0% found this document useful (0 votes)

18 views59 pages

Lecture 10

Uploaded by

aidynn.enoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views59 pages

Lecture 10

Uploaded by

aidynn.enoc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Introduc)on

to Bayesian methods

Lecture 10

David Sontag
New York University

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein,

and Vibhav Gogate
Bayesian learning
• Bayesian learning uses probability to model
data and quan+fy uncertainty of predic;ons
– Facilitates incorpora;on of prior knowledge
– Gives op;mal predic;ons
• Allows for decision-‐theore;c reasoning
Your first consul;ng job
• A billionaire from the suburbs of ManhaFan asks
you a ques;on:
– He says: I have thumbtack, if I flip it, what’s the
probability it will fall with the nail up?
– You say: Please flip it a few ;mes:

– You say: The probability is:

• P(heads) = 3/5
– He says: Why???
– You say: Because…
Outline of lecture
• Review of probability
• Maximum likelihood es;ma;on

2 examples of Bayesian classiﬁers:

• Naïve Bayes
• Logis;c regression
Random Variables
• A random variable is some aspect of the world about
which we (may) have uncertainty
– R = Is it raining?
– D = How long will it take to drive to work?
– L = Where am I?

• We denote random variables with capital letters

• Random variables have domains

– R in {true, false} (sometimes write as {+r, ¬r})
– D in [0, ∞)
– L in possible locations, maybe {(0,0), (0,1), …}
Probability Distributions
• Discrete random variables have distributions

T P W P
warm 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

• A discrete distribution is a TABLE of probabilities of values

• The probability of a state (lower case) is a single number

• Must have:
Joint Distributions
• A joint distribution over a set of random variables:
specifies a real number for each assignment:

T W P
– How many assignments if n variables with domain sizes d? hot sun 0.4
hot rain 0.1
– Must obey: cold sun 0.2
cold rain 0.3

• For all but the smallest distributions, impractical to write out or estimate
– Instead, we make additional assumptions about the distribution
Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding

T P
hot 0.5
T W P X
P (t) = P (t, w) cold 0.5
hot sun 0.4
w
hot rain 0.1
cold sun 0.2 X W P
cold rain 0.3 P (w) = P (t, w) sun 0.6
t
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability

T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Conditional Distributions
• Conditional distributions are probability distributions over
some variables given fixed values of others
Conditional Distributions Joint Distribution

W P T W P
sun 0.8 hot sun 0.4
rain 0.2 hot rain 0.1
cold sun 0.2
cold rain 0.3
W P
sun 0.4
rain 0.6
The Product Rule
• Sometimes have conditional distributions but want the joint

• Example:

D W P D W P
wet sun 0.1 wet sun 0.08
W P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
Bayes’ Rule
• Two ways to factor a joint distribution over two variables:

• Dividing, we get:

• Why is this at all helpful?

– Let’s us build one conditional from its reverse
– Often one conditional is tricky but the other one is simple
– Foundation of many practical systems (e.g. ASR, MT)

• In the running for most important ML equation!

Returning to thumbtack example…
• P(Heads) = θ, P(Tails) = 1-‐θ

…
• Flips are i.i.d.: D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )
– Independent events
– Iden;cally distributed according to Bernoulli
distribu;on
• Sequence D of αH Heads and αT Tails

Called the “likelihood” of the data under the model

Maximum Likelihood Es;ma;on
• Data: Observed set D of αH Heads and αT Tails
• Hypothesis: Bernoulli distribu;on
• Learning: ﬁnding θ is an op;miza;on problem
– What’s the objec;ve func;on?

• MLE: Choose θ to maximize probability of D

⇥lnH⇥ dd d d
⇥ ln (D || ⇥)) =
lnPP(D [ln
ln ↵ H ↵
⇥ H (1 ⇥)) T] =
T
[ H ln ⇥ + T
d ⇥ Hdd⇥
ln d d⇥
dH d d⇥
ln P (D | d⇥) = [ln ⇥ (1 ⇥) T ] = [ H ln ⇥ + T ln(1 ⇥
d⇥⇥) T ] =d [ Hd⇥ln ⇥ + T ln(1 ⇥)] d⇥
) T ] = dd⇥ [ H ln ⇥ + T ln(1 ⇥)]
⇥) T
]=
d⇥d⇥ [ dH ln ⇥ + T ln(1
d ⇥)] H T
= H ln ⇥ + T ln(1 ⇥) = =0
d⇥ d⇥ ⇥ 1 ⇥
HH TT
1 ⇥) = ⇥
ln(1 ⇥) = ==0 0
⇥ 11 ⇥⇥
Data

L(✓; D) = ln P (D|✓)

✓
What if I have prior beliefs?
• Billionaire says: Wait, I know that the thumbtack
is “close” to 50-‐50. What can you do for me now?
• You say: I can learn it the Bayesian way…
• Rather than es;ma;ng a single θ, we obtain a
distribu;on over possible values of θ

In the beginning After observations
Pr(✓) Observe flips Pr(✓ | D)
e.g.: {tails, tails}

✓
✓
2N ⇥2
⇥ ⇤ 2e ⇤ P (mistake)
Bayesian
ln ⇥L⇤earning
ln 2 2N ⇤ 2

Prior
• Use Bayes’ rule! 2 Data Likelihood
ln ⇥ ⇤ ln 2 2N ⇤ ln(2/⇥)
N⇤
2⇤2
Posteriorln(2/⇥)
N⇤ 2 ln(2/0.05) 3.8
2⇤ N ⇤ 2
⌅ Normalization
= 190
2 ⇥ 0.1 0.02
• Or equivalently:
ln(2/0.05) 3.8
⇤ • For uniform
2
⌅ priors, this
= 190 P (⌅) ⇧
reduces to 1
2 ⇥ 0.1 0.02
maximum likelihood estimation!
P (⌅) ⇧ 1 P (⌅ | D) ⇧ P (D | ⌅)
Bayesian Learning for Thumbtacks

Likelihood:

• What should the prior be?

– Represent expert knowledge
– Simple posterior form

• For binary variables, commonly used prior is the

Beta distribu;on:
n ⇤ ⇤ ln 2 2N ⌅2 2 ln ⇥ ⇤ ln 2 2N ⇤2
ln ⇥ ⇤ ln 2 2N ⇤
Beta
ln(2/⇤)
p rior d istribu;on
ln(2/⇥) – P(θ)
⇤ ⇤ ln(2/⇥)
NN 2 2
N⇤
2⇤2
2⌅2⇤
ln(2/0.05) 3.8
(2/0.05)
ln(2/0.05) 3.8N⇤
3.8 2
⌅ = 190
⌅⌅ 2 ⇥ 0.1
==190
190 0.02
⇥ 2 2
0.10.1 0.02
2⇥ 0.02
P (⌅) ⇧ 1
P (⇧) ⇧⇧1 1
P (⌅)
• Since the Beta distribu;on is conjugate to the Bernoulli distribu;on, the
| aD)
P (⌅has
posterior distribu;on ⇧ P (Dsimple
par;cularly | ⌅) form:
P (⌅
(⇧ | D)
| D) ⇧ ⇧P P | ⌅)
(D(D| ⇧)
1 1
P (⌅ | D) ⇧ ⌅ H
(1 ⌅) T
⌅ ⇥H (1 ⌅)⇥T
1 1 H +⇥H 1 T +⇥t +1
⌅ ⇥H (1 ⌅)⇥T =⌅ (1 ⌅)
1 H +⇥H 1 T +⇥t +1
T
=⇧ (1 ⇧) = Beta( H +⇥H , T +⇥T )
Using Bayesian inference for predic;on
• We now have a distribu)on over parameters
• For any speciﬁc f, a func;on of interest, compute the
expected value of f:

• Integral is oien hard to compute

• As more data is observed, prior is more concentrated
• MAP (Maximum a posteriori approxima;on): use most
likely parameter to approximate the expecta;on
What about con;nuous variables?
• Billionaire says: If I am
measuring a
con;nuous variable,
what can you do for
me?
• You say: Let me tell
you about Gaussians…
Some proper;es of Gaussians
• Aﬃne transforma;on (mul;plying by
scalar and adding a constant) are
Gaussian
– X ~ N(µ,σ2)
– Y = aX + b  Y ~ N(aµ+b,a2σ2)

• Sum of Gaussians is Gaussian

– X ~ N(µX,σ2X)
– Y ~ N(µY,σ2Y)
– Z = X+Y  Z ~ N(µX+µY, σ2X+σ2Y)

• Easy to diﬀeren;ate, as we will see soon!

Learning a Gaussian xi Exam
i = Score
0 85
• Collect a bunch of data
1 95
– Hopefully, i.i.d. samples
– e.g., exam scores 2 100

3 12
• Learn parameters
… …
– µ (“mean”)
99 89
– σ (“variance”)
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
µM LE , M LE = arg max P (D | µ, )
µ,

• Log-likelihood of data:
Your second learning algorithm:
MLE for mean of a Gaussian
• What’s MLE for mean?

µM LE
µM, LEM, LE ==
M LE argarg
max (D || µ,
maxPP(D µ, ))
µ,
µM LE , M LE = arg µ,
max P (D | µ, )
µ,
N
X (xi µ)
N
=XX N(x
i i 2µ)
(x µ) = 0
= = i=1 22 ==00
i=1i=1
N
X
= - Nµ = 0
xi +
i=1

N
X
N (xi µ)2
= + 3
=0
i=1
MLE for variance
• Again, set deriva;ve to zero:
µM LE , M LE = arg max P (D | µ, )
µ,

µ) max P (D | µ, )
N
µM LE , X =i
M LE(x arg
= µ,0
=
2
i=1

X
NN
N X (x(xi µ)2µ)
i
= =+ 3 2
= 0= 0
i=1
i=1
Learning Gaussian parameters
• MLE:

• MLE for the variance of a Gaussian is biased

– Expected result of es;ma;on is not true parameter!
– Unbiased variance es;mator:
Bayesian learning of Gaussian
parameters
• Conjugate priors
– Mean: Gaussian prior
– Variance: Wishart Distribu;on

• Prior for mean:

Outline of lecture
• Review of probability
• Maximum likelihood es;ma;on

2 examples of Bayesian classiﬁers:

• Naïve Bayes
• Logis;c regression
Bayesian Classiﬁca;on
• Problem statement:
– Given features X1,X2,…,Xn
– Predict a label Y

[Next several slides adapted from:

Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos
Guestrin, and Dan Weld]
Example Applica;on
• Digit Recogni)on

Classifier 5

• X1,…,Xn ∈ {0,1} (Black vs. White pixels)

• Y ∈ {0,1,2,3,4,5,6,7,8,9}
The Bayes Classiﬁer
• If we had the joint distribu;on on X1,…,Xn and Y, could predict
using:

– (for example: what is the probability that the image

represents a 5 given its pixels?)

• So … How do we compute that?

The Bayes Classiﬁer
• Use Bayes Rule!

Likelihood Prior

Normalization Constant

• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
The Bayes Classiﬁer
• Let’s expand this for our digit recogni;on task:

• To classify, we’ll simply compute these probabili;es, one per

class, and predict based on which one is largest
Model Parameters
• How many parameters are required to specify the likelihood,
P(X1,…,Xn|Y) ?
– (Supposing that each image is 30x30 pixels)

• The problem with explicitly modeling P(X1,…,Xn|Y) is that

there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of ;me
– And we’ll need tons of training data (which is usually not
available)
Naïve Bayes
• Naïve Bayes assump;on:
– Features are independent given class:

– More generally:

• How many parameters now?

• Suppose X is composed of n binary features
The Naïve Bayes Classiﬁer
• Given:
– Prior P(Y) Y
– n condi;onally independent
features X given the class Y
– For each Xi, we have
likelihood P(Xi|Y) X1 X2 Xn

• Decision rule:

If certain assumption holds, NB is optimal classifier!

(they typically don’t)
A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9

Are the naïve Bayes assump;ons realis;c here?
What has to be learned?

1 0.1 1 0.01 1 0.05

2 0.1 2 0.05 2 0.01
3 0.1 3 0.05 3 0.90
4 0.1 4 0.30 4 0.80
5 0.1 5 0.80 5 0.90
6 0.1 6 0.90 6 0.90
7 0.1 7 0.05 7 0.25
8 0.1 8 0.60 8 0.85
9 0.1 9 0.50 9 0.60
0 0.1 0 0.80 0 0.80
maxlnln 1⇥⇥
argmax
arg ++ [tjj ii wii hi2i (xjj )]
arg
arg max
maxwln
ln
w ⇥
⇥
⇥⇥ 22 +
+ 2⇥
2⇥
2
2
w
w ⇥⇥ 22 j=1
j=1 2⇥
2⇥ 2
j=1
j=1
P P
MLE ==for t he p arameters hio if N j B
X
NN P 2
NX
N
X
X [t
[tj j P iwwi (xh
i (x(x )]
j22)]
2
argmax
arg max [t
[t jj w
wi h
h (x
ii ii i2i jj )]
)]
=
= arg arg max
max ww 2⇥
2⇥
2
2
• Given dataset ww j=1j=1 j=1 2⇥
2⇥ 2
j=1
– Count(A=a,B=b) Ã number
N of examples where A=a and
X
N
X X XX
B=b X
X N
N X 22
==arg
argmin
min [t[tj j w i (x
wi hi hi (x j j )]
22)]
• MLE for = arg
arg min
=discrete
min
w
w N
wB, [t
[t
w simply:
j=1
j=1
jj wiihhii(x
w
ii
(xjj)] )]
– Prior: j=1
j=1 ii
Count(Y ==y)y)
Count(Y
P (Y ==y)y)=
P(Y =PCount(Y
P
Count(Y == y)
y)
P (Y
P (Y == y) =P
y) = P 0y0Count(Y
Count(Y ==yy))
Count(Y =
yy00 Count(Y
y = yy ))
– Observa;on distribu;on:
Count(Xii =
Count(X x, YY =
= x, = y)
y)
P (X
P (Xii = x|Y =
= x|Y =P
y) =
= y) P
Count(Xii =
xx00 Count(X = xx ,, YY =
= y)
y)
MLE for the parameters of NB
• Training amounts to, for each of the classes, averaging all of
the examples together:
arg maxlnln ⇥⇥ i=1 ++ j
argmax i i i j
arg max lnww ⇥ ⇥⇥ 22+ 2⇥
2⇥ 22
w ✓
⇥ 2 ◆N X N j=1 P 2⇥ 2
j=1 2
1 j=1 [tj i w i h i (x j )]
arg max ln ⇥ + P
P
MAP e s;ma;on
[tj j w fhor iwiN hi)]iB
N 2⇥ 2
w ⇥ 2 N X X
N P
[t w h (x(x
2i j )]
2
)] 2
=arg
argmax X [t
max j
j=1
i (x j
=
= arg max X i i i 2j
P 2 2⇥
ww
N 2⇥ 2
2
w
• Given dataset [t
j=1
j=1 j 2⇥
i w i h i (x j )]
= arg maxj=1 2
w
Count(A=a,B=b) Ã number 2⇥
N of examples
– j=1
N XX X X
N X where A=a and
B=b X 22
=
=minarg
arg X
min min
N [t [t
[t j j wwi h
i i (x
h (x
2i j )]
j )]
= arg
• MAP e=
s;ma;on for X w h
wwdjiscrete NB, (x )]
i simply:
i j2
w
arg min j=1 w hi(x
[tj=1
j i )]
i i j
– Prior: w j=1 i
j=1 i
Count(Y ==y)y)
Count(Y
P (Y ==y)
P(Y y)=Count(Y
= PP =
= y)
Count(YCount(Y
y)
(Y(Y==y)
PP = P ==yy))
y 0y Count(Y
y) = P 0 Count(Y 0
=y)
yy0 Count(Y
– Observa;on distribu;on: =y)
Count(X
Count(X i =
i = x,x,
Y Y = y)= y)
+a
P (X i =
P (X x|Y==y)
x|Y
i = =P
y) = P
Count(Xi =
x00 Count(X
x x ,xY , =
i = = +y)|X_i|*a
Y y)

• Called “smoothing”. Corresponds to Dirichlet prior!

What about if there is missing data?
• One of the key sMissing data approaches is that
trengths of Bayesian
they can naturally handle missing data
 Suppose don’t have value for some attribute Xi •
 applicant’s credit history unknown
 some medical test not performed on patient
 how to compute P(X1=x1 … Xj=? … Xd=xd | y)
 Easy with Naïve Bayes •
 ignore attribute in instance
where its value is missing
 compute likelihood based on observed attributes
 no need to “fill in” or explicitly model missing values
 based on conditional independence between attributes

[Slide from Victor Lavrenko and Nigel Goddard]

Copyright © Victor Lavrenko, 2013 Copyright ©
Naive Bayes = Linear Classifier
Theorem: assume that xi {0, 1} for all i [1, N ] .
Then, the Naive Bayes classifier is defined by
x⇥ sgn(w · x + b),

where wi = log Pr[x

Pr[xi =1|+1]
i =1| 1]
Pr[xi =0|+1]
log Pr[x i =0| 1]

and b = log Pr[+1]

[Slide from Mehyrar Mohri]

Mehryar Mohri - Introduction to Machine Learning page 20
Outline of lecture
• Review of probability
• Maximum likelihood es;ma;on

2 examples of Bayesian classiﬁers:

• Naïve Bayes
• Logis)c regression

[Next several slides adapted from:

Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld]
Logis;c Regression
 Learn P(Y|X) directly!
 Assume a particular functional form
✬ Linear classifier? On one side we say P(Y=1|X)=1, and on
the other P(Y=1|X)=0
✬ But, this is not differentiable (hard to learn)… doesn’t
allow for label noise...

P(Y=1)=1

P(Y=1)=0
mator (MVUE) is sometimes used instead. It is
1
⇤
j
⇥ˆ 2ik = j
(X i µ̂ ik )2
(Y j
= yk ) (15)
(⇤ j (Y 1= yk )) 1 j j
⇥ˆ 2ik = ⇤ (Xi µ̂ik ) 2
(Y j
= yk ) (15)
Logis;c Regression
j
(⇤ j (Y = yk )) 1 j
ogistic Regression
ogistic Regression Logistic function
Regression is an approach to learning functions of the form f :X ⇤ Y , or (Sigmoid):
n the case
• where
Regression isLearn Y isP(Y|X)
an approachdiscrete-valued,
to learning and X =of⌅Xthe
functions
directly! . Xn ⇧ fis: any
1 . .form X ⇤vector
Y , or
n the
ng case or
discrete where Y is discrete-valued,
continuous variables. In this X = ⌅Xwe
andsection . Xn ⇧primarily
1 . .will is any vectorcon-
ng discrete
case whereorY isAssume
• continuous
a boolean avariable,
particular
variables. Ininthis section
order we willnotation.
to simplify primarilyIncon- the 1
case where Y isfunctional
a boolean form
variable, in order to simplify
section we extend our treatment to the case where Y takes on any finite notation. In the 1 + e z

section
of wevalues.
discrete extend our treatment to the case where Y takes on any finite
• Sigmoid applied to a linear
of
sticdiscrete values.
Regression assumes a parametric form for the distribution P(Y |X),
stic Regression function
assumes a of the data:
parametric form for thedata.
distribution P(Y |X),
ectly estimates its parameters from the training The parametric
ectly
sumed estimates
by Logisticits parameters
Regression infrom
the the
casetraining
where Ydata. The parametric
is boolean is: z
sumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = 1 n (16)
P(Y = 1|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (16)
1 + exp(w0 + ⇤i=1 wi Xi ) Features can be
exp(w0 + ⇤nni=1 wi Xi ) discrete or
P(Y = 0|X) = exp(w0 + ⇤i=1n wi Xi ) continuous!(17)
P(Y = 0|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (17)
1 + exp(w0 + ⇤i=1 wi Xi )
hat equation (17) follows directly from equation (16), because the sum of
hat equation (17)
o probabilities follows
must equaldirectly
1. from equation (16), because the sum of
ohighly
probabilities mustproperty
convenient equal 1. of this form for P(Y |X) is that it leads to a
Logis;c Func;on in n Dimensions

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

1x1x2
0.8
0.6
0.4
0.2
0
10
8
6
4
2
0
6-4-2
4
2
0
-2
on we extend our treatment to the case1wherei=1 Y takes on any finite
crete values.P(Y = 1|X) = model assumed n
by Logistic Regression (16) in the case where Y is boolean
1 + exp(w0 + ⇤i=1 n wi Xi )
0.2

Regression assumes a0 parametricexp(w + ⇤the

form0 for i=1 w i Xi )
distribution P(Y |X),
P(Y =−5 0|X) = X0 (17) 1
n P(Yi ) = parametric
1|X) =
5
estimates its parameters from1 + theexp(w
training
0 + ⇤data.i=1 wi XThe n
ed by Logistic
ce that equation Logis;c Regression: decision boundary
Regression
P(Y(17)= 0|X) in exp(w
the case0+ where
n
⇤i=1Ywiis
= directly from equation
follows n
)
Xiboolean is:
(16), because the sum
1 + exp(w
(17)of
0 + ⇤ i=1 wi Xi )

two probabilities must equal and 1 + exp(w +

1. 1 0 ⇤i=1 i i w X )
n
hat highly =
P(Yconvenient
ne equation 1|X)
(17) = property
follows directlyof from
this equation (16), because (16)
the exp(w
sum of 0 + ⇤ i=1 wi Xi )
nform for P(Y |X) is that it leads to a
1 + exp(w0 + ⇤i=1 wi Xi ) P(Y = 0|X) = 1 + exp(w + n w X )
oleprobabilities must for
linear expression equal 1.
classification. To classify any given X we generally 0 ⇤i=1 i i
highly
to assignconvenient
the valueproperty of this form
yk that maximizes
Notice thatP(Y for=P(Y
equation yk |X).
|X)
(17) isPut
that it leads
another
follows to from
way,
directly awe equation (16), because
nnear • Prediction:
the expression for Output
exp(w
classification.+ the
⇤ n Y with
To i ) any given
wi Xholds:
i=1classify
label Y = = 0 if the
= following 0 condition
these two probabilities must Xequalwe generally
1.

=0
P(Y
highest0|X) P(Y|X) n (17)
assign
Form the value
of the yk that
logistic +maximizes
1 function.
exp(w0One +In
⇤ P(Y w
Logistic
highly
i=1 ik)|X).
=i Xyconvenient Put another
Regression, propertyP(Y way,
of we
is form
|X)this as- for P(Y |X) is that

0
– 0For binary P(Y
Y, output = 0|X)
Y=0 holds:
if

w.X+w
euation
ollow Y=
label (17)
this if the
form.
follows following
directly1 from
< condition
simple linear expression
equation (16), because forthe
classification.
sum of To classify any given X w
P(Y = 1|X)
babilities must equal 1. P(Y =to0|X)
want assign the value yk that maximizes P(Y = yk |X). Put anoth
yituting
convenient property (16)
from equations 1<
of this
andassign
form the thislabel
for P(Y Y =is0that
|X) if theit following
leads to a condition holds:
P(Y(17),
= 1|X) becomes
gexpression
the natural for classification. To classify any given X we generally
log of both sides we have n a linear classificationP(Y rule= that
0|X)
ningthe value
from yk that (16)
equations maximizes
and (17),P(Y =becomes
this yk |X). Put another way, we
bel Y = 0 if X satisfies1 < exp(w0 + ⇤ wi Xi ) 1<
P(Y = 1|X)
el Y = 0 if the following condition holds: i=1
n
0 + ⇤ wi Xfrom
n
1< exp(w
P(Y =substituting
0|X) i) equations (16) and (17), this becomes
1< 0 < w 0 +
P(Y = 1|X)i=1
⇥i=1wi Xi (18)
n
1 < exp(w0 + ⇤ wi Xi )
om A
equations Linear
(16)
s Y = 1 otherwise. and Classifier!
(17), this becomes i=1
tingly, the parametric form nof P(Y |X) used by Logistic Regression is
he form implied1 <by
exp(w 0 + ⇤ wi Xi ) of a Gaussian Naive Bayes classi-
the assumptions
i=1
fore, we can view Logistic Regression as a closely related alternative to
Likelihood vs. Condi;onal Likelihood
Genera;ve (Naïve Bayes) maximizes Data likelihood

Discrimina;ve (Logis;c Regr.) maximizes Condi)onal Data Likelihood

Focuses only on learning P(Y|X) -‐ all that maFers for classiﬁca;on
Maximizing Condi;onal Log Likelihood

0 or 1!

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!
No local maxima
Concave functions easy to optimize
Op;mizing concave func;on –
Gradient ascent
• Condi;onal likelihood for Logis;c Regression is concave !

Gradient:

Learning rate, η>0

Maximize Condi;onal Log Likelihood: Gradient ascent

0
w0 + +i w i
i i Xi ji
== yy lnln= i j P + (1
(1 i
y y ) ) ln
ln kj P
=1 + eywy00ln
wj ln
+ w i X i P
P w X + (11 +ye + (1 y
jw )+ln
w0)+ln i w
0 w i Xi
w+0P
jj 1 + e + 11 ii wi Xw
++ eei w0 +
0 + i w
i i X
ii i 1 + e 1
i i Xi
1+ +ewe0
+ wiiw
i XiiXi
jj
P (Y = 1|X)
l(w)
l(w)
⇥ exp(w10 + ⇥⇥w1i Xi )
l(w) j j ⌥
j j
1|xj ,j ,w) ⇥
= ⌅l(w) xxjii yy= (Y
PP(Y
j j j j= =j 1|x w)
ji j j j ⇥
w = xxi i yy PP(Y(Y ==1|x 1|x, w), w)
jj⌅ww i jj
⇧⇧ ⇤⇤ + ⌅⌃ ⌅⌃
l(w)
l(w)
P (Y
⇧⇧ = 2|X) ⇥ exp(w 20 ⇤ ⇤w X
2i i ) ⌅⌃ ⌅⌃
= l(w) y j⌥
= ⌅l(w) (w0 +⌅ j jw wiixxjii)) ⌥ lnln
j
j j 11+ + ⌅exp(w
exp(w i 0 0++ ww j j
i xiix)i )
⌥
j j
ww =
w= yy (w (w00++ ww wi xi x
ii) ) ln ln 1 1++ exp(w
exp(w 0 +
0 + wiwxix
i )i )
j w
j⌅w i ⌅wwi ii ⌅wwi i i
jj ii r 1 i i
⇧ ⌥⌥ ⌃ ⌃
⇧xjj exp(w
x exp(w ++ ⌥ w w x x
jj
) ) ⌥ ⌃
P (Y j= j r|X)
j ii = 1 0 j 0
xi exp(w iiP
yj x
i i(Yi
0 i+
j i = j|X) j
= y xii j j = ⌥⌥ j j i wi xi )
= 11 +y exp(w
+ x
exp(w
i 00+ +j=1 i iww ixix i) )⌥ j
jj 1 + exp(w + i
i i i)
w x
j 0
⇧ j ⌃
⇧ P ⌥⌥ ⌃
ej
w 0 + w
i exp(w⇧
X
iexp(wi 0+ +
jj
wwi xi xi ))⌥ ⌃1
j j j 0 i ji j
= =
y= ln xxii =yywj +P 1x j
+ w y j
exp(w
X
+ +
exp(w
(1 ⌥ ⌥
i
w
0y+)
x j jlni wi xi )
) ⌥ w +
P
j 1 + e 0 1 ii+ exp(w
j
i i 00 + i w
i
i ix i i ) 1 + ej 0 i wi Xi
j j 1 + exp(w0 + i wi xi )

⌅l(w) 1 ⇥
= xji y j
P
1+e (Y
ax
j j
= 1|x , w)
⌅wi
j
⇧ ln p(w) ⇥ ⇤wi2 ⌅⌃
⌅l(w) ⌅ ⌅2
wi xji ) wi xji )
j i
= y (w0 + ln 1 + exp(w0 +
Gradient Ascent for LR
Gradient ascent algorithm: (learning rate η > 0)
do:

For i=1 to n: (iterate over features)

un;l “change” < ε

Loop over training examples!
Naïve Bayes vs. Logis)c Regression
Learning: h:X  Y X – features
Y – target classes

Genera)ve Discrimina)ve
• Assume func;onal form for • Assume func;onal form for
– P(X|Y) assume cond indep – P(Y|X) no assump)ons
– P(Y)
– Est. params from train data – Est params from training data
• Gaussian NB for cont. features • Handles discrete & cont features
• Bayes rule to calc. P(Y|X= x):
– P(Y | X) ∝ P(X | Y) P(Y)
• Indirect computa;on • Directly calculate P(Y|X=x)
– Can generate a sample of the data – Can’t generate data sample
– Can easily handle missing data
Naïve Bayes vs. Logis;c Regression
[Ng & Jordan, 2002]

• Genera;ve vs. Discrimina;ve classiﬁers

• Asympto;c comparison
(# training examples  inﬁnity)
– when model correct
• NB, Linear Discriminant Analysis (with class independent
variances), and Logis;c Regression produce iden;cal
classiﬁers

– when model incorrect

• LR is less biased – does not assume condi;onal
independence
– therefore LR expected to outperform NB
Naïve Bayes vs. Logis;c Regression
[Ng & Jordan, 2002]

• Genera;ve vs. Discrimina;ve classiﬁers

• Non-‐asympto;c analysis
– convergence rate of parameter es;mates,
(n = # of a_ributes in X)
• Size of training data to get close to inﬁnite data solu;on
• Naïve Bayes needs O(log n) samples
• Logis;c Regression needs O(n) samples

– Naïve Bayes converges more quickly to its (perhaps

less helpful) asympto;c es;mates
Naïve bayes
Logistic Regression

A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
ProbabilityStatitic Review
No ratings yet
ProbabilityStatitic Review
41 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
No ratings yet
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
11 pages
Current State of The Course!!!: We're Done With Part I Search and Planning! Part II: Probabilistic Reasoning
No ratings yet
Current State of The Course!!!: We're Done With Part I Search and Planning! Part II: Probabilistic Reasoning
30 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
CS115 Probability
No ratings yet
CS115 Probability
41 pages
07 Probability Review
No ratings yet
07 Probability Review
56 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
No ratings yet
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
141 pages
Artificial Intelligence: Lecture 12 - Probability Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 12 - Probability Dr. Shivanjali Khare
32 pages
CSE3635 Lecture 12 Probability 3
No ratings yet
CSE3635 Lecture 12 Probability 3
33 pages
Lecture 05 Reasoning Under Uncertainty
No ratings yet
Lecture 05 Reasoning Under Uncertainty
41 pages
03 MLE MAP NBayes-1-21-2015
No ratings yet
03 MLE MAP NBayes-1-21-2015
40 pages
Bayesian Learning: Thanks To Nir Friedman, HU
No ratings yet
Bayesian Learning: Thanks To Nir Friedman, HU
41 pages
Naive Bayes
No ratings yet
Naive Bayes
25 pages
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
No ratings yet
Frequentist vs. Bayesian Statistics Frequentist Thinking Bayesian Thinking
18 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
6 Probabilities
No ratings yet
6 Probabilities
52 pages
SP14 CS188 Lecture 12 - Probability
No ratings yet
SP14 CS188 Lecture 12 - Probability
35 pages
CHP: 13 and 14
No ratings yet
CHP: 13 and 14
62 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
SP14 CS188 Lecture 12 - Probability - Print
No ratings yet
SP14 CS188 Lecture 12 - Probability - Print
33 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Probability
No ratings yet
Probability
56 pages
Probs Stats
No ratings yet
Probs Stats
26 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
CS464 Ch3 Estimation
No ratings yet
CS464 Ch3 Estimation
56 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
Sam Roweis Probx
No ratings yet
Sam Roweis Probx
12 pages
Babybayes Master
No ratings yet
Babybayes Master
172 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
Part 1
No ratings yet
Part 1
200 pages
Bayes
No ratings yet
Bayes
10 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
2 Probability
No ratings yet
2 Probability
30 pages
07 Bayesian Networks
No ratings yet
07 Bayesian Networks
106 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Lec 12
No ratings yet
Lec 12
54 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
Cs Ai Lecture Notes 02
No ratings yet
Cs Ai Lecture Notes 02
103 pages
Probability and Statistics Cheat Sheet
100% (2)
Probability and Statistics Cheat Sheet
28 pages
Naive Bays
No ratings yet
Naive Bays
25 pages
2 Mle
No ratings yet
2 Mle
28 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
ML - Lec 2 - Review of Probability and Statistics
No ratings yet
ML - Lec 2 - Review of Probability and Statistics
30 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
4.5/5 (2)
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
Nanovea Mechanical - Testers Guide
No ratings yet
Nanovea Mechanical - Testers Guide
40 pages
References
No ratings yet
References
13 pages
DLP - Environmental Science 7 - q3 Week 1
No ratings yet
DLP - Environmental Science 7 - q3 Week 1
4 pages
Phisophy Conceipt
No ratings yet
Phisophy Conceipt
6 pages
For FRP Abaqus
No ratings yet
For FRP Abaqus
7 pages
Prob. Distribution
No ratings yet
Prob. Distribution
24 pages
Application of Cutting Stock Problem in
No ratings yet
Application of Cutting Stock Problem in
6 pages
Timber PDF
No ratings yet
Timber PDF
12 pages
Pure MTC Resource
No ratings yet
Pure MTC Resource
4 pages
Morocco - Entrepreneurs & Venturing
No ratings yet
Morocco - Entrepreneurs & Venturing
72 pages
Agony Aunt
No ratings yet
Agony Aunt
2 pages
1.EFP - 100AF - 110 - 2 - E 1000 USGPM@99m
No ratings yet
1.EFP - 100AF - 110 - 2 - E 1000 USGPM@99m
3 pages
Chapter 17-EMERGING MANAGEMENT PRACTICES
100% (1)
Chapter 17-EMERGING MANAGEMENT PRACTICES
7 pages
POLI 2104 Syllabus 2022-23
No ratings yet
POLI 2104 Syllabus 2022-23
13 pages
Study of Siesmic Analysis of Multistorey Building With or Without Floating Columns
No ratings yet
Study of Siesmic Analysis of Multistorey Building With or Without Floating Columns
18 pages
BSC Project Report - Kirchhoff's Law
No ratings yet
BSC Project Report - Kirchhoff's Law
26 pages
Biochemistry Answer Key-PINK PACOP
100% (3)
Biochemistry Answer Key-PINK PACOP
29 pages
Synopsis of Scientific Research Methods
No ratings yet
Synopsis of Scientific Research Methods
41 pages
PMMerdeka TG4 C05
No ratings yet
PMMerdeka TG4 C05
84 pages
Implementing Green Initiative
No ratings yet
Implementing Green Initiative
3 pages
The Disjunctive Graph Machine Representation of The Job Shop Scheduling Problem
No ratings yet
The Disjunctive Graph Machine Representation of The Job Shop Scheduling Problem
15 pages
Appsc
No ratings yet
Appsc
19 pages
1689580033sarkarinaukriexams com-TCS NQT Syllabus 2023 Exam Pattern PDF Download
No ratings yet
1689580033sarkarinaukriexams com-TCS NQT Syllabus 2023 Exam Pattern PDF Download
6 pages
Jurnal Sabun
No ratings yet
Jurnal Sabun
16 pages
3rd Year Syllabus Electrical Engineering, Electrical & Electronics Engineering
No ratings yet
3rd Year Syllabus Electrical Engineering, Electrical & Electronics Engineering
31 pages
2021 General Studies Paper 3
No ratings yet
2021 General Studies Paper 3
5 pages
2020 05 26 IDN NV UN 001 English
100% (2)
2020 05 26 IDN NV UN 001 English
2 pages
Ujian Objektif Sains t1 Bab1-3
No ratings yet
Ujian Objektif Sains t1 Bab1-3
3 pages
Lab Report #4
No ratings yet
Lab Report #4
6 pages
Micro-Para Lesson 1
No ratings yet
Micro-Para Lesson 1
11 pages

Lecture 10

Uploaded by

Lecture 10

Uploaded by

Introduc)on

to Bayesian methods

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein,

– You say: The probability is:

2 examples of Bayesian classiﬁers:

• We denote random variables with capital letters

• Random variables have domains

• A discrete distribution is a TABLE of probabilities of values

• Why is this at all helpful?

• In the running for most important ML equation!

Called the “likelihood” of the data under the model

• MLE: Choose θ to maximize probability of D

• What should the prior be?

• For binary variables, commonly used prior is the

• Integral is oien hard to compute

• Sum of Gaussians is Gaussian

• Easy to diﬀeren;ate, as we will see soon!

• MLE for the variance of a Gaussian is biased

• Prior for mean:

2 examples of Bayesian classiﬁers:

[Next several slides adapted from:

• X1,…,Xn ∈ {0,1} (Black vs. White pixels)

– (for example: what is the probability that the image

• So … How do we compute that?

• To classify, we’ll simply compute these probabili;es, one per

• The problem with explicitly modeling P(X1,…,Xn|Y) is that

• How many parameters now?

If certain assumption holds, NB is optimal classifier!

• Input: pixel grids

• Output: a digit 0-9

1 0.1 1 0.01 1 0.05

• Called “smoothing”. Corresponds to Dirichlet prior!

[Slide from Victor Lavrenko and Nigel Goddard]

where wi = log Pr[x

and b = log Pr[+1]

[Slide from Mehyrar Mohri]

2 examples of Bayesian classiﬁers:

[Next several slides adapted from:

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Regression assumes a0 parametricexp(w + ⇤the

two probabilities must equal and 1 + exp(w +

Discrimina;ve (Logis;c Regr.) maximizes Condi)onal Data Likelihood

Bad news: no closed-form solution to maximize l(w)

Learning rate, η>0

Maximize Condi;onal Log Likelihood: Gradient ascent

For i=1 to n: (iterate over features)

un;l “change” < ε

• Genera;ve vs. Discrimina;ve classiﬁers

– when model incorrect

• Genera;ve vs. Discrimina;ve classiﬁers

– Naïve Bayes converges more quickly to its (perhaps

You might also like