Lecture 6_Generative Models
Lecture 6_Generative Models
Boyu Wang
Department of Computer Science
University of Western Ontario
Discriminative model vs. generative model
1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
– This is called discriminative model, because we only care about
discriminating examples of the two classes.
1
Discriminative model vs. generative model
1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
– This is called discriminative model, because we only care about
discriminating examples of the two classes.
I Another way to model p(y ) and p(x|y) and then use the Bayes Rule:
p(x, y = 1)
p(y = 1|x) =
p(x)
p(x|y = 1)p(y = 1)
=
p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
– This is called generative model, because we can actually use the
model to generate data.
1
Bayes classifier for continuous features
2
Examples of multivariate Gaussian distribution
3
Gaussian discriminant analysis
I For 2 classes:
p(y = 1) = θ; p(y = 0) = 1 − θ
1 1 >
Σ−1
p(x|y = 1) = n 1
e− 2 (x−µ1 ) 1
(x−µ1 )
(2π) |Σ1 |
2 2
1 1 >
Σ−1
p(x|y = 0) = n 1
e− 2 (x−µ0 ) 0
(x−µ0 )
(2π) |Σ0 |
2 2
I For C classes:
C
X
p(y = c) = θc ; s.t. θc = 1
c=1
1 1 >
Σ−1
p(x|y = c) = n 1
e− 2 (x−µc ) c (x−µc )
(2π) |Σc |
2 2
5
Other variants to simplify the model
I If we assume the same covariance matrix Σ for all the classes, the
maximum likelihood estimation of Σ is
C
nc X
Σ= Σc
n
c=1
6
Classification using quadratic discriminant analysis
Recall:
1 1 >
Σ−1
p(y = c) = θc ; p(x|y = c) = n 1
e− 2 (x−µc ) c (x−µc )
(2π) |Σc |
2 2
I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?
10
Conditional independence: an example
I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?
I p(A) = p(B) = p(head) = p(head|R) × p(R) + p(head|F ) × p(F ) =
1 1 1 3
2
× 2
+ 2
= 4
p(A, B) = p(head, head) =
1
p(head, head|R) × p(R) + p(head, head|F ) × p(F ) = 2
× 21 × 12 + 21 = 5
8
p(A)p(B) 6= p(A, B) ⇒ A and B are dependent!
10
Conditional independence: an example
I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?
I p(A) = p(B) = p(head) = p(head|R) × p(R) + p(head|F ) × p(F ) =
1 1 1 3
2
× 2
+ 2
= 4
p(A, B) = p(head, head) =
1
p(head, head|R) × p(R) + p(head, head|F ) × p(F ) = 2
× 21 × 12 + 21 = 5
8
p(A)p(B) 6= p(A, B) ⇒ A and B are dependent!
I Consider an additional event:
11
Maximum likelihood estimation for Naive Bayes
nc
θc =
n
I Computing the gradient with respect to βj,c and setting it to 0 gives us:
12
Training Naive Bayes
13
Training Naive Bayes
14
Training Naive Bayes
15
Training Naive Bayes
16
Training Naive Bayes
17
Training Naive Bayes
18
Training Naive Bayes
19
Training Naive Bayes
20
Training Naive Bayes
21
Training Naive Bayes
22
Training Naive Bayes
23
Laplace smoothing
24
Training Naive Bayes with Laplace smoothing
25
Training Naive Bayes with Laplace smoothing
26
Training Naive Bayes with Laplace smoothing
27
Training Naive Bayes with Laplace smoothing
28
Generative model summary
I Advantages:
- Easy to train
- Can handle streaming data well
- Can handle both real and discrete data
I Disadvantages:
29