0% found this document useful (0 votes)
16 views

Lecture 6_Generative Models

The document discusses the differences between discriminative and generative models in classification, focusing on logistic regression as a discriminative model and Bayesian approaches as generative models. It explains how to estimate parameters for Gaussian discriminant analysis and introduces naive Bayes classifiers for both continuous and discrete features, including the use of Laplace smoothing. The advantages and disadvantages of generative models are also summarized, highlighting their ease of training and limitations in handling high-dimensional data.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 6_Generative Models

The document discusses the differences between discriminative and generative models in classification, focusing on logistic regression as a discriminative model and Bayesian approaches as generative models. It explains how to estimate parameters for Gaussian discriminant analysis and introduces naive Bayes classifiers for both continuous and discrete features, including the use of Laplace smoothing. The advantages and disadvantages of generative models are also summarized, highlighting their ease of training and limitations in handling high-dimensional data.

Uploaded by

aeryaery0
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Artificial Intelligence II (CS4442 & CS9542)

Classification: Generative Models

Boyu Wang
Department of Computer Science
University of Western Ontario
Discriminative model vs. generative model

I Recall: in logistic regression, we model directly p(y |x):

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
– This is called discriminative model, because we only care about
discriminating examples of the two classes.

1
Discriminative model vs. generative model

I Recall: in logistic regression, we model directly p(y |x):

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
– This is called discriminative model, because we only care about
discriminating examples of the two classes.

I Another way to model p(y ) and p(x|y) and then use the Bayes Rule:

p(x, y = 1)
p(y = 1|x) =
p(x)
p(x|y = 1)p(y = 1)
=
p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
– This is called generative model, because we can actually use the
model to generate data.

1
Bayes classifier for continuous features

I Idea: Use the training data to estimate p(y ) and p(x|y )

I p(y ) can be estimated by counting the number of data points of


each class.

I How to estimate p(x|y)?

- Need additional assumptions (for continuous inputs ) –


multivariate Gaussian with mean µ ∈ Rn , and covariance
Σ ∈ Rn×n
- Each class has mean µc and covariance Σc , c ∈ {0, 1}

2
Examples of multivariate Gaussian distribution

Figure: 2D Gaussian distributions with different Σ

Figure credit: Doina Precup

3
Gaussian discriminant analysis
I For 2 classes:

p(y = 1) = θ; p(y = 0) = 1 − θ
1 1 >
Σ−1
p(x|y = 1) = n 1
e− 2 (x−µ1 ) 1
(x−µ1 )

(2π) |Σ1 |
2 2

1 1 >
Σ−1
p(x|y = 0) = n 1
e− 2 (x−µ0 ) 0
(x−µ0 )

(2π) |Σ0 |
2 2

I The parameters to estimate are: θ, µ1 , Σ1 , µ0 , Σ0

I For C classes:
C
X
p(y = c) = θc ; s.t. θc = 1
c=1
1 1 >
Σ−1
p(x|y = c) = n 1
e− 2 (x−µc ) c (x−µc )

(2π) |Σc |
2 2

I The parameters to estimate are: {θc , µc , Σc }C


c=1
4
Estimate the parameters

I We can write down the likelihood function, like linear regression


and logistic regression

I Compute the gradient with respect to the parameters and set


them to 0.
- The parameter θc is given by θc = nnc , where nc is the number of
instances of class c.
- The mean µc is given by
1 X
µc = xi
nc x :y =c
i i

- The covariance matrix Σc is given by


1 X
Σc = (xi − µc )(xi − µc )>
nc x :y =c
i i

5
Other variants to simplify the model

I If we assume the same covariance matrix Σ for all the classes, the
maximum likelihood estimation of Σ is
C
nc X
Σ= Σc
n
c=1

I Covariance matrix can be restricted to diagonal, or mostly diagonal with


few off-diagonal elements, based on prior knowledge.
I Covariance matrix can even be identity matrix.

I The shape of the covariance is influenced both by assumptions about


the domain and by the amount of data available.
I If the covariance matrices are different for the class, the model is called
quadratic discriminant analysis (QDA); if the covariance matrices are
are the same, the model is called linear discriminant analysis (LDA); if
the covariance matrices are diagonal, the model is called naive Bayes
classifier (NBC).

6
Classification using quadratic discriminant analysis
Recall:
1 1 >
Σ−1
p(y = c) = θc ; p(x|y = c) = n 1
e− 2 (x−µc ) c (x−µc )

(2π) |Σc |
2 2

Using the Bayes rule, we have


1 1 >
Σ−1
p(y = c|x) ∝ θc |2πΣc |− 2 e− 2 (x−µc ) c (x−µc )

Predict class label as the most probable label:


y = arg max p(y = c|x)
c

Figure credit: Kevin Murphy


7
Classification using linear discriminant analysis
If we assume the covariance matrices are the same for all the classes:
>
1 Σ−1 (x−µc )
p(y = c|x) ∝ θc e− 2 (x−µc )
>
Σ−1 x− 12 µ> 1 > Σ−1 x
= eµc c Σµc +log θc · e− 2 x
>
Σ−1 x− 12 µ>
∝ eµc c Σµc +log θc

Let wc = Σ−1 µc , and bc = − 12 µ>


c Σµc + log θc ⇒ we get a linear model!

Figure credit: Kevin Murphy


8
Bayes classifier for discrete features

I Idea: Use the training data to estimate p(y) and p(x|y )

I p(y ) can be estimated in the same way as for continuous features

I How to estimate p(x|y ) for discrete values?

- Assume x = [x1 , . . . , xn ]> ∈ Rn has n features. Then using the


chain rule, we have
p(x|y ) = p(x1 |y )p(x2 |y , x1 ) · · · p(xn |y , x1 , . . . , xn−1 )
– even for binary features, it requires O(2n ) numbers to describe
the model!
- If we assume that the features xj ’s are conditionally independent
given y : p(xj |y ) = p(xj |y , xk ), ∀i, j, then we have
p(x|y ) = p(x1 |y )p(x2 |y, x1 ) · · · p(xn |y , x1 , . . . , xn−1 )
= p(x1 |y)p(x2 |y ) · · · p(xn |y)
– only requires O(n) numbers to describe the model!
9
Conditional independence: an example

I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?

10
Conditional independence: an example

I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?
I p(A) = p(B) = p(head) = p(head|R) × p(R) + p(head|F ) × p(F ) =
1 1 1 3
2
× 2
+ 2
= 4
p(A, B) = p(head, head) =
1
p(head, head|R) × p(R) + p(head, head|F ) × p(F ) = 2
× 21 × 12 + 21 = 5
8
p(A)p(B) 6= p(A, B) ⇒ A and B are dependent!

10
Conditional independence: an example

I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?
I p(A) = p(B) = p(head) = p(head|R) × p(R) + p(head|F ) × p(F ) =
1 1 1 3
2
× 2
+ 2
= 4
p(A, B) = p(head, head) =
1
p(head, head|R) × p(R) + p(head, head|F ) × p(F ) = 2
× 21 × 12 + 21 = 5
8
p(A)p(B) 6= p(A, B) ⇒ A and B are dependent!
I Consider an additional event:

- C = Coin R (regular) has been selected.


Then it is easy to show that p(A|R)p(B|R) = p(A, B|R) ⇒ A and B are
conditionally independent given C!
10
Naive Bayes classifier for binary features

I The model parameters are {θc = p(y = c)}C


c=1 and
{βj,c = p(xj = 1|y = c)}n,C
j,c=1

I Predict class label as the most probable label:


 
Yn
y = arg max p(y = c) p(xj |y = c)
c
j=1

I In practice, using the log trick to avoid the numerical issue:


 
Y n
y = arg max log p(y = c) p(xj |y = c)
c
j=1
n
X
= arg max log p(y = c) + log p(xj |y = c)
c
j=1

11
Maximum likelihood estimation for Naive Bayes

I The log-likelihood function is


 
  m
X n
X
log L {θc }Cc=1 , {βj,c }n,C
i,c=1 = log p(yi ) + log p(xi,j |yi )
i=1 j=1

I Computing the gradient with respect to θc and setting it to 0 gives us:

nc
θc =
n

I Computing the gradient with respect to βj,c and setting it to 0 gives us:

βj,c = p(xj = 1|y = c)


number of the instances for which xi,j = 1 and yi = c
=
number of the instances for which yi = c

12
Training Naive Bayes

Silde credit: Eric Eaton

13
Training Naive Bayes

Silde credit: Eric Eaton

14
Training Naive Bayes

Silde credit: Eric Eaton

15
Training Naive Bayes

Silde credit: Eric Eaton

16
Training Naive Bayes

Silde credit: Eric Eaton

17
Training Naive Bayes

Silde credit: Eric Eaton

18
Training Naive Bayes

Silde credit: Eric Eaton

19
Training Naive Bayes

Silde credit: Eric Eaton

20
Training Naive Bayes

Silde credit: Eric Eaton

21
Training Naive Bayes

Silde credit: Eric Eaton

22
Training Naive Bayes

Silde credit: Eric Eaton

23
Laplace smoothing

I Notice that some probabilities estimated by counting might be


zero!
I Instead of the maximum likelihood estimate:
number of the instances for which xi,j = 1 and yi = c
βj,c =
number of the instances for which yi = c
use:
(number of the instances for which xi,j = 1 and yi = c)+1
βj,c =
(number of the instances for which yi = c)+C

– add 1 to each count


I If a feature appears a lot of times, this estimate is only slightly
different from maximum likelihood.

24
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

25
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

26
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

27
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

28
Generative model summary

I Advantages:

- Easy to train
- Can handle streaming data well
- Can handle both real and discrete data

I Disadvantages:

- Requires additional assumptions (e.g., Gaussian distribution,


conditional independence of features)
- Cannot handle high-dimensional data very well

29

You might also like