0% found this document useful (0 votes)

16 views

Lecture 6_Generative Models

The document discusses the differences between discriminative and generative models in classification, focusing on logistic regression as a discriminative model and Bayesian approaches as generative models. It explains how to estimate parameters for Gaussian discriminant analysis and introduces naive Bayes classifiers for both continuous and discrete features, including the use of Laplace smoothing. The advantages and disadvantages of generative models are also summarized, highlighting their ease of training and limitations in handling high-dimensional data.

Uploaded by

aeryaery0

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Lecture 6_Generative Models

Uploaded by

aeryaery0

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Artificial Intelligence II (CS4442 & CS9542)

Classification: Generative Models

Boyu Wang
Department of Computer Science
University of Western Ontario
Discriminative model vs. generative model

I Recall: in logistic regression, we model directly p(y |x):

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
– This is called discriminative model, because we only care about
discriminating examples of the two classes.

1
Discriminative model vs. generative model

I Recall: in logistic regression, we model directly p(y |x):

1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
– This is called discriminative model, because we only care about
discriminating examples of the two classes.

I Another way to model p(y ) and p(x|y) and then use the Bayes Rule:

p(x, y = 1)
p(y = 1|x) =
p(x)
p(x|y = 1)p(y = 1)
=
p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)
– This is called generative model, because we can actually use the
model to generate data.

1
Bayes classifier for continuous features

I Idea: Use the training data to estimate p(y ) and p(x|y )

I p(y ) can be estimated by counting the number of data points of

each class.

I How to estimate p(x|y)?

- Need additional assumptions (for continuous inputs ) –

multivariate Gaussian with mean µ ∈ Rn , and covariance
Σ ∈ Rn×n
- Each class has mean µc and covariance Σc , c ∈ {0, 1}

2
Examples of multivariate Gaussian distribution

Figure: 2D Gaussian distributions with different Σ

Figure credit: Doina Precup

3
Gaussian discriminant analysis
I For 2 classes:

p(y = 1) = θ; p(y = 0) = 1 − θ
1 1 >
Σ−1
p(x|y = 1) = n 1
e− 2 (x−µ1 ) 1
(x−µ1 )

(2π) |Σ1 |
2 2

1 1 >
Σ−1
p(x|y = 0) = n 1
e− 2 (x−µ0 ) 0
(x−µ0 )

(2π) |Σ0 |
2 2

I The parameters to estimate are: θ, µ1 , Σ1 , µ0 , Σ0

I For C classes:
C
X
p(y = c) = θc ; s.t. θc = 1
c=1
1 1 >
Σ−1
p(x|y = c) = n 1
e− 2 (x−µc ) c (x−µc )

(2π) |Σc |
2 2

I The parameters to estimate are: {θc , µc , Σc }C

c=1
4
Estimate the parameters

I We can write down the likelihood function, like linear regression

and logistic regression

I Compute the gradient with respect to the parameters and set

them to 0.
- The parameter θc is given by θc = nnc , where nc is the number of
instances of class c.
- The mean µc is given by
1 X
µc = xi
nc x :y =c
i i

- The covariance matrix Σc is given by

1 X
Σc = (xi − µc )(xi − µc )>
nc x :y =c
i i

5
Other variants to simplify the model

I If we assume the same covariance matrix Σ for all the classes, the
maximum likelihood estimation of Σ is
C
nc X
Σ= Σc
n
c=1

I Covariance matrix can be restricted to diagonal, or mostly diagonal with

few off-diagonal elements, based on prior knowledge.
I Covariance matrix can even be identity matrix.

I The shape of the covariance is influenced both by assumptions about

the domain and by the amount of data available.
I If the covariance matrices are different for the class, the model is called
quadratic discriminant analysis (QDA); if the covariance matrices are
are the same, the model is called linear discriminant analysis (LDA); if
the covariance matrices are diagonal, the model is called naive Bayes
classifier (NBC).

6
Classification using quadratic discriminant analysis
Recall:
1 1 >
Σ−1
p(y = c) = θc ; p(x|y = c) = n 1
e− 2 (x−µc ) c (x−µc )

(2π) |Σc |
2 2

Using the Bayes rule, we have

1 1 >
Σ−1
p(y = c|x) ∝ θc |2πΣc |− 2 e− 2 (x−µc ) c (x−µc )

Predict class label as the most probable label:

y = arg max p(y = c|x)
c

Figure credit: Kevin Murphy

7
Classification using linear discriminant analysis
If we assume the covariance matrices are the same for all the classes:
>
1 Σ−1 (x−µc )
p(y = c|x) ∝ θc e− 2 (x−µc )
>
Σ−1 x− 12 µ> 1 > Σ−1 x
= eµc c Σµc +log θc · e− 2 x
>
Σ−1 x− 12 µ>
∝ eµc c Σµc +log θc

Let wc = Σ−1 µc , and bc = − 12 µ>

c Σµc + log θc ⇒ we get a linear model!

Figure credit: Kevin Murphy

8
Bayes classifier for discrete features

I Idea: Use the training data to estimate p(y) and p(x|y )

I p(y ) can be estimated in the same way as for continuous features

I How to estimate p(x|y ) for discrete values?

- Assume x = [x1 , . . . , xn ]> ∈ Rn has n features. Then using the

10
Conditional independence: an example

I A box contains two coins: a regular coin (R) and one fake two-headed
coin (F). I choose a coin at random and toss it twice. Define the
following two events:
- A = First coin toss results in a head
- B = Second coin toss results in a head
Are A and B independent?
I p(A) = p(B) = p(head) = p(head|R) × p(R) + p(head|F ) × p(F ) =
1 1 1 3
2
× 2
+ 2
= 4
p(A, B) = p(head, head) =
1
p(head, head|R) × p(R) + p(head, head|F ) × p(F ) = 2
× 21 × 12 + 21 = 5
8
p(A)p(B) 6= p(A, B) ⇒ A and B are dependent!

10
Conditional independence: an example

- C = Coin R (regular) has been selected.

Then it is easy to show that p(A|R)p(B|R) = p(A, B|R) ⇒ A and B are
conditionally independent given C!
10
Naive Bayes classifier for binary features

I The model parameters are {θc = p(y = c)}C

c=1 and
{βj,c = p(xj = 1|y = c)}n,C
j,c=1

I Predict class label as the most probable label:

 
Yn
y = arg max p(y = c) p(xj |y = c)
c
j=1

I In practice, using the log trick to avoid the numerical issue:

 
Y n
y = arg max log p(y = c) p(xj |y = c)
c
j=1
n
X
= arg max log p(y = c) + log p(xj |y = c)
c
j=1

11
Maximum likelihood estimation for Naive Bayes

I The log-likelihood function is

 
m
X n
X
log L {θc }Cc=1 , {βj,c }n,C
i,c=1 = log p(yi ) + log p(xi,j |yi )
i=1 j=1

I Computing the gradient with respect to θc and setting it to 0 gives us:

nc
θc =
n

I Computing the gradient with respect to βj,c and setting it to 0 gives us:

βj,c = p(xj = 1|y = c)

number of the instances for which xi,j = 1 and yi = c
=
number of the instances for which yi = c

12
Training Naive Bayes

Silde credit: Eric Eaton

13
Training Naive Bayes

Silde credit: Eric Eaton

14
Training Naive Bayes

Silde credit: Eric Eaton

15
Training Naive Bayes

Silde credit: Eric Eaton

16
Training Naive Bayes

Silde credit: Eric Eaton

17
Training Naive Bayes

Silde credit: Eric Eaton

18
Training Naive Bayes

Silde credit: Eric Eaton

19
Training Naive Bayes

Silde credit: Eric Eaton

20
Training Naive Bayes

Silde credit: Eric Eaton

21
Training Naive Bayes

Silde credit: Eric Eaton

22
Training Naive Bayes

Silde credit: Eric Eaton

23
Laplace smoothing

I Notice that some probabilities estimated by counting might be

zero!
I Instead of the maximum likelihood estimate:
number of the instances for which xi,j = 1 and yi = c
βj,c =
number of the instances for which yi = c
use:
(number of the instances for which xi,j = 1 and yi = c)+1
βj,c =
(number of the instances for which yi = c)+C

– add 1 to each count

I If a feature appears a lot of times, this estimate is only slightly
different from maximum likelihood.

24
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

25
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

26
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

27
Training Naive Bayes with Laplace smoothing

Silde credit: Eric Eaton

28
Generative model summary

I Advantages:

- Easy to train
- Can handle streaming data well
- Can handle both real and discrete data

I Disadvantages:

- Requires additional assumptions (e.g., Gaussian distribution,

conditional independence of features)
- Cannot handle high-dimensional data very well

Hayden (2006) Introduction To International Education International Schools and Their Communities
No ratings yet
Hayden (2006) Introduction To International Education International Schools and Their Communities
201 pages
BCTT BeginnersComputationalThinkingTest
No ratings yet
BCTT BeginnersComputationalThinkingTest
15 pages
Correlation of Job Satisfaction and Job Performance of Teachers in Manila
60% (10)
Correlation of Job Satisfaction and Job Performance of Teachers in Manila
70 pages
Personal Development Portfolio
No ratings yet
Personal Development Portfolio
12 pages
Proactive Vs Reactive
75% (4)
Proactive Vs Reactive
1 page
Untitled
100% (1)
Untitled
24 pages
Year 4 - Information Report Compare and Contrast
100% (1)
Year 4 - Information Report Compare and Contrast
3 pages
Songahm 1
No ratings yet
Songahm 1
2 pages
REACTION PAPER Jam
No ratings yet
REACTION PAPER Jam
2 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Lecture 5 Bayesian Classification
No ratings yet
Lecture 5 Bayesian Classification
16 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Module - 4 - ECE3047 - Machine Learning
No ratings yet
Module - 4 - ECE3047 - Machine Learning
81 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
lecture3-linear-classifiers
No ratings yet
lecture3-linear-classifiers
36 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Data Mining - Bayesian Classification
No ratings yet
Data Mining - Bayesian Classification
6 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
Unit-4 DWDM
No ratings yet
Unit-4 DWDM
10 pages
Class Adv Classification IV
No ratings yet
Class Adv Classification IV
49 pages
Naive Bayesian Classifier: National Institute of Technology Sikkim
No ratings yet
Naive Bayesian Classifier: National Institute of Technology Sikkim
6 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
3 - Bayesian Classification
No ratings yet
3 - Bayesian Classification
15 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
NBayes Log Reg
No ratings yet
NBayes Log Reg
18 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
No ratings yet
6.867 Section 3: Classification: 1 Intro 2 2 Representation 2 3 Probabilistic Models 2
10 pages
6 Classification
No ratings yet
6 Classification
53 pages
Classification-Alternative Techniques: Bayesian Classifiers
No ratings yet
Classification-Alternative Techniques: Bayesian Classifiers
7 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Bayes Classification
No ratings yet
Bayes Classification
86 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
Statistical Inference INF312 - Is - Lecture 03 - Part 3
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 3
18 pages
05_lecturenote_NB
No ratings yet
05_lecturenote_NB
10 pages
5 ML NaiveBayes
No ratings yet
5 ML NaiveBayes
45 pages
FML Unit3
No ratings yet
FML Unit3
18 pages
Bayes Classification
No ratings yet
Bayes Classification
9 pages
Unit-Iv Data Classification: Data Warehousing and Data Mining
No ratings yet
Unit-Iv Data Classification: Data Warehousing and Data Mining
7 pages
Lecture 06 Bayesian Networks 07112022 011127pm
No ratings yet
Lecture 06 Bayesian Networks 07112022 011127pm
33 pages
ML-09-naive-bayes-classifier
No ratings yet
ML-09-naive-bayes-classifier
24 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
Lecture-7 Classification Using Naive Bays
No ratings yet
Lecture-7 Classification Using Naive Bays
19 pages
Classification Naive Bayes
No ratings yet
Classification Naive Bayes
17 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
Dl Highlights
No ratings yet
Dl Highlights
6 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
8 Classification
No ratings yet
8 Classification
45 pages
Bayesian Classification
No ratings yet
Bayesian Classification
25 pages
A5 PDF
No ratings yet
A5 PDF
9 pages
8 - Classification NaiveBayes PDF
No ratings yet
8 - Classification NaiveBayes PDF
13 pages
Lecture 8 - Naive Bayes
No ratings yet
Lecture 8 - Naive Bayes
27 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Pgm5 With Output
No ratings yet
Pgm5 With Output
13 pages
Week 4 - Classification Alternative Techniques
No ratings yet
Week 4 - Classification Alternative Techniques
87 pages
Naïve Bayesv1
No ratings yet
Naïve Bayesv1
31 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Lecture10 - Bayesian Classifier
No ratings yet
Lecture10 - Bayesian Classifier
40 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Global Trend Chapter-1 Practice Questions
No ratings yet
Global Trend Chapter-1 Practice Questions
18 pages
Go Math Florida Grade 2 Homework Book
100% (1)
Go Math Florida Grade 2 Homework Book
8 pages
Test 2
No ratings yet
Test 2
9 pages
A Chair For My Mother: Objective
No ratings yet
A Chair For My Mother: Objective
5 pages
TST Referral Form v.2
No ratings yet
TST Referral Form v.2
3 pages
The Tortoise & The Ducks
100% (2)
The Tortoise & The Ducks
9 pages
Ni3Myra Levine's Conservation Theory
No ratings yet
Ni3Myra Levine's Conservation Theory
18 pages
Basic Calculus Q4 W3
No ratings yet
Basic Calculus Q4 W3
5 pages
Srimad Bhagavatam - Canto 1 (SDHS, JG, VCT) (1895) PDF
No ratings yet
Srimad Bhagavatam - Canto 1 (SDHS, JG, VCT) (1895) PDF
730 pages
Grade 6 Schedule
No ratings yet
Grade 6 Schedule
1 page
Guru Nanak Dev University Amritsar: Faculty of Physical Planning & Architecture
No ratings yet
Guru Nanak Dev University Amritsar: Faculty of Physical Planning & Architecture
103 pages
Dickson 1996
No ratings yet
Dickson 1996
1 page
DUOLINGO-English-Vocabulary (1)
No ratings yet
DUOLINGO-English-Vocabulary (1)
8 pages
ED Review Test 2
No ratings yet
ED Review Test 2
2 pages
Adverbs-of-Manner STUDENT
No ratings yet
Adverbs-of-Manner STUDENT
4 pages
Eliana Pereira Dos Penedos Santiago: Identification
No ratings yet
Eliana Pereira Dos Penedos Santiago: Identification
12 pages
Kranti
No ratings yet
Kranti
2 pages
Caseydewitt Resume 09302013
No ratings yet
Caseydewitt Resume 09302013
5 pages
Sea Life Brief 2017
No ratings yet
Sea Life Brief 2017
4 pages
Testbank for Basic Technical Mathematics With Calculus 12th Edition Washington
No ratings yet
Testbank for Basic Technical Mathematics With Calculus 12th Edition Washington
18 pages
Science Syllubus
No ratings yet
Science Syllubus
5 pages

Lecture 6_Generative Models

Uploaded by

Lecture 6_Generative Models

Uploaded by

Artificial Intelligence II (CS4442 & CS9542)

Classification: Generative Models

I Recall: in logistic regression, we model directly p(y |x):

I Recall: in logistic regression, we model directly p(y |x):

I Idea: Use the training data to estimate p(y ) and p(x|y )

I p(y ) can be estimated by counting the number of data points of

I How to estimate p(x|y)?

- Need additional assumptions (for continuous inputs ) –

Figure: 2D Gaussian distributions with different Σ

Figure credit: Doina Precup

I The parameters to estimate are: θ, µ1 , Σ1 , µ0 , Σ0

I The parameters to estimate are: {θc , µc , Σc }C

I We can write down the likelihood function, like linear regression

I Compute the gradient with respect to the parameters and set

- The covariance matrix Σc is given by

I Covariance matrix can be restricted to diagonal, or mostly diagonal with

I The shape of the covariance is influenced both by assumptions about

Using the Bayes rule, we have

Predict class label as the most probable label:

Figure credit: Kevin Murphy

Let wc = Σ−1 µc , and bc = − 12 µ>

Figure credit: Kevin Murphy

I Idea: Use the training data to estimate p(y) and p(x|y )

I p(y ) can be estimated in the same way as for continuous features

I How to estimate p(x|y ) for discrete values?

- Assume x = [x1 , . . . , xn ]> ∈ Rn has n features. Then using the

- C = Coin R (regular) has been selected.

I The model parameters are {θc = p(y = c)}C

I Predict class label as the most probable label:

I In practice, using the log trick to avoid the numerical issue:

I The log-likelihood function is

I Computing the gradient with respect to θc and setting it to 0 gives us:

βj,c = p(xj = 1|y = c)

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

I Notice that some probabilities estimated by counting might be

– add 1 to each count

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

Silde credit: Eric Eaton

- Requires additional assumptions (e.g., Gaussian distribution,

You might also like