0% found this document useful (0 votes)

48 views36 pages

Probabilities from Linear Classifiers

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views36 pages

Probabilities from Linear Classifiers

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 3: Linear Classifiers

Instructor: Jackie CK Cheung & David Adelani

COMP-550
Readings: Eisenstein Ch. 2
Classification
Map input 𝑥 to output 𝑦:
𝑦 = 𝑓(𝑥)

Classification: 𝑦 is a discrete outcome

• Genre of the document (news text, novel, …?)
• Overall topic of the document
• Spam vs. non-spam
• Identity, gender, native language, etc. of author
• Positive vs. negative movie review
• Other examples?

2
Review of Last Lecture
How is classification different from regression?

What does it mean to train a text classifier?

What is the use of a training set? A validation set? A

test set?

3
Cross Validation
k-fold cross validation: splitting training data into k
partitions or folds; iteratively test on each after training
on the rest
e.g., 3-fold CV: split dataset into 3 folds
Fold 1 Fold 2 Fold 3
Exp. 1 test train train
Exp. 2 train test train
Exp. 3 train train test
Average results from above experiments
• CV is often used if the corpus is small

4
Supervised Classifiers in Python
scikit-learn has many simple classifiers implemented,
with a common interface.

e.g., SVMs
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = [Link]()
>>> [Link](X, y)
>>> [Link]([[2., 2.]])

5
Steps
1. Define problem and collect data set
2. Extract features from documents
3. Train a classifier on a training set [today]
4. Apply classifier on test data

6
Feature Extraction
𝑦 = 𝑓(𝑥)
⃗
document
document label
classifier

Represent document 𝑥⃗ as a list of features

Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥& 𝑥' 𝑥( …
exercitation ullamco laboris nisi ut aliquip ex
ea commodo consequat. Duis aute irure dolor 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0 …
in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa
qui officia deserunt mollit anim id est laborum.

7
Think Abstractly
𝑦 = 𝑓(𝑥)
⃗
document
document label
classifier

What are possible choices for the form of 𝑓?

Some popular approaches:
• Naïve Bayes
• Logistic regression
• Support vector machines
• Artificial neural networks – nonlinear, for next class

8
Training
𝑦 = 𝑓(𝑥) ⃗
Say we select an architecture (e.g., Naïve Bayes). 𝑓 can
now be described in terms of parameters 𝜃:
𝑦 = 𝑓(𝑥; ⃗ 𝜃)

Training the model specifically means to select

parameters 𝜃 ∗ according to some objective function
(e.g., minimize error on training set; maximize
likelihood of training data).

9
Naïve Bayes
A probabilistic classifier that uses Baye's rule
𝑃 𝑥,⃗ 𝑦
𝑃 𝑦 𝑥⃗ = = 𝑃 𝑦 𝑃 𝑥⃗ 𝑦 /𝑃 𝑥⃗
𝑃 𝑥⃗
Naïve Bayes is a generative model
• Probabilistic account of the data 𝑃(𝑥,
⃗ 𝑦)
• Naïve Bayes assumes the dataset is generated in the
following way:
For each sample:
1. Generate label by 𝑃 𝑦
2. Generate feature vector 𝑥⃗ by generating each feature
independently, conditioned on 𝑦
• 𝑃 𝑥! 𝑦
10
Naïve Bayes Graphically
Assumption about how data is generated, as a
probabilistic graphical model:

y label

𝑥! 𝑥" … 𝑥# features

𝑃 𝑥,
⃗ y = 𝑃 y * 𝑃(𝑥" |𝑦)
"
Note how the independence
between features is expressed!
11
Naïve Bayes Model Parameters
The parameters to the model, 𝜃, consist of:
• Parameters of prior class distribution 𝑃 y
• Parameters of each feature's distribution conditioned on
class 𝑃 𝑥" 𝑦

With discrete data, we assume that the distributions

𝑃 y and 𝑃(𝑥# |𝑦) are categorical distributions

12
Reminder: Categorical Distribution
A categorical random variable follows this distribution
if it can take one of k outcomes, each with a certain
probability
• The probabilities of the outcomes must sum to 1
Examples:
• Coin flip (k = 2; Bernoulli distribution)
• Die roll (k = 6)
• Distribution of class labels (e.g., spam vs non-spam, k =
number of classes)
• Generating unigrams! (k = size of vocabulary)

13
Training a Naïve Bayes Classifier
Objective: pick 𝜃 such as to maximize the likelihood of
the training corpus, 𝐷:
𝐿$% 𝜃 = ∏ ',) ⃗ ∈+ 𝑃 𝑥,⃗ 𝑦; 𝜃
=∏ ⃗ ∈+ 𝑃
',) y ∏# 𝑃(𝑥# |𝑦)

Can show that this boils down to computing relative

frequencies:
𝑃 𝑌 = 𝑦 should be set to proportion of samples that with
class 𝑦
𝑃(𝑋" = 𝑥|𝑌 = 𝑦) should be set to proportion of samples
with feature value 𝑥 among samples of class 𝑦
14
Inference in Naïve Bayes
After training, we would like to classify a new instance
(e.g., is a new document spam)
• i.e., want 𝑃 𝑦 𝑥⃗
Easy to get from 𝑃(𝑥,⃗ 𝑦):
𝑃 𝑦 𝑥⃗ = 𝑃(𝑥,⃗ 𝑦)/𝑃(𝑥)
⃗
= 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥)
⃗

To calculate denominator 𝑃(𝑥),

⃗ marginalize over random
variable 𝑦 by summing up numerator for all possible classes
(all possible values of 𝑦).

15
Naïve Bayes in Summary
Bayes’ rule:
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 𝑃 𝑥⃗ 𝑦 / 𝑃(𝑥) ⃗
Assume that all the features are independent:
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥) ⃗

Training the model means estimating the parameters

𝑃 𝑦 and 𝑃 𝑥# 𝑦 .
• e.g., P(SPAM) = 0.24, P(NON-SPAM) = 0.76
P(money at home|SPAM) = 0.07
P(money at home|NON-SPAM) = 0.0024

16
Exercise: Train a NB Classifier
Table of whether a student will get an A or not based
on their habits (nominal data, Bernoulli distributions):
Reviews notes Does assignments Asks questions Grade
Y N Y A
Y Y N A
N Y N A
Y N N non-A
N Y Y non-A
N N Y

What is the probability that this student gets an A?

• Doesn’t review notes, no assignments, asks questions
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥)
⃗
17
Train a NB Classifier (solution)
Bayes’ rule:
𝑃 𝑦 = 𝐴 𝑥⃗ = 𝑃 𝑦 = 𝐴 𝑃 𝑥⃗ 𝑦 = 𝐴 / 𝑃(𝑥)
⃗

𝑃 𝑦 = ¬𝐴 𝑥⃗ = 𝑃 𝑦 = ¬𝐴 𝑥⃗ 𝑦 = ¬𝐴 / 𝑃(𝑥)
⃗

Assume that all the features are independent:

3 1 1 1 1
𝑃 𝑦 = 𝐴 𝑥⃗ = ∗ ∗ ∗ =
5 3 3 3 45𝑃(𝑥)⃗
2 1 1 1 1
𝑃 𝑦 = ¬𝐴 𝑥⃗ = ∗ ∗ ∗ =
5 2 2 2 20𝑃(𝑥) ⃗

1 1
𝑃 𝑦 = 𝐴 𝑥⃗ = < = 𝑃 𝑦 = ¬𝐴 𝑥⃗
45𝑃(𝑥)
⃗ 20𝑃(𝑥)
⃗

Reviews notes Does assignments Asks questions Grade

Y N Y A
Y Y N A
N Y N A
Y N N non-A
N Y Y non-A
N N Y non-A
18
Type/Token Distinction
What if a word appears more than once in a
document? Frequency matters!
Type the identity of a word (i.e., count unique words)
Token an instance of a word (i.e., each occurrence is
separate)
In text classification, we usually deal with tokens, and
assume that there is a categorical distribution that is
used to generate all of the tokens seen in a sample,
conditioned on class y.
yo buy my stuff yo class: spam
P(spam)P(yo|spam)P(my|spam)P(stuff|spam)P(yo|spam)

19
Generative vs. Discriminative
Generative models learn a distribution for all of the
random variables involved: joint distribution, 𝑃 𝑥, ⃗ 𝑦
But for text classification, we really only care about the
conditional distribution 𝑃 𝑦|𝑥⃗ !

Discriminative models directly parameterize and learn

𝑃 𝑦|𝑥⃗
• May be easier than learning the joint!
• Can flexibly design many different features
• Model can only do classification!

20
Logistic Regression
Linear regression:
𝑦 = 𝑎#𝑥# + 𝑎$𝑥$ + … + 𝑎% 𝑥% + 𝑏
Intuition: Linear regression gives as continuous values
in [-∞, ∞] —let’s squish the values to be in [0, 1]!
Function that does this: logit function
1 & ' (& ' (…(& ' (*
𝑃(𝑦|𝑥)
⃗ = 𝑒 ! ! " " # #
𝑍
This 𝑍 is a normalizing constant to ensure
this is a probability distribution.

(a.k.a., maximum entropy or MaxEnt classifier)

N.B.: Don’t be confused by name—this method is most often used to solve
classification problems.

21
Logistic Function
# &! '! ( &" '" ( … ( &# '# ( *
y-axis: 𝑃(𝑦|𝑥)
⃗ = ,
𝑒
x-axis: 𝑎#𝑥# + 𝑎$𝑥$ + … + 𝑎% 𝑥% + 𝑏

22
Features Can Be Anything!
We don't have to care about generating the data, so
can go wild in designing features!
• Does the document start with a capitalized letter?
• What is the length of the document in words? In
sentences?
• Actually, would usually scale and/or bin this
• How many sentiment-bearing words are there?
In practice, the features depend on both the document
and the proposed class:
• Does the document contain the word money with the
proposed class being spam?

23
Parameters in Logistic Regression
1 , ' -, ' -…-, ' -/
𝑃(𝑦|𝑥;
⃗ 𝜃) = 𝑒 ! ! " " # #
𝑍
where, 𝜃 = {𝑎0 , 𝑎1 , … , 𝑎2 , 𝑏}

Learning means to maximize the conditional likelihood

of the training corpus
𝐿-. 𝜃 = * 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1
or more usually, the log conditional likelihood
log 𝐿-. 𝜃 = ; log 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1

24
Optimizing the Objective
We want to maximize
log 𝐿-. 𝜃 = ; log 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1
0 ,! '! - ," '" - … - ,# '# - /
=∑ ⃗
',) ∈+ log( 𝑒 )
3
=∑ ⃗ ∈+ (∑# 𝑎# 𝑥#
',) − log 𝑍)
This can be optimized by gradient descent

25
Support Vector Machines
Let’s visualize 𝑥⃗ as points in a high dimensional space.
e.g., if we have two features, each sample is a point in a
2D scatter plot. Label y using colour.

𝑥"

𝑥!

26
Support Vector Machines
A SVM learns a decision boundary as a line (or
hyperplane when >2 features)

𝑥"

𝑥!

27
Margin
This hyperplane is chosen to maximize the margin to
the nearest sample in each of the two classes.

𝑥"

𝑥! The method also deals with

the fact that the samples may not
be linearly separable. 28
SVMs – Generative or Discriminative?
Are SVMs a generative or a discriminative model?

𝑥"

𝑥! The method also deals with

the fact that the samples may not
be linearly separable. 29
How To Decide?
• Naïve Bayes, logistic regression, and SVMs can all
work well in different tasks and settings.
• Usually, given little training data, Naïve Bayes are a
good bet—strong independence assumptions.
• In practice, try them all and select between them on
a development set!

30
Steps
1. Define problem and collect data set
2. Extract features from documents
3. Train a classifier on a training set
• Train many versions of the classifier and select between
them on a validation set
4. Apply classifier on test data

31
Perceptron
Closely related to logistic regression (differences in
training and output interpretation)
𝑓 𝑥⃗ = ?1 if 𝑤 C 𝑥⃗ + 𝑏 > 0
0 otherwise
Let’s visualize this graphically:

𝑓(𝑥)

𝑥⃗
32
Stacked Perceptrons
Let’s have multiple units, then stack and recombine
their outputs Final output

ℎ!

𝑔! 𝑔" 𝑔# 𝑔$
…Connections here…

𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&

𝑥⃗
33
Artificial Neural Networks
Above is an example of an artificial neural network:
• Each unit is a neuron with many inputs (dendrites) and
one output (axon)
• The nucleus fires (sends an electric signal along the axon)
given input from other neurons.
• Learning occurs at the synapses that connect neurons,
either by amplifying or attenuating signals.

34
Artificial Neural Networks
Advantages:
• Can learn very complex functions
• Many possible different network structures possible
• Given enough training data, are currently achieving the
best results in many NLP tasks
Disadvantages:
• Training can take a long time
• Often need a lot of training data to work well

35
Even More Classification Algorithms
Read up on them or ask me if you’re interested:
• k-nearest neighbour
• decision trees
• transformation-based learning
• random forests

Next class: non-linear classifiers

Na Ive Bayes Classifier
No ratings yet
Na Ive Bayes Classifier
3 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Lecture 6 - Generative Models
No ratings yet
Lecture 6 - Generative Models
33 pages
NLP NB
No ratings yet
NLP NB
52 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
NBayes 1 20 2011 Ann
No ratings yet
NBayes 1 20 2011 Ann
21 pages
NaiveBayersClassification BA
No ratings yet
NaiveBayersClassification BA
36 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
NBayes Log Reg
No ratings yet
NBayes Log Reg
18 pages
Naïve Bayes Text Classification Overview
No ratings yet
Naïve Bayes Text Classification Overview
58 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Practical # 11
No ratings yet
Practical # 11
10 pages
In4080 2022 Lecture 03
No ratings yet
In4080 2022 Lecture 03
62 pages
Homework3 Sol
No ratings yet
Homework3 Sol
5 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
No ratings yet
Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression
17 pages
lec20-ML I
No ratings yet
lec20-ML I
48 pages
Machine Learning: Classification & Naive Bayes
No ratings yet
Machine Learning: Classification & Naive Bayes
20 pages
ESGB - Naive Bayes and Logistic Regression
No ratings yet
ESGB - Naive Bayes and Logistic Regression
36 pages
Lesson 6.0 Supervised Learning With Naive Bayes Classifiers
No ratings yet
Lesson 6.0 Supervised Learning With Naive Bayes Classifiers
13 pages
Naïve Bayes and Conditional Independence
No ratings yet
Naïve Bayes and Conditional Independence
41 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
ML Module4 Classification
No ratings yet
ML Module4 Classification
79 pages
Lecture - 4.1 - Bayes Classifier
No ratings yet
Lecture - 4.1 - Bayes Classifier
31 pages
WK 08
No ratings yet
WK 08
10 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
Bayesian Classification, Nearest
No ratings yet
Bayesian Classification, Nearest
46 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
SP14 CS188 Lecture 21 - Naive Bayes - Print
No ratings yet
SP14 CS188 Lecture 21 - Naive Bayes - Print
41 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
05 Lecturenote NB
No ratings yet
05 Lecturenote NB
10 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
55 pages
Naive Bayes Classifier in Machine Learning
No ratings yet
Naive Bayes Classifier in Machine Learning
16 pages
AI ML Unit4
No ratings yet
AI ML Unit4
252 pages
Naïve Bayes Classification Overview
No ratings yet
Naïve Bayes Classification Overview
19 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
Practical 3
No ratings yet
Practical 3
11 pages
Lecture 10 Naïve Bayes Classification
No ratings yet
Lecture 10 Naïve Bayes Classification
29 pages
Lecture - Naive Bayesian
No ratings yet
Lecture - Naive Bayesian
21 pages
Naïve Bayes Classifier Guide
No ratings yet
Naïve Bayes Classifier Guide
26 pages
Introduction to Classification in AI
No ratings yet
Introduction to Classification in AI
66 pages
LM3 - Naive Bayes Model
No ratings yet
LM3 - Naive Bayes Model
21 pages
Bayes Classifiers Overview by Ihler
No ratings yet
Bayes Classifiers Overview by Ihler
51 pages
Bayes Classifiers in Machine Learning
No ratings yet
Bayes Classifiers in Machine Learning
51 pages
UNIT - IV
No ratings yet
UNIT - IV
169 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
16 pages
Naive Bayes Classifier in Machine Learning Javatpoint
No ratings yet
Naive Bayes Classifier in Machine Learning Javatpoint
23 pages
MlUnit 2
No ratings yet
MlUnit 2
11 pages
Purva Rawale - BDA Practical No 2
No ratings yet
Purva Rawale - BDA Practical No 2
9 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
NB Slides
No ratings yet
NB Slides
29 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
WINSEM2023-24 MCSE602L TH VL2023240501960 2024-03-13 Reference-Material-I
No ratings yet
WINSEM2023-24 MCSE602L TH VL2023240501960 2024-03-13 Reference-Material-I
132 pages
CSL0777 L24
No ratings yet
CSL0777 L24
38 pages
Machine Learning 10-601: Today: - Bayes Classifiers - Conditional Independence - Naïve Bayes Readings
No ratings yet
Machine Learning 10-601: Today: - Bayes Classifiers - Conditional Independence - Naïve Bayes Readings
51 pages
5 PBWN101A Levels Leveling Loop
No ratings yet
5 PBWN101A Levels Leveling Loop
13 pages
Huffman Coding Explained
No ratings yet
Huffman Coding Explained
12 pages
PDEs for Advanced Mathematics
No ratings yet
PDEs for Advanced Mathematics
11 pages
Advanced Structural Analysis - Syllabus
No ratings yet
Advanced Structural Analysis - Syllabus
2 pages
Detherage N54 Syllabus (2024)
No ratings yet
Detherage N54 Syllabus (2024)
5 pages
Simple System Identification with Strejc Models
No ratings yet
Simple System Identification with Strejc Models
4 pages
MPC for Inverted Pendulum Control
No ratings yet
MPC for Inverted Pendulum Control
5 pages
Matrices and Determinants - JEE Mains PYQ 2023 Session 2
No ratings yet
Matrices and Determinants - JEE Mains PYQ 2023 Session 2
63 pages
Introduction To Second-Quantization I
No ratings yet
Introduction To Second-Quantization I
16 pages
Lecture, Convolution and Filtering
No ratings yet
Lecture, Convolution and Filtering
45 pages
Math AI SL IA
No ratings yet
Math AI SL IA
13 pages
Answer To The Assignment 3: by Xin Wu SID: 102519884 Fall, 2007
No ratings yet
Answer To The Assignment 3: by Xin Wu SID: 102519884 Fall, 2007
4 pages
B.Tech CSE Cybersecurity Exam
No ratings yet
B.Tech CSE Cybersecurity Exam
2 pages
2020 HSC Mathematics Extension 2 Exam
No ratings yet
2020 HSC Mathematics Extension 2 Exam
6 pages
Continuity of Functions Continuity of A Function at A Point
No ratings yet
Continuity of Functions Continuity of A Function at A Point
13 pages
HW 4 Sol
No ratings yet
HW 4 Sol
6 pages
Viet An Resume
No ratings yet
Viet An Resume
1 page
Machine Learning in Materials Science
No ratings yet
Machine Learning in Materials Science
21 pages
Ciphers in Network Security Overview
No ratings yet
Ciphers in Network Security Overview
32 pages
Overview of SHA Hash Functions
No ratings yet
Overview of SHA Hash Functions
17 pages
CBSE X Maths Test: Linear Equations
No ratings yet
CBSE X Maths Test: Linear Equations
3 pages
Convex Shape Optimization Guide
No ratings yet
Convex Shape Optimization Guide
25 pages
Compound Indexing
No ratings yet
Compound Indexing
4 pages
Udemy Course Description
No ratings yet
Udemy Course Description
1 page
Trust Region Policy Optimization Via Entropy Regularization For Kullback-Leibler Divergence Constraint
No ratings yet
Trust Region Policy Optimization Via Entropy Regularization For Kullback-Leibler Divergence Constraint
12 pages
Sample Paper For The Machine Learning Course Ajay Sharma
No ratings yet
Sample Paper For The Machine Learning Course Ajay Sharma
19 pages
Student Guide 2
No ratings yet
Student Guide 2
8 pages
MCQS
No ratings yet
MCQS
7 pages
Parameter Estimation
No ratings yet
Parameter Estimation
12 pages
Midterm Question - Time Series Analysis - Updated
No ratings yet
Midterm Question - Time Series Analysis - Updated
3 pages