0% found this document useful (0 votes)
10 views

lecture3-linear-classifiers

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

lecture3-linear-classifiers

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture 3: Linear Classifiers

Instructor: Jackie CK Cheung & David Adelani


COMP-550
Readings: Eisenstein Ch. 2
Classification
Map input 𝑥 to output 𝑦:
𝑦 = 𝑓(𝑥)

Classification: 𝑦 is a discrete outcome


• Genre of the document (news text, novel, …?)
• Overall topic of the document
• Spam vs. non-spam
• Identity, gender, native language, etc. of author
• Positive vs. negative movie review
• Other examples?

2
Review of Last Lecture
How is classification different from regression?

What does it mean to train a text classifier?

What is the use of a training set? A validation set? A


test set?

3
Cross Validation
k-fold cross validation: splitting training data into k
partitions or folds; iteratively test on each after training
on the rest
e.g., 3-fold CV: split dataset into 3 folds
Fold 1 Fold 2 Fold 3
Exp. 1 test train train
Exp. 2 train test train
Exp. 3 train train test
Average results from above experiments
• CV is often used if the corpus is small

4
Supervised Classifiers in Python
scikit-learn has many simple classifiers implemented,
with a common interface.

e.g., SVMs
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
>>> clf.predict([[2., 2.]])

5
Steps
1. Define problem and collect data set
2. Extract features from documents
3. Train a classifier on a training set [today]
4. Apply classifier on test data

6
Feature Extraction
𝑦 = 𝑓(𝑥)

document
document label
classifier

Represent document 𝑥⃗ as a list of features


Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud 𝑥! 𝑥" 𝑥# 𝑥$ 𝑥% 𝑥& 𝑥' 𝑥( …
exercitation ullamco laboris nisi ut aliquip ex
ea commodo consequat. Duis aute irure dolor 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0 …
in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa
qui officia deserunt mollit anim id est laborum.

7
Think Abstractly
𝑦 = 𝑓(𝑥)

document
document label
classifier

What are possible choices for the form of 𝑓?


Some popular approaches:
• Naïve Bayes
• Logistic regression
• Support vector machines
• Artificial neural networks – nonlinear, for next class

8
Training
𝑦 = 𝑓(𝑥) ⃗
Say we select an architecture (e.g., Naïve Bayes). 𝑓 can
now be described in terms of parameters 𝜃:
𝑦 = 𝑓(𝑥; ⃗ 𝜃)

Training the model specifically means to select


parameters 𝜃 ∗ according to some objective function
(e.g., minimize error on training set; maximize
likelihood of training data).

9
Naïve Bayes
A probabilistic classifier that uses Baye's rule
𝑃 𝑥,⃗ 𝑦
𝑃 𝑦 𝑥⃗ = = 𝑃 𝑦 𝑃 𝑥⃗ 𝑦 /𝑃 𝑥⃗
𝑃 𝑥⃗
Naïve Bayes is a generative model
• Probabilistic account of the data 𝑃(𝑥,
⃗ 𝑦)
• Naïve Bayes assumes the dataset is generated in the
following way:
For each sample:
1. Generate label by 𝑃 𝑦
2. Generate feature vector 𝑥⃗ by generating each feature
independently, conditioned on 𝑦
• 𝑃 𝑥! 𝑦
10
Naïve Bayes Graphically
Assumption about how data is generated, as a
probabilistic graphical model:

y label

𝑥! 𝑥" … 𝑥# features

𝑃 𝑥,
⃗ y = 𝑃 y * 𝑃(𝑥" |𝑦)
"
Note how the independence
between features is expressed!
11
Naïve Bayes Model Parameters
The parameters to the model, 𝜃, consist of:
• Parameters of prior class distribution 𝑃 y
• Parameters of each feature's distribution conditioned on
class 𝑃 𝑥" 𝑦

With discrete data, we assume that the distributions


𝑃 y and 𝑃(𝑥# |𝑦) are categorical distributions

12
Reminder: Categorical Distribution
A categorical random variable follows this distribution
if it can take one of k outcomes, each with a certain
probability
• The probabilities of the outcomes must sum to 1
Examples:
• Coin flip (k = 2; Bernoulli distribution)
• Die roll (k = 6)
• Distribution of class labels (e.g., spam vs non-spam, k =
number of classes)
• Generating unigrams! (k = size of vocabulary)

13
Training a Naïve Bayes Classifier
Objective: pick 𝜃 such as to maximize the likelihood of
the training corpus, 𝐷:
𝐿$% 𝜃 = ∏ ',) ⃗ ∈+ 𝑃 𝑥,⃗ 𝑦; 𝜃
=∏ ⃗ ∈+ 𝑃
',) y ∏# 𝑃(𝑥# |𝑦)

Can show that this boils down to computing relative


frequencies:
𝑃 𝑌 = 𝑦 should be set to proportion of samples that with
class 𝑦
𝑃(𝑋" = 𝑥|𝑌 = 𝑦) should be set to proportion of samples
with feature value 𝑥 among samples of class 𝑦
14
Inference in Naïve Bayes
After training, we would like to classify a new instance
(e.g., is a new document spam)
• i.e., want 𝑃 𝑦 𝑥⃗
Easy to get from 𝑃(𝑥,⃗ 𝑦):
𝑃 𝑦 𝑥⃗ = 𝑃(𝑥,⃗ 𝑦)/𝑃(𝑥)

= 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥)

To calculate denominator 𝑃(𝑥),


⃗ marginalize over random
variable 𝑦 by summing up numerator for all possible classes
(all possible values of 𝑦).

15
Naïve Bayes in Summary
Bayes’ rule:
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 𝑃 𝑥⃗ 𝑦 / 𝑃(𝑥) ⃗
Assume that all the features are independent:
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥) ⃗

Training the model means estimating the parameters


𝑃 𝑦 and 𝑃 𝑥# 𝑦 .
• e.g., P(SPAM) = 0.24, P(NON-SPAM) = 0.76
P(money at home|SPAM) = 0.07
P(money at home|NON-SPAM) = 0.0024

16
Exercise: Train a NB Classifier
Table of whether a student will get an A or not based
on their habits (nominal data, Bernoulli distributions):
Reviews notes Does assignments Asks questions Grade
Y N Y A
Y Y N A
N Y N A
Y N N non-A
N Y Y non-A
N N Y

What is the probability that this student gets an A?


• Doesn’t review notes, no assignments, asks questions
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥)

17
Train a NB Classifier (solution)
Bayes’ rule:
𝑃 𝑦 = 𝐴 𝑥⃗ = 𝑃 𝑦 = 𝐴 𝑃 𝑥⃗ 𝑦 = 𝐴 / 𝑃(𝑥)

𝑃 𝑦 = ¬𝐴 𝑥⃗ = 𝑃 𝑦 = ¬𝐴 𝑥⃗ 𝑦 = ¬𝐴 / 𝑃(𝑥)

Assume that all the features are independent:


3 1 1 1 1
𝑃 𝑦 = 𝐴 𝑥⃗ = ∗ ∗ ∗ =
5 3 3 3 45𝑃(𝑥)⃗
2 1 1 1 1
𝑃 𝑦 = ¬𝐴 𝑥⃗ = ∗ ∗ ∗ =
5 2 2 2 20𝑃(𝑥) ⃗

1 1
𝑃 𝑦 = 𝐴 𝑥⃗ = < = 𝑃 𝑦 = ¬𝐴 𝑥⃗
45𝑃(𝑥)
⃗ 20𝑃(𝑥)

Reviews notes Does assignments Asks questions Grade


Y N Y A
Y Y N A
N Y N A
Y N N non-A
N Y Y non-A
N N Y non-A
18
Type/Token Distinction
What if a word appears more than once in a
document? Frequency matters!
Type the identity of a word (i.e., count unique words)
Token an instance of a word (i.e., each occurrence is
separate)
In text classification, we usually deal with tokens, and
assume that there is a categorical distribution that is
used to generate all of the tokens seen in a sample,
conditioned on class y.
yo buy my stuff yo class: spam
P(spam)P(yo|spam)P(my|spam)P(stuff|spam)P(yo|spam)

19
Generative vs. Discriminative
Generative models learn a distribution for all of the
random variables involved: joint distribution, 𝑃 𝑥, ⃗ 𝑦
But for text classification, we really only care about the
conditional distribution 𝑃 𝑦|𝑥⃗ !

Discriminative models directly parameterize and learn


𝑃 𝑦|𝑥⃗
• May be easier than learning the joint!
• Can flexibly design many different features
• Model can only do classification!

20
Logistic Regression
Linear regression:
𝑦 = 𝑎#𝑥# + 𝑎$𝑥$ + … + 𝑎% 𝑥% + 𝑏
Intuition: Linear regression gives as continuous values
in [-∞, ∞] —let’s squish the values to be in [0, 1]!
Function that does this: logit function
1 & ' (& ' (…(& ' (*
𝑃(𝑦|𝑥)
⃗ = 𝑒 ! ! " " # #
𝑍
This 𝑍 is a normalizing constant to ensure
this is a probability distribution.

(a.k.a., maximum entropy or MaxEnt classifier)


N.B.: Don’t be confused by name—this method is most often used to solve
classification problems.

21
Logistic Function
# &! '! ( &" '" ( … ( &# '# ( *
y-axis: 𝑃(𝑦|𝑥)
⃗ = ,
𝑒
x-axis: 𝑎#𝑥# + 𝑎$𝑥$ + … + 𝑎% 𝑥% + 𝑏

22
Features Can Be Anything!
We don't have to care about generating the data, so
can go wild in designing features!
• Does the document start with a capitalized letter?
• What is the length of the document in words? In
sentences?
• Actually, would usually scale and/or bin this
• How many sentiment-bearing words are there?
In practice, the features depend on both the document
and the proposed class:
• Does the document contain the word money with the
proposed class being spam?

23
Parameters in Logistic Regression
1 , ' -, ' -…-, ' -/
𝑃(𝑦|𝑥;
⃗ 𝜃) = 𝑒 ! ! " " # #
𝑍
where, 𝜃 = {𝑎0 , 𝑎1 , … , 𝑎2 , 𝑏}

Learning means to maximize the conditional likelihood


of the training corpus
𝐿-. 𝜃 = * 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1
or more usually, the log conditional likelihood
log 𝐿-. 𝜃 = ; log 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1

24
Optimizing the Objective
We want to maximize
log 𝐿-. 𝜃 = ; log 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1
0 ,! '! - ," '" - … - ,# '# - /
=∑ ⃗
',) ∈+ log( 𝑒 )
3
=∑ ⃗ ∈+ (∑# 𝑎# 𝑥#
',) − log 𝑍)
This can be optimized by gradient descent

25
Support Vector Machines
Let’s visualize 𝑥⃗ as points in a high dimensional space.
e.g., if we have two features, each sample is a point in a
2D scatter plot. Label y using colour.

𝑥"

𝑥!

26
Support Vector Machines
A SVM learns a decision boundary as a line (or
hyperplane when >2 features)

𝑥"

𝑥!

27
Margin
This hyperplane is chosen to maximize the margin to
the nearest sample in each of the two classes.

𝑥"

𝑥! The method also deals with


the fact that the samples may not
be linearly separable. 28
SVMs – Generative or Discriminative?
Are SVMs a generative or a discriminative model?

𝑥"

𝑥! The method also deals with


the fact that the samples may not
be linearly separable. 29
How To Decide?
• Naïve Bayes, logistic regression, and SVMs can all
work well in different tasks and settings.
• Usually, given little training data, Naïve Bayes are a
good bet—strong independence assumptions.
• In practice, try them all and select between them on
a development set!

30
Steps
1. Define problem and collect data set
2. Extract features from documents
3. Train a classifier on a training set
• Train many versions of the classifier and select between
them on a validation set
4. Apply classifier on test data

31
Perceptron
Closely related to logistic regression (differences in
training and output interpretation)
𝑓 𝑥⃗ = ?1 if 𝑤 C 𝑥⃗ + 𝑏 > 0
0 otherwise
Let’s visualize this graphically:

𝑓(𝑥)

𝑥⃗
32
Stacked Perceptrons
Let’s have multiple units, then stack and recombine
their outputs Final output

ℎ!

𝑔! 𝑔" 𝑔# 𝑔$
…Connections here…

𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&

𝑥⃗
33
Artificial Neural Networks
Above is an example of an artificial neural network:
• Each unit is a neuron with many inputs (dendrites) and
one output (axon)
• The nucleus fires (sends an electric signal along the axon)
given input from other neurons.
• Learning occurs at the synapses that connect neurons,
either by amplifying or attenuating signals.

34
Artificial Neural Networks
Advantages:
• Can learn very complex functions
• Many possible different network structures possible
• Given enough training data, are currently achieving the
best results in many NLP tasks
Disadvantages:
• Training can take a long time
• Often need a lot of training data to work well

35
Even More Classification Algorithms
Read up on them or ask me if you’re interested:
• k-nearest neighbour
• decision trees
• transformation-based learning
• random forests

Next class: non-linear classifiers

36

You might also like