lecture3-linear-classifiers
lecture3-linear-classifiers
2
Review of Last Lecture
How is classification different from regression?
3
Cross Validation
k-fold cross validation: splitting training data into k
partitions or folds; iteratively test on each after training
on the rest
e.g., 3-fold CV: split dataset into 3 folds
Fold 1 Fold 2 Fold 3
Exp. 1 test train train
Exp. 2 train test train
Exp. 3 train train test
Average results from above experiments
• CV is often used if the corpus is small
4
Supervised Classifiers in Python
scikit-learn has many simple classifiers implemented,
with a common interface.
e.g., SVMs
>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
>>> clf.predict([[2., 2.]])
5
Steps
1. Define problem and collect data set
2. Extract features from documents
3. Train a classifier on a training set [today]
4. Apply classifier on test data
6
Feature Extraction
𝑦 = 𝑓(𝑥)
⃗
document
document label
classifier
7
Think Abstractly
𝑦 = 𝑓(𝑥)
⃗
document
document label
classifier
8
Training
𝑦 = 𝑓(𝑥) ⃗
Say we select an architecture (e.g., Naïve Bayes). 𝑓 can
now be described in terms of parameters 𝜃:
𝑦 = 𝑓(𝑥; ⃗ 𝜃)
9
Naïve Bayes
A probabilistic classifier that uses Baye's rule
𝑃 𝑥,⃗ 𝑦
𝑃 𝑦 𝑥⃗ = = 𝑃 𝑦 𝑃 𝑥⃗ 𝑦 /𝑃 𝑥⃗
𝑃 𝑥⃗
Naïve Bayes is a generative model
• Probabilistic account of the data 𝑃(𝑥,
⃗ 𝑦)
• Naïve Bayes assumes the dataset is generated in the
following way:
For each sample:
1. Generate label by 𝑃 𝑦
2. Generate feature vector 𝑥⃗ by generating each feature
independently, conditioned on 𝑦
• 𝑃 𝑥! 𝑦
10
Naïve Bayes Graphically
Assumption about how data is generated, as a
probabilistic graphical model:
y label
𝑥! 𝑥" … 𝑥# features
𝑃 𝑥,
⃗ y = 𝑃 y * 𝑃(𝑥" |𝑦)
"
Note how the independence
between features is expressed!
11
Naïve Bayes Model Parameters
The parameters to the model, 𝜃, consist of:
• Parameters of prior class distribution 𝑃 y
• Parameters of each feature's distribution conditioned on
class 𝑃 𝑥" 𝑦
12
Reminder: Categorical Distribution
A categorical random variable follows this distribution
if it can take one of k outcomes, each with a certain
probability
• The probabilities of the outcomes must sum to 1
Examples:
• Coin flip (k = 2; Bernoulli distribution)
• Die roll (k = 6)
• Distribution of class labels (e.g., spam vs non-spam, k =
number of classes)
• Generating unigrams! (k = size of vocabulary)
13
Training a Naïve Bayes Classifier
Objective: pick 𝜃 such as to maximize the likelihood of
the training corpus, 𝐷:
𝐿$% 𝜃 = ∏ ',) ⃗ ∈+ 𝑃 𝑥,⃗ 𝑦; 𝜃
=∏ ⃗ ∈+ 𝑃
',) y ∏# 𝑃(𝑥# |𝑦)
15
Naïve Bayes in Summary
Bayes’ rule:
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 𝑃 𝑥⃗ 𝑦 / 𝑃(𝑥) ⃗
Assume that all the features are independent:
𝑃 𝑦 𝑥⃗ = 𝑃 𝑦 ∏# 𝑃 𝑥# 𝑦 / 𝑃(𝑥) ⃗
16
Exercise: Train a NB Classifier
Table of whether a student will get an A or not based
on their habits (nominal data, Bernoulli distributions):
Reviews notes Does assignments Asks questions Grade
Y N Y A
Y Y N A
N Y N A
Y N N non-A
N Y Y non-A
N N Y
𝑃 𝑦 = ¬𝐴 𝑥⃗ = 𝑃 𝑦 = ¬𝐴 𝑥⃗ 𝑦 = ¬𝐴 / 𝑃(𝑥)
⃗
1 1
𝑃 𝑦 = 𝐴 𝑥⃗ = < = 𝑃 𝑦 = ¬𝐴 𝑥⃗
45𝑃(𝑥)
⃗ 20𝑃(𝑥)
⃗
19
Generative vs. Discriminative
Generative models learn a distribution for all of the
random variables involved: joint distribution, 𝑃 𝑥, ⃗ 𝑦
But for text classification, we really only care about the
conditional distribution 𝑃 𝑦|𝑥⃗ !
20
Logistic Regression
Linear regression:
𝑦 = 𝑎#𝑥# + 𝑎$𝑥$ + … + 𝑎% 𝑥% + 𝑏
Intuition: Linear regression gives as continuous values
in [-∞, ∞] —let’s squish the values to be in [0, 1]!
Function that does this: logit function
1 & ' (& ' (…(& ' (*
𝑃(𝑦|𝑥)
⃗ = 𝑒 ! ! " " # #
𝑍
This 𝑍 is a normalizing constant to ensure
this is a probability distribution.
21
Logistic Function
# &! '! ( &" '" ( … ( &# '# ( *
y-axis: 𝑃(𝑦|𝑥)
⃗ = ,
𝑒
x-axis: 𝑎#𝑥# + 𝑎$𝑥$ + … + 𝑎% 𝑥% + 𝑏
22
Features Can Be Anything!
We don't have to care about generating the data, so
can go wild in designing features!
• Does the document start with a capitalized letter?
• What is the length of the document in words? In
sentences?
• Actually, would usually scale and/or bin this
• How many sentiment-bearing words are there?
In practice, the features depend on both the document
and the proposed class:
• Does the document contain the word money with the
proposed class being spam?
23
Parameters in Logistic Regression
1 , ' -, ' -…-, ' -/
𝑃(𝑦|𝑥;
⃗ 𝜃) = 𝑒 ! ! " " # #
𝑍
where, 𝜃 = {𝑎0 , 𝑎1 , … , 𝑎2 , 𝑏}
24
Optimizing the Objective
We want to maximize
log 𝐿-. 𝜃 = ; log 𝑃(𝑦|𝑥;
⃗ 𝜃)
⃗ ∈3
',1
0 ,! '! - ," '" - … - ,# '# - /
=∑ ⃗
',) ∈+ log( 𝑒 )
3
=∑ ⃗ ∈+ (∑# 𝑎# 𝑥#
',) − log 𝑍)
This can be optimized by gradient descent
25
Support Vector Machines
Let’s visualize 𝑥⃗ as points in a high dimensional space.
e.g., if we have two features, each sample is a point in a
2D scatter plot. Label y using colour.
𝑥"
𝑥!
26
Support Vector Machines
A SVM learns a decision boundary as a line (or
hyperplane when >2 features)
𝑥"
𝑥!
27
Margin
This hyperplane is chosen to maximize the margin to
the nearest sample in each of the two classes.
𝑥"
𝑥"
30
Steps
1. Define problem and collect data set
2. Extract features from documents
3. Train a classifier on a training set
• Train many versions of the classifier and select between
them on a validation set
4. Apply classifier on test data
31
Perceptron
Closely related to logistic regression (differences in
training and output interpretation)
𝑓 𝑥⃗ = ?1 if 𝑤 C 𝑥⃗ + 𝑏 > 0
0 otherwise
Let’s visualize this graphically:
𝑓(𝑥)
𝑥⃗
32
Stacked Perceptrons
Let’s have multiple units, then stack and recombine
their outputs Final output
ℎ!
𝑔! 𝑔" 𝑔# 𝑔$
…Connections here…
𝑓! 𝑓" 𝑓# 𝑓$ 𝑓% 𝑓&
𝑥⃗
33
Artificial Neural Networks
Above is an example of an artificial neural network:
• Each unit is a neuron with many inputs (dendrites) and
one output (axon)
• The nucleus fires (sends an electric signal along the axon)
given input from other neurons.
• Learning occurs at the synapses that connect neurons,
either by amplifying or attenuating signals.
34
Artificial Neural Networks
Advantages:
• Can learn very complex functions
• Many possible different network structures possible
• Given enough training data, are currently achieving the
best results in many NLP tasks
Disadvantages:
• Training can take a long time
• Often need a lot of training data to work well
35
Even More Classification Algorithms
Read up on them or ask me if you’re interested:
• k-nearest neighbour
• decision trees
• transformation-based learning
• random forests
36