0% found this document useful (0 votes)
13 views60 pages

Lecture 3. Classification

Uploaded by

thaotrau55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views60 pages

Lecture 3. Classification

Uploaded by

thaotrau55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Learning Systems (DT8008)

Classification

Dr. Mohamed-Rafik Bouguelia


[email protected]

Halmstad University
Classification
• The variable 𝑦𝑦 that you want to predict (the output variable) is
discrete.

• Examples (with two classes)


 Email: Spam / Not Spam?
 Online Transactions: Fraudulent (Yes/No)?
 Tumor: Malignant / Benign
0: "Negative Class" (e.g., benign tumor)
• 𝑦𝑦 ∈ {0, 1}
1: "Positive Class" (e.g. malignant tumor)

• We will first start talking about binary classification (with two


classes).
• Then, we will talk more about multi-class classification (with
more than two classes), 𝑦𝑦 ∈ {0, 1, 2, 3, … , 𝑐𝑐}
2
Some Applications of
Classification
Some applications of classification

4
Some applications of classification

5
Some applications of classification

6
Some applications of classification

7
Some applications of classification

Edible or poisonous ?
8
Some applications of classification
• e.g. Laryngeal disease diagnostics

• Features / Attributes:
– Age
– Subjectively estimated illness duration (months)
– Education (five grades)
– Average duration of intensive speech use (hours/day)
– Number of days of intensive speech use (days/week)
– Smoking (Yes/No)
– Smoked cigarettes/day
– Smoking duration (years);
– Subjective voice function assessment by the patient
– Maximal tonality duration for “aaaaaa” (sec)
– Functional voice index (F);
– Emotional condition index (E);
– Physical condition index (P);
– Voice deficiency index
– …

9
Some applications of classification
apple

pear

tomato

cow

dog

horse

Training set (labels known) Test set (labels unknown)

10
Linear Classification with
Logistic Regression
Logistic Regression
• This is a classification method (don’t get confused by the name).

• In a binary classification, we want 𝑦𝑦 = 0 or 𝑦𝑦 = 1


– but, if you use a simple linear regression model ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑇𝑇 𝑥𝑥, then
ℎ𝜃𝜃 𝑥𝑥 can be > 1 or < 0

• The logistic regression model is defined so that 0 ≤ ℎ𝜃𝜃 𝑥𝑥 ≤ 1


 ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔(𝜃𝜃 𝑇𝑇 𝑥𝑥), where 𝑔𝑔(. ) is the sigmoid function (or logistic
function).
1
 Sigmoid function: 𝑔𝑔 𝑎𝑎 = 1
1+𝑒𝑒 −𝑎𝑎 𝑔𝑔 𝑎𝑎 =
1 + 𝑒𝑒 −𝑎𝑎

1
• ℎ𝜃𝜃 (𝑥𝑥) = −𝜃𝜃 𝑇𝑇 𝑥𝑥
1+𝑒𝑒
𝑎𝑎
1
Logistic Regression ℎ𝜃𝜃 (𝑥𝑥) =
1+ 𝑒𝑒 −𝜃𝜃 𝑇𝑇 𝑥𝑥

• Interpretation of the hypothesis output ℎ𝜃𝜃 (𝑥𝑥)


ℎ𝜃𝜃 𝑥𝑥 = estimated probability that 𝑦𝑦 = 1 on input 𝑥𝑥

Example: If

ℎ𝜃𝜃 𝑥𝑥 = 0.7  𝑦𝑦 = 1
70% chance of tumor being malignant
Probability that 𝑦𝑦 = 1 ,
ℎ𝜃𝜃 𝑥𝑥 = 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃) = 0.7 given 𝑥𝑥, parametrized by 𝜃𝜃

Note: since y ∈ 0,1 , 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃) + 𝑃𝑃 𝑦𝑦 = 0 𝑥𝑥; 𝜃𝜃) = 1


13
𝒈𝒈 𝒂𝒂
Logistic Regression
Linear decision boundary

ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔 𝜃𝜃 𝑇𝑇 𝑥𝑥 = 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥, 𝜃𝜃)


1
𝑔𝑔(𝑎𝑎) =
1 + 𝑒𝑒 −𝑎𝑎 𝒂𝒂

if ℎ𝜃𝜃 𝑥𝑥 ≥ 0.5 then we predict class y = 1


same as
if 𝜃𝜃 𝑇𝑇 𝑥𝑥 ≥ 0 then we predict class y = 1
if ℎ𝜃𝜃 𝑥𝑥 < 0.5 then we predict class y = 0 if 𝜃𝜃 𝑇𝑇 𝑥𝑥 < 0 then we predict class y = 0

• Example of a linear decision boundary:


ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔(𝜃𝜃 𝑇𝑇 𝑥𝑥)
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 )
= 𝑔𝑔(−3 + 𝑥𝑥1 + 𝑥𝑥2 )
Predict 𝑦𝑦 = 1 if −3 + 𝑥𝑥1 + 𝑥𝑥2 ≥ 0
𝒙𝒙𝟐𝟐

𝒙𝒙𝟏𝟏 14
𝒈𝒈 𝒂𝒂
Logistic Regression
Linear decision boundary

ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔 𝜃𝜃 𝑇𝑇 𝑥𝑥 = 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥, 𝜃𝜃)


1
𝑔𝑔(𝑎𝑎) =
1 + 𝑒𝑒 −𝑎𝑎 𝒂𝒂

if ℎ𝜃𝜃 𝑥𝑥 ≥ 0.5 then we predict class y = 1


same as
if 𝜃𝜃 𝑇𝑇 𝑥𝑥 ≥ 0 then we predict class y = 1
if ℎ𝜃𝜃 𝑥𝑥 < 0.5 then we predict class y = 0 if 𝜃𝜃 𝑇𝑇 𝑥𝑥 < 0 then we predict class y = 0

• Example of a linear decision boundary:


𝒚𝒚 = 𝟏𝟏 ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔(𝜃𝜃 𝑇𝑇 𝑥𝑥)
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 )
= 𝑔𝑔(−3 + 𝑥𝑥1 + 𝑥𝑥2 )
Predict 𝑦𝑦 = 1 if −3 + 𝑥𝑥1 + 𝑥𝑥2 ≥ 0
𝒙𝒙𝟐𝟐 • for all regions where 𝒙𝒙𝟏𝟏 + 𝒙𝒙𝟐𝟐 ≥ 𝟑𝟑, this will predict 𝑦𝑦 = 1
• for all regions where 𝒙𝒙𝟏𝟏 + 𝒙𝒙𝟐𝟐 < 𝟑𝟑, this will predict 𝑦𝑦 = 0
• The decision boundary is 𝒙𝒙𝟏𝟏 + 𝒙𝒙𝟐𝟐 = 𝟑𝟑
𝒚𝒚 = 𝟎𝟎
𝒙𝒙𝟏𝟏 decision boundary 15
Defining the Cost Function for
Logistic Regression

16
Logistic Regression – Error function
• Training dataset •
𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 1 / (1 + 𝑒𝑒 −𝜃𝜃 𝑥𝑥 )
{ 𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2
, 𝑦𝑦 2
, … , (𝑥𝑥 𝑛𝑛
, 𝑦𝑦 (𝑛𝑛) )} • How de we choose the
parameters 𝜃𝜃 ?
𝑥𝑥0
– By minimizing some error (cost)
𝑥𝑥1 function
• 𝑥𝑥 = … ∈ ℝ𝑑𝑑+1 , 𝑥𝑥0 = 1, 𝑦𝑦 ∈ {0,1}
𝑥𝑥𝑑𝑑

If our cost function is


defined this way, it will be
Non-Convex !

Several local minimums.


GD is not guaranteed to
converge to the global
minimum.

17
Logistic Regression – Error function
Instead, we use the following convex cost function:

This will give us a


convex optimization
problem when we
want to minimize
𝐸𝐸(𝜃𝜃)

18
Logistic Regression – Error function
Instead, we use the following error function:

If 𝑦𝑦 = 1 In the case where 𝑦𝑦 = 1


• When ℎ𝜃𝜃 𝑥𝑥 is closer to 1, the
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(ℎ𝜃𝜃 𝑥𝑥 , 𝑦𝑦) is closer to 0.
• The 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝜃𝜃 𝑥𝑥 , 𝑦𝑦 = 0 if ℎ𝜃𝜃 𝑥𝑥 = 1
• As h𝜃𝜃 𝑥𝑥 → 0, the 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 → ∞
• Captures the intuition that if ℎ𝜃𝜃 𝑥𝑥 = 0
(i.e. 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥, 𝜃𝜃) = 0), but 𝑦𝑦 = 1, then
we will penalize the learning algorithm by a
very large cost.
0 ℎ𝜃𝜃 𝑥𝑥 1
19
Logistic Regression – Error function
Instead, we use the following error function:

If 𝑦𝑦 = 0 In the case where 𝑦𝑦 = 0


• When ℎ𝜃𝜃 𝑥𝑥 is closer to 0, the
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(ℎ𝜃𝜃 𝑥𝑥 , 𝑦𝑦) is closer to 0.
• The 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ℎ𝜃𝜃 𝑥𝑥 , 𝑦𝑦 = 0 if ℎ𝜃𝜃 𝑥𝑥 = 0
• As h𝜃𝜃 𝑥𝑥 → 1, the 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 → ∞
• Captures the intuition that if ℎ𝜃𝜃 𝑥𝑥 = 1
(i.e. 𝑃𝑃 𝑦𝑦 = 0 𝑥𝑥, 𝜃𝜃) = 0), but 𝑦𝑦 = 0, then
we will penalize the learning algorithm by a
very large cost.
0 ℎ𝜃𝜃 𝑥𝑥 1
20
Logistic Regression – Error function

Simpler way to write the error function:

• To find the best parameters 𝜃𝜃: • To make a prediction given new 𝑥𝑥:
min 𝐸𝐸(𝜃𝜃) 1
𝜃𝜃 ℎ𝜃𝜃 𝑥𝑥 = 𝑇𝑇
1 + 𝑒𝑒 −𝜃𝜃 𝑥𝑥
• ℎ𝜃𝜃 (𝑥𝑥) is interpreted as 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃)
21
Gradient of the Cost Function

22
Derivative of the cost function

𝐹𝐹 𝑄𝑄
𝜕𝜕𝜕𝜕
To use gradient descent, we need to know for 𝑗𝑗 = 1, … , 𝑝𝑝
𝜕𝜕𝜃𝜃𝑗𝑗

23
Derivative of the cost function

𝐹𝐹 𝑄𝑄
𝜕𝜕𝜕𝜕
To use gradient descent, we need to know for 𝑗𝑗 = 1, … , 𝑝𝑝
𝜕𝜕𝜃𝜃𝑗𝑗

24
Derivative of the cost function

25
Derivative of the cost function

𝐹𝐹 𝑄𝑄
𝜕𝜕𝜕𝜕
To use gradient descent, we need to know for 𝑗𝑗 = 1, … , 𝑝𝑝
𝜕𝜕𝜃𝜃𝑗𝑗

26
Derivative of the cost function

𝐹𝐹 𝑄𝑄

Looks identical to linear regression


27
Gradient Descent for the Logistic
Regression Classifier

28
Gradient descent algorithm

Simultaneously
update all 𝜃𝜃𝑗𝑗 for
𝑗𝑗 = 0, … , 𝑑𝑑

• Looks identical to linear regression!


1
• But here in logistic regression ℎ𝜃𝜃 𝑥𝑥 = 𝑇𝑇 instead of ℎ𝜃𝜃 𝑥𝑥 = 𝜃𝜃 𝑇𝑇 𝑥𝑥
1+𝑒𝑒 −𝜃𝜃 𝑥𝑥
which was used in linear regression.

29
Can We Use Logistic Regression
for Multi-class Classification?

30
Multi-class classification
• Examples of multi-class classification applications

– Activity recognition in smart homes:


• Sleeping, Cooking, Taking Lunch, Watching TV …

– Medical diagrams:
• Cold, Flu, Not ill, …

– Email classification/folding/tagging
• Work, Friends, Family, Hobby, …

– Weather:
• Sunny, Cloudy, Rainy, Snow

31
Multi-class classification

Binary classification: Multi-class classification:

𝑥𝑥2 𝑥𝑥2

𝑥𝑥1 𝑥𝑥1
32
Multi-class classification

Binary classification: Multi-class classification:


How do we do in multi-class classification ?

𝑥𝑥2 𝑥𝑥2
?

𝑥𝑥1 𝑥𝑥1
33
Multi-class classification 𝑥𝑥2
one-vs-all (one-vs-rest) 𝟏𝟏
𝒉𝒉𝜽𝜽 (𝒙𝒙)
𝑥𝑥2 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃)

𝑥𝑥1
𝑥𝑥2

𝑥𝑥1 𝟐𝟐
𝒉𝒉𝜽𝜽 (𝒙𝒙)
• Train a logistic regression classifier 𝑃𝑃 𝑦𝑦 = 2 𝑥𝑥; 𝜃𝜃)
𝑖𝑖
ℎ𝜃𝜃 (𝑥𝑥) for each class 𝑖𝑖 to predict the
probability that 𝑦𝑦 = 𝑖𝑖 𝑥𝑥1
𝑖𝑖
• ℎ𝜃𝜃 𝑥𝑥 = 𝑃𝑃 𝑦𝑦 = 𝑖𝑖 𝑥𝑥; 𝜃𝜃), 𝑖𝑖 = 1,2,3 𝑥𝑥2

• To make a prediction on a new input 𝑥𝑥, 𝟑𝟑


pick the class that maximizes the 𝒉𝒉𝜽𝜽 (𝒙𝒙)
𝑖𝑖 𝑃𝑃 𝑦𝑦 = 3 𝑥𝑥; 𝜃𝜃)
probability: max ℎ𝜃𝜃 𝑥𝑥
𝑖𝑖 34
𝑥𝑥1
Multi-class classification
• One-vs-all (one-vs-rest)
– Train one binary classification model for each class
(vs all the other classes).
• Number of models is equal to the number of classes (𝑐𝑐)

• One-vs-one
– You can also train one binary classification model
for each pair of classes.
• Number of models is in the order of 2𝑐𝑐
35
Nonlinear Classification
Non-linear classification with
Logistic Regression.

37
Logistic Regression
Non-linear decision boundary
Example of a non-linear decision boundary
• Let’s add extra higher order polynomial terms to the features:
ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔 𝜃𝜃 𝑇𝑇 𝑥𝑥
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥12 + 𝜃𝜃4 𝑥𝑥22 )
= 𝑔𝑔(−1 + 𝑥𝑥12 + 𝑥𝑥22 )

Predict 𝑦𝑦 = 1 if 𝑥𝑥12 + 𝑥𝑥22 ≥ 1


𝒙𝒙𝟐𝟐

𝒙𝒙𝟏𝟏

38
Logistic Regression
Non-linear decision boundary
Example of a non-linear decision boundary
• Let’s add extra higher order polynomial terms to the features:
ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔 𝜃𝜃 𝑇𝑇 𝑥𝑥
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥12 + 𝜃𝜃4 𝑥𝑥22 )
= 𝑔𝑔(−1 + 𝑥𝑥12 + 𝑥𝑥22 )

Predict 𝑦𝑦 = 1 if 𝑥𝑥12 + 𝑥𝑥22 ≥ 1

decision boundary 𝒙𝒙𝟐𝟐


𝒚𝒚 = 𝟏𝟏
𝑥𝑥12 + 𝑥𝑥22 = 1
NOTE:
• The decision boundary is a property of the
hypothesis and the parameters 𝜃𝜃, not a property
of the training dataset.. Choosing a different 𝜃𝜃 𝒙𝒙𝟏𝟏
leads to a different decision boundary (regardless
of the training dataset). 𝒚𝒚 = 𝟎𝟎
• The training dataset is used to fit the parameters
𝜃𝜃 (i.e. find optimal 𝜃𝜃). We will see how later.
39
Logistic Regression
More complex non-linear decision boundary

Example of a more complex non-linear decision boundary


• Let’s add even more extra higher order polynomial terms to the features:

ℎ𝜃𝜃 𝑥𝑥 = (𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥12 + 𝜃𝜃4 𝑥𝑥12 𝑥𝑥2 + 𝜃𝜃5 𝑥𝑥12 𝑥𝑥22 + 𝜃𝜃6 𝑥𝑥13 𝑥𝑥2 + … )

𝒙𝒙𝟐𝟐
𝒚𝒚 = 𝟎𝟎
𝒚𝒚 = 𝟏𝟏
𝒚𝒚 = 𝟎𝟎

𝒙𝒙𝟏𝟏 40
The K Nearest Neighbors
Classifier
KNN
Nearest Neighbors (KNN)

Logistic Regression K Nearest Decision


Neighbors Tree

42
K Nearest Neighbors (KNN) - Classification

• Simple method that does not require learning (the model is just the labeled training
dataset itself).

• For each test data-point x, to be classified, find the K nearest points in the training
data.

• Classify the point x, according to the majority vote of their class labels

Example: x
• K=3
• 2 classes (red / blue)

43
Classification by Nearest Neighbor

Word vector document classification – here the vector space is


illustrated as having 2 dimensions. But for real text document data,
what would be our features? How many?
Classification by Nearest Neighbor
Classification by Nearest Neighbor

Classify the test document as the class of the document “nearest” to the query
document (use vector similarity to find most similar doc)
Classification by KNN

Classify the test document as the majority class of the k documents


Classification by KNN
KNN – Model Complexity

• Linear model (e.g. • KNN with K=1 • KNN with K=15


Logistic Regression)
• It produces a complex • It produces a simpler
• Very simple decision decision boundary on this decision boundary than
boundary. dataset. K=1.

Smaller K produces a more complex decision boundary.

49
KNN – Model Complexity

50
KNN – Model Complexity

51
KNN – Model Complexity

52
Decision Tree
Classifier
Classification with Decision Trees

Logistic Regression K Nearest Decision


Neighbors Tree

54
Decision Tree
• Example: Shall we play golf today ?

55
Decision Tree - Classification
• At each nodes
– A question is asked about data
– One child node per possible answer
• Leaf nodes
– Class label (i.e. decision to take)

• Building the Tree:


– For each node, find the feature F + threshold value T Simple, practical and easy to
– ... that split the samples assigned to the node into 2 subsets interpret.
– ... so as to maximize the label purity within these subsets. Given a set of instances (with a
set of features), a tree is
constructed with internal nodes
as the features and the leaves as
the classes.

56
In the next lecture:

Overfitting, Generalization,
Regularization
58
Overfitting
• Overfitting:
– A model that performs well on the training examples, but poorly on
new examples.
– Training and testing on the same data will generally lead to overfitting
and produce a model which looks good only for this particular training
dataset.

• To avoid overfitting:
– Use separate training and testing data
– Use cross-validation
– Try using simple models first
– Use regularization or ensemble models …

– We will talk more about this next week.

59
Performance evaluation

60

You might also like