Lecture 3. Classification
Lecture 3. Classification
Classification
Halmstad University
Classification
• The variable 𝑦𝑦 that you want to predict (the output variable) is
discrete.
4
Some applications of classification
5
Some applications of classification
6
Some applications of classification
7
Some applications of classification
Edible or poisonous ?
8
Some applications of classification
• e.g. Laryngeal disease diagnostics
• Features / Attributes:
– Age
– Subjectively estimated illness duration (months)
– Education (five grades)
– Average duration of intensive speech use (hours/day)
– Number of days of intensive speech use (days/week)
– Smoking (Yes/No)
– Smoked cigarettes/day
– Smoking duration (years);
– Subjective voice function assessment by the patient
– Maximal tonality duration for “aaaaaa” (sec)
– Functional voice index (F);
– Emotional condition index (E);
– Physical condition index (P);
– Voice deficiency index
– …
9
Some applications of classification
apple
pear
tomato
cow
dog
horse
10
Linear Classification with
Logistic Regression
Logistic Regression
• This is a classification method (don’t get confused by the name).
1
• ℎ𝜃𝜃 (𝑥𝑥) = −𝜃𝜃 𝑇𝑇 𝑥𝑥
1+𝑒𝑒
𝑎𝑎
1
Logistic Regression ℎ𝜃𝜃 (𝑥𝑥) =
1+ 𝑒𝑒 −𝜃𝜃 𝑇𝑇 𝑥𝑥
Example: If
ℎ𝜃𝜃 𝑥𝑥 = 0.7 𝑦𝑦 = 1
70% chance of tumor being malignant
Probability that 𝑦𝑦 = 1 ,
ℎ𝜃𝜃 𝑥𝑥 = 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃) = 0.7 given 𝑥𝑥, parametrized by 𝜃𝜃
𝒙𝒙𝟏𝟏 14
𝒈𝒈 𝒂𝒂
Logistic Regression
Linear decision boundary
16
Logistic Regression – Error function
• Training dataset •
𝑇𝑇
ℎ𝜃𝜃 𝑥𝑥 = 1 / (1 + 𝑒𝑒 −𝜃𝜃 𝑥𝑥 )
{ 𝑥𝑥 1 , 𝑦𝑦 1 , 𝑥𝑥 2
, 𝑦𝑦 2
, … , (𝑥𝑥 𝑛𝑛
, 𝑦𝑦 (𝑛𝑛) )} • How de we choose the
parameters 𝜃𝜃 ?
𝑥𝑥0
– By minimizing some error (cost)
𝑥𝑥1 function
• 𝑥𝑥 = … ∈ ℝ𝑑𝑑+1 , 𝑥𝑥0 = 1, 𝑦𝑦 ∈ {0,1}
𝑥𝑥𝑑𝑑
17
Logistic Regression – Error function
Instead, we use the following convex cost function:
18
Logistic Regression – Error function
Instead, we use the following error function:
• To find the best parameters 𝜃𝜃: • To make a prediction given new 𝑥𝑥:
min 𝐸𝐸(𝜃𝜃) 1
𝜃𝜃 ℎ𝜃𝜃 𝑥𝑥 = 𝑇𝑇
1 + 𝑒𝑒 −𝜃𝜃 𝑥𝑥
• ℎ𝜃𝜃 (𝑥𝑥) is interpreted as 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃)
21
Gradient of the Cost Function
22
Derivative of the cost function
𝐹𝐹 𝑄𝑄
𝜕𝜕𝜕𝜕
To use gradient descent, we need to know for 𝑗𝑗 = 1, … , 𝑝𝑝
𝜕𝜕𝜃𝜃𝑗𝑗
23
Derivative of the cost function
𝐹𝐹 𝑄𝑄
𝜕𝜕𝜕𝜕
To use gradient descent, we need to know for 𝑗𝑗 = 1, … , 𝑝𝑝
𝜕𝜕𝜃𝜃𝑗𝑗
24
Derivative of the cost function
25
Derivative of the cost function
𝐹𝐹 𝑄𝑄
𝜕𝜕𝜕𝜕
To use gradient descent, we need to know for 𝑗𝑗 = 1, … , 𝑝𝑝
𝜕𝜕𝜃𝜃𝑗𝑗
26
Derivative of the cost function
𝐹𝐹 𝑄𝑄
28
Gradient descent algorithm
Simultaneously
update all 𝜃𝜃𝑗𝑗 for
𝑗𝑗 = 0, … , 𝑑𝑑
29
Can We Use Logistic Regression
for Multi-class Classification?
30
Multi-class classification
• Examples of multi-class classification applications
– Medical diagrams:
• Cold, Flu, Not ill, …
– Email classification/folding/tagging
• Work, Friends, Family, Hobby, …
– Weather:
• Sunny, Cloudy, Rainy, Snow
31
Multi-class classification
𝑥𝑥2 𝑥𝑥2
𝑥𝑥1 𝑥𝑥1
32
Multi-class classification
𝑥𝑥2 𝑥𝑥2
?
𝑥𝑥1 𝑥𝑥1
33
Multi-class classification 𝑥𝑥2
one-vs-all (one-vs-rest) 𝟏𝟏
𝒉𝒉𝜽𝜽 (𝒙𝒙)
𝑥𝑥2 𝑃𝑃 𝑦𝑦 = 1 𝑥𝑥; 𝜃𝜃)
𝑥𝑥1
𝑥𝑥2
𝑥𝑥1 𝟐𝟐
𝒉𝒉𝜽𝜽 (𝒙𝒙)
• Train a logistic regression classifier 𝑃𝑃 𝑦𝑦 = 2 𝑥𝑥; 𝜃𝜃)
𝑖𝑖
ℎ𝜃𝜃 (𝑥𝑥) for each class 𝑖𝑖 to predict the
probability that 𝑦𝑦 = 𝑖𝑖 𝑥𝑥1
𝑖𝑖
• ℎ𝜃𝜃 𝑥𝑥 = 𝑃𝑃 𝑦𝑦 = 𝑖𝑖 𝑥𝑥; 𝜃𝜃), 𝑖𝑖 = 1,2,3 𝑥𝑥2
• One-vs-one
– You can also train one binary classification model
for each pair of classes.
• Number of models is in the order of 2𝑐𝑐
35
Nonlinear Classification
Non-linear classification with
Logistic Regression.
37
Logistic Regression
Non-linear decision boundary
Example of a non-linear decision boundary
• Let’s add extra higher order polynomial terms to the features:
ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔 𝜃𝜃 𝑇𝑇 𝑥𝑥
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥12 + 𝜃𝜃4 𝑥𝑥22 )
= 𝑔𝑔(−1 + 𝑥𝑥12 + 𝑥𝑥22 )
𝒙𝒙𝟏𝟏
38
Logistic Regression
Non-linear decision boundary
Example of a non-linear decision boundary
• Let’s add extra higher order polynomial terms to the features:
ℎ𝜃𝜃 𝑥𝑥 = 𝑔𝑔 𝜃𝜃 𝑇𝑇 𝑥𝑥
= 𝑔𝑔(𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥12 + 𝜃𝜃4 𝑥𝑥22 )
= 𝑔𝑔(−1 + 𝑥𝑥12 + 𝑥𝑥22 )
ℎ𝜃𝜃 𝑥𝑥 = (𝜃𝜃0 + 𝜃𝜃1 𝑥𝑥1 + 𝜃𝜃2 𝑥𝑥2 + 𝜃𝜃3 𝑥𝑥12 + 𝜃𝜃4 𝑥𝑥12 𝑥𝑥2 + 𝜃𝜃5 𝑥𝑥12 𝑥𝑥22 + 𝜃𝜃6 𝑥𝑥13 𝑥𝑥2 + … )
𝒙𝒙𝟐𝟐
𝒚𝒚 = 𝟎𝟎
𝒚𝒚 = 𝟏𝟏
𝒚𝒚 = 𝟎𝟎
𝒙𝒙𝟏𝟏 40
The K Nearest Neighbors
Classifier
KNN
Nearest Neighbors (KNN)
42
K Nearest Neighbors (KNN) - Classification
• Simple method that does not require learning (the model is just the labeled training
dataset itself).
• For each test data-point x, to be classified, find the K nearest points in the training
data.
• Classify the point x, according to the majority vote of their class labels
Example: x
• K=3
• 2 classes (red / blue)
43
Classification by Nearest Neighbor
Classify the test document as the class of the document “nearest” to the query
document (use vector similarity to find most similar doc)
Classification by KNN
49
KNN – Model Complexity
50
KNN – Model Complexity
51
KNN – Model Complexity
52
Decision Tree
Classifier
Classification with Decision Trees
54
Decision Tree
• Example: Shall we play golf today ?
55
Decision Tree - Classification
• At each nodes
– A question is asked about data
– One child node per possible answer
• Leaf nodes
– Class label (i.e. decision to take)
56
In the next lecture:
Overfitting, Generalization,
Regularization
58
Overfitting
• Overfitting:
– A model that performs well on the training examples, but poorly on
new examples.
– Training and testing on the same data will generally lead to overfitting
and produce a model which looks good only for this particular training
dataset.
• To avoid overfitting:
– Use separate training and testing data
– Use cross-validation
– Try using simple models first
– Use regularization or ensemble models …
59
Performance evaluation
60