0% found this document useful (0 votes)

11 views19 pages

Machine Learning - Lecture 5

Uploaded by

Athmajan Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views19 pages

Machine Learning - Lecture 5

Uploaded by

Athmajan Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Machine learning

(521289S)
Lecture 5

Prof. Tapio Seppänen

Center for Machine Vision and Signal Analysis
University of Oulu

1
Multi-class classification
• In practice many classification problems have more than two classes (C>2) we wish
to distinguish
• Here we study how to extend to multi-class cases
– One-versus-All classification (OvA)
– Multi-class Perceptrons

2
OvA principle
• For a C-class problem:
1. Solve C sub-problems in which a decision boundary between each class and the rest of the data is
computed, and
2. Combine the individual decision boundaries in a fusion rule to get the final classifier.

• Data is represented in the usual notation as:

• To solve the class c decision boundary,

assign temporary class labels:

– The class c is assigned label ’+1’ and all other classes

are combined and assigned label ’-1’ !
• The C linear decision boundaries are
found individually by optimization:

3
About positive and negative sides of a hyperplane
• A data point is on the positive / negative side of a hyperplane according to the sign of the
dot-product:

– The dot-product is positive if the data point is on the same side as where the normal vector is
pointing to (positive side)
– The dot-product is negative if the data point is on the opposite side (negative side)
• By convention, we label data points with label ‘+1’ and ‘-1’ depending on which side they are
expected to be after model optimization (class c is labelled ‘+1’ on the previous slide)
• As mentioned earlier the parameter vector w can be decomposed into two parts: w0
(intercept / bias) and ω = [w1,…,wN]T (normal vector of a hyperplane)
• We choose that ω points to the side where ‘+1’ data points should be after optimization
• Note: the dot product is related to the signed length of the projection of point/vector x on
the normal vector ω. If the length of vector ω equals to 1, then the dot-product is the signed
distance of the point to the hyperplane.

x1
d = xTω = ‖x‖ ‖ω‖ cos(angle):
ω ’-1’
d>0 if angle = (-90o,+90o)
d1>0 d<0 if angle = (+90o,-90o)

x2 ’+1’

d2<0 Note: x is the original feature vector here, 4

not the extended one (with ’1’)!
Fusion: Case 1 - Data points on the positive
side of a single classifier
• We have now set each linear decision surface (hyperplane) optimally
between the data clusters representing the classes
• For any data point x that is on the positive side of the cth optimized
hyperplane, but on the negative side of all of the other hyperplanes:

• We can therefore write:

• As all C classifiers “agree” that the classifier c is the winner, we can write
the fusion classifier as:

– This classifier classifies these data points

to Class c because it is the only one that
yields a positive value of the dot-product 5
Fusion: Case 2 - Data points on the positive
side of more than one classifier
• We have now set each linear decision surface (hyperplane) optimally between the data
clusters representing the classes
• As at least two classifiers claim that the data point is on the positive side of the decision
boundary, we need to consider which one has the highest confidence
• Simple rule can be applied: the further the data point is on the positive side from a
particular decision boundary the more confidence the classifier has that it is the winner
• We already know how to compute the distance of a point from a hyperplane:

– Note: the distance can be negative or positive depending on which side the point is!
– Note: parameter vectors wj must be in the normalized form (unit length) for fair comparison

• So, the classifier fusion rule is then:

• The same rule applies as in Case 1 !

6
Fusion: Case 3 - Data points on the positive
side of no classifier
• We have now set each linear decision surface (hyperplane) optimally between
the data clusters representing the classes
• The data point is located on the negative side of all classifiers
– All distances from decision boundaries are negative
• We will classify the point to the class that has the closest decision boundary:
the largest signed distance
– For example: max(-4, -10, -2) = -2
• The fusion classifier for this case is:

• The same rule applies as in

Case 1 and Case 2!

• Note: all parameter vectors

wj must be in the normalized
form (unit length)
7
OvA fusion classifier summary
• First, find C One-versus-All decision boundaries (hyperplanes)
independently by training the C two-class classifiers
• Then, establish the OvA fusion classifier:

• The decision boundary of the fusion classifier is based on the individual

decision boundaries, but in general, is not solvable in a closed form
• Data balancing (e.g. weighed classifiers) should be used in the cost
functions as the ’+1’ and ’-1’ data sets can have very different sizes

8
A problem with OvA classifier
• In OvA, the subclassifiers are trained independently and then fused, which
can cause problems if data are distributed in complex ways (see left Figure
below)
– The small class in the center has hard times to place an optimal decision boundary as all
other classes are surrounding it (middle panel)

• Instead, a multi-class Perceptron is able to find good decision boundaries

in the feature space (the right panel)
– It learns all sub-classifier decision boundaries simultaneously by including the classifier
fusion in the cost function

9
Multi-class Perceptron cost function (1/2)
• The fusion rule is as before:

• In other words, the signed distance from point xp to its class (yp) decision
boundary should be greater than (or equal to) its distance to every other
two-class decision boundary:

• Point-wise cost function can now be defined as:

– Note: if a “wrong class is the winner for a data point”, the optimizer updates parameters
accordingly to translate/rotate all decision boundaries
• The total cost function is then:

10
Multi-class Perceptron cost function (2/2)
• Equivalently (ReLU form):

• A regularized version of the cost function:

• Softmax cost function:

• Alternative formulation of Softmax cost function:

• Regularized version of Softmax cost function:

11
Categorical classification (1/4)
• Class labels can be arbitrary numbers. However, we can also use multi-
input-multi-output approach like in regression: the output yp is a vector of
dimencion C
• Let’s define a one-hot encoded vector for each class:
– Each one-hot encoded categorical label is a length C vector and contains all zeros except
a ‘1’ in the index equal to the value of yp

12
Categorical classification (2/4)
• Combining logistic regression and multi-output regression, we can set our
goal to find parameter matrix W such that:

– The sigmoid function should output a vector of dimension C which

approximates the one-hot encoded vector yp for each data point xp

• The logistic sigmoid function can be generalized as:

• The generalized sigmoid function applies one form of normalization to the

output vector so that all output elements are values between [0,1] and sum
to 1
13
Categorical classification (3/4)
• This output vector (prediction) can be interpreted as a discrete probability
distribution as all elements are nonnegative and sum to 1
– Very convenient as we can now classify an input vector and assign a confidence value
in the range [0, 1] to it!
– An example:

We classify the input vector x to class 1 as that corresponds to the maximum value in
the output vector. In addition, we assign confidence of 0.7 to it. We can now state: The
input vector belongs to Class 1 with probability 0.7.
• Thus, classifier is:

• In the Figure, we can see

that the xTwc distances
are normalized to
represent the probabilities
for any data point x
14
Categorical classification (4/4)
• One could use the standard point-wise Least Squares cost function:

• However, we can also use the log-error cost function as that works better
for binary outputs:

• The total cost function will be the standard multi-class Cross Entropy /
Softmax cost function:

15
Nonlinear multi-class classification (1/2)
• Generalization to nonlinear decision boundary is similar to logistic regression
• The model in case that all classes have separate nonlinear functions:

• The model in case that all classes share the nonlinear functions:

16
Nonlinear multi-class classification (2/2)
• We can stack all C models to have:

• Multi-class Softmax cost function can be expressed as:

• Fusion classifier:

17
Mini-batch learning
• With very large data sets the learning task is often divided into
consecutive learning steps by dividing the data set into independent sub-
sets (mini-batches), and optimizing the models with them in sequence
– The cost functions naturally decompose into mini-batches as they are sums of all data

• One sweep through all data is

called epoch
– Several epocs/sweeps need be done
before optimization is completed
• This increases learning speed
• Often, 32 is used as mini-batch size 18
Classifier accuracy
• Each data point x’ is classified
with the fusion classifier to
class y´:

• We compute the confusion matrix similarly as for two-class applications.

Often, accuracy A is computed:

Introduction To Machine Learning: Jaime S. Cardoso
100% (1)
Introduction To Machine Learning: Jaime S. Cardoso
52 pages
50 Deep Learning Technical Interview Questions With Answers
100% (1)
50 Deep Learning Technical Interview Questions With Answers
20 pages
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
100% (8)
AI Publishing. Python Scikit-Learn For Beginners... For Data Scientist 2021
339 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Lecture 3 1611410001002
No ratings yet
Lecture 3 1611410001002
51 pages
1 An Introduction To Linear Classifiers
No ratings yet
1 An Introduction To Linear Classifiers
9 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
08 Classification
No ratings yet
08 Classification
46 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
74 pages
08classification I
No ratings yet
08classification I
52 pages
Linear Classifiers
No ratings yet
Linear Classifiers
48 pages
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
No ratings yet
Machine Learning - Classifiers and Boosting: Reading CH 18.6-18.12, 20.1-20.3.2
54 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
No ratings yet
06-07-08-Supervised Learning by Computing Distances, Multi Class Classification, Decision Boundary
32 pages
Module 6-Svm
No ratings yet
Module 6-Svm
47 pages
ML Unit I
No ratings yet
ML Unit I
14 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
No ratings yet
Support Vector Machines: Dominik Wisniewski Wojciech Wawrzyniak
16 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
Chapter 4. Classification Algorithms-Stud
No ratings yet
Chapter 4. Classification Algorithms-Stud
43 pages
Main
No ratings yet
Main
5 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
3 Percept Ron
No ratings yet
3 Percept Ron
34 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
10 SVM
No ratings yet
10 SVM
77 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Session 5
No ratings yet
Session 5
36 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Lect 1
No ratings yet
Lect 1
24 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
Binary, Multi-Class & Multi-Label Classification
No ratings yet
Binary, Multi-Class & Multi-Label Classification
6 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Pattern Recognition Linear Classifier by Zaheer Ahmad
0% (1)
Pattern Recognition Linear Classifier by Zaheer Ahmad
37 pages
Perceptron
No ratings yet
Perceptron
23 pages
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
No ratings yet
Linear Classifiers and The Perceptron Algorithm: 36-350, Data Mining, Fall 2009 16 November 2009
5 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
Perceptron Notes
No ratings yet
Perceptron Notes
5 pages
Lecture 3
No ratings yet
Lecture 3
50 pages
PRu 4
No ratings yet
PRu 4
13 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
19 Image Classification
No ratings yet
19 Image Classification
78 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
Linear - Classification
No ratings yet
Linear - Classification
72 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
Ds 2
No ratings yet
Ds 2
27 pages
ML - Mod2 Classification
No ratings yet
ML - Mod2 Classification
74 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
W4 Ecs7020p
No ratings yet
W4 Ecs7020p
48 pages
SVM Presentation
No ratings yet
SVM Presentation
27 pages
SVM Example
No ratings yet
SVM Example
24 pages
Beyond Binary Classification
No ratings yet
Beyond Binary Classification
34 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
Recommendation Systems
No ratings yet
Recommendation Systems
27 pages
Iris Flower Classification Using ML - by Modassir - Medium
No ratings yet
Iris Flower Classification Using ML - by Modassir - Medium
21 pages
UNIT 1 - Introduction (Types of Machine Learning)
100% (1)
UNIT 1 - Introduction (Types of Machine Learning)
21 pages
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
No ratings yet
MLP - Week 6 - MNIST - LogitReg - Ipynb - Colaboratory
19 pages
Expert Systems With Applications: Marcin Michał Miro Nczuk, Jarosław Protasiewicz
No ratings yet
Expert Systems With Applications: Marcin Michał Miro Nczuk, Jarosław Protasiewicz
19 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
49 pages
Hierarchical Multi-Label Classification For Large Scale Data
No ratings yet
Hierarchical Multi-Label Classification For Large Scale Data
35 pages
Ai Unit 3
No ratings yet
Ai Unit 3
23 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
46 pages
Flight Delay Prediction Team3
No ratings yet
Flight Delay Prediction Team3
8 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
14 pages
Machine Learning Notes Unit 1 To 4
No ratings yet
Machine Learning Notes Unit 1 To 4
101 pages
Notes On Module 3 - Pattern Recognition
No ratings yet
Notes On Module 3 - Pattern Recognition
17 pages
A Review of Multi-Class Classification Algorithms
No ratings yet
A Review of Multi-Class Classification Algorithms
10 pages
Tensorflow Developer Certificate: Candidate Handbook
No ratings yet
Tensorflow Developer Certificate: Candidate Handbook
9 pages
A Survey Paper On Classification of Fruits: Roshani - Raut@pccoepune - or G Pawarsakshi2703@gma
No ratings yet
A Survey Paper On Classification of Fruits: Roshani - Raut@pccoepune - or G Pawarsakshi2703@gma
6 pages
Module 5 IMLA QB 6677
No ratings yet
Module 5 IMLA QB 6677
2 pages
Project Report Undergraduate
No ratings yet
Project Report Undergraduate
95 pages
Chapter 6 Group Techonology
No ratings yet
Chapter 6 Group Techonology
31 pages
IE506 IntrotoML 2024jan5
No ratings yet
IE506 IntrotoML 2024jan5
49 pages
Gradient Ascent
No ratings yet
Gradient Ascent
27 pages
Answer Key
No ratings yet
Answer Key
11 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
CS550 Lec7-ClassificationIntro
No ratings yet
CS550 Lec7-ClassificationIntro
49 pages
AI ML Unit 4 QB
No ratings yet
AI ML Unit 4 QB
38 pages
MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages

Machine Learning - Lecture 5

Uploaded by

Machine Learning - Lecture 5

Uploaded by

Machine learning

Prof. Tapio Seppänen

• Data is represented in the usual notation as:

• To solve the class c decision boundary,

– The class c is assigned label ’+1’ and all other classes

d2<0 Note: x is the original feature vector here, 4

• We can therefore write:

– This classifier classifies these data points

• So, the classifier fusion rule is then:

• The same rule applies as in Case 1 !

• The same rule applies as in

• Note: all parameter vectors

• The decision boundary of the fusion classifier is based on the individual

• Instead, a multi-class Perceptron is able to find good decision boundaries

• Point-wise cost function can now be defined as:

• A regularized version of the cost function:

• Softmax cost function:

• Alternative formulation of Softmax cost function:

• Regularized version of Softmax cost function:

– The sigmoid function should output a vector of dimension C which

• The logistic sigmoid function can be generalized as:

• The generalized sigmoid function applies one form of normalization to the

• In the Figure, we can see

• Multi-class Softmax cost function can be expressed as:

• One sweep through all data is

• We compute the confusion matrix similarly as for two-class applications.

You might also like