0% found this document useful (0 votes)
11 views19 pages

Machine Learning - Lecture 5

Uploaded by

Athmajan Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views19 pages

Machine Learning - Lecture 5

Uploaded by

Athmajan Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Machine learning

(521289S)
Lecture 5

Prof. Tapio Seppänen


Center for Machine Vision and Signal Analysis
University of Oulu

1
Multi-class classification
• In practice many classification problems have more than two classes (C>2) we wish
to distinguish
• Here we study how to extend to multi-class cases
– One-versus-All classification (OvA)
– Multi-class Perceptrons

2
OvA principle
• For a C-class problem:
1. Solve C sub-problems in which a decision boundary between each class and the rest of the data is
computed, and
2. Combine the individual decision boundaries in a fusion rule to get the final classifier.

• Data is represented in the usual notation as:

• To solve the class c decision boundary,


assign temporary class labels:

– The class c is assigned label ’+1’ and all other classes


are combined and assigned label ’-1’ !
• The C linear decision boundaries are
found individually by optimization:

3
About positive and negative sides of a hyperplane
• A data point is on the positive / negative side of a hyperplane according to the sign of the
dot-product:

– The dot-product is positive if the data point is on the same side as where the normal vector is
pointing to (positive side)
– The dot-product is negative if the data point is on the opposite side (negative side)
• By convention, we label data points with label ‘+1’ and ‘-1’ depending on which side they are
expected to be after model optimization (class c is labelled ‘+1’ on the previous slide)
• As mentioned earlier the parameter vector w can be decomposed into two parts: w0
(intercept / bias) and ω = [w1,…,wN]T (normal vector of a hyperplane)
• We choose that ω points to the side where ‘+1’ data points should be after optimization
• Note: the dot product is related to the signed length of the projection of point/vector x on
the normal vector ω. If the length of vector ω equals to 1, then the dot-product is the signed
distance of the point to the hyperplane.

x1
d = xTω = ‖x‖ ‖ω‖ cos(angle):
ω ’-1’
d>0 if angle = (-90o,+90o)
d1>0 d<0 if angle = (+90o,-90o)

x2 ’+1’

d2<0 Note: x is the original feature vector here, 4


not the extended one (with ’1’)!
Fusion: Case 1 - Data points on the positive
side of a single classifier
• We have now set each linear decision surface (hyperplane) optimally
between the data clusters representing the classes
• For any data point x that is on the positive side of the cth optimized
hyperplane, but on the negative side of all of the other hyperplanes:

• We can therefore write:

• As all C classifiers “agree” that the classifier c is the winner, we can write
the fusion classifier as:

– This classifier classifies these data points


to Class c because it is the only one that
yields a positive value of the dot-product 5
Fusion: Case 2 - Data points on the positive
side of more than one classifier
• We have now set each linear decision surface (hyperplane) optimally between the data
clusters representing the classes
• As at least two classifiers claim that the data point is on the positive side of the decision
boundary, we need to consider which one has the highest confidence
• Simple rule can be applied: the further the data point is on the positive side from a
particular decision boundary the more confidence the classifier has that it is the winner
• We already know how to compute the distance of a point from a hyperplane:

– Note: the distance can be negative or positive depending on which side the point is!
– Note: parameter vectors wj must be in the normalized form (unit length) for fair comparison

• So, the classifier fusion rule is then:

• The same rule applies as in Case 1 !


6
Fusion: Case 3 - Data points on the positive
side of no classifier
• We have now set each linear decision surface (hyperplane) optimally between
the data clusters representing the classes
• The data point is located on the negative side of all classifiers
– All distances from decision boundaries are negative
• We will classify the point to the class that has the closest decision boundary:
the largest signed distance
– For example: max(-4, -10, -2) = -2
• The fusion classifier for this case is:

• The same rule applies as in


Case 1 and Case 2!

• Note: all parameter vectors


wj must be in the normalized
form (unit length)
7
OvA fusion classifier summary
• First, find C One-versus-All decision boundaries (hyperplanes)
independently by training the C two-class classifiers
• Then, establish the OvA fusion classifier:

• The decision boundary of the fusion classifier is based on the individual


decision boundaries, but in general, is not solvable in a closed form
• Data balancing (e.g. weighed classifiers) should be used in the cost
functions as the ’+1’ and ’-1’ data sets can have very different sizes

8
A problem with OvA classifier
• In OvA, the subclassifiers are trained independently and then fused, which
can cause problems if data are distributed in complex ways (see left Figure
below)
– The small class in the center has hard times to place an optimal decision boundary as all
other classes are surrounding it (middle panel)

• Instead, a multi-class Perceptron is able to find good decision boundaries


in the feature space (the right panel)
– It learns all sub-classifier decision boundaries simultaneously by including the classifier
fusion in the cost function

9
Multi-class Perceptron cost function (1/2)
• The fusion rule is as before:

• In other words, the signed distance from point xp to its class (yp) decision
boundary should be greater than (or equal to) its distance to every other
two-class decision boundary:

• Point-wise cost function can now be defined as:

– Note: if a “wrong class is the winner for a data point”, the optimizer updates parameters
accordingly to translate/rotate all decision boundaries
• The total cost function is then:

10
Multi-class Perceptron cost function (2/2)
• Equivalently (ReLU form):

• A regularized version of the cost function:

• Softmax cost function:

• Alternative formulation of Softmax cost function:

• Regularized version of Softmax cost function:

11
Categorical classification (1/4)
• Class labels can be arbitrary numbers. However, we can also use multi-
input-multi-output approach like in regression: the output yp is a vector of
dimencion C
• Let’s define a one-hot encoded vector for each class:
– Each one-hot encoded categorical label is a length C vector and contains all zeros except
a ‘1’ in the index equal to the value of yp

12
Categorical classification (2/4)
• Combining logistic regression and multi-output regression, we can set our
goal to find parameter matrix W such that:

– The sigmoid function should output a vector of dimension C which


approximates the one-hot encoded vector yp for each data point xp

• The logistic sigmoid function can be generalized as:

• The generalized sigmoid function applies one form of normalization to the


output vector so that all output elements are values between [0,1] and sum
to 1
13
Categorical classification (3/4)
• This output vector (prediction) can be interpreted as a discrete probability
distribution as all elements are nonnegative and sum to 1
– Very convenient as we can now classify an input vector and assign a confidence value
in the range [0, 1] to it!
– An example:

We classify the input vector x to class 1 as that corresponds to the maximum value in
the output vector. In addition, we assign confidence of 0.7 to it. We can now state: The
input vector belongs to Class 1 with probability 0.7.
• Thus, classifier is:

• In the Figure, we can see


that the xTwc distances
are normalized to
represent the probabilities
for any data point x
14
Categorical classification (4/4)
• One could use the standard point-wise Least Squares cost function:

• However, we can also use the log-error cost function as that works better
for binary outputs:

• The total cost function will be the standard multi-class Cross Entropy /
Softmax cost function:

15
Nonlinear multi-class classification (1/2)
• Generalization to nonlinear decision boundary is similar to logistic regression
• The model in case that all classes have separate nonlinear functions:

• The model in case that all classes share the nonlinear functions:

16
Nonlinear multi-class classification (2/2)
• We can stack all C models to have:

• Multi-class Softmax cost function can be expressed as:

• Fusion classifier:

17
Mini-batch learning
• With very large data sets the learning task is often divided into
consecutive learning steps by dividing the data set into independent sub-
sets (mini-batches), and optimizing the models with them in sequence
– The cost functions naturally decompose into mini-batches as they are sums of all data

• One sweep through all data is


called epoch
– Several epocs/sweeps need be done
before optimization is completed
• This increases learning speed
• Often, 32 is used as mini-batch size 18
Classifier accuracy
• Each data point x’ is classified
with the fusion classifier to
class y´:

• We compute the confusion matrix similarly as for two-class applications.


Often, accuracy A is computed:

19

You might also like