Machine Learning - Lecture 5
Machine Learning - Lecture 5
(521289S)
Lecture 5
1
Multi-class classification
• In practice many classification problems have more than two classes (C>2) we wish
to distinguish
• Here we study how to extend to multi-class cases
– One-versus-All classification (OvA)
– Multi-class Perceptrons
2
OvA principle
• For a C-class problem:
1. Solve C sub-problems in which a decision boundary between each class and the rest of the data is
computed, and
2. Combine the individual decision boundaries in a fusion rule to get the final classifier.
3
About positive and negative sides of a hyperplane
• A data point is on the positive / negative side of a hyperplane according to the sign of the
dot-product:
– The dot-product is positive if the data point is on the same side as where the normal vector is
pointing to (positive side)
– The dot-product is negative if the data point is on the opposite side (negative side)
• By convention, we label data points with label ‘+1’ and ‘-1’ depending on which side they are
expected to be after model optimization (class c is labelled ‘+1’ on the previous slide)
• As mentioned earlier the parameter vector w can be decomposed into two parts: w0
(intercept / bias) and ω = [w1,…,wN]T (normal vector of a hyperplane)
• We choose that ω points to the side where ‘+1’ data points should be after optimization
• Note: the dot product is related to the signed length of the projection of point/vector x on
the normal vector ω. If the length of vector ω equals to 1, then the dot-product is the signed
distance of the point to the hyperplane.
x1
d = xTω = ‖x‖ ‖ω‖ cos(angle):
ω ’-1’
d>0 if angle = (-90o,+90o)
d1>0 d<0 if angle = (+90o,-90o)
x2 ’+1’
• As all C classifiers “agree” that the classifier c is the winner, we can write
the fusion classifier as:
– Note: the distance can be negative or positive depending on which side the point is!
– Note: parameter vectors wj must be in the normalized form (unit length) for fair comparison
8
A problem with OvA classifier
• In OvA, the subclassifiers are trained independently and then fused, which
can cause problems if data are distributed in complex ways (see left Figure
below)
– The small class in the center has hard times to place an optimal decision boundary as all
other classes are surrounding it (middle panel)
9
Multi-class Perceptron cost function (1/2)
• The fusion rule is as before:
• In other words, the signed distance from point xp to its class (yp) decision
boundary should be greater than (or equal to) its distance to every other
two-class decision boundary:
– Note: if a “wrong class is the winner for a data point”, the optimizer updates parameters
accordingly to translate/rotate all decision boundaries
• The total cost function is then:
10
Multi-class Perceptron cost function (2/2)
• Equivalently (ReLU form):
11
Categorical classification (1/4)
• Class labels can be arbitrary numbers. However, we can also use multi-
input-multi-output approach like in regression: the output yp is a vector of
dimencion C
• Let’s define a one-hot encoded vector for each class:
– Each one-hot encoded categorical label is a length C vector and contains all zeros except
a ‘1’ in the index equal to the value of yp
12
Categorical classification (2/4)
• Combining logistic regression and multi-output regression, we can set our
goal to find parameter matrix W such that:
We classify the input vector x to class 1 as that corresponds to the maximum value in
the output vector. In addition, we assign confidence of 0.7 to it. We can now state: The
input vector belongs to Class 1 with probability 0.7.
• Thus, classifier is:
• However, we can also use the log-error cost function as that works better
for binary outputs:
• The total cost function will be the standard multi-class Cross Entropy /
Softmax cost function:
15
Nonlinear multi-class classification (1/2)
• Generalization to nonlinear decision boundary is similar to logistic regression
• The model in case that all classes have separate nonlinear functions:
• The model in case that all classes share the nonlinear functions:
16
Nonlinear multi-class classification (2/2)
• We can stack all C models to have:
• Fusion classifier:
17
Mini-batch learning
• With very large data sets the learning task is often divided into
consecutive learning steps by dividing the data set into independent sub-
sets (mini-batches), and optimizing the models with them in sequence
– The cost functions naturally decompose into mini-batches as they are sums of all data
19