0% found this document useful (0 votes)
4 views

SML_Lecture5

The document covers the theory and algorithms of supervised machine learning, focusing on linear classification, including concepts like generalization error, model selection, and specific algorithms such as perceptron and logistic regression. It explains the geometric interpretation of linear classifiers, their properties, and the learning process involved in training these models. Additionally, it discusses the convergence of the perceptron algorithm and the loss function associated with it.

Uploaded by

mohamnaf.b
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SML_Lecture5

The document covers the theory and algorithms of supervised machine learning, focusing on linear classification, including concepts like generalization error, model selection, and specific algorithms such as perceptron and logistic regression. It explains the geometric interpretation of linear classifiers, their properties, and the learning process involved in training these models. Additionally, it discusses the convergence of the perceptron algorithm and the loss function associated with it.

Uploaded by

mohamnaf.b
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CS-E4715 Supervised Machine Learning

Lecture 5: Linear classification


Course topics

• Part I: Theory
• Introduction
• Generalization error analysis & PAC learning
• Rademacher Complexity & VC dimension
• Model selection
• Part II: Algorithms and models
• Linear models: perceptron, logistic regession
• Support vector machines
• Kernel methods
• Boosting
• Neural networks (MLPs)
• Part III: Additional topics
• Feature learning, selection and sparsity
• Multi-class classification
• Preference learning, ranking

1
Linear classification
Linear classification

• Input space X ⊂ Rd , each x ∈ X is a d-dimensional real-valued


vector, output space: Y = {−1, +1}
• Training sample S = {(x1 , y1 ), . . . , (xm , ym )} drawn from an
unknown distribution D
• Hypothesis class 
Pd d
H = {x 7→ sgn j=1 wj xj + w0 |w ∈ R , w0 ∈ R} consists of
P 
d
functions h(x) = sgn j=1 wj xj + w0 that map each example in
one of the two classes
(
+1, a ≥ 0
• sgn (a) = is the sign function
−1 a < 0

2
Linear classifiers

Linear classifiers
 
d
X
wj xj + w0  = sgn wT x + w0

h(x) = sgn 
j=1

have several attractive properties

• They are fast to evaluate and takes small space to store (O(d) time
and space)
• Easy to understand: |wj | shows the importance of variable xj and its
sign tells if the effect is positive or negative
• Linear models have relatively low complexity (e.g. VCdim = d + 1)
so they can be reliably estimated from limited data

Good practise is to try a linear model before something more complicated

3
The geometry of the linear classifier

• The points
{x ∈ X |g (x) = wT x + w0 = 0} define
a hyperplane in Rd , where d is the
number of variables in x
• The hyperplane g (x) = wT x + w0 = 0
splits the input space into two
half-spaces. The linear classifier
predicts +1 for points in the halfspace
{x ∈ X |g (x) = wT x + w0 ≥ 0} and
−1 for points in
{x ∈ X |g (x) = wT x + w0 < 0}

4
The geometry of the linear classifier

• w is the normal vector of the


hyperplane g (x) = wT x + w0 = 0
• The distance of the hyperplane from
the origin
qP is |w0 |/ kwk, where
kwk = 2
j wj denotes the
Euclidean norm
• If w0 < 0 the hyperplane lies in the
direction of w from origin, otherwise
it lies in the direction of −w

5
The geometry of the linear classifier

• The value g (x0 ) tells where x0 lies in


relation to the hyperplane:
• g (x0 ) > 0: x0 lies in the halfspace
that is in the direction of w from
the hyperplane
• g (x0 ) = 0: x0 lies on the hyperplane
• g (x0 ) < 0: x0 lies in the direction of
−w from the hyperplane
• The distance of a point x0 from the
hyperplane g (x) = 0 is |g (x0 )|/ kwk

6
Learning linear classifiers
Change of representation

• Consider the parameters (w, w0 ) of the linear function


g (x) = wT x + w0
• For presentation is is convenient to subsume term w0 into the weight
vector " #
w
w⇐
w0
and augment all inputs with a constant 1:
" #
x
x⇐
1

• The models give the same value for x:


" #T " #
w x
= w T x + w0
w0 1

7
Geometric interpretation

• Geometrically, the hyperplane in the


changed representation goes now
through origin
• The positive points have an acute
angle with w: wT x ≥ 0
• The negative points have an obtuse
angle with w: wT x < 0

8
Checking for prediction errors

• When the labels are Y = {−1, +1} for a training example (x, y ) we
have for g (x) = wT x,
(
y if x is correctly classified
sgn (g (x)) =
−y if x is incorrectly classified
• Alternative we can just multiply with the correct label to check for
misclassification:
(
≥ 0 if x is correctly classified
yg (x) =
< 0 if x is incorrectly classified

9
Margin

• The geometric margin of a labeled


example (x, y ) is given by
γ(x) = yg (x)/ kwk
• It takes into account both the
distance |wT x|/ kwk from the
hyperplane, and whether x is on the
correct side of the hyperplane
• The unnormalized version of the
margin is sometimes called the
functional margin γ(x) = yg (x)
• Often the term margin is used for
both variants, assuming the context
makes clear which one is meant

10
Perceptron
Perceptron

• Perceptron algorithm by Frank


Rosenblatt (1956) is perhaps the first
machine learning algorithm
• Its purpose was to learn a linear
function separating two classes
• It was built in hardware and shown to
be capable of performing rudimentary
pattern recognition tasks
• New York Times in 1958: ”the
embryo of an electronic computer that
[the Navy] expects will be able to
walk, talk, see, write, reproduce itself
Mark I perceptron ca. 1958 (Picture: Wikipedia)
and be conscious of its existence.”
(Source: Wikipedia)

11
The perceptron algorithm

• The perceptron algorithm a learns a hyperplane separating two


classes
g (x) = wT x
• It processes incrementally a set of training examples
• At each step, it finds a training example xi that is incorrectly
classified by the current model
• It updates the model by adding the example to the current weight
vector together with the label: w(t+1) ← w(t) + yi xi
• This process is continued until incorrectly predicted training
examples are not found

12
The perceptron algorithm

Input: Training set S = {(xi , yi )}m d


i=1 , x ∈ R , y ∈ {−1, +1}
(1)
Initialize w ← (0, . . . , 0), t ← 1, stop ← FALSE
repeat
T
if exists i, s.t. yi w(t) xi ≤ 0 then
w(t+1) ← w(t) + yi xi
else
stop ← TRUE
end if
t ←t +1
until stop

13
Understanding the update rule

• Let us examine the update rule

w(t+1) ← w(t) + yi xi

• We can see that the margin of the example (xi , yi ) increases after
the update
T
yi g (t+1) (xi ) = yi w(t+1) xi = yi (w(t) + yi xi )T xi
T 2
= yi w(t) xi + yi2 xT
i xi = yi g
(t)
(xi ) + kxi k
≥ yi g (t) (xi )

• Note that this does not guarantee that yi g (t+1) (xi ) > 0 after the
update, further updates may be required to achieve that

14
Perceptron animation

• Assume w(t) has been found by running the algorithm for t steps
• We notice two misclassified examples

15
Perceptron animation

• Select the misclassified example (φ(xi ), −1)


• Note: φ(xi ) is here some transformation of xi e.g. with some basis
functions but it could be identity φ(x) = x

(τ) T φ
+ w >0
+ φ(x i)
(τ) _
w
+

_
+ _
(τ)T
w φ <0

15
Perceptron animation

• Update the weight vector: w(t+1) = w(t) + yi φ(xi )

+
+ φ(x i)
(τ) _
w
+

(τ) _ φ(x
w i)
_

_
+ _

15
Perceptron animation

• The update tilts the hyperplane to make the example ”more


correct”, i.e. more negative
• We repeat the process by finding the next misclassified example
φ(xi+1 ) and update: w(t+2) = w(t+1) + yi+1 φ(xi+1 )

+
+
(τ+1) _ φ(x )
w i+1 _
+
(τ+1)
w

_
+ φ(x i+1) _

15
Perceptron animation

• Next iteration

+
+
(τ+2)
w _
+

_
+ _

15
Perceptron animation

• Next iteration

+
+

_
+

_
+ _

15
Perceptron animation

• Finally we have found a hyperplane that correctly classify the


training points
• We can stop the iteration and output the final weight vector

+
+

_
+

_
+ _

15
Convergence of the perceptron algorithm

• The perceptron algorithm can be shown to eventually converge to a


consistent hyperplane if the two classes are linearly separable, that
is, if there exists a hyperplane that separates the two classes
• Theorem (Novikoff):
• Let S = {(xi , yi )}m
i=1 be a linearly separable training set.
• Let R = maxxi ∈S kxi k.
• Let there exist a vector w∗ that satisfies kw∗ k = 1 and
yi w∗T xi + bopt ≥ γ for i = 1 . . . , m.
• Then the perceptron algorithm will stop after at most t ≤ ( 2R γ
)2
(t) (t)
iterations and output a weight vector w for which yi w xi ≥ 0 for
all i = 1 . . . , m

16
Convergence of the perceptron algorithm

The number of iterations in the bound t ≤ ( 2R 2


γ ) depend on:

• γ: The largest achievable


geometric margin so that all
training examples have at
least that margin
• R: The smallest radius of the γ
d-dimensional ball that
encloses the training data R

• Intuitively: how large the


||w|| = 1
margin in is relative to the w
distances of the training
points
However, Perceptron algorithm does not stop on a non-separable training
set, since there will always be a misclassified example that causes an
update
17
The loss function of the Perceptron algorithm

It can be shown that the


Perceptron algorithm is using the
following loss:

LPerceptron (y , wT x) = max(0, −y wT x)

• y wT x is the margin
• if y wT x < 0, a loss of
−y wT x is incurred, otherwise
no loss is incurred

18
Convexity of Perceptron loss

A function f : Rn 7→ R is convex if for all x, y , and 0 ≤ θ ≤ 1, we have

f (θx + (1 − θ)y ) ≤ θf (x) + (1 − θ)f (y ).

• Geometrical interpretation:
the graph of a convex
function lies below the line
segment from (x, f (x)) to
(y , f (y ))
• It is easy to see that
Perceptron loss is convex but
zero-one loss is not convex

19
Convexity of Perceptron loss

• The convexity of the Perceptron loss has an important consequence:


every local minimum is also the global minimum
• In principle we can minimize it with incremental updates that
gradually decrease the loss
• In contrast, finding a hyperplane that minimizes the zero-one loss is
computationally hard (NP-hard to minimize training error)
• However, we need better algorithms than the Perceptron, which
terminate when we are close to the optimum

20
Logistic regression
Logistic regression

Logistic regression is a classification technique (despite the name)

• it gets its name from the logistic


function
1 exp(z)
φlogistic (z) = =
1 + exp(−z) 1 + exp(z)

that maps a real valued input z onto


the interval 0 < φlogistic (z) < 1
• The function is an example of
sigmoid (”S” shaped) functions

21
Logistic function: a probabilistic interpretation

• The logistic function φlogistic (z) is the inverse of logit function


• The logit function is the logarithm of odds ratio of probability p of
and event happening vs. the probability of the event not happening,
1 − p;
p
z = logit(p) = log = log p − log(1 − p)
1−p
• Thus the logistic function
1
φlogistic (z) = logit −1 (z) =
1 + exp(−z)

answer the question ”what is the probability p that gives the log
odds ratio of z”

22
Logistic regression

• Logistic regression model assumes a underlying conditional


probability:

exp(+ 12 y wT x)
Pr (y |x) =
exp(+ 12 y wT x) + exp(− 12 y wT x)

where the denominator normalizes the right-hand side to be between


zero and one.
• Dividing the numerator and denominator by exp(+ 12 y wT x) reveals
1
the logistic function Pr (y |x) = φlogistic (y wT x) = 1+exp(−y wT x)

• The margin z = y wT x is thus interpreted as the log odds ratio of


(y |x)
label y vs. label −y given input x: y wT x = log PrPr(−y |x)
• Note: these equations assume the labels Y = {−1, +1}. With labels
Y = {0, 1} the equations will be slightly different.

23
Logistic loss

• Consider the maximization of the likelihood of the observed


input-output in the training data:
m m
Y Y 1
w∗ = argmaxw P(yi |xi ) = argmaxw
1 + exp(−y wT x)
i=1 i=1

• Since the logarithm is monotonically increasing function, we can


take the logarithm to obtain an equivalent objective:
m
X m
X
log Pr (yi |xi ) = − log(1 + exp(−yi wT xi ))
i=1 i=1

• The right-hand side is the logistic loss:

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Minimizing the logistic loss correspond maximizing the likelihood of


the training data
24
Geometric interpretation of Logistic loss

Llogistic (y , wT x) = log(1 + exp(−y wT x))

• Logistic loss is convex and


differentiable
• It is a monotonically decreasing
function of the margin y wT x
• The loss changes fast when the
margin is highly negative =⇒
penalization of examples far in the
incorrect halfspace
• It changes slowly for highly positive
margins =⇒ does not give extra
bonus for being very far in the correct
halfspace
25
Logistic regression optimization problem

• To train a logistic regression model, we need to find the w that


Pm
minimizes the average logistic loss J(w) = m1 i=1 Llogistic (yi , wT xi )
over the training set:
m
1 X
min J(w) = log(1 + exp(−yi wT xi )
m
i=1

w .r .t parameters w ∈ Rd

• The function to be minimized is continuous and differentiable


• However, it is a non-linear function so it is not easy to find the
optimum directly (e.g. unlike in linear regression)
• We will use stochastic gradient descent to incrementally step
towards the direction where the objective decreases fastest, the
negative gradient

26
Gradient

• The gradient is the vector of partial derivatives of the objective


function J(w) with respect to all parameters wj
m m iT
1 X 1 Xh ∂ ∂
∇J(w) = ∇Ji (w) = ∂w1 Ji (w), . . . , ∂wd J i (w)
m m
i=1 i=1

• Compute the gradient by using the regular rules for differentiation.


For the logistic loss we have

∂ ∂ exp(−yi wT xi )
Ji (w) = log(1 + exp(−yi wT xi )) = · (−yi xij )
∂wj ∂wj 1 + exp(−yi wT xi )
1
=− yi xij = −φlogistic (−yi wT xi )yi xij
1 + exp(yi wT xi )

27
Stochastic gradient descent

• We collect the partial derivatives with respect to a single training


example into a vector:
 
−(φlogistic (−yi wT xi )yi ) · xi1
 .. 

 . 

∇Ji (w) =  −(φlogistic (−yi wT xi )yi ) · xij  = −φlogistic (−yi wT xi )yi · xi
 

 .. 

 . 
T
−(φlogistic (−yi w xi )yi ) · xid

• The vector −∇Ji (w) gives the update direction that fastest
decreases the loss on training example (xi , yi )

28
Stochastic gradient descent

• Evaluating the full gradient


m m
1 X 1 X
∇J(w) = ∇Ji (w) = − φlogistic (−yi wT xi )yi · xi
m m
i=1 i=1

is costly since we need to process all training examples


• Stochastic gradient descent instead uses a series of smaller updates
that depend on single randomly drawn training example (xi , yi ) at a
time
• The update direction is taken as −∇Ji (w)
• Its expectation is the full negative gradient:

−Ei=1...,m [ ∇Ji (w) ] = −∇J(w)

• Thus on average, the updates match that of using the full gradient

29
Stochastic gradient descent algorithm

Initialize w = 0; t = 1;
repeat
Draw a training example (x, y ) uniformly at random;
Compute the update direction corresponding to the training example:
∆w = −∇Jt (w);
Determine a stepsize ηt ;
Update w = w − ηt ∇Jt (w);
t = t + 1;
until stopping criterion statisfied
Output w;

30
Stepsize selection

Consider the SGD update: w = w − ηt OJt (w)


• The stepsize parameter ηt , also called the learning rate is a critical
one for convergence to the optimum value
• One uses small constant stepsize, the initial convergence may be
unnecessarily slow
• Too large stepsize may cause the method to continually overshoot
the optimum.

Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 31
Diminishing stepsize

• We can use a diminishing stepsize by starting with an initial larger


stepsize, controlled by hyperparameter η0 > 0
• In each iteration, the stepsize is divided by the iteration counter
t > 0:
η0
ηt =
t
• Caution: In practice, finding a good value for hyperparameter η0
requires experimenting with several values

Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 32
Stopping criterion

When should we stop the algorithm? Some possible choices:

1. Set a maximum number of iterations, after which the algorithm


terminates
• This needs to be separately calibrated for each dataset to avoid
premature termination
2. Gradient of the objective: If we are at a optimum point w∗ of J(w),
the gradient vanishes ∇J(w∗ ) = 0, so we can stop kJ(w)k < γ
where γ is some user-defined parameter
3. It is usually sufficient to train until the zero-one error on training
data does not change anymore
• This usually happens before the logistic loss converges

33
Summary

• Linear classification model are and important class of machine


learning models, they are used as standalone models and appear as
building blocks of more complicated, non-liner models
• Perceptron is a simple algorithm to train linear classifiers on linearly
separable data
• Logistic regression is a classification method that can be interpreted
as maximizing odds ratios of conditional class probabilities
• Stochastic gradient descent is an efficient optimization method for
large data that is nowadays very widely used

34

You might also like