SML_Lecture5
SML_Lecture5
• Part I: Theory
• Introduction
• Generalization error analysis & PAC learning
• Rademacher Complexity & VC dimension
• Model selection
• Part II: Algorithms and models
• Linear models: perceptron, logistic regession
• Support vector machines
• Kernel methods
• Boosting
• Neural networks (MLPs)
• Part III: Additional topics
• Feature learning, selection and sparsity
• Multi-class classification
• Preference learning, ranking
1
Linear classification
Linear classification
2
Linear classifiers
Linear classifiers
d
X
wj xj + w0 = sgn wT x + w0
h(x) = sgn
j=1
• They are fast to evaluate and takes small space to store (O(d) time
and space)
• Easy to understand: |wj | shows the importance of variable xj and its
sign tells if the effect is positive or negative
• Linear models have relatively low complexity (e.g. VCdim = d + 1)
so they can be reliably estimated from limited data
3
The geometry of the linear classifier
• The points
{x ∈ X |g (x) = wT x + w0 = 0} define
a hyperplane in Rd , where d is the
number of variables in x
• The hyperplane g (x) = wT x + w0 = 0
splits the input space into two
half-spaces. The linear classifier
predicts +1 for points in the halfspace
{x ∈ X |g (x) = wT x + w0 ≥ 0} and
−1 for points in
{x ∈ X |g (x) = wT x + w0 < 0}
4
The geometry of the linear classifier
5
The geometry of the linear classifier
6
Learning linear classifiers
Change of representation
7
Geometric interpretation
8
Checking for prediction errors
• When the labels are Y = {−1, +1} for a training example (x, y ) we
have for g (x) = wT x,
(
y if x is correctly classified
sgn (g (x)) =
−y if x is incorrectly classified
• Alternative we can just multiply with the correct label to check for
misclassification:
(
≥ 0 if x is correctly classified
yg (x) =
< 0 if x is incorrectly classified
9
Margin
10
Perceptron
Perceptron
11
The perceptron algorithm
12
The perceptron algorithm
13
Understanding the update rule
w(t+1) ← w(t) + yi xi
• We can see that the margin of the example (xi , yi ) increases after
the update
T
yi g (t+1) (xi ) = yi w(t+1) xi = yi (w(t) + yi xi )T xi
T 2
= yi w(t) xi + yi2 xT
i xi = yi g
(t)
(xi ) + kxi k
≥ yi g (t) (xi )
• Note that this does not guarantee that yi g (t+1) (xi ) > 0 after the
update, further updates may be required to achieve that
14
Perceptron animation
• Assume w(t) has been found by running the algorithm for t steps
• We notice two misclassified examples
15
Perceptron animation
(τ) T φ
+ w >0
+ φ(x i)
(τ) _
w
+
_
+ _
(τ)T
w φ <0
15
Perceptron animation
+
+ φ(x i)
(τ) _
w
+
(τ) _ φ(x
w i)
_
_
+ _
15
Perceptron animation
+
+
(τ+1) _ φ(x )
w i+1 _
+
(τ+1)
w
_
+ φ(x i+1) _
15
Perceptron animation
• Next iteration
+
+
(τ+2)
w _
+
_
+ _
15
Perceptron animation
• Next iteration
+
+
_
+
_
+ _
15
Perceptron animation
+
+
_
+
_
+ _
15
Convergence of the perceptron algorithm
16
Convergence of the perceptron algorithm
LPerceptron (y , wT x) = max(0, −y wT x)
• y wT x is the margin
• if y wT x < 0, a loss of
−y wT x is incurred, otherwise
no loss is incurred
18
Convexity of Perceptron loss
• Geometrical interpretation:
the graph of a convex
function lies below the line
segment from (x, f (x)) to
(y , f (y ))
• It is easy to see that
Perceptron loss is convex but
zero-one loss is not convex
19
Convexity of Perceptron loss
20
Logistic regression
Logistic regression
21
Logistic function: a probabilistic interpretation
answer the question ”what is the probability p that gives the log
odds ratio of z”
22
Logistic regression
exp(+ 12 y wT x)
Pr (y |x) =
exp(+ 12 y wT x) + exp(− 12 y wT x)
23
Logistic loss
w .r .t parameters w ∈ Rd
26
Gradient
∂ ∂ exp(−yi wT xi )
Ji (w) = log(1 + exp(−yi wT xi )) = · (−yi xij )
∂wj ∂wj 1 + exp(−yi wT xi )
1
=− yi xij = −φlogistic (−yi wT xi )yi xij
1 + exp(yi wT xi )
27
Stochastic gradient descent
• The vector −∇Ji (w) gives the update direction that fastest
decreases the loss on training example (xi , yi )
28
Stochastic gradient descent
• Thus on average, the updates match that of using the full gradient
29
Stochastic gradient descent algorithm
Initialize w = 0; t = 1;
repeat
Draw a training example (x, y ) uniformly at random;
Compute the update direction corresponding to the training example:
∆w = −∇Jt (w);
Determine a stepsize ηt ;
Update w = w − ηt ∇Jt (w);
t = t + 1;
until stopping criterion statisfied
Output w;
30
Stepsize selection
Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 31
Diminishing stepsize
Source: https://dunglai.github.io/2017/12/21/gradient-descent/ 32
Stopping criterion
33
Summary
34