Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
Linear Classifier
by
Dr. Sanjeev Kumar
Associate Professor
Department of Mathematics
IIT Roorkee, Roorkee-247 667, India
[email protected]
[email protected]
Linear models
A strong high-bias assumption is linear separability:
in 2 dimensions, can separate classes by a line
Goals:
Explore a number of linear training algorithms
0 =b + å
m
wj fj
j=1
ìï 1 if x =True üï
1[ x ] =í ý
î 0 if x =False
ï ï
þ
0 =b + å
n
wj fj
j=1
å 1[ y (w ×x + b) £0]
i i
i=1
å 1[ y (w ×x + b) £0]
i i
i=1
m
distance =b + å w j x j =w ×x + b distance from hyperplane
j=1
n
0/1 loss =å 1[ yi (w ×xi + b) £0 ] total number of mistakes,
i=1 aka 0/1 loss
Model-based machine learning
1. pick a model
0 =b + å
m
wj fj
j=1
å 1[ y (w ×x + b) £ 0]
i i
i=1
n
Find w and b that
argmin w,b å 1[ yi (w ×xi + b) £ 0 ]
minimize the 0/1 loss
i=1
How do we do this?
How do we minimize a function?
Why is it hard for this function?
Minimizing 0/1 in one dimension
n
å 1[ y (w ×x + b) £ 0]
i i
i=1
loss
å 1[ y (w ×x + b) £ 0]
i i
i=1
loss
n
Find w and b that
argmin w,b å 1[ yi (w ×xi + b) £ 0 ]
minimize the 0/1 loss
i=1
w
What property/properties do we want from our loss function?
More manageable loss functions
loss
Ideas?
Some function that is a proxy for
error, but is continuous and convex
Surrogate loss functions
0 =b + å
m
wj fj
j=1
You’re blindfolded, but you can see out of the bottom of the
blindfold to the ground right by your feet. I drop you off
somewhere and tell you that you’re in a convex shaped valley
and escape is at the bottom/minimum. How do you get out?
Finding the minimum
loss
w
One approach: gradient descent
Approach:
pick a starting point (w)
repeat: w
pick a dimension
move a small amount in that
dimension towards decreasing loss
(using the derivative)
One approach: gradient descent
Approach:
pick a starting point (w)
repeat:
pick a dimension
move a small amount in that
dimension towards decreasing loss
(using the derivative)
Gradient descent
d d n
dw j
loss = å
dw j i=1
exp(- yi (w ×xi + b))
n
d
=å exp(- yi (w ×xi + b)) - yi (w ×xi + b)
i=1 dw j
n
=å - yi xij exp(- yi (w ×xi + b))
i=1
Gradient descent
n
w j =w j + h å yi xij exp(- yi (w ×xi + b))
i=1
label prediction
prediction =b + å
m
wj fj
j=1
n
argmin w,b å exp(- yi (w ×xi + b))
i=1
loss
prediction =b + å
m
wj fj
j=1
n
argmin w,b å exp(- yi (w ×xi + b))
i=1
loss
0 =b + å
n
wj fj
j=1
Any preferences?
0 =b + å
n
wj fj
j=1
0 =b + å
n
wj fj
j=1
n
argmin w,b å loss(yy') + l regularizer(w, b)
i=1
Common regularizers
r(w, b) =å w j
sum of the weights
wj
2
sum of the squared weights r(w, b) = å wj
wj
r(w, b) =å w j
sum of the weights
wj
åw
2
sum of the squared weights r(w, b) = j
wj
åw
sum of the squared weights 2
(2-norm)
r(w, b) = j
wj
r(w, b) = p å w j = w
p p
p-norm
wj
w2
p w2
1 0.5
For example, if w1 = 0.5 1.5 0.75
2 0.87
3 0.95
∞ 1
p-norms visualized
0 =b + å
n
wj fj
j=1
å loss(yy') + lregularizer(w)
i=1
make convex
Convexity revisited
Prove:
z(tx1 +(1- t)x2 ) £ tz(x1 ) +(1- t)z(x2 ) " 0 < t <1
r(w, b) = p å w j = w
p p
wj
0 =b + å
n
wj fj
j=1
i=1
i=1 2
n
l
argmin w,b å exp(- yi (w ×xi + b)) +
2
w
i=1 2
Some more maths
d d n l
å
2
objective = exp(- yi (w ×x i + b)) + w
dw j dw j i=1 2
…
(some math happens)
n
=- å yi xij exp(- yi (w ×xi + b)) + lw j
i=1
Gradient descent
n
w j =w j + h å yi xij exp(- yi (w ×xi + b)) - hlw j
i=1
The update
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hl w j
d d n
dw j
objective = å
dw j i=1
exp(- yi (w ×xi + b)) + l w
n
=- å yi xij exp(- yi (w ×xi + b)) + lsign(w j )
i=1
L1 regularization
w j =w j + h yi xij exp(- yi (w ×xi + b)) - hlsign(w j )
Lp:
p- 1
w j =w j + h (loss _ correction - l cw )
j