02u Handout
02u Handout
(機器學習基石)
Roadmap
1 When Can Machines Learn?
hypothesis set
H
d
! !
X
h(x) = sign wi xi −threshold
i=1
d
!
X
= sign wi xi + (−threshold) · (+1)
| {z } | {z }
i=1 w0 x0
d
!
X
= sign wi xi
i=0
= sign wT x
Perceptrons in R2
h(x) = sign (w0 + w1 x1 + w2 x2 )
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera
Reference Answer: 2
The occurrence of keywords with positive
weights increase the ‘spam score’, and hence
those keywords should often appear in spams.
Select g from H
H = all possible perceptrons, g =?
For t = 0, 1, . . . y= +1 w+y x
w
1 find a mistake of wt called xn(t) , yn(t)
x
w
sign wTt xn(t) ̸= yn(t)
y= −1 w
2 (try to) correct the mistake by
y=
w+−1
yx w x
wt+1 ← wt + yn(t) xn(t)
x
. . . until no more mistakes w+y x
return last w (called wpla ) as g
That’s it!
—A fault confessed is half redressed. :-)
Handling sign(·) = 0
Perceptron Learning Algorithm
start from some w0 (say, 0), and ‘correct’ its mistakes on D
Cyclic PLA
For t = 0, 1, . . .
1 find the next mistake of wt called xn(t) , yn(t)
sign wTt xn(t) ̸= yn(t)
Seeing is Believing
initially
Seeing is Believing
update: 1
w(t+1)
x1
Seeing is Believing
update: 2
w(t)
w(t+1)
x9
Seeing is Believing
update: 3
w(t+1)
w(t)
x14
Seeing is Believing
update: 4
x3
w(t)
w(t+1)
Seeing is Believing
update: 5
w(t)
w(t+1)
x9
Seeing is Believing
update: 6
w(t+1)
w(t)
x14
Seeing is Believing
update: 7
w(t)
w(t+1)
x9
Seeing is Believing
update: 8
w(t+1)
w(t)
x14
Seeing is Believing
update: 9
w(t)
w(t+1)
x9
Seeing is Believing
finally
wPLA
Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
sign wTt xn ̸= yn , wt+1 ← wt + yn xn
1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn
Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
sign wTt xn ̸= yn , wt+1 ← wt + yn xn
1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn
Reference Answer: 3
Simply multiply the second part of the rule by
yn xn . The result shows that the rule somewhat
‘tries to correct the mistake.’
Linear Separability
• if PLA halts (i.e. no more mistakes),
(necessary condition) D allows some w to make no mistake
• call such D linear separable
wTf wT √
≥ T · constant
∥wf ∥ ∥wT ∥
Reference Answer: 2
wt wT
The maximum value of ∥wf ∥ ∥w t∥
is 1. Since T
f
mistake
√ corrections increase the inner product
by T · constant, the maximum number of
corrected mistakes is 1/constant2 .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/23
Learning to Answer Yes/No Guarantee of PLA
Pros
simple to implement, fast, works in any dimension d
Cons
• ‘assumes’ linear separable D to halt
—property unknown in advance (no need for PLA if we know wf )
• not fully sure how long halting takes (ρ depends on wf )
—though practically fast
hypothesis set
H
Summary
1 When Can Machines Learn?