Machine Learning Foundations
(機器學習基石)
Lecture 2: Learning to Answer Yes/No
Hsuan-Tien Lin (林軒田)
[email protected] Department of Computer Science
& Information Engineering
National Taiwan University
(國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/23
Learning to Answer Yes/No
Roadmap
1 When Can Machines Learn?
Lecture 1: The Learning Problem
A takes D and H to get g
Lecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
Perceptron Learning Algorithm (PLA)
Guarantee of PLA
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/23
Learning to Answer Yes/No Perceptron Hypothesis Set
Credit Approval Problem Revisited
Applicant Information
age 23 years
gender female
annual salary NTD 1,000,000
unknown target function
year in residence 1 year
f: X →Y
year in job 0.5 year
(ideal credit approval formula) current debt 200,000
training examples learning final hypothesis
D : (x1 , y1 ), · · · , (xN , yN ) algorithm g≈f
A
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
H
(set of candidate formula)
what hypothesis set can we use?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/23
Learning to Answer Yes/No Perceptron Hypothesis Set
A Simple Hypothesis Set: the ‘Perceptron’
age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000
• For x = (x1 , x2 , · · · , xd ) ‘features of customer’, compute a
weighted ‘score’ and
Xd
approve credit if wi xi > threshold
i=1
Xd
deny credit if wi xi < threshold
i=1
• Y: +1(good), −1(bad) , 0 ignored—linear formula h ∈ H are
d
! !
X
h(x) = sign wi xi − threshold
i=1
called ‘perceptron’ hypothesis historically
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/23
Learning to Answer Yes/No Perceptron Hypothesis Set
Vector Form of Perceptron Hypothesis
d
! !
X
h(x) = sign wi xi −threshold
i=1
d
!
X
= sign wi xi + (−threshold) · (+1)
| {z } | {z }
i=1 w0 x0
d
!
X
= sign wi xi
i=0
= sign wT x
• each ‘taller’ w represents a hypothesis h & is multiplied with
‘taller’ x —will use taller versions to simplify notation
what do perceptrons h ‘look like’?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/23
Learning to Answer Yes/No Perceptron Hypothesis Set
Some Notation Conventions
fonts
• normal x: just a scalar
• bold x: a vector (x0 , x1 , . . . , xd )
bold w: a vector (w0 , w1 , . . . , wd )
• normal xi : the i-th component in x
normal wi : the i-th component in w
• bold xn : the n-th vector (in the data)
bold wt : the t-th vector (we will see)
• normal xn,i (rarely used): the i-th component in xn
normal wt,i (rarely used): the i-th component in wt
• caligraphic as sets: input X , output Y, data D, hypothesis H, except algorithm A
two important numbers
• N examples (xn , yn ), indexed by n = 1, 2, . . . , N
• d features, indexed by i = 0, 1, 2, . . . , d
important to follow the notations
from the very beginning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/23
Learning to Answer Yes/No Perceptron Hypothesis Set
Perceptrons in R2
h(x) = sign (w0 + w1 x1 + w2 x2 )
• customer features x: points on the plane (or points in Rd )
• labels y : ◦ (+1), × (-1)
• hypothesis h: visually lines w0 + w1 x1 + w2 x2 = 0
(or hyperplanes in Rd )
—positive on one side of a line, negative on the other side
• different line classifies customers differently
perceptrons ⇔ linear (binary) classifiers
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/23
Learning to Answer Yes/No Perceptron Hypothesis Set
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/23
Learning to Answer Yes/No Perceptron Hypothesis Set
Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera
Reference Answer: 2
The occurrence of keywords with positive
weights increase the ‘spam score’, and hence
those keywords should often appear in spams.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Select g from H
H = all possible perceptrons, g =?
• want: g ≈ f (hard when f unknown)
• almost necessary: g ≈ f on D, ideally
g(xn ) = f (xn ) = yn
• difficult: H is of infinite size
• idea: start from some g0 , and ‘correct’ its
mistakes on D
will represent g0 by its weight vector w0
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Perceptron Learning Algorithm
start from some w0 (say, 0), and ‘correct’ itsy= +1
mistakes on D w+y x
For t = 0, 1, . . . y= +1 w+y x
w
1 find a mistake of wt called xn(t) , yn(t)
x
w
sign wTt xn(t) ̸= yn(t)
y= −1 w
2 (try to) correct the mistake by
y=
w+−1
yx w x
wt+1 ← wt + yn(t) xn(t)
x
. . . until no more mistakes w+y x
return last w (called wpla ) as g
That’s it!
—A fault confessed is half redressed. :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Handling sign(·) = 0
Perceptron Learning Algorithm
start from some w0 (say, 0), and ‘correct’ its mistakes on D
When w0 = 0, technically sign(wT0 xn(0) ) = 0, shall we update?
• convention -1: sign(0) = −1 (update if yn(0) = +1)
• convention +1: sign(0) = +1 (update if yn(0) = −1)
• convention 0: sign(0) = 0 (always update)
• convention r: sign(0) = random flip (50% chance of update)
—usually does not matter much, as long as w1 often becomes
non-zero
wTt xn(t) = 0 rarely happens in practice
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
How to Update w0 ?
Perceptron Learning Algorithm
For t = 0, 1, . . .
find a mistake of wt called xn(t) , yn(t) : sign wTt xn(t) ̸= yn(t)
1
2 (try to) correct the mistake by
wt+1 ← wt + yn(t) xn(t) , i.e.,
wt+1,0 wt,0 x0 (= what?)
wt+1,1 wt,1 xn(t),1
+ yn(t)
... = ...
...
wt+1,d wt,d xn(t),d
. . . until no more mistakes
return last w (called wpla ) as g
each update changes wt,0 by yn(t)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Practical Implementation of PLA
start from some w0 (say, 0), and ‘correct’ its mistakes on D
Cyclic PLA
For t = 0, 1, . . .
1 find the next mistake of wt called xn(t) , yn(t)
sign wTt xn(t) ̸= yn(t)
2 correct the mistake by
wt+1 ← wt + yn(t) xn(t)
. . . until a full cycle of not encountering mistakes
next can follow naïve cycle (1, · · · , N)
or precomputed random cycle
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
initially
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 1
w(t+1)
x1
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 2
w(t)
w(t+1)
x9
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 3
w(t+1)
w(t)
x14
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 4
x3
w(t)
w(t+1)
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 5
w(t)
w(t+1)
x9
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 6
w(t+1)
w(t)
x14
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 7
w(t)
w(t+1)
x9
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 8
w(t+1)
w(t)
x14
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
update: 9
w(t)
w(t+1)
x9
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Seeing is Believing
finally
wPLA
worked like a charm with < 20 lines!!
(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
sign wTt xn ̸= yn , wt+1 ← wt + yn xn
1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
sign wTt xn ̸= yn , wt+1 ← wt + yn xn
1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn
Reference Answer: 3
Simply multiply the second part of the rule by
yn xn . The result shows that the rule somewhat
‘tries to correct the mistake.’
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23
Learning to Answer Yes/No Guarantee of PLA
Linear Separability
• if PLA halts (i.e. no more mistakes),
(necessary condition) D allows some w to make no mistake
• call such D linear separable
(linear separable) (not linear separable) (not linear separable)
assume linear separable D,
does PLA always halt?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23
Learning to Answer Yes/No Guarantee of PLA
PLA Fact: wt Gets More Aligned with wf
linear separable D ⇔ exists perfect wf such that yn = sign(wTf xn )
• wf perfect hence every xn correctly away from line:
yn(t) wTf xn(t) ≥min yn wTf xn > 0
n
• wTf wt ↑ by updating with any xn(t) , yn(t)
wTf wt+1 = wTf wt + yn(t) xn(t)
≥ wTf wt + min yn wTf xn
n
> wTf wt + 0.
wt appears more aligned with wf after update
(really?)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/23
Learning to Answer Yes/No Guarantee of PLA
PLA Fact: wt Does Not Grow Too Fast
wt changed only when mistake
⇔ sign wTt xn(t) ̸= yn(t) ⇔ yn(t) wTt xn(t) ≤ 0
• mistake ‘limits’ ∥wt ∥2 growth, even when updating with ‘longest’ xn
∥wt+1 ∥2 = ∥wt + yn(t) xn(t) ∥2
= ∥wt ∥2 + 2yn(t) wTt xn(t) + ∥yn(t) xn(t) ∥2
≤ ∥wt ∥2 + 0 + ∥yn(t) xn(t) ∥2
≤ ∥wt ∥2 + max ∥yn xn ∥2
n
start from w0 = 0, after T mistake corrections,
wTf wT √
≥ T · constant
∥wf ∥ ∥wT ∥
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/23
Learning to Answer Yes/No Guarantee of PLA
Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
wTf
Define R 2 = max ∥xn ∥2 ρ = min yn xn
n n ∥wf ∥
We want to show that T ≤ □. Express the upper bound □ by the two
terms above.
1 R/ρ
2 R 2 /ρ2
3 R/ρ2
4 ρ2 /R 2
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/23
Learning to Answer Yes/No Guarantee of PLA
Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
wTf
Define R 2 = max ∥xn ∥2 ρ = min yn xn
n n ∥wf ∥
We want to show that T ≤ □. Express the upper bound □ by the two
terms above.
1 R/ρ
2 R 2 /ρ2
3 R/ρ2
4 ρ2 /R 2
Reference Answer: 2
wt wT
The maximum value of ∥wf ∥ ∥w t∥
is 1. Since T
f
mistake
√ corrections increase the inner product
by T · constant, the maximum number of
corrected mistakes is 1/constant2 .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/23
Learning to Answer Yes/No Guarantee of PLA
PLA Mistake Bound
inner product grows fast length2 grows slowly
wTf wt+1 ≥ wTf wt + min yn wTf xn ∥wt+1 ∥2 ≤ ∥wt ∥2 + max ∥xn ∥2
| n {z } | n {z }
ρ·∥wf ∥ R2
Magic Chain! Magic Chain!
wTf w1 ≥ wTf w0 + ρ · ∥wf ∥ ∥w1 ∥2 ≤ ∥w0 ∥2 + R 2
wTf w2 ≥ wTf w1 + ρ · ∥wf ∥ ∥w2 ∥2 ≤ ∥w1 ∥2 + R 2
wTf w3 ≥ wTf w2 + ρ · ∥wf ∥ ∥w3 ∥2 ≤ ∥w2 ∥2 + R 2
··· ···
wTf wT ≥ wTf wT −1 + ρ · ∥wf ∥ ∥wT ∥2 ≤ ∥wT −1 ∥2 + R 2
start from w0 = 0, after T mistake corrections,
2
wTf wT T ρ∥wf ∥ R
1≥ ≥ √ =⇒ T ≤
∥wf ∥ ∥wT ∥ ∥wf ∥ T R ρ
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/23
Learning to Answer Yes/No Guarantee of PLA
More about PLA
Guarantee
as long as linear separable and correct by mistake
• inner product of wf and wt grows fast; length of wt grows slowly
• PLA ‘lines’ are more and more aligned with wf ⇒ halts
Pros
simple to implement, fast, works in any dimension d
Cons
• ‘assumes’ linear separable D to halt
—property unknown in advance (no need for PLA if we know wf )
• not fully sure how long halting takes (ρ depends on wf )
—though practically fast
what if D not linear separable?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/23
Learning to Answer Yes/No Guarantee of PLA
Learning with Noisy Data
unknown target function
f: X →Y
+ noise
(ideal credit approval formula)
training examples learning final hypothesis
D : (x1 , y1 ), · · · , (xN , yN ) algorithm g≈f
A
(historical records in bank) (‘learned’ formula to be used)
hypothesis set
H
(set of candidate formula)
how to at least get g ≈ f on noisy D?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/23
Learning to Answer Yes/No Guarantee of PLA
Line with Noise Tolerance
• assume ‘little’ noise: yn = f (xn ) usually
• if so, g ≈ f on D ⇔ yn = g(xn ) usually
• how about
N r
X z
wg ← argmin yn ̸= sign(wT xn )
w
n=1
—NP-hard to solve, unfortunately
will discuss other solutions for an
‘approximately good’ g later?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/23
Learning to Answer Yes/No Guarantee of PLA
Summary
1 When Can Machines Learn?
Lecture 1: The Learning Problem
Lecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
hyperplanes/linear classifiers in Rd
Perceptron Learning Algorithm (PLA)
correct mistakes and improve iteratively
Guarantee of PLA
no mistake eventually if linear separable
• next: the zoo of learning problems
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/23