0% found this document useful (0 votes)
62 views37 pages

Machine Learning Foundations (機器學習基石) : Lecture 2: Learning to Answer Yes/No

This document summarizes Lecture 2 of a machine learning foundations course. It discusses learning to answer yes/no questions using perceptrons. Specifically, it introduces the perceptron hypothesis set, which represents hypotheses as linear classifiers that assign labels (+1 or -1) based on the sign of a weighted linear combination of features. It then presents the Perceptron Learning Algorithm, which starts with an initial set of weights and iteratively updates the weights to correct mistakes on the training data until it converges to a hypothesis that correctly classifies all examples.

Uploaded by

Donald Bennett
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views37 pages

Machine Learning Foundations (機器學習基石) : Lecture 2: Learning to Answer Yes/No

This document summarizes Lecture 2 of a machine learning foundations course. It discusses learning to answer yes/no questions using perceptrons. Specifically, it introduces the perceptron hypothesis set, which represents hypotheses as linear classifiers that assign labels (+1 or -1) based on the sign of a weighted linear combination of features. It then presents the Perceptron Learning Algorithm, which starts with an initial set of weights and iteratively updates the weights to correct mistakes on the training data until it converges to a hypothesis that correctly classifies all examples.

Uploaded by

Donald Bennett
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Machine Learning Foundations

(機器學習基石)

Lecture 2: Learning to Answer Yes/No


Hsuan-Tien Lin (林軒田)
[email protected]

Department of Computer Science


& Information Engineering
National Taiwan University
(國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/22


Learning to Answer Yes/No

Roadmap
1 When Can Machines Learn?

Lecture 1: The Learning Problem


A takes D and H to get g

Lecture 2: Learning to Answer Yes/No


Perceptron Hypothesis Set
Perceptron Learning Algorithm (PLA)
Guarantee of PLA
Non-Separable Data

2 Why Can Machines Learn?


3 How Can Machines Learn?
4 How Can Machines Learn Better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/22


Learning to Answer Yes/No Perceptron Hypothesis Set

Credit Approval Problem Revisited


Applicant Information
age 23 years
gender female
annual salary NTD 1,000,000
unknown target function
year in residence 1 year
f: X →Y
year in job 0.5 year
(ideal credit approval formula) current debt 200,000

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g≈f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

what hypothesis set can we use?


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/22
Learning to Answer Yes/No Perceptron Hypothesis Set

A Simple Hypothesis Set: the ‘Perceptron’


age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000

• For x = (x1 , x2 , · · · , xd ) ‘features of customer’, compute a


weighted ‘score’ and
Xd
approve credit if wi xi > threshold
i=1
Xd
deny credit if wi xi < threshold
i=1


• Y: +1(good), −1(bad) , 0 ignored—linear
d
! ! h ∈ H are
formula
X
h(x) = sign wi xi − threshold
i=1

called ‘perceptron’ hypothesis historically


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/22
Learning to Answer Yes/No Perceptron Hypothesis Set

Vector Form of Perceptron Hypothesis

d
! !
X
h(x) = sign wi xi −threshold
i=1
 
d
!
X
= sign  wi xi + (−threshold) · (+1)
 
| {z } | {z }
i=1 w0 x0
d
!
X
= sign w i xi
i=0

T
= sign w x

• each ‘tall’ w represents a hypothesis h & is multiplied with


‘tall’ x —will use tall versions to simplify notation

what do perceptrons h ‘look like’?


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22
Learning to Answer Yes/No Perceptron Hypothesis Set

Perceptrons in R2
h(x) = sign (w0 + w1 x1 + w2 x2 )

• customer features x: points on the plane (or points in Rd )


• labels y : ◦ (+1), × (-1)
• hypothesis h: lines (or hyperplanes in Rd )
—positive on one side of a line, negative on the other side
• different line classifies customers differently

perceptrons ⇔ linear (binary) classifiers

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/22


Learning to Answer Yes/No Perceptron Hypothesis Set

Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22


Learning to Answer Yes/No Perceptron Hypothesis Set

Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera

Reference Answer: 2
The occurrence of keywords with positive
weights increase the ‘spam score’, and hence
those keywords should often appear in spams.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/22


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Select g from H
H = all possible perceptrons, g =?

• want: g ≈ f (hard when f unknown)


• almost necessary: g ≈ f on D, ideally
g(xn ) = f (xn ) = yn
• difficult: H is of infinite size
• idea: start from some g0 , and ‘correct’ its
mistakes on D

will represent g0 by its weight vector w0

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/22


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Perceptron Learning Algorithm


start from some w0 (say, 0), and ‘correct’ itsy= +1
mistakes on D w+y x

For t = 0, 1, . . . y= +1 w+y x
 w
1 find a mistake of wt called xn(t) , yn(t)
x
  w
sign wTt xn(t) 6= yn(t)
y= −1 w
2 (try to) correct the mistake by
y=
w+−1
yx w x
wt+1 ← wt + yn(t) xn(t)
x
. . . until no more mistakes w+y x
return last w (called wPLA ) as g

That’s it!
—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/22


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Practical Implementation of PLA


start from some w0 (say, 0), and ‘correct’ its mistakes on D

Cyclic PLA
For t = 0, 1, . . .

1 find the next mistake of wt called xn(t) , yn(t)
 
sign wTt xn(t) 6= yn(t)

2 correct the mistake by

wt+1 ← wt + yn(t) xn(t)

. . . until a full cycle of not encountering mistakes

next can follow naïve cycle (1, · · · , N)


or precomputed random cycle
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 1

w(t+1)

x1

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 2

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 3

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 4

x3

w(t)

w(t+1)

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 5

w(t)
w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 6

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 7

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 8

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 9

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

finally

wPLA

worked like a charm with < 20 lines!!


(note: made xi  x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/22
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Some Remaining Issues of PLA


‘correct’ mistakes on D until no mistakes

Algorithmic: halt (with no mistake)?


• naïve cyclic: ??
• random cyclic: ??
• other variant: ??

Learning: g ≈ f ?
• on D, if halt, yes (no mistake)
• outside D: ??
• if not halting: ??

[to be shown] if (...), after ‘enough’ corrections,


any PLA variant halts

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/22


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
 
sign wTt xn 6= yn , wt+1 ← wt + yn xn

1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
 
sign wTt xn 6= yn , wt+1 ← wt + yn xn

1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn

Reference Answer: 3
Simply multiply the second part of the rule by
yn xn . The result shows that the rule
somewhat ‘tries to correct the mistake.’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/22


Learning to Answer Yes/No Guarantee of PLA

Linear Separability
• if PLA halts (i.e. no more mistakes),
(necessary condition) D allows some w to make no mistake
• call such D linear separable

(linear separable) (not linear separable) (not linear separable)

assume linear separable D,


does PLA always halt?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/22


Learning to Answer Yes/No Guarantee of PLA

PLA Fact: wt Gets More Aligned with wf


linear separable D ⇔ exists perfect wf such that yn = sign(wTf xn )

• wf perfect hence every xn correctly away from line:

yn(t) wTf xn(t) ≥min yn wTf xn > 0


n

• wTf wt ↑ by updating with any xn(t) , yn(t)




wTf wt+1 = wTf wt + yn(t) xn(t)




≥ wTf wt + min yn wTf xn


n
> wTf wt + 0.

wt appears more aligned with wf after update


(really?)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/22
Learning to Answer Yes/No Guarantee of PLA

PLA Fact: wt Does Not Grow Too Fast


wt changed  only when mistake
⇔ sign wTt xn(t) 6= yn(t) ⇔ yn(t) wTt xn(t) ≤ 0

• mistake ‘limits’ kwt k2 growth, even when updating with ‘longest’ xn

kwt+1 k2 = kwt + yn(t) xn(t) k2


= kwt k2 + 2yn(t) wTt xn(t) + kyn(t) xn(t) k2
≤ kwt k2 + 0 + kyn(t) xn(t) k2
≤ kwt k2 + max kyn xn k2
n

start from w0 = 0, after T mistake corrections,

wTf wT √
≥ T · constant
kwf k kwT k

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/22


Learning to Answer Yes/No Guarantee of PLA

Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
wTf
Define R 2 = max kxn k2 ρ = min yn xn
n n kwf k
We want to show that T ≤ . Express the upper bound  by the two
terms above.
1 R/ρ
2 R 2 /ρ2
3 R/ρ2
4 ρ2 /R 2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22


Learning to Answer Yes/No Guarantee of PLA

Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
wTf
Define R 2 = max kxn k2 ρ = min yn xn
n n kwf k
We want to show that T ≤ . Express the upper bound  by the two
terms above.
1 R/ρ
2 R 2 /ρ2
3 R/ρ2
4 ρ2 /R 2

Reference Answer: 2
wt wT
The maximum value of kwf k kwtk
is 1. Since T
f
mistake corrections
√ increase the inner
product by T · constant, the maximum
number of corrected mistakes is 1/constant2 .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/22
Learning to Answer Yes/No Non-Separable Data

More about PLA


Guarantee
as long as linear separable and correct by mistake
• inner product of wf and wt grows fast; length of wt grows slowly
• PLA ‘lines’ are more and more aligned with wf ⇒ halts

Pros
simple to implement, fast, works in any dimension d

Cons
• ‘assumes’ linear separable D to halt
—property unknown in advance (no need for PLA if we know wf )
• not fully sure how long halting takes (ρ depends on wf )
—though practically fast

what if D not linear separable?


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22
Learning to Answer Yes/No Non-Separable Data

Learning with Noisy Data


unknown target function
f: X →Y
+ noise
(ideal credit approval formula)

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g≈f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

how to at least get g ≈ f on noisy D?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/22


Learning to Answer Yes/No Non-Separable Data

Line with Noise Tolerance


• assume ‘little’ noise: yn = f (xn ) usually
• if so, g ≈ f on D ⇔ yn = g(xn ) usually
• how about
N r
X z
wg ← argmin 6 sign(wT xn )
yn =
w
n=1

—NP-hard to solve, unfortunately

can we modify PLA to get


an ‘approximately good’ g?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/22


Learning to Answer Yes/No Non-Separable Data

Pocket Algorithm
modify PLA algorithm (black lines) by keeping best weights in pocket

initialize pocket weights ŵ


For t = 0, 1, · · ·
1 find a (random) mistake of wt called (xn(t) , yn(t) )
2 (try to) correct the mistake by

wt+1 ← wt + yn(t) xn(t)

3 if wt+1 makes fewer mistakes than ŵ, replace ŵ by wt+1


...until enough iterations
return ŵ (called wPOCKET ) as g

a simple modification of PLA to find


(somewhat) ‘best’ weights

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/22


Learning to Answer Yes/No Non-Separable Data

Fun Time
Should we use pocket or PLA?
Since we do not know whether D is linear separable in advance, we
may decide to just go with pocket instead of PLA. If D is actually linear
separable, what’s the difference between the two?
1 pocket on D is slower than PLA
2 pocket on D is faster than PLA
3 pocket on D returns a better g in approximating f than PLA
4 pocket on D returns a worse g in approximating f than PLA

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22


Learning to Answer Yes/No Non-Separable Data

Fun Time
Should we use pocket or PLA?
Since we do not know whether D is linear separable in advance, we
may decide to just go with pocket instead of PLA. If D is actually linear
separable, what’s the difference between the two?
1 pocket on D is slower than PLA
2 pocket on D is faster than PLA
3 pocket on D returns a better g in approximating f than PLA
4 pocket on D returns a worse g in approximating f than PLA

Reference Answer: 1
Because pocket need to check whether wt+1 is
better than ŵ in each iteration, it is slower than
PLA. On linear separable D, wPOCKET is the
same as wPLA , both making no mistakes.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/22
Learning to Answer Yes/No Non-Separable Data

Summary
1 When Can Machines Learn?

Lecture 1: The Learning Problem


Lecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
hyperplanes/linear classifiers in Rd
Perceptron Learning Algorithm (PLA)
correct mistakes and improve iteratively
Guarantee of PLA
no mistake eventually if linear separable
Non-Separable Data
hold somewhat ‘best’ weights in pocket
• next: the zoo of learning problems
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/22

You might also like