0% found this document useful (0 votes)
28 views37 pages

02u Handout

Uploaded by

amien50311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views37 pages

02u Handout

Uploaded by

amien50311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Machine Learning Foundations

(機器學習基石)

Lecture 2: Learning to Answer Yes/No


Hsuan-Tien Lin (林軒田)
[email protected]

Department of Computer Science


& Information Engineering
National Taiwan University
(國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 0/23


Learning to Answer Yes/No

Roadmap
1 When Can Machines Learn?

Lecture 1: The Learning Problem


A takes D and H to get g

Lecture 2: Learning to Answer Yes/No


Perceptron Hypothesis Set
Perceptron Learning Algorithm (PLA)
Guarantee of PLA
2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 1/23


Learning to Answer Yes/No Perceptron Hypothesis Set

Credit Approval Problem Revisited


Applicant Information
age 23 years
gender female
annual salary NTD 1,000,000
unknown target function
year in residence 1 year
f: X →Y
year in job 0.5 year
(ideal credit approval formula) current debt 200,000

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g≈f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

what hypothesis set can we use?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 2/23


Learning to Answer Yes/No Perceptron Hypothesis Set

A Simple Hypothesis Set: the ‘Perceptron’


age 23 years
annual salary NTD 1,000,000
year in job 0.5 year
current debt 200,000

• For x = (x1 , x2 , · · · , xd ) ‘features of customer’, compute a


weighted ‘score’ and
Xd
approve credit if wi xi > threshold
i=1
Xd
deny credit if wi xi < threshold
i=1

• Y: +1(good), −1(bad) , 0 ignored—linear formula h ∈ H are



d
! !
X
h(x) = sign wi xi − threshold
i=1

called ‘perceptron’ hypothesis historically


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 3/23
Learning to Answer Yes/No Perceptron Hypothesis Set

Vector Form of Perceptron Hypothesis

d
! !
X
h(x) = sign wi xi −threshold
i=1
 
d
!
X
= sign  wi xi + (−threshold) · (+1)
 
| {z } | {z }
i=1 w0 x0
d
!
X
= sign wi xi
i=0

= sign wT x

• each ‘taller’ w represents a hypothesis h & is multiplied with


‘taller’ x —will use taller versions to simplify notation

what do perceptrons h ‘look like’?


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/23
Learning to Answer Yes/No Perceptron Hypothesis Set

Some Notation Conventions


fonts
• normal x: just a scalar
• bold x: a vector (x0 , x1 , . . . , xd )
bold w: a vector (w0 , w1 , . . . , wd )
• normal xi : the i-th component in x
normal wi : the i-th component in w
• bold xn : the n-th vector (in the data)
bold wt : the t-th vector (we will see)
• normal xn,i (rarely used): the i-th component in xn
normal wt,i (rarely used): the i-th component in wt
• caligraphic as sets: input X , output Y, data D, hypothesis H, except algorithm A

two important numbers


• N examples (xn , yn ), indexed by n = 1, 2, . . . , N
• d features, indexed by i = 0, 1, 2, . . . , d

important to follow the notations


from the very beginning
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 5/23
Learning to Answer Yes/No Perceptron Hypothesis Set

Perceptrons in R2
h(x) = sign (w0 + w1 x1 + w2 x2 )

• customer features x: points on the plane (or points in Rd )


• labels y : ◦ (+1), × (-1)
• hypothesis h: visually lines w0 + w1 x1 + w2 x2 = 0
(or hyperplanes in Rd )
—positive on one side of a line, negative on the other side
• different line classifies customers differently

perceptrons ⇔ linear (binary) classifiers

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 6/23


Learning to Answer Yes/No Perceptron Hypothesis Set

Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/23


Learning to Answer Yes/No Perceptron Hypothesis Set

Fun Time
Consider using a perceptron to detect spam messages.
Assume that each email is represented by the frequency of keyword
occurrence, and output +1 indicates a spam. Which keywords below
shall have large positive weights in a good perceptron for the task?
1 coffee, tea, hamburger, steak
2 free, drug, fantastic, deal
3 machine, learning, statistics, textbook
4 national, Taiwan, university, coursera

Reference Answer: 2
The occurrence of keywords with positive
weights increase the ‘spam score’, and hence
those keywords should often appear in spams.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 7/23


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Select g from H
H = all possible perceptrons, g =?

• want: g ≈ f (hard when f unknown)


• almost necessary: g ≈ f on D, ideally
g(xn ) = f (xn ) = yn
• difficult: H is of infinite size
• idea: start from some g0 , and ‘correct’ its
mistakes on D

will represent g0 by its weight vector w0

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 8/23


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Perceptron Learning Algorithm


start from some w0 (say, 0), and ‘correct’ itsy= +1
mistakes on D w+y x

For t = 0, 1, . . . y= +1 w+y x
 w
1 find a mistake of wt called xn(t) , yn(t)
x
  w
sign wTt xn(t) ̸= yn(t)
y= −1 w
2 (try to) correct the mistake by
y=
w+−1
yx w x
wt+1 ← wt + yn(t) xn(t)
x
. . . until no more mistakes w+y x
return last w (called wpla ) as g

That’s it!
—A fault confessed is half redressed. :-)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 9/23


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Handling sign(·) = 0
Perceptron Learning Algorithm
start from some w0 (say, 0), and ‘correct’ its mistakes on D

When w0 = 0, technically sign(wT0 xn(0) ) = 0, shall we update?

• convention -1: sign(0) = −1 (update if yn(0) = +1)


• convention +1: sign(0) = +1 (update if yn(0) = −1)
• convention 0: sign(0) = 0 (always update)
• convention r: sign(0) = random flip (50% chance of update)
—usually does not matter much, as long as w1 often becomes
non-zero

wTt xn(t) = 0 rarely happens in practice

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 10/23


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)
How to Update w0 ?
Perceptron Learning Algorithm
For t = 0, 1, . . .
find a mistake of wt called xn(t) , yn(t) : sign wTt xn(t) ̸= yn(t)
 
1

2 (try to) correct the mistake by

wt+1 ← wt + yn(t) xn(t) , i.e.,


     
wt+1,0 wt,0 x0 (= what?)
 wt+1,1   wt,1   xn(t),1 
 + yn(t) 
 ...  =  ...
   
  ... 
wt+1,d wt,d xn(t),d

. . . until no more mistakes


return last w (called wpla ) as g

each update changes wt,0 by yn(t)


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 11/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Practical Implementation of PLA


start from some w0 (say, 0), and ‘correct’ its mistakes on D

Cyclic PLA
For t = 0, 1, . . .

1 find the next mistake of wt called xn(t) , yn(t)
 
sign wTt xn(t) ̸= yn(t)

2 correct the mistake by

wt+1 ← wt + yn(t) xn(t)

. . . until a full cycle of not encountering mistakes

next can follow naïve cycle (1, · · · , N)


or precomputed random cycle

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 12/23


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

initially

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 1

w(t+1)

x1

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 2

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 3

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 4

x3

w(t)

w(t+1)

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 5

w(t)
w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 6

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 7

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 8

w(t+1)

w(t)

x14

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

update: 9

w(t)

w(t+1)

x9

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Seeing is Believing

finally

wPLA

worked like a charm with < 20 lines!!


(note: made xi ≫ x0 = 1 for visual purpose)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 13/23
Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
 
sign wTt xn ̸= yn , wt+1 ← wt + yn xn
1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA)

Fun Time
Let’s try to think about why PLA may work.
Let n = n(t), according to the rule of PLA below, which formula is true?
 
sign wTt xn ̸= yn , wt+1 ← wt + yn xn
1 wTt+1 xn = yn
2 sign(wTt+1 xn ) = yn
3 yn wTt+1 xn ≥ yn wTt xn
4 yn wTt+1 xn < yn wTt xn

Reference Answer: 3
Simply multiply the second part of the rule by
yn xn . The result shows that the rule somewhat
‘tries to correct the mistake.’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 14/23


Learning to Answer Yes/No Guarantee of PLA

Linear Separability
• if PLA halts (i.e. no more mistakes),
(necessary condition) D allows some w to make no mistake
• call such D linear separable

(linear separable) (not linear separable) (not linear separable)

assume linear separable D,


does PLA always halt?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 15/23


Learning to Answer Yes/No Guarantee of PLA

PLA Fact: wt Gets More Aligned with wf


linear separable D ⇔ exists perfect wf such that yn = sign(wTf xn )

• wf perfect hence every xn correctly away from line:

yn(t) wTf xn(t) ≥min yn wTf xn > 0


n

• wTf wt ↑ by updating with any xn(t) , yn(t)




wTf wt+1 = wTf wt + yn(t) xn(t)




≥ wTf wt + min yn wTf xn


n
> wTf wt + 0.

wt appears more aligned with wf after update


(really?)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 16/23
Learning to Answer Yes/No Guarantee of PLA

PLA Fact: wt Does Not Grow Too Fast


wt changed only when mistake
⇔ sign wTt xn(t) ̸= yn(t) ⇔ yn(t) wTt xn(t) ≤ 0

• mistake ‘limits’ ∥wt ∥2 growth, even when updating with ‘longest’ xn

∥wt+1 ∥2 = ∥wt + yn(t) xn(t) ∥2


= ∥wt ∥2 + 2yn(t) wTt xn(t) + ∥yn(t) xn(t) ∥2
≤ ∥wt ∥2 + 0 + ∥yn(t) xn(t) ∥2
≤ ∥wt ∥2 + max ∥yn xn ∥2
n

start from w0 = 0, after T mistake corrections,

wTf wT √
≥ T · constant
∥wf ∥ ∥wT ∥

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/23


Learning to Answer Yes/No Guarantee of PLA
Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
wTf
Define R 2 = max ∥xn ∥2 ρ = min yn xn
n n ∥wf ∥
We want to show that T ≤ □. Express the upper bound □ by the two
terms above.
1 R/ρ
2 R 2 /ρ2
3 R/ρ2
4 ρ2 /R 2

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/23


Learning to Answer Yes/No Guarantee of PLA
Fun Time
Let’s upper-bound T , the number of mistakes that PLA ‘corrects’.
wTf
Define R 2 = max ∥xn ∥2 ρ = min yn xn
n n ∥wf ∥
We want to show that T ≤ □. Express the upper bound □ by the two
terms above.
1 R/ρ
2 R 2 /ρ2
3 R/ρ2
4 ρ2 /R 2

Reference Answer: 2
wt wT
The maximum value of ∥wf ∥ ∥w t∥
is 1. Since T
f
mistake
√ corrections increase the inner product
by T · constant, the maximum number of
corrected mistakes is 1/constant2 .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 18/23
Learning to Answer Yes/No Guarantee of PLA

PLA Mistake Bound


inner product grows fast length2 grows slowly
wTf wt+1 ≥ wTf wt + min yn wTf xn ∥wt+1 ∥2 ≤ ∥wt ∥2 + max ∥xn ∥2
| n {z } | n {z }
ρ·∥wf ∥ R2

Magic Chain! Magic Chain!


wTf w1 ≥ wTf w0 + ρ · ∥wf ∥ ∥w1 ∥2 ≤ ∥w0 ∥2 + R 2
wTf w2 ≥ wTf w1 + ρ · ∥wf ∥ ∥w2 ∥2 ≤ ∥w1 ∥2 + R 2
wTf w3 ≥ wTf w2 + ρ · ∥wf ∥ ∥w3 ∥2 ≤ ∥w2 ∥2 + R 2
··· ···
wTf wT ≥ wTf wT −1 + ρ · ∥wf ∥ ∥wT ∥2 ≤ ∥wT −1 ∥2 + R 2

start from w0 = 0, after T mistake corrections,


 2
wTf wT T ρ∥wf ∥ R
1≥ ≥ √ =⇒ T ≤
∥wf ∥ ∥wT ∥ ∥wf ∥ T R ρ
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 19/23
Learning to Answer Yes/No Guarantee of PLA

More about PLA


Guarantee
as long as linear separable and correct by mistake
• inner product of wf and wt grows fast; length of wt grows slowly
• PLA ‘lines’ are more and more aligned with wf ⇒ halts

Pros
simple to implement, fast, works in any dimension d

Cons
• ‘assumes’ linear separable D to halt
—property unknown in advance (no need for PLA if we know wf )
• not fully sure how long halting takes (ρ depends on wf )
—though practically fast

what if D not linear separable?


Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 20/23
Learning to Answer Yes/No Guarantee of PLA

Learning with Noisy Data


unknown target function
f: X →Y
+ noise
(ideal credit approval formula)

training examples learning final hypothesis


D : (x1 , y1 ), · · · , (xN , yN ) algorithm g≈f
A
(historical records in bank) (‘learned’ formula to be used)

hypothesis set
H

(set of candidate formula)

how to at least get g ≈ f on noisy D?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 21/23


Learning to Answer Yes/No Guarantee of PLA

Line with Noise Tolerance


• assume ‘little’ noise: yn = f (xn ) usually
• if so, g ≈ f on D ⇔ yn = g(xn ) usually
• how about
N r
X z
wg ← argmin yn ̸= sign(wT xn )
w
n=1

—NP-hard to solve, unfortunately

will discuss other solutions for an


‘approximately good’ g later?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 22/23


Learning to Answer Yes/No Guarantee of PLA

Summary
1 When Can Machines Learn?

Lecture 1: The Learning Problem


Lecture 2: Learning to Answer Yes/No
Perceptron Hypothesis Set
hyperplanes/linear classifiers in Rd
Perceptron Learning Algorithm (PLA)
correct mistakes and improve iteratively
Guarantee of PLA
no mistake eventually if linear separable

• next: the zoo of learning problems


2 Why Can Machines Learn?
3 How Can Machines Learn?
4 How Can Machines Learn Better?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 23/23

You might also like