0% found this document useful (0 votes)
87 views

Lecture 8-ml On-Line Learning

This document summarizes key concepts in online machine learning. It discusses the motivation for online learning compared to PAC learning, including that online learning makes no distributional assumptions and allows for a worst-case analysis. It then overviews prediction with expert advice, including the weighted majority algorithm and its mistake bound of O(log N). Finally, it introduces the exponential weighted average algorithm and proves its regret bound of O(√T log N).
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Lecture 8-ml On-Line Learning

This document summarizes key concepts in online machine learning. It discusses the motivation for online learning compared to PAC learning, including that online learning makes no distributional assumptions and allows for a worst-case analysis. It then overviews prediction with expert advice, including the weighted majority algorithm and its mistake bound of O(log N). Finally, it introduces the exponential weighted average algorithm and proves its regret bound of O(√T log N).
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Foundations of Machine Learning

On-Line Learning

Mehryar Mohri
Courant Institute and Google Research
[email protected]
Motivation
PAC learning:
• distribution fixed over time (training and test).
• IID assumption.
On-line learning:
• no distributional assumption.
• worst-case analysis (adversarial).
• mixed training and test.
• Performance measure: mistake model, regret.

Mehryar Mohri - Foundations of Machine Learning page 2


This Lecture
Prediction with expert advice
Linear classification

Mehryar Mohri - Foundations of Machine Learning page 3


General On-Line Setting
For t = 1 to T do
• receive instance xt X.
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Classification: Y = {0, 1}, L(y, y ) = |y y|.
Regression: Y ⇥ R, L(y, y ) = (y y)2 .
Objective: minimize total loss T
t=1 yt , yt ).
L(⇥

Mehryar Mohri - Foundations of Machine Learning page 4


Prediction with Expert Advice
For t = 1 to T do
• receive instance xt X and advice yt,i Y, i [1, N ].
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Objective: minimize regret, i.e., difference of total
loss incurred and that of best expert.
T
X T
X
N
Regret(T ) = L(b
yt , yt ) min L(yt,i , yt ).
i=1
t=1 t=1

Mehryar Mohri - Foundations of Machine Learning page 5


Mistake Bound Model
Definition: the maximum number of mistakes a
learning algorithm L makes to learn c is defined by
ML (c) = max |mistakes(L, c)|.
x1 ,...,xT

Definition: for any concept class C the maximum


number of mistakes a learning algorithm L makes is
ML (C) = max ML (c).
c C

A mistake bound is a bound M on ML (C) .

Mehryar Mohri - Foundations of Machine Learning page 6


Halving Algorithm
see (Mitchell, 1997)

Halving(H)
1 H1 H
2 for t 1 to T do
3 Receive(xt )
4 yt MajorityVote(Ht , xt )
5 Receive(yt )
6 if yt = yt then
7 Ht+1 {c Ht : c(xt ) = yt }
8 return HT +1

Mehryar Mohri - Foundations of Machine Learning page 7


Halving Algorithm - Bound
(Littlestone, 1988)
Theorem: Let H be a finite hypothesis set, then
MHalving(H) log2 |H|.
Proof: At each mistake, the hypothesis set is
reduced at least by half.

Mehryar Mohri - Foundations of Machine Learning page 8


VC Dimension Lower Bound
(Littlestone, 1988)
Theorem: Let opt(H) be the optimal mistake bound
for H . Then,
VCdim(H) opt(H) MHalving(H) log2 |H|.
Proof: for a fully shattered set, form a complete
binary tree of the mistakes with height VCdim(H) .

Mehryar Mohri - Foundations of Machine Learning page 9


Weighted Majority Algorithm
(Littlestone and Warmuth, 1988)
Weighted-Majority(N experts) yt , yt,i {0, 1}.
1 for i 1 to N do [0, 1).
2 w1,i 1
3 for t 1 to T do
4 Receive(xt )
5 yt 1PNy =1 wt
PN
y =0 wt
weighted majority vote
t,i t,i
6 Receive(yt )
7 if yt = yt then
8 for i 1 to N do
9 if (yt,i = yt ) then
10 wt+1,i wt,i
11 else wt+1,i wt,i
12 return wT +1
Mehryar Mohri - Foundations of Machine Learning page 10
Weighted Majority - Bound
Theorem: Let mt be the number of mistakes made
by the WM algorithm till time t and mt that of the
best expert. Then, for all t ,
log N + mt log 1
mt .
log 2
1+

• Thus, m O(log N ) + constant


t best expert.

• Realizable case: m O(log N ). t

• Halving algorithm: = 0 .
Mehryar Mohri - Foundations of Machine Learning page 11
Weighted Majority - Proof
Potential: t =
N
i=1 wt,i .

Upper bound: after each error,



⇥1 1
⇤ 1+
t+1  2 + 2 ⇥ t = t.
mt
2
1+
Thus, t N.
2
Lower bound: for any expert i , t wt,i = mt,i
.
mt
Comparison: mt 1+
2 N
1+
mt log log N + mt log 2
mt log 2
1+ log N + mt log 1 .

Mehryar Mohri - Foundations of Machine Learning page 12


Weighted Majority - Notes
Advantage: remarkable bound requiring no
assumption.
Disadvantage: no deterministic algorithm can
achieve a regret RT = o(T ) with the binary loss.
• better guarantee with randomized WM.
• better guarantee for WM with convex losses.

Mehryar Mohri - Foundations of Machine Learning page 13


Exponential Weighted Average
total loss incurred by
Algorithm: expert i up to time t

• weight update: wt+1,i wt,i e


PN
⌘L(yt,i ,yt )
=e ⌘Lt,i
.

• prediction: yt = i=1 wt,i yt,i


P N
wt,i
.
i=1

Theorem: assume that L is convex in its first


argument and takes values in [0, 1] . Then, for any > 0
and any sequence y1 , . . . , yT Y , the regret at T
satisfies log N T
Regret(T ) + .
8
For = 8 log N/T ,
Regret(T ) (T /2) log N .
Mehryar Mohri - Foundations of Machine Learning page 14
Exponential Weighted Avg - Proof
Potential: t = log
N
i=1 wt,i .
Upper bound:
PN
i=1 wt 1,i e ⌘L(yt,i ,yt )
t t 1 = log PN
i=1 wt 1,i
= log E [e ⌘L(yt,i ,yt ) ]
wt 1
✓  ✓ ⇣ ⌘ ◆ ◆
= log E exp ⌘ L(yt,i , yt ) E [L(yt,i , yt )] ⌘ E [L(yt,i , yt )]
wt 1 wt 1 wt 1

⌘2
 ⌘ E [L(yt,i , yt )] + (Hoeffding’s ineq.)
wt 1 8
⌘2
 ⌘L( E [yt,i ], yt ) + (convexity of first arg. of L)
wt 1 8
⌘2
= ⌘L(b
yt , yt ) + .
8

Mehryar Mohri - Foundations of Machine Learning page 15


Exponential Weighted Avg - Proof
Upper bound: summing up the inequalities yields
T 2
T
T 0 ⇥ yt , yt ) +
L(⇥ .
t=1
8
Lower bound:
N
N
T 0 = log e LT ,i
log N ⇥ log max e LT ,i
log N
i=1
i=1 N
= min LT,i log N.
i=1
Comparison: T 2
N T
min LT,i log N ⇥ yt , yt ) +
L(⇥
i=1
t=1
8
T
N log N T
⇤ yt , yt )
L(⇥ min LT,i ⇥ + .
t=1
i=1 8
Mehryar Mohri - Foundations of Machine Learning page 16
Exponential Weighted Avg - Notes
Advantage: bound on regret per bound is of the
form RTT = O log(N )
T
.
Disadvantage: choice of requires knowledge of
horizon T .

Mehryar Mohri - Foundations of Machine Learning page 17


Doubling Trick
Idea: divide time into periods [2k , 2k+1 1] of length 2k
with k = 0, . . . , n, T ⇥ 2 1, and choose k =
n 8 log N
2k
in each period.
Theorem: with the same assumptions as before, for
any T , the following holds:

2
Regret(T ) ⇥ ⇤ (T /2) log N + log N/2.
2 1

Mehryar Mohri - Foundations of Machine Learning page 18


Doubling Trick - Proof
By the previous theorem, for any Ik = [2k , 2k+1 1] ,
N
LIk min LIk ,i ⇥ 2k /2 log N.
i=1

n n
N
n ⇤
Thus, LT = LIk min LIk ,i +
i=1
2k (log N )/2
k=0 k=0 k=0
n ⇥
N k
min LT,i + 2 2(log N )/2.
i=1
k=0

with
n ⇤ n+1 ⇤ ⇤ ⇤ ⇤ ⇤ ⇤
k 2 1 2 (n+1)/2
1 2 T +1 1 2( T + 1) 1 2 T
22 = ⇤ = ⇤ ⇥ ⇤ ⇥ ⇤ ⇥⇤ + 1.
i=0
2 1 2 1 2 1 2 1 2 1

Mehryar Mohri - Foundations of Machine Learning page 19


Notes
Doubling trick used in a variety of other contexts
and proofs.
More general method, learning parameter function
of time: t = (8 log N )/t . Constant factor
improvement:

Regret(T ) 2 (T /2) log N + (1/8) log N.

Mehryar Mohri - Foundations of Machine Learning page 20


This Lecture
Prediction with expert advice
Linear classification

Mehryar Mohri - Foundations of Machine Learning page 21


Perceptron Algorithm
(Rosenblatt, 1958)

Perceptron(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt )
5 Receive(yt )
6 if (yt = yt ) then
7 wt+1 wt + yt xt more generally yt xt , > 0
8 else wt+1 wt
9 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 22


Separating Hyperplane
Margin and errors

w·x = 0 w·x = 0

ρ ρ

yi (w · xi )
w

Mehryar Mohri - Foundations of Machine Learning page 23


Perceptron = Stochastic Gradient Descent
Objective function: convex but not differentiable.
T
1
F (w) = max 0, yt (w · xt ) = E [f (w, x)]
T t=1
b
x D

with f (w, x) = max 0, y(w · x) .


Stochastic gradient: for each xt , the update is
wt w f (wt , xt ) if differentiable
wt+1
wt otherwise,
where > 0 is a learning rate parameter.
Here: wt + yt xt if yt (wt · xt ) < 0
wt+1
wt otherwise.
Mehryar Mohri - Foundations of Machine Learning page 24
Perceptron Algorithm - Bound
(Novikoff, 1962)
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, for all t [1, T ] ,
yt (v · xt )
.
v
Then, the number of mistakes made by the
perceptron algorithm is bounded by R 2
/ 2
.
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Mehryar Mohri - Foundations of Machine Learning page 25


• Summing up the assumption inequalities gives:
v· t I yt xt
M
v
v· t I (wt+1 wt )
= (definition of updates)
v
v · wT +1
=
v
wT +1 (Cauchy-Schwarz ineq.)
= wtm + ytm xtm (tm largest t in I)
1/2
= wtm 2
+ xtm 2
+ 2 ytm wtm · xtm
1/2 0
wtm 2
+R 2

1/2
MR 2
= M R. (applying the same to previous ts in I)

Mehryar Mohri - Foundations of Machine Learning page 26


• Notes:
• bound independent of dimension and tight.
• convergence can be slow for small margin, it can
be in (2 ) . N

• among the many variants: voted perceptron


algorithm. Predict according to
sign ( c t wt ) · x ,
t I

where ct is the number of iterations wt survives.


• {xt : t I} are the support vectors for the
perceptron algorithm.

• non-separable case: does not converge.


Mehryar Mohri - Foundations of Machine Learning page 27
Perceptron - Leave-One-Out Analysis
Theorem: Let hS be the hypothesis returned by the
perceptron algorithm for sample S = (x1 , . . . , xT ) D
and let M (S) be the number of updates defining hS .
Then,
min(M (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1

Proof: Let S Dm+1 be a sample linearly separable


and let x S . If hS {x} misclassifies x , then x must
be a ‘support vector’ for hS (update at x ). Thus,
M (S)
Rloo (perceptron) .
m+1
Mehryar Mohri - Foundations of Machine Learning page 28
Perceptron - Non-Separable Bound
(MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
q 2
R
MT  inf L⇢ (u) + ,
⇢>0,kuk2 1 ⇢

where R = maxt2I kxt k


P yt (u·xt )
L⇢ (u) = t2I 1 ⇢ +
.

Mehryar Mohri - Foundations of Machine Learning page 29


• Proof: for any t , 1 yt (u·xt )
1 yt (u·xt )
+ , summing
up these inequalities for t I yields:
X⇣ yt (u · xt ) ⌘ X yt (u · xt )
MT  1 +
⇢ + ⇢
t2I t2I
p
MT R
 L⇢ (u) + ,

by upper-bounding t I (yt u · xt ) as in the proof
for the separable case.
• solving the second-degree inequality
p
MT R
MT  L⇢ (u) + ,

q
R R2 q
p + + 4L⇢ (u) R
gives MT 
⇢ ⇢2
2


+ L⇢ (u).

Mehryar Mohri - Foundations of Machine Learning page 30


Non-Separable Case - L2 Bound
(Freund and Schapire, 1998; MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
2
L (u) 2 L (u) 2
t I xt 2
MT inf + 2
+ .
>0, u 2 1 2 4

• when x t R for all t I, this implies


2
R
MT inf + L (u) 2 ,
>0, u 2 1

where L (u) = 1 yt (u·xt )


+ t I
.
Mehryar Mohri - Foundations of Machine Learning page 31
• Proof: Reduce problem to separable case in higher
dimension. Let l = 1 1 , for t [1, T ] .
t
yt u·xt
+ t I

• Mapping (similar to trivial mapping):


xt,1
(N +t)th component ..
. u1
xt,N Z
..
0 .
xt,1 .. uN
xt = .. xt = . u u= Z
. y 1 l1
0 Z
xt,N ..
.
0 y T lT
.. Z
. 2 L (u) 2
0 u =1 = Z= 1+ 2
.

Mehryar Mohri - Foundations of Machine Learning page 32


• Observe that the Perceptron algorithm makes the
same predictions and makes updates at the same
rounds when processing x1 , . . . , xT .
• For any t I,
u · xt yt lt
yt (u · xt ) = yt +
Z Z
yt u · xt lt
= +
Z Z
1
= yt u · xt + [ yt (u · xt )]+ .
Z Z
• Summing up and using the proof in the separable
case yields:
MT yt (u · xt ) xt 2 .
Z
t I t I
Mehryar Mohri - Foundations of Machine Learning page 33
• The inequality can be rewritten as
1 L (u) 2
r2 r2 L (u) 2
MT 2
MT2 2
+ 2
r +MT
2 2
= 2
+ 2
+ 2
+MT L (u) 2,

where r = t I xt 2 .

• Selecting to minimize the bound gives 2


=
L (u)
MT
2r

and leads to
r2 MT L (u) r
MT2 2 +2 + MT L (u) 2
= (r + MT L (u) 2 )2 .

• Solving the second-degree inequality


r
MT MT L (u) 2 0

yields directly the first statement. The second one


results from replacing r with MT R .
Mehryar Mohri - Foundations of Machine Learning page 34
Dual Perceptron Algorithm
Dual-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys (xs · xt ))
5 Receive(yt )
6 if (yt = yt ) then
7 t t+1
8 return

Mehryar Mohri - Foundations of Machine Learning page 35


Kernel Perceptron Algorithm
(Aizerman et al., 1964)

K PDS kernel.
Kernel-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys K(xs , xt ))
5 Receive(yt )
6 if (yt = yt ) then
7 t t+1
8 return

Mehryar Mohri - Foundations of Machine Learning page 36


Winnow Algorithm
(Littlestone, 1988)
Winnow( )
1 w1 1/N
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt ) yt { 1, +1}
5 Receive(yt )
6 if (yt = yt ) then
N
7 Zt i=1 wt,i exp( yt xt,i )
8 for i 1 to N do
wt,i exp( yt xt,i )
9 wt+1,i Zt
10 else wt+1 wt
11 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 37


Notes
Winnow = weighted majority:
• for yt,i = xt,i { 1, +1}, sgn(wt · xt )coincides with
the majority vote.
• multiplying by e or e the weight of correct or
incorrect experts, is equivalent to multiplying
by = e 2 the weight of incorrect ones.
Relationships with other algorithms: e.g., boosting
and Perceptron (Winnow and Perceptron can be
viewed as special instances of a general family).

Mehryar Mohri - Foundations of Machine Learning page 38


Winnow Algorithm - Bound
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, v 0 for all t [1, T ] ,
yt (v · xt )
.
v 1
Then, the number of mistakes made by the
Winnow algorithm is bounded by 2 (R 2
/ 2
) log N .
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Mehryar Mohri - Foundations of Machine Learning page 39


Notes
Comparison with perceptron bound:
• dual norms: norms for xt and v .
• similar bounds with different norms.
• each advantageous in different cases:

Winnow bound favorable when a sparse set of
experts can predict well. For example, if v = e1
and tx {±1} N
, log N vs N.

Perceptron favorable in opposite situation.

Mehryar Mohri - Foundations of Machine Learning page 40


Winnow Algorithm - Bound
N
vi vi / v
Potential: t =
v
log
wt,i
. (relative entropy)
i=1
Upper bound: for each t in I ,
PN vi wt,i
t+1 t = i=1 kvk1 log wt+1,i
PN vi Zt
= i=1 kvk1 log exp(⌘y t xt,i )
P N vi
= log Zt ⌘ i=1 kvk1 yt xt,i
⇥ PN ⇤
 log i=1 wt,i exp(⌘yt xt,i ) ⌘⇢1
⇥ ⇤
= log E exp(⌘yt xt ) ⌘⇢1
wt
⇥ 2 2

(Hoe↵ding)  log exp(⌘ (2R1 ) /8) + ⌘yt wt · xt ⌘⇢1
| {z }
 ⌘ 2 R12
/2 ⌘⇢1 . 0

Mehryar Mohri - Foundations of Machine Learning page 41


Winnow Algorithm - Bound
Upper bound: summing up the inequalities yields
T +1 1 M ( 2 R2 /2 ).
Lower bound: note that
N N
/ v
1 = vi
v 1 log vi1/N 1
= log N + vi
v 1 log vi
v 1 log N
i=1 i=1

and for all t , t 0 (property of relative entropy).


Thus, T +1 1 0 log N = log N.
Comparison: log N M ( 2 R2 /2 ). For = R2
we obtain R2
M 2 log N 2 .
Mehryar Mohri - Foundations of Machine Learning page 42
Conclusion
On-line learning:
• wide and fast-growing literature.
• many related topics, e.g., game theory, text
compression, convex optimization.
• online to batch bounds and techniques.
• online version of batch algorithms, e.g.,
regression algorithms (see regression lecture).

Mehryar Mohri - Foundations of Machine Learning page 43


References
• Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the
potential function method in pattern recognition learning. Automation and Remote Control,
25, 821-837.

• Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On-
Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004.

• Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge
University Press, 2006.

• Yoav Freund and Robert Schapire. Large margin classification using the perceptron
algorithm. In Proceedings of COLT 1998. ACM Press, 1998.

• Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.

• Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-
threshold Algorithm" Machine Learning 285-318(2). 1988.

Mehryar Mohri - Foundations of Machine Learning page 44


References
• Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989:
256-261.

• Tom Mitchell. Machine Learning, McGraw Hill, 1997.

• Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,


2013.

• Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the


Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn.

• Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6,
pp. 386-408, 1958.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 45


Appendix

Mehryar Mohri - Foudations of Machine Learning


SVMs - Leave-One-Out Analysis
(Vapnik, 1995)
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
min(NSV (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1
Proof: one part proven in lecture 4. The other part
due to i 1/Rm+12
for xi misclassified by SVMs.

Mehryar Mohri - Foundations of Machine Learning page 47


Comparison
Bounds on expected error, not high probability
statements.
Leave-one-out bounds not sufficient to distinguish
SVMs and perceptron algorithm. Note however:
• same maximum margin m+1 can be used in both.
• but different radius Rm+1 of support vectors.
Difference: margin distribution.

Mehryar Mohri - Foundations of Machine Learning page 48

You might also like