0% found this document useful (0 votes)

136 views48 pages

Lecture 8-ml On-Line Learning

This document summarizes key concepts in online machine learning. It discusses the motivation for online learning compared to PAC learning, including that online learning makes no distributional assumptions and allows for a worst-case analysis. It then overviews prediction with expert advice, including the weighted majority algorithm and its mistake bound of O(log N). Finally, it introduces the exponential weighted average algorithm and proves its regret bound of O(√T log N).

Uploaded by

Juan Javier Gonzalez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views48 pages

Lecture 8-ml On-Line Learning

Uploaded by

Juan Javier Gonzalez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Foundations of Machine Learning

On-Line Learning

Mehryar Mohri
Courant Institute and Google Research
[email protected]
Motivation
PAC learning:
• distribution fixed over time (training and test).
• IID assumption.
On-line learning:
• no distributional assumption.
• worst-case analysis (adversarial).
• mixed training and test.
• Performance measure: mistake model, regret.

Mehryar Mohri - Foundations of Machine Learning page 2

This Lecture
Prediction with expert advice
Linear classification

Mehryar Mohri - Foundations of Machine Learning page 3

General On-Line Setting
For t = 1 to T do
• receive instance xt X.
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Classification: Y = {0, 1}, L(y, y ) = |y y|.
Regression: Y ⇥ R, L(y, y ) = (y y)2 .
Objective: minimize total loss T
t=1 yt , yt ).
L(⇥

Mehryar Mohri - Foundations of Machine Learning page 4

Prediction with Expert Advice
For t = 1 to T do
• receive instance xt X and advice yt,i Y, i [1, N ].
• predict yt Y.
• receive label yt Y.
• incur loss L(yt , yt ).
Objective: minimize regret, i.e., difference of total
loss incurred and that of best expert.
T
X T
X
N
Regret(T ) = L(b
yt , yt ) min L(yt,i , yt ).
i=1
t=1 t=1

Mehryar Mohri - Foundations of Machine Learning page 5

Mistake Bound Model
Definition: the maximum number of mistakes a
learning algorithm L makes to learn c is defined by
ML (c) = max |mistakes(L, c)|.
x1 ,...,xT

Definition: for any concept class C the maximum

number of mistakes a learning algorithm L makes is
ML (C) = max ML (c).
c C

A mistake bound is a bound M on ML (C) .

Mehryar Mohri - Foundations of Machine Learning page 6

Halving Algorithm
see (Mitchell, 1997)

Halving(H)
1 H1 H
2 for t 1 to T do
3 Receive(xt )
4 yt MajorityVote(Ht , xt )
5 Receive(yt )
6 if yt = yt then
7 Ht+1 {c Ht : c(xt ) = yt }
8 return HT +1

Mehryar Mohri - Foundations of Machine Learning page 7

Halving Algorithm - Bound
(Littlestone, 1988)
Theorem: Let H be a finite hypothesis set, then
MHalving(H) log2 |H|.
Proof: At each mistake, the hypothesis set is
reduced at least by half.

Mehryar Mohri - Foundations of Machine Learning page 8

VC Dimension Lower Bound
(Littlestone, 1988)
Theorem: Let opt(H) be the optimal mistake bound
for H . Then,
VCdim(H) opt(H) MHalving(H) log2 |H|.
Proof: for a fully shattered set, form a complete
binary tree of the mistakes with height VCdim(H) .

Mehryar Mohri - Foundations of Machine Learning page 9

Weighted Majority Algorithm
(Littlestone and Warmuth, 1988)
Weighted-Majority(N experts) yt , yt,i {0, 1}.
1 for i 1 to N do [0, 1).
2 w1,i 1
3 for t 1 to T do
4 Receive(xt )
5 yt 1PNy =1 wt
PN
y =0 wt
weighted majority vote
t,i t,i
6 Receive(yt )
7 if yt = yt then
8 for i 1 to N do
9 if (yt,i = yt ) then
10 wt+1,i wt,i
11 else wt+1,i wt,i
12 return wT +1
Mehryar Mohri - Foundations of Machine Learning page 10
Weighted Majority - Bound
Theorem: Let mt be the number of mistakes made
by the WM algorithm till time t and mt that of the
best expert. Then, for all t ,
log N + mt log 1
mt .
log 2
1+

• Thus, m O(log N ) + constant

t best expert.

• Realizable case: m O(log N ). t

• Halving algorithm: = 0 .
Mehryar Mohri - Foundations of Machine Learning page 11
Weighted Majority - Proof
Potential: t =
N
i=1 wt,i .

Upper bound: after each error,


⇥1 1
⇤ 1+
t+1  2 + 2 ⇥ t = t.
mt
2
1+
Thus, t N.
2
Lower bound: for any expert i , t wt,i = mt,i
.
mt
Comparison: mt 1+
2 N
1+
mt log log N + mt log 2
mt log 2
1+ log N + mt log 1 .

Mehryar Mohri - Foundations of Machine Learning page 12

Weighted Majority - Notes
Advantage: remarkable bound requiring no
assumption.
Disadvantage: no deterministic algorithm can
achieve a regret RT = o(T ) with the binary loss.
• better guarantee with randomized WM.
• better guarantee for WM with convex losses.

Mehryar Mohri - Foundations of Machine Learning page 13

Exponential Weighted Average
total loss incurred by
Algorithm: expert i up to time t

• weight update: wt+1,i wt,i e

PN
⌘L(yt,i ,yt )
=e ⌘Lt,i
.

• prediction: yt = i=1 wt,i yt,i

P N
wt,i
.
i=1

Theorem: assume that L is convex in its first

argument and takes values in [0, 1] . Then, for any > 0
and any sequence y1 , . . . , yT Y , the regret at T
satisfies log N T
Regret(T ) + .
8
For = 8 log N/T ,
Regret(T ) (T /2) log N .
Mehryar Mohri - Foundations of Machine Learning page 14
Exponential Weighted Avg - Proof
Potential: t = log
N
i=1 wt,i .
Upper bound:
PN
i=1 wt 1,i e ⌘L(yt,i ,yt )
t t 1 = log PN
i=1 wt 1,i
= log E [e ⌘L(yt,i ,yt ) ]
wt 1
✓  ✓ ⇣ ⌘ ◆ ◆
= log E exp ⌘ L(yt,i , yt ) E [L(yt,i , yt )] ⌘ E [L(yt,i , yt )]
wt 1 wt 1 wt 1

⌘2
 ⌘ E [L(yt,i , yt )] + (Hoeffding’s ineq.)
wt 1 8
⌘2
 ⌘L( E [yt,i ], yt ) + (convexity of first arg. of L)
wt 1 8
⌘2
= ⌘L(b
yt , yt ) + .
8

Mehryar Mohri - Foundations of Machine Learning page 15

Exponential Weighted Avg - Proof
Upper bound: summing up the inequalities yields
T 2
T
T 0 ⇥ yt , yt ) +
L(⇥ .
t=1
8
Lower bound:
N
N
T 0 = log e LT ,i
log N ⇥ log max e LT ,i
log N
i=1
i=1 N
= min LT,i log N.
i=1
Comparison: T 2
N T
min LT,i log N ⇥ yt , yt ) +
L(⇥
i=1
t=1
8
T
N log N T
⇤ yt , yt )
L(⇥ min LT,i ⇥ + .
t=1
i=1 8
Mehryar Mohri - Foundations of Machine Learning page 16
Exponential Weighted Avg - Notes
Advantage: bound on regret per bound is of the
form RTT = O log(N )
T
.
Disadvantage: choice of requires knowledge of
horizon T .

Mehryar Mohri - Foundations of Machine Learning page 17

Doubling Trick
Idea: divide time into periods [2k , 2k+1 1] of length 2k
with k = 0, . . . , n, T ⇥ 2 1, and choose k =
n 8 log N
2k
in each period.
Theorem: with the same assumptions as before, for
any T , the following holds:
⇤
2
Regret(T ) ⇥ ⇤ (T /2) log N + log N/2.
2 1

Mehryar Mohri - Foundations of Machine Learning page 18

Doubling Trick - Proof
By the previous theorem, for any Ik = [2k , 2k+1 1] ,
N
LIk min LIk ,i ⇥ 2k /2 log N.
i=1

n n
N
n ⇤
Thus, LT = LIk min LIk ,i +
i=1
2k (log N )/2
k=0 k=0 k=0
n ⇥
N k
min LT,i + 2 2(log N )/2.
i=1
k=0

with
n ⇤ n+1 ⇤ ⇤ ⇤ ⇤ ⇤ ⇤
k 2 1 2 (n+1)/2
1 2 T +1 1 2( T + 1) 1 2 T
22 = ⇤ = ⇤ ⇥ ⇤ ⇥ ⇤ ⇥⇤ + 1.
i=0
2 1 2 1 2 1 2 1 2 1

Mehryar Mohri - Foundations of Machine Learning page 19

Notes
Doubling trick used in a variety of other contexts
and proofs.
More general method, learning parameter function
of time: t = (8 log N )/t . Constant factor
improvement:

Regret(T ) 2 (T /2) log N + (1/8) log N.

Mehryar Mohri - Foundations of Machine Learning page 20

This Lecture
Prediction with expert advice
Linear classification

Mehryar Mohri - Foundations of Machine Learning page 21

Perceptron Algorithm
(Rosenblatt, 1958)

Perceptron(w0 )
1 w1 w0 typically w0 = 0
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt )
5 Receive(yt )
6 if (yt = yt ) then
7 wt+1 wt + yt xt more generally yt xt , > 0
8 else wt+1 wt
9 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 22

Separating Hyperplane
Margin and errors

w·x = 0 w·x = 0

ρ ρ

yi (w · xi )
w

Mehryar Mohri - Foundations of Machine Learning page 23

Perceptron = Stochastic Gradient Descent
Objective function: convex but not differentiable.
T
1
F (w) = max 0, yt (w · xt ) = E [f (w, x)]
T t=1
b
x D

with f (w, x) = max 0, y(w · x) .

Stochastic gradient: for each xt , the update is
wt w f (wt , xt ) if differentiable
wt+1
wt otherwise,
where > 0 is a learning rate parameter.
Here: wt + yt xt if yt (wt · xt ) < 0
wt+1
wt otherwise.
Mehryar Mohri - Foundations of Machine Learning page 24
Perceptron Algorithm - Bound
(Novikoff, 1962)
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, for all t [1, T ] ,
yt (v · xt )
.
v
Then, the number of mistakes made by the
perceptron algorithm is bounded by R 2
/ 2
.
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Mehryar Mohri - Foundations of Machine Learning page 25

• Summing up the assumption inequalities gives:
v· t I yt xt
M
v
v· t I (wt+1 wt )
= (definition of updates)
v
v · wT +1
=
v
wT +1 (Cauchy-Schwarz ineq.)
= wtm + ytm xtm (tm largest t in I)
1/2
= wtm 2
+ xtm 2
+ 2 ytm wtm · xtm
1/2 0
wtm 2
+R 2

1/2
MR 2
= M R. (applying the same to previous ts in I)

Mehryar Mohri - Foundations of Machine Learning page 26

• Notes:
• bound independent of dimension and tight.
• convergence can be slow for small margin, it can
be in (2 ) . N

• among the many variants: voted perceptron

algorithm. Predict according to
sign ( c t wt ) · x ,
t I

where ct is the number of iterations wt survives.

• {xt : t I} are the support vectors for the
perceptron algorithm.

• non-separable case: does not converge.

Mehryar Mohri - Foundations of Machine Learning page 27
Perceptron - Leave-One-Out Analysis
Theorem: Let hS be the hypothesis returned by the
perceptron algorithm for sample S = (x1 , . . . , xT ) D
and let M (S) be the number of updates defining hS .
Then,
min(M (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1

Proof: Let S Dm+1 be a sample linearly separable

and let x S . If hS {x} misclassifies x , then x must
be a ‘support vector’ for hS (update at x ). Thus,
M (S)
Rloo (perceptron) .
m+1
Mehryar Mohri - Foundations of Machine Learning page 28
Perceptron - Non-Separable Bound
(MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
q 2
R
MT  inf L⇢ (u) + ,
⇢>0,kuk2 1 ⇢

where R = maxt2I kxt k

P yt (u·xt )
L⇢ (u) = t2I 1 ⇢ +
.

Mehryar Mohri - Foundations of Machine Learning page 29

• Proof: for any t , 1 yt (u·xt )
1 yt (u·xt )
+ , summing
up these inequalities for t I yields:
X⇣ yt (u · xt ) ⌘ X yt (u · xt )
MT  1 +
⇢ + ⇢
t2I t2I
p
MT R
 L⇢ (u) + ,
⇢
by upper-bounding t I (yt u · xt ) as in the proof
for the separable case.
• solving the second-degree inequality
p
MT R
MT  L⇢ (u) + ,
⇢
q
R R2 q
p + + 4L⇢ (u) R
gives MT 
⇢ ⇢2
2

⇢
+ L⇢ (u).

Mehryar Mohri - Foundations of Machine Learning page 30

Non-Separable Case - L2 Bound
(Freund and Schapire, 1998; MM and Rostamizadeh, 2013)
Theorem: let I denote the set of rounds at which
the Perceptron algorithm makes an update when
processing x1 , . . . , xT and let MT = |I| . Then,
2
L (u) 2 L (u) 2
t I xt 2
MT inf + 2
+ .
>0, u 2 1 2 4

• when x t R for all t I, this implies

2
R
MT inf + L (u) 2 ,
>0, u 2 1

where L (u) = 1 yt (u·xt )

+ t I
.
Mehryar Mohri - Foundations of Machine Learning page 31
• Proof: Reduce problem to separable case in higher
dimension. Let l = 1 1 , for t [1, T ] .
t
yt u·xt
+ t I

• Mapping (similar to trivial mapping):

xt,1
(N +t)th component ..
. u1
xt,N Z
..
0 .
xt,1 .. uN
xt = .. xt = . u u= Z
. y 1 l1
0 Z
xt,N ..
.
0 y T lT
.. Z
. 2 L (u) 2
0 u =1 = Z= 1+ 2
.

Mehryar Mohri - Foundations of Machine Learning page 32

• Observe that the Perceptron algorithm makes the
same predictions and makes updates at the same
rounds when processing x1 , . . . , xT .
• For any t I,
u · xt yt lt
yt (u · xt ) = yt +
Z Z
yt u · xt lt
= +
Z Z
1
= yt u · xt + [ yt (u · xt )]+ .
Z Z
• Summing up and using the proof in the separable
case yields:
MT yt (u · xt ) xt 2 .
Z
t I t I
Mehryar Mohri - Foundations of Machine Learning page 33
• The inequality can be rewritten as
1 L (u) 2
r2 r2 L (u) 2
MT 2
MT2 2
+ 2
r +MT
2 2
= 2
+ 2
+ 2
+MT L (u) 2,

where r = t I xt 2 .

• Selecting to minimize the bound gives 2

=
L (u)
MT
2r

and leads to
r2 MT L (u) r
MT2 2 +2 + MT L (u) 2
= (r + MT L (u) 2 )2 .

• Solving the second-degree inequality

r
MT MT L (u) 2 0

yields directly the first statement. The second one

results from replacing r with MT R .
Mehryar Mohri - Foundations of Machine Learning page 34
Dual Perceptron Algorithm
Dual-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys (xs · xt ))
5 Receive(yt )
6 if (yt = yt ) then
7 t t+1
8 return

Mehryar Mohri - Foundations of Machine Learning page 35

Kernel Perceptron Algorithm
(Aizerman et al., 1964)

K PDS kernel.
Kernel-Perceptron( 0 )
1 0
typically 0 = 0
2 for t 1 to T do
3 Receive(xt )
T
4 yt sgn( s=1 s ys K(xs , xt ))
5 Receive(yt )
6 if (yt = yt ) then
7 t t+1
8 return

Mehryar Mohri - Foundations of Machine Learning page 36

Winnow Algorithm
(Littlestone, 1988)
Winnow( )
1 w1 1/N
2 for t 1 to T do
3 Receive(xt )
4 yt sgn(wt · xt ) yt { 1, +1}
5 Receive(yt )
6 if (yt = yt ) then
N
7 Zt i=1 wt,i exp( yt xt,i )
8 for i 1 to N do
wt,i exp( yt xt,i )
9 wt+1,i Zt
10 else wt+1 wt
11 return wT +1

Mehryar Mohri - Foundations of Machine Learning page 37

Notes
Winnow = weighted majority:
• for yt,i = xt,i { 1, +1}, sgn(wt · xt )coincides with
the majority vote.
• multiplying by e or e the weight of correct or
incorrect experts, is equivalent to multiplying
by = e 2 the weight of incorrect ones.
Relationships with other algorithms: e.g., boosting
and Perceptron (Winnow and Perceptron can be
viewed as special instances of a general family).

Mehryar Mohri - Foundations of Machine Learning page 38

Winnow Algorithm - Bound
Theorem: Assume that xt R for all t [1, T ] and
that for some > 0 and v R N
, v 0 for all t [1, T ] ,
yt (v · xt )
.
v 1
Then, the number of mistakes made by the
Winnow algorithm is bounded by 2 (R 2
/ 2
) log N .
Proof: Let I be the set of ts at which there is an
update and let M be the total number of updates.

Mehryar Mohri - Foundations of Machine Learning page 39

Notes
Comparison with perceptron bound:
• dual norms: norms for xt and v .
• similar bounds with different norms.
• each advantageous in different cases:
•
Winnow bound favorable when a sparse set of
experts can predict well. For example, if v = e1
and tx {±1} N
, log N vs N.
•
Perceptron favorable in opposite situation.

Mehryar Mohri - Foundations of Machine Learning page 40

Winnow Algorithm - Bound
N
vi vi / v
Potential: t =
v
log
wt,i
. (relative entropy)
i=1
Upper bound: for each t in I ,
PN vi wt,i
t+1 t = i=1 kvk1 log wt+1,i
PN vi Zt
= i=1 kvk1 log exp(⌘y t xt,i )
P N vi
= log Zt ⌘ i=1 kvk1 yt xt,i
⇥ PN ⇤
 log i=1 wt,i exp(⌘yt xt,i ) ⌘⇢1
⇥ ⇤
= log E exp(⌘yt xt ) ⌘⇢1
wt
⇥ 2 2
⇤
(Hoe↵ding)  log exp(⌘ (2R1 ) /8) + ⌘yt wt · xt ⌘⇢1
| {z }
 ⌘ 2 R12
/2 ⌘⇢1 . 0

Mehryar Mohri - Foundations of Machine Learning page 41

Winnow Algorithm - Bound
Upper bound: summing up the inequalities yields
T +1 1 M ( 2 R2 /2 ).
Lower bound: note that
N N
/ v
1 = vi
v 1 log vi1/N 1
= log N + vi
v 1 log vi
v 1 log N
i=1 i=1

and for all t , t 0 (property of relative entropy).

Thus, T +1 1 0 log N = log N.
Comparison: log N M ( 2 R2 /2 ). For = R2
we obtain R2
M 2 log N 2 .
Mehryar Mohri - Foundations of Machine Learning page 42
Conclusion
On-line learning:
• wide and fast-growing literature.
• many related topics, e.g., game theory, text
compression, convex optimization.
• online to batch bounds and techniques.
• online version of batch algorithms, e.g.,
regression algorithms (see regression lecture).

Mehryar Mohri - Foundations of Machine Learning page 43

References
• Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the
potential function method in pattern recognition learning. Automation and Remote Control,
25, 821-837.

• Nicolò Cesa-Bianchi, Alex Conconi, Claudio Gentile: On the Generalization Ability of On-
Line Learning Algorithms. IEEE Transactions on Information Theory 50(9): 2050-2057. 2004.

• Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge
University Press, 2006.

• Yoav Freund and Robert Schapire. Large margin classification using the perceptron
algorithm. In Proceedings of COLT 1998. ACM Press, 1998.

• Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.

• Nick Littlestone. "Learning Quickly When Irrelevant Attributes Abound: A New Linear-
threshold Algorithm" Machine Learning 285-318(2). 1988.

Mehryar Mohri - Foundations of Machine Learning page 44

References
• Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989:
256-261.

• Tom Mitchell. Machine Learning, McGraw Hill, 1997.

• Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,

2013.

• Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the

Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn.

• Rosenblatt, Frank, The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain, Cornell Aeronautical Laboratory, Psychological Review, v65, No. 6,
pp. 386-408, 1958.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 45

Appendix

Mehryar Mohri - Foudations of Machine Learning

SVMs - Leave-One-Out Analysis
(Vapnik, 1995)
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
min(NSV (S), Rm+1
2
/ m+1 )
2
E m [R(hS )] Em+1 .
S D S D m+1
Proof: one part proven in lecture 4. The other part
due to i 1/Rm+12
for xi misclassified by SVMs.

Mehryar Mohri - Foundations of Machine Learning page 47

Comparison
Bounds on expected error, not high probability
statements.
Leave-one-out bounds not sufficient to distinguish
SVMs and perceptron algorithm. Note however:
• same maximum margin m+1 can be used in both.
• but different radius Rm+1 of support vectors.
Difference: margin distribution.

Mehryar Mohri - Foundations of Machine Learning page 48

gr9 Math Ahlia Book
100% (1)
gr9 Math Ahlia Book
264 pages
Ultimate Python Guide (2024)
100% (2)
Ultimate Python Guide (2024)
715 pages
Kakade S. Tewari A. - Topics in Artificial Intelligence (Learning Theory)
No ratings yet
Kakade S. Tewari A. - Topics in Artificial Intelligence (Learning Theory)
68 pages
Probability Theory
No ratings yet
Probability Theory
680 pages
Stress Detection in It Professional by Image Processing and Machine Learning
No ratings yet
Stress Detection in It Professional by Image Processing and Machine Learning
91 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
Online Learning
No ratings yet
Online Learning
68 pages
Foundations of Machine Learning: Regression
No ratings yet
Foundations of Machine Learning: Regression
52 pages
ML
No ratings yet
ML
85 pages
8 SVM
No ratings yet
8 SVM
55 pages
794 Lec Intro Handout
No ratings yet
794 Lec Intro Handout
44 pages
Artificial Neural Networks: CS464 Introduction To Machine Learning 1
No ratings yet
Artificial Neural Networks: CS464 Introduction To Machine Learning 1
39 pages
Unit 4 Ensemble Techniques and Unsupervised Learning
100% (1)
Unit 4 Ensemble Techniques and Unsupervised Learning
25 pages
PR-January20-10 Online Trial
No ratings yet
PR-January20-10 Online Trial
42 pages
Foundations of Machine Learning: Boosting
No ratings yet
Foundations of Machine Learning: Boosting
41 pages
Computational Learning
No ratings yet
Computational Learning
12 pages
Clase3 Redunidireccional
No ratings yet
Clase3 Redunidireccional
74 pages
03 NeuralNetworksI
No ratings yet
03 NeuralNetworksI
93 pages
Lec03 NeuralNetwork
No ratings yet
Lec03 NeuralNetwork
39 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Lecture 8 - Intro To Neural Networks
No ratings yet
Lecture 8 - Intro To Neural Networks
61 pages
Lecture 6
No ratings yet
Lecture 6
11 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
No ratings yet
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
43 pages
ML Unit-4 Prob Learning
No ratings yet
ML Unit-4 Prob Learning
36 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
No ratings yet
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
5 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
Project Aayush
No ratings yet
Project Aayush
9 pages
Perceptron Learning Algorithm Lecture Supplement
No ratings yet
Perceptron Learning Algorithm Lecture Supplement
6 pages
Solution ToYegnRame2001
No ratings yet
Solution ToYegnRame2001
107 pages
20.NeuralNets Short
No ratings yet
20.NeuralNets Short
60 pages
1 Solving Systems of Linear Inequalities
No ratings yet
1 Solving Systems of Linear Inequalities
4 pages
Perceptrons Algorithm PDF
No ratings yet
Perceptrons Algorithm PDF
68 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Perceptron Mistake Bound
No ratings yet
Perceptron Mistake Bound
10 pages
Percept Rons
No ratings yet
Percept Rons
68 pages
Perceptron Bound Proof
No ratings yet
Perceptron Bound Proof
27 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Matlab Programs
No ratings yet
Matlab Programs
40 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Preceptron
No ratings yet
Preceptron
17 pages
L06 Slides - mlp3
No ratings yet
L06 Slides - mlp3
26 pages
Lecture 1
No ratings yet
Lecture 1
4 pages
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
Emma Thompson, Clive Coote, Lindsay Doran - The Sense and Sensibility - Screenplay & Diaries - Bringing Jane Austen's Novel To Film-Newmarket PR (1995)
No ratings yet
Emma Thompson, Clive Coote, Lindsay Doran - The Sense and Sensibility - Screenplay & Diaries - Bringing Jane Austen's Novel To Film-Newmarket PR (1995)
324 pages
SaMD-Guidance-Document Aaditya Vats-Associate Director J&J Reg. Aff.
No ratings yet
SaMD-Guidance-Document Aaditya Vats-Associate Director J&J Reg. Aff.
28 pages
ML 03
No ratings yet
ML 03
42 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Least Mean Square (LMS) Algorithm: 3.1 Spatial Filtering
No ratings yet
Least Mean Square (LMS) Algorithm: 3.1 Spatial Filtering
16 pages
NN Part1
No ratings yet
NN Part1
43 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Linear Separability
No ratings yet
Linear Separability
4 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
Perceptrons
No ratings yet
Perceptrons
11 pages
Perceptron
No ratings yet
Perceptron
6 pages
Machine Learning Complete-Course-Notes Polimi
No ratings yet
Machine Learning Complete-Course-Notes Polimi
107 pages
Class Test 1
No ratings yet
Class Test 1
5 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
100% (1)
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
61 pages
41 Essential Machine Learning Interview Questions: 18 Mins Read
No ratings yet
41 Essential Machine Learning Interview Questions: 18 Mins Read
21 pages
Unit 2 - Class - Preceptron
No ratings yet
Unit 2 - Class - Preceptron
13 pages
Perceptron Linear Classifiers
No ratings yet
Perceptron Linear Classifiers
42 pages
Our Programmes: Iu International University
No ratings yet
Our Programmes: Iu International University
76 pages
Lecturenotes Perceptron
No ratings yet
Lecturenotes Perceptron
7 pages
Introduction of Pattern Recognition PDF
No ratings yet
Introduction of Pattern Recognition PDF
40 pages
Time Series Forecasting Using Deep Learning - MATLAB & Simulink
100% (1)
Time Series Forecasting Using Deep Learning - MATLAB & Simulink
6 pages
1Z0 1122 24 Demo
No ratings yet
1Z0 1122 24 Demo
6 pages
Pothole Severity Prediction Using Monocular Depth (3) (1) - 2
No ratings yet
Pothole Severity Prediction Using Monocular Depth (3) (1) - 2
15 pages
SSRN Id3607845 PDF
No ratings yet
SSRN Id3607845 PDF
39 pages
Large Language Model Routing With Benchmark Datasets
No ratings yet
Large Language Model Routing With Benchmark Datasets
18 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
Technical Answers For Realworld Problems (ECE3999) : Project Title: Covid-19 Analysis Through Chest X-Rays
No ratings yet
Technical Answers For Realworld Problems (ECE3999) : Project Title: Covid-19 Analysis Through Chest X-Rays
7 pages
MachineLearning Lecture 2
No ratings yet
MachineLearning Lecture 2
23 pages
Principles
No ratings yet
Principles
46 pages
T Rec Y.3176 202009 P!!PDF e
No ratings yet
T Rec Y.3176 202009 P!!PDF e
30 pages
Python IEEE Project Titles 2022 2023
No ratings yet
Python IEEE Project Titles 2022 2023
3 pages
Object Detection Based Handwriting Localization
No ratings yet
Object Detection Based Handwriting Localization
15 pages
Characteristics of Complex Systems in Sports Injury Rehabilitation: Examples and Implications For Practice
No ratings yet
Characteristics of Complex Systems in Sports Injury Rehabilitation: Examples and Implications For Practice
15 pages
UmaMahesh Chakali
No ratings yet
UmaMahesh Chakali
2 pages
Direct Market Access Through Digital Platforms For Farmers
No ratings yet
Direct Market Access Through Digital Platforms For Farmers
1 page
Number (Old) Title Old Course Area (Before July 2019) New Course Area (After June 2019)
No ratings yet
Number (Old) Title Old Course Area (Before July 2019) New Course Area (After June 2019)
5 pages
KNN MCQs 3
No ratings yet
KNN MCQs 3
14 pages
Deep Learning-Based Optimization of Energy Utilization in IoT-enabled Smart Cities A Pathway To Sustainable Development
No ratings yet
Deep Learning-Based Optimization of Energy Utilization in IoT-enabled Smart Cities A Pathway To Sustainable Development
12 pages
Network Inversion and It
No ratings yet
Network Inversion and It
9 pages
Integrating AIML in Open-RAN Overcoming Challenges and Seizing Opportunities-1
No ratings yet
Integrating AIML in Open-RAN Overcoming Challenges and Seizing Opportunities-1
9 pages
MTech AICurriculum 2022
No ratings yet
MTech AICurriculum 2022
2 pages
Data Science (2023 - 2024) Questions Papers
No ratings yet
Data Science (2023 - 2024) Questions Papers
6 pages
Unit 3 Data Science
No ratings yet
Unit 3 Data Science
7 pages

Lecture 8-ml On-Line Learning

Uploaded by

Lecture 8-ml On-Line Learning

Uploaded by

Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning page 2

Mehryar Mohri - Foundations of Machine Learning page 3

Mehryar Mohri - Foundations of Machine Learning page 4

Mehryar Mohri - Foundations of Machine Learning page 5

Definition: for any concept class C the maximum

A mistake bound is a bound M on ML (C) .

Mehryar Mohri - Foundations of Machine Learning page 6

Mehryar Mohri - Foundations of Machine Learning page 7

Mehryar Mohri - Foundations of Machine Learning page 8

Mehryar Mohri - Foundations of Machine Learning page 9

• Thus, m O(log N ) + constant

• Realizable case: m O(log N ). t

Upper bound: after each error,

Mehryar Mohri - Foundations of Machine Learning page 12

Mehryar Mohri - Foundations of Machine Learning page 13

• weight update: wt+1,i wt,i e

• prediction: yt = i=1 wt,i yt,i

Theorem: assume that L is convex in its first

Mehryar Mohri - Foundations of Machine Learning page 15

Mehryar Mohri - Foundations of Machine Learning page 17

Mehryar Mohri - Foundations of Machine Learning page 18

Mehryar Mohri - Foundations of Machine Learning page 19

Regret(T ) 2 (T /2) log N + (1/8) log N.

Mehryar Mohri - Foundations of Machine Learning page 20

Mehryar Mohri - Foundations of Machine Learning page 21

Mehryar Mohri - Foundations of Machine Learning page 22

Mehryar Mohri - Foundations of Machine Learning page 23

with f (w, x) = max 0, y(w · x) .

Mehryar Mohri - Foundations of Machine Learning page 25

Mehryar Mohri - Foundations of Machine Learning page 26

• among the many variants: voted perceptron

where ct is the number of iterations wt survives.

• non-separable case: does not converge.

Proof: Let S Dm+1 be a sample linearly separable

where R = maxt2I kxt k

Mehryar Mohri - Foundations of Machine Learning page 29

Mehryar Mohri - Foundations of Machine Learning page 30

• when x t R for all t I, this implies

where L (u) = 1 yt (u·xt )

• Mapping (similar to trivial mapping):

Mehryar Mohri - Foundations of Machine Learning page 32

• Selecting to minimize the bound gives 2

• Solving the second-degree inequality

yields directly the first statement. The second one

Mehryar Mohri - Foundations of Machine Learning page 35

Mehryar Mohri - Foundations of Machine Learning page 36

Mehryar Mohri - Foundations of Machine Learning page 37

Mehryar Mohri - Foundations of Machine Learning page 38

Mehryar Mohri - Foundations of Machine Learning page 39

Mehryar Mohri - Foundations of Machine Learning page 40

Mehryar Mohri - Foundations of Machine Learning page 41

and for all t , t 0 (property of relative entropy).

Mehryar Mohri - Foundations of Machine Learning page 43

• Nick Littlestone. From On-Line to Batch Learning. COLT 1989: 269-284.

Mehryar Mohri - Foundations of Machine Learning page 44

• Tom Mitchell. Machine Learning, McGraw Hill, 1997.

• Mehryar Mohri and Afshin Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208,

• Novikoff, A. B. (1962). On convergence proofs on perceptrons. Symposium on the

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 45

Mehryar Mohri - Foudations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning page 47

Mehryar Mohri - Foundations of Machine Learning page 48

You might also like