0% found this document useful (0 votes)

3 views68 pages

Online Learning

The document discusses online learning, contrasting it with batch learning, and outlines foundational online hypotheses classes, particularly focusing on learning with expert advice. It introduces algorithms such as the Halving Algorithm and the Weighted Majority Algorithm, which aim to minimize prediction errors by leveraging the performance of multiple experts. The document also explores regret bounds and performance guarantees for these algorithms in various scenarios.

Uploaded by

aaaaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views68 pages

Online Learning

Uploaded by

aaaaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

8.

Online learning
COMP0078: Supervised Learning

Carlo Ciliberto
(Slides thanks to Mark Herbster)

University College London

Department of Computer Science

1
Batch versus Online learning

Batch
Model: There exists training data set (sampled IID)
Aim: To build a classifier from the training data that predicts well on
new data (from same distribution)
Evaluation metric: Generalization error

Online
Model: There exists an online sequence of data (usually no
distributional assumptions)
Aim: To sequentially predict and update a classifier to predict well on
the sequence (i.e. there is no training and test set distinction)
Evaluation metric: Cumulative error

Note
There are a variety of models for online learning. Here we focus on the so-called worst-case model. Alternately distributional
assumption may be made on the data sequence. Also sometimes the phrase “online learning” is used to refer to “online optimisation”
that is to use online learning type algorithms as a training method for a batch classifier.

2
Why online learning?

Pragmatically
• “Often” fast algorithms
• “Often” small memory footprint
• “Often” no “statistical” assumptions required e.g. IID-ness
• As a training method for “BIG DATA” batch classifiers

Theoretically (learning performance guarantees)

• Non-asymptotic
• No statistical assumptions
• There exist techniques to convert cumulative error guarantees to
generalisation error guarantees

3
Today

Our focus today is on three foundational online “hypotheses” classes.

• Learning with experts

1. Halving algorithm
2. Weighted Majority algorithm
3. Refining and generalising the experts model
• Learning linear classifiers
• Perceptron

4
Experts

Part I
Learning with Expert Advice

5
On-Line Learning with expert advice (1) [V90,LW94,HKW98]

Model: There exists an online sequence of data

S = Sm = {(x1 , y1 ), . . . , (xm , ym )} with (x, y ) ∈ {0, 1}n × {0, 1}.
Interpretation: The vector xt is the set of predictions from n experts
about an outcome yt , where expert i predicts xt,i ∈ {0, 1} at time t.
Each expert at time t is aiming to predict yt .
What is an “expert”? Example: human experts or the predictions of n
separate algorithms.
experts
E1 E2 E3 En prediction true label loss
day 1 1 1 0 0 0 1 1
day 2 1 0 1 0 1 0 1
day 3 0 1 1 1 1 1 0
day t xt,1 xt,2 xt,3 xt,n ŷt yt |yt − ŷt |
Goal: Find a “Master” algorithm to combine the predictions xt of the n
experts (based on past perf.) to predict ŷt an estimate of yt .
6
On-Line Learning with experts (2)
Protocol of the Master Algorithm

For t = 1, . . . , m:
Get instance xt ∈ {0, 1}n
Predict ŷt ∈ {0, 1}
Get label yt ∈ {0, 1}
Incur loss (mistakes) |yt − ŷt |

Evaluation metric: The loss (mistakes) of Master Algorithm A on

sequence S is just
Xm
LA (S) := |yt − ŷt |
t=1
where ŷt = A(St−1 )(xt ) is the output of the online algorithm A trained
(online) on St−1 and evaluated on xt
Our Goal: Design master algorithms with “small loss”. 7
Special Case: The (Unknown) “Perfect” Expert

Let’s consider the special setting where there exists at least one expert
that is never wrong...

...how “fast” could we find them?

8
A Solution : Halving Algorithm

predict predict
all 1 inconsistent
0
experts experts

consistent
experts

The master algorithm:

• Keeps track of only the consistent experts

(those that never made a mistake so far)
• Predicts according to the majority vote
• Eliminates wrong experts after each prediction.

Question: How many mistakes does it make, at most?

9
A run of the Halving Algorithm

E1 E2 E3 E4 E5 E6 E7 E8 ŷ y loss
1 1 0 0 1 1 0 0 1 0 1
x x 0 1 x x 1 1 1 1 0
x x x 1 x x 0 0 0 1 1
x x x ↑ x x x x
consistent

Question: How many mistakes does it make, at most?

10
A run of the Halving Algorithm

E1 E2 E3 E4 E5 E6 E7 E8 ŷ y loss
1 1 0 0 1 1 0 0 1 0 1
x x 0 1 x x 1 1 1 1 0
x x x 1 x x 0 0 0 1 1
x x x ↑ x x x x
consistent

Question: How many mistakes does it make, at most?

Answer: For any sequence with a consistent expert, the Halving

Algorithm makes ≤ log2 n mistakes.
Exercise: Prove this!

10
What if no expert is consistent?
Notation
Pm
• Recall LA (S) := t=1 |yt − ŷt | is the loss of algorithm A on S
• Denote the loss of i-th expert Ei as
m
X
Li (S) := |yt − xt,i |
t=1

Aim
Bounds of the form:

∀S : LA (S) ≤ a min Li (S) +b log(n)

i
| {z }
Best Expert!

where a, b are “small” constants

Comment: These are known as “Regret” or “Worst-case” loss bounds, i.e., bounds
that hold in any case (even the “worst-case”). 11
A Solution: Weighted Majority Algorithm [LW94]

The Master algorithm:

• Can’t eliminate experts!

• Keeps track of how reliable each espert is
(By keeping track of a weight wi for each expert)

predict
all 0 predict
experts 1
vote with
their weight

• Predicts according to the larger (weighted) vote

• Weights of wrong experts are multiplied by β ∈ [0, 1)

12
Number of mistakes of the WM algorithm

M = # mistakes of master algorithm at the “end”

Mt,i = # mistakes of expert Ei at the start of trial t
Mi = Mm+1,i = # of total mistakes of expert Ei
wt,i = β Mt,i weight of Ei at beginning of trial t (w1,i = 1)
Xn
Wt = wt,i total sum of weights at the start of trial t
i=1

For each trial... (Minority) ≤ 21 Wt (Majority) ≥ 12 Wt

If the Master algorithm:

• ...is right, then the weights of the “minority” experts are multiplied by β:
Wt = Minority + Majority ≥ β · Minority + Majority = Wt+1

• ...makes a mistake, then majority multiplied by β:

1 1
Wt+1 ≤ 2
Wt + β 2
Wt (why ?)
minority majority 13
Number of mistakes of the WM algorithm – Continued-1

1+β
Hence, Wt+1 ≤ 2
Wt .

14
Number of mistakes of the WM algorithm – Continued-1

1+β
Hence, Wt+1 ≤ 2
Wt . Therefore, we have
M
1+β
Wm+1 ≤ W1
total final 2 |{z}
n
weight # of experts!

At the same time,

n
X n
X
Wm+1 = wm+1,j = β Mj ≥ β Mi
j=1 j=1

Combining the upper and lower bounds...

M
1+β
n ≥ β Mi
2

14
Number of mistakes of the WM algorithm – Continued-2

Taking the log and solving for M...

ln β1 1
M ≤ 2 Mi + 2 ln n
ln 1+β ln 1+β

For example, choosing β = 1/e

M ≤ |{z}
2.63 min Mi + |{z}
2.63 ln n
i
a b

For all sequences, the loss of master algorithm is comparable to the loss
of the best expert.

15
Refining and generalising the experts model – 1

More generally we would like to obtain regret bounds for arbitrary loss
functions L : Y × Ŷ → [0, +∞]. Making our notion of regret more
precise we would like guarantees of the form,

LA (S) − min Li (S) ≤ o(m) ,

i∈[n]

where the right-hand side is termed regret since it is how much we

“regret” predicting with the algorithm as opposed to the optimal
predictor on the data sequence.

Here o(m) denotes some function sublinear in m (the number of

examples in S) and possibly dependent on other parameters.

Why o(m)?

16
Refining and generalizing the experts model - 2

Remember that LA (S) is the cumulative sum of the errors incurred by A.

1
Therefore m LA (S) is the average error incurred (so far) by A.

If the sublinear Regret bound holds,

LA (S) − mini∈[n] Li (S) o(m)

≤ ,
m m

Since o(m)
m → 0 for m → ∞, this implies that asymptotically our
algorithm incurs in the same average loss as the average loss of the best
expert.

17
Refining and generalising the experts model – 3

In the following we will show two example of regret bounds generalising

the analysis of the weighted majority algorithm.

1. For a loss function L : {0, 1} × [0, 1] → [0, +∞] the entropic loss

y 1−y
L(y , ŷ ) = y ln + (1 − y ) ln
ŷ 1 − ŷ

2. For an arbitrary bounded loss function L : Y × Ŷ → [0, B].

For the first the regret will be the small constant log(n) for the second
√
the regret will be O( m log n).

18
A regret bound for the entropic (log) loss

Unlike in the case of the weighted majority we will now:

• allow predictions in [0, 1] rather than just {0, 1}, and
• predict with the weighted average rather than the “majority”.
At trial t, the expert i will have weight wt,i = β Lt,i
= e −η Lt,i
with:
• where Lt,i is the cumulative loss of Ei at the start of trial t
• and η is a positive learning rate
The Master predicts with the weighted average (dot product)
n
w
Pn t,i
X
ŷt = xt,i = vt · xt
i=1 | i=1
wt,i
{z }
vt,i
normalized
weights

where xt,i is the prediction of Ei in trial t

Set the initial weights w1 = (1, . . . , 1) and thus v1 == ( n1 , . . . , n1 ).
19
The Weighted Average Algorithm – Summary

WA Algorithm
Initialise : v1 = ( n1 , . . . , n1 ), LWA := 0, L := 0,
Input: η ∈ (0, ∞), Loss function L : Y × Y → R.

For t = 1, . . . , m Do
Receive instance xt ∈ [0, 1]n
Predict ŷt := vt · xt
Receive label yt ∈ [0, 1]
Incur loss LWA := LWA + L(yt , ŷt ),
Li := Li + L(yt , xt,i ) (i ∈ [n])
vt,i e −ηL(yt ,xt,i )
Update vt+1,i := Pn −ηL(yt ,xt,j ) for i ∈ [n].
j=1 vt,j e

20
Weighted Average Algorithm - Theorem
Theorem
For all sequences of examples

S = (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ [0, 1]n × [0, 1]

the regret of the weighted average WA algorithm is

LWA (S) − min Li (S) ≤ 1/η ln(n)

i |{z}
b
with square and entropic loss for η = 1/2 and η = 1 respectively.

Constant b as dependent on loss function

Loss b = 1/η
entropic Len (y , ŷ ) = y ln ŷy + (1 − y ) ln 1−y
1−ŷ
1
square Lsq (y , ŷ ) = (y − ŷ )2 2

• For simplicity, we will prove only for entropic loss when Y := {0, 1}
and Ŷ := [0, 1]. The result holds for many loss function (sufficient
21
smoothness and convexity with different η and b). See [KW99] for proof.
Weighted Average Algorithm - Proof

Notation: ∆n := {x ∈ [0, 1]n : ni=1 xi = 1}. Let d : ∆n × ∆n → [0, ∞] be the

P
Pn
rel. entropy d(u, v) := i=1 ui ln uvii . Note: Len (y , ŷ ) = d((y , 1 − y ), (ŷ , 1 − ŷ )).

22
Weighted Average Algorithm - Proof

Notation: ∆n := {x ∈ [0, 1]n : ni=1 xi = 1}. Let d : ∆n × ∆n → [0, ∞] be the

P
Pn
rel. entropy d(u, v) := i=1 ui ln uvii . Note: Len (y , ŷ ) = d((y , 1 − y ), (ŷ , 1 − ŷ )).

Proof – 1
We first prove the following “progress versus regret” equality.
n
X
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, vt ) − d(u, vt+1 ) for all u ∈ ∆n . (1)
i=1

22
Weighted Average Algorithm - Proof

Notation: ∆n := {x ∈ [0, 1]n : ni=1 xi = 1}. Let d : ∆n × ∆n → [0, ∞] be the

P
Pn
rel. entropy d(u, v) := i=1 ui ln uvii . Note: Len (y , ŷ ) = d((y , 1 − y ), (ŷ , 1 − ŷ )).

Proof – 1
We first prove the following “progress versus regret” equality.
n
X
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, vt ) − d(u, vt+1 ) for all u ∈ ∆n . (1)
i=1
Observe that
n
X vt+1,i
d(u, vt ) − d(u, vt+1 ) = ui ln
i=1
vt,i
Let yt = 1. Then (using Len (1, x) = − ln x)
−L (1,x )
vt,i e en t,i
v x
n n Pn −Len (1,xt,j ) n Pn t,i t,i n
X vt+1,i X v
j=1 t,j
e X v x
j=1 t,j t,j
X xt,i
ui ln = ui ln = ui ln = ui ln
i=1
vt,i i=1
vt,i i=1
vt,i i=1
ŷt
n
! n
X X
= ui ln xt,i − ln(ŷt ) = Len (yt , ŷt ) − ui Len (yt , xt,i )
i=1 i=1

by symmetry we also have the case y = 0 demonstrating (1). 22

Weighted Average Algorithm - Proof continued

Proof – 2
Now observe that (1) is a telescoping equality and we have
m
X m X
X n
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, v1 ) − d(u, vm+1 )
t=1 t=1 i=1

Now since the above holds for any u ∈ ∆n in particular the unit vectors
(1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1) and then if we upper bound
by noting that d(u, v1 ) ≤ ln n and −d(u, vm+1 ) ≤ 0 we have proved
theorem.

23
Hedge Algorithm
Hedge was introduced in [FS97], generalising the weighted majority
analysis to the allocation setting.
Allocation setting
On each trial the learner plays an allocation vt ∈ ∆n , then nature
returns a loss vector ℓt . I.e., the loss of expert i is ℓt,i .
Two models for the learner’s play (HA-1,HA-2):

1. [HA-1]:We simply incur loss so that LA (t) := vt · ℓt .

2. [HA-2]:Alternately learner randomly selects an action ŷ ∈ [n]
according to the discrete distribution over [n] so that Prob(j) := vt,j .
Thus
E [LHA on trial t] = Eŷt ∼vt [ℓt,ŷ ] = vt · ℓt .

• Observe that this setting can simulate the setting where we receive
side-information xt and have a fixed loss function.
• For the randomised setting the “mechanism” generating the loss
vectors ℓt must be oblivious to the learner’s selection ŷ until trial t + 1.
24
The Hedge Algorithm (HA) – Summary
Hedge Algorithm (HA-1)
Initialise : v1 := ( n1 , . . . , n1 ), LHA := 0, L := 0 ; Select: η ∈ (0, ∞),
For t = 1 To m Do
Predict vt ∈ ∆n
Receive loss ℓt ∈ [0, 1]n
Incur loss LHA := LHA + vt · ℓt , Li := Li + ℓt,i (i ∈ [n])
−ηℓ t,i
vt,i e
Update Weights vt+1,i := Pn −ηℓt,j for i ∈ [n].
v e
j=1 t,j

Hedge Algorithm (HA-2)

Initialise : v1 := ( n1 , . . . , n1 ), LHA := 0 L := 0 ; Select: η ∈ (0, ∞),
For t = 1 To m Do
Predict vt ∈ ∆n and sample ŷt ∼ vt
Receive loss ℓt ∈ [0, 1]n
Incur loss E [LHA ] := E [LHA ] + vt · ℓt , Li := Li + ℓt,i (i ∈ [n])
−ηℓ t,i
vt,i e
Update Weights vt+1,i := Pn −ηℓt,j for i ∈ [n].
v e
j=1 t,j

25
Hedge - Theorem

Theorem Hedge (Bound) [LW94,FS97]

For all sequence of loss vectors

S = ℓ1 , . . . , ℓm ∈ [0, 1]n
p
the regret of the Hedge HA-2 algorithm with η = 2ln n/m is
√
E [LHA (S)] − min Li (S) ≤ 2m ln n .
i

26
Hedge Theorem - Proof (1)
Proof – 1
We first prove the following “progress versus regret” inequality.
n
1 ηX
vt · ℓt − u · ℓt ≤ (d(u, vt ) − d(u, vt+1 )) + vt,i ℓ2t,i for all u ∈ ∆n . (2)
η 2
Pn i=1
Let Zt := i=1 vt,i exp(−ηℓt,i ), so that vt+1,i = vt,i /Zt . Observe that
n
X vt+1,i
d(u, vt ) − d(u, vt+1 ) = ui ln
vt,i
i=1
n
X n
X
= −η ui ℓt,i − ui ln Zt
i=1 i=1
n
X
= −ηu · ℓt − ln vt,i exp(−ηℓt,i )
i=1
n
X 1
≥ −ηu · ℓt − ln vt,i (1 − ηℓt,i + η 2 ℓ2t,i ) (3)
2
i=1
n
1 X
= −ηu · ℓt − ln(1 − ηvt · ℓt + η 2 vt,i ℓ2t,i )
2
i=1
n
1 X
≥ η(vt · ℓt − u · ℓt ) − η 2 vt,i ℓ2t,i (4)
2
i=1

x2
Using inequalities e −x ≤ 1 − x + 2 for x ≥ 0 and ln(1 + x) ≤ x for (3) and (4) demonstrating (2).

27
Hedge Theorem - Proof (2)

Proof – 2
Summing over t and rearranging we have
m m n
X 1 η XX
(vt · ℓt − u · ℓt ) ≤ (d(u, v1 ) − d(u, vm+1 )) + vt,i ℓ2t,i
t=1
η 2 t=1 i=1
m n
ln n η XX
≤ + vt,i ℓ2t,i (5)
η 2 t=1 i=1

Now since since the above holds for any u ∈ ∆n it then holds in particular for
the unit vectors (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1) and then we upper
bound by noting that d(u, v1 ) ≤ ln n, −d(u, vm+1 ) ≤ 0, and
Pm Pn 2
p
t=1 i=1 vt,i ℓt,i ≤ m. Finally we “tune” by choosing η = 2ln n/m and we
have proved theorem.

Question: how can we the above to prove a theorem if the loss is now in
the range [0, B]?

28
Comments

• Easy to combine many pretty good experts (algorithms) so that

Master is guaranteed to be almost as good as the best
• Bounds logarithmic in number of experts. Use many experts! Limits
only in computational resources.
• Observe updating is multiplicative

Next: So far we have given bounds which grow slowly in the number of
experts. The only significant drawback is potentially computational if we
wish to work with large classes of experts. With this is mind we may wish
to work with structured sets of experts for either computational
advantages or advantages in bound.
We now consider linear combinations of experts that are linear classifiers.

29
Part II
Online learning of linear classifiers

30
A more general setting (1)
Prediction Loss
Instance of alg A Label of alg A

x1 ŷ1 y1 L(y1 , ŷ1 )

.. .. .. ..
. . . .
xt ŷt yt L(yt , ŷt ) Sequence of examples
.. .. .. ..
. . . .
xm ŷm ym L(ym , ŷm )
————
Total Loss LA (S)
S = (x1 , y1 ), ..., (xm , ym )
Comparison class U = {u} (AKA hypothesis space, concept class)
Relative loss (Regret)
LA (S) − inf Loss u (S)
{u∈U }

Goal: Bound relative loss for arbitrary sequence S

31
A more general setting (2)

Now
• We consider the case where U is a set of linear threshold function.
• For simplicity we will focus on the case where there exists a u ∈ U
s.t. Loss u (S) = 0. This is known as realizable case. Compare to the
previously considered halving algorithm versus weighted majority
algorithm.

32
Perceptron
The Perceptron set-up

Assumption: Data is linearly separable by some margin γ. Hence exists

a hyperplane with normal vector v such that
1. ∥v∥ = 1
R 2. All examples (xt , yt )
+ +
γ • ∀yt yt ∈ {−1, +1}.
γ + +
- + • ∀xt , ∥xt ∥ ≤ R.

1 v + 3. ∀(xt , yt ), yt (xt · v) ≥ γ
- + +
- -- +
-
+
- -
+
- -
- +
-
- -
-

33
The Perceptron learning algorithm

Perceptron Algorithm
Input: (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ Rn × {−1, 1}

1. Initialise w1 = ⃗0; M1 = 0.
2. For t = 1 to m do
3. Receive pattern: xt ∈ Rn
4. Predict: ŷt = sign(wt · xt )
5. Receive label: yt
6. If mistake (ŷt yt ≤ 0)
• Then Update wt+1 = wt + yt xt ; Mt+1 = Mt + 1
7. Else wt+1 = wt ; Mt+1 = Mt .
8. End For

34
Example: trace for the Perceptron algorithm

35
Example: trace for the Perceptron algorithm

36
Example: trace for the Perceptron algorithm

36
Bound on number of mistakes

• The number of mistakes that the perceptron algorithm can make is

2
at most Rγ .
• Proof by combining upper and lower bounds on ∥w∥.

37
Pythagorean Lemma

On trials where “mistakes” occur we have the following inequality,

2
Lemma: If (wt · xt )yt < 0 then ∥wt+1 ∥ ≤ ∥wt ∥2 + ∥xt ∥2
Proof:

∥wt+1 ∥2 = ∥wt + yt xt ∥2
= ∥wt ∥2 + 2(wt · xt )yt + ∥xt ∥2
≤ ∥wt ∥2 + ∥xt ∥2

38
Upper bound on ∥wt ∥

Lemma: ∥wt ∥2 ≤ Mt R 2
Proof: By induction

• Claim: ∥wt ∥2 ≤ Mt R 2
• Base: M1 = 0, ∥w1 ∥2 = 0
• Induction step (assume for t and prove for t + 1) when we have a
mistake on trial t:

∥wt+1 ∥2 ≤ ∥wt ∥2 + ∥xt ∥2 ≤ ∥wt ∥2 + R 2 ≤ (Mt+1 )R 2

Here we used the Pythagorean lemma. If mistake Mt+1 = Mt + 1

else there is no mistake, then trivially wt+1 = wt and Mt+1 = Mt .

39
Lower bound on ∥wt ∥

Lemma: Mt γ ≤ ∥wt ∥
Observe: ∥wt ∥ ≥ wt · v because ∥v∥ = 1. (Cauchy-Schwarz)
We prove a lower bound on wt · v using induction over t

• Claim: wt · v ≥ Mt γ
• Base: t = 1, w1 · v = 0
• Induction step (assume for t and prove for t + 1):
If mistake (Mt+1 = Mt + 1) then

wt+1 · v = (xt yt ) · v
= w t · v + yt x t · v
≥ Mt γ + γ
= (Mt + 1)γ

40
Combining the upper and lower bounds

Let M := Mm+1 denote the total number of updates (“mistakes”) then

(Mγ)2 ≤ ∥wm+1 ∥2 ≤ MR 2

Thus simplifying we have the famous ...

Theorem (Perceptron Bound [Novikoff])
For all sequences of examples

S = (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ Rn × {−1, 1}

the mistakes of the Perceptron algorithm is bounded by

2
R
M≤ ,
γ

with R := maxt ∥xt ∥ when there exists a vector v with ∥v∥ = 1 and
constant γ such that (v · xt )yt ≥ γ for all t.

41
Comments

Comments
• It is often convenient to express the bound in the following form.
Here define u := γv then

2
M ≤ R 2 ∥u∥ (∀u : (u · xt )yt ≥ 1)

• Suppose we have linearly separable data set S. Questions:

1. Observe that wm+1 does not necessarily linearly separate S. Why?
2. How can we use the Perceptron to find a vector w that separates
S?
3. How long will this computation take?
• There are variants on the perceptron that operate on a single
example at a time that converge to the “SVM” max-margin linear
separator.

42
Regret Bounds for Linear Separation
Going Deeper : Regret Bounds for Linear Separation

Recall the regularisation approach to supervised learning.

m
X
∗
h = arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1

Example: Ridge Regression

m
X 2
arg min L(yt , w · xt ) + λ∥w∥
w∈Rn t=1

Example: Soft Margin SVM

m
X 2
arg min Lhi (yt , w · xt + b) + λ∥w∥
w∈Rn ,b∈R t=1

with Lhi (y , ŷ ) := max(0, 1 − y ŷ ).

43
Online Approach

Recall the regularisation approach to “BATCH” supervised learning.

m
X
arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1

Question: how can we approach it online?

44
Online Approach

Recall the regularisation approach to “BATCH” supervised learning.

m
X
arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1

Question: how can we approach it online?

A possible strategy is, every time we see a new sample (xt+1 , yt+1 ) to
produce a new ht+1 such that

• It fits the new data point well

• It is not “too different” from the previous ht

ht+1 = arg min L(yt , h(xt )) + λpenalty(h, ht )

h∈H

44
Online Gradient Descent with Hinge Loss and ∥·∥22 penalty

Let’s consider SVMs:

• Hinge loss: Lhi (y , ŷ ) = max(0, 1 − y ŷ ).

• Linear hypotheses: h(x) = hw (x) = w · x.

Then, the online update becomes

2
wt+1 = arg min Lhi (yt , w · xt ) + λ∥w − wt ∥
w∈Rn

Solving for the update (taking the “derivative” and set to zero)
corresponds to choosing wt+1 as follows:
(
wt yt (w · xt ) > 1
wt+1 = y t xt
wt + 2λ yt (w · xt ) < 1

Note : If yt (w · xt ) = 1 then we may choose either.

45
OGD with Hinge Loss and ∥·∥22 penalty

OGD Algorithm
Initialise : w1 := 0, LOGD := 0
1
Select: η ∈ (0, ∞) (interpretation η = 2λ )

For t = 1 To m Do
Receive instance xt ∈ Rn
Predict ŷt := wt · xt
Receive label yt ∈ {−1, 1}
Incur loss LOGD := LOGD + Lhi (yt , ŷt )
Update weights wt+1 := wt + 1{yt ŷt <1} ηyt xt .

How does the above differ from the perceptron?

46
Regret Bound for OGD

Theorem (Based on [G03])

U
Let R = maxt ∥xt ∥ and η := R √ m
. Then, for any vector u, such that
∥u∥ ≤ U, the sequence produced by OGD, satisfies
m
X √
Lhi (yt , ŷt ) − Lhi (yt , u · xt ) ≤ U 2R 2m , (6)
t=1

47
Regret Bound for OGD – Proof (1)

Proof
Using the convexity of the hinge loss (wrt its 2nd argument), we have

Lhi (yt , ŷt ) − Lhi (yt , u · xt ) ≤ (wt − u) · zt , (7)

where
zt := −yt xt 1{yt (wt ·xt )<1} ∈ ∂w Lhi (yt , wt · xt ) .
| {z }
subdifferential!

From the update we have,

2 2 2 2
∥wt+1 − u∥ = ∥wt − ηzt − u∥ = ∥wt − u∥ −2η(wt −u)·zt n+η 2 ∥zt ∥ .

Thus
1 2 2 2

(wt − u) · zt = ∥wt − u∥ − ∥wt+1 − u∥ + η 2 ∥zt ∥ . (8)
2η

48
Regret Bound for OGD – Proof (2)

Proof – Continued
From (8) we have
m m
X X 1 2 2 2

(wt − u) · zt = ∥wt − u∥ − ∥wt+1 − u∥ + η 2 ∥zt ∥
t=1 t=1
2η
m
!
1 2 2
X 2
≤ ∥u∥ + η ∥zt ∥
2η t=1
m
1 2 ηX 2
= ∥u∥ + ∥xt ∥ 1{yt (wt ·xt )<1}
2η 2 t=1
1 2 η
≤ U + mR 2
2η 2
√ U
= U 2 R 2 m (recall η := √ )
R m

Lower bounding the L.H.S. with (7) and we are done. ■

49
Deriving the perceptron algorithm/bound from OGD

Going back to the Hinge: we can recover the perception bound via OGD:

1. Observe that equation (6) implies,

m
X √
[yt ̸= sign(ŷt )] − Lhi (yt , u · xt ) ≤ U 2 R 2 m .
t=1

2. Now assume there exists a linear classifier u such that yt (u · xt ) ≥ 1 for all
t = 1, . . . , m. Thus,
m
X √
[yt ̸= sign(ŷt )] ≤ U 2 R 2 m .
t=1

3. Now make OGD conservative that is we only update when yt ŷt ≤ 0

versus yt ŷt ≤ 1 i.e., trials in which a mistake is made.
4. Thus with respect to the bound we can ignore the trials where a mistake is
not made so that we can set m = M := m
P
t=1 [yt ̸= sign(ŷt )] which implies
√
M ≤ U 2 R 2 M −→ M ≤ U 2 R 2

50
OGD Beyond the Hinge Loss

How much does this result depend on our choice of the Hinge loss Lhi ?
(Spoiler: very little)

Look back at our class on the Subgradient optimization method: Do you

see any similarities with what we are doing here?

Consider the following algorithm to minimize a generic loss L:

• start from w0 = 0. Then...

• for t = 1, . . . , m

wt+1 = wt − ηzt with zt ∈ ∂w L(yt , wt · xt )

Exercise: can you get a theorem for general OGD? Under what
assumptions on L?

51
Wrapping Up

• We have considered a supervised learning setting where no

randomness in the data is assumed (it could even be adversarial!)
• We have identified a different goal from the stochastic setting:
having a cumulative error that is close to the one of the best model
in the class.
• We first studied the case where we want to leverage the
recommendations of experts.
• We then considered the case of “transforming” our previous
stochastic approaches to supervised learning (e.g. Tikhonov
regularization) to online settings.

52
Problems – 1

1. Suppose X = {True, False}n . Give a polynomial time algorithm

A with a mistake bound M(A) ≤ O(n2 ) for any example sequence
which is consistent with a k-literal conjunction. Your answer should
contain an argument that M(A) ≤ O(n2 ).
2. State the perceptron convergence theorem [Novikoff] explaining the
relation with the hard margin support vector machine solution.
3. [Hard]: Define the c-regret of learning algorithm as
c−regret(m) = LA (S) − c min Li (S)
i∈[n]

thus the usual regret is the 1-regret.

3.1 Argue for the weighted majority set-up argue that without
randomised prediction it is impossible for all training sequences to
obtain sublinear c-regret for c < 2.
3.2 Show how to select β to obtain sublinear 2-regret.
4. Consider binary prediction with expert advice, with a perfect expert.
Prove that any algorithm makes at least Ω(min(m, log2 n)) mistakes
in the worst case. 53
Problems – 2

1. Recall that by tuning the weighted majority we achieved a bound

M ≤ 2.63 min Mi + 2.63 ln n

Now by using randomisation in the prediction, design an algorithm

with a bound that has the property
M Mi
≤ min as m → ∞ ,
m i∈[n] m

for the weighted majority setting (i.e., the mean prediction error of
the algorithm is bounded by the mean prediction error of the “best”
expert). Recalling that m is the number of examples (and the
“tuning” of the algorithm may depend on m). For contrast compare
this to problem 3.1 above.

54
Recommended Reading

Chapters 2, 4 and 12 of Cesa-Bianchi, Nicolo, and Gábor Lugosi.

Prediction, learning, and games. Cambridge university press, 2006.

55
Useful references

1. Nicolò Cesa-Bianchi and Gábor Lugosi, Prediction, learning, and games.,

(2006), Note this is a book.
2. N. Littlestone, Learning quickly when irrelevant attributes abound: a new
linear threshold algorithm, (1988).
3. N. Littlestone and M. K. Warmuth. The weighted majority algorithm,
(1994)
4. V. Vovk, Aggregating strategies, (1990).
5. Y. Freund and R. Schapire, A Decision-Theoretic Generalization of
On-Line Learning and an Application to Boosting, (1997)
6. Haussler, D., Kivinen, J. and Warmuth, M.K. Sequential Prediction of
Individual Sequences Under General Loss Functions, (1998)
7. M. Herbster and M. Warmuth, Tracking the Best Expert, (1998)
8. J. Kivinen and M. Warmuth, Averaging Expert predictions, (1999)
9. S. Shalev-Schwartz, Online Learning and Online Convex Optimization,
(2011)

k3 Ve Service Manual
26% (19)
k3 Ve Service Manual
2 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
No ratings yet
Online Learning: 9.520 Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das
33 pages
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
No ratings yet
1 Introduction/Recap From Last Time: COS 511: Theoretical Machine Learning
5 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Homework 1: EE 737 Spring 2019-20 Assigned: 30 Jan Due: Beginning of Class, 06 Feb
100% (1)
Homework 1: EE 737 Spring 2019-20 Assigned: 30 Jan Due: Beginning of Class, 06 Feb
2 pages
Regression
No ratings yet
Regression
39 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
IML Summary
No ratings yet
IML Summary
12 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Lecture 01-2
No ratings yet
Lecture 01-2
33 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Online Variance Minimization: T T 1 T 1 T T 1
No ratings yet
Online Variance Minimization: T T 1 T 1 T T 1
24 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Week 5 Optimisation
No ratings yet
Week 5 Optimisation
24 pages
1 Online Linear Regression: COS 511: Theoretical Machine Learning
No ratings yet
1 Online Linear Regression: COS 511: Theoretical Machine Learning
7 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
Lecture 8-ml On-Line Learning
No ratings yet
Lecture 8-ml On-Line Learning
48 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
EE 6106: Online Learning and Optimisation Homework 1
No ratings yet
EE 6106: Online Learning and Optimisation Homework 1
4 pages
Lecture 6 Linear Classifier 2
No ratings yet
Lecture 6 Linear Classifier 2
42 pages
Write Up
No ratings yet
Write Up
10 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
Berkeley-Tutorial Optimization For Machine Learning-Part1
No ratings yet
Berkeley-Tutorial Optimization For Machine Learning-Part1
37 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Machine Learning (Summary)
No ratings yet
Machine Learning (Summary)
20 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
ML 01
No ratings yet
ML 01
24 pages
ML Cheat Sheet
No ratings yet
ML Cheat Sheet
2 pages
Lec 16
No ratings yet
Lec 16
10 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Msep2013 L5
No ratings yet
Msep2013 L5
14 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Firas Al-Azizy ML Assignment 1
No ratings yet
Firas Al-Azizy ML Assignment 1
12 pages
ML Opt
No ratings yet
ML Opt
89 pages
Linear Models
No ratings yet
Linear Models
30 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
Kakade S. Tewari A. - Topics in Artificial Intelligence (Learning Theory)
No ratings yet
Kakade S. Tewari A. - Topics in Artificial Intelligence (Learning Theory)
68 pages
Project Aayush
No ratings yet
Project Aayush
9 pages
Week2 DL
No ratings yet
Week2 DL
29 pages
Unit 2
No ratings yet
Unit 2
37 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Kaka de 09 Generalization
No ratings yet
Kaka de 09 Generalization
8 pages
Week 7
No ratings yet
Week 7
53 pages
Lecture 1
No ratings yet
Lecture 1
4 pages
Overview For Artificial Intelligence
No ratings yet
Overview For Artificial Intelligence
112 pages
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
Online Learning Survey (ML Reading Group ACEMS)
No ratings yet
Online Learning Survey (ML Reading Group ACEMS)
90 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Implementation of Modbus Slave TCPIP For Alfen NG9xx Platform
No ratings yet
Implementation of Modbus Slave TCPIP For Alfen NG9xx Platform
15 pages
Drone Suppliers Uae
No ratings yet
Drone Suppliers Uae
5 pages
Tổng Hợp Đề Thi Ielts Speaking Quý 4 - 2019 by Ngocbach
No ratings yet
Tổng Hợp Đề Thi Ielts Speaking Quý 4 - 2019 by Ngocbach
14 pages
Shamjith UiUx Design Resume
No ratings yet
Shamjith UiUx Design Resume
1 page
वदेश मं ालय भारत सरकार Ministry of External Affairs Government of India Online Appointment Receipt
No ratings yet
वदेश मं ालय भारत सरकार Ministry of External Affairs Government of India Online Appointment Receipt
3 pages
CSC403 - Software Engineering BOSU
No ratings yet
CSC403 - Software Engineering BOSU
13 pages
Def Slide
No ratings yet
Def Slide
9 pages
Recruitment Selection Training
No ratings yet
Recruitment Selection Training
29 pages
Reg 216 - B520
No ratings yet
Reg 216 - B520
24 pages
Cement Mill Certificate
100% (2)
Cement Mill Certificate
1 page
Unit - 1 - Ohs352-Project Report Writing
No ratings yet
Unit - 1 - Ohs352-Project Report Writing
23 pages
Alemite Oil Mist Application Manual
100% (1)
Alemite Oil Mist Application Manual
34 pages
Aqa A Level English Literature Coursework Mark Scheme
100% (1)
Aqa A Level English Literature Coursework Mark Scheme
4 pages
Paper 4 PDF
No ratings yet
Paper 4 PDF
5 pages
Unit 4 - Week 2: Introduction To Python: Assignment 2
No ratings yet
Unit 4 - Week 2: Introduction To Python: Assignment 2
4 pages
Brochure 4200 en
No ratings yet
Brochure 4200 en
8 pages
Mehdi Belouahchia Resume F
No ratings yet
Mehdi Belouahchia Resume F
2 pages
Quidos Technical Bulletin - 15th September 2019
100% (1)
Quidos Technical Bulletin - 15th September 2019
7 pages
You Are Not Your Brain
0% (1)
You Are Not Your Brain
7 pages
Instructional Design Rubric Final
No ratings yet
Instructional Design Rubric Final
1 page
3a Index PDF
0% (1)
3a Index PDF
4 pages
MCQ Class 2 MS Word
No ratings yet
MCQ Class 2 MS Word
11 pages
MSA Case Studies
No ratings yet
MSA Case Studies
10 pages
New Vendor Form
No ratings yet
New Vendor Form
1 page
Item Analysis Procedures 1
No ratings yet
Item Analysis Procedures 1
2 pages
G 2 Catalogue
No ratings yet
G 2 Catalogue
60 pages
Lab 12 Eca2 Version Modif
No ratings yet
Lab 12 Eca2 Version Modif
13 pages
Curriculum Development Prof Ed LET Reviewer
100% (1)
Curriculum Development Prof Ed LET Reviewer
6 pages
2017.09.13 - MY18 GLE-Coupe
No ratings yet
2017.09.13 - MY18 GLE-Coupe
29 pages