0% found this document useful (0 votes)
3 views68 pages

Online Learning

The document discusses online learning, contrasting it with batch learning, and outlines foundational online hypotheses classes, particularly focusing on learning with expert advice. It introduces algorithms such as the Halving Algorithm and the Weighted Majority Algorithm, which aim to minimize prediction errors by leveraging the performance of multiple experts. The document also explores regret bounds and performance guarantees for these algorithms in various scenarios.

Uploaded by

aaaaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views68 pages

Online Learning

The document discusses online learning, contrasting it with batch learning, and outlines foundational online hypotheses classes, particularly focusing on learning with expert advice. It introduces algorithms such as the Halving Algorithm and the Weighted Majority Algorithm, which aim to minimize prediction errors by leveraging the performance of multiple experts. The document also explores regret bounds and performance guarantees for these algorithms in various scenarios.

Uploaded by

aaaaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

8.

Online learning
COMP0078: Supervised Learning

Carlo Ciliberto
(Slides thanks to Mark Herbster)

University College London


Department of Computer Science

1
Batch versus Online learning

Batch
Model: There exists training data set (sampled IID)
Aim: To build a classifier from the training data that predicts well on
new data (from same distribution)
Evaluation metric: Generalization error

Online
Model: There exists an online sequence of data (usually no
distributional assumptions)
Aim: To sequentially predict and update a classifier to predict well on
the sequence (i.e. there is no training and test set distinction)
Evaluation metric: Cumulative error

Note
There are a variety of models for online learning. Here we focus on the so-called worst-case model. Alternately distributional
assumption may be made on the data sequence. Also sometimes the phrase “online learning” is used to refer to “online optimisation”
that is to use online learning type algorithms as a training method for a batch classifier.

2
Why online learning?

Pragmatically
• “Often” fast algorithms
• “Often” small memory footprint
• “Often” no “statistical” assumptions required e.g. IID-ness
• As a training method for “BIG DATA” batch classifiers

Theoretically (learning performance guarantees)


• Non-asymptotic
• No statistical assumptions
• There exist techniques to convert cumulative error guarantees to
generalisation error guarantees

3
Today

Our focus today is on three foundational online “hypotheses” classes.

• Learning with experts


1. Halving algorithm
2. Weighted Majority algorithm
3. Refining and generalising the experts model
• Learning linear classifiers
• Perceptron

4
Experts

Part I
Learning with Expert Advice

5
On-Line Learning with expert advice (1) [V90,LW94,HKW98]

Model: There exists an online sequence of data


S = Sm = {(x1 , y1 ), . . . , (xm , ym )} with (x, y ) ∈ {0, 1}n × {0, 1}.
Interpretation: The vector xt is the set of predictions from n experts
about an outcome yt , where expert i predicts xt,i ∈ {0, 1} at time t.
Each expert at time t is aiming to predict yt .
What is an “expert”? Example: human experts or the predictions of n
separate algorithms.
experts
E1 E2 E3 En prediction true label loss
day 1 1 1 0 0 0 1 1
day 2 1 0 1 0 1 0 1
day 3 0 1 1 1 1 1 0
day t xt,1 xt,2 xt,3 xt,n ŷt yt |yt − ŷt |
Goal: Find a “Master” algorithm to combine the predictions xt of the n
experts (based on past perf.) to predict ŷt an estimate of yt .
6
On-Line Learning with experts (2)
Protocol of the Master Algorithm

For t = 1, . . . , m:
Get instance xt ∈ {0, 1}n
Predict ŷt ∈ {0, 1}
Get label yt ∈ {0, 1}
Incur loss (mistakes) |yt − ŷt |

Evaluation metric: The loss (mistakes) of Master Algorithm A on


sequence S is just
Xm
LA (S) := |yt − ŷt |
t=1
where ŷt = A(St−1 )(xt ) is the output of the online algorithm A trained
(online) on St−1 and evaluated on xt
Our Goal: Design master algorithms with “small loss”. 7
Special Case: The (Unknown) “Perfect” Expert

Let’s consider the special setting where there exists at least one expert
that is never wrong...

...how “fast” could we find them?

8
A Solution : Halving Algorithm

predict predict
all 1 inconsistent
0
experts experts

consistent
experts

The master algorithm:

• Keeps track of only the consistent experts


(those that never made a mistake so far)
• Predicts according to the majority vote
• Eliminates wrong experts after each prediction.

Question: How many mistakes does it make, at most?

9
A run of the Halving Algorithm

E1 E2 E3 E4 E5 E6 E7 E8 ŷ y loss
1 1 0 0 1 1 0 0 1 0 1
x x 0 1 x x 1 1 1 1 0
x x x 1 x x 0 0 0 1 1
x x x ↑ x x x x
consistent

Question: How many mistakes does it make, at most?

10
A run of the Halving Algorithm

E1 E2 E3 E4 E5 E6 E7 E8 ŷ y loss
1 1 0 0 1 1 0 0 1 0 1
x x 0 1 x x 1 1 1 1 0
x x x 1 x x 0 0 0 1 1
x x x ↑ x x x x
consistent

Question: How many mistakes does it make, at most?

Answer: For any sequence with a consistent expert, the Halving


Algorithm makes ≤ log2 n mistakes.
Exercise: Prove this!

10
What if no expert is consistent?
Notation
Pm
• Recall LA (S) := t=1 |yt − ŷt | is the loss of algorithm A on S
• Denote the loss of i-th expert Ei as
m
X
Li (S) := |yt − xt,i |
t=1

Aim
Bounds of the form:

∀S : LA (S) ≤ a min Li (S) +b log(n)


i
| {z }
Best Expert!

where a, b are “small” constants

Comment: These are known as “Regret” or “Worst-case” loss bounds, i.e., bounds
that hold in any case (even the “worst-case”). 11
A Solution: Weighted Majority Algorithm [LW94]

The Master algorithm:

• Can’t eliminate experts!


• Keeps track of how reliable each espert is
(By keeping track of a weight wi for each expert)

predict
all 0 predict
experts 1
vote with
their weight

• Predicts according to the larger (weighted) vote


• Weights of wrong experts are multiplied by β ∈ [0, 1)

12
Number of mistakes of the WM algorithm

M = # mistakes of master algorithm at the “end”


Mt,i = # mistakes of expert Ei at the start of trial t
Mi = Mm+1,i = # of total mistakes of expert Ei
wt,i = β Mt,i weight of Ei at beginning of trial t (w1,i = 1)
Xn
Wt = wt,i total sum of weights at the start of trial t
i=1

For each trial... (Minority) ≤ 21 Wt (Majority) ≥ 12 Wt

If the Master algorithm:

• ...is right, then the weights of the “minority” experts are multiplied by β:
Wt = Minority + Majority ≥ β · Minority + Majority = Wt+1

• ...makes a mistake, then majority multiplied by β:

1 1
Wt+1 ≤ 2
Wt + β 2
Wt (why ?)
minority majority 13
Number of mistakes of the WM algorithm – Continued-1

1+β
Hence, Wt+1 ≤ 2
Wt .

14
Number of mistakes of the WM algorithm – Continued-1

1+β
Hence, Wt+1 ≤ 2
Wt . Therefore, we have
 M
1+β
Wm+1 ≤ W1
total final 2 |{z}
n
weight # of experts!

At the same time,


n
X n
X
Wm+1 = wm+1,j = β Mj ≥ β Mi
j=1 j=1

Combining the upper and lower bounds...


 M
1+β
n ≥ β Mi
2

14
Number of mistakes of the WM algorithm – Continued-2

Taking the log and solving for M...

ln β1 1
M ≤ 2 Mi + 2 ln n
ln 1+β ln 1+β

For example, choosing β = 1/e

M ≤ |{z}
2.63 min Mi + |{z}
2.63 ln n
i
a b

For all sequences, the loss of master algorithm is comparable to the loss
of the best expert.

15
Refining and generalising the experts model – 1

More generally we would like to obtain regret bounds for arbitrary loss
functions L : Y × Ŷ → [0, +∞]. Making our notion of regret more
precise we would like guarantees of the form,

LA (S) − min Li (S) ≤ o(m) ,


i∈[n]

where the right-hand side is termed regret since it is how much we


“regret” predicting with the algorithm as opposed to the optimal
predictor on the data sequence.

Here o(m) denotes some function sublinear in m (the number of


examples in S) and possibly dependent on other parameters.

Why o(m)?

16
Refining and generalizing the experts model - 2

Remember that LA (S) is the cumulative sum of the errors incurred by A.

1
Therefore m LA (S) is the average error incurred (so far) by A.

If the sublinear Regret bound holds,

LA (S) − mini∈[n] Li (S) o(m)


≤ ,
m m

Since o(m)
m → 0 for m → ∞, this implies that asymptotically our
algorithm incurs in the same average loss as the average loss of the best
expert.

17
Refining and generalising the experts model – 3

In the following we will show two example of regret bounds generalising


the analysis of the weighted majority algorithm.

1. For a loss function L : {0, 1} × [0, 1] → [0, +∞] the entropic loss

y 1−y
L(y , ŷ ) = y ln + (1 − y ) ln
ŷ 1 − ŷ

2. For an arbitrary bounded loss function L : Y × Ŷ → [0, B].

For the first the regret will be the small constant log(n) for the second

the regret will be O( m log n).

18
A regret bound for the entropic (log) loss

Unlike in the case of the weighted majority we will now:


• allow predictions in [0, 1] rather than just {0, 1}, and
• predict with the weighted average rather than the “majority”.
At trial t, the expert i will have weight wt,i = β Lt,i
= e −η Lt,i
with:
• where Lt,i is the cumulative loss of Ei at the start of trial t
• and η is a positive learning rate
The Master predicts with the weighted average (dot product)
n
w
Pn t,i
X
ŷt = xt,i = vt · xt
i=1 | i=1
wt,i
{z }
vt,i
normalized
weights

where xt,i is the prediction of Ei in trial t


Set the initial weights w1 = (1, . . . , 1) and thus v1 == ( n1 , . . . , n1 ).
19
The Weighted Average Algorithm – Summary

WA Algorithm
Initialise : v1 = ( n1 , . . . , n1 ), LWA := 0, L := 0,
Input: η ∈ (0, ∞), Loss function L : Y × Y → R.

For t = 1, . . . , m Do
Receive instance xt ∈ [0, 1]n
Predict ŷt := vt · xt
Receive label yt ∈ [0, 1]
Incur loss LWA := LWA + L(yt , ŷt ),
Li := Li + L(yt , xt,i ) (i ∈ [n])
vt,i e −ηL(yt ,xt,i )
Update vt+1,i := Pn −ηL(yt ,xt,j ) for i ∈ [n].
j=1 vt,j e

20
Weighted Average Algorithm - Theorem
Theorem
For all sequences of examples

S = (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ [0, 1]n × [0, 1]

the regret of the weighted average WA algorithm is

LWA (S) − min Li (S) ≤ 1/η ln(n)


i |{z}
b
with square and entropic loss for η = 1/2 and η = 1 respectively.

Constant b as dependent on loss function


Loss b = 1/η
entropic Len (y , ŷ ) = y ln ŷy + (1 − y ) ln 1−y
1−ŷ
1
square Lsq (y , ŷ ) = (y − ŷ )2 2

• For simplicity, we will prove only for entropic loss when Y := {0, 1}
and Ŷ := [0, 1]. The result holds for many loss function (sufficient
21
smoothness and convexity with different η and b). See [KW99] for proof.
Weighted Average Algorithm - Proof

Notation: ∆n := {x ∈ [0, 1]n : ni=1 xi = 1}. Let d : ∆n × ∆n → [0, ∞] be the


P
Pn
rel. entropy d(u, v) := i=1 ui ln uvii . Note: Len (y , ŷ ) = d((y , 1 − y ), (ŷ , 1 − ŷ )).

22
Weighted Average Algorithm - Proof

Notation: ∆n := {x ∈ [0, 1]n : ni=1 xi = 1}. Let d : ∆n × ∆n → [0, ∞] be the


P
Pn
rel. entropy d(u, v) := i=1 ui ln uvii . Note: Len (y , ŷ ) = d((y , 1 − y ), (ŷ , 1 − ŷ )).

Proof – 1
We first prove the following “progress versus regret” equality.
n
X
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, vt ) − d(u, vt+1 ) for all u ∈ ∆n . (1)
i=1

22
Weighted Average Algorithm - Proof

Notation: ∆n := {x ∈ [0, 1]n : ni=1 xi = 1}. Let d : ∆n × ∆n → [0, ∞] be the


P
Pn
rel. entropy d(u, v) := i=1 ui ln uvii . Note: Len (y , ŷ ) = d((y , 1 − y ), (ŷ , 1 − ŷ )).

Proof – 1
We first prove the following “progress versus regret” equality.
n
X
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, vt ) − d(u, vt+1 ) for all u ∈ ∆n . (1)
i=1
Observe that
n
X vt+1,i
d(u, vt ) − d(u, vt+1 ) = ui ln
i=1
vt,i
Let yt = 1. Then (using Len (1, x) = − ln x)
−L (1,x )
vt,i e en t,i
v x
n n Pn −Len (1,xt,j ) n Pn t,i t,i n
X vt+1,i X v
j=1 t,j
e X v x
j=1 t,j t,j
X xt,i
ui ln = ui ln = ui ln = ui ln
i=1
vt,i i=1
vt,i i=1
vt,i i=1
ŷt
n
! n
X X
= ui ln xt,i − ln(ŷt ) = Len (yt , ŷt ) − ui Len (yt , xt,i )
i=1 i=1

by symmetry we also have the case y = 0 demonstrating (1). 22


Weighted Average Algorithm - Proof continued

Proof – 2
Now observe that (1) is a telescoping equality and we have
m
X m X
X n
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, v1 ) − d(u, vm+1 )
t=1 t=1 i=1

Now since the above holds for any u ∈ ∆n in particular the unit vectors
(1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1) and then if we upper bound
by noting that d(u, v1 ) ≤ ln n and −d(u, vm+1 ) ≤ 0 we have proved
theorem.

23
Hedge Algorithm
Hedge was introduced in [FS97], generalising the weighted majority
analysis to the allocation setting.
Allocation setting
On each trial the learner plays an allocation vt ∈ ∆n , then nature
returns a loss vector ℓt . I.e., the loss of expert i is ℓt,i .
Two models for the learner’s play (HA-1,HA-2):

1. [HA-1]:We simply incur loss so that LA (t) := vt · ℓt .


2. [HA-2]:Alternately learner randomly selects an action ŷ ∈ [n]
according to the discrete distribution over [n] so that Prob(j) := vt,j .
Thus
E [LHA on trial t] = Eŷt ∼vt [ℓt,ŷ ] = vt · ℓt .

• Observe that this setting can simulate the setting where we receive
side-information xt and have a fixed loss function.
• For the randomised setting the “mechanism” generating the loss
vectors ℓt must be oblivious to the learner’s selection ŷ until trial t + 1.
24
The Hedge Algorithm (HA) – Summary
Hedge Algorithm (HA-1)
Initialise : v1 := ( n1 , . . . , n1 ), LHA := 0, L := 0 ; Select: η ∈ (0, ∞),
For t = 1 To m Do
Predict vt ∈ ∆n
Receive loss ℓt ∈ [0, 1]n
Incur loss LHA := LHA + vt · ℓt , Li := Li + ℓt,i (i ∈ [n])
−ηℓ t,i
vt,i e
Update Weights vt+1,i := Pn −ηℓt,j for i ∈ [n].
v e
j=1 t,j

Hedge Algorithm (HA-2)


Initialise : v1 := ( n1 , . . . , n1 ), LHA := 0 L := 0 ; Select: η ∈ (0, ∞),
For t = 1 To m Do
Predict vt ∈ ∆n and sample ŷt ∼ vt
Receive loss ℓt ∈ [0, 1]n
Incur loss E [LHA ] := E [LHA ] + vt · ℓt , Li := Li + ℓt,i (i ∈ [n])
−ηℓ t,i
vt,i e
Update Weights vt+1,i := Pn −ηℓt,j for i ∈ [n].
v e
j=1 t,j

25
Hedge - Theorem

Theorem Hedge (Bound) [LW94,FS97]


For all sequence of loss vectors

S = ℓ1 , . . . , ℓm ∈ [0, 1]n
p
the regret of the Hedge HA-2 algorithm with η = 2ln n/m is

E [LHA (S)] − min Li (S) ≤ 2m ln n .
i

26
Hedge Theorem - Proof (1)
Proof – 1
We first prove the following “progress versus regret” inequality.
n
1 ηX
vt · ℓt − u · ℓt ≤ (d(u, vt ) − d(u, vt+1 )) + vt,i ℓ2t,i for all u ∈ ∆n . (2)
η 2
Pn i=1
Let Zt := i=1 vt,i exp(−ηℓt,i ), so that vt+1,i = vt,i /Zt . Observe that
n
X vt+1,i
d(u, vt ) − d(u, vt+1 ) = ui ln
vt,i
i=1
n
X n
X
= −η ui ℓt,i − ui ln Zt
i=1 i=1
n
X
= −ηu · ℓt − ln vt,i exp(−ηℓt,i )
i=1
n
X 1
≥ −ηu · ℓt − ln vt,i (1 − ηℓt,i + η 2 ℓ2t,i ) (3)
2
i=1
n
1 X
= −ηu · ℓt − ln(1 − ηvt · ℓt + η 2 vt,i ℓ2t,i )
2
i=1
n
1 X
≥ η(vt · ℓt − u · ℓt ) − η 2 vt,i ℓ2t,i (4)
2
i=1

x2
Using inequalities e −x ≤ 1 − x + 2 for x ≥ 0 and ln(1 + x) ≤ x for (3) and (4) demonstrating (2).

27
Hedge Theorem - Proof (2)

Proof – 2
Summing over t and rearranging we have
m m n
X 1 η XX
(vt · ℓt − u · ℓt ) ≤ (d(u, v1 ) − d(u, vm+1 )) + vt,i ℓ2t,i
t=1
η 2 t=1 i=1
m n
ln n η XX
≤ + vt,i ℓ2t,i (5)
η 2 t=1 i=1

Now since since the above holds for any u ∈ ∆n it then holds in particular for
the unit vectors (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1) and then we upper
bound by noting that d(u, v1 ) ≤ ln n, −d(u, vm+1 ) ≤ 0, and
Pm Pn 2
p
t=1 i=1 vt,i ℓt,i ≤ m. Finally we “tune” by choosing η = 2ln n/m and we
have proved theorem.

Question: how can we the above to prove a theorem if the loss is now in
the range [0, B]?

28
Comments

• Easy to combine many pretty good experts (algorithms) so that


Master is guaranteed to be almost as good as the best
• Bounds logarithmic in number of experts. Use many experts! Limits
only in computational resources.
• Observe updating is multiplicative

Next: So far we have given bounds which grow slowly in the number of
experts. The only significant drawback is potentially computational if we
wish to work with large classes of experts. With this is mind we may wish
to work with structured sets of experts for either computational
advantages or advantages in bound.
We now consider linear combinations of experts that are linear classifiers.

29
Part II
Online learning of linear classifiers

30
A more general setting (1)
Prediction Loss
Instance of alg A Label of alg A

x1 ŷ1 y1 L(y1 , ŷ1 )


.. .. .. ..
. . . .
xt ŷt yt L(yt , ŷt ) Sequence of examples
.. .. .. ..
. . . .
xm ŷm ym L(ym , ŷm )
————
Total Loss LA (S)
S = (x1 , y1 ), ..., (xm , ym )
Comparison class U = {u} (AKA hypothesis space, concept class)
Relative loss (Regret)
LA (S) − inf Loss u (S)
{u∈U }

Goal: Bound relative loss for arbitrary sequence S


31
A more general setting (2)

Now
• We consider the case where U is a set of linear threshold function.
• For simplicity we will focus on the case where there exists a u ∈ U
s.t. Loss u (S) = 0. This is known as realizable case. Compare to the
previously considered halving algorithm versus weighted majority
algorithm.

32
Perceptron
The Perceptron set-up

Assumption: Data is linearly separable by some margin γ. Hence exists


a hyperplane with normal vector v such that
1. ∥v∥ = 1
R 2. All examples (xt , yt )
+ +
γ • ∀yt yt ∈ {−1, +1}.
γ + +
- + • ∀xt , ∥xt ∥ ≤ R.

1 v + 3. ∀(xt , yt ), yt (xt · v) ≥ γ
- + +
- -- +
-
+
- -
+
- -
- +
-
- -
-

33
The Perceptron learning algorithm

Perceptron Algorithm
Input: (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ Rn × {−1, 1}

1. Initialise w1 = ⃗0; M1 = 0.
2. For t = 1 to m do
3. Receive pattern: xt ∈ Rn
4. Predict: ŷt = sign(wt · xt )
5. Receive label: yt
6. If mistake (ŷt yt ≤ 0)
• Then Update wt+1 = wt + yt xt ; Mt+1 = Mt + 1
7. Else wt+1 = wt ; Mt+1 = Mt .
8. End For

34
Example: trace for the Perceptron algorithm

35
Example: trace for the Perceptron algorithm

36
Example: trace for the Perceptron algorithm

36
Example: trace for the Perceptron algorithm

36
Example: trace for the Perceptron algorithm

36
Example: trace for the Perceptron algorithm

36
Example: trace for the Perceptron algorithm

36
Bound on number of mistakes

• The number of mistakes that the perceptron algorithm can make is


 2
at most Rγ .
• Proof by combining upper and lower bounds on ∥w∥.

37
Pythagorean Lemma

On trials where “mistakes” occur we have the following inequality,


2
Lemma: If (wt · xt )yt < 0 then ∥wt+1 ∥ ≤ ∥wt ∥2 + ∥xt ∥2
Proof:

∥wt+1 ∥2 = ∥wt + yt xt ∥2
= ∥wt ∥2 + 2(wt · xt )yt + ∥xt ∥2
≤ ∥wt ∥2 + ∥xt ∥2

38
Upper bound on ∥wt ∥

Lemma: ∥wt ∥2 ≤ Mt R 2
Proof: By induction

• Claim: ∥wt ∥2 ≤ Mt R 2
• Base: M1 = 0, ∥w1 ∥2 = 0
• Induction step (assume for t and prove for t + 1) when we have a
mistake on trial t:

∥wt+1 ∥2 ≤ ∥wt ∥2 + ∥xt ∥2 ≤ ∥wt ∥2 + R 2 ≤ (Mt+1 )R 2

Here we used the Pythagorean lemma. If mistake Mt+1 = Mt + 1


else there is no mistake, then trivially wt+1 = wt and Mt+1 = Mt .

39
Lower bound on ∥wt ∥

Lemma: Mt γ ≤ ∥wt ∥
Observe: ∥wt ∥ ≥ wt · v because ∥v∥ = 1. (Cauchy-Schwarz)
We prove a lower bound on wt · v using induction over t

• Claim: wt · v ≥ Mt γ
• Base: t = 1, w1 · v = 0
• Induction step (assume for t and prove for t + 1):
If mistake (Mt+1 = Mt + 1) then

wt+1 · v = (xt yt ) · v
= w t · v + yt x t · v
≥ Mt γ + γ
= (Mt + 1)γ

40
Combining the upper and lower bounds

Let M := Mm+1 denote the total number of updates (“mistakes”) then

(Mγ)2 ≤ ∥wm+1 ∥2 ≤ MR 2

Thus simplifying we have the famous ...


Theorem (Perceptron Bound [Novikoff])
For all sequences of examples

S = (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ Rn × {−1, 1}

the mistakes of the Perceptron algorithm is bounded by


 2
R
M≤ ,
γ

with R := maxt ∥xt ∥ when there exists a vector v with ∥v∥ = 1 and
constant γ such that (v · xt )yt ≥ γ for all t.

41
Comments

Comments
• It is often convenient to express the bound in the following form.
Here define u := γv then

2
M ≤ R 2 ∥u∥ (∀u : (u · xt )yt ≥ 1)

• Suppose we have linearly separable data set S. Questions:


1. Observe that wm+1 does not necessarily linearly separate S. Why?
2. How can we use the Perceptron to find a vector w that separates
S?
3. How long will this computation take?
• There are variants on the perceptron that operate on a single
example at a time that converge to the “SVM” max-margin linear
separator.

42
Regret Bounds for Linear Separation
Going Deeper : Regret Bounds for Linear Separation

Recall the regularisation approach to supervised learning.

m
X

h = arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1

Example: Ridge Regression


m
X 2
arg min L(yt , w · xt ) + λ∥w∥
w∈Rn t=1

Example: Soft Margin SVM


m
X 2
arg min Lhi (yt , w · xt + b) + λ∥w∥
w∈Rn ,b∈R t=1

with Lhi (y , ŷ ) := max(0, 1 − y ŷ ).

43
Online Approach

Recall the regularisation approach to “BATCH” supervised learning.


m
X
arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1

Question: how can we approach it online?

44
Online Approach

Recall the regularisation approach to “BATCH” supervised learning.


m
X
arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1

Question: how can we approach it online?

A possible strategy is, every time we see a new sample (xt+1 , yt+1 ) to
produce a new ht+1 such that

• It fits the new data point well


• It is not “too different” from the previous ht

ht+1 = arg min L(yt , h(xt )) + λpenalty(h, ht )


h∈H

44
Online Gradient Descent with Hinge Loss and ∥·∥22 penalty

Let’s consider SVMs:

• Hinge loss: Lhi (y , ŷ ) = max(0, 1 − y ŷ ).


• Linear hypotheses: h(x) = hw (x) = w · x.

Then, the online update becomes


2
wt+1 = arg min Lhi (yt , w · xt ) + λ∥w − wt ∥
w∈Rn

Solving for the update (taking the “derivative” and set to zero)
corresponds to choosing wt+1 as follows:
(
wt yt (w · xt ) > 1
wt+1 = y t xt
wt + 2λ yt (w · xt ) < 1

Note : If yt (w · xt ) = 1 then we may choose either.

45
OGD with Hinge Loss and ∥·∥22 penalty

OGD Algorithm
Initialise : w1 := 0, LOGD := 0
1
Select: η ∈ (0, ∞) (interpretation η = 2λ )

For t = 1 To m Do
Receive instance xt ∈ Rn
Predict ŷt := wt · xt
Receive label yt ∈ {−1, 1}
Incur loss LOGD := LOGD + Lhi (yt , ŷt )
Update weights wt+1 := wt + 1{yt ŷt <1} ηyt xt .

How does the above differ from the perceptron?

46
Regret Bound for OGD

Theorem (Based on [G03])


U
Let R = maxt ∥xt ∥ and η := R √ m
. Then, for any vector u, such that
∥u∥ ≤ U, the sequence produced by OGD, satisfies
m
X √
Lhi (yt , ŷt ) − Lhi (yt , u · xt ) ≤ U 2R 2m , (6)
t=1

47
Regret Bound for OGD – Proof (1)

Proof
Using the convexity of the hinge loss (wrt its 2nd argument), we have

Lhi (yt , ŷt ) − Lhi (yt , u · xt ) ≤ (wt − u) · zt , (7)

where
zt := −yt xt 1{yt (wt ·xt )<1} ∈ ∂w Lhi (yt , wt · xt ) .
| {z }
subdifferential!

From the update we have,


2 2 2 2
∥wt+1 − u∥ = ∥wt − ηzt − u∥ = ∥wt − u∥ −2η(wt −u)·zt n+η 2 ∥zt ∥ .

Thus
1  2 2 2

(wt − u) · zt = ∥wt − u∥ − ∥wt+1 − u∥ + η 2 ∥zt ∥ . (8)

48
Regret Bound for OGD – Proof (2)

Proof – Continued
From (8) we have
m m
X X 1  2 2 2

(wt − u) · zt = ∥wt − u∥ − ∥wt+1 − u∥ + η 2 ∥zt ∥
t=1 t=1

m
!
1 2 2
X 2
≤ ∥u∥ + η ∥zt ∥
2η t=1
m
1 2 ηX 2
= ∥u∥ + ∥xt ∥ 1{yt (wt ·xt )<1}
2η 2 t=1
1 2 η
≤ U + mR 2
2η 2
√ U
= U 2 R 2 m (recall η := √ )
R m

Lower bounding the L.H.S. with (7) and we are done. ■

49
Deriving the perceptron algorithm/bound from OGD

Going back to the Hinge: we can recover the perception bound via OGD:

1. Observe that equation (6) implies,


m
X √
[yt ̸= sign(ŷt )] − Lhi (yt , u · xt ) ≤ U 2 R 2 m .
t=1

2. Now assume there exists a linear classifier u such that yt (u · xt ) ≥ 1 for all
t = 1, . . . , m. Thus,
m
X √
[yt ̸= sign(ŷt )] ≤ U 2 R 2 m .
t=1

3. Now make OGD conservative that is we only update when yt ŷt ≤ 0


versus yt ŷt ≤ 1 i.e., trials in which a mistake is made.
4. Thus with respect to the bound we can ignore the trials where a mistake is
not made so that we can set m = M := m
P
t=1 [yt ̸= sign(ŷt )] which implies

M ≤ U 2 R 2 M −→ M ≤ U 2 R 2

50
OGD Beyond the Hinge Loss

How much does this result depend on our choice of the Hinge loss Lhi ?
(Spoiler: very little)

Look back at our class on the Subgradient optimization method: Do you


see any similarities with what we are doing here?

Consider the following algorithm to minimize a generic loss L:

• start from w0 = 0. Then...


• for t = 1, . . . , m

wt+1 = wt − ηzt with zt ∈ ∂w L(yt , wt · xt )

Exercise: can you get a theorem for general OGD? Under what
assumptions on L?

51
Wrapping Up

• We have considered a supervised learning setting where no


randomness in the data is assumed (it could even be adversarial!)
• We have identified a different goal from the stochastic setting:
having a cumulative error that is close to the one of the best model
in the class.
• We first studied the case where we want to leverage the
recommendations of experts.
• We then considered the case of “transforming” our previous
stochastic approaches to supervised learning (e.g. Tikhonov
regularization) to online settings.

52
Problems – 1

1. Suppose X = {True, False}n . Give a polynomial time algorithm


A with a mistake bound M(A) ≤ O(n2 ) for any example sequence
which is consistent with a k-literal conjunction. Your answer should
contain an argument that M(A) ≤ O(n2 ).
2. State the perceptron convergence theorem [Novikoff] explaining the
relation with the hard margin support vector machine solution.
3. [Hard]: Define the c-regret of learning algorithm as
c−regret(m) = LA (S) − c min Li (S)
i∈[n]

thus the usual regret is the 1-regret.


3.1 Argue for the weighted majority set-up argue that without
randomised prediction it is impossible for all training sequences to
obtain sublinear c-regret for c < 2.
3.2 Show how to select β to obtain sublinear 2-regret.
4. Consider binary prediction with expert advice, with a perfect expert.
Prove that any algorithm makes at least Ω(min(m, log2 n)) mistakes
in the worst case. 53
Problems – 2

1. Recall that by tuning the weighted majority we achieved a bound

M ≤ 2.63 min Mi + 2.63 ln n


i

Now by using randomisation in the prediction, design an algorithm


with a bound that has the property
M Mi
≤ min as m → ∞ ,
m i∈[n] m

for the weighted majority setting (i.e., the mean prediction error of
the algorithm is bounded by the mean prediction error of the “best”
expert). Recalling that m is the number of examples (and the
“tuning” of the algorithm may depend on m). For contrast compare
this to problem 3.1 above.

54
Recommended Reading

Chapters 2, 4 and 12 of Cesa-Bianchi, Nicolo, and Gábor Lugosi.


Prediction, learning, and games. Cambridge university press, 2006.

55
Useful references

1. Nicolò Cesa-Bianchi and Gábor Lugosi, Prediction, learning, and games.,


(2006), Note this is a book.
2. N. Littlestone, Learning quickly when irrelevant attributes abound: a new
linear threshold algorithm, (1988).
3. N. Littlestone and M. K. Warmuth. The weighted majority algorithm,
(1994)
4. V. Vovk, Aggregating strategies, (1990).
5. Y. Freund and R. Schapire, A Decision-Theoretic Generalization of
On-Line Learning and an Application to Boosting, (1997)
6. Haussler, D., Kivinen, J. and Warmuth, M.K. Sequential Prediction of
Individual Sequences Under General Loss Functions, (1998)
7. M. Herbster and M. Warmuth, Tracking the Best Expert, (1998)
8. J. Kivinen and M. Warmuth, Averaging Expert predictions, (1999)
9. S. Shalev-Schwartz, Online Learning and Online Convex Optimization,
(2011)

56

You might also like