Online Learning
Online Learning
Online learning
COMP0078: Supervised Learning
Carlo Ciliberto
(Slides thanks to Mark Herbster)
1
Batch versus Online learning
Batch
Model: There exists training data set (sampled IID)
Aim: To build a classifier from the training data that predicts well on
new data (from same distribution)
Evaluation metric: Generalization error
Online
Model: There exists an online sequence of data (usually no
distributional assumptions)
Aim: To sequentially predict and update a classifier to predict well on
the sequence (i.e. there is no training and test set distinction)
Evaluation metric: Cumulative error
Note
There are a variety of models for online learning. Here we focus on the so-called worst-case model. Alternately distributional
assumption may be made on the data sequence. Also sometimes the phrase “online learning” is used to refer to “online optimisation”
that is to use online learning type algorithms as a training method for a batch classifier.
2
Why online learning?
Pragmatically
• “Often” fast algorithms
• “Often” small memory footprint
• “Often” no “statistical” assumptions required e.g. IID-ness
• As a training method for “BIG DATA” batch classifiers
3
Today
4
Experts
Part I
Learning with Expert Advice
5
On-Line Learning with expert advice (1) [V90,LW94,HKW98]
For t = 1, . . . , m:
Get instance xt ∈ {0, 1}n
Predict ŷt ∈ {0, 1}
Get label yt ∈ {0, 1}
Incur loss (mistakes) |yt − ŷt |
Let’s consider the special setting where there exists at least one expert
that is never wrong...
8
A Solution : Halving Algorithm
predict predict
all 1 inconsistent
0
experts experts
consistent
experts
9
A run of the Halving Algorithm
E1 E2 E3 E4 E5 E6 E7 E8 ŷ y loss
1 1 0 0 1 1 0 0 1 0 1
x x 0 1 x x 1 1 1 1 0
x x x 1 x x 0 0 0 1 1
x x x ↑ x x x x
consistent
10
A run of the Halving Algorithm
E1 E2 E3 E4 E5 E6 E7 E8 ŷ y loss
1 1 0 0 1 1 0 0 1 0 1
x x 0 1 x x 1 1 1 1 0
x x x 1 x x 0 0 0 1 1
x x x ↑ x x x x
consistent
10
What if no expert is consistent?
Notation
Pm
• Recall LA (S) := t=1 |yt − ŷt | is the loss of algorithm A on S
• Denote the loss of i-th expert Ei as
m
X
Li (S) := |yt − xt,i |
t=1
Aim
Bounds of the form:
Comment: These are known as “Regret” or “Worst-case” loss bounds, i.e., bounds
that hold in any case (even the “worst-case”). 11
A Solution: Weighted Majority Algorithm [LW94]
predict
all 0 predict
experts 1
vote with
their weight
12
Number of mistakes of the WM algorithm
• ...is right, then the weights of the “minority” experts are multiplied by β:
Wt = Minority + Majority ≥ β · Minority + Majority = Wt+1
1 1
Wt+1 ≤ 2
Wt + β 2
Wt (why ?)
minority majority 13
Number of mistakes of the WM algorithm – Continued-1
1+β
Hence, Wt+1 ≤ 2
Wt .
14
Number of mistakes of the WM algorithm – Continued-1
1+β
Hence, Wt+1 ≤ 2
Wt . Therefore, we have
M
1+β
Wm+1 ≤ W1
total final 2 |{z}
n
weight # of experts!
14
Number of mistakes of the WM algorithm – Continued-2
ln β1 1
M ≤ 2 Mi + 2 ln n
ln 1+β ln 1+β
M ≤ |{z}
2.63 min Mi + |{z}
2.63 ln n
i
a b
For all sequences, the loss of master algorithm is comparable to the loss
of the best expert.
15
Refining and generalising the experts model – 1
More generally we would like to obtain regret bounds for arbitrary loss
functions L : Y × Ŷ → [0, +∞]. Making our notion of regret more
precise we would like guarantees of the form,
Why o(m)?
16
Refining and generalizing the experts model - 2
1
Therefore m LA (S) is the average error incurred (so far) by A.
Since o(m)
m → 0 for m → ∞, this implies that asymptotically our
algorithm incurs in the same average loss as the average loss of the best
expert.
17
Refining and generalising the experts model – 3
1. For a loss function L : {0, 1} × [0, 1] → [0, +∞] the entropic loss
y 1−y
L(y , ŷ ) = y ln + (1 − y ) ln
ŷ 1 − ŷ
For the first the regret will be the small constant log(n) for the second
√
the regret will be O( m log n).
18
A regret bound for the entropic (log) loss
WA Algorithm
Initialise : v1 = ( n1 , . . . , n1 ), LWA := 0, L := 0,
Input: η ∈ (0, ∞), Loss function L : Y × Y → R.
For t = 1, . . . , m Do
Receive instance xt ∈ [0, 1]n
Predict ŷt := vt · xt
Receive label yt ∈ [0, 1]
Incur loss LWA := LWA + L(yt , ŷt ),
Li := Li + L(yt , xt,i ) (i ∈ [n])
vt,i e −ηL(yt ,xt,i )
Update vt+1,i := Pn −ηL(yt ,xt,j ) for i ∈ [n].
j=1 vt,j e
20
Weighted Average Algorithm - Theorem
Theorem
For all sequences of examples
• For simplicity, we will prove only for entropic loss when Y := {0, 1}
and Ŷ := [0, 1]. The result holds for many loss function (sufficient
21
smoothness and convexity with different η and b). See [KW99] for proof.
Weighted Average Algorithm - Proof
22
Weighted Average Algorithm - Proof
Proof – 1
We first prove the following “progress versus regret” equality.
n
X
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, vt ) − d(u, vt+1 ) for all u ∈ ∆n . (1)
i=1
22
Weighted Average Algorithm - Proof
Proof – 1
We first prove the following “progress versus regret” equality.
n
X
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, vt ) − d(u, vt+1 ) for all u ∈ ∆n . (1)
i=1
Observe that
n
X vt+1,i
d(u, vt ) − d(u, vt+1 ) = ui ln
i=1
vt,i
Let yt = 1. Then (using Len (1, x) = − ln x)
−L (1,x )
vt,i e en t,i
v x
n n Pn −Len (1,xt,j ) n Pn t,i t,i n
X vt+1,i X v
j=1 t,j
e X v x
j=1 t,j t,j
X xt,i
ui ln = ui ln = ui ln = ui ln
i=1
vt,i i=1
vt,i i=1
vt,i i=1
ŷt
n
! n
X X
= ui ln xt,i − ln(ŷt ) = Len (yt , ŷt ) − ui Len (yt , xt,i )
i=1 i=1
Proof – 2
Now observe that (1) is a telescoping equality and we have
m
X m X
X n
Len (yt , ŷt ) − ui Len (yt , xt,i ) = d(u, v1 ) − d(u, vm+1 )
t=1 t=1 i=1
Now since the above holds for any u ∈ ∆n in particular the unit vectors
(1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1) and then if we upper bound
by noting that d(u, v1 ) ≤ ln n and −d(u, vm+1 ) ≤ 0 we have proved
theorem.
23
Hedge Algorithm
Hedge was introduced in [FS97], generalising the weighted majority
analysis to the allocation setting.
Allocation setting
On each trial the learner plays an allocation vt ∈ ∆n , then nature
returns a loss vector ℓt . I.e., the loss of expert i is ℓt,i .
Two models for the learner’s play (HA-1,HA-2):
• Observe that this setting can simulate the setting where we receive
side-information xt and have a fixed loss function.
• For the randomised setting the “mechanism” generating the loss
vectors ℓt must be oblivious to the learner’s selection ŷ until trial t + 1.
24
The Hedge Algorithm (HA) – Summary
Hedge Algorithm (HA-1)
Initialise : v1 := ( n1 , . . . , n1 ), LHA := 0, L := 0 ; Select: η ∈ (0, ∞),
For t = 1 To m Do
Predict vt ∈ ∆n
Receive loss ℓt ∈ [0, 1]n
Incur loss LHA := LHA + vt · ℓt , Li := Li + ℓt,i (i ∈ [n])
−ηℓ t,i
vt,i e
Update Weights vt+1,i := Pn −ηℓt,j for i ∈ [n].
v e
j=1 t,j
25
Hedge - Theorem
S = ℓ1 , . . . , ℓm ∈ [0, 1]n
p
the regret of the Hedge HA-2 algorithm with η = 2ln n/m is
√
E [LHA (S)] − min Li (S) ≤ 2m ln n .
i
26
Hedge Theorem - Proof (1)
Proof – 1
We first prove the following “progress versus regret” inequality.
n
1 ηX
vt · ℓt − u · ℓt ≤ (d(u, vt ) − d(u, vt+1 )) + vt,i ℓ2t,i for all u ∈ ∆n . (2)
η 2
Pn i=1
Let Zt := i=1 vt,i exp(−ηℓt,i ), so that vt+1,i = vt,i /Zt . Observe that
n
X vt+1,i
d(u, vt ) − d(u, vt+1 ) = ui ln
vt,i
i=1
n
X n
X
= −η ui ℓt,i − ui ln Zt
i=1 i=1
n
X
= −ηu · ℓt − ln vt,i exp(−ηℓt,i )
i=1
n
X 1
≥ −ηu · ℓt − ln vt,i (1 − ηℓt,i + η 2 ℓ2t,i ) (3)
2
i=1
n
1 X
= −ηu · ℓt − ln(1 − ηvt · ℓt + η 2 vt,i ℓ2t,i )
2
i=1
n
1 X
≥ η(vt · ℓt − u · ℓt ) − η 2 vt,i ℓ2t,i (4)
2
i=1
x2
Using inequalities e −x ≤ 1 − x + 2 for x ≥ 0 and ln(1 + x) ≤ x for (3) and (4) demonstrating (2).
27
Hedge Theorem - Proof (2)
Proof – 2
Summing over t and rearranging we have
m m n
X 1 η XX
(vt · ℓt − u · ℓt ) ≤ (d(u, v1 ) − d(u, vm+1 )) + vt,i ℓ2t,i
t=1
η 2 t=1 i=1
m n
ln n η XX
≤ + vt,i ℓ2t,i (5)
η 2 t=1 i=1
Now since since the above holds for any u ∈ ∆n it then holds in particular for
the unit vectors (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, 0, . . . , 1) and then we upper
bound by noting that d(u, v1 ) ≤ ln n, −d(u, vm+1 ) ≤ 0, and
Pm Pn 2
p
t=1 i=1 vt,i ℓt,i ≤ m. Finally we “tune” by choosing η = 2ln n/m and we
have proved theorem.
Question: how can we the above to prove a theorem if the loss is now in
the range [0, B]?
28
Comments
Next: So far we have given bounds which grow slowly in the number of
experts. The only significant drawback is potentially computational if we
wish to work with large classes of experts. With this is mind we may wish
to work with structured sets of experts for either computational
advantages or advantages in bound.
We now consider linear combinations of experts that are linear classifiers.
29
Part II
Online learning of linear classifiers
30
A more general setting (1)
Prediction Loss
Instance of alg A Label of alg A
Now
• We consider the case where U is a set of linear threshold function.
• For simplicity we will focus on the case where there exists a u ∈ U
s.t. Loss u (S) = 0. This is known as realizable case. Compare to the
previously considered halving algorithm versus weighted majority
algorithm.
32
Perceptron
The Perceptron set-up
1 v + 3. ∀(xt , yt ), yt (xt · v) ≥ γ
- + +
- -- +
-
+
- -
+
- -
- +
-
- -
-
33
The Perceptron learning algorithm
Perceptron Algorithm
Input: (x1 , y1 ), . . . , (xm , ym ) with (x, y ) ∈ Rn × {−1, 1}
1. Initialise w1 = ⃗0; M1 = 0.
2. For t = 1 to m do
3. Receive pattern: xt ∈ Rn
4. Predict: ŷt = sign(wt · xt )
5. Receive label: yt
6. If mistake (ŷt yt ≤ 0)
• Then Update wt+1 = wt + yt xt ; Mt+1 = Mt + 1
7. Else wt+1 = wt ; Mt+1 = Mt .
8. End For
34
Example: trace for the Perceptron algorithm
35
Example: trace for the Perceptron algorithm
36
Example: trace for the Perceptron algorithm
36
Example: trace for the Perceptron algorithm
36
Example: trace for the Perceptron algorithm
36
Example: trace for the Perceptron algorithm
36
Example: trace for the Perceptron algorithm
36
Bound on number of mistakes
37
Pythagorean Lemma
∥wt+1 ∥2 = ∥wt + yt xt ∥2
= ∥wt ∥2 + 2(wt · xt )yt + ∥xt ∥2
≤ ∥wt ∥2 + ∥xt ∥2
38
Upper bound on ∥wt ∥
Lemma: ∥wt ∥2 ≤ Mt R 2
Proof: By induction
• Claim: ∥wt ∥2 ≤ Mt R 2
• Base: M1 = 0, ∥w1 ∥2 = 0
• Induction step (assume for t and prove for t + 1) when we have a
mistake on trial t:
39
Lower bound on ∥wt ∥
Lemma: Mt γ ≤ ∥wt ∥
Observe: ∥wt ∥ ≥ wt · v because ∥v∥ = 1. (Cauchy-Schwarz)
We prove a lower bound on wt · v using induction over t
• Claim: wt · v ≥ Mt γ
• Base: t = 1, w1 · v = 0
• Induction step (assume for t and prove for t + 1):
If mistake (Mt+1 = Mt + 1) then
wt+1 · v = (xt yt ) · v
= w t · v + yt x t · v
≥ Mt γ + γ
= (Mt + 1)γ
40
Combining the upper and lower bounds
(Mγ)2 ≤ ∥wm+1 ∥2 ≤ MR 2
with R := maxt ∥xt ∥ when there exists a vector v with ∥v∥ = 1 and
constant γ such that (v · xt )yt ≥ γ for all t.
41
Comments
Comments
• It is often convenient to express the bound in the following form.
Here define u := γv then
2
M ≤ R 2 ∥u∥ (∀u : (u · xt )yt ≥ 1)
42
Regret Bounds for Linear Separation
Going Deeper : Regret Bounds for Linear Separation
m
X
∗
h = arg min L(yt , h(xt )) + λpenalty(h)
h∈H t=1
43
Online Approach
44
Online Approach
A possible strategy is, every time we see a new sample (xt+1 , yt+1 ) to
produce a new ht+1 such that
44
Online Gradient Descent with Hinge Loss and ∥·∥22 penalty
Solving for the update (taking the “derivative” and set to zero)
corresponds to choosing wt+1 as follows:
(
wt yt (w · xt ) > 1
wt+1 = y t xt
wt + 2λ yt (w · xt ) < 1
45
OGD with Hinge Loss and ∥·∥22 penalty
OGD Algorithm
Initialise : w1 := 0, LOGD := 0
1
Select: η ∈ (0, ∞) (interpretation η = 2λ )
For t = 1 To m Do
Receive instance xt ∈ Rn
Predict ŷt := wt · xt
Receive label yt ∈ {−1, 1}
Incur loss LOGD := LOGD + Lhi (yt , ŷt )
Update weights wt+1 := wt + 1{yt ŷt <1} ηyt xt .
46
Regret Bound for OGD
47
Regret Bound for OGD – Proof (1)
Proof
Using the convexity of the hinge loss (wrt its 2nd argument), we have
where
zt := −yt xt 1{yt (wt ·xt )<1} ∈ ∂w Lhi (yt , wt · xt ) .
| {z }
subdifferential!
Thus
1 2 2 2
(wt − u) · zt = ∥wt − u∥ − ∥wt+1 − u∥ + η 2 ∥zt ∥ . (8)
2η
48
Regret Bound for OGD – Proof (2)
Proof – Continued
From (8) we have
m m
X X 1 2 2 2
(wt − u) · zt = ∥wt − u∥ − ∥wt+1 − u∥ + η 2 ∥zt ∥
t=1 t=1
2η
m
!
1 2 2
X 2
≤ ∥u∥ + η ∥zt ∥
2η t=1
m
1 2 ηX 2
= ∥u∥ + ∥xt ∥ 1{yt (wt ·xt )<1}
2η 2 t=1
1 2 η
≤ U + mR 2
2η 2
√ U
= U 2 R 2 m (recall η := √ )
R m
49
Deriving the perceptron algorithm/bound from OGD
Going back to the Hinge: we can recover the perception bound via OGD:
2. Now assume there exists a linear classifier u such that yt (u · xt ) ≥ 1 for all
t = 1, . . . , m. Thus,
m
X √
[yt ̸= sign(ŷt )] ≤ U 2 R 2 m .
t=1
50
OGD Beyond the Hinge Loss
How much does this result depend on our choice of the Hinge loss Lhi ?
(Spoiler: very little)
Exercise: can you get a theorem for general OGD? Under what
assumptions on L?
51
Wrapping Up
52
Problems – 1
for the weighted majority setting (i.e., the mean prediction error of
the algorithm is bounded by the mean prediction error of the “best”
expert). Recalling that m is the number of examples (and the
“tuning” of the algorithm may depend on m). For contrast compare
this to problem 3.1 above.
54
Recommended Reading
55
Useful references
56