0% found this document useful (0 votes)
56 views4 pages

EE 6106: Online Learning and Optimisation Homework 1

Uploaded by

advaitkumar3107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views4 pages

EE 6106: Online Learning and Optimisation Homework 1

Uploaded by

advaitkumar3107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

EE 6106: Online Learning and Optimisation

Homework 1
Questions
1. Prove the following identities
ln x ≤ x − 1 for x > 0.
1 − ax > (1 − a)x for 0 ≤ a, x ≤ 1.
− ln(1 − x) ≤ 1 + 2x for 0 < x < 0.5.

2. For T rounds, you have to predict a Boolean variable. In each round, after you make
your prediction, the ‘true’ value is revealed. We are interested in the best strategy
defined as the one that minimizes the number of mistakes in the worst case. Clearly,
in the worst case, a deterministic strategy can have T errors. What is the optimal
randomised strategy? This suggests that in the ‘prediction with experts’ setting,
counting the total number of mistakes is not a meaningful benchmark. Why?
3. Let Xt be a binary 0/1 sequence. There are N experts and expert i predicts Yi,t for
Xt . Recall the MAJORITY algorithm that predicts
( )
X̂t = 1
X X
Yi,t ≥ (1 − Yi,t )
i i

where 1 {·} is the indicator function. In class, we showed that if there is a perfect
expert, the regret is upper bounded by log2 N. The best expert wlill make up to m
mistakes.
(a) Assume that it is known that the best expert will make upto m mistakes. De-
scribe a suitable modification to the algorithm in described in class and obtain
a bound on the regret for this case.
(b) Now assume that you do not know that there is such a best expert. What
algorithm would you use and what regret would you get. Compare with the
result from the previous part.
4. Let Xt be a sequence of IID Bernoulli(p) random variables with p > 0.5.
(a) Consider the oracle which knows the value of p. What should be the oracle’s
strategy and what is the expected number of mistakes it makes in t rounds?
(b) A forecasting algorithm that does not know p applies the majority algorithm on
the past data to predict X̂t . Specifically,
( t )
X̂t = 1
X X
−1Xs ≥ (1 − Xs )
s=1 i
i.e., the forecast for round t is the majority value in round 1, · · · , t − 1 with
randomisation if there is not majority. First, obtain the probability of the
forecaster making a mistake in round t. Then characterise the expected number
of total mistakes that the forecaster makes in t rounds. Also characterise the
regret, computed as the difference between the expected number of mistakes of
the forecaster and that of the oracle
5. Here is a different kind of an online learning problem. x1 , . . . xn are Boolean variables;
let x := [x1 , . . . , xn ]. A function f (x) = xi1 ∨ xi2 ∨ · · · ∨ xir is to be learned as follows.
1. An example x is revealed. This is called an instance of x.
2. The algorithm predicts f .
3. The true value of f (x) is revealed and the algorithm uses this to adapt the
prediction algorithm.
Consider the following learning/prediction algorithm.
1. Initialise w1 , . . . wn to 1.
2. Receive x and predict fˆ(x) = 1 if
w 1 x + w 2 x2 + · · · + w n xn ≥ n
and predict fˆ(x) = 0 otherwise
3. Receive the true value f (x). If f (x) 6= fˆ(x), then update weights as follows.
(a) If fˆ = 0 and f = 1, then for each xi = 1 in x, double wi .
(b) If fˆ = 1 and f = 0, then for each xi = 1 in x, halve wi .
4. Go to step 2 above.
Prove the claim that the algorithm makes at most 2 + 3r(1 + log n) mistakes. You
can obtain this by bounding the number of mistakes on each of the two types of
mistakes above.
6. In the EWMApalgorithm discussed in class, we chose the value of the learning pa-
rameter β = (8 ln K)/T where we needed to know T, the maximum number of
rounds.
p If we dont know T, then we can use a time varying learning parameter
βt = (8 ln K)/t. Rework the proof and obtan the regret at time t with this choice
of time varying t.
7. In the EWMA algorithm discussed in class, we chose to use Hoeffding’s Lemma
in calculating Wt+1 /Wt . Recall that there are other concentration inequalities that
provide tighter bounds when additional information about the random variables is
available. Specifically, investigate Bernstein’s Inequality and describe the conditions
under which this can be used in deriving the regret bounds. Work out the regret for
this case.

Page 2
8. Consider the following shooting game: You are shooting into a unit square and you
have to “predict” the fixed point, say X̂t for the t-th attempt. Let Xt be the actual
location of the hit. For every X̂t there is a cost kX̂t − Xt k2 .
First consider the offline problem, or the one shot problem: the sequence Xt for
t = 1, . . . , T, is available to you and you have to choose the best fixed point, i.e., x is
fixed independent of t.
T
X
XT∗ := arg min 2 kXt − xk2
x∈[0,1]
t=1

Prove the claim that the explicit formula for XT∗ would be
T
1X
XT∗ = Xt
T t=1

X̂t = XT∗ minimizes the total cost.


Now convert this to an online question: Predict the point before every shot such that
the cumulative error is minimized. Specifically,
T
X
min kX̂t − Xt k2
t=1

with X̂t to be determined before the t-th shot using X1 , . . . , Xt−1 .


Consider the “Follow the Leader (FTL)” prediction algorithm
t−1 t−1

X 1 X 2
X̂t = Xt−1 = arg min 2 kXs − xk = Xs
x∈[0,1]
s=1
t − 1 s=1

Our interest is to determine how this compares with the “offline” algorithm. Specif-
ically, what is the cumulative loss in the online algorithm compared to the full
information offline case, ı.e., what is
T
X T
X T
X T
X

kX̂t − Xt k2 − kX̂T − Xt k2 = kXt−1 − Xt k 2 − kXT∗ − Xt k2
t=1 t=1 t=1 t=1

First, using induction prove the following lemma.


T
X T
X
kXt∗ 2
− Xt k ≤ kXT∗ − Xt k2 (1)
t=1 t=1

Page 3
Since
T
X T
X
LT := 2
kX̂t − Xt k − kXT∗ − Xt k2
t=1 t=1

using the previous lemma, (1), and using kxk − kyk2 = hx + y, x − yi show that
2

T
X

LT = hXt−1 ∗
− Xt∗ , Xt−1 + Xt∗ − 2Xt i
t=1

Next use Cauchy-Schwartz inequality |hx, yi| ≤ kxk · kyk and triangle inequality to
show that
XT

LT ≤ 4 kXt−1 − Xt∗ k using the upper bound for k · k in 2nd term
t=1

Then use the formula for x∗t , the triangle inequality, and the upper bound for k · k to
show
∗ 2
kXt−1 − Xt∗ k ≤
t
And finally show that for any sequence X1 , . . . XT , in the unit square the FTL algo-
rithm has a loss of at most (1 + 8 ln(T )).
9. Recall Weighted Majority Algorithm for learning to predict a binary sequence using
K experts. Here, recall the notation that LT is the number of mistakes made by the
algorithm in T rounds, and Li,T is the number of mistakes made by expert i, in T
rounds.
1. Prove the useful inequality that −β ≥ ln(1 − β) ≥ −β − β 2 .
2. Using this inequality,tighten the the upper bound to
ln(K)
LT ≤ 2(1 + β) min Li,t +
i β
3. Argue that the factor 2 in the above bound cannot be eliminated in a deter-
ministic algorithm.
4. Consider the Hedge algorithm to predict a binary sequence. Here an expert is
chosen with probability proportional to the weights and the prediction from the
algorithm is the prediction of this chosen expert. The weights are updated as
in WMA. Show that expected loss is upper bounded as below
ln(K)
LT ≤ (1 + β) min Li,t +
i β
5. Argue that this bound is tight by considering the case where the experts pre-
dict by tossing a fair coin and the sequence that is being predicted is a i.i.d.
Bernoulli(0.5) sequence.

Page 4

You might also like