0% found this document useful (0 votes)
26 views42 pages

CS 747, Autumn 2023: Lecture 4: Shivaram Kalyanakrishnan

The document discusses Thompson sampling for multi-armed bandit problems. It explains that Thompson sampling maintains a beta distribution over the reward probability of each arm based on past rewards. At each round, it samples from these distributions and pulls the arm with the highest sample. This incorporates exploration and exploitation by favoring both high-reward and high-uncertainty arms.

Uploaded by

Srishti Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views42 pages

CS 747, Autumn 2023: Lecture 4: Shivaram Kalyanakrishnan

The document discusses Thompson sampling for multi-armed bandit problems. It explains that Thompson sampling maintains a beta distribution over the reward probability of each arm based on past rewards. At each round, it samples from these distributions and pulls the arm with the highest sample. This incorporates exploration and exploitation by favoring both high-reward and high-uncertainty arms.

Uploaded by

Srishti Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CS 747, Autumn 2023: Lecture 4

Shivaram Kalyanakrishnan

Department of Computer Science and Engineering


Indian Institute of Technology Bombay

Autumn 2023

1/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 1 / 14


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret
UCB, KL-UCB algorithms
Thompson Sampling algorithm

Understanding Thompson Sampling


Concentration bounds

Analysis of UCB
Other bandit problems 2/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 14


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret
UCB, KL-UCB algorithms
Thompson Sampling algorithm

Understanding Thompson Sampling


Concentration bounds

Analysis of UCB
Other bandit problems 2/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 2 / 14


Thompson Sampling (Thompson, 1933)
- At time t, arm a has sat successes (1’s) and fat failures (0’s).

3/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 14


Thompson Sampling (Thompson, 1933)
- At time t, arm a has sat successes (1’s) and fat failures (0’s).
- Beta(sat + 1, fat + 1) represents a “belief” about pa .
1

0
R

3/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 14


Thompson Sampling (Thompson, 1933)
- At time t, arm a has sat successes (1’s) and fat failures (0’s).
- Beta(sat + 1, fat + 1) represents a “belief” about pa .
1

0
R

- Computational step: For every arm a, draw a sample


xat ∼ Beta(sat + 1, fat + 1).
- Sampling step: Pull an arm a for which xat is maximum. 3/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 14


Thompson Sampling (Thompson, 1933)
- At time t, arm a has sat successes (1’s) and fat failures (0’s).
- Beta(sat + 1, fat + 1) represents a “belief” about pa .
1

0
R

- Computational step: For every arm a, draw a sample


xat ∼ Beta(sat + 1, fat + 1).
- Sampling step: Pull an arm a for which xat is maximum. 3/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 3 / 14


Bayesian Inference
Bayes’ Rule of Probability for events A and B:
P{B|A}P{A}
P{A|B} = .
P{B}

4/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 14


Bayesian Inference
Bayes’ Rule of Probability for events A and B:
P{B|A}P{A}
P{A|B} = .
P{B}
Application: there is an unknown world w from among possible worlds W , in
which we live.
We maintain a belief distribution over w ∈ W .
Belief0 (w) = P{w}.

4/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 14


Bayesian Inference
Bayes’ Rule of Probability for events A and B:
P{B|A}P{A}
P{A|B} = .
P{B}
Application: there is an unknown world w from among possible worlds W , in
which we live.
We maintain a belief distribution over w ∈ W .
Belief0 (w) = P{w}.
The process by/probability with which each w produces evidence e is known.
Evidence samples e1 , e2 , . . . , em are produced i.i.d. by the unknown world w.

4/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 14


Bayesian Inference
Bayes’ Rule of Probability for events A and B:
P{B|A}P{A}
P{A|B} = .
P{B}
Application: there is an unknown world w from among possible worlds W , in
which we live.
We maintain a belief distribution over w ∈ W .
Belief0 (w) = P{w}.
The process by/probability with which each w produces evidence e is known.
Evidence samples e1 , e2 , . . . , em are produced i.i.d. by the unknown world w.
How to refine our belief distribution based on incoming evidence?
Beliefm (w) = P{w|e1 , e2 , . . . , em }.
4/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 4 / 14


Bayesian Inference

Beliefm+1 (w) = P{w|e1 , e2 , . . . , em+1 }

5/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 14


Bayesian Inference

Beliefm+1 (w) = P{w|e1 , e2 , . . . , em+1 }


P{e1 , e2 , . . . , em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }

5/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 14


Bayesian Inference

Beliefm+1 (w) = P{w|e1 , e2 , . . . , em+1 }


P{e1 , e2 , . . . , em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em |w}P{em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }

5/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 14


Bayesian Inference

Beliefm+1 (w) = P{w|e1 , e2 , . . . , em+1 }


P{e1 , e2 , . . . , em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em |w}P{em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em , w}P{em+1 |w}
=
P{e1 , e2 , . . . , em+1 }

5/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 14


Bayesian Inference

Beliefm+1 (w) = P{w|e1 , e2 , . . . , em+1 }


P{e1 , e2 , . . . , em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em |w}P{em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em , w}P{em+1 |w}
=
P{e1 , e2 , . . . , em+1 }
P{w|e1 , e2 , . . . , em }P{e1 , e2 , . . . , em }P{em+1 |w}
=
P{e1 , e2 , . . . , em+1 }

5/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 14


Bayesian Inference

Beliefm+1 (w) = P{w|e1 , e2 , . . . , em+1 }


P{e1 , e2 , . . . , em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em |w}P{em+1 |w}P{w}
=
P{e1 , e2 , . . . , em+1 }
P{e1 , e2 , . . . , em , w}P{em+1 |w}
=
P{e1 , e2 , . . . , em+1 }
P{w|e1 , e2 , . . . , em }P{e1 , e2 , . . . , em }P{em+1 |w}
=
P{e1 , e2 , . . . , em+1 }
Beliefm (w)P{em+1 |w}
=P ′ ′
.
w ′ ∈W Beliefm (w )P{em+1 |w }
5/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 5 / 14


Bayesian Inference in Thompson Sampling
View each arm a’s mean pa as world w, estimated from rewards (evidence).

6/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 14


Bayesian Inference in Thompson Sampling
View each arm a’s mean pa as world w, estimated from rewards (evidence).
Belief0 over pa is typically set to Uniform(0, 1), but need not.

6/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 14


Bayesian Inference in Thompson Sampling
View each arm a’s mean pa as world w, estimated from rewards (evidence).
Belief0 over pa is typically set to Uniform(0, 1), but need not.
If em+1 is a 1-reward, we must set for x ∈ [0, 1]
Beliefm (x) · x
Beliefm+1 (x) = R 1 .
y =0
Belief m (y ) · y

6/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 14


Bayesian Inference in Thompson Sampling
View each arm a’s mean pa as world w, estimated from rewards (evidence).
Belief0 over pa is typically set to Uniform(0, 1), but need not.
If em+1 is a 1-reward, we must set for x ∈ [0, 1]
Beliefm (x) · x
Beliefm+1 (x) = R 1 .
y =0
Belief m (y ) · y

If em+1 is a 0-reward, we must set for x ∈ [0, 1]


Beliefm (x) · (1 − x)
Beliefm+1 (x) = R 1 .
y =0
Beliefm (y ) · (1 − y )

6/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 14


Bayesian Inference in Thompson Sampling
View each arm a’s mean pa as world w, estimated from rewards (evidence).
Belief0 over pa is typically set to Uniform(0, 1), but need not.
If em+1 is a 1-reward, we must set for x ∈ [0, 1]
Beliefm (x) · x
Beliefm+1 (x) = R 1 .
y =0
Belief m (y ) · y

If em+1 is a 0-reward, we must set for x ∈ [0, 1]


Beliefm (x) · (1 − x)
Beliefm+1 (x) = R 1 .
y =0
Beliefm (y ) · (1 − y )

We achieve exactly that by taking


Beliefm (x) = Betas+1,f +1 (x)dx
when the first m pulls yield s 1’s and f 0’s! 6/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 6 / 14


Principle of Selecting Arm to Pull
We have a belief distribution for each arm’s mean.
Together, these distributions represent a belief distribution over bandit
instances.
We sample a bandit instance I from the joint belief distribution, and
We act optimally w.r.t. I.

7/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 14


Principle of Selecting Arm to Pull
We have a belief distribution for each arm’s mean.
Together, these distributions represent a belief distribution over bandit
instances.
We sample a bandit instance I from the joint belief distribution, and
We act optimally w.r.t. I.

Alternative view: the probability with which we pick an arm is our belief that it
is optimal. For example, if A = {1, 2}, the probability of pulling 1 is
Z 1 Z x1
P{x1t > x2t } = Betas1t +1,f1t +1, (x1 )Betas2t +1,f2t +1, (x2 )dx2 dx1 .
x1 =0 x2 =0

7/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 7 / 14


Multi-armed Bandits

1. Understanding Thompson Sampling

2. Concentration bounds

8/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 8 / 14


Hoeffding’s Inequality (Hoeffding, 1963)
Let X be a random variable bounded in [0, 1], with E[X ] = µ;

9/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 14


Hoeffding’s Inequality (Hoeffding, 1963)
Let X be a random variable bounded in [0, 1], with E[X ] = µ;
Let u ≥ 1;
Let x1 , x2 , . . . , xu be i.i.d. samples of X ; and

9/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 14


Hoeffding’s Inequality (Hoeffding, 1963)
Let X be a random variable bounded in [0, 1], with E[X ] = µ;
Let u ≥ 1;
Let x1 , x2 , . . . , xu be i.i.d. samples of X ; and
Let x̄ be the mean of these samples (an empirical mean):
u
1X
x̄ = xi .
u
i=1

9/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 14


Hoeffding’s Inequality (Hoeffding, 1963)
Let X be a random variable bounded in [0, 1], with E[X ] = µ;
Let u ≥ 1;
Let x1 , x2 , . . . , xu be i.i.d. samples of X ; and
Let x̄ be the mean of these samples (an empirical mean):
u
1X
x̄ = xi .
u
i=1

Then, for or any fixed ϵ > 0, we have

2
P{x̄ ≥ µ + ϵ} ≤ e−2uϵ , and
2
P{x̄ ≤ µ − ϵ} ≤ e−2uϵ .

9/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 14


Hoeffding’s Inequality (Hoeffding, 1963)
Let X be a random variable bounded in [0, 1], with E[X ] = µ;
Let u ≥ 1;
Let x1 , x2 , . . . , xu be i.i.d. samples of X ; and
Let x̄ be the mean of these samples (an empirical mean):
u
1X
x̄ = xi .
u
i=1

Then, for or any fixed ϵ > 0, we have

2
P{x̄ ≥ µ + ϵ} ≤ e−2uϵ , and
2
P{x̄ ≤ µ − ϵ} ≤ e−2uϵ .

Note the bounds are trivial for large ϵ, since x̄ ∈ [0, 1]. 9/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 9 / 14


Applications
For given mistake probability δ and tolerance ϵ, how many samples u0 of X
do we need to guarantee that with probability at least 1 − δ, the empirical
mean x̄ will not exceed the true mean µ by ϵ or more?

10/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 14


Applications
For given mistake probability δ and tolerance ϵ, how many samples u0 of X
do we need to guarantee that with probability at least 1 − δ, the empirical
mean x̄ will not exceed the true mean µ by ϵ or more?
u0 = ⌈ 2ϵ12 ln( 1δ )⌉ pulls are sufficient, since Hoeffding’s Inequality gives
2
P{x̄ ≥ µ + ϵ} ≤ e−2u0 ϵ ≤ δ.

10/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 14


Applications
For given mistake probability δ and tolerance ϵ, how many samples u0 of X
do we need to guarantee that with probability at least 1 − δ, the empirical
mean x̄ will not exceed the true mean µ by ϵ or more?
u0 = ⌈ 2ϵ12 ln( 1δ )⌉ pulls are sufficient, since Hoeffding’s Inequality gives
2
P{x̄ ≥ µ + ϵ} ≤ e−2u0 ϵ ≤ δ.

We have u samples of X . How do we fill up this blank?:


With probability at least 1 − δ, the empirical mean x̄ exceeds the true mean µ
by at most ϵ0 = .

10/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 14


Applications
For given mistake probability δ and tolerance ϵ, how many samples u0 of X
do we need to guarantee that with probability at least 1 − δ, the empirical
mean x̄ will not exceed the true mean µ by ϵ or more?
u0 = ⌈ 2ϵ12 ln( 1δ )⌉ pulls are sufficient, since Hoeffding’s Inequality gives
2
P{x̄ ≥ µ + ϵ} ≤ e−2u0 ϵ ≤ δ.

We have u samples of X . How do we fill up this blank?:


With probability at least 1 − δ, the empirical mean x̄ exceeds the true mean µ
by at most ϵ0 = q .
1
We can write ϵ0 = 2u
ln( 1δ ); by Hoeffding’s Inequality:
2
P{x̄ ≥ µ + ϵ0 } ≤ e−2u(ϵ0 ) ≤ δ.
10/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 10 / 14


Arbitrary Bounded Range
Suppose X is a random variable bounded in [a, b]. Can we still apply
Hoeffding’s Inequality?

11/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 11 / 14


Arbitrary Bounded Range
Suppose X is a random variable bounded in [a, b]. Can we still apply
Hoeffding’s Inequality?
Yes. Assume u; x1 , x2 , . . . , xu ; ϵ as defined earlier.

11/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 11 / 14


Arbitrary Bounded Range
Suppose X is a random variable bounded in [a, b]. Can we still apply
Hoeffding’s Inequality?
Yes. Assume u; x1 , x2 , . . . , xu ; ϵ as defined earlier.
−a
Consider Y = Xb−a ; for 1 ≤ i ≤ u, yi = xb−a i −a
; ȳ = u1 ui=1 yi .
P

11/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 11 / 14


Arbitrary Bounded Range
Suppose X is a random variable bounded in [a, b]. Can we still apply
Hoeffding’s Inequality?
Yes. Assume u; x1 , x2 , . . . , xu ; ϵ as defined earlier.
−a
Consider Y = Xb−a ; for 1 ≤ i ≤ u, yi = xb−a i −a
; ȳ = u1 ui=1 yi .
P

Since Y is bounded in [0, 1], we get


  2
µ−a ϵ − 2uϵ
P{x̄ ≥ µ + ϵ} = P ȳ ≥ + ≤ e (b−a)2 , and
b−a b−a
  2
µ−a ϵ − 2uϵ 2
P{x̄ ≤ µ − ϵ} = P ȳ ≤ − ≤e (b−a) .
b−a b−a

11/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 11 / 14


A “KL” Inequality
Let X be a random variable bounded in [0, 1], with E[X ] = µ;
Let u ≥ 1;
Let x1 , x2 , . . . , xu be i.i.d. samples of X ; and
Let x̄ be the mean of these samples (an empirical mean):
u
1X
x̄ = xi .
u
i=1

12/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 12 / 14


A “KL” Inequality
Let X be a random variable bounded in [0, 1], with E[X ] = µ;
Let u ≥ 1;
Let x1 , x2 , . . . , xu be i.i.d. samples of X ; and
Let x̄ be the mean of these samples (an empirical mean):
u
1X
x̄ = xi .
u
i=1

Then, for or any fixed ϵ ∈ [0, 1 − µ], we have


P{x̄ ≥ µ + ϵ} ≤ e−uKL(µ+ϵ,µ) ,
and for or any fixed ϵ ∈ [0, µ], we have
P{x̄ ≤ µ − ϵ} ≤ e−uKL(µ−ϵ,µ) ,
where for p, q ∈ [0, 1], KL(p, q) = p ln( qp ) + (1 − p) ln( 1−q
1−p
def
). 12/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 12 / 14


Some Observations
The KL inequality gives a tighter upper bound:
For p, q ∈ [0, 1],
2
KL(p, q) ≥ 2(p − q)2 =⇒ e−uKL(p,q) ≤ e−2u(p−q) .

Both bounds are instances of “Chernoff bounds”, of which there are many
more forms.

Similar bounds can also be given when X has infinite support (such as a
Gaussian), but might need additional assumptions.

13/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 13 / 14


Multi-armed Bandits
The exploration-exploitation dilemma
Definitions: Bandit, Algorithm
ϵ-greedy algorithms
Evaluating algorithms: Regret
Achieving sub-linear regret
A lower bound on regret
UCB, KL-UCB algorithms
Thompson Sampling algorithm

Understanding Thompson Sampling


Concentration bounds

Analysis of UCB
Other bandit problems 14/14

Shivaram Kalyanakrishnan (2023) CS 747, Autumn 2023 14 / 14

You might also like