0% found this document useful (0 votes)

137 views47 pages

Lecture 9: Exploration and Exploitation: David Silver

The document discusses the exploration-exploitation dilemma in online decision making. It introduces the multi-armed bandit problem, where an agent must balance exploring new actions against exploiting the currently most rewarding action. Greedy and epsilon-greedy algorithms can become stuck exploiting suboptimal actions, resulting in linear regret. Decaying epsilon-greedy achieves logarithmic regret by gradually reducing exploration over time. The document outlines lower bounds on regret based on similarities between action distributions. It introduces the concept of optimism in the face of uncertainty to guide exploration.

Uploaded by

司向辉

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views47 pages

Lecture 9: Exploration and Exploitation: David Silver

Uploaded by

司向辉

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Lecture 9: Exploration and Exploitation

David Silver
Lecture 9: Exploration and Exploitation

Outline

1 Introduction

2 Multi-Armed Bandits

3 Contextual Bandits

4 MDPs
Lecture 9: Exploration and Exploitation
Introduction

Exploration vs. Exploitation Dilemma

Online decision-making involves a fundamental choice:

Exploitation Make the best decision given current information
Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions
Lecture 9: Exploration and Exploitation
Introduction

Examples

Restaurant Selection
Exploitation Go to your favourite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experimental move
Lecture 9: Exploration and Exploitation
Introduction

Principles

Naive Exploration
Add noise to greedy policy (e.g. -greedy)
Optimistic Initialisation
Assume the best until proven otherwise
Optimism in the Face of Uncertainty
Prefer actions with uncertain values
Probability Matching
Select actions according to probability they are best
Information State Search
Lookahead search incorporating value of information
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits

The Multi-Armed Bandit

A multi-armed bandit is a tuple hA, Ri

A is a known set of m actions (or “arms”)
Ra (r ) = P [r |a] is an unknown probability
distribution over rewards
At each step t the agent selects an action
at ∈ A
The environment generates a reward
rt ∼ Rat
The goal
Pis to maximise cumulative
reward tτ =1 rτ
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret

Regret
The action-value is the mean reward for action a,
Q(a) = E [r |a]
The optimal value V ∗ is
V ∗ = Q(a∗ ) = max Q(a)
a∈A

The regret is the opportunity loss for one step

lt = E [V ∗ − Q(at )]
The total regret is the total opportunity loss
" t #
X
Lt = E V ∗ − Q(aτ )
τ =1

Maximise cumulative reward ≡ minimise total regret

Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret

Counting Regret
The count Nt (a) is expected number of selections for action a
The gap ∆a is the difference in value between action a and
optimal action a∗ , ∆a = V ∗ − Q(a)
Regret is a function of gaps and the counts
" t #
X
Lt = E V ∗ − Q(aτ )
τ =1
X
= E [Nt (a)] (V ∗ − Q(a))
a∈A
X
= E [Nt (a)] ∆a
a∈A

A good algorithm ensures small counts for large gaps

Problem: gaps are not known!
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Regret

Linear or Sublinear Regret

greedy
ϵ-greedy

Total regret
decaying ϵ-greedy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Time-steps

If an algorithm forever explores it will have linear total regret

If an algorithm never explores it will have linear total regret
Is it possible to achieve sublinear total regret?
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

Greedy Algorithm

We consider algorithms that estimate Q̂t (a) ≈ Q(a)

Estimate the value of each action by Monte-Carlo evaluation
T
1 X
Q̂t (a) = rt 1(at = a)
Nt (a)
t=1

The greedy algorithm selects action with highest value

at∗ = argmax Q̂t (a)

a∈A

Greedy can lock onto a suboptimal action forever

⇒ Greedy has linear total regret
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

-Greedy Algorithm

The -greedy algorithm continues to explore forever

With probability 1 − select a = argmax Q̂(a)
a∈A
With probability select a random action
Constant ensures minimum regret
X
lt ≥ ∆a
A
a∈A

⇒ -greedy has linear total regret

Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

Optimistic Initialisation

Simple and practical idea: initialise Q(a) to high value

Update action value by incremental Monte-Carlo evaluation
Starting with N(a) > 0

1
Q̂t (at ) = Q̂t−1 + (rt − Q̂t−1 )
Nt (at )

Encourages systematic exploration early on

But can still lock onto suboptimal action
⇒ greedy + optimistic initialisation has linear total regret
⇒ -greedy + optimistic initialisation has linear total regret
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Greedy and -greedy algorithms

Decaying t -Greedy Algorithm

Pick a decay schedule for 1 , 2 , ...

Consider the following schedule

c>0
d = min ∆i
a|∆a >0

c|A|
t = min 1, 2
d t

Decaying t -greedy has logarithmic asymptotic total regret!

Unfortunately, schedule requires advance knowledge of gaps
Goal: find an algorithm with sublinear regret for any
multi-armed bandit (without knowledge of R)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Lower Bound

Lower Bound

The performance of any algorithm is determined by similarity

between optimal arm and other arms
Hard problems have similar-looking arms with different means
This is described formally by the gap ∆a and the similarity in
distributions KL(Ra ||Ra ∗)

Theorem (Lai and Robbins)

Asymptotic total regret is at least logarithmic in number of steps
X ∆a
lim Lt ≥ log t
t→∞ KL(Ra ||Ra∗ )
a|∆a >0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Optimism in the Face of Uncertainty

p(Q)

Q(a3)
Q(a2)

Q(a1)

-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6

Which action should we pick?

The more uncertain we are about an action-value
The more important it is to explore that action
It could turn out to be the best action
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Optimism in the Face of Uncertainty (2)

After picking blue action

We are less uncertain about the value
And more likely to pick another action
Until we home in on best action
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Upper Confidence Bounds

Estimate an upper confidence Ût (a) for each action value

Such that Q(a) ≤ Q̂t (a) + Ût (a) with high probability
This depends on the number of times N(a) has been selected
Small Nt (a) ⇒ large Ût (a) (estimated value is uncertain)
Large Nt (a) ⇒ small Ût (a) (estimated value is accurate)
Select action maximising Upper Confidence Bound (UCB)

at = argmax Q̂t (a) + Ût (a)

a∈A
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Hoeffding’s Inequality

Theorem (Hoeffding’s Inequality)

Let X1 , ...,
P Xt be i.i.d. random variables in [0,1], and let
X t = τ1 tτ =1 Xτ be the sample mean. Then

2
P E [X ] > X t + u ≤ e −2tu

We will apply Hoeffding’s Inequality to rewards of the bandit

conditioned on selecting action a
h i 2
P Q(a) > Q̂t (a) + Ut (a) ≤ e −2Nt (a)Ut (a)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Calculating Upper Confidence Bounds

Pick a probability p that true value exceeds UCB

Now solve for Ut (a)
2
e −2Nt (a)Ut (a) = p
s
− log p
Ut (a) =
2Nt (a)

Reduce p as we observe more rewards, e.g. p = t −4

Ensures we select optimal action as t → ∞
s
2 log t
Ut (a) =
Nt (a)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

UCB1

This leads to the UCB1 algorithm

s
2 log t
at = argmax Q(a) +
a∈A Nt (a)

Theorem
The UCB algorithm achieves logarithmic asymptotic total regret

X
lim Lt ≤ 8 log t ∆a
t→∞
a|∆a >0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Upper Confidence Bound

Example: UCB vs. -Greedy On 10-armed Bandit

Figure 8. Comparison on distribution 3 (2 machines with parameters 0.55, 0.45).

Figure 9. Comparison on distribution 11 (10 machines with parameters 0.9, 0.6, . . . , 0.6).

Figure 10. Comparison on distribution 12 (10 machines with parameters 0.9, 0.8, 0.8, 0.8, 0.7, 0.7, 0.7, 0.6,
0.6, 0.6).
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Bayesian Bandits

So far we have made no assumptions about the reward

distribution R
Except bounds on rewards
Bayesian bandits exploit prior knowledge of rewards, p [R]
They compute posterior distribution of rewards p [R | ht ]
where ht = a1 , r1 , ..., at−1 , rt−1 is the history
Use posterior to guide exploration
Upper confidence bounds (Bayesian UCB)
Probability matching (Thompson sampling)
Better performance if prior knowledge is accurate
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Bayesian UCB Example: Independent Gaussians

Assume reward distribution is Gaussian, Ra (r ) = N (r ; µa , σa2 )
p(Q)

Q(a3)
Q(a2)

Q(a1)

-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 4.4 4.8 5.2 5.6 6

Q µ(a1) µ(a2) µ(a3)

c!(a3)
c!(a2)
c!(a1)

Compute Gaussian posterior over µa and σa2 (by Bayes law)

Y
p µa , σa2 | ht ∝ p µa , σa2 N (rt ; µa , σa2 )

t | at =a

Pick action that maximises standard deviation of Q(a)

p
at = argmax µa + cσa / N(a)
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Probability Matching

Probability matching selects action a according to probability

that a is the optimal action

π(a | ht ) = P Q(a) > Q(a0 ), ∀a0 6= a | ht

Probability matching is optimistic in the face of uncertainty

Uncertain actions have higher probability of being max
Can be difficult to compute analytically from posterior
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Bayesian Bandits

Thompson Sampling

Thompson sampling implements probability matching

π(a | ht ) = P Q(a) > Q(a0 ), ∀a0 6= a | ht

= ER|ht 1(a = argmax Q(a))
a∈A

Use Bayes law to compute posterior distribution p [R | ht ]

Sample a reward distribution R from posterior
Compute action-value function Q(a) = E [Ra ]
Select action maximising value on sample, at = argmax Q(a)
a∈A
Thompson sampling achieves Lai and Robbins lower bound!
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Value of Information

Exploration is useful because it gains information

Can we quantify the value of information?
How much reward a decision-maker would be prepared to pay
in order to have that information, prior to making a decision
Long-term reward after getting information - immediate reward
Information gain is higher in uncertain situations
Therefore it makes sense to explore uncertain situations more
If we know value of information, we can trade-off exploration
and exploitation optimally
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Information State Space

We have viewed bandits as one-step decision-making problems

Can also view as sequential decision-making problems
At each step there is an information state s̃
s̃ is a statistic of the history, s˜t = f (ht )
summarising all information accumulated so far
Each action a causes a transition to a new information state
s̃ 0 (by adding information), with probability P̃s̃,s̃
a
0

This defines MDP M̃ in augmented information state space

M̃ = hS̃, A, P̃, R, γi
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Example: Bernoulli Bandits

Consider a Bernoulli bandit, such that Ra = B(µa )

e.g. Win or lose a game with probability µa
Want to find which arm has the highest µa
The information state is s̃ = hα, βi
αa counts the pulls of arm a where reward was 0
βa counts the pulls of arm a where reward was 1
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Solving Information State Space Bandits

We now have an infinite MDP over information states

This MDP can be solved by reinforcement learning
Model-free reinforcement learning
e.g. Q-learning (Duff, 1994)
Bayesian model-based reinforcement learning
e.g. Gittins indices (Gittins, 1979)
This approach is known as Bayes-adaptive RL
Finds Bayes-optimal exploration/exploitation trade-off
with respect to prior distribution
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Bayes-Adaptive Bernoulli Bandits

Start with Beta(αa , βa ) prior over 1.0 1.0

reward function Ra f(!1) f(!2)

Each time a is selected, update
posterior for Ra !1 1.0
, !2 1.0

Beta(αa + 1, βa ) if r = 0 Drug 1 Drug 2

Beta(αa , βa + 1) if r = 1
Success Failure Success Failure

This defines transition function P̃ 2.0

f(!1)
1.0

f(!2) 2.0 1.0 1.0 1.0 1.0 1.0

f(!1) f(!1)

for the Bayes-adaptive MDP

f(!1) f(!2) f(!2 ) f(!2)
!1 1.0
, !2 1.0

!1 1.0
, !2 1.0 , ,
!1 1.0 !2 1.0 !1 1.0 !2 1.0

Drug 1
Drug 2
Information state hα, βi Success Failure

corresponds to reward model 3.0

f(!)
1
1.0

f(!2)
f(!1)
1.0

f(!2)

..
,

Beta(α, β)
! 1.0 !2 1.0
1
!1 1.0
, !2 1.0

.
Failure
Each state transition corresponds
to a Bayesian model update
f(!1) f(!2)

!1 1.0
, !2 1.0
Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Bayes-Adaptive MDP for Bernoulli Bandits

Lecture 9: Exploration and Exploitation
Multi-Armed Bandits
Information State Search

Gittins Indices for Bernoulli Bandits

Bayes-adaptive MDP can be solved by dynamic programming

The solution is known as the Gittins index
Exact solution to Bayes-adaptive MDP is typically intractable
Information state space is too large
Recent idea: apply simulation-based search (Guez et al. 2012)

Forward search in information state space

Using simulations from current information state
Lecture 9: Exploration and Exploitation
Contextual Bandits

Contextual Bandits

A contextual bandit is a tuple hA, S, Ri

A is a known set of actions (or “arms”)
S = P [s] is an unknown distribution over
states (or “contexts”)
Ras (r ) = P [r |s, a] is an unknown
probability distribution over rewards
At each step t
Environment generates state st ∼ S
Agent selects action at ∈ A
Environment generates reward rt ∼ Rastt
Goal
Pt is to maximise cumulative reward
τ =1 rτ
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB

Linear Regression
Action-value function is expected reward for state s and
action a
Q(s, a) = E [r |s, a]
Estimate value function with a linear function approximator
Qθ (s, a) = φ(s, a)> θ ≈ Q(s, a)
Estimate parameters by least squares regression
t
X
At = φ(sτ , aτ )φ(sτ , aτ )>
τ =1
X t
bt = φ(sτ , aτ )rτ
τ =1
θt = A−1
t bt
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB

Linear Upper Confidence Bounds

Least squares regression estimates the mean action-value

Qθ (s, a)
But it can also estimate the variance of the action-value
σθ2 (s, a)
i.e. the uncertainty due to parameter estimation error
Add on a bonus for uncertainty, Uθ (s, a) = cσ
i.e. define UCB to be c standard deviations above the mean
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB
3.2

Geometric Interpretation 2.8

2.4

1.6
E

1.2

0.8

0.4

-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4

-0.4

Define confidence ellipsoid Et around parameters θt

Such that Et includes true parameters θ∗ with high probability
Use this ellipsoid to estimate the uncertainty of action values
Pick parameters within ellipsoid that maximise action value
argmax Qθ (s, a)
θ∈E
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB

Calculating Linear Upper Confidence Bounds

For least squares regression, parameter covariance is A−1

Action-value is linear in features, Qθ (s, a) = φ(s, a)> θ
So action-value variance is quadratic,
σθ2 (s, a) = φ(s, a)> A−1 φ(s, a)
p
Upper confidence bound is Qθ (s, a) + c φ(s, a)> A−1 φ(s, a)
Select action maximising upper confidence bound
q
at = argmax Qθ (st , a) + c φ(st , a)> A−1
t φ(st , a)
a∈A
Lecture 9: Exploration and Exploitation
Contextual Bandits
Linear UCB
L I AND C HU AND L ANGFORD AND M OON AND WANG

Example: Linear UCB for Selecting Front Page News

(a) (b)

Exploration/Exploitation Principles to MDPs

The same principles for exploration/exploitation apply to MDPs

Naive Exploration
Optimistic Initialisation
Optimism in the Face of Uncertainty
Probability Matching
Information State Search
Lecture 9: Exploration and Exploitation
MDPs
Optimistic Initialisation

Optimistic Initialisation: Model-Free RL

rmax
Initialise action-value function Q(s, a) to 1−γ
Run favourite model-free RL algorithm
Monte-Carlo control
Sarsa
Q-learning
...
Encourages systematic exploration of states and actions
Lecture 9: Exploration and Exploitation
MDPs
Optimistic Initialisation

Optimistic Initialisation: Model-Based RL

Construct an optimistic model of the MDP

Initialise transitions to go to heaven
(i.e. transition to terminal state with rmax reward)
Solve optimistic MDP by favourite planning algorithm
policy iteration
value iteration
tree search
...
Encourages systematic exploration of states and actions
e.g. RMax algorithm (Brafman and Tennenholtz)
Lecture 9: Exploration and Exploitation
MDPs
Optimism in the Face of Uncertainty

Upper Confidence Bounds: Model-Free RL

Maximise UCB on action-value function Q π (s, a)

at = argmax Q(st , a) + U(st , a)

a∈A

Estimate uncertainty in policy evaluation (easy)

Ignores uncertainty from policy improvement
Maximise UCB on optimal action-value function Q ∗ (s, a)

at = argmax Q(st , a) + U1 (st , a) + U2 (st , a)

a∈A

Estimate uncertainty in policy evaluation (easy)

plus uncertainty from policy improvement (hard)
Lecture 9: Exploration and Exploitation
MDPs
Optimism in the Face of Uncertainty

Bayesian Model-Based RL

Maintain posterior distribution over MDP models

Estimate both transitions and rewards, p [P, R | ht ]
where ht = s1 , a1 , r2 , ..., st is the history
Use posterior to guide exploration
Upper confidence bounds (Bayesian UCB)
Probability matching (Thompson sampling)
Lecture 9: Exploration and Exploitation
MDPs
Probability Matching

Thompson Sampling: Model-Based RL

Thompson sampling implements probability matching

π(s, a | ht ) = P Q ∗ (s, a) > Q ∗ (s, a0 ), ∀a0 6= a | ht

∗
= EP,R|ht 1(a = argmax Q (s, a))
a∈A

Use Bayes law to compute posterior distribution p [P, R | ht ]

Sample an MDP P, R from posterior
Solve MDP using favourite planning algorithm to get Q ∗ (s, a)
Select optimal action for sample MDP, at = argmax Q ∗ (st , a)
a∈A
Lecture 9: Exploration and Exploitation
MDPs
Information State Search

Information State Search in MDPs

MDPs can be augmented to include information state

Now the augmented state is hs, s̃i
where s is original state within MDP
and s̃ is a statistic of the history (accumulated information)
Each action a causes a transition
to a new state s 0 with probability Ps,s
a
0
0
to a new information state s̃
Defines MDP M̃ in augmented information state space

M̃ = hS̃, A, P̃, R, γi
Lecture 9: Exploration and Exploitation
MDPs
Information State Search

Bayes Adaptive MDPs

Posterior distribution over MDP model is an information state

s̃t = P [P, R|ht ]

Augmented MDP over hs, s̃i is called Bayes-adaptive MDP

Solve this MDP to find optimal exploration/exploitation
trade-off (with respect to prior)
However, Bayes-adaptive MDP is typically enormous
Simulation-based search has proven effective (Guez et al.)
Lecture 9: Exploration and Exploitation
MDPs
Information State Search

Conclusion

Have covered several principles for exploration/exploitation

Naive methods such as -greedy
Optimistic initialisation
Upper confidence bounds
Probability matching
Information state search
Each principle was developed in bandit setting
But same principles also apply to MDP setting

RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Bandit
No ratings yet
Bandit
8 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Report
No ratings yet
Report
4 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Multi-Armed Bandit Algorithms and Empirical Evaluation
No ratings yet
Multi-Armed Bandit Algorithms and Empirical Evaluation
12 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Reinforcement Learning - Chapter 2
100% (1)
Reinforcement Learning - Chapter 2
22 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Unit II
No ratings yet
Unit II
10 pages
Unit - 1: Probability Linear Algebra
No ratings yet
Unit - 1: Probability Linear Algebra
20 pages
Expanded Multi Armed Bandit and Probability Basics
No ratings yet
Expanded Multi Armed Bandit and Probability Basics
5 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
Upper Confidence Bound Algorithm in Reinforcement Learning
No ratings yet
Upper Confidence Bound Algorithm in Reinforcement Learning
6 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
EXP3
No ratings yet
EXP3
36 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
Aifinal
No ratings yet
Aifinal
15 pages
RL Ese Answers
No ratings yet
RL Ese Answers
22 pages
Cs6046-Notes 2
No ratings yet
Cs6046-Notes 2
34 pages
RL Mid-1 Bit Bank
No ratings yet
RL Mid-1 Bit Bank
10 pages
Dissecting Reinforcement Learning-Part6
No ratings yet
Dissecting Reinforcement Learning-Part6
25 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
17 ThompsonSampling
No ratings yet
17 ThompsonSampling
24 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
No ratings yet
Unit:1 Reinforcement Learning: Upper-Confidence-Bound Action Selection, Gradient Bandits
6 pages
Contextual Bandits
No ratings yet
Contextual Bandits
34 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
1.RL Unit 1
No ratings yet
1.RL Unit 1
47 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Reinforcement Learning Framework
No ratings yet
Reinforcement Learning Framework
12 pages
Unit Iv-1
No ratings yet
Unit Iv-1
32 pages
AdaptiveEpsilonGreedyExploration PDF
No ratings yet
AdaptiveEpsilonGreedyExploration PDF
8 pages
RL Unit
No ratings yet
RL Unit
595 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
2 pages
EE290S Lecture Note 22
No ratings yet
EE290S Lecture Note 22
12 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
No ratings yet
NIPS 2008 Algorithms For Infinitely Many Armed Bandits Paper
8 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
Lecture 7: Policy Gradient: David Silver
No ratings yet
Lecture 7: Policy Gradient: David Silver
41 pages
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
No ratings yet
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
14 pages
Math Quiz 6 Ellimination
No ratings yet
Math Quiz 6 Ellimination
1 page
BEng Mechanical 2024
No ratings yet
BEng Mechanical 2024
7 pages
05 Shearbear Print
No ratings yet
05 Shearbear Print
50 pages
1976 - Barlow Optimal Stress Locations in Finite Element Models
No ratings yet
1976 - Barlow Optimal Stress Locations in Finite Element Models
9 pages
Bachelor of Electrical and Electronics Engineering
No ratings yet
Bachelor of Electrical and Electronics Engineering
1 page
Flow Map
No ratings yet
Flow Map
6 pages
The Bivariate Poisson Distribution
No ratings yet
The Bivariate Poisson Distribution
45 pages
ECO2201 - Slides - 0.2 Mathematics Review PDF
No ratings yet
ECO2201 - Slides - 0.2 Mathematics Review PDF
20 pages
Newtons Laws of Motion PDF
No ratings yet
Newtons Laws of Motion PDF
43 pages
09 2.5sketch v2020 JUNE05 2020
No ratings yet
09 2.5sketch v2020 JUNE05 2020
8 pages
2017 H2 Math Functions Lecture Notes
No ratings yet
2017 H2 Math Functions Lecture Notes
32 pages
SAT Equivalent-Expressions
No ratings yet
SAT Equivalent-Expressions
79 pages
Repaso Examen 1
No ratings yet
Repaso Examen 1
3 pages
Wave
No ratings yet
Wave
15 pages
API 579 Fitness For Service For Nozzles and Flanges (APIFFSB) Module Overview
No ratings yet
API 579 Fitness For Service For Nozzles and Flanges (APIFFSB) Module Overview
49 pages
BCA SEM 3 Computer Oriented Numerical Methods BC0043
75% (4)
BCA SEM 3 Computer Oriented Numerical Methods BC0043
10 pages
Cryptography
No ratings yet
Cryptography
3 pages
Properties of Circle
No ratings yet
Properties of Circle
21 pages
Estimating On A Number Line To 1000 - Horizontal
No ratings yet
Estimating On A Number Line To 1000 - Horizontal
7 pages
Moho Modeling, Chen 2017
No ratings yet
Moho Modeling, Chen 2017
16 pages
Jurnal Ekonomi Mikro
No ratings yet
Jurnal Ekonomi Mikro
26 pages
RRB NTPC Time & Work Questions PDF
No ratings yet
RRB NTPC Time & Work Questions PDF
15 pages
Promaths Final Push Paper 2 Paper 2 (October 2023)
No ratings yet
Promaths Final Push Paper 2 Paper 2 (October 2023)
144 pages
DLL 4TH Quarter
No ratings yet
DLL 4TH Quarter
11 pages
Face Recognition Line Edge Maps
No ratings yet
Face Recognition Line Edge Maps
9 pages
As 4
No ratings yet
As 4
2 pages
Technology Management Tools: S-Curve
No ratings yet
Technology Management Tools: S-Curve
18 pages
Modeling and Design of Plate Heat Exchanger
No ratings yet
Modeling and Design of Plate Heat Exchanger
33 pages
Chap 6
No ratings yet
Chap 6
24 pages
Discrete Mathematics Notes - 1
No ratings yet
Discrete Mathematics Notes - 1
24 pages

Lecture 9: Exploration and Exploitation: David Silver

Uploaded by

Lecture 9: Exploration and Exploitation: David Silver

Uploaded by

Lecture 9: Exploration and Exploitation