Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Introduction to bandits

(some slides stolen from Csaba’s AAAI tutorial)


Motivation
Do not have complete information about the
effectiveness or side-effects of the drugs.
Aim: Infer the best drug by running a
sequence of trials
Mapping to a bandits algorithm:
● Each drug choice is mapped to an arm and
its reward is mapped to the drug's
effectiveness.
● Administering a drug is an action and is
equivalent to pulling the corresponding arm.
● The trial goes on for n rounds.
Other applications: Recommender Systems, Viral Marketing, Network Routing,
Ad Placement
Introduction

Assumptions:
1. Stochasticity: The reward for each arm is sampled from its underlying
distribution. The
2. Finiteness and Independence: The number of arms is finite and the reward
for each arm is independent of the others.
3. Stationarity: The reward distributions of the arms do not change over time.
Introduction

Is a special tractable case of RL

Performance Metric: Cumulative regret

Results in an exploration-exploitation trade-off:


Exploration: Pull an arm to learn more about it.
Exploitation: Pull the arm that we know has a higher reward.
Multi-armed bandits
OBSERVE: Can observe reward immediately on pulling the arm. Rewards are
scalars bounded on the [0,1] interval.

UPDATE: Use the mean of rewards obtained on pulling arm i as the empirical
estimated reward for that arm.

SELECT: Explore-Then-Commit, Epsilon-Greedy, Upper Confidence Bound,


Thompson sampling
Explore-Then-Commit
Explore-Then-Commit
When to commit:

(Gap-dependent Bound)

(Gap-free Bound)
Epsilon-Greedy

+ Interleaves exploration and exploitation.


+ Doesn’t require knowledge of the gap or the horizon.
+ Popularly used and works well in practice.

- Performance is sensitive to the choice of epsilon.


- Results in suboptimal n^{⅔} regret.
Optimism in the face of uncertainty
Optimism in the face of uncertainty

+ Doesn’t require knowledge of the gap or the horizon.


+ Results in near-optimal regret.
Thompson sampling
P_i is the posterior distribution (conditioned on the observed rewards) for arm i

Update

+ Simple to implement. Only requires a sampling procedure


+ Theoretically, it results in near-optimal regret.
+ Often works better than UCB in practice.

- In some variants, it tends to over-explore.


Structured Bandits
● Arms (choices) can be related by a structural assumption on the action space
or according to their corresponding features. Eg: Items in a Rec-sys.
● In problems with large number of arms, learning about each arm separately is
inefficient.
● Contextual Bandits: Each arm j has a feature vector xj and there exists

● Linear Bandits:
● Combinatorial Bandits: The space of arms are related according to a
combinatorial constraint.
Contextual Bandits
UPDATE:

Linear Bandits:
(Non)-Linear Bandits
Epsilon-Greedy
- O(n^{2/3}) regret
+ Easy to extend for
non-linear bandits

LinUCB

- Don’t know how to construct


confidence intervals for
complex functions
(Non)-Linear Bandits
Thompson sampling
+ O(d n^{½}) regret
+ Can use approximate sampling
procedures for complex functions

Bootstrapping
- Not well developed theory.
+ Need to compute only point
estimates.
Bandits everywhere!
● Adversarial Bandits (relaxing assumption 1)
● Gaussian process Bandits (relaxing assumption 2)
● Restless Bandits (relaxing assumption 3)
● Rotting Bandits
● Duelling Bandits
● Firing Bandits
● ………….

Difference objective functions:


Best-arm identification
Bayesian bandits

You might also like