0% found this document useful (0 votes)
15 views159 pages

AR23

Uploaded by

pvtkhv86cr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views159 pages

AR23

Uploaded by

pvtkhv86cr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

Foundations of Reinforcement Learning and Interactive Decision Making

Dylan J. Foster and Alexander Rakhlin

Last Updated: December 2023

These lecture notes are based on a course taught at MIT in Fall 2022 and Fall 2023. This
arXiv:2312.16730v1 [cs.LG] 27 Dec 2023

is a live draft, and all parts will be updated regularly. Please send us an email if you find a
mistake, typo, or missing reference.

Contents

1 Introduction 4
1.1 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 A Spectrum of Decision Making Problems . . . . . . . . . . . . . . . . . . . 4
1.3 Minimax Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Statistical Learning: Brief Refresher . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Refresher: Random Variables and Averages . . . . . . . . . . . . . . . . . . 10
1.6 Online Learning and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.1 Connection to Statistical Learning . . . . . . . . . . . . . . . . . . . 14
1.6.2 The Exponential Weights Algorithm . . . . . . . . . . . . . . . . . . 15
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Multi-Armed Bandits 21
2.1 The Need for Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 The ε-Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 The Upper Confidence Bound (UCB) Algorithm . . . . . . . . . . . . . . . 27
2.4 Bayesian Bandits and the Posterior Sampling Algorithm⋆ . . . . . . . . . . 30
2.5 Adversarial Bandits and the Exp3 Algorithm⋆ . . . . . . . . . . . . . . . . . 34
2.6 Deferred Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Contextual Bandits 38
3.1 Optimism: Generic Template . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Optimism for Linear Models: The LinUCB Algorithm . . . . . . . . . . . . 43
3.3 Moving Beyond Linear Classes: Challenges . . . . . . . . . . . . . . . . . . 45
3.4 The ε-Greedy Algorithm for Contextual Bandits . . . . . . . . . . . . . . . 46
3.5 Inverse Gap Weighting: An Optimal Algorithm for General Model Classes . 49
3.5.1 Extending to Offline Regression . . . . . . . . . . . . . . . . . . . . . 52
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

1
4 Structured Bandits 55
4.1 Building Intuition: Optimism for Structured Bandits . . . . . . . . . . . . . 58
4.1.1 UCB for Structured Bandits . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 The Eluder Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3 Suboptimality of Optimism . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 The Decision-Estimation Coefficient . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Decision-Estimation Coefficient: Examples . . . . . . . . . . . . . . . . . . . 69
4.3.1 Cheating Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Linear Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3 Nonparametric Bandits . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.4 Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Relationship to Optimism and Posterior Sampling . . . . . . . . . . . . . . 75
4.4.1 Connection to Optimism . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 Connection to Posterior Sampling . . . . . . . . . . . . . . . . . . . 77
4.5 Incorporating Contexts⋆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Additional Properties of the Decision-Estimation Coefficient⋆ . . . . . . . . 79
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Reinforcement Learning: Basics 80


5.1 Finite-Horizon Episodic MDP Formulation . . . . . . . . . . . . . . . . . . . 81
5.2 Planning via Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Failure of Uniform Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Optimism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 The UCB-VI Algorithm for Tabular MDPs . . . . . . . . . . . . . . . . . . . 88
5.6.1 Analysis for a Single Episode . . . . . . . . . . . . . . . . . . . . . . 90
5.6.2 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 General Decision Making 93


6.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Refresher: Information-Theoretic Divergences . . . . . . . . . . . . . . . . . 95
6.3 The Decision-Estimation Coefficient for General Decision Making . . . . . . 97
6.3.1 Basic Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 E2D Algorithm for General Decision Making . . . . . . . . . . . . . . . . . . 100
6.4.1 Online Estimation with Hellinger Distance . . . . . . . . . . . . . . . 102
6.5 Decision-Estimation Coefficient: Lower Bound on Regret . . . . . . . . . . . 103
6.5.1 The Constrained Decision-Estimation Coefficient . . . . . . . . . . . 104
6.5.2 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.3 Proof of Proposition 28 . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.4 Examples for the Lower Bound . . . . . . . . . . . . . . . . . . . . . 110
6.6 Decision-Estimation Coefficient and E2D: Application to Tabular RL . . . . 112
6.6.1 Proof of Proposition 31 . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.7 Tighter Regret Bounds for the Decision-Estimation Coefficient . . . . . . . 118
6.7.1 Guarantees Based on Decision Space Complexity . . . . . . . . . . . 118
6.7.2 General Divergences and Randomized Estimators . . . . . . . . . . . 119
6.7.3 Optimistic Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.8 Decision-Estimation Coefficient: Structural Properties⋆ . . . . . . . . . . . . 124
6.9 Deferred Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

2
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Reinforcement Learning: Function Approximation and Large State Spaces129


7.1 Is Realizability Sufficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2.1 The LSVI-UCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.2 Proof of Proposition 46 . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3 Bellman Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3.1 The BiLinUCB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3.2 Proof of Proposition 47 . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3.3 Bellman Rank: Examples . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.4 Generalizations of Bellman Rank . . . . . . . . . . . . . . . . . . . . 147
7.3.5 Decision-Estimation Coefficient for Bellman Rank . . . . . . . . . . 148

A Technical Tools 156


A.1 Probabilistic Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.1.1 Tail Bounds with Stopping Times . . . . . . . . . . . . . . . . . . . 156
A.1.2 Tail Bounds for Martingales . . . . . . . . . . . . . . . . . . . . . . . 156
A.2 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.2.1 Properties of Hellinger Distance . . . . . . . . . . . . . . . . . . . . . 158
A.2.2 Change-of-Measure Inequalities . . . . . . . . . . . . . . . . . . . . . 158
A.3 Minimax Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

3
1. INTRODUCTION

1.1 Decision Making


This is a course about learning to make decisions in an interactive, data-driven fashion.
When we say interactive decision making, we are thinking of problems such as:

• Medical treatment: based on a patient’s medical history and vital signs, we need to
decide what treatment will lead to the most positive outcome.

• Controlling a robot: based on sensor signals, we need to decide what signals to send
to a robot’s actuators in order to navigate to a goal.

For both problems, we (the learner /agent) are interacting with an unknown environment.
In the robotics example, we do not necessarily a-priori know how the signals we send to
our robot’s actuators change its configuration, or what the landscape it’s trying to navigate
looks like. However, because we are able to actively control the agent, we can learn to
model the environment on the fly as we make decisions and collect data, which will reduce
uncertainty and allow us to make better decisions in the future. The crux of the interactive
decision making problem is to make decisions in a way that balances (i) exploring the
environment to reduce our uncertainty and (ii) maximizing our overall performance (e.g.,
reaching a goal state as fast as possible).
Figure 1 depicts an idealized interactive decision making setting, which we will return
to throughout this course. Here, at each round t, the agent (doctor) observes the medical
history and vital signs of a patient, summarized in a context xt , makes a treatment decision
π t , and then observes the outcomes of the treatment in the form of a reward rt , and an
auxiliary observation ot about, say, illness progression. With time, we hope that the doctor
will learn a good mapping xt 7→ π t from contexts to decisions. How can we develop an
automated system that can achieve this goal?
It is tempting to cast the problem of finding a good mapping xt 7→ π t as a supervised
learning problem. After all, modern deep neural networks are able to achieve excellent
performance on many tasks, such as image classification and recognition, and it is not
out of the question that there exists a good neural network for the medical example as
well. The question is: how do we find it? In supervised learning, finding a good predictor
often amounts to fitting an appropriate model—such as a neural network—to the data. In
the above example, however, the available data may be limited to what treatments have
been assigned to patients, potentially missing better options. It is the process of active
data collection with a controlled amount of exploration that we would like to study in this
course.
The decision making framework in Figure 1 generalizes many interactive decision mak-
ing problems the reader might already be familiar with, including multi-armed bandits,
contextual bandits, and reinforcement learning. We will cover the foundations of algorithm
design and analysis for all of these settings from a unified perspective, with an emphasis on
sample efficiency (i.e., how to learn a good decision making policy using as few rounds of
interaction as possible).

1.2 A Spectrum of Decision Making Problems


To design algorithms for general interactive decision making problems such as Figure 1,
there are many complementary challenges we must overcome. These challenges correspond

4
xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>

context

⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>

decision

rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>

reward

ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>

observation

Figure 1: A general decision making problem.

function approximation

online
structure in structure in learning
decisions contexts reinforcement
learning

structured
statistical bandits
learning
contextual
adversarial bandits
nature data
complex
observations

tabular
RL

multiarmed
bandits interactivity

Figure 2: Landscape of decision making problems.

to different assumptions we can place on the underlying environment and decision making
protocol, and give rise to what we describe as a spectrum of decision making problems,
which is illustrated in Figure 2. There are three core challenges we will focus on throughout
the course, which are given by the axes of Figure 2.

• Interactivity. Does the learning agent observe data passively, or do the decisions
they make actively influence what data we collect? In the setting of Figure 1, the
doctor observes the effects of the prescribed treatments, but not the counterfactuals
(the effects of the treatments not given). Hence, doctor’s decisions influence the data
they can collect, which in turn may significantly alter the ability to estimate the effects
of different treatments. On the other hand, in classical machine learning, a dataset is
typically given to the learner upfront, with no control over how it is collected.

• Function approximation and generalization. In supervised statistical learning


and estimation, one typically employs function approximation (e.g., models such as
neural networks, kernels, or forests) to generalize across the space of covariates. For
decision making, we can employ function approximation in a similar fashion, either
to generalize across a space of contexts, or to generalize across the space of decisions.
In the setting of Figure 1, the context xt summarizing the medical history and vital

5
signs might be a highly structured object. Likewise, the treatment π t might be a high-
dimensional vector with interacting components, or a complex multi-stage treatment
strategy.

• Data. Is the data (e.g., rewards or observations) observed by our learning algorithm
produced by a fixed data-generating process, or does it evolve arbitrarily, and even
adversarially in response to our actions? If there is fixed data-generating process,
do we wish to directly model it, or should we instead aim to be agnostic? Do we
observe only the labels of images, as in supervised learning, or a full trajectory of
states/actions/rewards for a policy employed by the robot?

As shown in Figure 2, many basic decision making and learning frameworks (contextual
bandits, structured bandits, statistical learning, online learning) can be thought of as ideal-
ized problems that each capture one or more of the possible challenges, while richer settings
such as reinforcement learning encompass all of them.
Figure 2 can be viewed as a roadmap for the course. We start with a brief introduction
to Statistical Learning (Section 1.4) and Online Learning (Section 1.6); the concepts and
results stated here will serve as a backbone for the rest of the course. We will then study, in
order, the problems of Multi-Armed Bandits (Section 2), Contextual Bandits (Section 3),
Structured Bandits (Section 4), Tabular Reinforcement Learning (Section 5), General Deci-
sion Making (Section 6), and Reinforcement Learning with General Function Approximation
(Section 7). Each of these topics will add a layer of complexity, and our aim is to develop a
unified approach to all the aforementioned problems, both in terms of statistical complexity
(the number of interactions required to achieve the goal), and in terms of algorithm design.

1.3 Minimax Perspective


For much of the course, we take a minimax point of view. Abstractly, let M be a set of possi-
ble models (or, choices for the environment) that can be encountered by the learner/decision
maker. The set M can be thought of as representing the prior knowledge of the learner
about the underlying environment. Let Alg denote a learning algorithm, and PerfT (Alg, M )
be some notion of performance of algorithm Alg on model M ∈ M after T rounds of inter-
action (or—in passive learning—after observing T datapoints). We would like to develop
algorithms that perform well, no matter what the model M ∈ M is, in the sense that Alg
approximately solves the minimax problem

min max PerfT (Alg, M ). (1.1)


Alg M ∈M

Understanding the statistical complexity (or, difficulty) of a given problem amounts to


establishing matching (or nearly matching) upper bounds ϕT (M) and lower bounds ϕT (M)
on the minimax value in (1.1). While developing such upper and lower bounds for specific
model classes M of interest might be a simple task, the grand aim of this course is to
develop a more fundamental, unified understanding of what makes any model class M easy
verus hard, and to give sharp results for all (or nearly all) M.
On the algorithmic side, we would like to better understand the scope of optimal algo-
rithms that solve (1.1). While the minimax problem is itself an optimization problem, the
space of all algorithms is typically prohibitively large. One of the key insights to be lever-
aged in this course is that for general decision making problems, we can restrict ourselves
to algorithms that interleave a type of supervised learning called online estimation (this

6
will be described in Sections 1.4 and 1.6), with a principled choice of exploration strategy
that balances greedily maximizing performance (exploitation) with information acquisition
(exploration). As we show, such algorithms achieve or nearly achieve optimality in (1.1) for
a surprisingly wide range of decision making problems.

1.4 Statistical Learning: Brief Refresher


We begin with a short refresher on the statistical learning problem. Statistical learning is a
purely passive problem in which the learner does not directly interact with the environment,
but it captures the challenge of generalization and function approximation in the context
of Figure 2.
In the statistical learning problem, we receive examples (x1 , y 1 ), . . . , (xT , y T ) ∈ X × Y,
i.i.d. from a (unknown) distribution M ⋆ . Here xt ∈ X are features (sometimes called
contexts or covariates), and X is the feature space. y t ∈ Y are called outcomes, and Y is the
outcome space. Given (x1 , y 1 ), . . . , (xT , y T ), the goal is to produce a model (or, estimator)
fb : X → Y ′ that will do a good job predicting outcomes from features for future examples
(x, y) drawn from M ⋆ .1
To measure prediction performance, we take as given a loss function ℓ : Y ′ × Y → R.
Standard examples include:

• Regression, where common losses include the square loss ℓ(a, b) = (a − b)2 when
Y = Y ′ = R.

• Classification, where Y = Y ′ = {0, 1} and we consider the indicator (or 0-1) loss
ℓ(a, b) = I {a ̸= b}.

• Conditional density estimation with the logarithmic loss (log loss). Here Y ′ = ∆(Y),
the set of distributions on Y, and for p ∈ Y ′ ,

ℓlog (p, y) = − log p(y). (1.2)

For a function f : X → Y ′ , we measure the prediction performance via the population


(or, “test”) loss:

L(f ) := E(x,y)∼M ⋆ [ℓ(f (x), y)]. (1.3)

Letting HT := {(xt , y t )}Tt=1 denote the dataset, a (deterministic) algorithm is a map


that takes the dataset as input and returns a function/predictor:

fb(·; HT ) : X → Y ′ . (1.4)
 
The goal in designing algorithms is to ensure that E L(fb) is minimized, where E[·] denotes
expectation with respect to the draw of the dataset HT . Without any assumptions, it is not
possible to learn a good predictor unless the number of examples T scales with |X | (this is
sometimes called the no-free-lunch theorem). The basic idea behind statistical learning is
to work with a restricted class of functions

F ⊆ {f : X → Y}
1
Note that we allow the outcome space Y to be different from the prediction space Y ′ .

7
in order to facilitate generalization. The class F can be thought of as (implicitly) encoding
prior knowledge about the structure of the data. For example, in computer vision, if the
features xt correspond to images and the outcomes y t are labels (e.g., “cat” or “dog”), one
might expect that choosing F to be a class of convolutional neural networks will work well,
since this encodes spatial structure.

Remark 1 (Conditional density estimation): For the problem of conditional den-


sity estimation, we shall overload the notation and interchangeably write f (x) and
f (·|x) for the conditional distribution. In this setting, the learner is required to com-
pute a distribution for each x rather than form a point estimate (see Figure 3). For an
outcome y, the loss is the negative log of the conditional density for the outcome.

<latexit sha1_base64="viLVQupRAFwdyUMHixuHqUlhlgo=">AAAG4HicdZVbT9swFMfNNjrW3WB73Eu0ColJqEpKr2+osAsSAsalMLUVcly39epc5DiDKMv73rY9js+zL7FvMztJKXWNlZ5ax7+/fexznNg+JQE3zX9LDx4+Wi48XnlSfPrs+YuXq2uvOoEXMoTPkEc9dmHDAFPi4jNOOMUXPsPQsSk+tyc7cvz8G2YB8dxTHvm478CRS4YEQS5cneFG9P363eVqySy3mvVKrWKYZdNsVLbqslNpVCtbhiU8spVA3o4u15b/9gYeCh3sckRhEHQt0+f9GDJOEMVJsRcG2IdoAke4K7oudHDQj9NwE2NdeAbG0GPi53Ij9d5VxNAJgsixBelAPg7UMenUjXVDPmz2Y+L6IccuyhYahtTgniH3bgwIw4jTSHQgYkTEaqAxZBBxcULFuWWyUOdcIwb9MUHXSbG4buwc7h8eG7vvP+wd7J3uHR6cFHsDPBRJSIXxLmSTjwxjN4nZyE5is2zVNsWR1qWxaokxj++T0Zh/wZR6V7eCpkRT09LzYv7olm5IMDPJPNvBLFrkZ7M31cnz2GdsGnlmtIG0aYhv4VYtCzm1LS2vCKwUzWxdKzjGg1ng2TnmVsXbVCRrygpEPCpygn0Ckxg50UQyYsatTUv8NUzl4NqMoEm69JQ1ys2WMK2qMJWmgp+PCZ/uSmxGPgpxFDKf4jspEKWwKSEXXyHPcaA7iHtX6TRdqx/3OL7mmTJzxiUrUWhbnqQCpz4Ny8RWFFS6NKTHoDtamDf3avhRWukKfucK6OKWeVIUefIWaSqLQBP+rDgWNUGaZkWQ514Tj8y1ZoVZDWh3HWkjyi7PPdvQqZQruqgciLPUCWdXdVHjZ9WmKKY1mPLivT99uRv3dzqVslUvVz9XS9vt/AuwAt6At2ADWKABtsEncATOAAJfwS/wB9wU7MKPws/C7wx9sJRrXoO5Vrj5Dw+jYUo=</latexit>

f (y|x)

y
<latexit sha1_base64="IX9FJnwE2dAuhsgJi8f27LGfznw=">AAAG23icdZXJbtswEIaZtHFTd0vaYy9CjQA9BIaVZrFvgZMuAYLsW2EbAUXRMmFqAUU1EQSdemt7bN6oL9G3KUnJcUyxgjwmRt9PDmeGkhNREvNW6+/c/KPHC7Uni0/rz56/ePlqafn1RRwmDOFzFNKQXTkwxpQE+JwTTvFVxDD0HYovnfGOfH75DbOYhMEZTyM88KEXkCFBkAvXcXq91Gg1W+qyqgO7HDRAeR1dLy/86bshSnwccERhHPfsVsQHGWScIIrzej+JcQTRGHq4J4YB9HE8yFSkubUiPK41DJn4BdxS3oeKDPpxnPqOIH3IR7H+TDpNz3oJH7YHGQmihOMAFQsNE2rx0JLbtlzCMOI0FQOIGBGxWmgEGURcJKc+s0wR6ozLYzAaEXSb1+sr1s7h/uGJtfvx097B3tne4cFpve/ioci/Ema7kI0/M4yDPGOek2etpr2xKlK6KY29kVuz+D7xRvwrpjS8uRe0JapMx8yL+dN7ekuChcln2QvM0io/nb2tT17GPmVV5IUxBtKlCb6HOxtFyMp2jLwmsBVa2E2j4AS708CLPJZWx7tUFGvCCkTcOnKKIwLzDPnpWDJixg+rtvjbammJ6zKCxmrpCWs12x1hOuvCrLU1/HJE+GRXYjPy1oijhEUUPyiBaIVVCQX4BoW+DwM369+oaXr2IOtzfMsLZeHMGnau0Y7MpAYrn4FlYisaKl0GMmQw8Crzll4D76lO1/AHR8AUt6yTpiiLV6WpbAJD+NPmqGpiVWZNUNbeEI+stWGFaQ8Yd50aIyoOz3+2YVJpR7SqdEUuTcLpUa1qoqLbNMWkBxUv3vu2/pavDi7WmvZmc/14vbHdLb8Ai+AteAfeAxtsgW3wBRyBc4AABj/Bb3BXG9S+137UfhXo/FypeQNmrtrdP7wfXyk=</latexit>

x
<latexit sha1_base64="jtuX9r/t3BwH73RfOLxwJs0brs8=">AAAG23icdZXJbtswEIaZtHFTd0vaYy9CjQA9BIaVZrFvgZMuAYLsW2EbAUXRMmFqAUU1EQSdemt7bN6oL9G3KUnJcUyxgjwmRt9PDmeGkhNREvNW6+/c/KPHC7Uni0/rz56/ePlqafn1RRwmDOFzFNKQXTkwxpQE+JwTTvFVxDD0HYovnfGOfH75DbOYhMEZTyM88KEXkCFBkAvX8e31UqPVbKnLqg7sctAA5XV0vbzwp++GKPFxwBGFcdyzWxEfZJBxgijO6/0kxhFEY+jhnhgG0MfxIFOR5taK8LjWMGTiF3BLeR8qMujHceo7gvQhH8X6M+k0PeslfNgeZCSIEo4DVCw0TKjFQ0tu23IJw4jTVAwgYkTEaqERZBBxkZz6zDJFqDMuj8FoRNBtXq+vWDuH+4cn1u7HT3sHe2d7hwen9b6LhyL/SpjtQjb+zDAO8ox5Tp61mvbGqkjppjT2Rm7N4vvEG/GvmNLw5l7QlqgyHTMv5k/v6S0JFiafZS8wS6v8dPa2PnkZ+5RVkRfGGEiXJvge7mwUISvbMfKawFZoYTeNghPsTgMv8lhaHe9SUawJKxBx68gpjgjMM+SnY8mIGT+s2uJvq6UlrssIGqulJ6zVbHeE6awLs9bW8MsR4ZNdic3IWyOOEhZR/KAEohVWJRTgGxT6PgzcrH+jpunZg6zP8S0vlIUza9i5RjsykxqsfAaWia1oqHQZyJDBwKvMW3oNvKc6XcMfHAFT3LJOmqIsXpWmsgkM4U+bo6qJVZk1QVl7Qzyy1oYVpj1g3HVqjKg4PP/ZhkmlHdGq0hW5NAmnR7WqiYpu0xSTHlS8eO/b+lu+OrhYa9qbzfXj9cZ2t/wCLIK34B14D2ywBbbBF3AEzgECGPwEv8FdbVD7XvtR+1Wg83Ol5g2YuWp3/wC1qF8o</latexit>

Figure 3: Conditional density estimation.

Empirical risk minimization and excess risk. The most basic and well-studied algo-
rithmic principle for statistical learning is Empirical Risk Minimization (ERM). Define the
empirical loss for the dataset HT as
T
b )= 1
X
L(f ℓ(f (xi ), y i ). (1.5)
T
i=1

Then, the empirical risk minimizer with respect to the class F is given by

fb ∈ arg min L(f


b ). (1.6)
f ∈F

To measure the performance of ERM and other algorithms that attempt to learn with
F, we consider excess loss (or, regret)

E(f ) = L(f ) − min



L(f ′ ). (1.7)
f ∈F

Intuitively, the quantity minf ′ ∈F L(f ′ ) in (1.7) captures the best prediction performance any
function in F can achieve, even with knowledge of the true distribution. If an algorithm fb
has low excess risk, this means that we are predicting future outcomes nearly as well as any
algorithm based on samples can hope to perform. ERM and other algorithms can ensure
that E(fb) is small in expectation or with high probability over draw of the dataset HT .

8
Connection to estimation. An appealing feature of the formulation in (1.7) is that it
does not presuppose any relationship between the class F and the data distribution; in
other words, it is agnostic. However, if F does happen to be good at modeling the data
distribution, the excess loss has an additional interpretation based on estimation.

Definition 1: For prediction with square loss, we say that the problem is well-specified
(or, realizable) if the regression function f ⋆ (a) := E[y|x = a] is in F.

The regression function f ⋆ can also be seen as a minimizer of L(f ) over measurable functions
f , for the same reason that Ez (z − b)2 is minimized at b = E[z].

Lemma 1: For the square loss, if the problem is well-specified, then for all f : X → Y,

E(f ) = Ex (f (x) − f ⋆ (x))2


 
(1.8)

Proof of Lemma 1. Adding and subtracting f ⋆ in the first term of (1.7), we have

E(f (x) − y)2 − E(f ⋆ (x) − y)2 = E(f (x) − f ⋆ (x))2 + 2 E[(f ⋆ (x) − y)(f (x) − f ⋆ (x))].

Inspecting (1.8), we see that any f achieving low excess loss necessarily estimates the true
regression function f ⋆ ; hence, the goals of prediction and estimation coincide.

Guarantees for ERM. We give bounds on the excess loss of ERM for perhaps the
simplest special case, in which F is finite.

Proposition 1: For any finite class F, empirical risk minimization satisfies


 
E E(fb) ≲ comp(F, T ), (1.9)

where
q
log |F |
1. For any bounded loss (including classification), comp(F, T ) = T .

log |F |
2. For square loss regression, if the problem is well-specified, comp(F, T ) = T .

In addition, there exists a (different) algorithm that achieves comp(F, T ) = logT|F | for
both square loss regression and conditional density estimation, even when the problem
is not well-specified.

Henceforth, we shall use the symbol ≲ to indicate an inequality that holds up to constants,
or other problem parameters deemed less important for the present discussion. As an
example, the range of losses for the first part is hidden in this notation, and we only focus
on the dependence of the right-hand side on F and T .

9
q
log |F |
The rate comp(F, T ) = T above is sometimes referred to as a slow rate, and is
optimal for generic losses. The rate comp(F, T ) = logT|F | is referred to as a fast rate, and
takes advantage of additional structure (curvature, or strong convexity) of the square loss.
Critically, both bounds scale only with the cardinality of F, and do not depend on the size
of the feature space X , which could be infinite. This reflects the fact that working with
a restricted function class is allowing us to generalize across the feature space X . In this
context the cardinality log|F| should be thought of a notion of capacity, or expressiveness
for F. Intuitively, choosing a larger, more expressive class will require a larger amount of
data, but will make the excess loss bound in (1.7) more meaningful, since the benchmark
will be stronger.

Remark 2 (From finite to infinite classes): Throughout these lecture notes, we


restrict our attention to finite classes whenever possible in order to simplify presentation.
If one wishes to move beyond finite classes, a well-developed literature within statistical
learning provides various notions of complexity for F that lead to bounds on comp(F, T )
for ERM and other algorithms. These include the Vapnik-Chervonenkis (VC) dimension
for classification, Rademacher complexity, and covering numbers. Standard references
include Bousquet et al. [20], Boucheron et al. [19], Anthony and Bartlett [10], Shalev-
Shwartz and Ben-David [77].

1.5 Refresher: Random Variables and Averages


To prove Proposition 1 and similar generalization bounds, the main tools we will use are
concentration inequalities (or, tail bounds) for random variables.

Definition 2: A random variable Z is sub-Gaussian with variance factor (or variance


proxy) σ 2 if
2 2
∀η ∈ R, E eη(Z−E[Z]) ≤ eσ η /2 .

Note that if Z ∼ N (0, σ 2 ) is Gaussian with variance σ 2 , then it is sub-Gaussian with vari-
ance proxy σ 2 . In this sense, sub-Gaussian random variables generalize the tail behavior of
Gaussians. A standard application of Chernoff method yields the following result.

Lemma 2: If Z1 , . . . , ZT are i.i.d. sub-Gaussian random variables with variance proxy


σ 2 , then
T
!
T u2
 
1X
P Zi − E[Z] ≥ u ≤ exp − 2 (1.10)
T 2σ
i=1

Applying this result with Z and −Z and taking a union bound yields the following two-sided
guarantee:
T
!
T u2
 
1X
P Zi − E[Z] ≥ u ≤ 2 exp − 2 . (1.11)
T 2σ
i=1

10
Setting the right-hand side of (1.11) to δ and solving for u, we find that for any δ ∈ (0, 1),
with probability at least 1 − δ,
T
r
1X 2σ 2 log(2/δ)
Zi − E[Z] ≤ . (1.12)
T T
i=1

Remark 3 (Union bound): The factor 2 under the logarithm in (1.12) is the result
of applying union bound to (1.10). Throughout the course, we will frequently apply the
union bound to multiple—say N —high probability events involving sub-Gaussian ran-
dom variables. In this case, the union bound will result in terms of the form log(N/δ).
The mild logarithmic dependence is due to the sub-Gaussian tail behavior of the aver-
ages.

The following result shows that any bounded random variable is sub-Gaussian.

Lemma 3 (Hoeffding’s inequality): Any random variable Z taking values in [a, b]


is sub-Gaussian with variance proxy (b − a)2 /4, i.e.

η 2 (b − a)2
∀η ∈ R, ln E exp{−η(Z − E[Z])} ≤ . (1.13)
8
As a consequence, for i.i.d. random variables Z1 , . . . , ZT taking values in [a, b] almost
surely, with probability at least 1 − δ,
T
r
1X log(1/δ)
Zi − E[Z] ≤ (b − a) (1.14)
T 2T
i=1

In particular, in the setting of Section 1.4,


Using Hoeffding’s inequality, we can prove now prove Part 1 (the slow rate) from Propo-
sition 1.

Lemma 4 (Proposition 1, Part 1): Let F = {f : X → Y} be finite, and assume


ℓ ◦ f ∈ [0, 1] almost surely. Then with probability at least 1 − δ, ERM satisfies
r
log(2|F|/δ)
L(f ) − min L(f ) ≤ 2
b .
f ∈F 2T

Proof of Lemma 4. For any f ∈ F, we can write


h i h i h i
L(fb) − L(f ) = L(fb) − L( b fb) − L(f
b fb) + L( b ) − L(f ) .
b ) + L(f

Observe that for all f : X → Y, we have


T
1X
L(f ) − L(f
b ) = E ℓ(f (X), Y ) − ℓ(f (Xi ), Yi ) .
T
i=1

11
By union bound and Lemma 3, with probability at least 1 − |F |δ,
T
r
1X log(2/δ)
∀f ∈ F, E ℓ(f (X), Y ) − ℓ(f (Xi ), Yi ) ≤ (1.15)
T 2T
i=1

To deduce the in-expectation bound of Proposition 1 from the high-probability tail bound
of Lemma 4, a standard technique of “integrating out the tail”R is employed. More precisely,

for a nonnegative random variable U , it holds that E[U ] ≤ τ + τ P (U ≥ z) dz for all τ > 0;
choosing τ ∝ T −1/2 concludes the proof.
To prove the Part 2 (the fast rate) from Proposition 1, we need a more refined con-
centration inequality (Bernstein’s inequality), which gives tighter guarantees for random
variables with small variance.

Lemma 5 (Bernstein’s inequality): Let Z1 , . . . , ZT , Z be i.i.d. with variance V(Zi ) =


σ 2 , and range |Z − E Z| ≤ B almost surely. Then with probability at least 1 − δ,
T
r
1X 2 log(1/δ) B log(1/δ)
Zi − E Z ≤ σ + . (1.16)
T T 3T
i=1

The proof for Part 2 is given as an exercise in Section 1.7. We refer the reader to Ap-
pendix A.1 for further background on tail bounds.

1.6 Online Learning and Prediction


We now move on to the problem of online learning, or sequential prediction. The online
learning problem generalizes statistical learning on two fronts:

• Rather than receiving a batch dataset of T examples all at once, we receive the
examples (xt , y t ) one by one, and must predict y t from xt only using the examples we
have already observed.

• Instead of assuming that examples are drawn from a fixed distribution, we allow
examples to be generated in an arbitrary, potentially adversarial fashion.

Online Learning Protocol


for t = 1, . . . , T do
Compute predictor fbt : X → Y
Observe (xt , y t ) ∈ X × Y

In more detail, at each timestep t, given the examples

Ht−1 = {(x1 , y 1 ), . . . , (xt−1 , y t−1 )} (1.17)

observed so far, the algorithm produces a predictor

fbt = fbt (· | Ht−1 ),

12
which aims to predict the outcome y t from the features xt . The algorithm’s goal is to
minimize the cumulative loss over T rounds, given by
T
X
ℓ(fbt (xt ), y t )
t=1

for a known loss function ℓ : Y ′ × Y → R; the cumulative loss can be thought of as a sum
of “out-of-sample” prediction errors. Since we will not be placing assumptions on the data-
generating process, it is not possible to make meaningful statements about the cumulative
loss itself. However, we can aim to ensure that this cumulative loss is not much worse than
the best empirical explanation of the data by functions in a given class F. That is, we
measure the algorithm’s performance via regret to F:
T
X T
X
Reg = ℓ(fbt (xt ), y t ) − min ℓ(f (xt ), y t ). (1.18)
f ∈F
t=1 t=1

Our aim is to design prediction algorithms that keep regret small for any sequence of
data. As in statistical learning, the class F should be thought of as capturing our prior
knowledge about the problem, and might be a linear model or neural network. At first
glance, keeping the regret small for arbitrary sequences might seem like an impossible task,
as it stands in stark contrast with statistical learning, where data is generated i.i.d. from
a fixed distribution. Nonetheless, we will that algorithms with guarantees similar to those
for statistical learning are available.
Let us remark that it is often useful to apply online learning methods in settings where
data is not fully adversarial, but evolves according to processes too difficult to directly
model. For example, in the chapters that follow, we will apply online methods as a sub-
routine with more sophisticated algorithms for decision making. Here, the choice of past
decisions, while in our purview, does not look like i.i.d. or simple time-series data.

Remark 4 (Proper learning, improper learning, and randomization): The


online learning protocol does not require that fbt lies in F (fbt ∈ F). A method that
chooses functions from F will be called proper, and the one that selects predictors
outside of F will be called improper. It will also be useful to allow for randomized
predictions of the form
fbt ∼ q t (·|Ht−1 ),
where q t is a distribution on functions, typically on elements of F. For randomized
predictions, we slightly abuse notation and write regret as
T
X T
X
 
Reg = Efbt ∼qt ℓ(fbt (xt ), y t ) − min ℓ(f (xt ), y t ). (1.19)
f ∈F
i=1 i=1

The algorithms we introduce in the sequel below ensure small regret even if data are ad-
versarially and adaptively chosen. More precisely, for deterministic algorithms, (xt , y t )
may be chosen based on fbt and all the past data, while for randomized algorithms,
Nature can only base this choice on q t .

13
In the context of Figure 2, online learning generalizes statistical learning by considering
arbitrary sequences of data, but still allows for general-purpose function approximation and
generalization via the class F. While the setting involves making predictions in an online
fashion, we do not think of this as an interactive decision making problem, because the
predictions made by the learning agent do not directly influence what data the agent gets
to observe.

1.6.1 Connection to Statistical Learning


Online learning can be thought of as a generalization of statistical learning, and in fact,
algorithms for online learning immediately yield algorithms for statistical learning via a
technique called online-to-batch conversion. This result, which is formalized by the following
proposition, rests on two observations: the cumulative loss of the algorithm looks like a
sum of out-of-sample errors, and the minimum empirical fit to realized data (over F) is,
on average, a harder (that is, smaller) benchmark than the minimum expected loss in
F.

Proposition 2: Suppose the examples (x1 , y 1 ), . . . , (xT , y T ) are drawn i.i.d. from a dis-
tribution M ⋆ , and suppose the loss function a 7→ ℓ(a, b) is convex in the first argument
for all b. Then for any online learning algorithm, if we define
T
1 X bt
fb(x) = f (x),
T
t=1

we have
  1
E E(fb) ≤ · E[Reg].
T

Proof of Proposition 2. Let (x, y) ∼ M ⋆ be a fresh sample which is independent of the


history HT . First, by Jensen’s inequality,
T T
" !# " #
h i 1 X
t 1 X 
t

E L(fb) = E E(x,y) ℓ fb (x), y ≤E E(x,y) ℓ fb (x), y (1.20)
T T
t=1 t=1

which is equal to
T
" #
1X  
E E(xt ,yt ) ℓ fbt (xt ), y t (1.21)
T
t=1

since fbt is a function of Ht−1 and (x, y) and (xt , y t ) are i.i.d. Second,
T T
" # " #
1X t t 1X t t
min L(f ) = min E ℓ(f (x ), y ) ≥ E min ℓ(f (x ), y ) (1.22)
f ∈F f ∈F T f ∈F T
t=1 t=1

In light of Proposition 2, one can interpret regret as generalizing the notion of excess risk
from i.i.d. data to arbitrary sequences.

14
Similar to Lemma 1 in the setting of statistical learning, the regret for online learning
has an additional interpretation in terms of estimation if the outcomes for the problem are
well-specified.

Lemma 6: Suppose that the features x1 , . . . , xT are generated in an arbitrary fashion,


but that for all t, the variable y t is random with mean given by a fixed function f ⋆ ∈ F:

E[y t | xt = x] = f ⋆ (x).

Then for the problem of prediction with square loss,


" T #
X
E[Reg] ≥ E (fbt (xt ) − f ⋆ (xt ))2 .
t=1

Notably, this result holds even if the features x1 , . . . , xT are generated adversarially, with no
prior knowledge of the sequence. This is a significant departure from classical estimation
results in statistics, where estimation of an unknown function is typically done over a fixed,
known sequence (“design”) x1 , . . . , xT , or with respect to an i.i.d. dataset.

1.6.2 The Exponential Weights Algorithm


The main online learning algorithm is the Exponential Weights algorithm, which is appli-
cable to finite classes F. At each time t, the algorithm computes a distribution q t ∈ ∆(F)
via
t−1
( )
X
t i i
q (f ) ∝ exp −η ℓ(f (x ), y ) , (1.23)
i=1

where η > 0 is a learning rate. Based on q t , the algorithm forms the prediction fbt . We give
two variants of the method here.
Exponential Weights (averaged) Exponential Weights (randomized)
for t = 1, . . . , T do for t = 1, . . . , T do
Compute q t in (1.23). Compute q t in (1.23).
Let f = Ef ∼qt [f ].
bt
Sample fbt ∼ q t .
Observe (x , y ), incur ℓ(f (x ), y ).
t t bt t t
Observe (xt , y t ), incur ℓ(fbt (xt ), y t ).
The only difference between these variants lies in whether we compute the prediction fbt
from q t via

fbt = Ef ∼qt [f ], or fbt ∼ q t . (1.24)

The latter can be applied to any bounded loss functions, while the former leads to faster
rates for specific losses such as the square loss and log loss, but is only applicable when Y ′ is
convex. Note that the averaged version is inherently improper, while the second is proper,
yet randomized. From the point of view of regret, the key difference between these two
versions is the placement of “Ef ∼qt ”: For the averaged version it is inside the loss function,
and for the randomized version it is outside (see (1.19)). The averaged version can therefore
take advantage of the structure of the loss function, such as strong convexity, leading to

15
faster rates. The following result shows that Exponential Weights leads to regret bounds
for online learning, with rates that parallel those in Proposition 1.

Proposition 3: For any finite class F, the Exponential Weights algorithm (with ap-
propriate choice of η) satisfies
1
Reg ≲ comp(F, T ) (1.25)
T
for any sequence, where:
q
log |F |
1. For arbitrary bounded losses (including classification), comp(F, T ) = T .
This is achieved by the randomized variant.

2. For regression with the square loss and conditional density estimation with the
log loss, comp(F, T ) = logT|F| . This is achieved by the averaged variant.

We now turn to the proof of Proposition 3. Since we are not placing any assumptions
on the data generating process, we cannot hope to control the algorithm’s loss at any
particular time t, but only cumulatively. It is then natural to employ amortized analysis
with a potential function.
In more detail, the proof of Proposition 3 relies on several steps, common to standard
analyses of online learning: (i) define a potential function, (ii) relate the increase in potential
at each time step, to the loss of the algorithm, (iii) relate cumulative loss of any expert
f ∈ F to the final potential. For the Exponential Weights Algorithm, the proof relies on
the following potential for time t, parameterized by η > 0:
t
( )
X X
t i i
Φη = − log exp −η ℓ(f (x ), y ) . (1.26)
f ∈F i=1

The choice of this potential is rather opaque, and a full explanation of its origin is beyond
the scope of the course, but we mention in passing that there are principled ways of coming
up with potentials in general online learning problems.

Proof of Proposition 3. We first prove the second statement, focusing on conditional density
with the logarithmic loss; for the square loss, see Remark 6 below.
Proof for Part 2: Log loss. Recall that for each x, f (x) is a distribution over Y, and
ℓlog (f (x), y) = − log f (y|x) where we abuse the notation and write f (x) and f (·|x) inter-
changeably. With η = 1, the averaged variant of exponential weights satisfies
n P o
X X exp − t−1i=1 ℓ log (f (x i
), y i
)
fbt (y t |xt ) = q t (f )f (y t |xt ) = f (y t |xt ) P n P o , (1.27)
t−1
f ∈F exp − i=1 ℓlog (f (x ), y )
i i
f ∈F f ∈F

and thus

ℓlog (fb(xt ), y t ) = − log fbt (y t |xt ) = Φt1 − Φt−1


1 . (1.28)

16
Hence, by telescoping
T
X
ℓlog (fb(xt ), y t ) = ΦT1 − Φ01 .
t=1

Finally, observe that Φ = − log |F| and, since − log is monotonically decreasing, we have
0
1

( T ) T
X X
⋆ i
T
Φ1 ≤ − log exp − i
ℓlog (f (x ), y ) = ℓlog (f ⋆ (xi ), y i ), (1.29)
i=1 i=1

for any f ⋆ ∈ F. This establishes the result for conditional density estimation with the log
loss. As already discussed, the above proof follows the strategy: the loss on each round
related to change in potential (1.28), and the cumulative loss of any expert is related to the
final potential (1.29). We now aim to replicate these steps for arbitrary bounded losses.

Proof for Part 1: Generic loss. To prove this result, we build on the log loss result above.
First, observe that without loss of generality, we may assume that ℓ ◦ f ∈ [0, 1] for all f ∈ F
and (x, y), as we can always re-scale the problem. The randomized variant of exponential
weights (1.24) satisfies
n P o
t−1
X exp −η i=1 ℓ(f (x i
), y i
)
Efbt ∼qt [ℓ(fbt (xt ), y t )] = ℓ(f (xt ), y t ) P n P o. (1.30)
t−1
f ∈F f ∈F exp −η i=1 ℓ(f (x i
), y i
)

Hoeffding’s inequality (1.13) implies that


n P o
t−1
X exp{−ηℓ(f (xt ), y t )} exp −η i=1 ℓ(f (xi ), y i ) η2
η Efbt ∼qt [ℓ(fbt (xt ), y t )] ≤ − log n P o + .
P t−1 8
f ∈F exp −η i=1 ℓ(f (x ), y )
i i
f ∈F
(1.31)

Note that the right-hand side of this inequality is simply

η2
Φtη − Φt−1
η + ,
8
establishing the analogue of (1.28). Summing over t, this gives
T
X T η2
η Efbt ∼qt [ℓ(fbt (xt ), y t )] ≤ ΦTη − Φ0η + . (1.32)
8
t=1

As in the first part, for any f ⋆ ∈ F, we can upper bound


T
X
T
Φ ≤η
η ℓ(f ⋆ (xt ), y t ),
t=1

while Φ0η = − log |F|. Hence, we have that for any f ⋆ ∈ F,

T
X T η log|F|
Efbt ∼qt [ℓ(fbt (xt ), y t )] − ℓ(f ⋆ (xt ), y t ) ≤ + .
8 η
t=1

17
q
8 log |F |
With η = T , we conclude that

T
r
X
t t t ⋆ t t T log |F|
Efbt ∼qt [ℓ(fb (x ), y )] − ℓ(f (x ), y ) ≤ . (1.33)
2
t=1

Observe that Hoeffding’s inequality was all that was needed for Lemma 4. Curiously
enough, it was also the only nontrivial step in the proof of Proposition 3. In fact, the
connection between probabilistic inequalities and online learning regret inequalities (that
hold for arbitrary sequences) runs much deeper.

Remark 5 (Beyond finite classes): As in statistical learning, there are (sequential)


complexity measures for F that can be used to generalize the regret bounds in Propo-
sition 3 to infinite classes. In general, the optimal regret for a class F will reflect the
statistical capacity of the class [69].

Remark 6 (Mixable losses): We did not provide a proof of Proposition 3 for square
loss. It is tempting to reduce square loss regression to density estimation by taking the
conditional density to be a Gaussian distribution. Indeed, the log loss of a distribution
with density proportional to exp{−(fbt (xt )−y t )2 } is, up to constants, the desired square
loss. However, the mixture in (1.27) does not immediately lead to a prediction strategy
for the square loss, as the expectation appears in the wrong location. This issue is fixed
by a notion known as mixability.
We say that a loss ℓ is mixable with parameter η if there exists a constant c > 0
such that the following holds: for any x and a distribution q ∈ ∆(F), there exists a
prediction fb(x) ∈ Y ′ such that for all y ∈ Y,
 
c X
ℓ(fb(x), y) ≤ − log  q(f ) exp{−ηℓ(f (x), y)} . (1.34)
η
f ∈F

If loss is mixable, then given the exponential weights distribution q t , the best prediction
ybt = fbt (xt ) can be written (by bringing the right-hand side of (1.34) to the left side) as
an optimization problem
  
c X
arg min maxℓ(b y t , y t ) + log  q t (f ) exp{−ηℓ(f (xt ), y t )} (1.35)
t
ybt ∈Y ′ y ∈Y η
f ∈F

which is equivalent to
  
t
c X X
arg min maxℓ(b y t , y t ) + log  exp{−η ℓ(f (xi ), y i )} (1.36)
t
yb ∈Y ′ y t ∈Y η
f ∈F i=1

18
once we remove the normalization factor. With this choice, mixability allows one to
replicate the proof of Proposition 3 for the logarithmic loss, with the only difference
being that (1.27) (after applying − log to both sides) becomes an inequality. It can be
verified that square loss is mixable with parameter η = 2 and c = 1 when Y = Y ′ = [0, 1],
leading to the desired fast rate for square loss in Proposition 3. The idea of translating
the English statement “there exists a strategy such that for any outcome...” into a
min-max inequality will come up again in the course.

Remark 7 (Online linear optimization): For the slow rate in Proposition 3, the
nature of the loss and the dependence on the function f is immaterial for the proof. The
guarantee can be stated in a more abstract form that depends only on the vector of losses
for functions in F as follows. Let |F| = N . For timestep t, define ℓtf = ℓ(f (xt ), y t ) and
ℓt = (ℓtf1 , . . . , ℓtfN ) ∈ RN for F = {f1 , . . . , fN }. For a randomized strategy q t ∈ ∆([N ]),
expected loss of the learner can be written as

Efbt ∼qt [ℓ(fbt (xt ), y t )] = ⟨q t , ℓt ⟩ ,

and the expected regret can be written as


T
X T
X
t t
Reg = ⟨q , ℓ ⟩ − min ⟨ej , ℓt ⟩ (1.37)
j∈{1,...,N }
t=1 t=1

where ej ∈ RN is the standard basis vector with 1 in jth position. In its most general
form, the exponential weights algorithm gives bounds on the regret in (1.37) for any
sequence of vectors ℓ1 , . . . , ℓT , and the update takes the form
t−1
( )
X
q t (k) ∝ exp −η ℓt (k) .
i=1

This formulation can be viewed as a special case of a problem known as online linear
optimization, and the exponential weights method can be viewed as an instance of an
algorithm known as mirror descent.

1.7 Exercises

Exercise 1 (Proposition 1, Part 2.): Consider the setting of Proposition 1, where (x1 , y 1 ), . . . , (xT , y T )
are i.i.d., F = {f : X → [0, 1]} is finite, the true regression function satisfies f ⋆ ∈ F, and
Yi ∈ [0, 1] almost surely. Prove that empirical risk minimizer fb with respect to square loss
satisfies the following bound on excess risk. With probability at least 1 − δ,

log(|F|/δ)
E(fb) ≲ . (1.38)
T
Follow these steps:

19
1. For a fixed function f ∈ F, consider the random variable

Zi (f ) = (f (xi ) − y i )2 − (f ⋆ (xi ) − y i )2

for i = 1, . . . , T . Show that

E[Zi (f )] = E(f (xi ) − f ⋆ (xi ))2 = E(f ).

2. Show that for any fixed f ∈ F, the variance V(Zi (f )) is bounded as

V(Zi (f )) ≤ 4 E(f (xi ) − f ⋆ (xi ))2 .

3. Apply Bernstein’s inequality (Lemma 5) to show that with for any f ∈ F, with probability
at least 1 − δ,

E(f ) ≤ 2(L(f b ⋆ )) + C log(1/δ) ,


b ) − L(f (1.39)
T
b ) = 1 T (f (xt ) − y t )2 .
for an absolute constant C, where L(f
P
T t=1

4. Extend this probabilistic inequality to simultaneously hold for all f ∈ F by taking the
union bound over f ∈ F. Conclude as a consequence that the bound holds for fb, the empirical
minimizer, implying (1.38).

Exercise 2 (ERM in Online Learning): Consider the problem of Online Supervised Learn-
ing with indicator loss ℓ(f (x), y) = I {f (x) ̸= y}, Y = Y ′ = {0, 1}, and a finite class F.

1. Exhibit a class F for which ERM cannot ensure sublinear growth of regret for all sequences,
i.e. there exists a sequence (x1 , y 1 ), . . . , (xT , y T ) such that
T
X T
X
ℓ(fbt (xt ), y t ) − min ℓ(f (xt ), y t ) = Ω(T ),
f ∈F
t=1 t=1

where fbt is the empirical minimizer for the indicator loss on (x1 , y 1 ), . . . , (xt−1 , y t−1 ). Note:
The construction must have |F| ≤ C, where C is an absolute constant that does not depend
on T .
2. Show that
p if data are i.i.d., then in expectation over the data, ERM attains a sublinear
bound O( T log|F|) on regret for any finite class F.

Exercise 3 (Low Noise): 1. For a nonnegative random variable X, prove that for any η ≥ 0,

η2
ln E exp{−η(X − E[X])} ≤ E[X 2 ]. (1.40)
2
Hint: use the fact that ln x ≤ x − 1 and exp(−x) ≤ 1 − x + x2 /2 for x ≥ 0.
2. Consider the setting of Proposition 3, Part 1 (Generic Loss). Prove that the randomized
variant of the Exponential Weights Algorithm satisfies, for any f ⋆ ∈ F,
T T
X ηX log|F|
Efbt ∼qt [ℓ(fbt (xt ), y t )] − ℓ(f ⋆ (xt ), y t ) ≤ E bt t [ℓ(fbt (xt ), y t )2 ] + . (1.41)
t=1
2 t=1 f ∼q η

20
for any sequence of data and nonnegative losses. Hint: replace Hoeffding’s Lemma by (1.40).
3. Suppose ℓ(f (x), y) ∈ [0, 1] for all x ∈ X , y ∈ Y, and f ∈ F. Suppose that there is a
“perfect expert f ⋆ ∈ F such that ℓ(f ⋆ (xt ), y t ) = 0 for all t ∈ [T ]. Conclude that the above
algorithm, with an appropriate choice of η, enjoys a bound of O(log|F|) on the cumulative loss
|
of the algorithm (equivalently, the fast rate log|F
T for the average regret). This setting is called
“zero-noise.”
4. Consider the binary classification problem with indicator loss, and suppose F contains a
perfect expert, as above. The Halving Algorithm maintains a version space F t = {f ∈ F :
f (xs ) = y s , s < t} and, given xt , follows the majority vote of remaining experts in F t . Show
that this algorithm incurs cumulative loss at most O(log|F|). Hence, the Exponential Weights
Algorithm can be viewed as an extension of the Halving algorithm to settings where the optimal
loss is non-zero.

2. MULTI-ARMED BANDITS

This chapter introduces the multi-armed bandit problem, which is the simplest interactive
decision making framework we will consider in this course.

Multi-Armed Bandit Protocol


for t = 1, . . . , T do
Select decision π t ∈ Π := {1, . . . , A}.
Observe reward rt ∈ R.

The protocol (see above) proceeds in T rounds. At each round t ∈ [T ], the learning agent
selects a discrete decision 2 π t ∈ Π = {1, . . . , A} using the data

Ht−1 = {(π 1 , r1 ), . . . , (π t−1 , rt−1 )}

collected so far; we refer to Π as the decision space or action space, with A ∈ N denoting
the size of the space. We allow the learner to randomize the decision at step t according
to a distribution pt = pt (· | Ht−1 ), sampling π t ∼ pt . Based on the decision π t , the learner
receives a reward rt , and their goal is to maximize the cumulative reward across all T
rounds. As an example, one might consider an application in which the learner is a doctor
(or personalized medical assistant) who aims to select a treatment (the decision) in order
to make a patient feel better (maximize reward); see Figure 4.
The multi-armed bandit problem can be studied in a stochastic framework, in which re-
wards are generated from a fixed (conditional) distribution, or an non-stochastic/adversarial
framework in the vein of online learning (Section 1.6). We will focus on the stochastic frame-
work, and make the following assumption.

Assumption 1 (Stochastic Rewards): Rewards are generated independently via

rt ∼ M ⋆ (· | π t ), (2.1)

where M ⋆ (· | ·) is the underlying model (conditional distribution).

2
In the literature on bandits, decisions are often referred to as actions. We will use these terms inter-
changeably throughout this section.

21
We define

f ⋆ (π) := E [r | π] (2.2)

as the mean reward function under r ∼ M ⋆ (· | π). We measure the learner’s performance
via regret to the action π ⋆ := arg maxπ∈Π f ⋆ (π) with highest reward:
T
X T
X
⋆ ⋆
Eπt ∼pt f ⋆ (π t ) .
 
Reg := f (π ) − (2.3)
t=1 t=1

Regret is a natural notion of performance for the multi-armed bandit problem because
it is cumulative: it measures not just how well the learner can identify an action with
good reward, but how well it can maximize reward as it goes. This notion is well-suited
to settings like the personalized medicine example in Figure 4, where regret captures the
overall quality of treatments, not just the quality of the final treatment. As in the online
learning framework, we would like to develop algorithms that enjoy sublinear regret, i.e.

E[Reg]
→0 as T → ∞.
T
The most important feature of the multi-armed bandit problem, and what makes the
problem fundamentally interactive, is that the learner only receives a reward signal for the
single decision π t ∈ Π they select at each round. That is, the observed reward rt gives a
noisy estimate for f ⋆ (π t ), but reveals no information about the rewards for other decisions
π ̸= π t . For example in Figure 4, if the doctor prescribes a particular treatment to the

xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>

context

⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>

decision

rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>

reward

ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>

observation

Figure 4: An illustration of the multi-armed bandit problem. A doctor (the learner) aims
to select a treatment (the decision) to improve a patient’s vital signs (the reward).

patient, they can observe whether the patient responds favorably, but they do not directly
observe whether other possible treatments might have led to an even better outcome. This
issue is often referred to as partial feedback or bandit feedback. Partial feedback introduces
an element of active data collection, as it means that the information contained in the
dataset Ht depends on the decisions made by the learner, which we will see necessitates
exploring different actions. This should be contrasted with statistical learning (where the
dataset is generated independently from the learner) and online learning (where losses may
be chosen by nature in response to the learner’s behavior, but where the outcome y t — and
hence the full loss function ℓ(·, y t )—is always revealed).

22
In the context of Figure 2, the multi-armed bandit problem constitutes our first step
along the “interactivity” axis, but does not incorporate any structure in the decision space
(and does not involve features/contexts/covariates). In particular, information about one
action does not reveal information about any other actions, so there is no hope of using
function approximation to generalize across actions.3 As a result, the algorithms we will
cover in this section will have regret that scales with Ω(|Π|) = Ω(A). This shortcoming is
addressed by the structured bandit framework we will introduce in Section 4, which allows
for the use of function approximation to model structure in the decision space.4

Remark 8 (Other notions of regret): It is also reasonable to consider empirical


regret, defined as
T
X T
X
max rt (π) − rt (π t ), (2.4)
π∈Π
t=1 t=1

where, for π ̸= π t , rt (π) denotes the counterfactual reward the learner would have
received if they had played π at round t. Using Hoeffding’s√ inequality, one can show
that this is equivalent to the definition in (2.3) up to O( T ) factors.

2.1 The Need for Exploration


In statistical learning, we saw that the empirical risk minimization algorithm, which greedily
chooses the function that best fits the data, leads to interesting bounds on excess risk. For
multi-armed bandits, since we assume the data generating process is stochastic, a natural
first attempt at designing an algorithm is to apply the greedy principle here in the same
fashion. Concretely, at time t, we can compute an empirical estimate for the reward function
f ⋆ via
1 X s
fbt (π) = t r I {π s = π} , (2.5)
n (π) s<t

where nt (π) is the number of times π has been selected up to time t.5 Then, we can choose
the greedy action
bt = arg max fbt (π).
π
π∈Π

Unfortunately, due to the interactive nature of the bandit problem, this strategy can fail,
leading to linear regret (Reg = Ω(T )). Consider the following problem with Π = {1, 2}
(A = 2).

• Decision 1 has reward 1


2 almost surely.

• Decision 2 has reward Ber(3/4).


3
Another way to say this is that we take F = RA , so that f ⋆ ∈ F .
4
Throughout the lecture notes, we will exclusively use the term “multi-armed bandit” to refer to bandit
problems with finite action spaces, and use the term “structured bandit” for problems with large action
spaces.
5
If nt (π) = 0, we will set fbt (π) = 0.

23
Suppose we initialize by playing each decision a single time to ensure that nt (π) > 0, then
follow the greedy strategy. One can see that with probability 1/4, the greedy algorithm will
get stuck on action 1, leading to regret Ω(T ).
The issue in this example is that the greedy algorithm immediately gives up on the
optimal action and never revisits it. To address this, we will consider algorithms that
deliberately explore less visited actions to ensure that their estimated rewards are not
misleading.

2.2 The ε-Greedy Algorithm


The greedy algorithm for bandits can fail because it can insufficiently explore good decisions
that initially seem bad, leading it to get stuck playing suboptimal decisions. In light of this
failure, a reasonable solution is to manually force the algorithm to explore, so as to ensure
that this situation never occurs. This leads us to what is known as the ε-Greedy algorithm
(e.g., Sutton and Barto [81], Auer et al. [13]).
Let ε ∈ [0, 1] be the exploration parameter. At each time t ∈ [T ], the ε-Greedy algorithm
computes the estimated reward function fbt as in (2.5). With probability 1−ε, the algorithm
chooses the greedy decision

bt = arg max fbt (π),


π (2.6)
π

and with probability ε it samples a uniform random action π t ∼ unif({1, . . . , A}). As the
name suggests, ε-Greedy usually plays the greedy action (exploiting what it has already
learned), but the uniform sampling ensures that the algorithm will also explore unseen
actions. We can think of the parameter ε as modulating the tradeoff between exploiting
and exploring.

Proposition 4: Assume that f ⋆ (π) ∈ [0, 1] and rt is 1-sub-Gaussian. Then for any T ,
by choosing ε appropriately, the ε-Greedy algorithm ensures that with probability at
least 1 − δ,

E[Reg] ≲ A1/3 T 2/3 · log1/3 (AT /δ).

This regret bound has E[Reg] → 0 with T → ∞ as desired, though we will see in the sequel
T √
that more sophisticated strategies can attain improved regret bounds that scale with AT .6

Proof of Proposition 4. Recall that πbt := arg maxπ fbt (π) denotes the greedy action at round
t, and that p denotes the distribution over π t . We can decompose the regret into two terms,
t

representing the contribution from choosing the greedy action and the contribution from
6

Note that AT ≤ A1/3 T 2/3 whenever A ≤ T , and when A ≥ T both guarantees are vacuous.

24
exploring uniformly:
T
X
Reg = Eπt ∼pt [f ⋆ (π ⋆ ) − f ⋆ (π t )]
t=1
T
X T
X
⋆ ⋆ ⋆
= (1 − ε) f (π ) − f (b
π )+ε t
Eπt ∼unif([A]) [f ⋆ (π ⋆ ) − f ⋆ (b
π t )]
t=1 t=1
T
X
≤ f ⋆ (π ⋆ ) − f ⋆ (b
π t ) + εT.
t=1

In the last inequality, we have simply written off the contribution from exploring uniformly
by using that f ⋆ (π) ∈ [0, 1]. It remains to bound the regret we incur from playing the
greedy action. Here, we bound the per-step regret in terms of estimation error using a
similar decomposition to Lemma 4 (note that we are now working with rewards rather than
losses):
f ⋆ (π ⋆ ) − f ⋆ (b
π t ) = [f ⋆ (π ⋆ ) − fbt (π ⋆ )] + [fbt (π ⋆ ) − fbt (b π t ) − f ⋆ (b
π t )] +[fbt (b π t )] (2.7)
| {z }
≤0

≤2 max |f (π) − fbt (π)| ≤ 2 max |f ⋆ (π) − fbt (π)|.



(2.8)
π∈{π ⋆ ,b
πt } π

Note that this regret decomposition can also be applied to the pure greedy algorithm, which
we have already shown can fail. The reason why ε-Greedy succeeds, which we use in the
argument that follows, is that because we explore, the “effective” number of times that each
arm will be pulled prior to round t is of the order εt/A, which will ensure that the sample
mean converges to f ⋆ . In particular, we will show that the event
( r )
A log(AT /δ)
Et = max |f ⋆ (π) − fbt (π)| ≲ (2.9)
π εt

occurs for all t with probability at least 1 − δ.


To prove that (2.9) holds, we first use Hoeffding’s inequality for adaptive stopping times
(Lemma 33), which gives that for any fixed π, with probability at least 1 − δ over the draw
of rewards,
s
2 log(2T /δ)
|f ⋆ (π) − fbt (π)| ≤ . (2.10)
nt (π)

From here, taking a union bound over all t ∈ [T ] and π ∈ Π ensures that
s
⋆ t 2 log(2AT 2 /δ)
|f (π) − fb (π)| ≤ (2.11)
nt (π)

for all π and t simultaneously. It remains to show that the number of pulls nt (π) is suffi-
ciently large.
Let et ∈ {0, 1} be a random variable whose value indicates whether the algorithm
explored uniformly at step t, and let mt (π) = |{i < t : π i = π, ei = 1}|, which has
nt (π) ≥ mt (π). Let Z t = I {π t = π, et = 1}. Observe that we can write
X
mt (π) = Z i.
i<t

25
In addition, Z t ∼ Ber(ε/A), so we have E[mt (π)] = ε(t − 1)/A. Using Bernstein’s inequality
(Lemma 5) with Z 1 , . . . , Z t−1 , we have that for any fixed π and all u > 0, with probability
at least 1 − 2e−u ,
r
t ε(t − 1) p u 2ε(t − 1)u u ε(t − 1) 4u
m (π) − ≤ 2V[Z](t − 1)u + ≤ + ≤ + ,
A 3 A 3 2A 3

where we have used that V[Z] = ε/A · (1 − ε/A) ≤ ε/A, and then applied the arithmetic

mean-geometric mean (AM-GM) inequality, which states that xy ≤ x2 + y2 for x, y ≥ 0.
Rearranging, this gives

ε(t − 1) 4u
mt (π) ≥ − . (2.12)
2A 3
Setting u = log(2AT /δ) and taking a union bound, we are guaranteed that with probability
at least 1 − δ, for all π ∈ Π and t ∈ [T ]

ε(t − 1) 4 log(2AT /δ)


mt (π) ≥ − . (2.13)
2A 3
As long as εt ≳ A log(AT /δ) (we can write off the rounds where this does not hold), this
yields
εt
nt (π) ≥ mt (π) ≳ .
A
Taking a union bound and combining with (2.11), this implies that with probability at least
1 − δ, for all t, r
⋆ t A log(AT /δ)
max |f (π) − f (π)| ≲
b .
π εt
which leads to the overall regret bound
T T
r
X
⋆ t
X A log(AT /δ)
Reg ≤ max |f (π) − fb (π)| + εT ≲ + εT
π εt
t=1 t=1
r
AT log(AT /δ)
≤ + εT. (2.14)
ε
To balance the terms on the right-hand side, we set
 1/3
A log(AT /δ)
ε∝ ,
T

which gives the final result.

This proof shows that the ε-Greedy strategy allows the learner to acquire information
uniformly for all actions, but we pay for this in terms of regret (specifically, through the
εT factor in the final regret bound (2.14)). This issue here is that the ε-Greedy strategy
continually explores all actions, even though we might expect to rule out actions with very
low reward after a relatively small amount of exploration. To address this shortcoming, we
will consider more adaptive strategies.

26
Remark 9 (Explore-then-commit): A relative of ε-Greedy is the explore-then-
commit (ETC) algorithm (e.g., Robbins [73], Langford and Zhang [57]), which uni-
formly explores actions for the first N rounds, then estimates rewards based on the
data collected and commits to the greedy action for the remaining T − N rounds.
This strategy can be shown to attain Reg ≲ A1/3 T 2/3 for an appropriate choice of N ,
matching ε-Greedy.

2.3 The Upper Confidence Bound (UCB) Algorithm


The next algorithm we will study for bandits is the Upper Confidence Bound√(UCB) algo-
rithm [56, 6, 13]. The UCB algorithm attains a regret bound of the order O( e AT ), which
improves upon the regret bound for ε-Greedy, and is optimal (in a worst-case sense) up
to logarithmic factors. In addition to optimality, the algorithm offers several secondary
benefits, including adaptivity to favorable structure in the underlying reward function.
The UCB algorithm is based on the notion of optimism in the face of uncertainty,
which is a general principle we will revisit throughout this text in increasingly rich settings.
The idea behind the principle is that at each time t, we should adopt the most optimistic
perspective of the world possible given the data collected so far, and then choose the decision
π t based on this perspective.
To apply the idea of optimism to the multi-armed bandit problem, suppose that for each
step t, we can construct “confidence intervals”
f t , f¯t : Π → R, (2.15)
with the following property: with probability at least 1 − δ,
∀t ∈ [T ], π ∈ Π, f ∗ (π) ∈ [f t (π), f¯t (π)]. (2.16)
We refer to f t as a lower confidence bound and f¯t as a upper confidence bound, since we are

f¯t
<latexit sha1_base64="oOQa37e3XcFQdEoFwda+Jat9N7c=">AAAG43icdZVbb9MwFMe9wcootw0eQShimsTDVDVjl/Zt6sZl0rTuflFbJsdxU6vORY7DFkV55Ik34JF9j30AvgSfgS+B46Tr6hgrPbWOf3/72Oc4sQJKQl6v/5mavnd/pvJg9mH10eMnT5/NzT8/Cf2IIXyMfOqzMwuGmBIPH3PCKT4LGIauRfGpNdzMxk+/YBYS3zvicYB7LnQ80icIcuE671qQJf30M7+YW6jX6rIZ5Y5ZdBY2Xt3s//36+mbvYn7md9f2UeRijyMKw7Bj1gPeSyDjBFGcVrtRiAOIhtDBHdH1oIvDXiIjTo1F4bGNvs/Ez+OG9N5VJNANw9i1BOlCPgjVscypG+tEvN/oJcQLIo49lC/Uj6jBfSPbvmEThhGnsehAxIiI1UADyCDi4pCqE8vkoU64HAaDAUFXabW6aGy2d9oHxtb7D9u720fb7d3DatfGfZEHKUy2IBt+ZBh7acIcK03qNXN1SRzpWmbM1dSYxHeIM+DnmFL/8lbQyFBpmnpezB/f0usZmJt0kj3BLC7z49kb6uRF7GNWRp4bbSAtGuFbuLmahyxtU8srAlOiuV3TCg6wPQ48P8fCqniLimSNWIGIR0UOcUBgmiA3HmaMmPHdkin+1uvKwbUYQUO59Ig1ao2mMM0VYZYbCn46IHy0K7GZ7FGIvYgFFN9JgSiFpQzy8CXyXRd6dtK9lNN0zF7S5fiK58rcmSyYqUJb2UkqsPRpWCa2oqCZS0P6DHpOad7Cq+EdWekKfucK6OLO8qQoiuSVaZoVgSb8cXGUNaFMsyIocq+JJ8u1ZoVxDWh3HWsjyi/Pf7ahUylXtKy0xVnqhOOrWtYEebUpilENSl689031LV/unCzXzLXayr74ALRA3mbBS/AGvAUmWAcb4BPYA8cAARf8AL/AdQVXvlW+V37m6PRUoXkBJlrl+h9mwmZo</latexit>

f⇤
<latexit sha1_base64="7O9vQd/0qADC33r5YiCoEpXGK20=">AAAG3XicdZVbT9swFMcNGx3rbrA97iVahTRNqEo6KMkbKuyChIBxn9qucl03tepc5DiDKMrj3rY9ji+0L7FvMztJKXWNlZ5ax7+/fexznPRDSiJumv8WFh88XKo8Wn5cffL02fMXK6svz6MgZgifoYAG7LIPI0yJj8844RRfhgxDr0/xRX+8I8cvvmMWkcA/5UmIux50fTIkCHLhOhl+e9dbqZl1U7Rm05AdyzYt0XEcu9FwDCsfMs0aKNtRb3Xpb2cQoNjDPkcURlHbMkPeTSHjBFGcVTtxhEOIxtDFbdH1oYejbprHmhlrwjMwhgETP58bufeuIoVeFCVeX5Ae5KNIHZNO3Vg75kO7mxI/jDn2UbHQMKYGDwy5cWNAGEacJqIDESMiVgONIIOIi+OpzixThDrjchkMRwRdZ9XqmrFzuH94bOx++Lh3sHe6d3hwUu0M8FBkIBemu5CNPzGM/Sxlbj9Lzbq1uS6OtCmNtZkZs/g+cUf8K6Y0uLoV2BLNjaPnxfzJLb0lwcJks+w5Zsk8P53dVicvY5+yeeSF0QbSojG+hZ3NIuTcOlpeEVg5WtimVnCMB9PAi3MsrYq3qEjWhBWIeFTkBIcEZinykrFkxIzv1y3xt2UqB9diBI3zpSesUbcdYZwNYRq2gl+MCJ/sSmxGPgpxFLOQ4jspEKWwLiEfX6HA86A/SDtX+TRtq5t2OL7mhbJwpjUrU+i+PEkFzn0alomtKKh0aciAQd+dm7f0ang3r3QFv3MFdHHLPCmKMnnzNJVFoAl/WhzzmihPsyIoc6+JR+Zas8K0BrS7TrQRFZfnnm3oVMoVnVcOxFnqhNOrOq8Ji2pTFJMazHnx3p+83I37O+eNutWsb3zZqG23yi/AMngN3oC3wAJbYBt8BkfgDCDggl/gD7ip9Co/Kj8rvwt0caHUvAIzrXLzH4rTX+8=</latexit>

fˆt
<latexit sha1_base64="c4FYbizQrt//5llVh/MC3hPqYKs=">AAAG43icdZVbT9swFMcNGx3rbrA97iVahbQHVCWlLc0bKuyChIBxR21BjuO2Vp2LHGdQRfkEe9v2OL7NvsS+zRwnpdQ1VnpqHf/+9rHPceKElETcNP8tLD55ulR6tvy8/OLlq9dvVlbfnkVBzBA+RQEN2IUDI0yJj0854RRfhAxDz6H43BltZ+Pn3zGLSOCf8HGIex4c+KRPEOTCddkdQp700yt+vVIxq7Zt1i3bMKsN06y1mqJjbtRajYZhVU3ZKqBoh9erS3+7boBiD/scURhFHcsMeS+BjBNEcVruxhEOIRrBAe6Irg89HPUSGXFqrAmPa/QDJn4+N6T3oSKBXhSNPUeQHuTDSB3LnLqxTsz7rV5C/DDm2Ef5Qv2YGjwwsu0bLmEYcToWHYgYEbEaaAgZRFwcUnlmmTzUGdeAwXBI0G1aLq8Z2wd7B0fGzqfPu/u7J7sH+8flrov7Ig9SmOxANvrCMPbThA2cNDGrVmNdHGkzM1YjNWbxPTIY8ktMaXBzL2hlqDS2nhfzj+/pzQzMTTrLnmE2nuens7fUyYvYp6yMPDfaQNo0xvew3chDltbW8orAkmhum1rBEXangefnWFgVb1ORrAkrEPGoyDEOCUwT5I1HGSNm3Fi3xN+mqRxcmxE0kktPWKPasoWx68LUWgp+PiR8siuxmexRiMOYhRQ/SIEohfUM8vENCjwP+m7SvZHTdKxe0uX4lufK3JlUrFShnewkFVj6NCwTW1HQzKUhAwb9wdy8hVfDD2SlK/iDK6CLO8uToiiSN0/TrAg04U+LY14TyTQrgiL3mniyXGtWmNaAdtdjbUT55XlkGzqVckXnla44S51welXnNWFebYpiUoOSF+/9ycvdeLxzVqtazWr9W72y1S6+AMvgPfgAPgILbIIt8BUcglOAgAd+gT/groRLP0o/S79zdHGh0LwDM6109x9ryWMV</latexit>

ft
<latexit sha1_base64="ZQ2Nn/BfiP87lt5JFmIaryHpLtU=">AAAG63icdZVbT9swFMcNGx3rLsD2uJdoCGmTqqoBCu0bKuyChIBxn9qCHMdNrToXOc4givIp9rbtcXyRaV9iD/sus5OUUseL0lPr5Pe3j885TqyAkpA3Gn9mZh88nKs8mn9cffL02fOFxaUXZ6EfMYRPkU99dmHBEFPi4VNOOMUXAcPQtSg+t0bb8vn5F8xC4nsnPA5w34WORwYEQS5cV4sLvcizMZPyZJBe8qvF5Ua9kV1GeWAWg+Wt2q/Lv3Nvm4dXS3O/e7aPIhd7HFEYhl2zEfB+AhkniOK02otCHEA0gg7uiqEHXRz2kyzy1FgRHtsY+Ez8PG5k3vuKBLphGLuWIF3Ih6H6TDp1z7oRH7T6CfGCiGMP5QsNImpw35BpMGzCMOI0FgOIGBGxGmgIGURcJKs6tUwe6pTLYTAYEnSTVqsrxvbB3sGRsfPu/e7+7snuwf5xtWfjgUhoJkx2IBt9YBh7acIcK00adbNZEyndkMZspsY0vkecIf+MKfWv7wQtiWamrefF/PEdvSnB3KTT7BlmcZmfzN5SJy9in7BZ5LnRBtKhEb6D28085My2tbwiMDM0txtawRG2J4HneSysineoKNaYFYi4VeQYBwSmCXLjkWTEjGs1U/xtNpTEdRhBo2zpMWvUW21h2uvCrLYU/HxI+HhXYjPyVojDiAUU3yuBaIWahDx8jXzXhZ6d9K6zabpmP+lxfMNzZe5Mls1UoS2ZSQXOfBqWia0oqHRpSJ9BzynNW3g1vJN1uoLfOwK6uGWdFEVRvDJNZRNowp80R1kTZmVWBEXtNfHIWmtWmPSAdtexNqL88PxnGzqVckTLSlvkUiecHNWyJsi7TVGMezDjxXvfVN/y5cHZat3cqK9/Eh+ADsivefAKvAZvgAk2wRb4CA7BKUAgAt/BT3BbcStfK98qP3J0dqbQvARTV+X2H6+kaIc=</latexit>

⇡t ⇡⇤ ⇧
<latexit sha1_base64="CrLtaJzoNgRi917mWwMTGzYeeMA=">AAAG33icdZVbT9swFMcNGx3rbrA97iVahbQHVCWsbZI3VNgFCQEDCkxtQa7rpladixxnEEV53tu2x/F99iX2beZcSqlrrPTUOv797WOf42QQUBJyXf+3tPzo8UrlyerT6rPnL16+Wlt/fRb6EUO4g3zqs4sBDDElHu5wwim+CBiG7oDi88FkJxs//45ZSHzvlMcB7rvQ8ciIIMiFq9MLyCW/Wqvpdd22TMvU9LppGs1WQ3T0xpZu25pR1/NWA2U7ulpf+dsb+ihysccRhWHYNfSA9xPIOEEUp9VeFOIAogl0cFd0PejisJ/k0abahvAMtZHPxM/jWu69r0igG4axOxCkC/k4lMcyp2qsG/GR1U+IF0Qce6hYaBRRjftatnVtSBhGnMaiAxEjIlYNjSGDiIsDqs4tU4Q653IYDMYE3aTV6oa2c7h/eKztfvy0d7B3und4cFLtDfFI5CAXJruQTT4zjL00Yc4gTfS60dwUR9rKjNFMtXl8nzhj/g1T6l/fCawMzY2t5sX88R1tZmBh0nn2DLN4kZ/NbsmTl7HP2DzywigDadMI38F2swg5t7aSlwRGjha2pRQc4+Es8OIcSyvjbSqSNWUFIh4ZOcEBgWmC3HiSMWLGD5uG+DN16eDajKBJvvSU1eqWLYzdEGbLkvDzMeHTXYnNZI9EHEUsoPheCkQpbGaQh6+R77rQGya963yartFPehzf8EJZOJOakUr0IDtJCc59CpaJrUho5lKQPoOeszBv6VXwTl7pEn7vCqjizvIkKcrkLdI0KwJF+LPiWNSEeZolQZl7RTxZrhUrzGpAuetYGVFxeR7YhkolXdFF5VCcpUo4u6qLmqCoNkkxrcGcF+/96ctde7hztlU3WvXG10Ztu11+AVbBW/AOvAcGMME2+AKOQAcgQMAv8AfcVmDlR+Vn5XeBLi+VmjdgrlVu/wPIfWE8</latexit> <latexit sha1_base64="VkKOzmYZoBQEvUOYGuo9SlWiY64=">AAAG3XicdZVbb9MwFMc9YGWU2waPvERUk3iYqmTX5m3qxmXStJXdUVtNjuOmVp2LHIctivLIG/DIvhBfgm+DnaRr63pWemod//72sc9x4kSUxNw0/y08evxksfZ06Vn9+YuXr14vr7y5iMOEIXyOQhqyKwfGmJIAn3PCKb6KGIa+Q/GlM9qT45ffMYtJGJzxNMJ9H3oBGRAEuXCd9jrkerlhNs2iGVMd29ywTduwKk8DVK1zvbL4t+eGKPFxwBGFcdy1zIj3M8g4QRTn9V4S4wiiEfRwV3QD6OO4nxWx5saq8LjGIGTiF3Cj8E4rMujHceo7gvQhH8bqmHTqxroJH7T6GQmihOMAlQsNEmrw0JAbN1zCMOI0FR2IGBGxGmgIGURcHE99Zpky1BmXx2A0JOg2r9dXjb3jw+MTY//jp4Ojg7OD46PTes/FA5GBQpjtQzb6zDAO8ox5Tp6ZTWtrTRzptjTWVm7M4ofEG/JvmNLw5l7QkmhhbD0v5k/v6R0JliafZS8wS+f5yewtdfIq9glbRF4abSBtmuB72N4qQy6sreUVgVWgpd3WCk6wOwm8PMfKqnibimSNWYGIR0VOcURgniE/HUlGzLixZom/HVM5uDYjaFQsPWaNZssWxt4UZr2l4JdDwse7EpuRj0J0EhZRPJUCUQprEgrwDQp9HwZu1rsppula/azH8S0vlaUza1i5QjvyJBW48GlYJraioNKlIUMGA29u3sqr4b2i0hV86gro4pZ5UhRV8uZpKotAE/6kOOY1cZFmRVDlXhOPzLVmhUkNaHedaiMqL88D29CplCs6r3TFWeqEk6s6r4nKalMU4xosePHeH7/cjYc7F+tNa7u5+XWzsduuvgBL4B14Dz4AC+yAXfAFdMA5QMADv8AfcFe7rv2o/az9LtFHC5XmLZhptbv/wHpf9w==</latexit>

<latexit sha1_base64="WUZ8Xd1KFmgFudB/n13XDvNRRfs=">AAAG33icdZVbT9swFMcNGx3rbrA97iVahTRNqEpKgeYNFXZBQsCAAlPbIcdxU6vORY4ziKI8723b4/g++xL7NnMupdTxrPTUOv797WOf48QKKAm5rv9dWHzwcKn2aPlx/cnTZ89frKy+PA/9iCHcQz712aUFQ0yJh3uccIovA4aha1F8YU12s/GLb5iFxPfOeBzgoQsdj4wIgly4eoOAfH13tdLQm2Z7w2y1Nb2p523WMcpOA5Tt+Gp16c/A9lHkYo8jCsOwb+gBHyaQcYIoTuuDKMQBRBPo4L7oetDF4TDJo021NeGxtZHPxM/jWu69r0igG4axawnShXwcymOZUzXWj/ioM0yIF0Qce6hYaBRRjftatnXNJgwjTmPRgYgREauGxpBBxMUB1eeWKUKdczkMBmOCbtJ6fU3bPTo4OtH23n/YP9w/2z86PK0PbDwSOciFyR5kk48MYy9NmGOlid40NtfFkW5lxthMtXn8gDhj/gVT6l/fCToZmhtTzYv54zt6OwMLk86z55jFVX42e0eevIx9xuaRF0YZSJdG+A42N4uQc2sqeUlg5Ghht5SCE2zPAi/OsbQy3qUiWVNWIOKRkVMcEJgmyI0nGSNm3Fg3xN+2Lh1clxE0yZeeslqzYwpjtoVpdST8Ykz4dFdiM9kjEccRCyi+lwJRCusZ5OFr5Lsu9OxkcJ1P0zeGyYDjG14oC2fSMFKJtrKTlODcp2CZ2IqEZi4F6TPoOZV5S6+Cd/JKl/B7V0AVd5YnSVEmr0rTrAgU4c+Ko6oJ8zRLgjL3iniyXCtWmNWActexMqLi8vxnGyqVdEWrSlucpUo4u6pVTVBUm6SY1mDOi/e+Ib/lq53zVtPYarY/txs73fILsAxegzfgLTDANtgBn8Ax6AEECPgJfoPbGqx9r/2o/SrQxYVS8wrMtdrtP14XYLQ=</latexit>

Figure 5: Illustration of the UCB algorithm. Selecting the action π t optimistically ensures
that the suboptimality never greater exceeds the confidence width.
guaranteed that with high probability, they lower (resp. upper) bound f ⋆ . Given confidence
intervals, the UCB algorithm simply chooses π t as the “optimistic” action that maximizes
the upper confidence bound:
π t = arg max f¯t (π).
π∈Π
The following lemma shows that the instantaneous regret for this strategy is bounded by
the width of the confidence interval; see Figure 5 for an illustration.

27
Lemma 7: Fix t, and suppose that f ⋆ (π) ∈ [f t (π), f¯t (π)] for all π. Then the optimistic
action
π t = arg max f¯t (π)
π∈Π

has

f ⋆ (π ⋆ ) − f ⋆ (π t ) ≤ f¯t (π t ) − f ⋆ (π t ) ≤ f¯t (π t ) − f t (π t ). (2.17)

Proof of Lemma 7. The result follows immediate from the observation that for any t ∈ [T ]
and any π ⋆ ∈ Π, we have

f ⋆ (π ⋆ ) ≤ f¯t (π ⋆ ) ≤ f¯t (π t ) and − f ⋆ (π t ) ≤ −f t (π t ).

Lemma 7 implies that as long as we can build confidence intervals for which the width
f¯t (π t ) − f t (π t ) shrinks, the regret for the UCB strategy will be small. To construct such
intervals, here we appeal to Hoeffding’s inequality for adaptive stopping times (Lemma 33).7
As long as rt ∈ [0, 1], a union bound gives that with probability at least 1 − δ, for all t ∈ [T ]
and π ∈ Π, s
2 log(2T 2 A/δ)
|fbt (π) − f ⋆ (π)| ≤ , (2.18)
nt (π)
P
where we recall that fbt is the sample mean and nt (π) := i<t I {π = π}. This suggests
i

that by choosing
s s
2 log(2T 2 A/δ) 2
f¯t (π) = fbt (π) + , and f t
(π) = bt (π) − 2 log(2T A/δ) ,
f (2.19)
nt (π) nt (π)

we obtain a valid confidence interval. With this choice—along with Lemma 7—we are in a
favorable position, because for a given round t, one of two things must happen:

• The optimistic action has high reward, so the instantaneous regret is small.

• The instantaneous regret is large, which by Lemma 7 implies that confidence width
is large as well (and nt (π t ) is small). This can only happen a small number of times,
since nt (π t ) will increase as a result, causing the width to shrink.

Using this idea, we can prove the following regret bound.

Proposition 5: Using the confidence bounds in (2.19), the UCB algorithm ensures
that with probability at least 1 − δ,
p
Reg ≲ AT log(AT /δ).

7
While asymptotic confidence intervals in classical statistics arise from limit theorems, we are interested
in valid non-asymptotic intervals, and thus appeal to concentration inequalities.

28
This result is optimal up to the log(AT ) factor, which can be removed by using the same
algorithm with a slightly more sophisticated confidence interval construction [11]. Note
that compared to the statistical learning and online learning setting, where we were able to
attain regret bounds that scaled logarithmically with the size of the benchmark class, here
the optimal regret scales linearly with |Π| = A. This is the price we pay for partial/bandit
feedback, and reflects that fact that we must explore all actions to learn.

Proof of Proposition 5. Let us condition on the event in (2.18). Whenever this occurs, we
have that f ⋆ (π) ∈ f t (π), f¯t (π) for all t ∈ [T ] and π ∈ Π, so the confidence intervals are


valid. As a result, Lemma 7 bounds regret in terms of the confidence width:


s
T T T
2 log(2T 2 A/δ)
f¯t (π t ) − f t (π t ) =
X X X
f ⋆ (π ⋆ ) − f ⋆ (π t ) ≤ 2 ∧ 1; (2.20)
nt (π t )
t=1 t=1 t=1

here, the “∧1” term appears because we can write off the regret for early rounds where
nt (π t ) = 0 as 1.
To bound the right-hand side, we use a potential argument. The basic idea is that at
every round, np t
(π) must increase for some action π, and since there are only A actions, this
means that 1/ nt (π t ) can only be large for a small number of rounds. This can be thought
of as a quantitative instance of the pigeonhole principle.

Lemma 8 (Confidence width potential lemma): We have


T √
X 1
p ∧ 1 ≲ AT .
t=1 nt (π t )

Proof of Lemma 8. We begin by writing.


T +1
T
X 1 XX T
I {π t = π} X n X(π) 1
p ∧1= p ∧1= √ ∧ 1. (2.21)
t=1 nt (π t ) π t=1 nt (π) π t=1
t−1
Pn √
For any n ∈ N, we have √1 ∧ 1 ≤ 1 + 2 n, which allows us to bound by
t=1 t−1
Xp
A+2 nT (π).
π

The factor of A above is a lower-order term (recall that we have A ≤ AT whenever A ≤ T ,
and if A > T the regret bound we are proving is vacuous). To bound the second term, using
Jensen’s inequality, we have
s
Xp X nT (π) p √
n (π) ≤ A
T
= A T /A = AT .
π π
A

The main regret bound now follows from Lemma 8 and (2.20).

29
To summarize, the key steps in the proof of Proposition 5 were to:

1. Use the optimistic property and validity of the confidence bounds to bound regret by
the sum of confidence widths.

2. Use a potential argument to show that the sum of confidence widths is small.

We will revisit and generalize both ideas in subsequent chapters for more sophisticated
settings, including contextual bandits, structured bandits, and reinforcement learning.


Remark 10 (Instance-dependent regret for UCB): The O( e AT ) regret bound
attained by UCB holds uniformly for all models, and is (nearly) minimax-optimal, in
the sense that ⋆
√ for any algorithm, there exists a model M for which the regret must
scale as Ω( AT ). Minimax optimality is a useful notion of performance, but may be
overly pessimistic. As an alternative, it is possible to show that the UCB attains what
is known as an instance-dependent regret bound, which adapts to the underlying reward
function, and can be smaller for “nice” problem instances.
Let ∆(π) := f ⋆ (π ⋆ ) − f ⋆ (π) be the suboptimality gap for decision π. Then, when

f (π) ∈ [0, 1], UCB can be shown to achieve
X log(AT /δ)
Reg ≲ .
∆(π)
π:∆(π)>0

If we keep the underlying model fixed and take√ T → ∞, this regret bound scales only
logarithmically in T , which improves upon the T -scaling of the minimax regret bound.

2.4 Bayesian Bandits and the Posterior Sampling Algorithm⋆


Up to this point, we have been designing and analyzing algorithms from a frequentist view-
point, in which we aim to minimize regret for a worst-case choice of the underlying model
M ⋆ . An alterative is to adopt a Bayesian viewpoint, and assume that the underlying model
is drawn from a known prior µ ∈ ∆(M).8 In this case, rather than worst-case performance,
we will be concerned with average regret under the prior, defined via

RegBayes (µ) := EM ⋆ ∼µ EM [Reg],

where EM [·] denotes the algorithm’s expected regret when M ⋆ is the underlying reward
distribution.
Working in the Bayesian setting opens up additional avenues for designing algorithms,
because we can take advantage of our knowledge of the prior to compute quantities of interest
that are not available in the frequentist setting, such as posterior distribution over π ⋆ after
observing the dataset Ht−1 . The most basic and well-known strategy here is posterior
sampling (also known as Thompson sampling or probability matching) [82, 75].

8
It is important that µ is known, otherwise this is no different from the frequentist setting.

30
Posterior Sampling
for t = 1, . . . , T do
Set pt (π) = P(π ⋆ = π | Ht−1 ), where Ht−1 = (π 1 , r1 ), . . . , (π t−1 , rt−1 ).
Sample π t ∼ pt and observe rt .

The basic idea is as follows. At each time t, we can use our knowledge of the prior to
compute the distribution P(π ⋆ = · | Ht−1 ), which represents the posterior distribution over
π ⋆ given all of the data we have collected from rounds 1, . . . , t − 1. The posterior sampling
algorithm simply samples the learner’s action π t from this distribution, thereby “matching”
the posterior distribution of π ⋆ .

Proposition 6: For any prior µ, the posterior sampling algorithm ensures that
p
RegBayes (µ) ≤ AT log(A). (2.22)

In what follows, we prove a simplified version of Proposition 6; the full proof is given in
Section 2.6.

Proof of Proposition 6 (simplified version). We will make the following simplified assump-
tions:
• We restrict to reward distributions where M ⋆ (· | π) = N (f ⋆ (π), 1). That is, f ⋆ is the
only part of the reward distribution that is unknown.
• f ⋆ belongs to a known class F, and rather than proving the regret bound in Proposi-
tion 6, we will prove a bound of the form
p
RegBayes (µ) ≲ AT log|F|,
which replaces the log A factor in the proposition with log|F|.
Since the mean reward function f ⋆ is the only part of the reward distribution M ⋆ that is
unknown, we can simplify by considering an equivalent formulation where the prior has the
form µ ∈ ∆(F). That is, we have a prior over f ⋆ rather than M ⋆ .
Before proceeding, let us introduce some notation. The process through which we sample
f ∼ µ and the run the bandit algorithm induces a joint law over (f ⋆ , HT ), which we call

P. Throughout the proof, we use E[·] to denote the expectation under this law. We also
define Et [·] = E[· | Ht ] and Pt [·] = P[· | Ht ].
We begin by using the law of total expectation to express the expected regret as
" T #
X
RegBayes (µ) = E Et−1 [f ⋆ (πf ⋆ ) − f ⋆ (π t )] .
t=1

Above, we have written π ⋆ = πf ⋆ to make explicit the fact that this is a random variable
whose value is a function of f ⋆ .
We first simplify the expected regret for each step t. Let µt (f ) := P(f ⋆ = f | Ht−1 ) be the
posterior distribution at timestep t. The learner’s decision π t is conditionally independent
of f ⋆ given Ht−1 , so we can write
Et−1 [f ⋆ (πf ⋆ ) − f ⋆ (π t )] = Ef ⋆ ∼µt ,πt ∼pt [f ⋆ (πf ⋆ ) − f ⋆ (π t )].

31
If we define f¯t (π) = Ef ⋆ ∼µt [f ⋆ (π)] as the expected reward function under the posterior, we
can further write this as

Ef ⋆ ∼µt ,πt ∼pt f ⋆ (πf ⋆ ) − f¯t (π t ) .


 

By the design of the posterior sampling algorithm, π t ∼ pt is identical in distribution to πf ⋆


under f ⋆ ∼ µt , so this is equal to

Ef ⋆ ∼µt f ⋆ (πf ⋆ ) − f¯t (πf ⋆ ) .


 

This quantity captures—on average—how far a given realization of f ⋆ deviates from the
posterior mean f¯t , for a specific decision πf ⋆ which is coupled to f ⋆ . The expression above
might appear to be unrelated to the learner’s decision distribution, but the next lemma
shows that it is possible to relate this quantity back to the learner’s decision distribution
using a notion of information gain (or, estimation error).

Lemma 9 (Decoupling): For any function f¯ : Π → R, it holds that


 q
Ef ⋆ ∼µt f ⋆ (πf ⋆ ) − f¯(πf ⋆ ) ≤ A · Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯(π t ))2 .
  
(2.23)

Proof of Lemma 9. We will show a more general result. Namely, for any ν ∈ ∆(F) and
f¯ : Π → R, if we define p(π) = Pf ∼ν (πf = π), then
 q
¯
Ef ∼ν f (πf ) − f (πf ) ≤ A · Ef ∼ν Eπ∼p (f (π) − f¯(π))2 .
  
(2.24)

This can be thought of as a “decoupling” lemma. On the left-hand side, the random
variables f and πf are coupled, but on the right-hand side, π is drawn from the marginal
distribution over πf , independent of the draw of f itself.
To prove the result, we use Cauchy-Schwarz as follows:
" #
p 1/2 (π )
f
Ef ∼ν f (πf ) − f¯(πf ) = Ef ∼ν 1/2 f (πf ) − f¯(πf )
  
p (πf )
  1/2 
1 h 2 i1/2
≤ Ef ∼ν · Ef ∼ν p(πf ) f (πf ) − f¯(πf ) .
p(πf )

For the first term, we have


  X
1 ν(f ) X X ν(f ) X p(π)
Ef ∼ν = = = = A.
p(πf ) p(πf ) π
p(π) π
p(π)
f f :πf =π

For the second term, we have


" #
h 2 i 2
p(πf ) f (πf ) − f¯(πf ) p(π) f (π) − f¯(π) = Ef ∼ν Eπ∼p (f (π) − f¯(π))2 .
X  
Ef ∼ν ≤ Ef ∼ν
π

Putting these bounds together yields (2.24).

32
Using Lemma 9, we have that
" T #
Xq
A · Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2
 
E[Reg] ≤ E
t=1
v " T #
u
u
Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 .
X  
≤ AT · E
t
t=1

To finish up we will show that Tt=1 Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 ≤ log|F|. To do this,
P  

we need some additional information-theoretic tools.

• For a random variable X with distribution P, Ent(X) ≡ Ent(P) := x p(x) log(1/p(x)).


P

• For random variables X and Y , Ent(X | Y = y) := Ent(PX|Y =y ) and Ent(X | Y ) :=


Ey∼pY [Ent(X | Y = y)].

• For distributions P and Q, DKL (P ∥ Q) = x p(x) log(p(x)/q(x)).


P

To keep notation as clear as possible going forward, let us use boldface script (π t , π ⋆ , f ⋆ ,
Ht ) to refer to the abstract random variables under consideration, and use non-boldface
script (π t , π ⋆ , f ⋆ , Ht ) to refer to their realizations. Our aim will be to use the conditional
entropy Ent(f ⋆ | Ht ) as a potential function, and show that for each t,
1 
E Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 = Ent(f ⋆ | Ht−1 ) − Ent(f ⋆ | Ht ).
 
(2.25)
2
From here the result will follow, because
" T # T
1 X  ⋆ t
¯ 2
X
Ent(f ⋆ | Ht−1 ) − Ent(f ⋆ | Ht )
t t

E Ef ⋆ ∼µt Eπt ∼pt (f (π ) − f (π )) =
2
t=1 t=1
= Ent(f ⋆ | H0 ) − Ent(f ⋆ | HT )
≤ Ent(f ⋆ | H0 )
≤ log|F|,

where the last inequality follows because the entropy of a random variable X over a set X
is always bounded by log|X |.
We proceed to prove (2.25). To begin, we use Lemma 39, which implies that
1 ⋆ t
(f (π ) − f¯t (π t ))2 ≤ DKL Prt |f ⋆ ,πt ,Ht−1 ∥ Prt |πt ,Ht−1 .

2
and
1
Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 = Ef ⋆ ∼µt Eπt ∼pt DKL Prt |f ⋆ ,πt ,Ht−1 ∥ Prt |πt ,Ht−1
   
2
   
Since KL divergence satisfies Ex∼PX DKL PY |X=x ∥ PY = Ey∼PY DKL PX|Y =y ∥ PX , this
is equal to
   
Et−1 DKL Pf ⋆ |πt ,rt ,Ht−1 ∥ Pf ⋆ |Ht−1 = Et−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 . (2.26)

33
Taking the expectation over Ht−1 , we can write this as
    
E Et−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 = EHt−1 EHt |Ht−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 .

A simple exercise shows that for random variables X, Y, Z,


 
E(x,y)∼PX,Y DKL PZ|X=x,Y =y ∥ PZ|X=x = Ent(Z | X) − Ent(Z | X, Y ).

Applying this result above (and using that Ht−1 ⊂ Ht ) gives

EHt−1 EHt |Ht−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 = Ent(f ⋆ | Ht−1 ) − Ent(f ⋆ | Ht )


 

as desired.

The analysis above critically makes use of the fact that we are concerned with Bayesian
regret, and have access to the true prior. One might hope that by choosing a sufficiently
uninformative prior, this approach might continue to work in the frequentist setting. In
fact, this indeed the case for bandits, though a different analysis is required [7, 8]. However,
one can show (Sections 4 and 6) that the Bayesian analysis we have given here extends to
significantly richer decision making settings, while the frequentist counterpart is limited to
simple variants of the multi-armed bandit.

Remark 11 (Equivalence of min-max frequentist regret and max-min Bayesian


regret): Using the minimax theorem, it is possible to show that under appropriate
technical conditions
⋆ ,Alg ⋆ ,Alg
min max

EM [Reg] = max min EM ⋆ ∼µ EM [Reg].
Alg M µ∈∆(M) Alg

That is, if we take the worst-case value of the Bayesian regret over all possible choices
of prior, this coincides with the minimax value of the frequentist regret.

2.5 Adversarial Bandits and the Exp3 Algorithm⋆


We conclude this section with a brief introduction to the multi-armed bandit problem with
non-stochastic/adversarial rewards, which dispenses with Assumption 1. In the context of
Figure 2, the non-stochastic nature of rewards adds a new “adversarial data” dimension to
the problem. As one might expect, the solution we will present for non-stochastic bandits
will leverage the the online learning tools introduced in Section 1.6.
To simplify the presentation, suppose that the collection of rewards

{rt (π) ∈ [0, 1] : π ∈ [A], t ∈ [T ]}

for each action and time step is arbitrary and fixed ahead of the interaction by an oblivious
adversary. Since we do not posit a stochastic model for rewards, we define regret as in (2.4).
The algorithm we present will build upon the exponential weights algorithm studied in
the context of online supervised learning in Section 1.6. To make the connection as clear
as possible, we make a temporary switch from rewards to losses, mapping rt to 1 − rt , a
transformation that does not change the problem itself.

34
Recall that pt denotes the randomization distribution for the learner at round t. As
discussed in Remark 7, we can write expected regret as
T
X T
X
Reg = ⟨pt , ℓt ⟩ − min ⟨eπ , ℓt ⟩ (2.27)
π∈[A]
t=1 t=1

where ℓt ∈ [0, 1]A is the vector of losses for each of the actions at time t.
Since only the loss (equivalently, reward) of the chosen action π t ∼ pt is observed, we
cannot directly appeal to the exponential weights algorithm, which requires knowledge of
the full vector ℓt . To address this, we build an unbiased estimate of the vector ℓt from
a single real-valued observation ℓt (π t ). At first, this might appear impossible, but it is
straightforward to show that
t ℓt (π)
ℓ (π) = t
e × I {π t = π} (2.28)
p (π)

is an unbiased estimate for all π ∈ [A], or in vector notation


 t
Eπt ∼pt eℓ = ℓt . (2.29)
t
If we apply the exponential weights algorithm with the loss vectors e
ℓ , it can be shown to
attain regret
" T # T
X X
t t
E[Reg] = E ⟨p , ℓ ⟩ − min ⟨eπ , ℓt ⟩ (2.30)
π
t=1 t=1
" T # " T #
XD t
E XD t
E p
=E pt , e
ℓ − min E eπ , e
ℓ ≲ AT log A. (2.31)
π
t=1 t=1

This algorithm is known as Exp3 (“Exponential Weights for Exploration and Exploitation”).
A full proof of this result is left as an exercise in Section 2.7.

2.6 Deferred Proofs


Proof of Proposition 6 (full version) . Let Et [·] = E[· | Ht ] and Pt [·] = P[· | Ht ]. We begin
by using the law of total expectation to express the expected regret as
" T #
X
RegBayes (µ) = E Et−1 [f ⋆ (π ⋆ ) − f ⋆ (π t )] .
t=1

Here and throughout the proof, E[·] will denote the joint expectation over both M ⋆ ∼ µ and
over the sequence HT = (π 1 , r1 ), . . . , (π T , rT ) that the algorithm generates by interacting
with M ⋆ .
We first simplify the (conditional) expected regret for each step t. Let f¯t (π) := Et−1 [f ⋆ (π)]
denote the posterior mean reward function at time t, which should be thought of as
the expected value of f ⋆ given everything we have learned so far. Next, let f¯πt ′ (π) =
Et−1 [f ⋆ (π) | π ⋆ = π ′ ], which is the expected reward given everything we have learned so far,
assuming that π ⋆ = π ′ . We proceed to write the expression

Et−1 [f ⋆ (π ⋆ ) − f ⋆ (π t )]

35
in terms of these quantities. For the learner’s reward, we observe that f ⋆ is conditionally
independent of π t given Ht−1 , we have
Et−1 [f ⋆ (π t )] = Eπ∼pt [f¯t (π)].
For the reward of the optimal action, we begin by writing
X
Et−1 [f ⋆ (π ⋆ )] = Pt−1 (π ⋆ = π) Et−1 [f ⋆ (π) | π ⋆ = π]
π∈Π

Pt−1 (π ⋆ = π)f¯πt (π)


X
=
π∈Π
= Eπ∼pt f¯πt (π) ,
 

where we have used that pt was chosen to match the posterior distribution over π ⋆ . This
establishes that
Et−1 [f ⋆ (π ⋆ ) − f ⋆ (π t )] = Eπ∼pt f¯πt (π) − f¯t (π) .
 

We now make use of the following decoupling-type inequality, which follows from (2.24):
 q
Eπ∼pt f¯πt (π) − f¯t (π) ≤ A · Eπ,π⋆ ∼pt (f¯πt ⋆ (π) − f¯t (π))2 .
  
(2.32)

To keep notation as clear as possible going forward, let us use boldface script (π t , π ⋆ , f ⋆ ,
t
H ) to refer to the abstract random variables under consideration, and use non-boldface
script (π t , π ⋆ , f ⋆ , Ht ) to refer to their realizations. As in the simplified proof, we will
show that the right-hand side in (2.32) is related to a notion of information gain (that is,
information about π ⋆ acquired at step t). Using Pinsker’s inequality, we have
Eπt ,π⋆ ∼pt (f¯πt ⋆ (π t ) − f¯t (π t ))2 ≤ Et−1 DKL Prt |π⋆ ,πt ,Ht−1 ∥ Prt |πt ,Ht−1 .
   
   
Since KL divergence satisfies EX DKL PY |X ∥ PY = EY DKL PX|Y ∥ PX , this is equal to
   
Et−1 DKL Pπ⋆ |πt ,rt ,Ht−1 ∥ Pπ⋆ |Ht−1 = Et−1 DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 . (2.33)
This is quantifying how much information about π ⋆ we gain by playing π t and observing rt
at step t, relative to what we knew at step t − 1. Applying (2.32) and (2.33), we have
T T q
¯ ¯ A · Eπ,π⋆ ∼pt (f¯πt ⋆ (π) − f¯t (π))2
X  t  X  
t
Eπ∼pt fπ (π) − f (π) ≤
t=1 t=1
XT q
 
≤ A · Et−1 DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1
t=1
v
u
u T
X  
≤ tAT · E DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 .
t=1

We can write
E DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 = Ent(π ⋆ | Ht−1 ) − Ent(π ⋆ | Ht ),
 

so telescoping gives
T
X
E DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 = Ent(π ⋆ | H0 ) − Ent(π ⋆ | HT ) ≤ log(A).
 
t=1

36
2.7 Exercises

Exercise 4 (Adversarial Bandits): In this exercise, we will prove a regret bound for adver-
sarial bandits (Section 2.5), where the sequence of rewards (losses) is non-stochastic. To make
a direct connection to the Exponential Weights Algorithm, we switch from rewards to losses,
mapping rt to 1 − rt , a transformation that does not change the problem itself. To simplify
the presentation, suppose that a collection of losses

{ℓt (π) ∈ [0, 1] : π ∈ [A], t ∈ [T ]}

for each action π and time step t is arbitrary and chosen before round t = 1; this is referred to
as an oblivious adversary. We denote by ℓt = (ℓt (1), . . . , ℓt (A)) the vector of losses at time t.
The protocol for the problem of adversarial multi-armed bandits (with losses) is as follows:
Multi-Armed Bandit Protocol
for t = 1, . . . , T do
Select decision π t ∈ Π := {1, . . . , A} by sampling π t ∼ pt
Observe loss ℓt (π t )
Let pt be the randomization distribution of the decision-maker on round t. Expected regret
can be written as
" T # T
X X
t
E[Reg] = E t
p ,ℓ − min eπ , ℓt . (2.34)
π∈[A]
t=1 t=1

Since only the loss of the chosen action π t ∼ pt is observed, we cannot directly appeal to the
Exponential Weights Algorithm. The solution is to build an unbiased estimate of the vector ℓt
from the single real-valued observation ℓt (π t ).
t
ℓ (· | π t ) defined by
1. Prove that the vector e

t ℓt (π)
ℓ (π | π t ) = t
e × I {π t = π} (2.35)
p (π)

is an unbiased estimate for ℓt (π) for all π ∈ [A]. In vector notation, this means
t
ℓ (· | π t )] = ℓt .
Eπt ∼pt [e

Conclude that
" T # " T #
X t X t
E[Reg] = E t e
Eπt ∼pt p , ℓ − min E Eπt ∼pt eπ , e
ℓ (2.36)
π∈[A]
t=1 t=1

t
Above, we use the shorthand e ℓ(· | π t ).
ℓ =e
2. Show that given π ′ ,
h t i ℓt (π ′ )2 h t i
ℓ (π | π ′ )2 = t ′ ,
Eπ∼pt e so that ℓ (π | π t )2 ≤ A.
Eπt ∼pt Eπ∼pt e (2.37)
p (π )

3. Define
t−1
( )
X s
pt (π) ∝ exp −η ℓ (· | π s )
eπ , e ,
s=1

37
s
which corresponds to the exponential weights algorithm on the estimated losses e
ℓ . Apply
(1.41) to the estimated losses to show that for any π ∈ [A],
" T # " T #
X t X t p
E t e
Eπt ∼pt p , ℓ −E Eπt ∼pt eπ , ℓ
e ≲ AT log A
t=1 t=1

Hence, the price of bandit


√ feedback in the adversarial model, as compared to full-information
online learning, is only A.

3. CONTEXTUAL BANDITS

In the last section, we studied the multi-armed bandit problem, which arguably the simplest
framework for interactive decision making. This simplicity comes at a cost: few real-world
problems can be modeled as a multi-armed bandit problem directly. For example, for the
problem of selecting medical treatments, the multi-armed bandit formulation presupposes
that one treatment rule (action/decision) is good for all patients, which is clearly unreason-
able. To address this, we augment the problem formulation by allowing the decision-maker
to select the action π t after observing a context xt ; this is called the contextual bandit prob-
lem. The context xt , which may also be thought of as a feature vector or collection of

xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>

context

⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>

decision

rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>

reward

ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>

observation

Figure 6: An illustration of the contextual multi-armed bandit problem. A doctor (the


learner) aims to select a treatment based on the context (medical history, symptoms).

covariates (e.g., a patient’s medical history, or the profile of a user arriving at a website),
can be used by the learner to better maximize rewards by tailoring decisions to the specific
patient or user under consideration.

Contextual Bandit Protocol


for t = 1, . . . , T do
Observe context xt ∈ X .
Select decision π t ∈ Π = {1, . . . , A}.
Observe reward rt ∈ R.

As with multi-armed bandits, contextual bandits can be studied in a stochastic frame-


work or in an adversarial framework. In this course, we will allow the contexts x1 , . . . , xT
to be generated in an arbitrary, potentially adversarially fashion, but assume that rewards
are generated from a fixed conditional distribution.

38
Assumption 2 (Stochastic Rewards): Rewards are generated independently via

rt ∼ M ⋆ (· | xt , π t ), (3.1)

where M ⋆ (· | ·, ·) is the underlying model (or conditional distribution).

This generalizes the stochastic multi-armed bandit framework in Section 2. We define

f ⋆ (x, π) := E [r | x, π] (3.2)

as the mean reward function under r ∼ M ⋆ (· | x, π), and define π ⋆ (x) := arg maxπ∈Π f ⋆ (x, π)
as the optimal policy, which maps each context x to the optimal action for the context. We
measure performance via regret relative to π ⋆ :
T
X T
X
Reg := f ⋆ (xt , π ⋆ (xt )) − Eπt ∼pt [f ⋆ (xt , π t )], (3.3)
t=1 t=1

where pt ∈ ∆(Π) is the learner’s action distribution at step t (conditioned on the Ht−1
and xt ). This provides a (potentially) much stronger notion of performance than what we
considered for the multi-armed bandit: Rather than competing with the reward of the single
best action, we are competing with the reward of the best sequence of decisions tailored to
the context sequence we observe.

Remark 12 (Contextual bandits versus reinforcement learning): To readers


already familiar with reinforcement learning, the contextual bandit setting may appear
quite similar at first glance, with the term “context” replacing “state”. The key dif-
ference is that in reinforcement learning, we aim to control the evolution of x1 , . . . , xT
(which is why they are referred to as state), whereas in contextual bandits, we take
the sequence as a given, and only aim to maximize our rewards conditioned on the
sequence.

Function approximation and desiderata. If X , the set of possible contexts, is finite,


one might imagine running a separate MAB algorithm for each context. In this case, the
regret bound would scale with |X |,9 an undesirable property which reflects the fact that
this approach does not allow for generalization across contexts. Instead, we would like to
share information between different contexts. After all, a doctor prescribing treatments
might never observe exactly the same medical history and symptoms twice, but they might
see similar patients or recognize underlying patterns. In the spirit of statistical learning
(Section 1) this means assuming access to a class F that can model the mean reward
function, and aiming for regret bounds that scale with log|F| (reflecting the statistical
capacity of F), with no dependence on the cardinality of X . To facilitate this, we will
assume a well-specified/realizable model.

9
pOne can show that running an independent instance of UCB for each context leads to regret
O(
e AT · |X |); see Exercise 5.

39
Assumption 3: The decision-maker has access to a class F ⊂ {f : X × Π → R} such
that f ⋆ ∈ F.

Using the class F, we would like to develop algorithms that can model the underlying reward
function for better decision making performance. With this goal in mind, it is reasonable to
try leveraging the algorithms and respective guarantees we have already seen for statistical
and online supervised learning. At this point, however, the decision-making problem—with
its exploration-exploitation dilemma—appears to be quite distinct from these supervised
learning frameworks. Indeed, naively applying supervised learning methods, which do not
account for the interactive nature of the problem, can lead to failure, as we saw with the
greedy algorithm in Section 2.1. In spite of these apparent difficulties, in the next few
lectures, we will show that it is possible to leverage supervised learning methods to develop
provable decision making methods, thereby bridging the two methodologies.

3.1 Optimism: Generic Template


What algorithmic principles should we employ to solve the contextual bandit problem?
One approach is to adapt solutions from the multi-armed bandit setting. There, we saw
that the principle of optimism (in particular, the UCB algorithm) led to (nearly) optimal
rates for bandits, so a natural question is whether optimism can be adapted to give optimal
guarantees in the presence of contexts. The answer to this last question is: it depends. We
will first describe some positive results under assumptions on F, then provide a negative
example, and finally turn to an entirely different algorithmic principle.

Optimism via confidence sets. Let us describe a general approach (or, template) for
applying the principle of optimism to contextual bandits [26, 1, 74, 37]. Suppose that at
each time, we have a way to construct a confidence set

Ft ⊆ F

based on the data observed so far, with the important property that f ⋆ ∈ F t . Given such
a confidence set we can define upper and lower confidence functions f t , f¯t : X × Π → R via

f t (x, π) = min f (x, π), f¯t (x, π) = max f (x, π). (3.4)
f ∈F t f ∈F t

These functions generalize the upper and lower confidence bounds we constructed in Sec-
tion 2. Since f ⋆ ∈ F t , they have the property that

f t (x, π) ≤ f ⋆ (x, π) ≤ f¯t (x, π) (3.5)

for all x ∈ X , π ∈ Π. As such, if we consider a contextual analogue of the UCB algorithm,


given by
π t = arg max f¯t (xt , π), (3.6)
π∈Π

then as in Lemma 7, the optimistic action satisfies

f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t ) ≤ f¯t (xt , π t ) − f t (xt , π t ).

40
That is, the suboptimality is bounded by the width of the confidence interval at (xt , π t ),
and the total regret is bounded as
T
f¯t (xt , π t ) − f t (xt , π t ).
X
Reg ≤ (3.7)
t=1

To make this approach concrete and derive sublinear bounds on the regret, we need a way to
construct the confidence set F t , ideally so that the width in (3.7) shrinks as fast as possible.

Constructing confidence sets with least squares. We construct confidence sets by


appealing to a supervised learning method, empirical risk minimization with the square loss
(or, least squares). Assume that f (x, a) ∈ [0, 1] for all f ∈ F, and that rt ∈ [0, 1] almost
surely. Let
t−1
X
fbt = arg min (f (xi , π i ) − ri )2 (3.8)
f ∈F i=1

be the empirical risk minimizer at round t, and with β := 8 log(|F|/δ) define F 1 = F and
t−1 t−1
( )
X X
Ft = f ∈ F : (f (xi , π i ) − ri )2 ≤ (fbt (xi , π i ) − ri )2 + β (3.9)
i=1 i=1

for t > 1. That is, our confidence set F t is the collection of all functions that have empirical
squared error close to that of fbt . The idea behind this construction is to set β “just large
enough”, to ensure that we do not accidentally exclude f ⋆ , with the precise value for β
informed by the concentration inequalities we explored in Section 1. The only catch here is
that we need to use variants of these inequalities that handle dependent data, since xt and
π t are not i.i.d. in (3.8). The following result shows that F t is indeed valid and, moreover,
that all functions f ∈ F t have low estimation error on the history.

Lemma 10: Let π 1 , . . . , π T be chosen by an arbitrary (and possibly randomized)


decision-making algorithm. With probability at least 1 − δ, f ⋆ ∈ F t for all t ∈ [T ].
Moreover, with probability at least 1 − δ, for all τ ≤ T , all f ∈ F τ satisfy
τ −1
X
Eπt ∼pt (f (xt , π t ) − f ⋆ (xt , π t ))2 ≤ 4β,
 
(3.10)
t=1

where β = 8 log(|F|/δ).

Lemma 10 is valid for any algorithm, but it is particularly useful for UCB as it establishes
the validity of the confidence bounds as per (3.5); however, it is not yet enough to show
that the algorithm attains low regret. Indeed, to bound the regret, we need to control the
confidence widths in (3.7), but there is a mismatch: for step τ , the regret bound in (3.7)
considers the width at (xτ , π τ ), but (3.10) only ensures closeness of functions in F τ under
(x1 , π 1 ), . . . , (xτ −1 , π τ −1 ). We will show in the sequel that for linear models, it is possible to
control this mismatch, but that this is not possible in general.

41
Proof of Lemma 10. For f ∈ F, define
U t (f ) = (f (xt , π t ) − rt )2 − (f ⋆ (xt , π t ) − rt )2 . (3.11)
It is straightforward to check that10
Et−1 U t (f ) = Et−1 (f (xt , π t ) − f ⋆ (xt , π t ))2 , (3.12)
:= E[· | Ht−1 , xt ]. Then Z t (f ) = Et−1 U t (f ) − U t (f ) is a martingale difference
where Et−1 [·] P
sequence and τt=1 Z t (f ) is a martingale. Since increments Z t (f ) are bounded as |Z t (f )| ≤ 1
(this holds whenever f ∈ [0, 1], rt ∈ [0, 1]), according to Lemma 35 with η = 81 , with
probability at least 1 − δ, for all τ ≤ T ,
τ τ
X 1X
Et−1 Z t (f )2 + 8 log(δ −1 ).
 
Z t (f ) ≤ (3.13)
8
t=1 t=1

To control the right-hand side, we again use that f, rt ∈ [0, 1] to bound


h  i
t 2 2
 t 2 t t t 2 ⋆ t t
Et−1 Z (f ) ≤ Et−1 (f (x , π ) − r ) − (f (x , π ) − r ) (3.14)
≤ 4 Et−1 (f (xt , π t ) − f ⋆ (xt , π t ))2 = 4 Et−1 U t (f )
 
(3.15)
Then, after rearranging, (3.13) becomes
τ τ
1X X
Et−1 U t (f ) ≤ U t (f ) + 8 log(δ −1 ). (3.16)
2
t=1 t=1

Since the left-hand side is nonnegative, we conclude that with probability at least 1 − δ,
τ
X τ
X
(f ⋆ (xt , π t ) − rt )2 ≤ (f (xt , π t ) − rt )2 + 8 log(δ −1 ). (3.17)
t=1 t=1

Taking a union bound over f ∈ F, gives that with probability at least 1 − δ,


τ
X τ
X
∀f ∈ F, ∀τ ∈ [T ], (f ⋆ (xt , π t ) − rt )2 ≤ (f (xt , π t ) − rt )2 + 8 log(|F|/δ), (3.18)
t=1 t=1

and in particular
τ −1
X τ −1
X
⋆ 2
∀τ ∈ [T + 1], t t
(f (x , π ) − r ) ≤ t
(fbτ (xt , π t ) − rt )2 + 8 log(|F|/δ); (3.19)
t=1 t=1

that is, we have f ⋆ ∈ F τ for all τ ∈ {1, . . . , T + 1}, proving the first claim. For the second
part of the claim, observe that any f ∈ F τ must satisfy
τ −1
X
U t (f ) ≤ β
t=1

since the empirical risk of f ⋆ is never better than the empirical risk of the minimizer fbt .
Thus from (3.16), with probability at least 1 − δ, for all τ ≤ T ,
τ −1
X
Et−1 U t (f ) ≤ 2β + 16 log(δ −1 ). (3.20)
t=1

The second claim follows by taking union bound over f ∈ F τ ⊆ F, and by (3.12).
10
We leave Et−1 on the right-hand side to include the case of randomized decisions π t ∼ pt .

42
3.2 Optimism for Linear Models: The LinUCB Algorithm
We now instantiate the general template for optimistic algorithms developed in the previous
section for the special case where F is a class of linear functions.

Linear models. We fix a feature map ϕ : X × Π → Bd2 (1), where Bd2 (1) is the unit-norm
Euclidean ball in Rd . The feature map is assumed to be known to the learning agent. For
example, in the case of medical treatments, ϕ transforms the medical history and symptoms
x for the patient, along with a possible treatment π, to a representation ϕ(x, π) ∈ Bd2 (1).
We take F to be the set of linear functions given by

F = {(x, π) 7→ ⟨θ, ϕ(x, π)⟩ | θ ∈ Θ}, (3.21)

where Θ ⊆ Bd2 (1) is the parameter set. As before, we assume f ⋆ ∈ F; we let θ∗ denote
the corresponding parameter vector, so that f ⋆ (x, π) = ⟨θ⋆ , ϕ(x, π)⟩. With some abuse of
notation, we associate the set of parameters Θ to the corresponding functions in F.
To apply the technical results in the previous section, we assume for simplicity that
|Θ| = |F| is finite. To extend our results to potentially non-finite sets, one can work with
an ε-discretization, or ε-net, which is of size at most O(ε−d ) using standard arguments.
Taking ε ∼ 1/T ensures only a constant loss in cumulative regret relative to the continuous
set of parameters, while log |F| ≲ d log T .

The LinUCB algorithm. The following figure displays an algorithm we refer to as


LinUCB [12, 26, 1], which adapts the generic template for optimistic algorithms to the case
where F is linear in the sense of (3.21).

LinUCB
Input: β > 0
for t = 1, . . . , T do
Compute the least squares solution θbt (over θ ∈ Θ) given by

(⟨θ, ϕ(xi , π i )⟩ − ri )2 .
X
θbt = arg min
θ∈Θ i<t

Define
t−1
X
et =
Σ ϕ(xi , π i )ϕ(xi , π i ) T + I.
i=1

Given x , select action


t

π t ∈ arg max max ⟨θ, ϕ(xt , π)⟩ .


2
π∈Π θ:∥θ−θbt ∥Σ
e t ≤16β+4

Observe reward rt

The following result shows that LinUCB enjoys a regret bound that scales with the
complexity log|F| of the model class and the feature dimension d.

43
Proposition 7: Let Θ ⊆ Bd2 (1) and fix ϕ : X × Π → Bd2 (1). For a finite set F of linear
functions (3.21), taking β = 8 log(|F|/δ), LinUCB satisfies, with probability at least
1 − δ, p p
Reg ≲ βdT log(1 + T /d) ≲ dT log(|F|/δ) log(1 + T /d)
for any sequence of contexts x1 , . . . , xT . More generally, for infinite F, we may take
β = O(d log(T ))a and √
Reg ≲ d T log(T ).
a
This follows from a simple covering number argument.

Notably, this regret bound has no explicit dependence on the context space size |X |. Inter-
estingly, the bound is also independent of the number of actions |Π|, which is replaced by
the dimension d; this reflects that the linear structure of F allows the learner to generalize
not just across contexts, but across decisions. We will expand upon the idea of generalizing
across actions in Section 4.

Proof of Proposition 7. The confidence set (3.9) in the generic optimistic algorithm tem-
plate is
t−1 t−1
( )
X X
Ft = θ ∈ Θ : (⟨θ, ϕ(xi , π i )⟩ − ri )2 ≤ (⟨θbt , ϕ(xi , π i )⟩ − ri )2 + β , (3.22)
i=1 i=1

where θbt is the least squares solution computed in LinUCB. According to Lemma 10, with
probability at least 1 − δ, for all t ∈ [T ], all θ ∈ F t satisfy
t−1
X
(⟨θ − θ∗ , ϕ(xi , π i )⟩)2 ≤ 4β, (3.23)
i=1

which means that F t is a subset of11


n o t−1
X
Θ = θ ∈ Θ : ∥θ − θ∗ ∥2Σt ≤ 4β ,

where t
Σ = ϕ(xi , π i )ϕ(xi , π i ) T . (3.24)
i=1

2
Since θbt ∈ F t , we have that for any θ ∈ Θ′ , by triangle inequality, θ − θbt Σt ≤ 16β.
Furthermore, since θbt ∈ Θ ⊆ Bd2 (1), θ − θbt 2 ≤ 2. Combining the two constraints into one,
we find that Θ′ is a subset of
n o t−1
2
X
′′
Θ = θ ∈ Rd : θ − θbt et
Σ
≤ 16β + 4 , where et =
Σ ϕ(xi , π i )ϕ(xi , π i ) T + I.
i=1
(3.25)

The definition of f¯t in (3.4) and the inclusion Θ′ ⊆ Θ′′ implies that

f¯t (x, π) ≤
p
max√ ⟨θ, ϕ(x, π)⟩ = θbt , ϕ(x, π) + 16β + 4 ∥ϕ(x, π)∥(Σe t )−1 ,
θ:∥θ−θbt ∥Σ
e t ≤ 16β+4

(3.26)
11
p
For a PSD matrix Σ ⪰ 0, we define ∥x∥Σ = ⟨x, Σx⟩.

44

and similarly f t (x, π) ≥ θbt , ϕ(x, π) − 16β + 4∥ϕ(x, π)∥(Σe t )−1 . We conclude that regret
of the UCB algorithm, in view of Lemma 7, is
v
X T u
u T
X
∥ϕ(xt , π t )∥2(Σe t )−1 .
p t t
Reg ≤ 2 16β + 4 ∥ϕ(x , π )∥(Σe t )−1 ≲ tβT (3.27)
t=1 t=1

The above upper bound has the same flavor as the one in Lemma 8: as we obtain more and
more information in some direction v, the matrix Σe t has a larger and larger component in
that direction, and for that direction v, the term ∥v∥2(Σe t )−1 becomes smaller and smaller.
To conclude, we apply a potential argument, Lemma 11 below, to bound
T
X
∥ϕ(xt , π t )∥2(Σe t )−1 ≲ d log(1 + T /d).
t=1

The following result is referred to as the elliptic potential lemma, and it can be thought
of as a generalization of Lemma 8.

Lemma 11 (Elliptic potential


P lemma): Let a1 , . . . , aT ∈ Rd satisfy ∥at ∥ ≤ 1 for all
t ∈ [T ], and let Vt = I + s≤t as asT . Then

T
X
∥at ∥2V −1 ≤ 2d log(1 + T /d). (3.28)
t−1
t=1

Proof Lemma 11 (sketch). First, the determinant of Vt evolves as


 
det(Vt ) = det(Vt−1 ) 1 + ∥at ∥2V −1 .
t−1

Second, using the


PT  identity u ∧  1 ≤ 2 ln(1 + u) for u ≥ 0, the left-hand side of (3.28) is at
2
most 2 t=1 log 1 + ∥at ∥V −1 . The proof concludes by upper bounding the determinant
t−1
of Vn via the AM-GM inequality. We leave the details as an exercise; see also Lattimore
and Szepesvári [60].

3.3 Moving Beyond Linear Classes: Challenges


We now present an example of a class F for which optimistic methods necessarily incur
regret that scales linearly with either the cardinality of F or with cardinality of X , meaning
that we do not achieve the desired log|F| scaling of regret that one might expect in (offline
or online) supervised learning.
Example 3.1 (Failure of optimism for contextual bandits [37]). Let A = 2, and let N ∈ N
be given. Let πg and πb be two actions available in each context, so that Π = {πg , πb }. Let
X = {x1 , . . . , xN } be a set of distinct contexts, and define a class F = {f ⋆ , f1 , . . . , fN } of
cardinality N + 1 as follows. Fix 0 < ε < 1. Let f ⋆ (x, πg ) = 1 − ε and f ⋆ (x, πb ) = 0 for any
x ∈ X . For each i ∈ [N ], fi (xj , πg ) = 1 − ε and fi (xj , πb ) = 0 for j ̸= i, while fi (xi , πg ) = 0
and fi (xi , πb ) = 1.

45
Now, consider a (well-specified) problem instances in which rewards are deterministic
and given by
rt = f ⋆ (xt , π t ),
which we note is a constant function with respect to the context. Since f ⋆ is the true model,
πg is always the best action, bringing a reward of 1 − ε per round. Any time πb is chosen,
the decision-maker incurs instantaneous regret 1 − ε. We will now argue that if we apply
the generic optimistic algorithm from Section 3.1, it will choose πb every time a new context
is encountered, leading to Ω(N ) regret.
Let S t be the set of distinct contexts encountered before round t. Clearly, the exact
minimizers of empirical square loss (see (3.8)) are f ⋆ , and all fi where i is such that xi ∈ / St.
Hence, for any choice of β ≥ 0, the confidence set in (3.9) contains all fi for which xi ∈ / St.
This implies that for each t ∈ [T ] where x = xi ∈t
/ S , action πb has a higher upper confidence
t

bound than πg , since


f¯t (xt , πb ) = fi (xi , πb ) = 1 > f¯t (xt , πg ) = f ⋆ (xt , πg ) = 1 − ε.
Hence, the cumulative regret grows by 1 − ε every time a new context is presented, and thus
scales as Ω(N (1−ε)) if the contexts are presented in order. That is, since N = |X | = |F|−1,
the confidence-based algorithm fails to achieve logarithmic dependence on F (note that we
may take ε = 1/2 for concreteness).
Let us remark that this failure continues even if contexts are stochastic. If the contexts
are chosen via the uniform distribution on X , then for T ≥ N , at least a constant proportion
of the domain will be presented, which still leads to a lower bound of
E[Reg] = Ω(N ) = Ω(min{|F|, |X |}).

What is behind the failure of optimism in this example? The structure of F forces
optimistic methods to over-explore, as the algorithm puts too much hope into trying the
arm πb for each new context. As a result, the confidence widths in (3.7) do not shrink
quickly enough. Below, we will see that there are alternative methods which do enjoy
logarithmic
p dependence on the size of F, with the best of these methods achieving regret
O( AT log|F|).
We mention in passing that even though optimism does not succeed in general, it is
useful to understand in what cases it works. We saw that the structure of linear classes
in Rd only allowed for d “different” directions, while in the example above, the optimistic
algorithm gets tricked by each new context, and is not able to shrink the confidence band
quickly enough over the domain. In a few lectures (Section 4), we will introduce the eluder
dimension, a structural property of the class F which is sufficient for optimistic methods to
experience low regret, generalizing the positive result for the linear setting.

3.4 The ε-Greedy Algorithm for Contextual Bandits


Given that the principle of optimism only leads to low regret for classes F with special
structure, we are left wondering whether there are more general algorithmic principles for
decision making that can succeed for any class F. In this section and the following one,
we will present two such principles. Both approaches will still make use of supervised
learning with the class F, but will build upon online supervised learning as opposed to
offline/statistical learning. To make the use of supervised learning as modular as possible,
we will abstract this away using the notion of an online regression oracle [36].

46
Definition 3 (Online Regression Oracle): At each time t ∈ [T ], an online regression
oracle returns, given
(x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 )
with E[ri |xi , π i ] = f ⋆ (xi , π i ) and π i ∼ pi , a function fbt : X × Π → R such that
T
X
Eπt ∼pt (fbt (xt , π t ) − f ⋆ (xt , π t ))2 ≤ EstSq (F, T, δ)
t=1

with probability at least 1 − δ. For the results that follow, pi = pi (·|xi , Hi−1 ) will
represent the randomization distribution of a decision-maker.

For example, for finite classes, the (averaged) exponential weights method introduced in
Section 1.6 is an online regression oracle with EstSq (F, T, δ) = log(|F|/δ). More generally,
in view of Lemma 6, any online learning algorithm that attains low square loss regret for
the problem of predicting of rt based on (xt , π t ) leads to a valid online regression oracle.
Note that we make use of online learning oracles for the results that follow because
we aim to derive regret bounds that hold for arbitrary, potentially adversarial sequences
x1 , . . . , xT . If we instead assume that contexts are i.i.d., it is reasonable to make use of
algorithms for offline estimation, or statistical learning with F. See Section 3.5.1 for further
discussion.
The first general-purpose contextual bandit algorithm we will study, illustrated below,
is a contextual counterpart to the ε-Greedy method introduced in Section 2.

ε-Greedy for Contextual Bandits


Input: ε ∈ (0, 1).
for t = 1, . . . , T do
Obtain fbt from online regression oracle for (x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 ).
Observe xt .
With prob. ε, select π t ∼ unif([A]), and with prob. 1 − ε, choose the greedy
action
bt = arg max fbt (xt , π).
π
π∈[A]

Observe reward r . t

At each step t, the algorithm uses an online regression oracle to compute a reward estimator
fbt (x, a) based on the data Ht−1 collected so far. Given this estimator, the algorithm uses the
same sampling strategy as in the non-contextual case: with probability 1 − ε, the algorithm
chooses the greedy decision

bt = arg max fbt (xt , π),


π (3.29)
π

and with probability ε it samples a uniform random action π t ∼ unif({1, . . . , A}). The
following theorem shows that whenever the online estimation oracle has low estimation
error EstSq (F, T, δ), this method achieves low regret.

47
Proposition 8: Assume f ⋆ ∈ F and f ⋆ (x, a) ∈ [0, 1]. Suppose the decision-maker
has access to an online regression oracle (Definition 3) with a guarantee EstSq (F, T, δ).
Then by choosing ε appropriately, the ε-Greedy algorithm ensures that with probability
at least 1 − δ,

Reg ≲ A1/3 T 2/3 · EstSq (F, T, δ)1/3

for any sequence x1 , . . . , xT . As a special case, when F is finite, if we use the (averaged)
exponential weights algorithm as an online regression oracle, the ε-Greedy algorithm
has

Reg ≲ A1/3 T 2/3 · log1/3 (|F|/δ).

Notably, this result scales with log|F| for any finite class, analogous to regret bounds for
offline/online supervised learning. The T 2/3 -dependence in the regret bound is suboptimal
(as seen for the special case of non-contextual bandits), which we will address using more
deliberate exploration methods in the sequel.

Proof of Proposition 8. Recall that pt denotes the randomization strategy on round t, com-
puted after observing xt . Following the same steps as the proof of Proposition 4, we can
bound regret by
T
X T
X
Reg = Eπt ∼pt [f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π t )] ≤ f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π
bt ) + εT,
t=1 t=1

where the εT term represents the bias incurred by exploring uniformly.


Fix t and abbreviate π ⋆ (xt ) = π ⋆ . We have

f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π
bt )
= [f ⋆ (xt , π ⋆ ) − fbt (xt , π ⋆ )] + [fbt (xt , π ⋆ ) − fbt (xt , π bt ) − f ⋆ (xt , π
bt )] + [fbt (xt , π bt )]
X
≤ |f ⋆ (xt , π) − fbt (xt , π)|
π t ,π ⋆ }
π∈{b
X 1 p t
= p p (π)|f ⋆ (xt , π) − fbt (xt , π)|.
π t ,π ⋆ }
π∈{b
p (π)
t

By the Cauchy-Schwarz inequality, the last expression is at most


 1/2  1/2
 X 1   X  2 
pt (π) f ⋆ (xt , π) − fbt (xt , π) (3.30)
 pt (π)   
π t ,π ⋆ }
π∈{b π t ,π ⋆ }
π∈{b
r
2A
  2 1/2
⋆ t t t t t
≤ Eπt ∼pt f (x , π ) − f (x , π )
b . (3.31)
ε

48
Summing across t, this gives
T T r 2 1/2
X
⋆ t ⋆ t ⋆ t 2A X
t

⋆ t t t t t
f (x , π (x )) − f (x , π
b)≤ Eπt ∼pt f (x , π ) − f (x , π )
b (3.32)
ε
t=1 t=1

2 1/2
r ( T )
2AT X 
≤ Eπt ∼pt f ⋆ (xt , π t ) − fbt (xt , π t ) . (3.33)
ε
t=1

Now observe that the online regression oracle guarantees that with probability 1 − δ,
T
X  2
Eπt ∼pt f ⋆ (xt , π t ) − fbt (xt , π t ) ≤ EstSq (F, T, δ).
t=1

Whenever this occurs, we have


r
AT EstSq (F, T, δ)
Reg ≲ + εT.
ε
Choosing ε to balance the two terms leads to the claimed result.

3.5 Inverse Gap Weighting: An Optimal Algorithm for General Model Classes
To conclude this section, we present a general, oracle-based algorithm for contextual bandits
which achieves
p
Reg ≲ AT log|F|

for any finite class F. As with ε-Greedy, this approach has no dependence on the cardinality
|X | of the context space, reflecting the ability to generalize across contexts. The dependence
on T improves upon ε-Greedy, and is optimal.
To motivate the approach, recall that conceptually, the key step of the proof of Propo-
sition 8 involved relating the instantaneous regret

Eπt ∼pt [f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π t )] (3.34)

of the decision maker at time t to the instantaneous estimation error


h 2 i
Eπt ∼pt f ⋆ (xt , π t ) − fbt (xt , π t ) (3.35)

between fbt and f ⋆ under the randomization distribution pt . The ε-Greedy exploration
distribution gives a way to relate these quantities, but the algorithm’s regret is suboptimal
because the randomization distribution puts mass at least ε/A on every action, even those
that are clearly suboptimal and should be discarded. One can ask whether there exists a
better randomization strategy that still admits an upper bound on (3.34) in terms of (3.35).
Proposition 9 below establishes exactly that. At first glance, this distribution might appear
to be somewhat arbitrary or “magical”, but we will show in subsequent chapters that it
arises as a special case of more general—and in some sense, universal—principle for designing
decision making algorithms, which extends well beyond contextual bandits.

49
Definition 4 (Inverse Gap Weighting [2, 36]): Given a vector fb = (fb(1), . . . , fb(A)) ∈
RA , the Inverse Gap Weighting distribution p = IGWγ (fb(1), . . . , fb(A)) with parameter
γ ≥ 0 is defined as
1
p(π) = , (3.36)
π ) − fb(π))
λ + 2γ(fb(b
b = arg maxπ fb(π) is the greedy action, and where λ ∈ [1, A] is chosen such that
where π
P
π p(π) = 1.

Above,P the normalizing constant λ ∈ P[1, A] is always guaranteed to exist, because we have
1 A
λ ≤ π p(π) ≤ λ , and because λ 7→ π p(π) is continuous over [1, A].
Let us give some intuition behind the distribution in (3.36). We can interpret the
parameter γ as trading off exploration and exploitation. Indeed, γ → 0 gives a uniform
distribution, while γ → ∞ amplifies the gap between the greedy action π b and any action
with fb(π) < fb(b
π ), resulting in a distribution supported only on actions that achieve the
largest estimated value fb(bπ ).
The following fundamental technical result shows that playing the Inverse Gap Weight-
ing distribution always suffices to link the instantaneous regret in (3.34) in to the instanta-
neous estimation error in (3.35).

Proposition 9: Consider a finite decision space Π = {1, . . . , A}. For any vector fb ∈ RA
and γ > 0, define p = IGWγ (fb(1), . . . , fb(A)). This strategy guarantees that for all
f ⋆ ∈ RA ,
A
Eπ∼p [f ⋆ (π ⋆ ) − f ⋆ (π)] ≤ + γ · Eπ∼p (fb(π) − f ⋆ (π))2 .
 
(3.37)
γ

Proof of Proposition 9. We break the “regret” term on the left-hand side of (3.37) into three
terms:
Eπ∼p f ⋆ (π ⋆ ) − f ⋆ (π) = Eπ∼p fb(b
π ) − fb(π) + Eπ∼p fb(π) − f ⋆ (π) + f ⋆ (π ⋆ ) − fb(b
     
π) .
| {z } | {z } | {z }
(I) exploration bias (II) est. error on policy (III) est. error at opt

The first term asks “how much would we lose by exploring, if fb were the true reward
function?”, and is equal to
X π ) − fb(π)
fb(b A−1
≤ ,
π λ + 2γ fb(bπ ) − fb(π) 2γ
while the second term is at most
1 γ
q
+ Eπ∼p (fb(π) − f ⋆ (π))2 .
 
Eπ∼p (fb(π) − f ⋆ (π))2 ≤
2γ 2
The third term can be further written as
γ 1
f ⋆ (π ⋆ ) − fb(π ⋆ ) − (fb(b
π ) − fb(π ⋆ )) ≤ p(π ⋆ )(f ⋆ (π ⋆ ) − fb(π ⋆ ))2 + − (fb(bπ ) − fb(π ⋆ ))
2 2γp(π ⋆ )
 
γ ⋆ 2 1 ⋆
≤ Eπ∼p (f (π) − f (π)) +
b − (f (b
b π ) − f (π )) .
b
2 2γp(π ⋆ )

50
The term in brackets above is equal to
π ) − fb(π ⋆ ))
λ + 2γ(fb(b λ A
π ) − fb(π ⋆ )) =
− (fb(b ≤ .
2γ 2γ 2γ

The simple result we just proved is remarkable. The special IGW strategy guarantees a
relation between regret and estimation error for any estimator fb and any f ⋆ , irrespective of
the problem structure or the class F. Proposition 9 will be at the core of the development for
the rest of the course, and will be greatly generalized to general decision making problems
and reinforcement learning.
Below, we present a contextual bandit algorithm called SquareCB [36] which makes use
of the Inverse Gap Weighting distribution.
SquareCB
Input: Exploration parameter γ > 0.
for t = 1, . . . , T do
Obtain fbt from online regression oracle with (x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 ).
Observe xt .  
Compute pt = IGWγ fbt (xt , 1), . . . , fbt (xt , A) .
Select action π t ∼ pt .
Observe reward rt .

At each step t, the algorithm uses an online regression oracle to compute a reward estimator
fbt (x, a) based on the data Ht−1 collected so far. Given this estimator, the algorithm uses
Inverse Gap Weighting to compute pt = IGWγ (fbt (xt , ·)) as an exploratory distribution, then
samples π t ∼ pt .
The following result, which is a near-immediate consequence of Proposition 9, gives a
regret bound for this algorithm.

Proposition 10: Given a class F with f ⋆ ∈ F, assume the decision-maker has access
to an online regression
p oracle (Definition 3) with estimation error EstSq (F, T, δ). Then
SquareCB with γ = T A/EstSq (F, T, δ) attains a regret bound of
q
Reg ≲ AT EstSq (F, T, δ)

with probability at least 1 − δ for any sequence x1 , . . . , xT . As a special case, when F is


finite, the averaged exponential weights algorithm achieves EstSq (F, T, δ) ≲ log(|F|/δ),
leading to p
Reg ≲ AT log(|F|/δ).

Proof of Proposition 10. We begin with regret, then add and subtract the squared estima-
tion error as follows:
XT
Reg = Eπt ∼pt [f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t )]
t=1
T
X h i
= Eπt ∼pt f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t ) − γ · (f ⋆ (xt , π t ) − fbt (xt , π t ))2 + γ · EstSq (F, T, δ).
t=1

51
By appealing to Proposition 9 with fb(xt , ·) and f ⋆ (xt , ·), for each step t, we have
h i A
Eπt ∼pt f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t ) − γ · (f ⋆ (xt , π t ) − fbt (xt , π t ))2 ≤ ,
γ
and thus
TA
+ γ · EstSq (F, T, δ).
Reg ≤
γ
Choosing γ to balance these terms yields the result.

If the online regression oracle is minimax optimal (that is, EstSq (F, T, δ) is the “best
possible” for F) then SquareCB is also minimax optimal for F. Thus, IGW not only provides
a connection between online supervised learning and decision making, but it does so in an
optimal fashion. Establishing minimax optimality is beyond the scope of this course: it
requires understanding of minimax optimality of online regression with arbitrary F, as well
as lower bound on regret of contextual bandits with arbitrary sequences of contexts. We
refer to Foster and Rakhlin [36] for details.

3.5.1 Extending to Offline Regression


When x1 , . . . , xT are i.i.d., it is natural to ask whether an online regression method that
works for arbitrary sequences is necessary, or whether one can work with a weaker oracle
tuned to i.i.d. data. For SquareCB, it turns out that any oracle for offline regression
(defined below) is sufficient.

Definition 5 (Offline Regression Oracle): Given

(x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 )

where x1 , . . . , xt−1 are i.i.d., π i ∼ p(xi ) for fixed p : X → ∆(Π) and E[ri |xi , π i ] =
f ⋆ (xi , π i ), an offline regression oracle returns a function fb : X × Π → R such that

Ex,π∼p(x) (fb(x, π) − f ⋆ (x, π))2 ≤ t−1 Estoff


Sq (F, t, δ)

with probability at least 1 − δ.

Note that the normalization t−1 above is introduced to keep the scaling consistent with our
conventions for offline estimation.
Below, we state a variant of SquareCB which is adapted to offline oracles [79]. Compared
to the SquareCB for online oracles, the main change is that we update the estimation
oracle and exploratory distribution on an epoched schedule as opposed to updating at every
round. In addition, the parameter γ for the Inverse Gap Weighting distribution changes as
a function of the epoch.

SquareCB with offline oracles


Input: Exploration parameters γ1 , γ2 , . . . and epoch sizes τ1 , τ2 , . . .
for m = 1, 2, . . . do
Obtain fbm from offline regression oracle with
(xτm−2 +1 , π τm−2 +1 , rτm−2 +1 ), . . . , (xτm−1 , π τm−1 , rτm−1 ).

52
for t = τm−1 + 1, . . . , τm do
Observe xt .  
Compute pt = IGWγm fbm (xt , 1), . . . , fbm (xt , A) .
Select action π t ∼ pt .
Observe reward rt .
While this algorithm is quite intuitive, proving a regret bound for it is quite non-trivial—
much more so than the online oracle variant. They key challenge is that, while the con-
texts x1 , . . . , xT are i.i.d., the decisions π 1 , . . . , π T evolve in a time-dependent fashion, which
makes it unclear to invoke the guarantee in Definition 5. Nonetheless, the following remark-
able result shows that this algorithm attains a regret bound similar to that of Proposition
10.

q
Proposition 11 (Simchi-Levi and Xu [79]): Let τm = 2m and γm = AT /Estoff Sq (F, τm−1 , δ)
for m = 1, 2, . . .. Then with probability at least 1 − δ, regret of SquareCB with an offline
oracle is at most
⌈log T ⌉ q
X
Reg ≲ A · τm · Estoff 2
Sq (F, τm , δ/m ).
m=1

Under mild assumptions, above bound scales as


q
Reg ≲ A · T · Estoff
Sq (F, τm , δ/ log T ).

For a finite class F, we recall from Section 1 that empirical risk with the square loss (least
squares) achieves Estoff
Sq (F, T, δ) ≲ log(|F|/δ), which gives
p
Reg ≲ AT log(|F|/δ).

3.6 Exercises

Exercise 5 (Unstructured Contextual Bandits): Consider a contextual bandit problem


with a finite set X of possible contexts, and a finite set of actions A.
pShow that running UCB
independently for each context yields a regret bound of the order O( e |X||A|T ) in expectation,
ignoring logarithmic factors. In the setting where F = X × A → [0, 1] is unstructured, and
consists of all possible functions, this is essentially optimal.

Exercise 6 (ε-Greedy with Offline Oracles): In Proposition 8, we analyzed the ε-Greedy


contextual bandit algorithm assuming access to an online regression oracle. Because we appeal
to online learning, this algorithm was able to handle adversarial contexts x1 , . . . , xT . In the
present problem, we will modify the ε-greedy algorithm and proof to show that if contexts are
stochastic (that is xt ∼ D ∀t, where D is a fixed distribution), ε-greedy works even if we use
an offline oracle (Definition 5).
We consider the following variant of ε-greedy. The algorithm proceeds in epochs m =
0, 1, . . . of doubling size

{2}, {3, 4}, {5 . . . 8}, . . . , {2m + 1, 2m+1 }, . . . , {T /2 + 1, T };


| {z }
epoch m

53
we assume without loss of generality that T is a power of 2, and that an arbitrary decision is
made on round t = 1. At the end of each epoch m − 1, the offline oracle is invoked with the
data from the epoch, producing an estimated model fbm . This model is used for the greedy
step in the next epoch m. In other words, for any round t ∈ [2m + 1, 2m+1 ] of epoch m, the
algorithm observes xt ∼ D, chooses an action π t ∼ unif[A] with probability ε and chooses the
greedy action
π t = arg max fbm (xt , π)
π∈[A]

with probability 1 − ε. Subsequently, the reward rt is observed.

1. Prove that for any T ∈ N and δ > 0, by setting ε appropriately, this method ensures that
with probability at least 1 − δ,
 2/3
log2 T
X
Reg ≲ A1/3 T 1/3  2m/2 Estoff
Sq (F, 2
m−1
, δ/m2 )1/2 
m=1

2. Recall that for a finite class, ERM achieves Estoff


Sq (F, T, δ) ≲ log(|F|/δ). Show that with
this choice, the above upper bound matches that in Proposition 8, up to logarithmic in T
factors.

Exercise 7 (Model Misspecification in Contextual Bandits): In Proposition 10, we


showed that for contextual bandits with a general class F, SquareCB attains regret
q
Reg ≲ AT · EstSq (F, T, δ). (3.38)

To do so, we assumed that f ⋆ ∈ F, where f ⋆ (x, a) := Er∼M ⋆ (·|x,a) [r]; that is, we have a well-
specified model. In practice, it may be unreasonable to assume that we have f ⋆ ∈ F. Instead,
a weaker assumption is that there exists some function f¯ ∈ F such that

max |f¯(x, a) − f ⋆ (x, a)| ≤ ε


x∈X ,a∈A

for some ε > 0; that is, the model is ε-misspecified. In this problem, we will generalize the
regret bound for SquareCB to handle misspecification. Recall that in the lecture notes, we
assumed (Definition 3) that the regression oracle satisfies
T
X
Eπt ∼pt (fbt (xt , π t ) − f ⋆ (xt , π t ))2 ≤ EstSq (F, T, δ).
 
t=1

In the misspecified setting, this is too much to ask for. Instead, we will assume that the oracle
satisfies the following guarantee for every sequence:
T
X T
X
(fbt (xt , π t ) − rt )2 − min (f (xt , π t ) − rt )2 ≤ RegSq (F, T ).
f ∈F
t=1 t=1

Whenever f ⋆ ∈ F, we have EstSq (F, T, δ) ≲ RegSq (F, T ) + log(1/δ) with probability at least
1 − δ. However, it is possible to keep RegSq (F, T ) small even when f ⋆ ∈
/ F. For example, the
averaged exponential weights algorithm satisfies this guarantee with RegSq (F, T ) ≲ log|F|,
regardless of whether f ⋆ ∈ F.

54
We will show that for every δ > 0, with an appropriate choice of γ, SquareCB (that is, the
algorithm that chooses pt = IGWγ (fbt (xt , ·))) ensures that with probability at least 1 − δ,
q
Reg ≲ AT · (RegSq (F, T ) + log(1/δ)) + ε · A1/2 T.

Assume that all functions in F and rewards take values in [0, 1].

1. Show that for any sequence of estimators fb1 , . . . , fbt , by choosing pt = IGWγ (fbt (xt , ·)), we
have that
T T
AT h i
Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 +εT.
X X
Reg = Eπt ∼pt [f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π t )] ≲ +γ
t=1
γ t=1

If we had f ⋆ = f¯, this would follow from Proposition 9, but the difference is that in general
(f¯ ̸= f ⋆ ), the expression above measures estimation error with respect to the best-in-class
model f¯ rather than the true model f ⋆ (at the cost of an extra εT factor).
2. Show that the following inequality holds for every sequence
T T
(fbt (xt , π t ) − f¯(xt , π t ))2 ≤ RegSq (F, T ) + 2 (rt − f¯(xt , π t ))(fbt (xt , π t ) − f¯(xt , π t )).
X X

t=1 t=1

3. Using Freedman’s inequality (Lemma 35), show that with probability at least 1 − δ,
T h i T
Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 ≤ 2 (fbt (xt , π t ) − f¯(xt , π t ))2 + O(log(1/δ)).
X X

t=1 t=1

4. Using Freedman’s inequality once more, show that with probability at least 1 − δ,
T T
1X h i
(r −f¯(xt , π t ))(fbt (xt , π t )−f¯(xt , π t )) ≤ Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 +O(ε2 T +log(1/δ)).
X
t
2
t=1
4 t=1

Conclude that with probability at least 1 − δ,


T h i
Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 ≲ RegSq (F, T ) + ε2 T + log(1/δ).
X

t=1

5. Combining the previous results, show that for any δ > 0, by choosing γ > 0 appropriately,
we have that with probability at least 1 − δ,
q
Reg ≲ AT · (RegSq (F, T ) + log(1/δ)) + ε · A1/2 T.

4. STRUCTURED BANDITS

Up to this point, we have focused our attention on bandit problems (with or without
contexts) in which the decision space Π is a small, finite set. This section introduces
the structured bandit problem, which generalizes the basic (non-contextual) multi-armed
bandit problem by allowing for large, potentially infinite or continuous decision spaces.
The protocol for the setting is as follows.

55
Structured Bandit Protocol
for t = 1, . . . , T do
Select decision π t ∈ Π. // Π is large and potentially continuous.
Observe reward rt ∈ R.

This protocol is exactly the same as for multi-armed bandits (Section 2), except that we
have removed the restriction that Π = {1, . . . , A}, and now allow it to be arbitrary. This
added generality is natural in many applications:

• In medicine, the treatment may be a continuous variable, such as a dosage. The


treatment could even by a high-dimensional vector (such as dosages for many different
medications). See Figure 7.

• In pricing applications, a seller might aim to select a continuous price or vector or


prices in order to maximize their returns.

• In routing applications, the decision space may be finite, but combinatorially large.
For example, the decision might be a path or flow in a graph.

Both contextual bandits and structured bandits generalize the basic multi-armed bandit
problem, by incorporating function approximation and generalization, but in different ways:

• The contextual bandit formulation in Section 3 assumes structure in the context space.
The aim here was to generalize across contexts, but we restricted the decision space
to be finite (unstructured).

• In structured bandits, we will focus our attention on the case of no contexts, but will
assume the decision space is structured, and aim to generalize across decisions.

Clearly, both ideas above can be combined, and we will touch on this in Section 4.5.

xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>

context

⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>

decision

rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>

reward

ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>

observation

Figure 7: An illustration of the structured bandit problem. A doctor aims to select a


continuous, high-dimensional treatment.

Assumptions and regret. To build intuition as to what it means to generalize across


decisions, and to give a sense for what sort of guarantees we might hope to prove, let us
first give the formal setup for the structured bandit problem. As in preceding sections, we
will assume that rewards are stochastic, and generated from a fixed model.

56
Assumption 4 (Stochastic Rewards): Rewards are generated independently via

rt ∼ M ⋆ (· | π t ), (4.1)

where M ⋆ (· | ·) is the underlying model.

We define

f ⋆ (π) := E [r | π] (4.2)

as the mean reward function under r ∼ M ⋆ (· | π), and measure regret via
T
X T
X
Reg := f ⋆ (π ⋆ ) − Eπt ∼pt [f ⋆ (π t )]. (4.3)
t=1 t=1

Here, π ⋆ := arg maxπ∈Π f ⋆ (π) as usual. We will define the history as Ht = (π 1 , r1 ), . . . , (rt , π t ).

Function approximation. A first attempt to tackle the structured bandit problem might
be to applyp algorithms for the multi-armed bandit setting, such as UCB. This would give
e |Π|T ), which could be vacuous if Π is large relative to T . However, with no
regret O(
further assumptions on the underlying reward function f ⋆ , this is unavoidable. To allow for
better regret, we will make assumptions on the structure of f ⋆ that will allow us to share
information across decisions, and to generalize to decisions that we may not have played.
This is well-suited for the applications described above, where Π is a continuous set (e.g.,
Π ⊆ Rd ), but we expect f ⋆ to be continuous, or perhaps even linear with respect some
well-designed set of features. To make this idea precise, we follow the same approach as in
statistical learning and contextual bandits, and assume access to a well-specified function
class F that aims to capture our prior knowledge about f ⋆ .

Assumption 5: The decision-maker has access to a class F ⊂ {f : Π → R} such that


f ⋆ ∈ F.

Given such a class, a reasonable goal—particularly in light of the development in Section 1


and Section 3—would be to achieve guarantees that scale with the complexity of supervised
learning or estimation with F, e.g. log|F| for finite classes; this is what we were able to
achieve for contextual bandits, after all. Unfortunately, this is too good to be true, as the
following example shows.

Example 4.1 (Necessity of structural assumptions). Let Π = [A], and let F = {fi }i∈[A] ,
where
1 1
fi (π) := + I {π = i} .
2 2
It is clear that one needs
p Reg ≳ A for this setting, yet log|F| = log(A), so a regret bound
of the form Reg ≲ T log|F| is not possible if A is large relative to T . ◁

What this example highlights is that generalizing across decisions is fundamentally dif-
ferent (and, in some sense, more challenging) than generalizing across contexts. In light

57
of this, we will aim for guarantees that scale with log|F|, but additionally scale with an
appropriate notion of complexity of exploration for the decision space Π. Such a notion of
complexity should reflect how much information is shared across decisions, which depends
on the interplay between Π and F.

4.1 Building Intuition: Optimism for Structured Bandits


Our goal is to obtain regret bounds for structured bandits that reflect the intrinsic difficulty
of exploring the decision space Π, which should reflect the structure of the function class F
under consideration. To build intuition as to what such guarantees will look like, and how
they can be obtained, we first investigate the behavior of the optimism principle and the
UCB algorithm when applied to structured bandits. We will see that:

1. UCB attains guarantees that scale with log|F|, and additionally scale with a notion
of complexity called the eluder dimension, which is small for simple problems such as
bandits with linear rewards.

2. In general, UCB is not optimal, and can have regret that is exponentially large com-
pared to the optimal rate.

4.1.1 UCB for Structured Bandits


We can adapt the UCB algorithm from multi-armed bandits to structured bandit by ap-
pealing to least squares and confidence sets, similar to the approach we took for contextual
bandits [74]. Assume F = {f : Π → [0, 1]} and rt ∈ [0, 1] almost surely. Let
t−1
X
fbt = arg min (f (π i ) − ri )2 (4.4)
f ∈F i=1

be the empirical minimizer on round t, and with β := 8 log(|F|/δ), define confidence sets
F 1 = F and
t−1 t−1
( )
X X
Ft = f ∈ F : (f (π i ) − ri )2 ≤ (fbt (π i ) − ri )2 + β . (4.5)
i=1 i=1

Defining f¯t (π) := maxf ∈F t f (π) as the upper confidence bound, the generalized UCB algo-
rithm is given by
π t = arg max f¯t (π). (4.6)
π∈Π

When does the confidence width shrink? Using Proposition 7, one can see the gen-
eralized UCB algorithm ensures that f ⋆ ∈ F t for all t with high probability. Whenever this
happens, regret is bounded by the upper confidence width:
T
f¯t (π t ) − f ⋆ (π t ).
X
Reg ≤ (4.7)
t=1

This bound holds for all structured bandit problems, with no assumption on the structure
of Π and F. Hence, to derive a regret bound, the only question we need to answer is when
will the confidence widths shrink?

58
For the unstructured multi-armed bandit, we need to shrinkpthe width for every arm
separately, and the best bound on (4.7) we can hope for is O( |Π|T ). One might hope
that if Π and F have nice structure, we can do better. In fact, we have already seen one
such case: For linear models, where
n o
F = π 7→ ⟨θ, ϕ(π)⟩ | θ ∈ Θ ⊂ Bd2 (1) , (4.8)
p
Proposition 7 shows that we can bound (4.7) by dT log|F|. Here, the number of decisions
|Π| is replaced by the dimension d, which reflects the fact that there are only d truly unique
directions to explore before we can start extrapolating to new actions. Is there a more
general version of this phenomenon when we move beyond linear models?

4.1.2 The Eluder Dimension


The eluder dimension [74] is a complexity measure that aims to capture the extent to which
the function class F facilitates extrapolation (i.e., generalization to unseen decisions), and
gives a generic way of bounding the confidence width in (4.7). It is defined for a class F as
follows.

Definition 6 (Eluder Dimension): Let F ⊂ (Π → R) and f ⋆ : Π → R be given, and


define Edimf ⋆ (F, ε) as the length of the longest sequence of decisions π 1 , . . . , π d ∈ Π
such that for all t ∈ [d], there exists f t ∈ F such that
X
|f t (π t ) − f ⋆ (π t )| > ε, and (f t (π i ) − f ⋆ (π i ))2 ≤ ε2 . (4.9)
i<t

The eluder dimension is defined as Edimf ⋆ (F, ε) = supε′ ≥ε Edimf ⋆ (F, ε′ ) ∨ 1. We ab-
breviate Edim(F, ε) = maxf ⋆ ∈F Edimf ⋆ (F, ε).

The intuition behind the eluder dimension is simple: It asks, for a worst-case sequence of
decisions, how many times we can be “surprised” by a new decision π t if we can estimate
the underlying model f ⋆ well on all of the preceding points. In particular, if we form
confidence sets as in (4.5) with β = ε2 , then the number of times the upper confidence
width in (4.7) can be larger than ε is at most Edimf ⋆ (F, ε). We consider the definition
Edimf ⋆ (F, ε) = supε′ ≥ε Edimf ⋆ (F, ε′ ) ∨ 1 instead of directly working with Edimf ⋆ (F, ε) to
ensure monotonicity with respect to ε, which will be useful in the proofs that follow.
The following result gives a regret bound for UCB for generic structured bandit prob-
lems. The regret bound has no dependence on the size of the decision space, and scales
only with Edim(F, ε) and log|F|.

Proposition 12: For a finite set of functions F ⊂ (Π → [0, 1]), using β = 8 log(|F|/δ),
the generalized UCB algorithm guarantees that with probability at least 1 − δ,
np o q
Reg ≲ min Edim(F, ε) · T log(|F|/δ) + εT ≲ Edim(F, T −1/2 ) · T log(|F|/δ).
ε>0
(4.10)

59
For the case of linear models in (4.8), it is possible to use the elliptic potential lemma
(Lemma 11) to show that
Edim(F, ε) ≲ d log(ε−1 ).
p
For finite classes, this gives Reg ≲ dT log(|F|/δ) log(T ), which recovers the guarantee in
Proposition 7. Another well-known example is that of generalized linear models. Here, we
fix link function σ : [−1, +1] → R and define
n o
F = π 7→ σ ⟨θ, ϕ(π)⟩ | θ ∈ Θ ⊂ Bd2 (1) .


This is a more flexible model than linear bandits. A well-known special case is the logistic
bandit problem, where σ(z) = 1/(1 + e−z ). One can show [74] that for any choice of σ, if
there exist µ, L > 0 such that µ < σ ′ (z) < L for all z ∈ [−1, +1], then

L2
Edim(F, ε) ≲ · d log(ε−1 ). (4.11)
µ2

This leads to a regret bound that scales with Lµ dT log|F|, generalizing the regret bound
p

for linear bandits.


In general, the eluder dimension can be quite large. Consider the generalized linear
model setup above with σ(z) = +relu(z) or σ(z) = −relu(z) (either choice of sign works),
where relu(z) := max{z, 0} is the ReLU function; this can be interpreted as a neural network
with a single neuron. Here, we can have σ ′ (z) = 0, so (4.11) does not apply, and it turns
out [61] that

Edim(F, ε) ≳ ed (4.12)

for constant ε. That is, even for a single ReLU neuron, the eluder dimension is already
exponential, which is a bit disappointing. Fortunately, we will show in the sequel that the
eluder dimension can be overly pessimistic, and it is possible to do better, but this will
require changing the algorithm.

Proof of Proposition 12. Define


( )
X
t i ⋆ i 2
F = f ∈F | (f (π ) − f (π )) ≤ 4β .
i<t

By Lemma 10, we have that with probability at least 1 − δ, for all t:

1. f ⋆ ∈ F t .

2. F t ⊆ F t .

Let us condition on this event. As in Lemma 7 , since f ⋆ ∈ F t , we can upper bound


T
f¯t (π t ) − f ⋆ (π t ).
X
Reg ≤
t=1

Now, define
wt (π) = sup [f (π) − f ⋆ (π)],
f ∈F t

60
which is a useful upper bound on the upper confidence width at time t. Since F t ⊆ F t , we
have
XT
Reg ≤ wt (π t ).
t=1
We now appeal to the following technical lemma concerning the eluder dimension.

Lemma 12 (Russo and Van Roy [74], Lemma 3): Fix a function class F, function
f ⋆ ∈ F, and parameter β > 0. For any sequence π 1 , . . . , π T , if we define
( )
X
wt (π) = sup f (π) − f ⋆ (π) : (f (π i ) − f ⋆ (π i ))2 ≤ β ,
f ∈F i<t

then for all α > 0,


T  
X
t t β
I {w (π ) > α} ≤ + 1 · Edimf ⋆ (F, α).
α2
t=1

Note that for the special case where β = α2 , the bound in Lemma 12 immediately
follows from the definition of the eluder dimension. The point of this lemma is to show that
a similar bound holds for all scales α simultaneously, but with a pre-factor αβ2 that grows
large when α2 ≪ β.
To apply this result, fix ε > 0, and bound
T
X T
X
t t
w (π ) ≤ wt (π t )I {wt (π t ) > ε} + εT. (4.13)
t=1 t=1

Let us order the indices {1, . . . , T } as {i1 , . . . , iT }, so that wi1 (π i1 ) ≥ wi2 (π i2 ) ≥ . . . ≥


wiτ (π iτ ). Consider any index τ for which wiτ (π iτ ) > ε. For any α > ε, if we have wiτ (π iτ ) >
α, then Lemma 12 (since α ≤ 1 ≤ β) implies that
T  
X
t t 4β 5β
τ≤ I {w (π ) > α} ≤ 2
+ 1 Edimf ⋆ (F, α) ≤ 2 Edimf ⋆ (F, α). (4.14)
α α
t=1

Since we have restricted to α ≥ ε and α 7→ Edimf ⋆ (F, α) is decreasing, rearranging yields


r
iτ iτ 5βEdim(F, ε)
w (π ) ≤ .
τ
With this, we can bound the main term in (4.13) by
T T
r
X
t t t t
X βEdim(F, ε) p
w (π )I {w (π ) > ε} ≲ ≲ βEdim(F, ε)T .
t
t=1 t=1
p
Combining this with (4.13) gives Reg ≲ βEdim(F, ε)T + εT . Since ε > 0 was arbitrary,
we are free to minimize over it.

61
Proof of Lemma 12. Let us adopt the shorthand d = Edimf ⋆ (F, α). We begin with a
definition. We say π P is α-independent of π 1 , . . . , π t if there exists f ∈ F such that
⋆ t ⋆ i 2 2
|f (π) − f (π)| > α and
Pt i=1 (fi(π ) −⋆ f i(π2 )) ≤2 α . We say⋆ π is α-dependent on π , . . . , π
i 1 t

if for all f ∈ F with i=1 (f (π ) − f (π )) ≤ α , |f (π) − f (π)| ≤ α.


We first claim that for any t, if wt (π t ) > α, then πt is α-dependent on at most β/α2
disjoint subsequences of π 1 , . . . , π t−1 . Indeed, let f be such that |f (π t ) − f ⋆ (π t )| > α. If π t
is α-dependent on a particular subsequence π i1 , . . . , π ik but wt (π t ) > α, we must have
k
X
(f (π ij ) − f ⋆ (π ij ))2 ≥ α2 .
j=1

If there are M such disjoint sequences, we have


X
M α2 ≤ (f (π i ) − f ⋆ (π i ))2 ≤ β,
i<t

so M ≤ αβ2 .
Next, we claim that for τ and any sequence (π 1 , . . . , π τ ), there is some j such that π j
is α-dependent on at least ⌊τ /d⌋ disjoint subsequences of π 1 , . . . , π j−1 . Let N = ⌊τ /d⌋,
and let B1 , . . . , BN be subsequences of π 1 , . . . , π τ . We initialize with Bi = (π i ). If π N +1 is
α-dependent on Bi = (π i ) for all 1 ≤ i ≤ N we are done. Otherwise, choose i such that
π N +1 is α-independent of Bi , and add it to Bi . Repeat this process until we reach j such
that either π j is α-dependent P on all Bi or j = τ . In the first case we are done, while in
the second case, we have N |B
i=1 i | ≥ τ ≥ dN . Moreover, |B i | ≤ d, since each π j
∈ B i
is α-independent of its prefix (this follows from the definition of eluder dimension). We
conclude that |Bi | = d for all i, so in this case π τ is α-dependent on all Bi .
Finally, let (π t1 , . . . , π tτ ) be the subsequence π 1 , . . . , π T consisting of all elements for
which wii (π ti ) > α. Each element of the sequence is dependent on at most β/α2 disjoint
subsequences of (π t1 , . . . , π tτ ), and by the argument above, one element is dependent on at
least ⌊τ /d⌋ disjoint subsequences, so we must have ⌊τ /d⌋ ≤ β/α2 , and which implies that
τ ≤ (β/α2 + 1)d.

4.1.3 Suboptimality of Optimism


The following example shows a function class F for which the regret experienced by UCB
is exponentially large compared to the regret obtained by a simple alternative algorithm.
This shows that while the algorithm is useful for some special cases, it does not provide a
general principle that attains optimal regret for any structured bandit problem.

Example 4.2 (Cheating Code [9, 50]). Let A ∈ N be a power of 2 and consider the following
function class F.

• The decision space is Π = [A] ∪ C, where C = {c1 , . . . , clog2 (A) } is a set of “cheating”
actions.

• For all actions π ∈ [A], f (π) ∈ [0, 1] for all f ∈ F, but we otherwise make no
assumption on the reward.

• For each f ∈ F, rewards for actions in C take the following form. Let πf ∈ [A] denote
the action in [A] with highest reward. Let b(f ) = (b1 (f ), . . . , blog2 (A) (f )) ∈ {0, 1}log2 (A)

62
be a binary encoding for the index of πf ∈ [A] (e.g., if πf = 1, b(f ) = (0, 0, . . . , 0), if
πf = 2, b(f ) = (0, 0, . . . , 0, 1), and so on). For each action ci ∈ C, we set
f (ci ) = −bi (f ).

The idea here is that if we ignore the actions √ C, this looks like a standard multi-armed
bandit problem, and the optimal regret is Θ( AT ). However, we can use the actions in C
to “cheat” and get an exponential improvement in sample complexity. The argument is as
follows.
Suppose for simplicity that rewards are Gaussian with r ∼ N (f ⋆ (π), 1) under π. For
each cheating action ci ∈ C, since f ⋆ (ci ) = −bi (f ⋆ ) ∈ {0, −1}, we can determine whether the
value is bi (f ⋆ ) = 0 or bi (f ⋆ ) = 1 with high probability using O(1)
e action pulls. If we do this
for each ci ∈ C, which will incur O(log(A)) regret (there are log(A) such actions and each one
e
leads to constant regret), we can infer the binary encoding b(f ⋆ ) = b1 (f ⋆ ), . . . , blog2 (A) (f ⋆ )
for the optimal action πf ⋆ with high probability. At this point, we can simply stop exploring,
and commit to playing πf ⋆ for the remaining rounds, which will incur no more regret. If
one is careful with the details, this gives that with probability at least 1 − δ,
Reg ≲ log2 (A/δ).
In other words, by exploiting the cheating actions, our regret has gone from linear to
logarithmic in A (we have also improved the dependence on T , which is a secondary bonus).
Now, let us consider the behavior of the generalized UCB algorithm. Unfortunately,
since all actions ci ∈ C have f (ci ) ≤ 0 for all f ∈ F, we have f¯t (ci ) ≤ 0. As a result, the
generalized UCB algorithm will only ever pull actions in [A], ignoring the cheating actions
and effectively turning this into a vanilla multi-armed bandit problem, which means that

Reg ≳ AT .

This example shows that UCB can behave suboptimally in the presence of decisions that
reveal useful information but do not necessarily lead to high reward. Since the “cheating”
actions are guaranteed to have low reward, UCB avoids them even though they are very
informative. We conclude that:
1. Obtaining optimal sample complexity for structured bandits requires algorithms that
more deliberately balance the tradeoff between optimizing reward and acquiring in-
formation.
2. In general, the optimal strategy for picking decisions can be very different depending
on the choice of the class F. This contrasts the contextual bandit setting, where we
saw that the Inverse Gap Weighting algorithm attained optimal sample complexity for
any choice of class F, and all that needed to change was how to perform estimation.

Remark 13 (Suboptimality of posterior sampling): Recall the Bayesian bandit


setting in√Section 2.4, where we showed that the posterior sampling algorithm attains
e AT ) when Π = {1, . . . , A}. Posterior sampling is a general-purpose algo-
regret O(
rithm, and can be applied to directly to arbitrary structured bandit problems (as long
as a prior is available). However, similar to UCB, the cheating code construction in

63
Example 4.2 implies that posterior sampling is not optimal in general. Indeed, poste-
rior sampling will never select the cheating arms in C, as these have sub-optimal reward
for all models
√ in F. As a result, the Bayesian regret of the algorithm will scale with
Reg ≳ AT for a worst-case prior.

4.2 The Decision-Estimation Coefficient


The discussion in the prequel highlights two challenges in designing algorithms and under-
standing sample complexity for structured bandits: 1) the optimal regret (in a sense, the
complexity of exploration) can depend on the class F in a subtle, sometimes surprising
fashion, and 2) the algorithms required to achieve optimal regret can heavily depend on the
choice of F. In light of these challenges, it is natural to ask whether it is possible to have
any sort of unified understanding of the optimal regret. We will now show that the answer
is yes, and this will be achieved by a single, general-purpose principle for algorithm design.
The algorithm we will present in this section reduces the problem of decision making to
that of supervised online learning/estimation, in a similar fashion to the SquareCB method
for contextual bandits in Section 3. To apply this method, we require the following oracle
for supervised estimation.

Definition 7 (Online Regression Oracle): At each time t ∈ [T ], an online regression


oracle returns, given
(π 1 , r1 ), . . . , (π t−1 , rt−1 )
with E[ri |π i ] = f ⋆ (π i ) and π i ∼ pi , a function fbt : Π → R such that
T
X
Eπt ∼pt (fbt (π t ) − f ⋆ (π t ))2 ≤ EstSq (F, T, δ)
t=1

with probability at least 1 − δ. Here, pi (·|Hi−1 ) is the randomization distribution for


the decision-maker.

Recall, following the discussion in Section 3, that the averaged exponential weights algorithm
achieves is an online regression oracle with EstSq (F, t, δ) ≲ log(|F|/δ).
The following algorithm, which we call Estimation-to-Decisions or E2D [40, 43], is a
general-purpose meta-algorithm for structured bandits.

Estimation-to-Decisions (E2D) for Structured Bandits


Input: Exploration parameter γ > 0.
for t = 1, . . . , T do
Obtain fbt from online regression oracle with (π 1 , r1 ), . . . , (π t−1 , rt−1 ).
Compute
 
pt = arg min max Eπ∼p f (πf ) − f (π) − γ · (f (π) − fbt (π))2 .
p∈∆(Π) f ∈F

Select action π t ∼ pt .

64
At each timestep t, the algorithm calls invokes an online regression oracle to obtain an esti-
mator fbt using the data Ht−1 = (π 1 , r1 , . . . , π t−1 , rt−1 ) observed so far. The algorithm then
finds a distribution pt by solving a min-max optimization problem involving the estimator
fbt and the class F, then samples the decision π t from this distribution.
The minimax problem in E2D is derived from a complexity measure (or, structural
parameter) for F called the Decision-Estimation Coefficient [40, 43], whose value is given
by  
2
decγ (F, fb) = min max Eπ∼p f (πf ) − f (π) −γ · (f (π) − fb(π)) . (4.15)
p∈∆(Π) f ∈F | {z } | {z }
regret of decision information gain for obs.
The Decision-Estimation Coefficient can be thought of as the value of a game in which
the learner (represented by the min player) aims to find a distribution over decisions such
that for a worst-case problem instance (represented by the max player), the regret of their
decision is controlled by a notion of information gain (or, estimation error) relative to a
reference model fb. Conceptually, fb should be thought of as a guess for the true model,
and the learner (the min player) aims to—in the face of an unknown environment (the max
player)—optimally balance the regret of their decision with the amount information they
acquire. With enough information, the learner can confirm or rule out their guess fb, and
scale parameter γ controls how much regret they are willing to incur to do this. In general,
the larger the value of decγ (F, fb), the more difficult it is to explore.
To state a regret bound for E2D, we define
decγ (F) = sup decγ (F, fb). (4.16)
fb∈co(F )

Here, co(F) denotes the set of all convex combinations of elements in F. The reason we
consider the set co(F) is that in general, online estimation algorithms such as exponential
weights will produce improper predictions with fb ∈ co(F). In fact, it turns out (see Propo-
sition 24) that even if we allow fb to be unconstrained above, the maximizer always lies in
co(F) without loss of generality.
The main result for this section shows that the regret for E2D is controlled by the value
of the DEC and the estimation error EstSq (F, T, δ) for the online regression oracle.

Proposition 13 (Foster et al. [40]): The E2D algorithm with exploration parameter
γ > 0 guarantees that with probability at least 1 − δ,

Reg ≤ decγ (F) · T + γ · EstSq (F, T, δ). (4.17)

We can optimize over the parameter γ in the result above, which yields
 
Reg ≤ inf decγ (F) · T + γ · EstSq (F, T, δ)
γ>0
 
≤ 2 · inf max decγ (F) · T, γ · EstSq (F, T, δ) .
γ>0

For finite classes, we can use the exponential weights method to obtain EstSq (F, T, δ) ≲
log(|F|/δ), and this bound specializes to
 
Reg ≲ inf max decγ (F) · T, γ · log(|F|/δ) . (4.18)
γ>0

65
As desired, this gives a bound on regret that scales only with:

1. the complexity log|F| for estimation.

2. the complexity of exploration in the decision space, which is captured by decγ (F).

Before interpreting the result further, we give the proof, which is a nearly immediate con-
sequence of the definition of the DEC, and bears strong similarity to the proof of the regret
bound for SquareCB (Proposition 10), minus contexts.

Proof of Proposition 13. We write


T
X
Reg = Eπt ∼pt [f ⋆ (π ⋆ ) − f ⋆ (π t )]
t=1
XT h i
= Eπt ∼pt [f ⋆ (π ⋆ ) − f ⋆ (π t )] − γ · Eπt ∼pt (f ⋆ (π t ) − fbt (π t ))2 + γ · EstSq (F, T, δ).
t=1

For each t, since f ⋆ ∈ F, we have


h i
Eπt ∼pt [f ⋆ (π ⋆ ) − f ⋆ (π t )] − γ · Eπt ∼pt (f ⋆ (π t ) − fbt (π t ))2
n h io
≤ sup Eπt ∼pt [f (πf ) − f (π t )] − γ · Eπt ∼pt (f (π t ) − fbt (π t ))2
f ∈F
h i
= inf sup Eπ∼p f (πf ) − f (π) − γ · (f (π t ) − fbt (π t ))2
p∈∆(Π) f ∈F

= decγ (F, fbt ), (4.19)

where the first equality above uses that pt is chosen as the minimizer for decγ (F, fbt ). Sum-
ming across rounds, we conclude that

Reg ≤ sup decγ (F, fb) · T + γ · EstSq (F, T, δ).


fb

When designing algorithms for structured bandits, a common challenge is that the
connection between decision making (where the learner’s decisions influence what feedback
is collected) and estimation (where data is collected passively) may not seem apparent a-
priori. The power of the Decision-Estimation Coefficient is that it—by definition—provides
a bridge, which the proof of Proposition 13 highlights. One can select decisions by building
an estimate for the model using all of the observations collected so far, then sampling
from the distribution p that solves (4.15) with the estimated reward function fb plugged
in. Boundedness of the DEC implies that at every round, any learner using this strategy
either enjoys small regret or acquires information, with their total regret controlled by the
cumulative online estimation error.

66
Example: Multi-Armed Bandit. Of course, the perspective above is only useful if the
DEC is indeed bounded, which itself is not immediately apparent. In Section 6, we will show
that boundedness of the DEC is not just sufficient, but in fact necessary for low regret in
a fairly strong quantitative sense. For now, we will build intuition about the DEC through
examples. We begin with the multi-armed bandit, where Π = [A] and F = RA . Our first
result shows that decγ (F) ≤ A γ , and that this is achieved with the Inverse Gap Weighting
method introduced in Section 3.

Proposition 14 (IGW minimizes the DEC): For the Multi-Armed Bandit setting,
where Π = [A] and F = RA , the Inverse Gap Weighting distribution p = IGW4γ (fb) in
(3.36) is the exact minimizer for decγ (F, fb), and certifies that decγ (F, fb) = A−1
4γ .

By rewriting Proposition 9, it is straightforward to deduce that the DEC is bounded by


A
γ , but Proposition 14 shows that IGW is actually the best possible distribution for this
minimax problem. In this sense, the SquareCB algorithm can be seen as a (contextual)
special case of the Estimation-to-Decisions principle. Note that to attain the exact optimal
value (instead of a bound that is optimal up to constants), we use IGW4γ as opposed IGWγ
as in Proposition 9; the reason why this choice for γ is optimal is related to the fact that
the inequality xy ≤ x2 + 14 y 2 is tight in general.

Proof of Proposition 14. We rewrite the minimax problem as


h i
min max Eπ∼p f (πf ) − f (π) − γ(f (π) − fb(π))2
p∈∆([A]) f ∈RA
h i
= min max max Eπ∼p f (π ⋆ ) − f (π) − γ(f (π) − fb(π))2
p∈∆([A]) f ∈RA π ⋆ ∈[A]
h i
= min max max Eπ∼p f (π ⋆ ) − f (π) − γ(f (π) − fb(π))2 .
p∈∆([A]) π ⋆ ∈[A] f ∈RA

For any fixed p and π ⋆ , first-order conditions for optimality imply that the choice for f that
maximizes this expression is
1 1
f (π) = fb(π) − + I {π = π ⋆ } .
2γ 2γp(π ⋆ )
This choice gives
h i 1 − p(π ⋆ )
Eπ∼p [f (π ⋆ ) − f (π)] = Eπ∼p fb(π ⋆ ) − fb(π) +
2γp(π ⋆ )
and i 1 − p(π ⋆ ) (1 − p(π ⋆ ))2
h 1 1
γ Eπ∼p (f (π) − fb(π))2 = + ⋆
= ⋆
− .
4γ 4γp(π ) 4γp(π ) 4γ
Plugging in and simplifying, we compute that the original minimax game is equivalent to
 
h

i 1 1
min max Eπ∼p f (π ) − f (π) +
b b

− . (4.20)

p∈∆([A]) π ∈[A] 4γp(π ) 4γ
Finishing the proof: Ad-hoc approach. Observe that for any p ∈ ∆(Π), we have
   
h

i 1 h

i 1 A
max Eπ∼p f (π ) − f (π) +
b b

≥ Eπ⋆ ∼p Eπ∼p f (π ) − f (π) +
b b

= ,

π ∈[A] 4γp(π ) 4γp(π ) 4γ

67
A
so no p can attain value better than 4γ . If we can show that IGW achieves this value, we
are done.
Observe that by setting p = IGW4γ (fb), we have that for all π ⋆ ,
h i 1 h i λ
Eπ∼p fb(π ⋆ ) − fb(π) + ⋆
= E π∼p f
b (π ⋆
) − f
b (π) + π ) − fb(π ⋆ )
+ fb(b (4.21)
4γp(π ) 4γ
h i λ
= Eπ∼p fb(b π ) − fb(π) + .

Note that the value on the right-hand side is independent of π ⋆ . That is, the inverse gap
weighting distribution is an equalizing strategy. This means that for this choice of p, we
have
   
h

i 1 h

i 1
max Eπ∼p f (π ) − f (π) +
b b = min Eπ∼p f (π ) − f (π) +
b b
π ⋆ ∈[A] 4γp(π ⋆ ) π ⋆ ∈[A] 4γp(π ⋆ )
 
h i 1 A
= Eπ⋆ ∼p Eπ∼p fb(π ⋆ ) − fb(π) + = .
4γp(π ⋆ ) 4γ
Hence, p = IGW4γ (fb) achieves the optimal value.
Finishing the proof: Principled approach. We begin by relaxing to p ∈ RA
+ . Define
1
gπ⋆ (p) = fb(π ⋆ ) + .
4γp(π ⋆ )
Let ν ∈ R be a Lagrange multiplier and p ∈ RA
+ , and consider the Lagrangian
!
X X
L(p, ν) = gπ⋆ (p) − p(π)fb(π) + ν p(π) − 1 .
π π
By the KKT conditions, if we wish to show that p ∈ ∆(Π) is optimal for the objective in
(4.20), it suffices to find ν such that12
0 ∈ ∂p L(p, ν),
where ∂p denotes the subgradient with respect  to p. Recall that for a convex function

h(x) = maxy g(x, y), we have ∂ x h(x) = co( ∇g(x, y) | g(x, y) = maxy′ g(x, y ) ). As a
result,
∂p L(p, ν) = ν1 − fb + co({∇p gπ⋆ (p) | gπ⋆ (p) = max

gπ′ (p)}).
π

Now, let p = IGW4γ (fb). We will argue that 0 ∈ ∂p L(p, ν) for an appropriate choice of ν. By
(4.21), we know that gπ (p) = gπ′ (p) for all π, π ′ (p is equalizing), so the expression above
simplifies to
∂p L(p, ν) = ν1 − fb + co({∇p gπ⋆ (p)}π⋆ ∈Π ). (4.22)
Noting that ∇p gπ⋆ (p) = − 4γp21(π⋆ ) eπ⋆ , we compute
   
X 1 λ
δ := p(π)gπ (p) = − = − − f (b
b π ) + f (π)
b ,
π
4γp(π) π∈Π 4γ π∈Π
λ
which has δ ∈ co({∇p gπ⋆ (p)}π⋆ ∈Π ). By choosing ν = 4γ + fb(b
π ), we have

ν1 − fb + δ = 0,
so (4.22) is satisfied.

12 d
If p ∈ ∆(Π), the KKT condition that dν
L(p, ν) = 0 is already satisfied.

68
4.3 Decision-Estimation Coefficient: Examples
We now show how to bound the Decision-Estimation Coefficient for a number of examples
beyond finite-armed bandits—some familiar and others new—and show how this leads to
bounds on regret via E2D.

Approximately solving the DEC. Before proceeding, let us mention that to apply
E2D, it is not necessary to exactly solve the minimax problem (4.15). Instead, let us say
that a distribution p = p(fb, γ) certifies an upper bound on the DEC if, given fb and γ > 0,
it ensures that
h i
sup Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ decγ (F, fb)
f ∈F

for some known upper bound decγ (F, fb) ≥ decγ (F, fb). In this case, letting decγ (F) :=
supfb decγ (F, fb), it is simple to see that if we use this distribution pt = p(fbt , γ) within E2D,
we have

Reg ≤ decγ (F) · T + γ · EstSq (F, T, δ).

4.3.1 Cheating Code


For a first example, we show that the DEC leads to regret bounds that scale with log(A)
for the cheating code example in Example 4.2; that is, unlike UCB and posterior sampling,
the DEC correctly adapts to the structured of this problem.

Proposition 15 (DEC for Cheating Code): Consider the cheating code in Exam-
ple 4.2. For this class F, we have

log2 (A)
decγ (F) ≲ .
γ

Note that while the strategy p in Proposition 15 certifies a bound on the DEC, it is not
necessarily the exact minimizer, and hence the distributions p1 , . . . , pT played by E2D may
be different. Nonetheless, since the regret of E2D is bounded byp the DEC, this result (via
Proposition 13) implies that its regret is bounded by Reg ≲ log2 (A)T log|F|. Using a
slightly more refined version of the E2D algorithm [43], one can improve this to match the
log(T ) regret bound given in Example 4.2.

Proof of Proposition 15. To simplify exposition, we present a bound on decγ (F, fb) for this
example only for fb ∈ F, not for fb ∈ co(F). A similar approach (albeit with a slightly
different choice for p) leads to the same bound on decγ (F). Let fb ∈ F and γ > 0 be given,
and define
p = (1 − ε)πfb + ε · unif(C).

We will show that if we choose ε = 2 log2γ(A) , this strategy certifies that

log2 (A)
decγ (F, fb) ≲ .
γ

69
Let f ∈ F be fixed, and consider the value
h i
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 .

We consider two cases. First the first, if πf = πfb, then we can upper bound
h i h i
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ Eπ∼p [f (πf ) − f (π)] = Eπ∼p f (πfb) − f (π) ≤ 2ε,

since f ∈ [−1, 1].


For the second case, suppose that πf ̸= πfb. We begin by bounding
h i h i
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ 2 − γ · Eπ∼p (f (π) − fb(π))2 ,

using that f ∈ [−1, 1]. To proceed, we want to argue that the negative offset term above
is sufficiently large; informally, this means that we are exploring “enough”. Observe that
since πf ̸= πfb, if we let b1 , . . . , blog2 (A) and b′1 , . . . , b′log (A) denote the binary representations
2
for πf and πfb, there exists i such that bi ̸= b′i . As a result, we have
h i ε ε ε
Eπ∼p (f (π) − fb(π))2 ≥ (f (ci ) − fb(ci ))2 = (bi − b′i )2 = .
log2 (A) log2 (A) log2 (A)
We conclude that in the second case,
h i ε
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ 2 − γ .
log2 (A)
Putting the cases together, we have
 
h i ε
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ max 2ε, 2 − γ .
log2 (A)
To balance these terms, we set
log2 (A)
ε=2 ,
γ
which leads to the result.

4.3.2 Linear Bandits


We next consider the problem of linear bandits linear bandit [2, 13, 28, 26, 1], which is
a special case of the linear contextual bandit problem we saw in Section 3. We let Π be
arbitrary, and define F = {π 7→ ⟨θ, ϕ(π)⟩ | θ ∈ Θ}, where Θ ⊆ Bd2 (1) is a parameter set and
ϕ : Π → Bd2 (1) is a fixed feature map that is known to the learner.
To prove bounds on the DEC for this setting, we make use of a primitive from convex
analysis and experimental design known as the G-optimal design.

Proposition 16 (G-optimal design [52]): For any compact set Z ⊆ Rd with


dim span(Z) = d, there exists a distribution p ∈ ∆(Z), called the G-optimal design,
which has
sup Σ−1
p z, z ≤ d, (4.23)
z∈Z

70
where Σp := Ez∼p zz ⊤ .
 

The G-optimal design ensures coverage in every direction of the decision space, generalizing
the notion of uniform exploration for finite action spaces. In this sense, it can be thought
of as a “universal” exploratory distribution for linearly structured action spaces. Special
cases include:
• When Z = ∆([A]), we can take p = unif(e1 , . . . , eA ) as an optimal design
• When Z = Bd2 (1), we can again take p = unif(e1 , . . . , eA ) as an optimal design.
• For any positive definite matrix A ≻ 0, the set Z = z ∈ Rd | ⟨Az, z⟩ ≤ 1 is an


ellipsoid. Letting λ1 , . . . , λd and v1 , . . . , vd denote the eigenvalues and eigenvectors for


−1/2 −1/2
A, respectively, the distribution p = unif(λ1 v1 , . . . , λd vd ) is an optimal design.
To see how the G-optimal design can be used for exploration, consider the following
generalization of the ε-greedy algorithm.
• Let q ∈ ∆(Π) be the G-optimal design for the set {ϕ(π)}π∈Π .
• At each step t, obtain fbt from a supervised estimation oracle. Play π
bt = πfbt with
probability 1 − ε, and sample π t ∼ q otherwise.
It is straightforward to show that this strategy gives Reg ≲ d1/3 T 2/3 log|F| for linear
bandits. The basic idea is to replace (3.30) in the proof of Proposition 8 with the optimal
design property (4.23), using that the reward functions under consideration are linear. The
intuition is that even though we are no longer guaranteed to explore every single action
with some minimum probability, by exploring with the optimal design, we ensure that some
fraction of the data we collect covers every possible direction in action space to the greatest
extent possible.
The following result shows that by combining optimal design inverse
√ gap weighting, we
can obtain a d/γ bound on the DEC, which leads to an improved dT regret bound.

Proposition 17 (DEC for Linear Bandits): Consider the linear bandit setting. Let
a linear function fb and γ > 0 be given, consider the following distribution p:
q
• Define ϕ(π) = ϕ(π)/ 1 + γd fb(πfb) − fb(π) , where πfb = arg maxπ∈Π fb(π).


• Let q̄ ∈ ∆(Π) be the G-optimal design for the set {ϕ(π)}π∈Π , and define q =
1 1
2 q̄ + 2 Iπfb.

• For each π ∈ Π, set


q(π)
p(π) = γ b
,
λ+ d (f (πfb) − fb(π))
p(π) = 1.a
P
where λ ∈ [1/2, 1] is chosen such that π

This strategy certifies that


d
decγ (F) ≲ .
γ

71
a 1
P 1
The normalizing constant λ always exists because we have 2λ
≤ π p(π) ≤ λ
.

One can show that decγ (F) ≳ γd for this setting as well, so this is the best bound we can
hope for. Combining this result with Proposition 13 and p using the averaged exponential
weights algorithm for estimation as in (4.18) gives Reg ≲ dT log(|F|/δ).

Proof of Proposition 17. Fix f ∈ F. Let us abbreviate η = γd . As in Proposition 9, we


break regret into three terms:
     
Eπ∼p f (πf ) − f (π) = Eπ∼p fb(πfb) − fb(π) + Eπ∼p fb(π) − f (π) + f (πf ) − fb(πfb) .
| {z } | {z } | {z }
(I) exploration bias (II) est error on policy (III) est error at opt

The first term captures the loss in exploration that we would incur if fb we true the reward
function, and is equal to:

X q(π)(fb(πfb) − fb(π)) X q(π) 1


 ≤ ≤ ,
π λ + η f (πfb) − f (π)
b b π
η η

and the second term, as before, is at most


1 γ
q
+ Eπ∼p (fb(π) − f (π))2 .
 
Eπ∼p (fb(π) − f (π))2 ≤
2γ 2
The third term can be written as

(III) = f (πf ) − fb(πf ) − (fb(πfb) − fb(πf )) = θ − θ,


b ϕ(πf ) − (fb(π b) − fb(πf )),
f

where θ, θb ∈ Θ are parameters such that f (π) = ⟨θ, ϕ(π)⟩ and fb(π) = ⟨θ,
b ϕ(π)⟩. Defining


Σp = Eπ∼p ϕ(π)ϕ(π) , we can bound

b Σ−1/2 ϕ(πf )
b ϕ(πf ) = Σ1/2 (θ − θ),
θ − θ, p p
−1/2 γ b 2 + 1 ∥Σ−1/2 ϕ(πf )∥2 .
≤ ∥Σ1/2
p (θ − θ)∥2 ∥Σp
b ϕ(πf )∥2 ≤ ∥Σp1/2 (θ − θ)∥2
2 2γ p 2

1/2 b 2 = Eπ∼p [(fb(π) − f (π))2 ] and ∥Σ−1/2


Note that ∥Σp (θ − θ)∥ ϕ(πf )∥22 = ⟨ϕ(πf ), Σ−1
2 p p ϕ(πf )⟩,
so we have
γ 1
(III) ≤ Eπ∼p [(fb(π) − f (π))2 ] + ⟨ϕ(πf ), Σ−1
p ϕ(πf )⟩ − (f (πfb) − f (πf )) .
b b
2 2γ
| {z }
(IV)

To proceed, observe that

1X q̄(π)
Σp ⪰ ϕ(π)ϕ(π)⊤
2 π λ + η(fb(π b) − fb(π))
f
1 X q̄(π) 1X 1
⪰ ϕ(π)ϕ(π)⊤ ⪰ q̄(π)ϕ(π)ϕ(π)⊤ =: Σq̄
2 π 1 + η(f (π b) − f (π))
b b 2 π 2
f

72
This means that we can bound
⟨ϕ(πf ), Σ−1 −1
p ϕ(πf )⟩ ≤ 2⟨ϕ(πf ), Σq̄ ϕ(πf )⟩

= 2(1 + η(fb(π b) − fb(πf ))⟨ϕ(πf ), Σ−1 ϕ(πf )⟩


f q̄

≤ 2d(1 + η(fb(πfb) − fb(πf )),

where the last line uses that q̄ is the G-optimal design for {ϕ(π)}π∈Π . We conclude that
2d 2dη b d
(IV) ≤ + (f (πfb) − fb(πf )) − (fb(πfb) − fb(πf )) ≤ .
2γ 2γ γ

Remark 14: In fact, it can be shown [39] that when Θ = Rd , the exact minimizer of
the DEC for linear bandits is given by
 
  1 ⊤
p = arg max Eπ∼p f (π) +
b log det(Eπ∼p [ϕ(π)ϕ(π) ]) .
p∈∆(Π) 4γ

4.3.3 Nonparametric Bandits


For all of the examples so far, we have shown that
eff-dim(F, Π)
decγ (F) ≲ ,
γ
where eff-dim(F, Π) is some quantity that (informally) reflects the amount of exploration
required for the class F under consideration (A for bandits, log2 (A) for the cheating code,
and d for linear bandits). In general though, the Decision-Estimation Coefficient does not
always shrink at a√γ −1 rate, and can have slower decay for problems where the optimal
rate is worse than T . We now consider such a setting: a standard nonparametric bandit
problem called Lipschitz bandits in metric spaces [14, 54].
We take Π to be a metric space equipped with metric ρ, and define
F = {f : Π → [0, 1] | f is 1-Lipschitz w.r.t ρ}.
We give a bound on the Decision-Estimation Coefficient which depends on the covering
number for the space Π (with respect to the metric ρ). Let us say that Π′ ⊆ Π is an ε-cover
with respect to ρ if
∀π ∈ Π ∃π ′ ∈ Π′ s.t. ρ(π, π ′ ) ≤ ε,
and let Nρ (Π, ε) denote the size of the smallest such cover.

Proposition 18 (DEC for Lipschitz Bandits): Consider the Lipschitz bandit set-
ting, and suppose that there exists d > 0 such that Nρ (Π, ε) ≤ ε−d for all ε > 0. Let
fb : Π → [0, 1] and γ ≥ 1 be given and consider the following distribution:

1. Let Π′ ⊆ Π witness the covering number Nρ (Π, ε) for a parameter ε > 0.

73
2. Let p be the result of applying the inverse gap weighting strategy in (3.36) to fb,
restricted to the (finite) decision space Π′ .
1
By setting ε ∝ γ − d+1 , this strategy certifies that
1
decγ (F, fb) ≲ γ − d+1 .
d+1
Ignoring dependence on EstSq (F, T, δ), this result leads to regret bounds that scale as T d+2
(after tuning γ in Proposition 13), which cannot be improved.

Proof of Proposition 18. Let f ∈ F be fixed. Let Π′ be the ε-cover for Π. Since f is 1-
Lipschitz, for all π ∈ Π there exists a corresponding covering element ι(π) ∈ Π′ such that
ρ(π, ι(π)) ≤ ε, and consequently for any distribution p,

Eπ∼p [f (πf ) − f (π)] ≤ Eπ∼p [f (ι(πf )) − f (π)] + |f (πf ) − f (ι(πf ))|


≤ Eπ∼p [f (ι(πf )) − f (π)] + ρ(πf , ι(πf ))
≤ Eπ∼p [f (ι(πf )) − f (π)] + ε.

At this point, since ι(πf ) ∈ Π′ , Proposition 9 ensures that if we choose p using inverse gap
weighting over Π′ , we have
|Π′ | h i
Eπ∼p [f (ι(πf )) − f (π)] ≤ + γ · Eπ∼p (f (π) − fb(π))2 .
γ
From our assumption on the growth of Nρ (Π, ε), |Π′ | ≤ ε−d , so the value is at most

ε−d
ε+ .
γ
1
We choose ε ∝ γ − d+1 to balance the terms, leading to the result.

4.3.4 Further Examples


We state the following additional upper bounds on the DEC without proof; details can be
found in [40].
Example 4.3 (Decision-Estimation Coefficient subsumes Eluder Dimension). Consider any
class F with values in [0, 1]. For all γ ≥ e, we have

Edim(F − F, ε) log2 (γ)


 
decγ (F) ≲ inf ε + + γ −1 . (4.24)
ε>0 γ

As a special case, this example implies that E2D enjoys a regret bound for generalized
linear bandits similar to that of UCB.
Example 4.4 (Bandits with Concave Rewards). The concave (or convex, if one considers
losses rather than rewards) bandit problem [53, 35, 4, 21, 23, 58] is a generalization of the
linear bandit. We take Π ⊆ Bd2 (1) and define

F = {f : Π → [0, 1] | f is concave and 1-Lipschitz w.r.t ℓ2 }.

74
For this setting, whenever F ⊆ (Π → [0, 1]), results of Lattimore [58] imply that
d4
decγ (F) ≲ · polylog(d, γ) (4.25)
γ
for all γ > 0. ◁
For the function class
n o
d
F = f (π) = −relu(⟨ϕ(π), θ⟩) | θ ∈ Θ ⊂ B2 (1) ,
p
(4.25) leads to a poly(d)T regret bound for E2D. This highlights a case where the Eluder
dimension is overly pessimistic, since we saw that it grows exponentially for this class.

4.4 Relationship to Optimism and Posterior Sampling


We close this section by highlighting some connections between the Decision-Estimation
Coefficient and E2D and other techniques we have covered so far: Optimism (UCB) and
Posterior Sampling. Additional connections to optimism can be found in Section 6.7.3.

4.4.1 Connection to Optimism


The E2D meta-algorithm and the Decision-Estimation Coefficient can be combined with the
idea of confidence sets that we used in the UCB algorithm. Consider the following variant
of E2D.
Estimation-to-Decisions (E2D) with Confidence Sets
Input: Exploration parameter γ > 0, confidence radius β > 0.
for t = 1, . . . , T do
Obtain fbt from online regression oracle with (π 1 , r1 ), . . . , (π t−1 , rt−1 ).
Set ( )
X h i
t i i ⋆ i 2
F = f ∈F | Eπi ∼pi (fb (π ) − f (π )) ≤ β .
i<t

Compute
 
t t 2
p = arg min max Eπ∼p f (πf ) − f (π) − γ · (f (π) − f (π)) .
b
t
p∈∆(Π) f ∈F

Select action π t ∼ pt .

This strategy is the same as the basic E2D algorithm, except that at each step, we compute
a confidence set F t and modify the minimax problem so that the max player is restricted to
choose f ∈ F t .13 With this change, the distribution pt can be interpreted as the minimizer
for decγ (F t , fbt ).
To analyze this algorithm, we show that as long as f ⋆ ∈ F t for all t, the same per-step
analysis as in Proposition 13 goes through, with F replaced by F t . This allows us to prove
the following result.
13
Note that compared to the confidence sets used in UCB, a slight difference is that we compute F t using
the estimates fb1 , . . . , fbT produced by the online regression oracle (this is sometimes referred to as “online-
to-confidence set conversion”) as opposed to using ERM; this difference is unimportant, and the later would
work as well.

75
Proposition 19: For any δ ∈ (0, 1) and γ > 0, if we set β = EstSq (F, T, δ), then E2D
with confidence sets ensures that with probability at least 1 − δ,
T
X
Reg ≤ decγ (F t ) + γ · EstSq (F, T, δ). (4.26)
t=1

This bound is never worse than the one in Proposition 13, but it can be smaller if the
confidence sets F 1 , . . . , F T shrink quickly. For a proof, see Exercise 9.

Remark 15: In fact, the regret bound in (4.26) can be shown to hold for any sequence
of confidence sets F 1 , . . . , F T , as long as f ⋆ ∈ F t ∀t with probability at least 1 −
δ; the specific construction we use within the E2D variant above is chosen only for
concreteness.

Relation to confidence width and UCB. It turns out that the usual UCB algo-
rithm, which selects π t = arg maxπ∈Π f¯t (π) for f¯t (π) = maxf ∈F t f t (π), certifies a bound
on decγ (F t ) which is never worse than usual confidence width we use in the UCB analy-
sis.

Proposition 20: The UCB strategy π t = arg maxπ∈Π f¯t (π) certifies that

dec0 (F t ) ≤ f¯t (π t ) − f (π t ). (4.27)

Proof of Proposition 20. By choosing π t = arg maxπ∈Π f¯t (π), we have that for any fb,
h i

dec0 (F t , fb) = inf sup Eπ∼p max

f (π ) − f (π)
p∈∆(Π) f ∈F t π
h i
⋆ t
≤ sup max

f (π ) − f (π )
f ∈F t π
h i
≤ sup max

f¯t (π ⋆ ) − f (π t )
f ∈F t π

= sup f¯t (π t ) − f (π t ) = f¯t (π t ) − f t (π t ).


 
f ∈F t

As we saw in the analysis of UCB for multi-armed bandits with Π = {1, . . . , A} (Section 2.3),
the confidence width in (4.27) might be large for a given round t, but by the pigeonhole
argument (Lemma 8), when we sum over all rounds we have
T T √
f¯t (π t ) − f t (π t ) ≤ O(
X X
dec0 (F t ) ≤ e AT ).
t=1 t=1

Hence, even though UCB is not the optimal strategy to minimize the DEC, it can still lead
to upper bounds on regret when the confidence width shrinks sufficiently quickly. Of course,
as examples like the cheating code show, we should not expect this to happen in general.

76
Interestingly, the bound on the DEC in Proposition 20 holds for γ = 0, which only leads
to meaningful bounds on regret because F 1 , . . . , F T are shrinking. Indeed, Proposition 14
shows that with F = RA , we have
A
decγ (F) ≳ ,
γ
so the unrestricted class F has decγ (F) → ∞ as γ → 0. By allowing for γ > 0, we can
prove the following slightly stronger result, which replaces f t by fbt .

Proposition 21: For any γ > 0, the UCB strategy π t = arg maxπ∈Π f¯t (π) certifies
that
1
decγ (F t , fbt ) ≤ f¯t (π t ) − fbt (π t ) + .

Proof of Proposition 21. This is a slight generalization of the proof of Proposition 20. By
choosing π t = arg maxπ∈Π f¯t (π), we have
 
t ⋆ t 2
decγ (F, f ) = min max Eπ∼p max
b f (π ) − f (π) − γ ·(f (π) − f (π))
b
p∈∆(Π) f ∈Ft π⋆
 
⋆ t t t t 2
≤ max max f (π ) − f (π ) − γ ·(f (π ) − f (π ))
b
f ∈Ft π⋆
 
≤ max f¯ (π ) − f (π ) − γ ·(fb (π ) − f (π ))
t t t t t t 2
f ∈Ft
 
= max f (π ) − f (π ) − γ ·(f (π ) − f (π )) +f¯t (π t ) − fbt (π t ).
bt t t bt t t 2
f ∈Ft
| {z }
1
≤ 4γ

4.4.2 Connection to Posterior Sampling


The Decision-Estimation Coefficient (4.15) is a min-max optimization problem, which we
have mentioned can be interpreted as a game in which the learner (the “min” player) aims
to find a decision distribution p that optimally trades off regret and information acquisition
in the face of an adversary (the “max” player) that selects a worst-case model in M. We
can define a natural dual (or, max-min) analogue of the DEC via
h i
decγ (F, fb) = sup inf Ef ∼µ Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 . (4.28)
µ∈∆(F ) p∈∆(Π)

The dual Decision-Estimation Coefficient has the following Bayesian interpretation. The
adversary selects a prior distribution µ over models in M, and the learner (with knowledge
of the prior) finds a decision distribution p that balances the average tradeoff between regret
and information acquisition when the underlying model is drawn from µ.
Using the minimax theorem (Lemma 41), one can show that the Decision-Estimation
Coefficient and its Bayesian counterpart coincide.

77
Proposition 22 (Equivalence of primal and dual DEC): Under mild regularity
conditions, we have
decγ (F, fb) = decγ (F, fb). (4.29)

Thus, any bound on the dual DEC immediately yields a bound on the primal DEC. This
perspective is useful because it allows us to bring existing tools for Bayesian bandits and
reinforcement learning to bear on the primal Decision-Estimation Coefficient. As an ex-
ample, we can adapt the posterior sampling/probability matching strategy introduced in
Section 2. When applied to the Bayesian DEC—this approach selects p to be the action
distribution induced by sampling f ∼ µ and selecting πf . Using Lemma 9, one can show
that this strategy certifies that
|Π|
decγ (F) ≲
γ
for the multi-armed bandit. In fact, existing analysis techniques for the Bayesian setting
can be viewed as implicitly providing bounds on the dual Decision-Estimation Coefficient
[75, 22, 21, 76, 59, 58]. Notably, the dual DEC is always bounded by a Bayesian complexity
measure known as the information ratio, which is used throughout the literature on Bayesian
bandits and reinforcement learning [40].
Beyond the primal and dual Decision-Estimation Coefficient, there are deeper connec-
tions between the DEC and Bayesian algorithms, including a Bayesian counterpart to the
E2D algorithm itself [40].

4.5 Incorporating Contexts⋆


The Decision-Estimation Coefficient and E2D algorithm trivially extend to handle contextual
structured bandits. This approach generalizes the SquareCB method introduced in Section 3
from finite action spaces to general action spaces. Consider the following protocol.
Contextual Structured Bandit Protocol
for t = 1, . . . , T do
Observe context xt ∈ X .
Select decision π t ∈ Π. // Π is large and potentially continuous.
Observe reward rt ∈ R.

This is the same as the contextual bandit protocol in Section 3, except that we allow Π to
be large and potentially continuous. As in that section, we allow the contexts x1 , . . . , xT to
be generated in an arbitrary, potentially adversarial fashion, but assume that
rt ∼ M ⋆ (· | xt , π t ),
and define f ⋆ (x, π) = Er∼M ⋆ (·|x,π) . We assume access to a function class F such that f ⋆ ∈ F,
and assume access to an estimation oracle for F that ensures that with probability at least
1 − δ,
XT
Eπt ∼pt (fbt (xt , π t ) − f ⋆ (xt , π t ))2 ≤ EstSq (F, T, δ).
t=1
For f ∈ F, we define πf (x) = arg maxπ∈Π f (x, π).
To extend the E2D algorithm to this setting, at each time t we solve the minimax
problem corresponding to the DEC, but condition on the context xt .

78
Estimation-to-Decisions (E2D) for Contextual Structured Bandits
Input: Exploration parameter γ > 0.
for t = 1, . . . , T do
Observe xt ∈ X .
Obtain fbt from online regression oracle with (x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 ).
Compute
 
t t t t t t t 2
p = arg min max Eπ∼p f (x , πf (x )) − f (x , π) − γ · (f (x , π) − fb (x , π)) .
p∈∆(Π) f ∈F

Select action π t ∼ pt .

For x ∈ X , define
F(x, ·) = {f (x, ·) | f ∈ F}
as the projection of F onto x ∈ X . The following result shows that whenever the DEC is
bounded conditionally—that is, whenever it is bounded for F(x, ·) for all x—this strategy
has low regret.

Proposition 23: The E2D algorithm with exploration parameter γ > 0 guarantees
that
Reg ≤ sup decγ (F(x, ·)) · T + γ · EstSq (F, T, δ), (4.30)
x∈X

We omit the proof of this result, which is nearly identical to that of Proposition 13. The
basic idea is that for each round, once we condition on the context xt , the DEC allows us
to link regret to estimation error in the same fashion non-contextual setting.
We showed in Proposition 14 that the IGW distribution exactly solves the DEC minimax
problem when F = RA . Hence, the SquareCB algorithm in Section 3 is precisely the special
case of Contextual E2D in which F = RA .
Going beyond the finite-action setting, it is simplest to interpret Proposition 23 when
F(x, ·) has the same structure for all contexts. One example is contextual bandits with
linearly structured action spaces. Here, we take

F = {f (x, a) = ⟨ϕ(x, a), g(x)⟩ | g ∈ G},

where ϕ(x, a) ∈ Rd is a fixed feature map and G ⊂ (X → Bd2 (1)) is an arbitrary func-
tion class. This setting generalizes the linear contextual bandit problem from Section 3,
which corresponds to the case where G is a set of constant functions. We can apply
Proposition 17 to conclude that supx∈X decγ (F(x, ·)) ≲ γd , so that Proposition 23 gives
p
Reg ≲ dT · EstSq (F, T, δ).

4.6 Additional Properties of the Decision-Estimation Coefficient⋆


The following proposition indicates that the value of the Decision-Estimation Coefficient
decγ (F, fb) cannot be increased by taking references models fb outside the convex hull of
F:

79
Proposition 24: For any γ > 0,

sup decγ (F, fb) = sup decγ (F, fb).


fb:Π→R fb∈co(F )

4.7 Exercises

Exercise 8 (Posterior Sampling for Multi-Armed Bandits): Prove that for the standard
multi-armed bandit,
|Π|
decγ (F) ≲ ,
γ
by using the Posterior Sampling strategy (select p to be the action distribution induced by
sampling f ∼ µ and selecting πf ), and applying the decoupling lemma (Lemma 9). Recall that
here, decγ (F) is the “maxmin” version of the DEC (4.28).

Exercise 9: Prove Proposition 19.

Exercise 10: In this exercise, we will prove Proposition 24 as follows. First, show that the
left-hand side is an upper bound on the right-hand side. For the other direction:
1. Prove that

inf Ef ∼µ Eπ∼p (f (π) − fb(π))2 ≤ inf Ef ∼µ Eπ∼p (f (π) − fb(π))2 . (4.31)


fb∈co(F ) fb:Π→R

2. Use the Minimax Theorem (Lemma 41 in Appendix A.3) to conclude Proposition 24.

5. REINFORCEMENT LEARNING: BASICS

We now introduce the framework of reinforcement learning, which encompasses a rich set
of dynamic, stateful decision making problems. Consider the task of repeated medical
treatment assignment, depicted in Figure 4. To make the setting more realistic, it is natural
to allow the decision-maker to apply multi-stage strategies rather simple one-shot decisions
such as “prescribe a painkiller.” In principle, in the language of structured bandits, nothing
is preventing us from having each decision π t be a complex multi-stage treatment strategy
that, at each stage, acts on the patient’s dynamic state, which evolves as a function of
the treatments at previous stages. As an example, intermediate actions of the type “if
patient’s blood pressure is above X then do Y” can form a decision tree that defines the
complex strategy π t . Methods from the previous lectures provide guarantees for such a
setting, as long as we have a succinct model of expected rewards. What sets RL apart from
structured bandits is the additional information about the intermediate state transitions
and intermediate rewards. This information facilitates credit assignment, the mechanism for
recognizing which of the actions led to the overall (composite) decision to be good or bad.
This extra information can reduce what would otherwise be exponential sample complexity
in terms of the number of stages, states, and actions in multi-stage decision making.

80
This section is structured as follows. We first present the formal reinforcement learning
framework and present basic principles including Bellman optimality and dynamic pro-
gramming, which facilitate efficiently computing optimal decisions when the environment
is known. We then consider the case in which the environment is unknown, and give
algorithms for perhaps the simplest reinforcement learning setting, tabular reinforcement
learning, where the state and action spaces are finite. Algorithms for more complex rein-
forcement learning settings are given in Section 5.

5.1 Finite-Horizon Episodic MDP Formulation


We consider an episodic finite-horizon reinforcement learning framework. With H denoting
the horizon, a Markov Decision Process (MDP) M takes the form

M = S, A, {PhM }H M H

h=1 , {Rh }h=1 , d1 ,

where S is the state space, A is the action space,

PhM : S × A → ∆(S)

is the probability transition kernel at step h,

RhM : S × A → ∆(R)

is the reward distribution, and d1 ∈ ∆(S) is the initial state distribution. We allow the
reward distribution and transition kernel to vary across MDPs, but assume for simplicity
that the initial state distribution is fixed and known.
For a fixed MDP M , an episode proceeds under the following protocol. At the beginning
of the episode, the learner selects a randomized, non-stationary policy

π = (π1 , . . . , πH ),

where πh : S → ∆(A); we let Πrns for “randomized, non-stationary” denote the set of
all such policies. The episode then evolves through the following process, beginning from
s1 ∼ d1 . For h = 1, . . . , H:
• ah ∼ πh (sh ).

• rh ∼ RhM (sh , ah ) and sh+1 ∼ PhM (sh , ah ).


For notational convenience, we take sH+1 to be a deterministic terminal state. The Markov
property refers to the fact that under this evolution,

PM (sh+1 = s′ | sh , ah ) = PM (sh+1 = s′ | sh , ah , sh−1 , ah−1 , . . . , s1 , a1 ).

The value for a policy π under M is given by


"H #
X
f M (π) := EM ,π rh , (5.1)
h=1

where EM ,π [·] denotes expectation under the process above. We define an optimal policy
for model M as

πM ∈ arg max f M (π). (5.2)


π∈Πrns

81
Value functions. Maximization in (5.2) is a daunting task, since each policy π is a
complex multi-stage object. It is useful to define intermediate “reward-to-go” functions to
start breaking this complex task into smaller sub-tasks. Specifically, for a given model M
and policy π, we define the state-action value function and state value function via
" H # " H #

rh′ | sh = s, ah = a , and VhM ,π (s) = EM ,π
X X
M ,π
QMh (s, a) = E rh′ | sh = s .
h′ =h h′ =h

Hence, the definition in (5.1) reads


 ,π  M ,π 
f M (π) = Es∼d1 ,a∼π1 (s) QM

1 (s, a) = Es∼d1 V1 (s)

Online RL. For reinforcement learning, our main focus will be on what is called the
online reinforcement learning problem, in which we interact with an unknown MDP M ⋆ for
T episodes. For each episode t = 1, . . . , T , the learner selects a policy π t ∈ Πrns . The policy
is executed in the MDP M ⋆ , and the learner observes the resulting trajectory

τ t = (st1 , at1 , r1t ), . . . , (stH , atH , rH


t
).

The goal is to minimize the total regret


T
X  ⋆ ⋆ 
Eπt ∼pt f M (πM ⋆ ) − f M (π t ) (5.3)
t=1

against the optimal policy πM ⋆ for M ⋆ .


The online RL framework is a strict generalization of (structured) bandits and contextual
bandits (with i.i.d. contexts). Indeed, if S = {s0 } and H = 1, each episode amounts to
choosing an action at ∈ A and observing a reward rt with mean f M (at ), which is precisely
a bandit problem. On the other hand, taking S = X and H = 1 puts us in the setting of
contextual bandits, with d1 being the distribution of contexts. In both cases, the notion of
regret (5.3) coincides with the notion of regret in the respective setting.
We mention in passing that many alternative formulations for Markov decision processes
and for the reinforcement learning problem appear throughout the literature. For example,
MDPs can be studied with infinite horizon (with or without discounting), and an alternative
to minimizing regret is to consider PAC-RL which aims to minimize the sub-optimality of
a final output policy produced after exploring for T rounds.

5.2 Planning via Dynamic Programming


In some reinforcement learning problems, it is natural to assume that the true MDP M ⋆ is
known. This may be the case with games, such as chess or backgammon, where transition
probabilities are postulated by the game itself. In other settings, such as robotics or medical
treatment, the agent interacts with an unknown M ⋆ and needs to learn at least some aspects
of this environment. The online reinforcement learning problem described above falls in the
latter category. Before attacking the learning problem, we need to understand the structure
of solutions to (5.2) in the case where M ⋆ is known to the decision-maker. In this section,
we show that the problem of maximizing f M (π) over π ∈ Π in a known model M (known
as planning) can be solved efficiently via the principle of dynamic programming. Dynamic

82
programming can be viewed as solving the problem of credit assignment by breaking down
a complex multi-stage decision (policy) into a sequence of small decisions.
We start by observing that the optimal policy πM in (5.2) may not be uniquely defined.
For instance, if d1 assigns zero probability to some state s1 , the behavior of πM on this state
is immaterial. In what follows, we introduce a fundamental result, Proposition 25, which
guarantees existence of an optimal policy πM = (πM,1 , . . . , πM,H ) that maximizes V1M ,π (s)
over π ∈ Πrns for all states s ∈ S simultaneously (rather than just on average, as in (5.2)).
The fact that such a policy exists may seem magical at first, but it is rather straightforward.
Indeed, if πM,h (s) is defined for all s ∈ S and h = 2, . . . , H, then defining the optimal πM,1 (s)
at any s is a matter of greedily choosing an action that maximizes the sum of the expected
immediate reward and the remaining expected reward under the optimal policy. Indeed,
this observation is Bellman’s principle of optimality, stated more generally as follows [17]:

To state the result formally, we introduce the optimal value functions as


" H #
,⋆
and VhM ,⋆ (s) = max QM ,⋆
X
M ,π
QM
h (s, a) = max E rh′ | sh = s, ah = a h (s, a) (5.4)
π∈Πrns a
h′ =h

M ,⋆ ,⋆
for all s ∈ S, a ∈ A, and h ∈ [H]; we adopt the convention that VH+1 (s) = QMH+1 (s, a) = 0.
Since these optimal values are separate maximizations for each s, a, h, it is reasonable to ask
whether there exists a single policy that maximizes all these value functions simultaneously.
Indeed, the following lemma shows that there exists πM such that for all s, a, h,
,⋆ M ,πM
QM
h (s, a) = Qh (s, a), and VhM ,⋆ (s) = VhM ,πM (s). (5.5)

Proposition 25 (Bellman Optimality): The optimal value function (5.4) for MDP
M ,πM
M can be computed via VH+1 (s) := 0, and for each s ∈ S,

VhM ,πM (s) = max EM rh + Vh+1


M ,πM
 
(sh+1 ) | sh = s, ah = a . (5.6)
a∈A

The optimal policy is given by


M ,πM
 
πM,h (s) ∈ arg max EM rh + Vh+1 (sh+1 ) | sh = s, ah = a . (5.7)
a∈A

Equivalently, for all s ∈ S, a ∈ A,


 
M ,πM M M ,πM ′
Qh (s, a) = E rh + max

Qh+1 (sh+1 , a ) | sh = s, ah = a , (5.8)
a ∈A

83
and an the optimal policy is given by
,πM
πM,h (s) ∈ arg max QM
h (s, a). (5.9)
a∈A

The update in (5.8) is referred to as value iteration (VI). It is useful to introduce a more suc-
cinct notation for this update. For an MDP M , define the Bellman Operators T1M , . . . , THM
via
 
M ′
[Th Q](s, a) = Esh+1 ∼P M (s,a),rh ∼RM (s,a) rh (s, a) + max

Q(sh+1 , a ) (5.10)
h h a ∈A

for any Q : S × A → R. Going forward, we will write the expectation above more succinctly
as
 

[ThM Q](s, a) = EM rh (sh , ah ) + max

Q(s h+1 , a ) | s h = s, ah = a (5.11)
a ∈A

In the language of Bellman operators, (5.8) can be written as


,πM ,πM
QM
h = ThM QM
h+1 . (5.12)

5.3 Failure of Uniform Exploration


The task of planning using dynamic programming—which requires knowledge of the MDP—
is fairly straightforward, at least if we disregard the computational concerns. In this course,
however, we are interested in the problem of learning to make decisions in the face of an
unknown environment. Minimizing regret in an unknown MDP requires exploration. As
the next example shows, exploration in MDPs is a more delicate issue than in bandits.
Recall that ε-Greedy, a simple algorithm, is a reasonable solution
√ for bandits and con-
textual bandits, albeit with a suboptimal rate (T 2/3 as opposed to T ). The next (classical)
example, a so-called “combination lock,” shows that such a strategy can be disastrous in
reinforcement learning, as it leads to exponential (in the horizon H) regret.
Consider the MDP depicted in Figure 8, with H + 2 states, and two actions ag and
ab , and a starting state 1. The “good” action ag deterministically leads to the next state
in the chain, while the “bad” action deterministically leads to a terminal state. The only
place where a non-zero reward can be received is the last state H, if the good action is
chosen. The starting state is 1, and so the only way to receive non-zero reward is to select
ag for all the H time steps within the episode. Since the length of the episode is also H,
selecting actions uniformly brings no information about the optimal sequence of actions,
unless by chance all of the actions sampled happen to be good; the probability that this
occurs is exponentially small in H. This means that T needs to be at least O(2H ) to achieve
nontrivial regret, and highlights the need for more strategic exploration.
ab
<latexit sha1_base64="cZeKLAuKLyN7qMbOGv/SfyeowIk=">AAAG3XicdZRbT9swFMfNNjpWdoHtkZdoFdIeUJV0UJo3BGwMCQHjPrVV5ThuatW5yHEGUZTHvW17HNr32T7Evs3spKXUNVF6ah3//sfHPsdxIkpibpr/5h49fjJfebrwrLr4/MXLV0vLry/iMGEIn6OQhuzKgTGmJMDnnHCKryKGoe9QfOkMd+T85VfMYhIGZzyNcNeHXkD6BEEuXKew5/SWambdFE+zaciB1TItMbDtVqNhG1YxZZq1rcW93yt/r9zj3vL8n44bosTHAUcUxnHbMiPezSDjBFGcVztJjCOIhtDDbTEMoI/jblbkmhurwuMa/ZCJX8CNwntfkUE/jlPfEaQP+SBW56RTN9dOeL/VzUgQJRwHqFyon1CDh4bcuOEShhGnqRhAxIjI1UADyCDi4niqU8uUqU65PAajAUE3ebW6auwcHRydGLsfPu4f7p/tHx2eVjsu7osKFMJsF7LhHsM4yDPmOXlm1q2NNXGkTWmsjdyYxg+IN+BfMKXh9Z2gJdHC2HpexE/v6E0JliafZi8wS2f5SfSWGnyU+4QtMi+NNpFtmuA72N4oUy6sreUVgVWgpW1qBSfYnSRenuPIqvg2FcUaswIRr4qc4ojAPEN+OpSMiPh+zRJ/m6ZycNuMoGGx9Jg16i1bGHtdmEZLwS8HhI93JTYjX4U4TlhE8b0SiFZYk1CAr1Ho+zBws851EaZtdbMOxze8VJbOrGblCu3Ik1TgwqdhmdiKgkqXhgwZDLyZuCOvhveKTlfwe1dAl7esk6IYFW+WprIJNOlPmmNWExdlVgSj2mvykbXWrDDpAe2uU21G5eV5YBs6lXJFZ5WuOEudcHJVZzVR2W2KYtyDBS++++OPu/Hw4KJRt5r19c9WbasJymcBrIC34B2wwCbYAp/AMTgHCHjgB/gFbiu9yrfK98rPEn00N9K8AVNP5fY/UXti2w==</latexit>

ab
<latexit sha1_base64="cZeKLAuKLyN7qMbOGv/SfyeowIk=">AAAG3XicdZRbT9swFMfNNjpWdoHtkZdoFdIeUJV0UJo3BGwMCQHjPrVV5ThuatW5yHEGUZTHvW17HNr32T7Evs3spKXUNVF6ah3//sfHPsdxIkpibpr/5h49fjJfebrwrLr4/MXLV0vLry/iMGEIn6OQhuzKgTGmJMDnnHCKryKGoe9QfOkMd+T85VfMYhIGZzyNcNeHXkD6BEEuXKew5/SWambdFE+zaciB1TItMbDtVqNhG1YxZZq1rcW93yt/r9zj3vL8n44bosTHAUcUxnHbMiPezSDjBFGcVztJjCOIhtDDbTEMoI/jblbkmhurwuMa/ZCJX8CNwntfkUE/jlPfEaQP+SBW56RTN9dOeL/VzUgQJRwHqFyon1CDh4bcuOEShhGnqRhAxIjI1UADyCDi4niqU8uUqU65PAajAUE3ebW6auwcHRydGLsfPu4f7p/tHx2eVjsu7osKFMJsF7LhHsM4yDPmOXlm1q2NNXGkTWmsjdyYxg+IN+BfMKXh9Z2gJdHC2HpexE/v6E0JliafZi8wS2f5SfSWGnyU+4QtMi+NNpFtmuA72N4oUy6sreUVgVWgpW1qBSfYnSRenuPIqvg2FcUaswIRr4qc4ojAPEN+OpSMiPh+zRJ/m6ZycNuMoGGx9Jg16i1bGHtdmEZLwS8HhI93JTYjX4U4TlhE8b0SiFZYk1CAr1Ho+zBws851EaZtdbMOxze8VJbOrGblCu3Ik1TgwqdhmdiKgkqXhgwZDLyZuCOvhveKTlfwe1dAl7esk6IYFW+WprIJNOlPmmNWExdlVgSj2mvykbXWrDDpAe2uU21G5eV5YBs6lXJFZ5WuOEudcHJVZzVR2W2KYtyDBS++++OPu/Hw4KJRt5r19c9WbasJymcBrIC34B2wwCbYAp/AMTgHCHjgB/gFbiu9yrfK98rPEn00N9K8AVNP5fY/UXti2w==</latexit>

ab
<latexit sha1_base64="cZeKLAuKLyN7qMbOGv/SfyeowIk=">AAAG3XicdZRbT9swFMfNNjpWdoHtkZdoFdIeUJV0UJo3BGwMCQHjPrVV5ThuatW5yHEGUZTHvW17HNr32T7Evs3spKXUNVF6ah3//sfHPsdxIkpibpr/5h49fjJfebrwrLr4/MXLV0vLry/iMGEIn6OQhuzKgTGmJMDnnHCKryKGoe9QfOkMd+T85VfMYhIGZzyNcNeHXkD6BEEuXKew5/SWambdFE+zaciB1TItMbDtVqNhG1YxZZq1rcW93yt/r9zj3vL8n44bosTHAUcUxnHbMiPezSDjBFGcVztJjCOIhtDDbTEMoI/jblbkmhurwuMa/ZCJX8CNwntfkUE/jlPfEaQP+SBW56RTN9dOeL/VzUgQJRwHqFyon1CDh4bcuOEShhGnqRhAxIjI1UADyCDi4niqU8uUqU65PAajAUE3ebW6auwcHRydGLsfPu4f7p/tHx2eVjsu7osKFMJsF7LhHsM4yDPmOXlm1q2NNXGkTWmsjdyYxg+IN+BfMKXh9Z2gJdHC2HpexE/v6E0JliafZi8wS2f5SfSWGnyU+4QtMi+NNpFtmuA72N4oUy6sreUVgVWgpW1qBSfYnSRenuPIqvg2FcUaswIRr4qc4ojAPEN+OpSMiPh+zRJ/m6ZycNuMoGGx9Jg16i1bGHtdmEZLwS8HhI93JTYjX4U4TlhE8b0SiFZYk1CAr1Ho+zBws851EaZtdbMOxze8VJbOrGblCu3Ik1TgwqdhmdiKgkqXhgwZDLyZuCOvhveKTlfwe1dAl7esk6IYFW+WprIJNOlPmmNWExdlVgSj2mvykbXWrDDpAe2uU21G5eV5YBs6lXJFZ5WuOEudcHJVZzVR2W2KYtyDBS++++OPu/Hw4KJRt5r19c9WbasJymcBrIC34B2wwCbYAp/AMTgHCHjgB/gFbiu9yrfK98rPEn00N9K8AVNP5fY/UXti2w==</latexit>

r=0 r=1
<latexit sha1_base64="Ak7L7PvFrsCCa8imffPXInWu0Yw=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x19bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMY6WPH</latexit>

<latexit sha1_base64="rI+D0Zhe42u4rXwHEySmbLNT5x0=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x1+bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMScmPG</latexit>

r=0 r=0
<latexit sha1_base64="rI+D0Zhe42u4rXwHEySmbLNT5x0=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x1+bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMScmPG</latexit> <latexit sha1_base64="rI+D0Zhe42u4rXwHEySmbLNT5x0=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x1+bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMScmPG</latexit>


ag <latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>

ag <latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>

ag <latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>

ag
<latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>

H H +1
<latexit sha1_base64="3BtFn+1+EJZr4Q2qgFo9R3w88+4=">AAAG23icdZRbT9swFMcNGx3rbrA97iVahcQkVCUFSvM0BGwDCXG/TW2FHNdNrToXOc4givK0t23SXsa32dO0L7EPsvfZSUupa6L01Dr+/e3jc47jhJRE3DT/Tk0/eDhTejT7uPzk6bPnL+bmX55FQcwQPkUBDdiFAyNMiY9POeEUX4QMQ8+h+Nzpb8r588+YRSTwT3gS4rYHXZ90CYJcuA63L+cqZtUUT71uyIHVMC0xsO1GrWYbVj5lmpV3vxb//f7RentwOT/zp9UJUOxhnyMKo6hpmSFvp5BxgijOyq04wiFEfejiphj60MNRO80jzYwF4ekY3YCJn8+N3HtXkUIvihLPEaQHeS9S56RTN9eMebfRTokfxhz7qNioG1ODB4Y8ttEhDCNOEzGAiBERq4F6kEHERXLKY9sUoY65XAbDHkHXWbm8YGzu7+4fGVvvP+zs7Zzs7O8dl1sd3BX5z4XpFmT9jwxjP0uZ62SpWbVWl0RK69JYq5kxju8St8c/YUqDq1tBQ6K5sfW8WD+5pdckWJhsnD3DLJnkR6s31MUHsY/YPPLCaAPZoDG+he3VIuTc2lpeEVg5Wti6VnCEO6PAizwOrIpvUFGsISsQ8arIMQ4JzFLkJX3JiBWXlyzxt2YqidtgBPXzrYesUW3YwtgrwtQaCn7eI3x4KnEY+SrEQcxCiu+UQLTCkoR8fIUCz4N+J21d5cs0rXba4viaF8rCmVasTKEdmUkFzn0alomjKKh0aciAQd+dWHfg1fBu3ukKfucK6OKWdVIUg+JN0lQ2gSb8UXNMaqK8zIpgUHtNPLLWmh1GPaA9daKNqLg89xxDp1Ku6KSyI3KpE46u6qQmLLpNUQx7MOfFd3/4cTfuH5zVqla9unJoVdbroHhmwWvwBiwCC6yBdbANDsApQACDb+AnuCm1S19KX0vfC3R6aqB5Bcae0s1/vtxjFA==</latexit>

<latexit sha1_base64="kkihs6FJLR5CtcpV2OgY8Dqix1g=">AAAG3XicdZTPT9swFMcNGx3rfgDbYYddolVIk4aqpEBJTkPANpAQMH5PbYVc102tOj/kOIMoynG3bcfxD+2486T9NcxOWkpdE6Wv1vPn+/zs95x2SEnETfPf1PSDhzOlR7OPy0+ePns+N7/w4jQKYobwCQpowM7bMMKU+PiEE07xecgw9NoUn7X7m3L+7CtmEQn8Y56EuOVB1yddgiAXrqPtd9bFfMWsmuKp1w05sGzTEgPHsWs1x7DyKdOsvL959ce5+Tt3cLEw87vZCVDsYZ8jCqOoYZkhb6WQcYIozsrNOMIhRH3o4oYY+tDDUSvNc82MReHpGN2AiZ/Pjdx7V5FCL4oSry1ID/JepM5Jp26uEfOu3UqJH8Yc+6hYqBtTgweG3LjRIQwjThMxgIgRkauBepBBxMXxlMeWKVIdc7kMhj2CrrJyedHY3N/dPzS2Pnzc2ds53tnfOyo3O7grKpAL0y3I+p8Yxn6WMredpWbVWl0SR1qXxlrNjHF8l7g9/gVTGlzeCmyJ5sbR8yJ+ckuvSbAw2Th7ilkyyY+i22rwQe4jNs+8MNpENmiMb2FntUg5t46WVwRWjha2rhUc4s4o8eIcB1bFN6go1pAViHhV5AiHBGYp8pK+ZETE5SVL/K2ZysFtMIL6+dJD1qjajjDOijA1W8HPeoQPdyU2I1+FOIhZSPGdEohWWJKQjy9R4HnQ76TNyzxMw2qlTY6veKEsnGnFyhS6LU9SgXOfhmViKwoqXRoyYNB3J+IOvBrezTtdwe9cAV3esk6KYlC8SZrKJtCkP2qOSU2Ul1kRDGqvyUfWWrPCqAe0u060GRWX555t6FTKFZ1UdsRZ6oSjqzqpCYtuUxTDHsx58d0fftyN+wentapVr658tirrdVA8s+A1eAPeAgusgXWwDQ7ACUDABT/AL3Bduih9K30v/SzQ6amB5iUYe0rX/wFdLmPU</latexit>

1 2
<latexit sha1_base64="k/SbdCUDyXcVhdRB+9OxkISdxKw=">AAAG13icdZRLTxsxEMcNLSlNX9BrL1YjJCqhaDdAyJ6KgD6QEFDeVRIhr9dJrHgf8noL0WpPvbWHHtp+m56qfol+kN5r7yaEOGa1mVjj398ez4zXjRiNhWX9nZm9d3+u9GD+YfnR4ydPny2UF8/iMOGYnOKQhfzCRTFhNCCnggpGLiJOkO8ycu72t9X8+SfCYxoGJ2IQkbaPugHtUIyEdH2wLxcqVtWST70O1cBuWLYcOE6jVnOgnU9ZVuX1r+V/v7+1Xh1eLs79aXkhTnwSCMxQHDdtKxLtFHFBMSNZuZXEJEK4j7qkKYcB8kncTvNIM7gkPR7shFz+AgFz721Fivw4HviuJH0kerE+p5ymuWYiOo12SoMoESTAxUadhEERQnVs6FFOsGADOUCYUxkrxD3EERYyOeWJbYpQJ1xdjqIexddZubwEtw/2Do7gzpu3u/u7J7sH+8fllkc6Mv+5MN1BvP+OExJkKe+6WWpV7fUVmdK6MvZ6BifxPdrtiY+EsfDqRtBQaG4cMy/XH9zQGwosTDbJnhE+mObHqzf0xYexj9k88sIYA9liCbmBnfUi5Nw6Rl4T2Dla2LpRcES8ceBFHodWx7eYLNaIlYh8deSYRBRlKfYHfcXIFVdXbPm3YWmJ2+IU9/OtRyysNhxpnDVpag0NP+9RMTqVPIx6NeIw4REjt0ogW2FFQQG5wqHvo8BLW1f5Mk27nbYEuRaFsnCmFTvTaFdlUoNzn4Hl8igaqlwGMuQo6E6tO/Qa+G7e6Rp+6wqY4lZ10hTD4k3TTDWBIfxxc0xr4rzMmmBYe0M8qtaGHcY9YDz1wBhRcXnuOIZJpV3RaaUnc2kSjq/qtCYquk1TjHow5+V3f/Rxh3cPzmpVu15dq2zWQfHMgxfgJVgGNtgAm+A9OASnAAMCvoIf4GepXfpc+lKAszNDxXMw8ZS+/wdAiGHe</latexit> <latexit sha1_base64="07bX+4vdFQ5CNwlwQtQks78mCcI=">AAAG23icdZRbT9swFMcNGx3rbrA97iVahcQkVCUdlOZpCNgFCVHuMLUVclw3tepc5DiDKMrT3rZJexnfZk/TvsQ+yN5nJy2lronSU+v497ePzzmOE1IScdP8OzN77/5c6cH8w/Kjx0+ePltYfH4aBTFD+AQFNGDnDowwJT4+4YRTfB4yDD2H4jNnsCXnzz5jFpHAP+ZJiDsedH3SIwhy4TqoXSxUzKopnnrdkAOrYVpiYNuNWs02rHzKNCtvfy3/+/2j/Xr/YnHuT7sboNjDPkcURlHLMkPeSSHjBFGcldtxhEOIBtDFLTH0oYejTppHmhlLwtM1egETP58bufe2IoVeFCWeI0gP8n6kzkmnbq4V816jkxI/jDn2UbFRL6YGDwx5bKNLGEacJmIAESMiVgP1IYOIi+SUJ7YpQp1wuQyGfYKusnJ5ydhq7jYPje1373f2do53mntH5XYX90T+c2G6DdngA8PYz1LmOllqVq21FZHSujTWWmZM4rvE7fNPmNLg8kbQkGhubD0v1k9u6HUJFiabZE8xS6b58eoNdfFh7GM2j7ww2kA2aYxvYHutCDm3tpZXBFaOFrauFRzi7jjwIo9Dq+KbVBRrxApEvCpyhEMCsxR5yUAyYsU3K5b4WzeVxG0yggb51iPWqDZsYexVYWoNBT/rEz46lTiMfBViP2YhxbdKIFphRUI+vkSB50G/m7Yv82VaVidtc3zFC2XhTCtWptCOzKQC5z4Ny8RRFFS6NGTAoO9OrTv0ang373QFv3UFdHHLOimKYfGmaSqbQBP+uDmmNVFeZkUwrL0mHllrzQ7jHtCeOtFGVFyeO46hUylXdFrZFbnUCcdXdVoTFt2mKEY9mPPiuz/6uBt3D05rVateXT2wKht1UDzz4CV4BZaBBdbBBvgI9sEJQACDb+AnuC51Sl9KX0vfC3R2Zqh5ASae0vV/MKJi/g==</latexit>

Figure 8: Combination Lock MDP.

84
Given the failure of ε-Greedy for this example, one can ask whether other algorithmic
principles also fail. As we will show now, the principle of optimism succeeds, and an
analogue of the UCB method yields a regret bound that is polynomial in the parameters
|S|, |A|, and H. Before diving into the details, we present a collection of standard tools for
analysis in MDPs, which will find use throughout the remainder of the lecture notes.

5.4 Analysis Tools


One of the most basic tools employed in the analysis of reinforcement learning algorithms is
the performance difference lemma, which expresses the difference in values for two policies
in terms of differences in single-step decisions made by the two policies. The simple proof,
stated below, proceeds by successively changing one policy into another and keep track of
the ensuing differences in expected rewards. One may also interpret this lemma as a version
of the credit assignment mechanism.
Henceforth, we adopt the following simplified notation. When a policy π is applied to
the random variable sh , we drop the subscript h and write π(sh ) instead of πh (sh ), whenever
this does not cause confusion.

Lemma 13 (Performance Difference Lemma): For any s ∈ S, and π, π ′ ∈ Πrns ,


H
′ ,π ′ M ,π ′
h i
V1M ,π (s) − V1M ,π (s) =
X

EM ,π QM
h (s h , π (sh )) − Qh (s h , ah ) | s 1 = s (5.13)
h=1

Proof of Lemma 13. Fix a pair of policies π, π ′ and define



π h = (π1 , . . . , πh−1 , πh′ , . . . , πH ),

with π 1 = π ′ and π H = π. By telescoping, we can write


H
′ h h+1
V1M ,π (s) − V1M ,π (s) = V1M ,π (s) − V1M ,π
X
(s). (5.14)
h=1

Observe that for each h, we have


H H
" # " #
M ,π h M ,π h+1 M ,π h M ,π h+1
X X
V1 (s) − V1 (s) = E rt | s1 = s − E rt | s1 = s . (5.15)
t=1 t=1

Here, one process evolves according to (M, π h ) and the one evolves according to (M, π h+1 ).
The processes only differ in the action taken once the state sh is reached. In the former,
the action π ′ (sh ) is taken, whereas in the latter it is π(sh ). Hence, (5.15) is equal to
,π ′ M ,π ′
h i

Esh ∼(M,π) EM ,π QMh (s h , π (sh )) − Q h (s h , π(s h )) | s1 = s (5.16)

which can be written as


,π ′ M ,π ′
h i

E(sh ,ah )∼(M,π) EM ,π QM
h (s h , π (sh )) − Qh (s ,
h ha ) | s 1 = s . (5.17)

85
In contrast to the performance difference lemma, which relates the values of two policies
under the same MDP, the next result relates the performance of the same policy under two
different MDPs. Specifically, the difference in initial value for two MDPs is decomposed
into a sum of errors between layer-wise value functions.

Lemma 14 (Bellman residual decomposition): For any pair of MDPs M =


c c
(P M , RM ) and M
c = (P M , RM ), for any s ∈ S, and policies π ∈ ΠRNS ,
H
V1M ,π (s) − V M ,π (s) = ,π M ,π
c
X c 
EM ,π QM

h (sh , ah ) − rh − Vh+1 (sh+1 ) | s1 = s (5.18)
h=1

Hence, for M, M
c with the same initial state distribution,

H
,π M ,π
c
X c 
EM ,π QM

f M (π) − f M (π) = h (sh , ah ) − rh − Vh+1 (sh+1 ) . (5.19)
h=1

In addition, for any MDP M and function Q = (Q1 , . . . , QH , QH+1 ) with QH+1 ≡ 0,
letting πQ,h (s) = arg maxa∈A Qh (s, a), we have

H
M ,πQ
X
EM ,πQ Qh (sh , ah ) − [ThM Qh+1 ](sh , ah ) | s1 = s . (5.20)
 
max Q1 (s, a) − V1 (s) =
a∈A
h=1

and, hence,
H
X
EM ,πQ Qh (sh , ah ) − [ThM Qh+1 ](sh , ah ) . (5.21)
   
Es1 ∼d1 max Q1 (s1 , a) − f M (πQ ) =
a∈A
h=1

Note that for the second part of Lemma 14 Q = (Q1 , . . . , QH ) can be any sequence of
functions, and need not be a value function corresponding to a particular policy or MDP.
It is worth noting that Q gives rise to the greedy policy πQ , which, in turn, gives rise to
QM ,πQ (the value of πQ in model M ), but it may well be the case that QM ,πQ ̸= Q.

Proof of Lemma 14. We will prove (5.19), and omit the proof for (5.18), which is similar
but more verbose. We have
H H
"H #
c,π  M ,π M ,π c,π  M ,π M ,π
X  X  c,π
X
M M M
E Qh (sh , ah ) − rh − Vh+1 (sh+1 ) = E Qh (sh , ah ) − Vh+1 (sh+1 ) − E rh
h=1 h=1 h=1
H
,π M ,π
X c  c
EM ,π QM
 M
= h (sh , ah ) − Vh+1 (sh+1 ) − f (π).
h=1

86
On the other hand, since VhM ,π (s) = Ea∼πh (s) [QM ,π
h (s, a)], a telescoping argument yields

H H
,π M ,π
EM ,π VhM ,π (sh ) − Vh+1
M ,π
X c   X c 
EM ,π QM

h (s ,
h ha ) − V (s
h+1 h+1 ) = (sh+1 )
h=1 h=1
V1M ,π (s1 ) − EM ,π VH+1
Mc,π   c  M ,π 
=E (sH+1 )
= f M (π),
M ,π
where we have used that VH+1 = 0, and that both MDPs have the same initial state
distribution. We prove (5.21) (omitting the proof of (5.20)) using a similar argument. We
have
H
X
EM ,πQ Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a)
 
a∈A
h=1
H
"H #
X X
M ,πQ M ,πQ
 
= E Qh (sh , ah ) − max Qh+1 (sh+1 , a) − E rh
a∈A
h=1 h=1
XH
EM ,πQ Qh (sh , ah ) − max Qh+1 (sh+1 , a) − f M (πQ ).
 
=
a∈A
h=1

Since ah+1 = πQ,h (sh+1 ) = arg maxa∈A Qh+1 (sh+1 , a), we have

EM ,πQ Qh (sh , ah ) − max Qh+1 (sh+1 , a) = EM ,πQ Qh (sh , ah ) − Qh+1 (sh+1 , ah+1 ) ,
   
a∈A

and the result follows by telescoping.

Another similar analysis tool for MDPs, the simulation lemma, is deferred to Section 6
(Lemma 23). This result can be proven as a consequence of Lemma 14.

5.5 Optimism
To develop algorithms for regret minimization in unknown MDPs, we turn to the principle
of optimism, which we have seen is successful in tackling multi-armed bandits and linear
bandits (in small dimension). Recall that for bandits, Lemma 7 gave a way to decompose
the regret of optimistic algorithms into width of confidence intervals. What is the analogue
of Lemma 7 for MDPs? Thinking of optimistic estimates at the level of expected rewards for
policies π is unwieldy, and we need to dig into the structure of these multi-stage decisions. In
particular, the approach we employ is to construct a sequence of optimistic value functions
Q1 , . . . , QH which are guaranteed to over-estimate the optimal value function QM ,⋆ . For
multi-armed bandits, implementing optimism amounted to adding “bonuses,” constructed
from past data, to estimates for the reward function. We will construct optimistic value
functions in a similar fashion. Before giving the construction, we introduce a technical
lemma, which quantifies the error in using such optimistic estimates in terms of Bellman
residuals; Bellman residuals measure self-consistency of the optimistic estimates under the
application of the Bellman operator.

87
Lemma 15 (Error decomposition for optimistic policies): Let {Q1 , . . . , QH } be
a sequence of functions Qh : S × A → R with the property that for all (s, a),
,⋆
QM
h (s, a) ≤ Qh (s, a) (5.22)

and set QH+1 ≡ 0. Let π


b = (b
π1 , . . . , π
bH ) be such that π
bh (s) = arg maxa Qh (s, a). Then
for all s ∈ S,
H
V1M ,⋆ (s) − V1M ,bπ (s) ≤
X
EM ,bπ (Qh − ThM Qh+1 )(sh , π
 
b(sh )) | s1 = s . (5.23)
h=1

Lemma 15 tells us that closeness of Qh to the Bellman backup ThM Qh+1 implies closeness
,⋆
b to πM in terms of the value. As a sanity check, if Qh = QM
of π h , the right-hand side
,⋆ ,⋆
of (5.23) is zero, since QM
h = ThM QM
h+1 . Crucially, errors do not accumulate too fast as a
function of the horizon. This fact should not be taken for granted: in general, if Q is not
optimistic, it could have been the case that small changes in Qh exponentially degrade the
quality of the policy πb.
Another important aspect of the decomposition (5.23) is the on-policy nature of the
terms in the sum. Observe that the law of sh for each of the terms is given by executing
π
b in model M . The distribution of sh is often referred to as the roll-in distribution; when
this distribution is induced by the policy executed by the algorithm, we may have a better
control of the error than in the off-policy case when the roll-in distribution is given by πM
or another unknown policy.

Proof of Lemma 15. Let V h (s) := maxa∈A Qh (s, a). Just as in the proof of Lemma 7, the
assumption that Qh is “optimistic” implies that
,⋆
QM
h (sh , πM (sh )) ≤ Qh (sh , πM (sh )) ≤ Qh (sh , π
b(sh ))

and, hence, V1M ,⋆ (s) ≤ V 1 (s). Then, (5.20) applied to Q = Q and πQ = π


b states that
H
V 1 (s) − V1M ,bπ (s) =
X
EM ,bπ Qh (sh , ah ) − ThM Qh+1 (sh , ah ) | s1 = s .
   
(5.24)
h=1

Remark 16: In fact, the proof of Lemma 15 only uses that the initial value Q1 is opti-
mistic. However, to construct a value function with this property, the algorithms we con-
sider will proceed by backwards induction, producing optimistic estimates Q1 , . . . , QH
in the process.

5.6 The UCB-VI Algorithm for Tabular MDPs


We now instantiate the principle of optimism to give regret bounds for online reinforcement
learning in tabular MDPs. Tabular RL may be thought of as an analogue of finite-armed

88
bandits: we assume no structure across states and actions, but require that the state and
action spaces are small. The regret bounds we present will depend polynomially on S = |S|
and A = |A|, as well as the horizon H.

Preliminaries. For simplicity, we assume that the reward function is known to the
learner, so that only the transition probabilities are unknown. This does not change the
difficulty of the problem in a meaningful way, but allows us to keep notation light.

Assumption 6: Rewards are deterministic, bounded, and known to the learner: RhM (s, a) =
δrh (s,a) for known rh : S × A → [0, 1], for all M . In addition, assume for simplicity that
V1M ,⋆ (s) ∈ [0, 1] for any s ∈ S.

Define, with a slight abuse of notation,


t−1
X t−1
X
I (sih , aih ) = (s, a) , and nth (s, a, s′ ) = I (sih , aih , sih+1 ) = (s, a, s′ ) ,
 
nth (s, a) =
i=1 i=1

as the empirical state-action and state-action-next state frequencies. We can estimate the
transition probabilities via

nt (s, a, s′ )
Pbht (s′ | s, a) = h t . (5.25)
nh (s, a)

The UCB-VI algorithm. The following algorithm, UCB-VI (“Upper Confidence Bound
Value Iteration”) [16], combines the notion of optimism with dynamic programming.

UCB-VI
for t = 1, . . . , T do
Let V tH+1 ≡ 1.
for h = H, . . . , 1 do
Update nth (s, a), nth (s, a, s′ ), and bth,δ (s, a), for all (s, a) ∈ S × A.
// bth,δ (s, a) is a bonus computed in (5.27).
Compute:
n o
Qth (s, a) = rh (s, a) + Es′ ∼Pbt (·|s,a) V th+1 (s′ ) + bth,δ (s, a) ∧ 1. (5.26)
h

Set V th (s) = maxa∈A Qth (s, a) and π


bht (s) = arg maxa∈A Qth (s, a).
Collect trajectory (st1 , at1 , r1t ), . . . , (stH , atH , rH
t
) according to π
bt .

The UCB-VI algorithm will be analyzed using Lemma 15. In constructing functions Qh , we
will need to satisfy two goals: (1) ensure that with high probability (5.22) is satisfied, i.e.
Qh s are optimistic; and (2) that Qh s are “self-consistent,” in the sense that the Bellman
residuals in (5.23) are small. The second requirement already suggests that we should define
Qh approximately as a Bellman backup ThM Qh+1 , going backwards for h = H + 1, . . . , 1
as in dynamic programming, while ensuring the first requirement. In addition to these
considerations, we will have to use a surrogate for the Bellman operator ThM , since the model

89
M is not known. This is achieved by estimating M using empirical transition frequencies.
Putting these ideas together gives the update in (5.26). We apply the principle of value
iteration, except that
1. For each episode t, we augment the rewards rh (s, a) with a “bonus” bth,δ (s, a) designed
to ensure optimism.
2. The Bellman operator is approximated using the estimated transition probabilities in
(5.25).
The bonus functions play precisely the same role as the width of the confidence interval in
(2.19): these bonuses ensure that (5.22) holds with high probability, as we will show below
in Lemma 16.
The following theorem shows that with an appropriate choice of bonus, this algorithm
achieves a polynomial regret bound.

Theorem 1: For any δ > 0, UCB-VI with


s
log(2SAHT /δ)
bth,δ (s, a) = 2 (5.27)
nth (s, a)

guarantees that with probability at least 1 − δ,


√ p
Reg ≲ HS AT · log(SAHT /δ)

We mention that a slight variation on Lemma 18 below (using the Freedman√inequal-


ity instead of the Azuma-Hoeffding inequality) yields an improved√ rate of O(H SAT +
poly(H, S, A) log T ), and the optimal rate can be shown to be Θ( HSAT ); this is achieved
through a more careful choice for the bonus bth,δ and a more refined analysis. We remark
that care should be taken in comparing results in the literature, as scaling conventions for
the individual and cumulative rewards (as in Assumption 6) can vary.

5.6.1 Analysis for a Single Episode


Our aim is to bound the regret
T
X
Reg = f M (πM ⋆ ) − f M (π t )
t=1
for UCB-VI. To do so, we first prove several helper lemmas concerning the performance
within each episode t. In what follows,  we fix t and drop the superscript t.
Given the estimated transitions Pbh (· | s, a) h,s,a , define the estimated MDP M
c =
n o
S, A, {Pbh }H M H
h=1 , {Rh }h=1 , d1 . The associated Bellman operator is
c
ThM Q(s, a) = rh (s, a) + Es′ ∼Pbh (·|s,a) max Q(s′ , a) (5.28)
a

for Q : S × A → R. Consider the sequence of functions Qh : S × A → [0, 1], V h : S → [0, 1],


for h = 1, . . . , H + 1, with QH+1 ≡ 0 and
n o
c
Qh (s, a) = [ThM Qh+1 ](s, a) + bh,δ (s, a) ∧ 1, and V h (s) = max Qh (s, a) (5.29)
a

90
for bonus functions bh,δ : S × A → R to be chosen later. Henceforth, we follow the usual
notation that for functions f, g over the same domain, f ≤ g indicates pointwise inequality
over the domain.
The first lemma we present shows that as long as the bonuses bh,δ are large enough to
bound the error between the estimated transition probabilities and true transition proba-
bilities, the functions Q1 , . . . , QH constructed above will be optimistic.


Lemma 16: Suppose we have estimates Pbh (· | s, a) h,s,a
and a function bh,δ : S ×A →
R with the property that for all s ∈ §, a ∈ A,

Pbh (s′ | s, a)VhM ,⋆ (s′ ) − PhM (s′ | s, a)VhM ,⋆ (s′ ) ≤ bh,δ (s, a).
X X
(5.30)
s′ s′

Then for all h ∈ [H], we have


,⋆
Qh ≥ QM
h , and V h ≥ VhM ,⋆ (5.31)

for Qh , V h defined in (5.29).

Proof of Lemma 16. The proof proceeds by backward induction on the statement

V h ≥ VhM ,⋆

with h = H + 1 down to h = 1. We start with the base case h = H + 1, which is trivial


M ,⋆ M ,⋆
because V H+1 = VH+1 ≡ 0. Now, assume V h+1 ≥ Vh+1 , and let us prove the induction
,⋆
step. Fix (s, a) ∈ S × A. If Qh (s, a) = 1, then, trivially, Qh (s, a) ≥ QMh (s, a). Otherwise,
c
Qh (s, a) = ThM Qh+1 (s, a) + bh,δ (s, a), and thus
,⋆ ′ M ,⋆ ′
Qh (s, a) − QM
h (s, a) = bh,δ (s, a) + Es′ ∼Pbh (·|s,a) V h+1 (s ) − Es′ ∼P M (·|s,a) Vh+1 (s )
h
M ,⋆ M ,⋆
≥ bh,δ (s, a) + Es′ ∼Pbh (·|s,a) Vh+1 (s′ ) − Es′ ∼P M (·|s,a) Vh+1 (s′ ) ≥ 0.
h

This, in turn, implies that V h (s) = maxa Qh (s, a) ≥ maxa Qh (s, a) = VhM ,⋆ (s), concluding
M ,⋆

the induction step.

We now analyze the effect of using an estimated model M


c for the Bellman operator
M
rather than the true unknown Th .

and b′h,δ (s, a) : S × A → R



Lemma 17: Suppose we have estimates Pbh (· | s, a) h,s,a
with the property that

X X
max Pbh (s′ | s, a)V (s′ ) − PhM (s′ | s, a)V (s′ ) ≤ b′h,δ (s, a) (5.32)
V ∈{0,1}S
s′ s′

Then the Bellman residual satisfies

Qh − ThM Qh+1 ≤ (bh,δ + b′h,δ ) ∧ 1. (5.33)

91
for Qh , V h defined in (5.29).

Proof of Lemma 17. That Qh − ThM Qh+1 ≤ 1 is immediate. To prove the main result,
observe that
n o
c c
Qh − ThM Qh+1 = ThM Qh+1 + bh,δ ∧ 1 − ThM Qh+1 ≤ (ThM − ThM )Qh+1 + bh,δ (5.34)

For any Q ∈ S × A → [0, 1],


c
(ThM − ThM )Q(s, a) = Es′ ∼Pbh (·|s,a) max Q(s′ , a) − Es′ ∼P M (·|s,a) max Q(s′ , a) (5.35)
a h a
≤ max |Es′ ∼Pbh (·|s,a) V (s′ ) − Es′ ∼P M (·|s,a) V (s′ )|. (5.36)
V ∈[0,1]S h

Since the maximum is achieved at a vertex of [0, 1]S , the statement follows.

5.6.2 Regret Analysis


We now bring back the time index t and show that the estimated transition probabilities
in UCB-VI satisfy conditions of Lemma 16 and Lemma 17, ensuring that the functions
Qt1 , . . . , QtH are optimistic.


Lemma 18: Let Pbht h∈[H],t∈[T ]
be defined as in (5.25). Then with probability at least
1 − δ, the functions
s s
t log(2SAHT /δ) ′t S log(2SAHT /δ)
bh,δ (s, a) = 2 , and bh,δ (s, a) = 8
nth (s, a) nth (s, a)

satisfy the assumptions of Lemma 16 and Lemma 17, respectively, for all s ∈ S, a ∈ A,
h ∈ [H], and t ∈ [T ] simultaneously.

Proof of Lemma 18. We leave the proof as an exercise.

Proof of Theorem 1. Putting everything together, we can now prove Theorem 1. Under
the event in Lemma 18, the functions Qt1 , . . . , QtH are optimistic, which means that the
conditions of Lemma 15 hold, and the instantaneous regret on round t (conditionally on
s1 ∼ d1 ) is at most
H H
t t
X  X
EM ,bπ (Qth − ThM Qth+1 )(sth , π bht (sth )) + b′h,δ (sth , π
EM ,bπ (bh,δ (sth , π

bht (sth )) | s1 = s ≤ bht (sth ))) ∧ 1 ,
h=1 h=1

where the second inequality invokes Lemma 17. Summing over t = 1, . . . , T , and applying
the Azuma-Hoeffding inequality, we have that with probability at least 1 − δ, the regret of
UCB-VI is bounded by
T X
H
t
X
bht (sth )) + b′h,δ (sth , π
EM ,bπ (bh,δ (sth , π

bht (sth ))) ∧ 1
t=1 h=1
XT X H
bht (sth )) + b′h,δ (sth , π
p
≲ (bh,δ (sth , π bht (sth ))) ∧ 1 + HT log(1/δ).
t=1 h=1

92
Using the bonus definition in (5.27), the bonus term above is bounded by
s
T X H T XH
X S log(2SAHT /δ) p X 1
∧ 1 ≤ S log(2SAHT /δ) p t t t t ∧ 1 (5.37)
t t
bh (sh ))
nh (sh , π t t
nh (sh , π
bh (sh ))
t=1 h=1 t=1 h=1

The double summation can be handled in the same fashion as Lemma 8:


T X
H H XX T
X 1 X I {(sth , π
bht (sth )) = (s, a)}
p t t t t ∧1= p t ∧1
t=1 h=1
n (s , π
h h h h
b (s )) h=1 (s,a) t=1
n h (s, a)
H Xq
X √
≲ nTh (s, a) ≤ H SAT .
h=1 (s,a)

6. GENERAL DECISION MAKING

So far, we have covered three general frameworks for interaction decision making: The
contextual bandit problem, the structured bandit problem, and the episodic reinforcement
learning problem; all of these frameworks generalize the classical multi-armed bandit prob-
lem in different directions. In the context of structured bandits, we introduced a complexity
measure called the Decision-Estimation Coefficient (DEC), which gave a generic approach
to algorithm design, and allowed us to reduce the problem of interactive decision making
to that of supervised online estimation. In this section, we will build on this develop-
ment on two fronts: First, we will introduce a unified framework for decision making,
which subsumes all of the frameworks we have covered so far. Then, we will show that
i) the Decision-Estimation Coefficient and its associated meta-algorithm (E2D) extend to
the general decision making framework, and ii) boundedness of the DEC is not just suf-
ficient, but actually necessary for low regret, and thus constitutes a fundamental limit.
As an application of the general tools we introduce, we will show how to use the (general-
ized) Decision-Estimation Coefficient to solve the problem of tabular reinforcement learning
(Section 6.6), offering an alternative to the UCB-VI method we introduced in Section 5.

6.1 Setting
For the remainder of the course, we will focus on a framework called Decision Making with
Structured Observations (DMSO), which subsumes all of the decision making frameworks
we have encountered so far. The protocol proceeds in T rounds, where for each round
t = 1, . . . , T :

1. The learner selects a decision π t ∈ Π, where Π is the decision space.

2. Nature selects a reward rt ∈ R and observation ot ∈ O based on the decision, where


R ⊆ R is the reward space and O is the observation space. The reward and observation
are then observed by the learner.

We focus on a stochastic variant of the DMSO framework.

93
Assumption 7 (Stochastic Rewards and Observations): Rewards and observa-
tions are generated independently via

(rt , ot ) ∼ M ⋆ (· | π t ), (6.1)

where M ⋆ : Π → ∆(R × O) is the underlying model.

To facilitate the use of learning and function approximation, we assume the learner has
access to a model class M that contains the model M ⋆ . Depending on the problem do-
main, M might consist of linear models, neural networks, random forests, or other complex
function approximators; this generalizes the role of the reward function class F used in
contextual/structured bandits. We make the following standard realizability assumption,
which asserts that M is flexible enough to express the true model.

Assumption 8 (Realizability): The model class M contains the true model M ⋆ .

For a model M ∈ M, let EM ,π [·] denote the expectation under (r, o) ∼ M (π). Further,
following the notation in Section 5, let

f M (π) := EM ,π [r]

denote the mean reward function, and let

πM := arg max f M (π)


π∈Π

denote the optimal decision with maximal expected reward. Finally, define

FM := {f M | M ∈ M} (6.2)

as the induced class of mean reward functions. We evaluate the learner’s performance in
terms of regret to the optimal decision for M ⋆ :
T
X  ⋆ ⋆ 
Reg := Eπt ∼pt f M (πM ⋆ ) − f M (π t ) , (6.3)
t=1

where p ∈ ∆(Π) is the learner’s distribution over decisions at round t. Going forward, we
t

abbreviate f ⋆ = f M and π ⋆ = πM ⋆ ,.
The DMSO framework is general enough to capture most online decision making prob-
lems. Let us first see how it subsumes the structured bandit and contextual bandit problems.
Example 6.1 (Structured bandits). When there are no observations (i.e., O = {∅}), the
DMSO framework is equivalent to structured bandits studied earlier in Section 4. Therein,
we defined a structured bandit instance by specifying a class F of mean reward functions
and a general class of reward distributions, such as sub-Gaussian or bounded. In the DMSO
framework, we may equivalently start with a set of models M and let FM be the induced
class (6.2). By changing the class F, this encompasses all of the concrete examples of
structured bandit problems we studied in Section 4, including linear bandits, nonparametric
bandits, and concave/convex bandits.

94
Example 6.2 (Contextual bandits). The DMSO framework readily captures contextual
bandits (Section 3) with stochastic contexts (see Assumption 2). To make this precise,
we will slightly abuse the notation and think of π t as functions mapping the context xt
to an action in Π = [A]. To this end, on round t, the decision-maker selects a mapping
π t : X → [A] from contexts to actions, and the context ot = xt is observed at the end of the
round. This is equivalent to first observing xt and selecting π t (xt ) ∈ [A].
Formally, let O = X be the space of contexts, Π = [A] be the set of actions, and
Π : X → [A] be the space of decisions. The distribution (r, x) ∼ M (π) then has the
following structure: x ∼ DM and r ∼ RM (·|x, π(x)) for some context distribution DM and
reward distribution RM . In other words, the distribution DM for the context x (treated as
an observation) is part of the model M .
We mention in passing that the DMSO framework also naturally extends to the case
when contexts are adversarial rather than i.i.d., as in Section 4.5; see Foster et al. [40]. ◁

Example 6.3 (Online reinforcement learning). The online reinforcement learning frame-
work we introduced
P in t Section t5 immediately falls into the DMSO framework by taking
Π = Πrns , rt = H r
h=1 h , and o = τ t
. While we have only covered tabular reinforcement
learning so far, the literature on online reinforcement learning contains algorithms and
sample complexity bounds for a rich and extensive collection of different MDP structures
(e.g., Dean et al. [30], Yang and Wang [87], Jin et al. [48], Modi et al. [65], Ayoub et al.
[15], Krishnamurthy et al. [55], Du et al. [33], Li [62], Dong et al. [31]). All of these settings
correspond to specific choices for the model class M in the DMSO framework, and we will
cover this topic in detail in Section 7. ◁

We adopt the DMSO framework because it gives simple, yet unified approach to describ-
ing and understanding what is—at first glance—a very general and seemingly complicated
problem. Other examples that are covered by the DMSO framework include:

• Partially Observed Markov Decision Processes (POMDPs)

• Bandits with graph-structured feedback

• Partial monitoring⋆

6.2 Refresher: Information-Theoretic Divergences


To develop algorithms and complexity measures for general decision making, we need a
way to measure the distance between distributions over abstract observations (this was
not a concern for the structured and contextual bandit settings, where we only needed to
consider the mean reward function). To do this, we will introduce the notion of the Csiszar
f -divergence, which generalizes a number of familiar divergences including the Kullback-
Leibler (KL) divergence, total variation distance, and Hellinger distance.
Let P and Q be probability distributions over a measurable space (Ω, F ). We say that P
is absolutely continuous with respect to Q if for all events A ∈ F , Q(A) = 0 =⇒ P(A) = 0;
we denote this by P ≪ Q. For a convex function f : (0, ∞) → R with f (1) = 0, the
associated f -divergence for P and Q is given by
  
dP
Df (P ∥ Q) := EQ f (6.4)
dQ

95
dQ
whenever P ≪ Q. More generally, defining p = dP
dν and q = dν for a common dominating
measure ν, we have
 
p
Z
Df (P ∥ Q) := qf dν + P(q = 0) · f ′ (∞), (6.5)
q>0 q
where f ′ (∞) := limx→0+ xf (1/x).
We will make use of the following f -divergences, all of which have unique properties
that make them useful in different contexts.
• Choosing f (t) = 12 |t − 1| gives the total variation (TV) distance
1 dP dQ
Z
DTV (P, Q) = − dν,
2 dν dν
which can also be written as
DTV (P, Q) = sup |P(A) − Q(A)|.
A∈F

• Choosing f (t) = (1 − t)2 gives squared Hellinger distance
Z r r !2
2 dP dQ
DH (P, Q) = − dν.
dν dν

• Choosing f (t) = t log(t) gives the Kullback-Leibler divergence:


dP
 R 
log dQ dP, P ≪ Q,
DKL (P ∥ Q) =
+∞, otherwise.

Note that for TV distance and Hellinger distance, we use the notation D(·, ·) rather than
D(· ∥ ·) to emphasize that the divergence is symmetric. Other standard examples include
the Neyman-Pearson χ2 -divergence.

Lemma 19: For all distributions P and Q,


2 2
DTV (P, Q) ≤ DH (P, Q) ≤ DKL (P ∥ Q). (6.6)

2 (P, Q) = 2, and D
It is known that DTV (P, Q) = 1 if and only if DH TV (P, Q) = 0 if and
2 2
only if DH (P, Q) = 0 (more generally, DH (P, Q) ≤ 2DTV (P, Q)). Moreover, they induce
same topology, i.e. a sequence converges in one distance if and only if it converges the
other. KL divergence cannot be bounded by TV distance or Hellinger distance in general,
but the following lemma shows that it is possible to relate these quantities if the density
ratios under consideration are bounded.

Lemma 20: Let P and Q be probability distributions over a measurable space (Ω, F ).
P(F )
If supF ∈F Q(F ) ≤ V , then

2
DKL (P ∥ Q) ≤ (2 + log(V ))DH (P, Q). (6.7)

96
Other properties we will use include:

• Boundedness of TV (by 1) and Hellinger (by 2).

• Triangle inequality for TV and Hellinger distance.

• The data-processing inequality, which is satisfied by all f -divergences.

• Chain rule and subadditivity properties for KL and Hellinger divergence (see Lemma
22).

• A variational representation for TV distance:

DTV (P, Q) = sup |EP [g] − EQ [g]| (6.8)


g:Ω→[0,1]

See Polyanskiy [68] for further background.

6.3 The Decision-Estimation Coefficient for General Decision Making


Developing algorithms for the general decision making framework poses a number of ad-
ditional challenges compared to the basic bandit frameworks we have studied so far. The
problem of understanding how to optimally explore and make decisions for a given model
class M is deeply connected to the problem of understanding the optimal statistical com-
plexity (i.e., minimax regret) for M. Any notion of problem complexity needs to capture
both i) simple problems like the multi-armed bandit, where the mean rewards serve as
a sufficient statistic, and ii) problems with rich, structured feedback (e.g., reinforcement
learning), where observations, or even structure in the noise itself, can provide non-trivial
information about the underlying problem instance. In spite of these apparent difficulties,
we will show that by incorporating an appropriate information-theoretic divergence, we can
use the Decision-Estimation Coefficient to address these challenges, in a similar fashion to
Section 4.
For a model class M, reference model M c ∈ M, and scale parameter γ > 0, the Decision-
Estimation Coefficient for general decision making [40, 43] is defined via
 
2
M M

decγ (M, M ) = inf sup Eπ∼p f (πM ) − f (π) −γ · DH M (π), M (π) .
c c (6.9)
p∈∆(Π) M ∈M | {z } | {z }
regret of decision information gain for obs.

We further define

decγ (M) = sup decγ (M, M


c). (6.10)
c∈co(M)
M

The DEC in (6.9) should look familiar to the definition we used for structured bandits in
Section 4 (Eq. (4.15)). The main difference is that instead of being defined over a class F
of reward functions, the general DEC is defined over the class of models M, and the notion
of estimation error/information gain has changed to account for this. In particular, rather

97
than measuring information gain via the distance between mean reward functions, we now
consider the information gain
h  i
2
Eπ∼p DH M (π), M
c(π) ,

which measures the distance between the distributions over rewards and observations under
the models M and M c (for the learner’s decision π). This is a stronger notion of distance since
i) it incorporates observations (e.g., trajectories for reinforcement learning), and ii) even for
bandit problems, we consider distance between distributions as opposed to distance between
means; the latter feature means that this notion of information gain can capture fine-grained
properties of the models under consideration, such as noise in the reward distribution.

6.3.1 Basic Examples


To build intuition as to how the general Decision-Estimation Coefficient adapts to the
structure of the model class M, let us review a few examples—some familiar, and some
new.
Example 6.4 (Multi-armed bandit with Gaussian rewards). Let Π = [A], R = R, O = {∅}.
We define
MMAB-G = {M : M (π) = N (f (π), 1), f : Π → [0, 1]}.
We claim that
A
decγ (MMAB-G ) ∝ . (6.11)
γ
To prove this, consider the case where M c ∈ M for simplicity. Recall that we have previ-
ously shown that this behavior holds for the squared error version of the DEC defined in
(4.15). Thus, it is sufficient to argue that the squared Hellinger divergence for Gaussian
distributions reduces to square difference between the means:
 
2 c
DH c(π) ∝ (f M (π) − f M
M (π), M (π))2 .

The claim will then follow from Proposition 14. To prove this, first note that

c(π) = 1 (f M (π) − f M
   
2 c
DH c(π) ≤ DKL M (π) ∥ M
M (π), M (π))2 . (6.12)
2
In the other direction, one can directly compute
 
2
  1 M c
M 2
DH M (π), M (π) = 1 − exp − (f (π) − f (π))
c
8
and using that 1 − exp{−x} ≥ (1 − e−1 )x for x ∈ [0, 1], we establish
 
2 c
DH c(π) ≥ c · (f M (π) − f M
M (π), M (π))2
1−1/e
for c = 8 . ◁
In fact, one can show that the general DEC in (6.9) coincides with the basic squared
error version from Section 4 for general structured bandit problems, not just multi-armed
bandits; see Proposition 41.
Let us next consider a twist on the bandit problem that is more information-theoretic in
nature, and highlights the need to work with information-theoretic divergences if we want
to handle general decision making problems.

98
Example 6.5 (Bandits with structured noise). Let Π = [A], R = R, O = {∅}. We define
n o
MMAB-SN = {M1 , . . . , MA } ∪ M
c

where Mi (π) := N (1/2, 1) for π ̸= i and Mi (π) := Ber(3/4) for π = i; we further define
c(π) := N (1/2, 1) for all π ∈ Π. Before proceeding with the calculations, observe that we
M
can solve the general decision making problem when the underlying model is M ⋆ ∈ M with
a simple algorithm. It is sufficient to select every action in [A] only once: all suboptimal
actions have Bernoulli rewards and give r ∈ {0, 1} almost surely, while the optimal action
has Gaussian rewards, and gives r ∈ / {0, 1} almost surely. Thus, if we select an action and
observe a reward r ∈/ {0, 1}, we know that we have identified the optimal action.
The valuable information contained in the reward distribution is reflected in the Hellinger
divergence, which attains its maximum value when comparing a continuous distribution to
a discrete one:  
2 c(π) = 2I {π = i} .
DH Mi (π), M

To use this property to derive the upper bound on decγ (MMAB-SN , M


c), first note that the
maximum over M in the definition of decγ (MMAB-SN , M ) is not attained at M = M
c c, since
in that case both the divergence and regret terms are zero, irrespective of p. Now, take
p = unif[A]. Then for any M ∈ {M1 , . . . , MA },

Eπ∼p [f M (πM ) − f M (π)] = (1 − 1/A)(3/4 − 1/2),

and
c) ≲ (1 − 1/A)(3/4 − 1/2) − γ 2 ≲ I {γ ≤ A/4} .
decγ (M, M
A
This leads to an upper bound
c) ≲ I {γ ≤ A/4}
decγ (MMAB-SN , M (6.13)

which can also be shown to be tight. ◁

Example 6.6 (Bandits with Full Information). Consider a “full-information” learning set-
ting. We have Π = [A] and R = [0, 1], and for a given decision π we observe a reward
r as in the standard multi-armed bandit, but also receive an observation o = (r(π ′ ))π′ ∈[A]
consisting of (counterfactual) rewards for every action.
For a given model M , let MR (π) denote the distribution over the reward r for π, and let
MO (π) denote the distribution of o. Then for any decision π, since all rewards are observed,
the data processing inequality implies that for all M, M c ∈ M and π ′ ∈ Π,
   
2 c(π) ≥ D2 MO (π), M
DH M (π), M H
cO (π) (6.14)
   
= DH 2
MO (π ′ ), M
cO (π ′ ) ≥ D2 MR (π ′ ), M
H
cR (π ′ ) . (6.15)

c ∈ M,
Using this property, we will show that for any M

c) ≤ 1
decγ (M, M . (6.16)
γ

99
Comparing to the finite-armed bandit, we see that the DEC for this example is independent
of A, which reflects the extra information contained in the observation o.
To prove (6.16), for a given model Mc ∈ M we choose p = Iπ (i.e. the decision maker
c
M
selects πM
c deterministically), and bound Eπ∼p [f
M
(πM ) − f M (π)] by
c c
f M (πM ) − f M (πM
c) ≤ f
M
(πM ) − f M (πM
c) + f
M
c) − f
(πM M
(πM )
c
≤2· max |f M (π) − f M (π)|
c}
π∈{πM ,πM
 
≤2· max DH MR (π), M
cR (π) .
c}
π∈{πM ,πM

We then use the AM-GM inequality, which implies that for any γ > 0,

cR (π) + 1
   
2 cR (π) ≲ γ · 2
max DH MR (π), M max DH MR (π), M
c}
π∈{πM ,πM c}
π∈{πM ,πM γ
  1
2
≤ γ · DH M (πMc ), M (πM
c c) + ,
γ

where the final inequality uses (6.14). This certifies that for all M ∈ M, the choice for p
above satisfies

c(π) ≲ 1 ,
h  i
2
Eπ∼p f M (πM ) − f M (π) − γ · DH M (π), M
γ

c) ≲ 1 .
so we have decγ (M, M ◁
γ

In what follows, we will show that the different behavior for the DEC for these examples
reflects the fact that the optimal regret is fundamentally different.

6.4 E2D Algorithm for General Decision Making

Estimation to Decision-Making (E2D) for General Decision Making


parameters: Exploration parameter γ > 0.
for t = 1, 2, · · · , T do
Obtain M ct from online estimation oracle with (π 1 , r1 , o1 ), . . . , (π t−1 , rt−1 , ot−1 ).
Compute // Minimizer for decγ (M, M ct ).
 
2
t M M t

p = arg min sup Eπ∼p f (πM ) − f (π) − γ · DH M (π), M (π) . c
p∈∆(Π) M ∈M

Sample decision π t ∼ pt and update estimation algorithm with (π t , rt , ot ).

Estimation-to-Decisions (E2D), the meta-algorithm based on the DEC that we gave for
structured bandits in Section 4, readily extends to general decision making [40, 43]. The
general version of the meta-algorithm is given above. Compared to structured bandits,
the main difference is that rather than trying to estimate the reward function f ⋆ , we now
estimate the underlying model M ⋆ . To do so, we appeal once again to the notion of an
online estimation oracle, but this time for model estimation.
At each timestep t, the algorithm calls invokes an online estimation oracle to obtain an
estimate Mct for M ⋆ using the data Ht−1 = (π 1 , r1 , o1 ), . . . , (π t−1 , rt−1 , ot−1 ) observed so far.

100
Using this estimate, E2D proceeds by computing the distribution pt that achieves the value
decγ (M, M
ct ) for the Decision-Estimation Coefficient. That is, we set
 
2
t M M t

p = arg min sup Eπ∼p f (πM ) − f (π) − γ · DH M (π), M (π) .c (6.17)
p∈∆(Π) M ∈M

E2D then samples the decision π t from this distribution and moves on to the next round.
Like structured bandits, one can show that by running Estimation-to-Decisions in the
general decision making setting, the regret for decision making is bounded in terms of the
DEC and a notion of estimation error for the estimation oracle. The main difference is that
for general decision making, the notion of estimation error we need to control is the sum of
Hellinger distances between the estimates from the supervised estimation oracle M ⋆ , which
we define via
XT h  i
2
EstH := Eπt ∼pt DH M ⋆ (π t ), M
ct (π t ) . (6.18)
t=1
With this definition, we can show that E2D enjoys the following bound on regret, analogous
to Proposition 13.

Proposition 26 (Foster et al. [40]): E2D with exploration parameter γ > 0 guar-
antees that
Reg ≤ sup decγ (M, M c) · T + γ · EstH , (6.19)
c∈M
M c

almost surely, where M ct ∈ M


c is any set such that M c for all t ∈ [T ].

Note that we can optimize over the parameter γ in the result above, which yields
   
Reg ≤ inf sup decγ (M, M )·T +γ ·EstH ≤ 2· inf max sup decγ (M, M )·T, γ ·EstH .
c c
γ>0 c∈M γ>0 c∈M
M c M c

We will show in the sequel that for any finite class M, the averaged exponential weights
algorithm with the logarithmic loss achieves EstH ≲ log(|M|/δ) with probability at least
1 − δ. For this algorithm, and most others we will consider, one can take M c = co(M). In
fact, one can show (via an analogue of Proposition 24) that for any M c, even if Mc∈ / co(M),
we have decγ (M, M ) ≤ supM
c c∈co(M) deccγ (M, M ) ≤ deccγ (M) for any absolute constant
c
c > 0. This means we can restrict our attention to the convex hull without loss of generality.
Putting these facts together, we see that for any finite class, it is possible to achieve

Reg ≤ decγ (M) · T + γ · log(|M|/δ) (6.20)

with probability at least 1 − δ.

Proof of Proposition 26. We write


T
X  ⋆ ⋆ 
Reg = Eπt ∼pt f M (πM ⋆ ) − f M (π t )
t=1
T h
X  ⋆ ⋆ 2
i
M ⋆ (π t ), M

= Eπt ∼pt f M (πM ⋆ ) − f M (π t ) − γ · Eπt ∼pt DH ct (π t ) + γ · EstH .
t=1

101
For each t, since M ⋆ ∈ M, we have
 ⋆ ⋆
h i
2
M ⋆ (π t ), M

Eπt ∼pt f M (πM ⋆ ) − f M (π t ) − γ · Eπt ∼pt DH ct (π t )
h i
2
≤ sup Eπt ∼pt [f M (πM ) − f M (π t )] − γ · Eπt ∼pt DH M (π t ), Mct (π t )
M ∈M
h i
2
= inf sup Eπ∼p f M (πM ) − f M (π) − γ · DH M (π), M ct (π)
p∈∆(Π) M ∈M
ct ).
= decγ (M, M (6.21)

Summing over all rounds t, we conclude that

Reg ≤ sup decγ (M, M


c) · T + γ · EstH .
c∈M
M c

Examples for the upper bound. We now revisit the examples from Section 6.3 and
use E2D and Proposition 26 to derive regret bounds for them.

Example 6.4 (cont’d). For the Gaussian bandit problem from Example 6.4, plugging the
bound decγ (MMAB-G ) ≲ A/γ into Proposition 26 yields

AT
Reg ≲ + γ · EstH ,
γ
p
Choosing γ = AT /EstH balances the terms above and gives
p
Reg ≲ AT · EstH .

Example 6.5 (cont’d). For the bandit-type problem with structured noise from Exam-
ple 6.5, the bound decγ (MMAB-SN ) ≲ I {γ ≤ A/4} yields

Reg ≲ I {γ ≤ A/4} · T + γ · EstH .

We can choose γ = A, which gives

Reg ≲ A · EstH .

6.4.1 Online Estimation with Hellinger Distance


Let us now give some more detail as to how to perform the online model estimation required
by Proposition 26. Model estimation is a more challenging problem than regression, since we
are estimating the underlying conditional distribution rather than just the conditional mean.
In spite of this difficulty, estimating the model M ⋆ with respect to Hellinger distance is a
classical problem that we can solve using the online learning tools introduced in Section 1.6;
in particular, online conditional density estimation with the log loss. This generalizes the
method of online regression employed in Sections 3 and 4.

102
Instead of directly performing estimation with respect to Hellinger distance, the simplest
way to develop conditional density estimation algorithms is to work with the logarithmic
loss. Given a tuple (π t , rt , ot ), define the logarithmic loss for a model M as
 
t 1
ℓlog (M ) = log , (6.22)
mM (rt , ot | π t )
where we define mM (·, · | π) as the conditional density for (r, o) under M . We define regret
under the logarithmic loss as:
T
X T
X
t ct ) − inf
RegKL = ℓlog (M ℓtlog (M ). (6.23)
M ∈M
t=1 t=1

The following result shows that a bound on the log-loss regret immediately yields a bound
on the Hellinger estimation error.

Lemma 21: For any online estimation algorithm, whenever Assumption 8 holds, we
have " T #
X  
⋆ t t t
E[RegKL ] ≥ E DKL M (π ) ∥ M (π ) ,
c (6.24)
t=1

so that
E[EstH ] ≤ E[RegKL ]. (6.25)
Furthermore, for any δ ∈ (0, 1), with probability at least 1 − δ,

EstH ≤ RegKL + 2 log(δ −1 ). (6.26)

This result is desirable because regret minimization with the logarithmic loss is a well-
studied problem in online learning. Efficient algorithms are known for model classes of
interest [27, 84, 51, 44, 67, 70, 38, 63], and this is complemented by theory which provides
minimax rates for generic model classes [78, 66, 24, 18]. One example we have already seen
(Section 1) is the averaged exponential weights method, which guarantees

RegKL ≤ log|M|

for finite classes M. Another example is that for linear models, where (i.e., mM (r, o | π) =
⟨ϕ(r, o, π), θ⟩ for a fixed feature map in ϕ ∈ Rd ), algorithms with RegKL = O(d log(T )) are
known [72, 78]. All of these algorithms satisfy M c = co(M). We refer the reader to Chapter
9 of [25] for further examples and discussion.
While (6.25) is straightforward, (6.26) is rather remarkable, as the remainder term does
not scale with T . Indeed, a naive attempt at applying concentration inequalities to control
the deviations of the random quantities EstH and RegKL would require boundedness of the
loss function, which is problematic because the logarithmic loss can be unbounded. The
proof exploits unique properties of the moment generating function for the log loss.

6.5 Decision-Estimation Coefficient: Lower Bound on Regret


Up to this point, we have been focused on developing algorithms that lead to upper bounds
on regret for specific model classes. We now turn our focus to lower bounds, and the

103
question of optimality: That is, for a given class of models M, what is the best regret that
can be achieved by any algorithm? We will show that in addition to upper bounds, the
Decision-Estimation Coefficient actually leads to lower bounds on the optimal regret.

Background: Minimax regret. What does it mean to say that an algorithm is optimal
for a model class M? There are many notions of optimality, but in this course we will focus
on minimax optimality, which is one of the most basic and well-studied notions.
For a model class M, we define the minimax regret via14
⋆ ,p
M(M, T ) = inf sup EM [Reg(T )], (6.27)
p1 ,...,pT M ⋆ ∈M

where pt = pt (· | Ht−1 ) is the algorithm’s strategy for step t (a function of the history Ht−1 ),
and where we write regret as Reg(T ) to make the dependence on T explicit. Intuitively,
minimax regret asks what is the best any algorithm can perform on a worst-case model
(in M) possibly chosen with the algorithm in mind. Another way to say this is: For any
algorithm, there exists a model in M for which E[Reg(T )] ≥ M(M, T ). We will say that
an algorithm is minimax optimal if it achieves (6.27) up to absolute constants that do not
depend on M or T .

6.5.1 The Constrained Decision-Estimation Coefficient


We now show how to lower bound the minimax regret for any model class M in terms of
the DEC for M. Instead of working with the quantity decγ (M) appearing in Proposition
26 directly, it will be more convenient to work with a related quantity called the constrained
Decision-Estimation Coefficient, which we define for a parameter ε > 0 as15
n h  i o
deccε (M, M
c) = inf sup Eπ∼p [f M (πM ) − f M (π)] | Eπ∼p D2 M (π), M
H
c(π) ≤ ε2 ,
p∈∆(Π) M ∈M

with

deccε (M) := sup deccε (M ∪ {M


c}, M
c).
c∈co(M)
M

This is similar to the definition for the DEC we have been working with so far— which
we will call the offset DEC going forward—-except that it places a hard constraint on the
information gain as opposed to subtracting the information gain. Both quantities have
a similar interpretation, since subtracting the information gain implicitly biases the max
player towards model where the gain is small. Indeed, the offset DEC can be thought of as
14 ⋆
Here, for any algorithm p = p1 , . . . , pT , EM ,p denotes the expectation with respect to the observation
process (rt , ot ) ∼ M ⋆ (π t ) and any randomization used by the algorithm, when M ⋆ is the true model.
15 c
We adopt the hconvention
 that the
i value of decε (M, M ) is zero is there exists p such that the set of
c
2 2
M ∈ M with Eπ∼p DH M (π), M (π) ≤ ε is empty.
c

104
a Lagrangian relaxation of the constrained DEC, and always upper bounds it via
n h  i o
deccε (M, M
c) = inf sup Eπ∼p [f M (πM ) − f M (π)] | Eπ∼p D2 M (π), M
H
c(π) ≤ ε2
p∈∆(Π) M ∈M
n  h  i o
2 c(π) − ε2
= inf sup inf Eπ∼p [f M (πM ) − f M (π)] − γ Eπ∼p DH M (π), M ∨0
p∈∆(Π) M ∈M γ≥0
n  h  i o
2 c(π) − ε2
≤ inf inf sup Eπ∼p [f M (πM ) − f M (π)] − γ Eπ∼p DH M (π), M ∨0
γ≥0 p∈∆(Π) M ∈M
n o
= inf decγ (M, Mc) + γε2 ∨ 0.
γ≥0

For the opposite direction, it is straightforward to show that

decγ (M) ≲ deccγ −1/2 (M).

This inequality is lossy, but cannot be improved in general. That is, there some classes
for which the constrained DEC is meaningfully smaller than the offset DEC. However, it is
possible to relate the two quantities if we restrict to a “localized” sub-class of models that
are not “too far” from the reference model M c.

Proposition 27: Given a model M


c and parameter α, define the localized subclass
around M
c via n o
c
c) = M ∈ M : f M
Mα ( M c) ≥ f
(πM M
(πM ) − α . (6.28)

For all ε > 0 and γ ≥ c1 · ε−1 , we have

deccε (M) ≤ c3 · sup sup decγ (Mα(ε,γ) (M


c), M
c), (6.29)
γ≥c1 ε−1 M
c∈co(M)

where α(ε, γ) := c2 · γε2 , and c1 , c2 , c3 > 0 are absolute constants.

For many “well-behaved” classes one can consider (e.g., multi-armed bandits and linear
bandits), one has decγ (Mα(ε,γ) (M c), Mc) ≈ decγ (M, Mc) whenever decγ (M, M c) ≈ γε2 (that
is, localization does not change the complexity), so that lower bounds in terms of the
constrained DEC immediately imply lower bounds in terms of the offset DEC. In general,
this is not the case, and it turns out that it is possible to obtain tighter upper bounds that
depend on the constrained DEC by using a refined version of the E2D algorithm. We refer
to Foster et al. [43] for details and further background on the constrained DEC.

6.5.2 Lower Bound


The main lower bound based on the constrained DEC is as follows.

Proposition 28 (DEC Lower Bound [43]): Let εT := c · √1T , where c > 0 is a


sufficiently small numerical constant. For all T such that the conditiona

deccεT (M) ≥ 10εT (6.30)

105
is satisfied, it holds that for any algorithm, there exists a model M ∈ M for which

E[Reg(T )] ≳ deccεT (M) · T. (6.31)


a
The numerical constant here is not important.

Proposition 28 shows that for any algorithm and model class M, the optimal regret
must scale with the constrained DEC in the worst-case. As a concrete example, √ we will
c
show in the sequel that for the multi-armed bandit with A actions, decε (M) ∝ ε A, which
leads to √
E[Reg] ≳ AT .
We mention in passing that by combining Proposition 28 with Proposition 27, we obtain
the following lower bound based on the (localized) offset DEC.

Corollary 1: Fix T ∈ N. Then for any algorithm, there exists a model M ∈ M for
which

E[Reg(T )] ≳ sup

sup decγ (Mα(T,γ) (M
c), M
c), (6.32)
γ≳ T M
c∈co(M)

where α(T, γ) := c · γ/T for an absolute constant c > 0

The DEC is necessary and sufficient. To understand the significance of Proposition


28 more broadly, we state but do not prove the following upper bound on regret based on
the constrained DEC, which is based on a refined variant of E2D.

Proposition 29 (Upper q bound for constrained DEC [43]): Let M be a finite


log(|M|/δ)
class, and set εT := c· T , where c > 0 is a sufficiently large numerical constant.
Under appropriate technical conditions, there exists an algorithm that achieves

E[Reg(T )] ≲ decεcT (M) · T (6.33)

with probability at least 1 − δ.

This matches
q the lower boud in Propositionq 28 upper to a difference in the radius: we have
εT ∝ T for the lower bound, and εT ∝ log(|M|/δ)
1
T for the upper bound. This implies
that for any class where log|M| < ∞, the constrained DEC is necessary and sufficient for
low regret. By the discussion in the prequel, a similar conclusion holds for the offset DEC
(albeit, with a polynomial loss in rate). The interpretation of the log|M| gap between
the upper and lower bounds is that the DEC is capturing the complexity of exploring the
decision space, but the statistical capacity required to estimate the underlying model is a
separate issue which is not captured.

6.5.3 Proof of Proposition 28


Before proving Proposition 28, let us give some background on a typical approach to proving
lower bounds on the minimax regret for a decision making problem.

106
Anatomy of a lower bound. How should one go about proving a lower bound on the
minimax regret in (6.27)? We will follow a general recipe which can be found throughout
statistics, information theory, and decision making [32, 89, 83]. The approach will be to
find a pair of models M and M c that satisfy the following properties:

1. Any algorithm with regret much smaller than the DEC must query substantially
different decisions in Π depending on whether the underlying model is M or M c.
Intuitively, this means that any algorithm that achieves low regret must be able to
distinguish between the two models.

2. M and M c are “close” in a statistical sense (typically via total variation distance or
another f -divergence), which implies via change-of-measure arguments that the deci-
sions played by any algorithm which interacts with the models only via observations
(in our case, (π t , rt , ot )) will be similar for both models. In other words, the models
are difficult to distinguish.

One then concludes that the algorithm must have large regret on either M or M c.
To make this approach concrete, classical results in statistical estimation and supervised
learning choose the models M and M c in a way that is oblivious to the algorithm under
consideration [32, 89, 83]. However, due to the interactive nature of the decision making
problem, the lower bound proof we present now will choose the models in an adaptive
fashion.

Simplifications. Rather than proving the full result in Proposition 28, we will make the
following simplifying assumptions:

• There exists a constant C such that

DKL M (π) ∥ M ′ (π) ≤ C · DH


2
M (π), M ′ (π)
 
(6.34)

for all M, M ′ ∈ M and π ∈ Π.

• Rather than proving a lower bound that scales with deccε (M) = supM c
c∈co(M) decε (M∪

{M
c}, M
c), we will prove a weaker lower bound that scales with sup c decc (M, M
c).
M ∈M ε

We refer to Foster et al. [43] for a full proof that removes these restrictions.

Preliminaries. We use the following technical lemma for the proof of Proposition 28.

Lemma 22 (Chain Rule for KL Divergence): Qi Let (X1 , F1 ),N . . . , (Xn , Fn ) be a


i
sequence of measurable spaces, and let X = i=t Xt and F =
i i
t=1 Ft . For each
i, let Pi (· | ·) and Qi (· | ·) be probability kernels from (X i−1 , F i−1 ) to (Xi , Fi ). Let P
and Q be the laws of X1 , . . . , Xn under Xi ∼ Pi (· | X1:i−1 ) and Xi ∼ Qi (· | X1:i−1 )
respectively. Then it holds that
" n #
X
DKL (P ∥ Q) = EP DKL (Pi (· | X1:i−1 ) ∥ Qi (· | X1:i−1 )) . (6.35)
i=1

107
fM
<latexit sha1_base64="H21olRoLeBWYGJ8QwsrdxBF93VI=">AAAB6nicdVDLSsNAFJ3UV62vVHe6GSyCq5CEmra7ghs3QkX7gDaWyXTSDp08mJkIJfQTunGhiFs/wN9w684Pce+kVVDRAxcO59zLPfd6MaNCmuablltaXlldy68XNja3tnf04m5LRAnHpIkjFvGOhwRhNCRNSSUjnZgTFHiMtL3xaea3bwgXNAqv5CQmboCGIfUpRlJJl/71eV8vmYZpVuxaDZqGbZcdx1Gk5pSrJxVoKStDqV6c6e/7L8+Nvv7aG0Q4CUgoMUNCdC0zlm6KuKSYkWmhlwgSIzxGQ9JVNEQBEW46jzqFR0oZQD/iqkIJ5+r3iRQFQkwCT3UGSI7Eby8T//K6ifSrbkrDOJEkxItFfsKgjGB2NxxQTrBkE0UQ5lRlhXiEOMJSfaegnvB1KfyftGzDcozyhVWqO2CBPDgAh+AYWKAC6uAMNEATYDAEM3AH7jWm3WoP2uOiNad9zuyBH9CePgBBmZF1</latexit>

<latexit sha1_base64="LwHbksNaMaAo6rKVfKDapB5JWLI=">AAAB+HicdVDLSsNAFJ3UV62PprrTzWARXIWk1LTdFdy4ESrYB7S1TCaTdujkwcxEqaGf4Be4caGIO/Ev3LrzQ9w7aRVU9MCFwzn3cu89TsSokKb5pmUWFpeWV7KrubX1jc28XthqiTDmmDRxyELecZAgjAakKalkpBNxgnyHkbYzPkr99gXhgobBmZxEpO+jYUA9ipFU0kDPe+dJ75K6ZIRkcjKdDvSiaZhmpVSrQdMolcq2bStSs8vVwwq0lJWiWC9c6+87L0+Ngf7ac0Mc+ySQmCEhupYZyX6CuKSYkWmuFwsSITxGQ9JVNEA+Ef1kdvgU7ivFhV7IVQUSztTvEwnyhZj4jur0kRyJ314q/uV1Y+lV+wkNoliSAM8XeTGDMoRpCtClnGDJJoogzKm6FeIR4ghLlVVOhfD1KfyftEqGZRvlU6tYt8EcWbAL9sABsEAF1MExaIAmwCAGN+AO3GtX2q32oD3OWzPa58w2+AHt+QN+35dQ</latexit>

c
fM
<latexit sha1_base64="zh7phG5US28P9HLf6AYbQLgjWWo=">AAAB6nicdZDLSsNAFIYn9VZbL1WX3QwWwVVI0prWXcFF3QgV7QXaUCaTSTt0cmFmIpTQR3DjQilufRVfwJ1v46RVUNEfBn6+/xzmnOPGjAppGO9abm19Y3Mrv10o7uzu7ZcODrsiSjgmHRyxiPddJAijIelIKhnpx5ygwGWk504vsrx3R7igUXgrZzFxAjQOqU8xkgrdtEZXo1LF0M8btnVmQUM3jLpVtTNj1WtWFZqKZKo0i61F+bXvtUelt6EX4SQgocQMCTEwjVg6KeKSYkbmhWEiSIzwFI3JQNkQBUQ46XLUOTxRxIN+xNULJVzS7x0pCoSYBa6qDJCciN9ZBv/KBon0G05KwziRJMSrj/yEQRnBbG/oUU6wZDNlEOZUzQrxBHGEpbpOQR3ha1P4v+laumnrtWuz0rTBSnlQBsfgFJigDprgErRBB2AwBvfgETxpTHvQFtrzqjSnffYcgR/SXj4ACBCQkA==</latexit>

GM
⇡M
<latexit sha1_base64="Udu75tOVB86XU7nbhcZ1RpsgTSY=">AAAB7HicdVDLSsNAFJ34rK2PqstuBovgKiSlpu2u4EI3QgXTFtpQJpNJO3QyCTMToYR+gxsXFnHrn/gD7vwbJ62Cih64cDjnXu65108Ylcqy3o219Y3Nre3CTrG0u7d/UD486so4FZi4OGax6PtIEkY5cRVVjPQTQVDkM9Lzpxe537sjQtKY36pZQrwIjTkNKUZKS+4woaPrUblqmZbVqLVa0DJrtbrjOJq0nHrzvAFtbeWotkuXi8prP+iMym/DIMZpRLjCDEk5sK1EeRkSimJG5sVhKkmC8BSNyUBTjiIivWwZdg5PtRLAMBa6uIJL9ftEhiIpZ5GvOyOkJvK3l4t/eYNUhU0vozxJFeF4tShMGVQxzC+HARUEKzbTBGFBdVaIJ0ggrPR/ivoJX5fC/0m3ZtqOWb+xq20HrFAAFXACzoANGqANrkAHuAADCu7BI1gY3HgwnoznVeua8TlzDH7AePkA31CRqg==</latexit>

⇡M
<latexit sha1_base64="ywKCnNPKhuVAF+BkLUoggyoNO0I=">AAAB+nicdVDLSsNAFJ3UV62vVJduhhbBVUhKTVvcFNy4USrYB7QlTCbTdujkwczEUmI+xY0LRdz6Gy5052f4B05bBRU9cOFwzr3ce48bMSqkab5pmaXlldW17HpuY3Nre0fP77ZEGHNMmjhkIe+4SBBGA9KUVDLSiThBvstI2x2fzPz2FeGChsGlnEak76NhQAcUI6kkR8/3IuokvQn1yAjJ5CxNHb1oGqZZKdVq0DRKpbJt24rU7HL1qAItZc1QrBfeX86Pn+2Go7/2vBDHPgkkZkiIrmVGsp8gLilmJM31YkEihMdoSLqKBsgnop/MT0/hgVI8OAi5qkDCufp9IkG+EFPfVZ0+kiPx25uJf3ndWA6q/YQGUSxJgBeLBjGDMoSzHKBHOcGSTRVBmFN1K8QjxBGWKq2cCuHrU/g/aZUMyzbKF1axboMFsmAfFMAhsEAF1MEpaIAmwGACbsAduNeutVvtQXtctGa0z5k98APa0wdh9Zhu</latexit>

c ⇧
<latexit sha1_base64="IUlsOhKosL6nzCFslUXyyNlS3Mg=">AAAB6nicdVDLSsNAFJ34rPVVdelmsAh1E5JQ03ZlwY3LivYBTSiT6bQdOpmEmYlQQj9BBBeKuPUzXPkJ7vwQ905aBRU9cOFwzr3cc28QMyqVZb0ZC4tLyyurubX8+sbm1nZhZ7clo0Rg0sQRi0QnQJIwyklTUcVIJxYEhQEj7WB8mvntKyIkjfilmsTED9GQ0wHFSGnpwmvQXqFomZZVcWo1aJmOU3ZdV5OaW64eV6CtrQzFk+fS+8uNd9ToFV69foSTkHCFGZKya1ux8lMkFMWMTPNeIkmM8BgNSVdTjkIi/XQWdQoPtdKHg0jo4grO1O8TKQqlnISB7gyRGsnfXib+5XUTNaj6KeVxogjH80WDhEEVwexu2KeCYMUmmiAsqM4K8QgJhJX+Tl4/4etS+D9pOabtmuVzu1h3wRw5sA8OQAnYoALq4Aw0QBNgMATX4A7cG8y4NR6Mx3nrgvE5swd+wHj6AOzZkfE=</latexit>

<latexit sha1_base64="CESz1jP2Pwc1WGRJkHq5OtY5sSA=">AAAB7HicdVBNS8NAEN3Ur1q/Ur3pZbEInkoSatreCl68CBVMW2hD2Ww37dLNJuxuhBL6E8SLB0W8evZvePXmD/HutlVQ0QcDj/dmmDcTJIxKZVlvRm5peWV1Lb9e2Njc2t4xi7stGacCEw/HLBadAEnCKCeeooqRTiIIigJG2sH4dOa3r4iQNOaXapIQP0JDTkOKkdKSl/Sz82nfLFlly6o69Tq0yo5TcV1Xk7pbqZ1Uoa2tGUqN4rX5vv/y3Oybr71BjNOIcIUZkrJrW4nyMyQUxYxMC71UkgThMRqSrqYcRUT62TzsFB5pZQDDWOjiCs7V7xMZiqScRIHujJAayd/eTPzL66YqrPkZ5UmqCMeLRWHKoIrh7HI4oIJgxSaaICyozgrxCAmElf5PQT/h61L4P2k5ZdstVy7sUsMFC+TBATgEx8AGVdAAZ6AJPIABBTfgDtwb3Lg1HozHRWvO+JzZAz9gPH0AFwuSjA==</latexit>

pM <latexit sha1_base64="InV1JmttPX1G1CRO3YG1mpdDMEU=">AAAB+HicdVDLSsNAFJ34rPXRqks3Q4vgKiSlpi1uCm7cKBXsA9oQJpNJO3TyYGai1JAvceNCEbf+hwvd+Rn+gdNWQUUPXDiccy/33uPGjAppGG/awuLS8spqbi2/vrG5VShu73RElHBM2jhiEe+5SBBGQ9KWVDLSizlBgctI1x0fT/3uJeGCRuGFnMTEDtAwpD7FSCrJKRZiJx1cUY+MkExPs8wplg3dMGqVRgMaeqVStSxLkYZVrR/WoKmsKcrN0vvL2dGz1XKKrwMvwklAQokZEqJvGrG0U8QlxYxk+UEiSIzwGA1JX9EQBUTY6ezwDO4rxYN+xFWFEs7U7xMpCoSYBK7qDJAcid/eVPzL6yfSr9spDeNEkhDPF/kJgzKC0xSgRznBkk0UQZhTdSvEI8QRliqrvArh61P4P+lUdNPSq+dmuWmBOXJgD5TAATBBDTTBCWiBNsAgATfgDtxr19qt9qA9zlsXtM+ZXfAD2tMH4RWXlQ==</latexit>

pM
c


<latexit sha1_base64="IUlsOhKosL6nzCFslUXyyNlS3Mg=">AAAB6nicdVDLSsNAFJ34rPVVdelmsAh1E5JQ03ZlwY3LivYBTSiT6bQdOpmEmYlQQj9BBBeKuPUzXPkJ7vwQ905aBRU9cOFwzr3cc28QMyqVZb0ZC4tLyyurubX8+sbm1nZhZ7clo0Rg0sQRi0QnQJIwyklTUcVIJxYEhQEj7WB8mvntKyIkjfilmsTED9GQ0wHFSGnpwmvQXqFomZZVcWo1aJmOU3ZdV5OaW64eV6CtrQzFk+fS+8uNd9ToFV69foSTkHCFGZKya1ux8lMkFMWMTPNeIkmM8BgNSVdTjkIi/XQWdQoPtdKHg0jo4grO1O8TKQqlnISB7gyRGsnfXib+5XUTNaj6KeVxogjH80WDhEEVwexu2KeCYMUmmiAsqM4K8QgJhJX+Tl4/4etS+D9pOabtmuVzu1h3wRw5sA8OQAnYoALq4Aw0QBNgMATX4A7cG8y4NR6Mx3nrgvE5swd+wHj6AOzZkfE=</latexit>

Figure 9: Models M and M c with corresponding mean rewards and average action distri-
butions. The overlap between the action distributions is at least 0.9, while near-optimal
choices for one model incur large regret for the other.

Proof of Proposition 28. Fix T ∈ N and consider any fixed algorithm, which we recall is
defined by a sequence of mappings p1 , . . . , pT , where pt = pt (· | Ht−1 ). Let PM denote the
distribution over HT for this algorithm when M is the true model, and let EM denote the
corresponding expectation.
Viewed as a function of the history Ht−1 , each pt is a random variable, and we can
consider its expected value under the model M . To this end, for any model M ∈ M, let
T
" #
1 X
pM := EM pt ∈ ∆(Π)
T
t=1

be the algorithm’s average action distribution when M is the true model. Our aim is to
show that we can find a model in M for which the algorithm’s regret is at least as large as
the lower bound in (6.32).
Let T ∈ N, and fix a value ε > 0 to be chosen momentarily. Fix an arbitrary model
c ∈ M and set
M
n h  i o
M M 2 c(π) ≤ ε2 ,
M = arg max Eπ∼pM c
[f (π M ) − f (π)] | Eπ∼p c
M
D H M (π), M (6.36)
M ∈M

The model M should be thought of as a “worst-case alternative” g to M c, but only for the
specific algorithm under consideration. We will show that the algorithm needs to have large
regret on either M or Mc. To this end, we establish some basic properties; let us abbreviate
g (π) = f (πM ) − f (π) going forward:
M M M

• For all models M , we have


1 M
E [Reg(T )] = Eπ∼pM [g M (π)]. (6.37)
T
So, to prove the
 desired lower bound, we need to show that either Eπ∼pM [g (π)] or
M

c
Eπ∼pMc
g (π) is large.
M

• By the definition of the constrained DEC, we have

Eπ∼pM
c
[g M (π)] ≥ deccε (M, M
c) =: ∆, (6.38)

108
since by (6.36), the model M is the best response to a potentially suboptimal choice
pMc. This is almost what we want, but there is a mismatch in models, since g M
considers the model M while pMc considers the model M .
c

• Using the chain rule for KL divergence, we have


" T #
  X  
c c
DKL PM ∥ PM = EM Eπt ∼pt DKL Mc(π t ) ∥ M (π t )
t=1
T
" #
X   h  i
c
M 2 c t 2 c
≤C ·E Eπt ∼pt DH M (π ), M (π t ) = CT · Eπ∼pM
c
D H M (π), M (π) .
t=1

To see why the first equality holds, we apply the chain rule to the sequence π 1 , z 1 , . . . , π T , z T
with z t = (rt , ot ). Let us use the bold notation zt to refer to a random variable under
consideration, and let z t refer to its realization. Then we have
 
c
DKL PM ∥ PM
" T #
X    
c c c
= EM DKL PM (zt |Ht−1 , π t ) ∥ PM (zt |Ht−1 , π t ) + DKL PM (π t |Ht−1 ∥ PM (π t |Ht−1 )
t=1
T
" #
X  
c
M t t
=E c(π ) ∥ M (π )
DKL M
t=1

since conditionally on Ht−1 , the law of π t does not depend on the model.
1
We can now choose ε = c1 · √CT , where c1 > 0 is a sufficiently small numerical
constant, to ensure that
   
2 c c
DTV PM , PM ≤ DKL PM ∥ PM ≤ 1/100. (6.39)

In other words, with constant probability, the algorithm can fail to distinguish M and
M
c.

Finally, we will make use of the fact that since rewards are in [0, 1], we have
h i h  i r h  i
c
M M
Eπ∼p f (π) − f (π) ≤ Eπ∼p DTV M (π), M c(π) ≤ Eπ∼p D2 M (π), M c(π) ≤ ε.
c
M c
M c
M H

(6.40)

Step 1. Define GM = {π ∈ Π | g M (π) ≤ ∆/10}. Observe that


∆ ∆
Eπ∼pM [g M (π)] ≥ · pM (π ∈
/ GM ) ≥ · (pM
c (π ∈
/ GM ) − DTV (pM , pM
c )) (6.41)
10 10

≥ · (pM
c (π ∈
/ GM ) − 1/10), (6.42)
10
c

c ) ≤ DTV P , P
since DTV (pM , pM ≤ 1/10 by the data-processing inequality and (6.39).
M M

Going forward, let us assume that


c
 M 
Eπ∼pMc
g (π) ≤ ∆/10, (6.43)

c (π ∈
or else we are done, by (6.37). Our aim is to show that under this assumption, pM /
GM ) ≥ 1/2, which will imply that Eπ∼pM [g M (π)] ≳ ∆ via (6.42).

109
Step 2. By adding the inequalities (6.43) and (6.38), we have that
h i h i
c c c
f M (πM ) − f M (πM M
M M
c ) ≥ Eπ∼p c g (π) − g (π) − Eπ∼p c |f
M
M
(π) − f M
(π)|
9 h
M c
M
i
≥ ∆ − Eπ∼pM c
|f (π) − f (π)| .
10
c
 M 
In addition, by (6.40), we have Eπ∼pM c
|f (π) − f M (π)| ≤ ε, so that

c 9
f M (πM ) − f M (πM
c) ≥ ∆ − ε. (6.44)
10
1
Hence, as long as ε ≤ 10 ∆, which is implied by (6.30), we have
c 4
f M (πM ) − f M (πM
c) ≥ ∆. (6.45)
5

Step 3. Observe that if π ∈ GM , then


c c c 7
|f M (π) − f M (π)|+ ≥ |f M (πM ) − f M (π) − ∆/10|+ ≥ |f M (πM ) − f M (πM
c ) − ∆/10|+ ≥ ∆,
10
where we have used (6.45). As a result, using (6.40),
h
M c
M
i 7
ε ≥ Eπ∼pM c
|f (π) − f (π)| + ≥ ∆ · pM
c (π ∈ GM ).
10
Hence, since ε ≤ ∆/10 by (6.30), we have
∆ 7
≥ ∆ · pM
c (π ∈ GM ),
10 10
c (π ∈ GM ) ≤ 1/7. Combining this with (6.42) gives
or pM
1 M ∆ ∆
E [Reg(T )] = Eπ∼pM [g M (π)] ≥ · (1 − 1/7 − 1/10) ≥ .
T 10 20

Finishing up. Note that since the choice of Mc ∈ M for this lower bound was arbitrary,
c
we are free to choose M to maximize decε (M, M ).
c c

6.5.4 Examples for the Lower Bound


We now instantiate the lower bound in Proposition 28 for concrete model classes of interest.
We begin by revisiting the examples at the beginning of the section.
Example 6.4 (cont’d). Let us lower bound the constrained DEC for the Gaussian ban-
dit problem from Example 6.4. Set M c(π) = N (1/2, 1), and let {M1 , . . . , MA } ⊆ M be
a sub-family of models with Mi (π) = N (f Mi (π), 1), where f Mi (π) := 21 + ∆I {π = i}
for ah parameter
 ∆ whose
i value will be chosen in a moment. Observe that for all i,
2 1 2
 M 
Eπ∼p DH Mi (π), M (π)
c ≤ 2 ∆ p(i) by (6.12), and Eπ∼p f (πMi ) − f Mi (π) = (1 −
i

p(i))∆, so we can lower bound


n h  i o
deccε (M, Mc) = inf sup Eπ∼p [f M (πM ) − f M (π)] | Eπ∼p D2 M (π), M
H
c(π) ≤ ε2
p∈∆(Π) M ∈M

∆2
 
≥ inf max (1 − p(i))∆ | p(i) ≤ ε2
p∈∆(Π) i 2

110

For any p, there exists i such that p(i) ≤ 1/A. If we choose ∆ = ε · 2A, this choice for i
2
will satisfy the constraint p(i) ∆2 ≤ ε2 , and we will be left with
p
deccε (M, M
c) ≥ (1 − p(i))∆ ≥ ε A/2,

since 1 − p(i) ≥ 1/2.


Plugging this lower bound on the constrained Decision-Estimation Coefficient into Propo-
sition 28 yields √
E[Reg] ≥ Ω(e AT ).

Generalizing the argument above, we can prove a lower bound on the Decision-Estimation
Coefficient for any model class M that “embeds” the multi-armed bandit problem in a cer-
tain sense.

Proposition 30: Let a reference model M c be given, and suppose that a class M con-
tains a sub-class {M1 , . . . , MN } and collection of decisions π1 , . . . , πN with the property
that for all i:
 
1. DH 2 M (π), Mc(π) ≤ β 2 · I {π = πi }.
i

2. f Mi (πMi ) − f Mi (π) ≥ α · I {π ̸= πi }.

Then n √ o
deccε (M, M
c) ≳ α · I ε ≥ β/ N .

The examples that follow can be obtained by applying this result with an appropriate
sub-family.

Example 6.5 (cont’d). Recall the bandit-type problem with structured noise from Exam-
ple 6.5, where we have M = {M1 , . . . , MA }, with Mi (π) = N (1/2, 1)I {π ̸= i}+Ber(3/4)I {π = i}.
c(π) = N (1/2, 1), then this family satisfies the conditions of Proposition 30 with
If we set M n o
α = 1/4 and β 2 = 2. As a result, we have deccε (MMAB-SN ) ≳ I ε ≥ 2/A , which yields
p

E[Reg] ≳ O(A)
e

if we apply Proposition 28.


Example 6.6 (cont’d). Consider the full-information variant of the bandit setting in
Example 6.6. By adapting the argument in Example 6.4, one can show that

deccε (M) ≳ ε,

which leads to a lower bound of the form



E[Reg] ≳ T.

111
Next, we revisit some of the structured bandit classes considered in Section 4.

Example 6.7. Consider the linear bandit setting in Section 4.3.2, with F = {π 7→ ⟨θ, ϕ(π)⟩ | θ ∈ Θ},
where Θ ⊆ Bd2 (1) is a parameter set and ϕ : Π → Rd is a fixed feature map that is known to
the learner. Let M be the set of all reward distributions with f M ∈ F and 1-sub-Gaussian
noise. Then √
deccε (M) ≳ ε d,
which gives √
E[Reg] ≳ dT .

Example 6.8. Consider the Lipschitz bandit setting in Section 4.3.3, where Π is a metric
space with metric ρ, and

F = {f : Π → [0, 1] | f is 1-Lipschitz w.r.t ρ}.

Let M be the set of all reward distributions with f M ∈ F and 1-sub-Gaussian noise. Let
d > 0 be such that the covering number for Π satisfies

Nρ (Π, ε) ≥ ε−d .

Then 2
deccε (M) ≳ ε d+2 ,
d+1
which leads to E[Reg] ≳ T d+2 . ◁

See Foster et al. [40, 43] for further details.

6.6 Decision-Estimation Coefficient and E2D: Application to Tabular RL


In this section, we use the Decision-Estimation Coefficient and E2D meta-algorithm to
provide regret bounds for the tabular reinforcement learning. This will be the most complex
example we consider in this section, and showcases the full power of DEC for general
decision making. In particular, the example will show how the DEC can take advantage
of the observations ot , in the form of trajectories. This will provide an alternative to the
optimistic algorithm (UCB-VI) we introduced in Section 5, and we will build on this approach
to give guarantees for reinforcement learning with function approximation in Section 7.

Tabular reinforcement learning. When we view tabular reinforcement learning as a


special case of the general
 decision makingMframework, M is the collection of all non-
stationary MDPs M = S, A, {PhM }H h=1 , {R } H ,d
h h=1 1 (cf. Section 5), with state space
S = [S], action space A = [A], and horizon H. The decision space Π = Πrns is the collection
of all randomized, non-stationaryPMarkov policies (cf. Example 6.3). We assume that
rewards are normalized such that H h=1 rh ∈ [0, 1] almost surely (so that R = [0, 1]). Recall
that for each M ∈ M, {Ph }h=1 and {RhM }H
M H
h=1 denote the associated transition kernels and
reward distributions, and d1 is the initial state distribution.

112
Occupancy measures. The results we present make use of the notion of occupancy
measures for an MDP M . Let PM ,π (·) denote the law of a trajectory evolving under MDP
M and policy π. We define state occupancy measures via
,π M ,π
dM
h (s) = P (sh = s)

and state-action occupancy measures via


,π M ,π
dM
h (s, a) = P (sh = s, ah = a).

Note that we have dM
1 (s) = d1 (s) for all M and π.

Bounding the DEC for tabular RL. Recall, that to certify a bound on the DEC, we
need to—given any parameter γ > 0 and estimator M c, exhibit a distribution (or, “strategy”)
p such that
h  i
2
sup Eπ∼p f M (πM ) − f M (π) − γ · DH c(π) ≤ decγ (M, M
M (π), M c)
M ∈M

for some upper bound decγ (M, M c). For tabular RL, we will choose p using an algorithm
called Policy Cover Inverse Gap Weighting. As the name suggests, the approach combines
the inverse gap weighting technique introduced in the multi-armed bandit setting with the
notion of a policy cover —that is, a collection of policies that ensures good coverage on every
state [33, 64, 47].

Policy Cover Inverse Gap Weighting (PC-IGW)


parameters: Estimated model M c, Exploration parameter η > 0.
Define inverse gap weighted policy cover Ψ = {πh,s,a }h∈[H],s∈[S],a∈[A] via

,π c
dMh (s, a)
πh,s,a = arg max c c
. (6.46)
π∈Πrns 2HSA + η(f (πMc) − f (π))
M M

For each policy π ∈ Ψ ∪ {πM


c }, define

1
p(π) = c c
, (6.47)
c) − f
λ + η(f (πM (π))
M M

P
where λ ∈ [1, 2HSA] is chosen such that π p(π) = 1.
return p.

The algorithm consists of two steps. First, in (6.46), we compute the collection of policies
Ψ = {πh,s,a }h∈[H],s∈[S],a∈[A] that constitutes the policy cover. The basic idea here is that
each policies in the policy cover should balance (i) regret and (ii) coverage—that is—ensure
that all the states are sufficiently reached, which means we are exploring. We accomplish
this by using policies of the form
,π c
dMh (s, a)
πh,s,a := arg max c c
π∈Πrns 2HSA + η(f (πMc) − f (π))
M M

113
which—for each (s, a, h) tuple—maximize the ratio of the occupancy measure for (s, a)
at layer h to the regret gap under M c. This inverse gap weighted policy cover balances
exploration and exploration by trading off coverage with suboptimality. With the policy
cover in hand, the second step of the algorithm computes the exploratory distribution p by
simply applying inverse gap weighting to the elements of the cover and the greedy policy
πMc.

The bound on the Decision-Estimation Coefficient for the PC-IGW algorithm is as fol-
lows.

Proposition 31: Consider the tabular reinforcement learning setting with H


P
h=1 rh ∈
γ
R := [0, 1]. For any γ > 0 and M ∈ M, the PC-IGW strategy with η = 21H 2 , ensures
c
that
3
c(π) ≲ H SA ,
h i
2
sup Eπ∼p f M (πM ) − f M (π) − γ · DH M (π), M
M ∈M γ

H 3 SA
and consequently certifies that decγ (M, M
c) ≲
γ .

We remark that it is also possible to prove this bound non-constructively, by moving to


the Bayesian DEC and adapting the posterior sampling approach described in Section 4.4.2.

Remark 17 (Computational efficiency): The PC-IGW strategy can be implemented


in a computationally efficient fashion. Briefly, the idea is to solve (6.46) by taking a
dual approach and optimizing over occupancy measures rather than policies. With
this parameterization, (6.46) becomes a linear-fractional program, which can then be
transformed into a standard linear program using classical techniques.

How to estimate the model. The bound on the DEC we proved using the PC-IGW
algorithm assumes that M c ∈ M, but in general, estimators from online learning algorithm
such as exponential weights will produce M ct ∈ co(M). While it is possible to show that
the same bound on the DEC holds for M c ∈ co(M), a slightly more complex version of the
algorithm is required to certify such a bound. To run the PC-IGW algorithm as-is, we can
use a simple approach to obtain a proper estimator M c ∈ M.
Assume for simplicity that rewards are known, i.e. RhM (s, a) = Rh (s, a) for all M ∈ M.
Instead of directly working with an estimator for the entire model M , we work with layer-
wise estimators AlgEst;1 , . . . , AlgEst;H . At each round t, given the history {(π i , ri , oi )}t−1
i=1 ,

the layer-h estimator AlgEst;h produces an estimate Pbh for the true transition kernel PhM .
t

We measure performance of the estimator via layer-wise Hellinger error:


T h  ⋆ i
⋆ ,π t
X
2
EstH;h := Eπt ∼pt EM DH PhM (sh , ah ), Pbht (sh , ah ) . (6.48)
t=1

We obtain an estimation algorithm for the full model M ⋆ by taking M ct as the MDP that
has Ph as the transition kernel for each layer h. This algorithm has the following guaran-
b t

tee.

114
Proposition 32: The estimator described above has
H
X
EstH ≤ O(log(H)) · EstH;h .
h=1

ct ∈ M.
In addition, M

For each layer, we can obtain EstH;h ≤ O(S e 2 A) using the averaged exponential weights
algorithm, by applying the approach described in Section 6.4.1 to each layer. That is, for
each layer, we obtain Pbht by running averaged exponential weights with the loss ℓtlog (Ph ) =
e 2 A) with this approach because there are
− log(Ph (sh+1 | sh , ah )). We obtain EstH;h ≤ O(S
2
S A parameters for the transition distribution at each layer.

A lower bound on the DEC. We state, but do not prove a complementary lower bound
on the DEC for tabular RL.

PropositionP33: Let M be the class of tabular MDPs with S ≥ 2 states, A ≥ 2


actions, and H
h=1 rh ∈ R := [0, 1]. If H ≥ 2 log2 (S/2), then

deccε (M) ≳ ε HSA.


Using Proposition 28, this gives E[Reg] ≳ HSAT .

6.6.1 Proof of Proposition 31


Toward proving Proposition 31, we provide some general-purpose technical lemmas which
will find further use in Section 7. First, we provide a simulation lemma, which allow us to
decompose the difference in value functions for two MDPs into errors between their per-layer
reward functions and transition probabilities.

Lemma 23 (Simulation lemma): For any pair of MDPs M = (P M , RM ) and M c=


c c P H
(P , R ) with the same initial state distribution and h=1 rh ∈ [0, 1], we have
M M

 
c
f M (π) − f M (π) ≤ DTV M (π), Mc(π) (6.49)

c(π) ≤ 1 + η D2 M (π), M
   
≤ DH M (π), M c(π) ∀η > 0, (6.50)
2η 2 H

115
and
c
f M (π) − f M (π)
H hh i i XH h i
M ,π
X c c c
= EM ,π (PhM − PhM )Vh+1 (sh , ah ) + EM ,π Erh ∼RM (sh ,ah ) [rh ] − Er c
M [rh ]
h h ∼Rh (sh ,ah )
h=1 h=1
(6.51)
H
X h    i
c c c
≤ EM ,π DTV PhM (sh , ah ), PhM (sh , ah ) + DTV RhM (sh , ah ), RhM (sh , ah ) . (6.52)
h=1

Next, we provide a “change-of-measure” lemma, which allows one to move from between
quantities involving an estimator M
c and those involving another model M .

Lemma 24 (Change of measure for RL): Consider any MDP M and reference
c which satisfy PH rh ∈ [0, 1]. For all p ∈ ∆(Π) and η > 0 we have
MDP M h=1

Eπ∼p [f M (πM ) − f M (π)]


c(π) + 1 .
h i h  i
c 2
≤ Eπ∼p f M (πM ) − f M (π) + η Eπ∼p DH M (π), M (6.53)

and
H
" #
X    
c,π
M 2 M c
M 2 M c
M
Eπ∼p E DTV P (sh , ah ), P (sh , ah ) + DTV R (sh , ah ), R (sh , ah )
h=1
h i
2
≤ 8H Eπ∼p DH M (π), M
c(π) . (6.54)

Proof of Proposition 31. Let M ∈ M be fixed. The main effort in the proof will be to
bound the quantity
h i
c
Eπ∼p f M (πM ) − f M (π)

in terms of the quantity on the right-hand side of (6.54), then apply change of measure
(Lemma 24). We begin with the decomposition
h i h i
c c c c
Eπ∼p f M (πM ) − f M (π) = Eπ∼p f M (πMc) − f
M
(π) + f M (πM ) − f M (πM
c) . (6.55)
| {z } | {z }
(II)
(I)

For the first term (I), which may be thought of as exploration bias, we have
c c
c) − f
f M (πM (π) 2HSA
h i X M
c c
Eπ∼p f M (πM
c) − f
M
(π) = c c
≤ , (6.56)
λ + η(f (πM
M
c) − f
M
(π)) η
c}
π∈Ψ∪{πM

where we have used that λ ≥ 0. We next bound the second term (II), which entails showing
that the PC-IGW distribution explores enough. We have
c c c c
f M (πM ) − f M (πM
c) = f
M
(πM ) − f M (πM ) − (f M (πM
c) − f
M
(πM )). (6.57)

116
We use the simulation lemma to bound
H
X h    i
c c c c
f M (πM ) − f M (πM ) ≤ EM ,πM DTV PhM (sh , ah ), PhM (sh , ah ) + DTV RhM (sh , ah ), RhM (sh , ah )
h=1
H X
c ,πM
X
= dM
h (s, a)errM
h (s, a),
h=1 s,a

. Define d¯h (s, a) =


c c
 
where errM
h (s, a) := D TV P M
(s, a), P M
(s, a) + D TV R M
(s, a), R M
(s, a)
c,π
Eπ∼p dh (s, a) . Then, using the AM-GM inequality, we have that for any η ′ > 0,
 M 

H X H X ¯ 1/2
X c ,πM
X c,πM dh (s, a) 2
dM (s, a)[errM
h (s, a)] = dM (s, a) (errM
h (s, a))
h h
d¯h (s, a)
h=1 s,a h=1 s,a
H c,π H
1 X X (dh M (s, a))2 η ′ X X ¯
M
2
≤ ¯h (s, a) + dh (s, a)(errM
h (s, a))
2η ′ s,a
d 2 s,a
h=1 h=1
H X c,πM H
(s, a))2 η′ X
M
1 X (dh c 
Eπ∼p EM ,π (errM 2

= + h (sh , ah )) .
2η ′ d¯h (s, a) 2
h=1 s,a h=1

The second term is exactly the upper bound we want, so it remains to bound the ratio of
occupancy measures in the first term. Observe that for each (h, s, a), we have
c,πM c,πM c,πM
dM
h (s, a) dM
h (s, a) 1 dM
h (s, a)  c
M c
M

≤ · ≤ 2HSA + η(f (π c ) − f (πh,s,a ,
)
d¯h (s, a) c,π c,π M
dh h,s,a (s, a) p(πh,s,a )
M M
dh h,s,a (s, a)

where the second inequality follows from the definition of p and the fact that λ ≤ 2HSA.
Furthermore, since
,π c
dMh (s, a)
πh,s,a = arg max c c
,
π∈Πrns 2HSA + η(f (πMc) − f (π))
M M

and since πM ∈ Πrns , we can upper bound by


c ,πM
dM
h (s, a)  c c

c c
c ,πM
2HSA + η(f M (πM
c ) − f M
(πM ) = 2HSA + η(f M (πM
c) − f
M
(πM ). (6.58)
dM
h (s, a)

As a result, we have
H X c,πM H X
X (dM
h (s, a))2 X c,πM c c
¯ ≤ dM
h (s, a)(2HSA + η(f M (πM
c) − f
M
(πM ))
dh (s, a)
h=1 s,a h=1 s,a
2 c c
= 2H SA + ηH(f M (πM
c) − f
M
(πM )).

Putting everything together and returning to (6.57), this establishes that


c
f M (πM ) − f M (πM
c)

H
H 2 SA η ′ X c   ηH M c c c c
≤ ′
+ Eπ∼p EM ,π (errM 2
h (sh , ah )) + ′ c) − f
(f (πM M
(πM )) − (f M (πM
c) − f
M
(πM )).
η 2 2η
h=1

117
ηH
We set η ′ = 2 so that the latter terms cancel and we are left with

H
c 2HSA ηH X c 
Eπ∼p EM ,π (errM 2

f M (πM ) − f M (πM
c) ≤ + h (sh , ah )) .
η 4
h=1

Combining this with (6.55) and (6.56) gives


h i
c
Eπ∼p f M (πM ) − f M (π)
H
4HSA ηH X c 
Eπ∼p EM ,π (errM 2

≤ + h (sh , ah ))
η 4
h=1
H
4HSA ηH X c
h 
c
 
c
i
≤ + Eπ∼p EM ,π DTV
2 2
P M (sh , ah ), P M (sh , ah ) + DTV RM (sh , ah ), RM (sh , ah ) .
η 2
h=1

We conclude by applying the change-of-measure lemma (Lemma 24), which implies that for
any η ′ > 0,
4HSA h  i
Eπ∼p [f M (πM ) − f M (π)] ≤ + (4η ′ )−1 + (4H 2 η + η ′ ) · Eπ∼p DH
2
M (π), M
c(π) .
η
γ
The result follows by choosing η = η ′ = 21H 2
(we have made no effort to optimize the
constants here).

6.7 Tighter Regret Bounds for the Decision-Estimation Coefficient


To close this section, we provide a number of refined regret bounds based on the Decision-
Estimation Coefficient, which improve upon Proposition 26 in various situations.

6.7.1 Guarantees Based on Decision Space Complexity


In general, low estimation complexity (i.e., a small bound on EstH or log|M|) is not required
to achieve low regret for decision making. This is because our end goal is to make good
decisions, so we can give up on accurately estimating the model in regions of the decision
space that do not help to distinguish the relative quality of decisions. The following result
provides a tighter bound that scales only with log|Π|, at the cost of depending on the DEC
for a larger model class: co(M) rather than M.

Proposition 34: There exists an algorithm that for any δ > 0, ensures that with
probability at least 1 − δ,

Reg ≲ inf {decγ (co(M)) · T + γ · log(|Π|/δ)}. (6.59)


γ>0

Compared to (6.20), this replaces the estimation term log|M| with the smaller quantity
log|Π|, replaces decγ (M) with the potentially larger quantity decγ (co(M)). Whether or
not this leads to an improvement depends on the class M. For multi-armed bandits, linear
bandits, and convex bandits, M is already convex, so this offers strict improvement. For
MDPs though, M is not convex: Even for the simple tabular MDP setting where |S| = S

118
and |A| = A, grows exponentially decγ (co(M)) in either H or S, whereas decγ (M) is
polynomial in all parameters.
We mention in passing that this result is proven using a different algorithm from E2D;
see Foster et al. [40, 42] for more background.

6.7.2 General Divergences and Randomized Estimators

E2D for General Divergences and Randomized Estimators


parameters: Exploration parameter γ > 0, divergence D(· ∥ ·).
for t = 1, 2, · · · , T do
Obtain randomized estimate ν t ∈ ∆(M) from estimation oracle with {(π i , ri , oi )}i<t .
Compute // Eq. (6.61).
 h  i
t M M π c
p = arg min sup Eπ∼p f (πM ) − f (π) − γ · EM
c∼ν t D M ∥M .
p∈∆(Π) M ∈M

Sample decision π t ∼ pt and update estimation algorithm with (π t , rt , ot ).

In this section we give a generalization of the E2D algorithm that incorporates two extra
features: general divergences and randomized estimators.

General divergences. The Decision-Estimation Coefficient measures estimation error


2 M (π), M

via the Hellinger distance DH c(π) , which is fundamental in the sense that it
leads to lower bounds on the optimal regret (Proposition 28). Nonetheless, for specific
applications and model classes, it can be useful to work with alternative distance measures
and divergences. For a non-negative function (“divergence”) Dπ (· ∥ ·), we define
  
D M M π c
decγ (M, M ) = inf sup Eπ∼p f (πM ) − f (π) − γ · D M ∥ M .
c (6.60)
p∈∆(Π) M ∈M

This variant of the DEC naturally leadsto regretbounds in terms of estimation error under
Dπ (· ∥ ·). Note that we use notation Dπ M

c ∥ M instead of say, D M c(π), M (π) , to reflect
that fact that the divergence may depend on M (resp. M c) and π through properties other
than M (π) (resp. M (π)).
c

Randomized estimators. The basic version of E2D assumes that at each round, the
online estimation oracle provides a point estimate M
ct . In some settings, it useful to consider
randomized estimators that, at each round, produce a distribution ν t ∈ ∆(M) over models.
For this setting, we further generalize the DEC by defining
 h  i
D M M π c
decγ (M, ν) = inf sup Eπ∼p f (πM ) − f (π) − γ · EM c∼ν D M ∥M (6.61)
p∈∆(Π) M ∈M

for distributions ν ∈ ∆(M). We additionally define decD D


γ (M) = supν∈∆(M) decγ (M, ν).

119
Algorithm. A generalization of E2D that incorporates general divergences and random-
ized estimators is given above on page 119. The algorithm is identical to E2D with Option
I, with the only differences being that i) we play the distribution that solves the minimax
problem (6.61) with the user-specified divergence Dπ (· ∥ ·) rather than squared Hellinger
distance, and ii) we use the randomized estimate ν t rather than a point estimate. Our per-
formance guarantee for this algorithm depends on the estimation performance of the oracle’s
randomized estimates ν 1 , . . . , ν T ∈ ∆(M) with respect to the given divergence Dπ (· ∥ ·),
which we define as
XT h t i
π t ⋆
EstD := Eπt ∼pt EM t
c ∼ν t D M
c ∥ M . (6.62)
t=1

We have the following guarantee.

Proposition 35: The algorithm E2D for General Divergences and Randomized Esti-
mators with exploration parameter γ > 0 guarantees that

Reg ≤ decD
γ (M) · T + γ · EstD (6.63)

almost surely.

Sufficient statistics and benefits of general divergences. Many divergences of in-


terest have the useful property that they depend on the estimated model M c only through
a “sufficient statistic” for the model class under consideration. Formally, there exists a
sufficient statistic space Ψ and sufficient statistic ψ : M → Ψ with the property that we
can write (overloading notation)
Dπ M ∥ M ′ = Dπ ψ(M ) ∥ M ′ , f M (π) = f ψ(M ) (π), and πM = πψ(M )
 

for all models M, M ′ . In this case, it suffices for the online estimation oracle to directly
estimate the sufficient statistic by producing a randomized estimator ν t ∈ ∆(Ψ), and we
can write the estimation error as
T
X h t i
EstD := Eπt ∼pt Eψbt ∼ν t Dπ ψbt ∥ M ⋆ . (6.64)
t=1

The benefit of this perspective is that for many examples of interest, since the divergence
depends on the estimate only through ψ, we can derive bounds on Est that scale with
log|Ψ| instead of log|M|.
For example, in structured bandit problems, one can work with the divergence
 
c
DSq Mc(π), M (π) := (f M (π) − f M (π))2

which uses the mean reward function as a sufficient statistic, i.e. ψ(M ) = f M . Here, it is
clear that one can achieve EstD ≲ log|F|, which improves upon the rate EstH ≲ log|M| for
Hellinger distance, and recovers the specialized version of the E2D algorithm we considered
in Section 4. Analogously, for reinforcement learning, one can consider value functions as a
sufficient statistic, and use an appropriate divergence based on Bellman residuals to derive
estimation guarantees that scale with the complexity log|Q| of a given value function class
Q; see Section 7 for details.

120
Does randomized estimation help? Note that whenever D is convex in the first argu-
D D
ment, we have decD γ (M) ≤ supM c∈co(M) decγ (M, M ) = decγ (M) (that is, the randomized
c
DEC is never larger than the vanilla DEC), but it is not immediately apparent whether the
opposite direction of this inequality holds, and one might hope that working with the ran-
domized DEC in (6.61) would lead to improvements over the non-randomized counterpart.
The next result shows that this is not the case: Under mild assumptions on the divergence
D, randomization offers no improvement.

Proposition 36: Let D be any bounded divergence with the property that for all
models M, M ′ , M
c and π ∈ Π,
    
Dπ M ∥ M ′ ≤ C Dπ M
c ∥ M + Dπ M c ∥ M′ .

(6.65)

Then for all γ > 0,


D
sup decD
γ (M, M ) ≤ decγ/(2C) (M).
c (6.66)
M
c

Squared Hellinger distance is symmetric and satisfies Condition (6.65) with C = 2. Hence,
writing decH D 2
γ (M) as shorthand for decγ (M) with D = DH (·, ·), we obtain the following
corollary.

Proposition 37: Suppose that R ⊆ [0, 1]. Then for all γ > 0,

decH
γ (M) ≤ sup decH H H
γ (M, M ) ≤ sup decγ (M, M ) ≤ decγ/4 (M).
c c
c∈co(M)
M M
c

This shows that for Hellinger distance—at least from a statistical perspective—there is no
benefit to using the randomized DEC compared to the original version. In some cases,
however, strategies p that minimize decH
γ (M, ν) can be simpler to compute than strategies
H c ∈ co(M).
that minimize decγ (M, Mc) for M

6.7.3 Optimistic Estimation


To derive stronger regret bounds that allow for estimation with general divergences, we
can combine Estimation-to-Decisions with a specialized estimation approach introduced by
Zhang [90] (see also Dann et al. [29], Agarwal and Zhang [3], Zhong et al. [91]), which we
refer to as optimistic estimation. The results we present here are based on Foster et al. [41].
Let a divergence Dπ (· ∥ ·) be fixed. An optimistic estimation oracle AlgEst is an al-
gorithm which, at each step t, given Ht−1 = (π 1 , r1 , o1 ), . . . , (π t−1 , rt−1 , ot−1 ), produces a
randomized estimator ν t ∈ ∆(M). Compared to the previous section, the only change is
that for a parameter γ > 0, we will measure the performance of the oracle via optimistic
estimation error, defined as
T h   i
−1 M ⋆ ct
X
OptEstD
γ := Eπt ∼pt EM t
c ∼ν t D π ct
M ∥ M ⋆
+ γ (f (π M ⋆) − f
M
(π c
M t ) . (6.67)
t=1

121
This quantity is similar to (6.62), but incorporates a bonus term
⋆ ct
γ −1 (f M (πM ⋆ ) − f M (πM
ct )),


which encourages the estimation algorithm to over-estimate the optimal value f M (πM ⋆ )
for the underlying model, leading to a form of optimism.
Example 6.9 (Structured bandits). Consider any structured bandit problem with decision
space Π, function class F ⊆ (Π → [0, 1]), and O = {∅}. Let MF be the class
MF = {M | f M ∈ F, M (π) is 1-sub-Gaussian ∀π}.
To derive bounds on the optimistic estimation error, we can appeal to an augmented version
of the (randomized) exponential weights algorithm which, for a learning rate parameter
η > 0, sets !!
X
ν t (f M ) ∝ exp −η (f M (π i ) − ri )2 − γ −1 f M (πM ) .
i<t

For an appropriate choice of η, this method achieves E OptEstD


  p
γ ≲ log|F| + T log|F|/γ
for D = DSq (·, ·) [90]. ◁

Optimistic E2D.

Optimistic E2D (E2D.Opt)


parameters: Exploration parameter γ > 0, divergence D(· ∥ ·).
for t = 1, 2, · · · , T do
Obtain randomized estimate ν t ∈ ∆(M) from optimistic estimation oracle with
{(π i , ri , oi )}i<t .
Compute // Eq. (6.68).
  
t c
M M π
c) − f (π) − γ · D
p = arg min sup Eπ∼p EM
c∼ν t f (πM
c∥M
M .
p∈∆(Π) M ∈M

Sample decision π t ∼ pt and update estimation algorithm with (π t , rt , ot ).

E2D.Opt is an optimistic variant of E2D, which we refer to as E2D.Opt. At each timestep


t, the algorithm calls the estimation oracle to obtain a randomized estimator ν t using the
data (π 1 , r1 , o1 ), . . . , (π t−1 , rt−1 , ot−1 ) collected so far. The algorithm then uses the estimator
to compute a distribution pt ∈ ∆(Π) and samples π t from this distribution. The main
change relative to the version of E2D on page 119 is that the minimax problem in E2D.Opt
is derived from an “optimistic” variant of the DEC tailored to the optimistic estimation
error in (6.67). This quantity, which we refer to as the Optimistic Decision-Estimation
Coefficient, is defined for ν ∈ ∆(M) as
h  i
o-decD
γ (M, ν) = inf sup E π∼p EMc∼ν f M
(πM ) − f M
(π) − γ · D π c
M ∥ M . (6.68)
p∈∆(Π) M ∈M

and
o-decD
γ (M) = sup o-decD
γ (M, ν). (6.69)
ν∈∆(M)

122
The Optimistic DEC the same as the generalized DEC in (6.61), except that the optimal
c
value f M (πM ) in (6.61) is replaced by the optimal value f M (πM
c ) for the (randomized)

reference model Mc ∼ ν. This seemingly small change is the main advantage of incorporating
optimistic estimation, and makes it possible to bound the Optimistic DEC for certain
divergences D for which the value of the generalized DEC in (6.61) would otherwise be
unbounded.

Remark 18: When the divergence D admits a sufficient statistic ψ : M → Ψ, for any
distribution ν ∈ ∆(M), if we define ν ∈ ∆(Ψ) via ν(ψ) = ν({M ∈ M : ψ(M ) = ψ}),
we have
h i
o-decD
γ (M, ν) = inf sup E E
π∼p ψ∼ν f ψ
(πψ ) − f M
(π) − γ · D π
(ψ ∥ M ) .
p∈∆(Π) M ∈M

In this case, by overloading notation slightly, we may simplify the definition in (6.69)
to
o-decD D
γ (M) = sup o-decγ (M, ν).
ν∈∆(Ψ)

Regret bound for optimistic E2D. The following result shows that the regret of Op-
timistic Estimation-to-Decisions is controlled by the Optimistic DEC and the optimistic
estimation error for the oracle.

Proposition 38: E2D.Opt ensures that

Reg ≤ o-decD D
γ (M) · T + γ · OptEstγ (6.70)

almost surely.

This regret bound has the same structure as that of Proposition 35, but the DEC and
estimation error are replaced by their optimistic counterparts.

When does optimistic estimation help?. When does the regret bound in Proposition
38 improve upon its non-optimistic counterpart in Proposition 35? It turns out that for
asymmetric divergences such as those found in the context of reinforcement learning, the
regret bound in (6.70) can be much smaller than the corresponding bound in (6.63); see
Section 7.3.5 for an example. However, for symmetric divergences such as Hellinger distance,
we will show now that the result never improves upon Proposition 35.
Given a divergence D, we define the flipped divergence, which swaps the first and second
arguments, by    
qπ M
D c ∥ M := Dπ M ∥ M c .

Proposition 39 (Equivalence of optimistic DEC and randomized DEC): As-


c
c ∈ co(M), we have (f M
sume that For all pairs of models M, M (π) − f M (π))2 ≤

123
 
L2lip · Dπ Mc ∥ M for a constant Llip > 0. Then for all γ > 0,

D L2lip D L2lip
≤ o-decD
q q
dec3γ/2 (M) − γ (M) ≤ decγ/2 (M) + . (6.71)
2γ 2γ

This result shows that the optimistic DEC with divergence D is equivalent to the generalized
DEC in (6.61), but with the arguments to the divergence flipped. Thus, for symmetric
divergences, the quantities are equivalent. In particular, we can combine Proposition 39
with Proposition 36 to derive the following corollary for Hellinger distance.

Proposition 40: Suppose that rewards are bounded in [0, 1]. Then for all γ > 0,
1 3
o-decH
2γ (M) − ≤ sup decH H
γ (M, M ) ≤ o-decγ/6 (M) + .
γ M γ

For asymmetric divergences, in settings where there exists an estimation oracle for which
the flipped estimation error
T
X h t i
D π
Est = ct ∼ν t D M⋆ ∥ M
ct
q
Eπ∼pt EM
t=1

is controlled, Proposition 39 shows that to match the guarantee in Proposition 38, optimism
is not required, and it suffices to run the non-optimistic algorithm on page 119. However,
we show in Section 7.3.5 that for certain divergences found in the context of reinforcement
learning, estimation with respect to the flipped divergence is not feasible, yet working with
the optimistic DEC E2D.Opt leads to meaningful guarantees.
[Note: This subsection will be expanded in the next version.]

6.8 Decision-Estimation Coefficient: Structural Properties⋆


In what follows, we state some structural properties of the Decision-Estimation Coefficient,
which are useful for calculating the value for specific model classes of interest.

Proposition 41 (Square loss is sufficient for structured bandit problems):


Consider any structured bandit problem with decision space Π, function class F ⊆
(Π → [0, 1]), and O = {∅}. Let MF be the class

MF = {M | f M ∈ F, M (π) is 1-sub-Gaussian ∀π}.

Then, letting
h i
decSq
γ (F, f ) =
b inf sup Eπ∼p f (πf ) − f (π) − γ(f (π) − fb(π))2 ,
p∈∆(Π) f ∈F

we have

decSq Sq
c1 γ (F) ≤ decγ (MF ) ≤ decc2 γ (F),

124
where c1 , c2 ≥ 0 are numerical constants.

Proposition 42 (Filtering irrelevant information): Adding observations that are


unrelated to the model under consideration never changes the value of the Decision-
Estimation Coefficient. In more detail, consider a model class M with observation space
O1 , and consider a class of conditional distributions D over a secondary observation
space O2 , where each D ∈ D has the form D(π) ∈ ∆(O2 ). For M ∈ M and D ∈ D, let
(M ⊗ D)(π) be the model that, given π ∈ Π, samples (r, o1 ) ∼ M (π) and o2 ∼ D(π),
then emits (r, (o1 , o2 )). Set

M ⊗ D = {M ⊗ D | M ∈ M, D ∈ D}.

c ∈ M and D
Then for all M b ∈ D,

decγ (M ⊗ D, M
c ⊗ D)
b = decγ (M, M
c).

This can be seen to hold by restricting the supremum in (6.9) to range over models of
the form M ⊗ D.b

Proposition 43 (Data processing): Passing observations through a channel never


decreases the Decision-Estimation Coefficient. Consider a class of models M with
observation space O. Let ρ : O → O′ be given, and define ρ ◦ M to be the model
that, given decision π, samples (r, o) ∼ M (π), then emits (r, ρ(o)). Let ρ ◦ M :=
{ρ ◦ M | M ∈ M}. Then for all Mc ∈ M, we have

c) ≤ decγ (ρ ◦ M, ρ ◦ M
decγ (M, M c).

This is an immediate consequence


 of the data processing inequality for Hellinger dis-
2
   2
 
tance, which implies that DH ρ ◦ M (π), ρ ◦ M (π) ≤ DH M (π), M (π) .
c c

6.9 Deferred Proofs


Proof of Lemma 21. We first prove the in-expectation bound. By assumption, we have that
T
X T
X
ℓtlog (M
ct ) − ℓtlog (M ⋆ ) ≤ RegKL .
t=1 t=1

Taking expectations, Assumption 8 implies that


T
X h  i
E DKL M ⋆ (π t ) ∥ M
ct (π t ) ≤ E[RegKL ].
t=1

The bound now follows from Lemma 19.

125
We now prove the high-probability bound using Lemma 34. Define Zt = 12 (ℓtlog (M
ct ) −
ℓtlog (M ⋆ )). Applying Lemma 34 with the sequence (−Zt )t≤T , we are guaranteed that with
probability at least 1 − δ,
T T T
X  X 1 X t ct 
− log Et−1 e−Zt ≤ Zt + log(δ −1 ) = ℓlog (M ) − ℓtlog (M ⋆ ) + log(δ −1 ).

2
t=1 t=1 t=1

Let t be fixed, and define abbreviate z t = (rt , ot ). Let ν(· | π) be any (conditional) domi-
ct ⋆
nating measure for mM and mM , and observe that
s 
ct
 −Zt  m M
(z t
| π t
)
Et−1 e | π t = Et−1  ⋆ | πt
mM (z t | π t )
s
ct
mM (z | π t )
Z
M⋆ t
= m (z | π ) ⋆ ν(dz | π t )
mM (z | π t )
1 2  ⋆ t ct t 
Z q
⋆ ct
= mM (z | π t )mM (z | π t )ν(dz | π t ) = 1 − DH M (π ), M (π ) .
2
Hence,
1 h  i
Et−1 e−Zt = 1 − Et−1 DH 2
M ⋆ (π t ), M
 
ct (π t )
2
and, since − log(1 − x) ≥ x for x ∈ [0, 1], we conclude that
T T 
1X h 
2 ct (π t ) ≤ 1
i X 
ct ) − ℓt (M ⋆ ) + log(δ −1 ).
Et−1 DH M ⋆ (π t ), M ℓtlog (M log
2 2
t=1 t=1

PH
Proof of Lemma 23. We first prove (6.49). Let X = h=1 rh . Since X ∈ [0, 1] almost
surely, we have
   
c c
f M (π) − f M (π) = EM ,π [X] − EM ,π [X] ≤ DTV M (π), M
c(π) ≤ DH M (π), M c(π) .

The final result now follows from the AM-GM inequality.


We now prove (6.51). From Lemma 14, we have
H
,π M ,π
c
X c 
EM ,π QM
M M

f (π) − f (π) = h (sh , ah ) − rh − Vh+1 (sh+1 )
h=1
H h i
M ,π M ,π
X c
EM ,π PhM Vh+1

= (sh , ah ) − Vh+1 (sh+1 ) + Erh ∼RM (sh ,ah ) [rh ] − Er c
M [r
) h
]
h h ∼Rh (sh ,ah
h=1
H hh i i XH h i
M ,π
X c c c
= EM ,π (PhM − PhM )Vh+1 (sh , ah ) + EM ,π Erh ∼RM (sh ,ah ) [rh ] − Er c
M [rh ]
h h ∼Rh (sh ,ah )
h=1 h=1
XH h    i
c c c
≤ EM ,π DTV PhM (sh , ah ), PhM (sh , ah ) + DTV RhM (sh , ah ), RhM (sh , ah ) ,
h=1
M ,π
where we have used that Vh+1 (s) ∈ [0, 1].

126
Proof of Lemma 24. We first prove (6.53). For all η > 0, we have

EM ∼µ Eπ∼p [f M (πM ) − f M (π)]


c(π) + 1 .
h i h  i
c 2
≤ EM ∼µ Eπ∼p f M (πM ) − f M (π) + η EM ∼µ Eπ∼p DH M (π), M

We now prove (6.54). Using Lemma 37, we have that for all h,
h  i h  i  
c c c c
EM ,π DH
2
P M (sh , ah ), P M (sh , ah ) +EM ,π DH
2 2
RM (sh , ah ), RM (sh , ah ) ≤ 8DH M (π), M
c(π) .

As a result,
"H #
X      
c c c
EM ,π DH 2 2
P M (sh , ah ), P M (sh , ah ) + DH RM (sh , ah ), RM (sh , ah ) 2
≤ 8HDH M (π), M
c(π) .
h=1

Since this holds uniformly for all π, we conclude that


"H #
X    
c c c
Eπ∼p EM ,π 2
DTV 2
P M (sh , ah ), P M (sh , ah ) + DTV RM (sh , ah ), RM (sh , ah )
h=1
h  i
2
≤ 8H Eπ∼p DH M (π), M
c(π) .

6.10 Exercises

Exercise 11: Prove Lemma 19.

Exercise 12: In this exercise, we will prove Proposition 37 as follows:

1. Prove the first two inequalities.


2. Use properties of the Hellinger distance to show that for any π ∈ Π, µ ∈ ∆(M), and M
c,

c(π) ≥ 1 EM,M ′ ∼µ DH2 (M (π), M ′ (π)).


 
EM ∼µ DH2 M (π), M
4
Hint: start with the right-hand side and use symmetry and triangle inequality for Hellinger
distance.
3. With the help of Part 2, show that for any M
c,
 
γ
c) ≤ sup
decγ (M, M inf Eπ∼p,M ∼µ f M (πM ) − f M (π) − EM ′ ∼µ DH2 (M (π), M ′ (π)) .
µ∈∆(M) p∈∆(Π) 4

4. Argue that
 
γ
c) ≤
decγ (M, M sup sup Eπ∼p,M ∼µ f M (πM )−f M (π)− EM ′ ∼ν DH2 (M (π), M ′ (π)) .
inf
ν∈∆(M) µ∈∆(M) p∈∆(Π) 4

and conclude the third inequality in Proposition 37.

127
5. Show that
c) ≤
sup decγ (M, M sup decγ/4 (M, M
c). (6.72)
c
M c∈co(M)
M

In other words, the estimation oracle cannot significantly increase the value of the DEC by
selecting models outside co(M).

Exercise 13 (Lower Bound on DEC for Tabular RL): We showed that for Gaussian
bandits,
p
deccε (M, M
c) ≥ ε A/2,

for all ε ≲ 1/ A by considering a small sub-family models and explicitly computing the DEC
for this sub-family. Show that if M is the set of all tabular MDPs with |S| = S, |A| = A, and
PH
h=1 rh ∈ [0, 1],

deccε (M, M
c) ≳ ε SA

for all ε ≲ 1/ SA, as long as H ≳ logA (S).

Exercise 14 (Structured Bandits with ReLU Rewards): We will show that structured
bandits with ReLU rewards suffer from the curse of dimensionality. Let relu(x) = max{x, 0}
and take Π = Bd2 (1) = π ∈ Rd | ∥π∥2 ≤ 1 . Consider the class of value functions of the form


fθ (π) = relu(⟨θ, π⟩ − b), (6.73)

where θ ∈ Θ = Sd−1 , is an unknown parameter vector and b ∈ [0, 1] is a known bias parameter.
Here Sd−1 := v ∈ Rd | ∥v∥ = 1 denotes the unit sphere. Let M = {Mθ }θ∈Θ , where for all π,
Mθ (π) := N (fθ (π), 1).
We will prove that for all d ≥ 16, there exists M ∈ M such that for all γ > 0,

ed/8
decγ (M, M ) ≳ ∧ 1, (6.74)
γ
for an appropriate choice of bias b. By slightly strengthening this result and appealing to
(6.32), it is possible to show that any algorithm must have E[Reg] ≳ ed/8 .
To prove (6.74), we will use the fact that for large d, a random vector v chosen uniformly
from the unit sphere is nearly orthogonal to any direction π. This fact is quantified as follows
(see Ball ’97):

α2
 
Pv∼unif(Sd−1 ) (⟨π, v⟩ > α) ≤ exp − d . (6.75)
2

for any π with ∥π∥ = 1.

1. Prove that for all π ∈ Π, v ∈ Θ, and any choice of b,

max fv (π ′ ) − fv (π) ≥ (1 − b)I {⟨v, π⟩ ≤ b}


π ′ ∈Π

In other words, instantaneous regret is at least (1 − b) whenever the decision π does not align
well with v.

128
2. Let M (π) = N (0, 1). Show that for all π ∈ Π, v ∈ Θ, and for any choice of b,

 1 (1 − b)2
DH2 Mv (π), M (π) ≤ fv2 (π) ≤ I {⟨v, π⟩ > b} ,
2 2
i.e. information is obtained by the decision-maker only if the decision π aligns well with v in
the model Mv .
3. Show that
(1 − b)2
 
decγ (M, M ) ≥ inf Ev∼unif(Sd−1 ) Eπ∼p (1 − b) − (1 − b)I {⟨v, π⟩ > b} − γ I {⟨v, π⟩ > b} .
p∈∆(Π) 2

4. Set ε := 1 − b. Use (6.75) and Part 3 above to argue that

ε2
decγ (M′ , M ) ≥ ε − ε exp(−d/8) − γ exp(−d/8).
2
Conclude that for d ≥ 8,

ε ε2
decγ (M′ , M ) ≥ − γ exp(−d/8)
2 2

ed/8 1
5. Show that by choosing ε = 6γ ∧ 2 and recalling that b = 1 − ε, we get (6.74).

7. REINFORCEMENT LEARNING: FUNCTION APPROXIMATION AND


LARGE STATE SPACES

In this section, we consider the problem of online reinforcement learning with function ap-
proximation. The framework is the same as that of Section 5 but, in developing algorithms,
we no longer assume that the state and action spaces are finite/tabular, and in particular
we will aim for regret bounds that are independent of the number of states. To do this,
we will make use of function approximation—either directly modeling the transition prob-
abilities for the underlying MDP, or modeling quantities such as value functions—and our
goal will be to design algorithms that are capable of generalizing across the state space as
they explore. This will pose challenges similar to that of the structured and contextual
bandit settings, but we now face the additional challenge of credit assignment. Note that
the online reinforcement learning framework is a special case of the general decision making
setting in Section 6, but the algorithms we develop in this section will be tailored to the
MDP structure.
Recall (Section 5) that for reinforcement learning, each MDP M takes the form

M = S, A, {PhM }H M H

h=1 , {Rh }h=1 , d1 ,

where S is the state space, A is the action space, PhM : S × A → ∆(S) is the probability
transition kernel at step h, RhM : S × A → ∆(R) is the reward distribution, and d1 ∈ ∆(S1 )
is the initial state distribution.
PH All of the results in this section will take Π = ΠRNS , and
we will assume that h=1 rh ∈ [0, 1] unless otherwise specified.

7.1 Is Realizability Sufficient?


For the frameworks we have considered so far (contextual and structured bandits, general
decision making), all of the algorithms we analyzed leveraged the assumption of realizability,

129
which asserts that we have a function class that is capable of modeling the underlying
environment well. For reinforcement learning, there are various realizability assumptions
one can consider:

• Model realizability: We have a model class M of MDPs that contains the true MDP
M ⋆.

• Value function realizability: We have a class Q of state-action value functions (Q-



functions) that contains the optimal function QM ,⋆ for the underlying MDP.

• Policy realizability: We have a class Π of policies that contains the optimal policy
πM ⋆ .

Note that model realizability implies value function realizability, which in turn implies policy
realizability. Ideally, we would like to be able to say that whenever one of these assumptions
holds, we can obtain regret bounds that scale with the complexity of the function class (e.g.,
log|M| for model realizability, or log|Q| for value function realizability), but do not depend
on the number of states |S| or other properties of the underlying MDP, analogous to the
situation for statistical learning. Unfortunately, the following result shows that this is too
much to ask for.

Proposition 44 (e.g., Krishnamurthy et al. [55]): For any S ∈ N and H ∈ N,


there exists a class of horizon-H MDPs M with |S| = S, |A| = 2, and log|M| = log(S),
yet any algorithm must have
q
E[Reg] ≳ min{S, 2H } · T .

The interpretation of this result is that even if model realizability holds, any algorithm needs
regret that scales with min{|S|, |M|, 2H }. This means additional structural assumptions on
the underlying MDP M ⋆ —beyond realizability—are required if we want to obtain sample-
efficient learning guarantees. Note that since this construction satisfies model realizability,
the strongest form of realizability, it also rules out sample-efficient results for value function
and policy realizability.
In what follows, we will explore different structural assumptions that facilitate low regret
for reinforcement learning with function approximation. Briefly, the idea will be to make
assumptions that either i) allow for extrapolation across the state space, or ii) control the
number of “effective” state distributions the algorithm can encounter. We will begin by
investigating reinforcement learning with linear models, then explore a general structural
property known as Bellman rank.

Remark 19 (Comparison to structured bandits): Proposition 44 is analogous


to the impossibility result we proved for structured bandits (Example 4.1), which is
subsumed by the RL framework. That result required a large number of actions, while
Proposition 44 holds even when |A| = 2.

130
Remark 20 (Further notions of realizability): There are many notions of realiz-
ability beyond those we consider above. For example, for value function approximation,

one can assume that QM ,π ∈ Q for all π, or assume that the class Q obeys certain
notions of consistency with respect to the Bellman operator for M ⋆ .

7.2 Linear Function Approximation


Toward understanding the complexity of RL with function approximation, let us consider
perhaps the simplest possible modeling approach: Linear function approximation. A natural
idea here is to assume linearity of the underlying Q-function corresponding to the true model
M , generalizing the linear bandit setting in Section 4:
,⋆
QM M
h (s, a) = ⟨ϕ(s, a), θh ⟩, ∀h ∈ [H] (7.1)

where ϕ(s, a) ∈ Bd2 (1) is a feature map that is known to the learner and θhM ∈ Bd2 (1) is an
unknown parameter vector. Equivalently, we can define
n o
Q = Qh (s, a) = ⟨ϕ(s, a), θh ⟩ | θh ∈ Bd2 (1) ∀h , (7.2)

and assume that QM ,⋆ ∈ Q. This is called the Linear-Q⋆ model.


Linearity is a strong assumption, and it is reasonable to imagine that this would be
sufficient for low regret. Indeed, one might hope that using linearity, we can extrapolate
the value of QM ,⋆ once we estimate it for a small number of states. Unfortunately, even for
this very simple class of functions, it turns out that realizability is still insufficient.

Proposition 45 (Weisz et al. [86], Wang et al. [85]): For any d ∈ N and H ∈ N
sufficiently large, any algorithm for the Linear-Q⋆ model must have
n o
E[Reg] ≳ min 2Ω(d) , 2Ω(H) .

This contrasts the situation for contextual bandits and linear bandits, where linear rewards
were sufficient for low regret. The intuition is that, even though QM ,⋆ is linear, it might
take a very long time to estimate the value for even a small number of states. That is,
linearity of the optimal value function is not a useful assumption unless there is some kind
of additional structure that can guide us toward the optimal value function to being with.
We mention in passing that Proposition 45 can be proven by lower bounding the
Decision-Estimation Coefficient [40].

The Low-Rank MDP model. Proposition 45 implies that linearity of the optimal Q-
function alone is not sufficient for sample-efficient RL. To proceed, we will make a stronger
assumption, which asserts that the transition probabilities themselves have linear structure:
For all s ∈ S, a ∈ A, and h ∈ [H], we have

PhM (s′ | s, a) = ϕ(s, a), µM ′


h (s ) , and E[rh |s, a] = ⟨ϕ(s, a), whM ⟩. (7.3)

Here, ϕ(s, a) ∈ Bd2 (1) is a feature map that is known to the learner, ′ d
√ h (s ) ∈ R is another
µM
feature map which is unknown to the learner, and whM ∈ Bd2 ( d) is an unknown parameter

131
P ′

vector. Additionally, for simplicity, we assume that s′ ∈S |µh (s )| ≤ d, which in
M

particular holds in the tabular example below. As before, assume that both cumulative and
individual-step rewards are in [0, 1]. For the remainder of the subsection, we let M denote
the set of MDPs with these properties.
The linear structure in (7.3) implies that the transition matrix has rank at most d, thus
facilitating (as we shall see shortly) information sharing and generalization across states,
even when the cardinality of S and A is large or infinite. For this reason, we refer to MDPs
with this structure as low-rank MDPs [71, 88, 48, 5].
Just as linear bandits generalize unstructured multi-armed bandits, the low-rank MDP
model (7.3) generalizes tabular RL, which corresponds to the special case in which d =
|S| · |A|, ϕ(s, a) = es,a , and (µh (s′ ))s,a = PhM (s′ | s, a).

Properties of low-rank MDPs. The linear structure of the transition probabilities and
,⋆
mean rewards is a significantly more stringent assumption than linearity of QM h (s, a) in
(7.1). Notably, it implies that Bellman backups of arbitrary functions are linear.

Lemma 25: For any low-rank MDP M ∈ M and any Q : S × A → R and any h ∈ [H],
the Bellman operator is linear in ϕ:
M
[ThM Q](s, a) = ϕ(s, a), θQ

for some θQM


∈ Rd . In particular, this implies that for any policy π = (π1 , . . . , πH ),
QM,π
functions √ h are linear in ϕ for every h. Finally, for Q : S × A → [0, 1], it holds that
θQ ≤ 2 d.
M


As a special case, this lemma implies that for low-rank MDPs, QM
h is linear for all π.

Proof of Lemma 25. We have


X
[ThM Q](s, a) = ⟨ϕ(s, a), whM ⟩ + PhM (s′ | s, a) max

Q(s′ , a′ ) (7.4)
a
s′
X

= ⟨ϕ(s, a), whM ⟩ + ϕ(s, a), µM
h (s ) max

Q(s′ , a′ ) (7.5)
a
s′
* +
X
M M ′ ′ ′
= ϕ(s, a), wh + µh (s ) max

Q(s , a ) . (7.6)
a
s′
 
The second statement follows since QM,π
h = ThM QM,π
h+1 . For the last statement,

M
X
′ ′

θQ ≤ ∥whM ∥ + µM
h (s )Q(s ) ≤ 2 d, (7.7)
s′

h is a vector of distributions on S.
since µM

7.2.1 The LSVI-UCB Algorithm


To provide regret bounds for the low-rank MDP model, we analyze an algorithm called
LSVI-UCB (“Least Squares Value Iteration UCB”), which was introduced and analyzed in

132
the influential paper of Jin et al. [48]. Similar to the UCB-VI algorithm we analyzed for
tabular RL, the main idea behind the algorithm is to compute a state-action value Qt with
the optimistic property that
,⋆
Qth (s, a) ≥ QM
h (s, a)
for all s, a, h. This is achieved by combining the principle of dynamic programming with
an appropriate choice of bonus to ensure optimism. However, unlike UCB-VI, the algo-
rithm does not directly estimate transition probabilities (which is not feasible when µM is
unknown), and instead implements approximate value iteration by solving a certain least
squares objective.
LSVI-UCB
Input: R, ρ > 0
for t = 1, . . . , T do
Let QtH+1 ≡ 0.
for h = H, . . . , 1 do
Compute least-squares estimator
X 2
θbht = arg min ⟨ϕ(sih , aih ), θ⟩ − rhi − max Qth+1 (sih+1 , a) ,
a
θ∈Bd2 (ρ) i<t

b t (s, a) := ϕ(s, a), θbt .


and let Q h h
Define X
Σth = ϕ(sih , aih )ϕ(sih , aih )⊤ + I.
i<t

Compute bonus:

bth,δ (s, a) = R∥ϕ(s, a)∥(Σt )−1 .
h

Compute optimistic value function:


n o
Qth (s, a) = Q b t (s, a) + bt (s, a) ∧ 1.
h h,δ

Set V th (s) = maxa∈A Qth (s, a) and π


bht (s) = arg maxa∈A Qth (s, a).
Collect trajectory (st1 , at1 , r1t ), . . . , (stH , atH , rH
t
) according to π
bt .

In more detail, for each episode t, the algorithm computes Qt1 , . . . , QtH through approximate
dynamic programming. At layer h, given Qth+1 , the algorithm computes a linear Q-function
Qb t (s, a) := ϕ(s, a), θbt , by solving a least squares problem in which X = ϕ(sh , ah ) is the
h h
feature vector and Y = rh + maxa Qth+1 (sh+1 , a) is the target/outcome. This is motivated
 
by Lemma 25, which asserts that the Bellman backup ThM Qth+1 (s, a) is linear. Given Q bt ,
h
the algorithm forms the optimistic estimate Qth via
n o
Qth (s, a) = Qb t (s, a) + bt (s, a) ∧ 1,
h h,δ

where
√ X
bth,δ (s, a) = R∥ϕ(s, a)∥(Σt )−1 , with Σth = ϕ(sih , aih )ϕ(sih , aih )⊤ + I,
h
i<t

133
is an elliptic bonus analogous to the bonus used within LinUCB. With this, the algorithm
proceeds to the next layer h−1. Once Qt is computed for every layer, the algorithm executes
the optimistic policy π
bt given by π
bht (s) = arg maxa∈A Qth (s, a).
The LSVI-UCB algorithm enjoys the following regret bound.

Proposition 46: If any δ > 0,√if we set R = c · d2 log(HT /δ) for a sufficiently large
numerical constant c and ρ = 2 d, LSVI-UCB has that with probability at least 1 − δ,
p
Reg ≲ H d3 · T log(HT /δ). (7.8)

7.2.2 Proof of Proposition 46


The starting point of our analysis for UCB-VI was Lemma 15, which states that it is sufficient
,⋆
to construct optimistic estimates {Q1 , . . . , QH } (i.e. QM
h ≤ Qh ) such that the Bellman
M ,b
π
 
residuals E (Qh − Th Qh+1 )(sh , ah ) are small under the greedy (with respect to Q’s)
M

policy π
b. In order to control these residuals, we constructed an estimated model M c and
c
defined empirical Bellman operators Th in terms of estimated transition kernels. We then
M

c
set Qh to be the empirical Bellman backup ThM Qh+1 , plus an optimistic bonus term. In
contrast, LSVI-UCB does not directly estimate the model. Instead, it performs regression
with a target that is an empirical Bellman backup. As we shall see shortly, subtleties arise
in the analysis of this regression step due to lack of independence.

Technical lemmas for regression. Recall from Lemma 25 that for any fixed Q : S×A →
R,
h i
EM rhi + max Q(sih+1 , a) | sih , aih = [ThM Q](sih , aih ). (7.9)
a

However, for layer h, the regression problem within LSVI-UCB concerns a data-dependent
function Q = Qth+1 (with i < t), which is chosen as a function of all the trajectories
τ 1 , . . . , τ t−1 . This dependence implies that the regression problem solved by LSVI-UCB is
not of the type studied, say, in Proposition 1. Instead, in the language of Section 1.4, the
mean of the outcome variable is itself a function that depends on all the data. The saving
grace here is that this dependence does not result in arbitrarily complex functions, which
will allow us to appeal to uniform convergence arguments. In particular, for every h and t,
Qth belongs to the class
n n √ o √ o
Q := (s, a) 7→ ⟨θ, ϕ(s, a)⟩ + R∥ϕ(s, a)∥(Σ)−1 ∧ 1 : ∥θ∥ ≤ 2 d, σmin (Σ) ≥ 1 . (7.10)

To make use of this fact, we first state an abstract result concerning regression with depen-
dent outcomes.

Lemma 26: Let G be an abstract set with |G| < ∞. Let x1 , . . . , xT ∈ X be fixed, and
for each g ∈ G, let y1 (g), . . . , yT (g) ∈ R be 1-subGaussian outcomes satisfying

E[yi (g) | xi ] = fg (xi )

134
for fg ∈ F ⊆ {f : X → R}.a In addition, assume that y1 (g), . . . , yT (g) are conditionally
independent given x1 , . . . , xT . For any latent g ∈ G, define the least-squares solution
T
X
fbg ∈ arg min (yi (g) − f (xi ))2 .
f ∈F i=1

With probability at least 1 − δ, simultaneously for all g ∈ G,


T
X
(fbg (xi ) − fg (xi ))2 ≲ log(|F||G|/δ).
i=1
a
The random variables {yi (g)}g∈G may be correlated.

Proof of Lemma 26. Fix g ∈ G. To shorten the notation, it is useful toP


introduce empirical
norms ∥f ∥T = T i=1 f (xi ) and empirical inner product ⟨f, f ⟩T = Ti=1 f (xi )f ′ (xi ) for
2 1 PT 2 ′

f, f ′ ∈ F. Optimality of fbg implies that


T
X T
X
(yi (g) − fbg (xi ))2 ≤ (yi (g) − fg (xi ))2
i=1 i=1
2
which can be written succinctly (with a slight abuse of notation) as Yg − fbg T
≤ ∥Yg − fg ∥2T
for Yg = (y1 (g), . . . , yT (g)). This implies
2
fbg − fg T
≤ 2 Yg − fg , fbg − fg T
.

Dividing both sides by fbg − fg T


and taking supremum over fbg ∈ F leads to
f − fg
fbg − fg ≤ 2 max Yg − fg , . (7.11)
T
f ∈F f − fg T
T

The random vector Yg −fg has independent zero-mean 1-subGaussian entries by assumption,
f −fg √
while the multiplier is simply a T -dimensional vector of Euclidean length T , for
f −fg
T
each f ∈ F. Hence, each inner product in (7.11) is a sub-Gaussian vector with variance
proxy T1 (see Definition 2). Thus, with probability at least 1 − δ, the maximum on the
p
right-hand side does not exceed C log(|F|/δ)/T for an appropriate constant C. Taking
the union bound over g and squaring both sides of (7.11) yields the desired bound.

We may now apply Lemma 26 to analyze the regression step of LSVI-UCB.

Lemma 27: With probability at least 1 − δ, we have that for all t and h,
X  i i 2
≲ d2 log(HT /δ).

b t (si , ai ) − T M Qt
Q h h h h h+1 (sh , ah ) (7.12)
i<t

Proof sketch for Lemma 27. Let t ∈ [T ] and h ∈ [H] be fixed. To make the correspondence
with Lemma 26 explicit, for the data (sih , aih , sih+1 , rhi ), we define xi = ϕ(sih , aih ) and yi (Q) =
rhi + maxa Q(sih+1 , a), with Q ∈ Q playing the role of the index g ∈ G. With this, we have
h i
E[yi (Q) | xi ] = EM rhi + max Q(sih+1 , a) | sih , aih = [ThM Q](sih , aih ) = ϕ(sih , aih ), θQ
M
a

135
which is linear in xi = ϕ(sih , aih ), with the vector of coefficients θQ
M
depending on Q. The
regression problem is well-specified as long as we choose
n √ o
F = ϕ(s, a) 7→ ⟨ϕ(s, a), θ⟩ : ∥θ∥ ≤ 2 d

and Q as in (7.10). While both of these sets are infinite, we can to a standard covering
number argument for an appropriate scale ε. The cardinalities of ε-discretized classes can
be shown to be of size O(d)
e e 2 ), respectively, up to factors logarithmic in 1/ε and
and O(d
d. The statement follows after checking that discretization incurs a small price due to
Lipschitzness with respect to parameters. Finally, we union bound over t and h.

Establishing optimism. The next lemma shows that closeness of the regression estimate
to the Bellman backup on the data {(sih , aih )}i<t translates into closeness at an arbitrary
(s, a) pair as long as ϕ(s, a) is sufficiently covered by the data collected so far. This, in turn,
implies that Qt1 , . . . , QtH are optimistic.

Lemma 28: Whenever the event in Lemma 27 occurs, we have that for all (s, a, h)
and t ∈ [T ],
  p
b t (s, a) − T M Qt
Q h h h+1 (s, a) ≲ d2 log(HT /δ) · ∥ϕ(s, a)∥(Σt )−1 =: bth,δ (s, a). (7.13)
h

and
,⋆
Qth (s, a) ≥ QM
h (s, a). (7.14)

Proof of Lemma 28. Writing the Bellman backup, via Lemma 25, as
 M t 
Th Qh+1 (s, a) = ⟨ϕ(s, a), θht ⟩

for some θht ∈ Rd with ∥θht ∥2 ≤ 2 d, we have that
  D E
b t (s, a) − T M Qt (s, a) = ϕ(s, a), θbt − θt
Q h h h h h
D E
= (Σth )−1/2 ϕ(s, a), (Σth )1/2 (θbht − θht )

≤ ϕ(s, a) (Σth )−1


· θbht − θht Σth
.

Lemma 27 then implies (7.13), since


!
2
X
t t t t T ⊤
θbh − θh Σt = (θbh − θh ) ϕ(sh , ah )ϕ(sh , ah ) + I (θbht − θht )
i i i i
(7.15)
h
i<t
X   i i 2 2
= b t (si , ai ) − T M Qt
Q + θbht − θht
h h h h h+1 (sh , ah ) (7.16)
i<t

2
and θbht − θht ≲ d by (7.7).
To show (7.14), we proceed by induction on V th ≥ VhM ,⋆ , as in the proof of Lemma
M ,⋆
16. We start with the base case h = H + 1, which has V tH+1 = VH+1 ≡ 0. Assuming

136
M ,⋆ M ,⋆ ,⋆
V th+1 ≥ Vh+1 , we first observe that ThM is monotone and ThM V th+1 ≥ ThM Vh+1 = QM
h .
Hence,
bt = Q
Q b t − T M V h+1 + T M V h+1 (7.17)
h h h h
≥ Q − T V h+1 + QM ,⋆
b t
h
M
h h (7.18)
t M ,⋆
≥ −b h,δ + Qh (7.19)
and thus Qb t + bth,δ ≥ QM ,⋆ . Since QM ,⋆ ≤ 1, the clipped version Qt also satisfies Qt ≥ QM ,⋆ .
h h h h h h
This, in turn, implies V th ≥ VhM ,⋆ .

Finishing the proof. With the technical results above established, the proof of Propo-
sition 46 follows fairly quickly.
Proof of Proposition 46. Let M be the true model. Condition on the event in Lemma 27.
Then, since Q is optimistic by Lemma 28, we have that for each timestep t,
 
f M (πM ) − f M (b
π t ) ≤ Es1 ∼d1 V t1 (s1 ) − f M (b
πt)
H
t
X
EM ,bπ Qth (sh , ah ) − ThM Qth+1 (sh , ah )
  
=
h=1
by Lemma 14. Using the definition of bth,δ and Lemma 28, we have
H H
X
πt
M ,b
 t
 M t   √ X πt
M ,b
h i
E Qh (sh , ah ) − Th Qh+1 (sh , ah ) ≲ R E ∥ϕ(sh , ah )∥(Σt )−1 .
h
h=1 h=1
Summing over all timesteps t gives
√ XT X
H h i
t
Reg ≤ R EM ,bπ ∥ϕ(sh , ah )∥(Σt )−1 .
h
t=1 h=1

By Hoeffding’s inequality, we have that with probability at least 1 − δ, this is at most


√ XT X
H p
R ∥ϕ(sth , ath )∥(Σt )−1 + RHT log(1/δ).
h
t=1 h=1

The elliptic potential lemma (Lemma 11) now allows us to bound


T
X p
∥ϕ(sth , ath )∥(Σt )−1 ≲ dT log(T /d)
h
t=1
for each h, which gives the result.

7.3 Bellman Rank


In this section, we continue our study of value-based methods, which assume access to a

class Q of state-action value functions such that QM ,⋆ ∈ Q. In the prequel, we saw that
the Low-Rank MDP assumption facilitates sample-efficient reinforcement learning when Q
is a class of linear functions, but what if we want to learn with nonlinear functions such as
neural networks? To this end, we will introduce a new structural property, Bellman rank
[46, 34, 49], which allows for sample-efficient learning with general classes Q, and subsumes
a number of well-studied MDP families, including:

137
• Low-Rank MDPs [87, 48, 5] • MDPs with low occupancy complexity
[34].
• Block MDPs and reactive POMDPs
[55, 33]. • Linear mixture MDPs [65, 15].

• MDPs with Linear Q⋆ and V ⋆ [34]. • Linear dynamical systems (LQR) [30].

We will learn about these examples in Section 7.3.3.

Building intuition. Bellman rank is a property of the underlying MDP M ⋆ which gives
a way of controlling distribution shift—that is, how many times a deliberate algorithm can
be surprised by a substantially new state distribution dM ,π when it updates its policy. To
motivate the property, let us revisit the low-rank MDP model. Let M be a low-rank MDP
with feature map ϕ(s, a) ∈ Rd , and let Qh (s, a) = ϕ(s, a), θhQ be an arbitrary
D linear value
E
function. Observe that since M is a Low-Rank MDP, we have [ThM Q](s, a) = ϕ(x, a), θ̃hM,Q ,
′ ′ ′ ′
R
where θ̃hM,Q := whM + µM h (s ) maxa′ Qh+1 (s , a )ds . As a result, for any policy π, we can
write the Bellman residual for Q as
h i D E
EM ,π Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a) = EM ,π [ϕ(sh , ah )], θhQ − whM − θ̃hM,Q (7.20)
a
= ⟨XhM (π), WhM (Q)⟩, (7.21)

where XhM (π) := EM ,π [ϕ(sh , ah )] ∈ Rd is an “embedding” that depends on π but not Q,


and WhM (Q) := θhQ − whM − θ̃hM,Q ∈ Rd is an embedding that depends on Q but not π (both
embeddings depend on M ). Notably, if we view the Bellman residual as a huge Π × Q
matrix Eh (·, ·) ∈ RΠ×Q with
h  i
Eh (π, Q) := EM ,π Qh (sh , ah ) − rh + max Qh+1 (sh+1 , a) , (7.22)
a

then the property (7.21) implies that rank(Eh (·, ·)) ≤ d. Bellman rank is an abstraction of
this property.16

Definition 8 (Bellman rank): For an MDP M with value function class Q and policy
class Π, the Bellman rank is defined as

dB (M ) = max rank({Eh (π, Q)}π∈Π,Q∈Q ). (7.23)


h∈[H]

Equivalently, Bellman rank is the smallest dimension d such that for all h, there exist
embeddings XhM (π), WhM (Q) ∈ Rd such that

Eh (π, Q) = ⟨XhM (π), WhM (Q)⟩ (7.24)

for all π ∈ Π and Q ∈ Q.

16
Bellman rank was originally introduced in the pioneering work of Jiang et al. [46]. The definition of
Bellman rank we present, which is slightly different from the original definition, is taken from the later work
of Du et al. [34], Jin et al. [49], and is often referred to as Q-type Bellman rank.

138
The utility of Bellman rank is that the factorization in (7.24) gives a way of controlling
distribution shift in the MDP M , which facilitates the application of standard generaliza-
tion guarantees for supervised learning/estimation. Informally, there are only d effective
directions in which we can be “surprised” by the state distribution induced by a policy π,
to the extent that this matters for the class Q under consideration; this property was used
implicitly in the proof of the regret bound for LSVI-UCB. As we will see, low Bellman rank
is satisfied in many settings that go beyond the Low-Rank MDP model.

7.3.1 The BiLinUCB Algorithm


We now present an algorithm, BiLinUCB [34], which attains low regret for MDPs with low
Bellman rank under the realizability assumption that
⋆ ,⋆
QM ∈ Q.

Like many of the algorithms we have covered, BiLinUCB is based on confidence sets and
optimism, though the way we will construct the confidence sets and implement optimism is
new.

PAC versus regret. For technical reasons, we will not directly give a regret bound for
BiLinUCB. Instead, we will prove a PAC (“probably approximately correct”) guarantee. For
PAC, the algorithm plays for T episodes, then outputs a final policy π
b, and its performance
is measured via
⋆ ⋆
f M (πM ⋆ ) − f M (b
π ). (7.25)

That is, instead of considering cumulative performance as with regret, we are only concerned
⋆ ⋆
with final performance. For PAC, we want to ensure that f M (πM ⋆ ) − f M (b π ) ≤ ε for some
−1
ε ≪ 1 using a number of episodes that is polynomial in ε and other problem parameters.
This is an √easier task than achieving low regret: If we have an algorithm that ensures that
E[Reg] ≲ CT for some problem-dependent constant C, we can turn this into an algorithm
that achieves PAC error ε using O εC2 episodes via online-to-batch conversion. In the other


direction, if we have an algorithm that achieves PAC error ε using O εC2 episodes, one can


use this to achieve E[Reg] ≲ C 1/3 T 2/3 using a simple explore-then-commit approach; this
is lossy, but is the best one can hope for in general.

Algorithm overview. BiLinUCB proceeds in K iterations, each of which consists of n


episodes. The algorithm maintains a confidence set Qk ⊆ Q of value functions (generalizing
the confidence sets we constructed for structured bandits in Section 4), with the property

that QM ,⋆ ∈ Q with high probability. Each iteration k consists of two parts:

• Given the current confidence set Qk , the algorithm computes a value function Qk and
corresponding policy π k := πQk that is optimistic on average

Qk = arg max Es1 ∼d1 [Q1 (s1 , πQ (s1 ))].


Q∈Qk

The main novelty here is that we are only aiming for optimism with respect to the
initial state distribution.

139
• Using the new policy π k , the algorithm gathers n episodes and uses these to compute
estimators {Ebhk (Q)}h∈[H] which approximate the Bellman residual Eh (π k , Q) for all
Q ∈ Q. Then, in (7.26), the algorithm computes the new confidence set Qk+1 by
restricting to value functions for which the estimated Bellman residual is small for
π 1 , . . . , π k . Eliminating value functions with large Bellman residual is a natural idea,

because we know from the Bellman equation that QM ,⋆ has zero Bellman residual.

BiLinUCB
Input: β > 0, iteration count K ∈ N, batch size n ∈ N.
Q1 ← Q.
for iteration k = 1, . . . , K do
Compute optimistic value function:

Qk = arg max Es1 ∼d1 [Q1 (s1 , πQ (s1 ))].


Q∈Qk

and let π k := πQk .


for l = 1, . . . , n do
Execute π k for an episode and observe trajectory (sk,l k,l k,l k,l k,l k,l
1 , a1 , r1 ), . . . , (sH , aH , rH ).
Compute confidence set
 
 X 
Qk+1 = Q ∈ Q | (Ebhi (Q))2 ≤ β ∀h ∈ [H] , (7.26)
 
i≤k

where
n  
i 1X i,l i,l i,l i,l
Eh (Q) :=
b Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a) .
n a∈A
l=1

1 Pn PH k,l
k = arg maxk∈[K] Vb k , where Vb k :=
Let b n l=1 h=1 rh .
b
Return π
b = πk .

Main guarantee. The main result for this section is the following PAC guarantee for
BiLinUCB.


Proposition 47: Suppose that M ⋆ has Bellman rank d and QM ,⋆ ∈ Q. For any ε > 0
3
and δ > 0, if we set n ≳ H d log(|Q|/δ)
ε2
, K ≳ Hd log(1+n/d), and β ∝ c·K log|Q|+log(HK/δ)
n ,
then BiLinUCB learns a policy π b such
⋆ ⋆
f M (πM ⋆ ) − f M (b
π) ≤ ε

with probability at least 1 − δ, and does so using


 4 2 
H d log(|Q|/δ)
O
e
ε2

episodes.

140
This result shows that low Bellman rank suffices to learn a near-optimal policy, with sample
complexity that only depends on the rank d, the horizon H, and the capacity log|Q| for the
value function class; this reflects that the algorithm is able to generalize across the state
space, with d and log|Q| controlling the degree of generalization. The basic principles at
play are:

• By choosing Qk optimistically, we ensure that the suboptimality of the algorithm is


controlled by the Bellman residual for Qk , on-policy, similar to what we saw for UCB-
VI and LSVI-UCB. An important difference compared to the LSVI-UCB algorithm we
covered in the previous section is that BiLinUCB is only optimistic “on average” with
respect to the initial state distribution, i.e.,
  ⋆
Es1 ∼d1 Qk1 (s1 , πQk (s1 )) ≥ f M (πM ⋆ ),

while LSVI-UCB aims to find a value function that is uniformly optimistic for all states
and actions.

• The confidence set construction (7.26) explicitly removes value functions that have
large Bellman residual on the policies encountered so far. The key role of the Bellman
rank property is to ensure that there are only O(d) e “effective” state distributions
that lead to substantially different values for the Bellman residual, which means that
eventually, only value functions with low residual will remain.

Interestingly, the Bellman rank property is only used for analysis, and the algorithm does
not explicitly compute or estimate the factorization.

Regret bounds. The BiLinUCB algorithm can be lifted to provide a regret guarantee via
a explore-then-commit strategy: Run the algorithm for T0 episodes to learn a ε-optimal
policy, then commit to this policy for the remaining rounds. It is a simple exercise to show
that by choosing T0 appropriately, this approach gives
 
Reg ≤ O e (H 4 d2 log(|Q|/δ))1/3 · T 2/3 .

Under
√ an additional assumption known as Bellman completeness, it is possible to attain
T with a variant of this algorithm that uses a slightly different confidence set construction
[49].

7.3.2 Proof of Proposition 47


⋆ ⋆
Recall from the definition of Bellman rank that there exist embeddings XhM (π), WhM (Q) ∈
Rd such that for all π ∈ Π and Q ∈ Q,
⋆ ⋆
Eh (π, Q) = XhM (π), WhM (Q) .
⋆ ⋆
We assume throughout this proof that XhM (π) , WhM (Q) 2
≤ 1 for simplicity.

Technical lemmas. Before proceeding, we state two technical lemmas. The first lemma
establishes validity for the confidence set Qk constructed by BiLinUCB.

141
Lemma 29: For any δ > 0, if we set β = c·K log|Q|+log(HK/δ)
n , where c > 0 is sufficiently
large absolute constant, then with probability at least 1 − δ, for all k ∈ [K]:

1. All Q ∈ Qk have X
(Eh (π i , Q))2 ≲ β ∀h ∈ [H]. (7.27)
i<k

⋆ ,⋆
2. QM ∈ Qk .

Proof of Lemma 29. Using Hoeffding’s inequality and a union bound (Lemma 3), we have
that with probability at least 1 − δ, for all k ∈ [K], h ∈ [H], and Q ∈ Q,
r
k k log(|Q|HK/δ)
Ebh (Q) − Eh (π , Q) ≤ C · , (7.28)
n
where c is an absolute constant.
To prove Part 1, we observe that for all k, using the AM-GM inequality, we have that
for all Q ∈ Q,
X X X
(Eh (π i , Q))2 ≤ 2 (Ebhi (Q))2 + 2 (Eh (π i , Q) − Ebhi (Q))2 .
i<k i<k i<k

For Q ∈ Qk , the definition of Qk implies that i<k (Ebhi (Q))2 ≤ β, while (7.28) implies that
P
2
P
i<k (Eh (π , Q) − Eh (Q)) ≲ β, which gives the result.
i bi
For Part 2, we similarly observe that for all k, h and Q ∈ Q,
X X X
(Ebhi (Q))2 ≤ 2 (Eh (π i , Q))2 + 2 (Eh (π i , Q) − Ebhi (Q))2 .
i<k i<k i<k
M ⋆ ,⋆ M ⋆ ,⋆
Since Q has Eh (π, Q ) = 0 ∀π by Bellman optimality, we have
X ⋆
X ⋆ ⋆ log(|Q|HK/δ)
(Ebhi (QM ,⋆ ))2 ≤ 2 (Eh (π i , QM ,⋆ ) − Ebhi (QM ,⋆ ))2 ≤ 2C 2 ,
n
i<k i<k

where the last inequality uses (7.28). It follows that as long as β ≥ 2C 2 log(|Q|HK/δ)
n , we will
M ⋆ ,⋆
have Q ∈ Q for all k.
k

The next result shows that whenever the event in the previous lemma holds, the value
functions constructed by BiLinUCB are optimistic.

Lemma 30: Whenever the event in Lemma 29 occurs, the following properties hold:

1. Define
⋆ ⋆
X
Σkh = XhM (π i )XhM (π i )⊤ . (7.29)
i<k

For all k ∈ [K], all Q ∈ Qk satisfy


⋆ 2
WhM (Q) Σkh
≲ β. (7.30)

142
2. For all k, Qk is optimistic in the sense that
 ⋆ ,⋆ ⋆
Es1 ∼d1 [Qk1 (s1 , πQ (s1 ))] ≥ Es1 ∼d1 QM

1 (s1 , πM ⋆ (s1 )) = f M (πM ⋆ ). (7.31)

Proof of Lemma 30. For Part 1, recall that by the Bilinear class property, we can write
⋆ ⋆
Eh (π k , Q) = X M (π k ), W M (Q) , so that (7.27) implies that

X ⋆ ⋆ 2 X
W M (Q) Σk = X M (π i ), W M (Q) = (Eh (π i , Q))2 ≲ β.
h
i<k i<k
M ⋆ ,⋆
For Part 2, we observe that for all k, since Q ∈ Qk , we have
 
Es1 ∼d1 [Qk1 (s1 , πQ (s1 ))] = sup Es1 ∼d1 Q1 (s1 , πQ (s1 ))
Q∈Q
 ⋆ ,⋆
≥ Es1 ∼d1 QM

1 (s1 , πM ⋆ (s1 ))

= f M (πM ⋆ ).

Proving the main result. Equipped with the lemmas above, we prove Proposition 47.
Proof of Proposition 47. We first prove a generic bound on the suboptimality of each policy
π k for k ∈ [K]. Let us condition on the event in Lemma 29, which occurs with probability
at least 1 − δ. Whenever this event occurs, Lemma 30 implies that Qk is optimistic, so we
have can bound
⋆ ⋆   ⋆
f M (πM ⋆ ) − f M (π k ) ≤ Es1 ∼d1 Qk1 (s1 , πQk (s1 ) − f M (π k ) (7.32)
H  
⋆ k
X
= EM ,π Qkh (sh , ah ) − rh − max Qkh+1 (sh+1 , a) (7.33)
a∈A
h=1
H
⋆ ⋆
X
= XhM (π k ), WhM (Qk ) , (7.34)
h=1

where the first equality uses the Bellman residual decomposition (Lemma 14), and the
second inequality uses the Bellman rank assumption. For any λ ≥ 0, using Cauchy-Schwarz,
we can bound
H H
⋆ ⋆ ⋆ ⋆
X X
XhM (π k ), WhM (Qk ) ≤ XhM (π k ) (λI+Σkh )−1
WhM (Qk ) λI+Σkh
h=1 h=1

For each h ∈ [H], applying the bound in (7.30) gives


q
⋆ ⋆ 2
WhM (Qk ) λI+Σk ≤ λ WhM (Qk ) 2
+ β ≤ λ1/2 + β 1/2 ,
h

where we have used that WhM (Qk ) 2
≤ 1 by assumption. This allows us to bound

H H
⋆ ⋆ ⋆
X X
XhM (π k ), WhM (Qk ) ≲ (λ1/2 + β 1/2 ) · XhM (π k ) (λI+Σkh )−1
. (7.35)
h=1 h=1

If we can find a policy π for which the right-hand side of (7.35) is small, this policy will be
k

guaranteed to have low regret. The following lemma shows that such a policy is guaranteed
to exist.

143
Lemma 31: For any λ > 0, as long as K ≥ Hd log 1 + λ−1 K/d , there exists k ∈ [K]


such that
Hd log 1 + λ−1 K/d

M⋆ k 2
Xh (π ) (λI+Σk )−1 ≲ ∀h ∈ [H]. (7.36)
h K

We choose λ = β, which implies that it suffices to take K ≳ Hd log(1 + n/d) to satisfy


the condition in Lemma 31. By choosing k to satisfy (7.36) and plugging this bound into
(7.35), we conclude that the policy π k has
r r !
⋆ ⋆ Hd log(1 + β −1 K/d) d log(|Q|/δ)
f M (πM ⋆ ) − f M (π k ) ≲ H β · e H 3/2
≲O ≲ε
K n
(7.37)

as desired.
Finally, we need to argue that the policy π b returned by the algorithm is at least as good
as π . This is straightforward and we only sketch the argument: By Hoeffding’s inequality
k

and a union bound, we have that with probability at least 1 − δ, for all k,
r
⋆ log(K/δ)
f M (π k ) − Vb k ≲ ,
n
q
π ) ≳ f M (π k ) − log(K/δ)
⋆ ⋆
which implies that f M (b n . The error term here is of lower order
than (7.37).

Deferred proofs. To finish up, we prove Lemma 31.

Proof of Lemma 31. To prove the result, we need a variant of the elliptic potential lemma
(Lemma 11).

Lemma 32 (e.g. Lemma 19.4 in Lattimore and Szepesvári P [60]): Let a1 , . . . , aT ∈


d
R satisfy ∥at ∥2 ≤ 1 for all t ∈ [T ]. Fix λ > 0, and let Vt = λI + s<t as asT . Then

T
X
log(1 + ∥at ∥2V −1 ) ≤ d log 1 + λ−1 T /d .

(7.38)
t
t=1

For any λ > 0, applying this result for each h ∈ [H] and summing gives
K X
H
X ⋆ 2
≤ Hd log 1 + λ−1 K/d .
 
log 1 + XhM (π k ) k
(λI+Σh )−1
k=1 h=1

This implies that there exists k such that


H
Hd log 1 + λ−1 K/d

X
M⋆ k 2 
log 1 + Xh (π ) (λI+Σkh )−1
≤ ,
K
h=1

144
⋆ 2  Hd log(1+λ−1 K/d)
which means that for all h ∈ [H], log 1 + XhM (π k ) (λI+Σkh )−1
≤ K , or
equivalently:
!
M⋆ 2 Hd log 1 + λ−1 K/d
Xh (π k ) (λI+Σkh )−1
≤ exp − 1.
K

As long as K ≥ Hd log 1 + λ−1 K/d , using that ex ≤ 1 + 2x for 0 ≤ x ≤ 1, we have




Hd log 1 + λ−1 K/d



M⋆ k 2
Xh (π ) (λI+Σk )−1 ≤ 2 .
h K

7.3.3 Bellman Rank: Examples


We now highlight concrete examples of models with low Bellman rank. We start with famil-
iar examples, the introduce new models that allow for nonlinear function approximation.
Example 7.1 (Tabular MDPs). If M is a tabular MDP with |S| ≤ S and |A| ≤ A, we can
write the Bellman residual for any function Q and policy π as
h  i
Eh (π, Q) = EM ,π Qh (sh , ah ) − rh + max Qh+1 (sh+1 , a)
a
X M ,π    
M ′
= dh (s, a) E Qh (s, a) − rh + max ′
Qh+1 (sh+1 , a ) | sh = s, ah = a .
a
s,a

It follows that if we define


 ,π
XhM (π) = dM
h (s, a) s∈S,a∈A
∈ RSA

and
    

M
Wh (Q) = M
E Qh (s, a) − rh + max

Qh+1 (sh+1 , a ) | sh = s, ah = a ∈ RSA ,
a s∈S,a∈A

we have
Eh (π, Q) = ⟨XhM (π), WhM (Q)⟩.
This shows that dB (M ) ≤ SA. ◁
Example 7.2 (Low-Rank MDPs). The calculation in (7.21) shows that by choosing XhM (π) :=
EM ,π [ϕ(sh , ah )] ∈ Rd and WhM (Q) := θhQ − whM − θ̃hM,Q ∈ Rd , any Low-Rank MDP M has
dB (M ) ≤ d. When specialized to this setting, the regret of BiLinUCB is worse than that
of LSVI-UCB (though still polynomial in all of the problem parameters). This is because
BiLinUCB is a more general algorithm, and does not take advantage of an additional feature
of the Low-Rank MDP model known as Bellman completeness: If M is a Low-Rank MDP,
then for all Q ∈ Q, we have ThM Qh+1 ∈ Qh+1 . By using a more specialized relative of
BiLinUCB that incorporates a modified confidence set construction to exploit completeness,
it is possible to match and actually improve upon the regret of LSVI-UCB [49].

We now explore Bellman rank for some MDP families that have not already been covered.

145
Example 7.3 (Low Occupancy Complexity). An MDP M is said to have low occupancy
complexity if there exists a feature map ϕM (s, a) ∈ Rd such that for all π, there exists
θhM ,π ∈ Rd such that
,π M ,π
dM M
h (s, a) = ϕ (s, a), θh . (7.39)
Note that neither ϕM nor θM ,π is assumed to be known to the learner. If M has low
occupancy complexity, then for any value function Q and policy π, we have
h  i
Eh (π, Q) = EM ,π Qh (sh , ah ) − rh + max Qh+1 (sh+1 , a)
a
X M ,π    
M ′
= dh (s, a) E Qh (s, a) − rh + max ′
Qh+1 (sh+1 , a ) | sh = s, ah = a
a
s,a
   
ϕM (s, a), θhM ,π EM
X

= Qh (s, a) − rh + max

Qh+1 (sh+1 , a ) | sh = s, ah = a
a
s,a
*    +
θhM ,π ,
X
= ϕM (s, a) EM Qh (s, a) − rh + max

Qh+1 (sh+1 , a′ ) | sh = s, ah = a .
a
s,a

It follows that if we define


XhM (π) = θhM ,π
and
X    
M M M ′
Wh (Q) = ϕ (s, a) E Qh (s, a) − rh + max

Qh+1 (sh+1 , a ) | sh = s, ah = a ,
a
s,a

then Eh (π, Q) = ⟨XhM (π), WhM (Q)⟩, which shows that dB (M ) ≤ d.


This setting subsumes tabular MDPs and low-rank MDPs, but is substantially more
general. Notably, low occupancy complexity allows for nonlinear function approximation:
As long as the occupancies satisfy (7.39), the Bellman rank is at most d for any class Q,
which might consist of neural networks or other nonlinear models. ◁
We close with two more new examples.
Example 7.4 (LQR). A classical problem in continuous control is the Linear Quadratic
Regulator, or LQR. Here, we have S = A = Rd , and states evolve via

sh+1 = AM sh + B M ah + ζh ,

where ζh ∼ N (0, I), and s1 ∼ N (0, I). We assume that rewards have the form17

rh = −s⊤ M ⊤ M
h Q sh − ah R ah

for matrices QM , RM ⪰ 0. A classical result, dating back to Kalman, is that the optimal
controller for this system is a linear mapping of the form

πM,h (s) = KhM s,

and that the value function QM ,⋆ (s, a) = (s, a)⊤ PhM (s, a) is quadratic. Hence, it suffices to
take Q to be the set of all quadratic functions in (s, a). With this choice,  it⊤can
 be shown
2 M ,π
that dB (M ) ≤ d + 1. The basic idea is to choose Xh (π) = (vec(E
M
sh sh ), 1) and use
the quadratic structure of the value functions. ◁
17
LQR is typically stated in terms of losses; we negate because we consider rewards.

146
Example 7.5 (Linear Q⋆ /V ⋆ ). In Proposition 45, we showed that for RL with linear

function approximation, assuming only that QM ,⋆ is linear is not enough to achieve low

regret. It turns out that if we also assume in addition that V M ,⋆ is linear, the situation
improves.
Consider an MDP M . Assume that there known feature maps ϕ(s, a) ∈ Rd and ψ(s′ ) ∈
d
R such that
,⋆
QM M
h (s, a) = ⟨ϕ(s, a), θh ⟩, and VhM ,⋆ (s) = ⟨ψ(s), whM ⟩.

Let
 
Q= Q | Qh (s, a) = ⟨ϕ(s, a), θh ⟩ : θh ∈ Rd , ∃w max⟨ϕ(s, a), θh ⟩ = ⟨ψ(s), w⟩ ∀s .
a∈A

Then dB (M ) ≤ 2d. We will not prove this result, but the basic idea is to choose XhM (π) =
EM ,π [(ϕ(sh , ah ), ψ(sh+1 ))]. ◁

See Jiang et al. [46], Du et al. [34] for further examples.

7.3.4 Generalizations of Bellman Rank


While we gave (7.24) as the definition for Bellman rank, there are many variations on the
assumption that also lead to low regret. One well-known variant is V-type Bellman rank
[46], which asserts that for all π ∈ Π and Q ∈ Q,
 
M ,π
E Esh+1 |sh ,ah ∼πQ (sh ) Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a) = ⟨XhM (π), WhM (Q)⟩. (7.40)
M
a∈A

This is the same as the definition (7.24) (which is typically referred to as Q-type Bellman
rank), except that we take ah = πQ (sh ) instead of ah = π(sh ).18 With an appropriate
modification, BiLinUCB can be shown to give sample complexity guarantees that scale with
V-type Bellman rank instead of Q-type. This definition captures meaningful classes of
tractable RL models that are not captured by the Q-type definition (7.24), with a canonical
example being Block MDPs.

Example 7.6 (Block MDP). The Block MDP [46, 33, 64] is a model in which the (“ob-
served”) state space S is large/high-dimensional, but the dynamics are governed by a (small)
latent state space Z. Formally, a Block MDP M = (S, A, P, R, H, d1 ) is defined based on
an (unobserved) latent state space Z, with zh denoting the latent state at layer h. We first
describe the dynamics for the latent space. Given initial latent state z1 , the latent states
evolve via
zh+1 ∼ PhM ,latent (zh , ah ).
The latent state zh is not observed. Instead, we observe

sh ∼ qhM (zh ),

where qhM : Z → ∆(S) is an emission distribution with the property that supp(qh (z)) ∩
supp(qh (z ′ )) = ∅ if z ̸= z ′ . This property (decodability) ensures that there exists a unique
18
The name “V-type” refers to the fact that (7.40) only depends on Q through the induced V -function
s 7→ Qh (s, πQ (s)), while (7.24) depends on the full Q-function, hence “Q-type”.

147
mapping ϕM h : S → Z that maps the observed state sh to the corresponding latent state zh .
We assume that RhM (s, a) = RhM (ϕM h (s), a), which implies that optimal policy πM depends
only on the endogenous latent state, i.e. πM,h (s) = πM,h (ϕM h (s)).
The main challenge of learning in Block MDPs is that the decoder ϕM is not known
to the learner in √ advance. Indeed, given access to the decoder, one can obtain regret
poly(H, |Z|, |A|) · T by applying tabular reinforcement learning algorithms to the latent
state space. In light of this, the aim of the Block MDP setting is to obtain sample com-
plexity guarantees that are independent of the size of the observed state space |S|, and
scale as poly(|Z|, |A|, H, log|F|), where F is an appropriate class of function approximators
(typically either a value function class Q containing QM ,⋆ or a class of decoders Φ that
attempts to model ϕM directly).
We now show that the Block MDP setting admits low V-type Bellman rank. Observe
that we can write
 
M ,π M
E Esh+1 |ah ,sh ∼πQ (sh ) Qh (sh , πQ (sh )) − rh − max Qh+1 (sh+1 , πQ (sh+1 ))
a∈A
X M ,π  
= dh (z) Es∼qM (z) EM sh+1 |sh ,ah ∼πQ (sh ) Q (s ,
h h Q hπ (s )) − rh − max Q (s , π (s
h+1 h+1 Q h+1 )) .
h a∈A
z∈Z

This implies that we can take


 ,π
XhM (π) = dM
h (z) z∈Z

and
  
M M
Wh (Q) = Es∼qM (z) Esh+1 |sh ,ah ∼πQ (sh ) Qh (sh , πQ (sh )) − rh − max Qh+1 (sh+1 , πQ (sh+1 ))
h a∈A z∈Z

so that the V-type Bellman rank is at most |Z|. This means that as long as Q contains
QM ,⋆ , we can obtain sample complexity guarantees that scale with |Z| rather than |S|, as
desired. ◁

There are a number of variants of Bellman rank, including Bilinear rank [34] and
Bellman-Eluder dimension [49], which subsume and slightly generalize both Bellman rank
definitions.

7.3.5 Decision-Estimation Coefficient for Bellman Rank


An alternative to the BiLinUCB method is to appeal to the E2D meta-algorithm and
the Decision-Estimation Coefficient. The following result [40] shows that the Decision-
Estimation Coefficient is always bounded for classes with low Bellman rank.

Proposition 48: For any class of MDPs M for which all M ∈ M have Bellman rank
at most d, we have

H 2d
decγ (M) ≲ . (7.41)
γ

148

This implies that that E2D meta-algorithm has E[Reg] ≲ H dT · EstH whenever we have
access to a realizable model class with low Bellman rank. As a special case, for any finite
class M, using averaged exponential weights as an estimation oracle gives
p
E[Reg] ≲ H dT log|M|. (7.42)

We will not prove Proposition 48, but interested readers can refer to Foster et al. [40].
The result can be proven using two approaches, both of which build on the techniques we
have already covered. The first approach is to apply a more general version of the PC-IGW
algorithm from Section 6.6, which incorporates optimal design in the space of policies. The
second approach is to move to the Bayesian DEC and appeal to posterior sampling, as in
Section 4.4.2.

Value-based guarantees via optimistic estimation. In general, the model estimation


complexity log|M| in (7.42) can be arbitrarily large compared to the complexity log|Q| for
a realizable value function class (consider the low-rank MDP—since µ is unknown, it is
not possible to construct a small model class M). To derive value-based guarantees along
the lines of what BiLinUCB achieves in Proposition 47, a natural approach is to replace
the Hellinger distance appearing in the DEC with a divergence tailored to value function
iteration, following the development in Sections 6.7.2 and 6.7.3. Once such choice is the
divergence
H 
X h  i2
π
Dsbr (Q ∥ M) = EM ,π Qh (sh , ah ) − rh + max Qh+1 (sh+1 , a) ,
a
h=1

which measures the squared bellman residual for an estimated value function under M .
With this choice, we appeal to the optimistic E2D algorithm (E2D.Opt) from Section 6.7.3.
One can show that the optimistic DEC for this divergence is bounded as
H ·d
o-decD
γ (M) ≲
sbr
.
γ
This implies that E2D.Opt, with an appropriate choice of estimation algorithm tailored to
π (· ∥ ·), achieves
Dsbr

E[Reg] ≲ (H 2 d log|Q|)1/2 T 3/4 .

Note that due to the asymmetric nature of Dsbrπ (· ∥ ·), it is critical to appeal to optimistic

estimation to derive this result. Indeed, the non-optimistic generalized DEC decD γ
sbr
does
not enjoy a polynomial bound. See Foster et al. [41] for details.

[Note: This subsection will be expanded in the next version.]

149
References

[1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic
bandits. In Advances in Neural Information Processing Systems, 2011.

[2] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilis-
tic concepts. In Proceedings of the Sixteenth International Conference on Machine
Learning, pages 3–11. Morgan Kaufmann Publishers Inc., 1999.

[3] A. Agarwal and T. Zhang. Model-based RL with optimistic posterior sampling: Struc-
tural conditions and sample complexity. arXiv preprint arXiv:2206.07659, 2022.

[4] A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex


optimization with bandit feedback. SIAM Journal on Optimization, 23(1):213–240,
2013.

[5] A. Agarwal, S. Kakade, A. Krishnamurthy, and W. Sun. FLAMBE: Structural com-


plexity and representation learning of low rank MDPs. Neural Information Processing
Systems (NeurIPS), 2020.

[6] R. Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed
bandit problem. Advances in applied probability, 27(4):1054–1078, 1995.

[7] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit
problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Confer-
ence Proceedings, 2012.

[8] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear
payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013.

[9] K. Amin, M. Kearns, and U. Syed. Bandits, query learning, and the haystack dimen-
sion. In Proceedings of the 24th Annual Conference on Learning Theory, pages 87–106.
JMLR Workshop and Conference Proceedings, 2011.

[10] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations.


Cambridge University Press, 2009.

[11] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits.
In COLT, volume 7, pages 1–122, 2009.

[12] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of


Machine Learning Research, 3(Nov):397–422, 2002.

[13] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2-3):235–256, 2002.

[14] P. Auer, R. Ortner, and C. Szepesvári. Improved rates for the stochastic continuum-
armed bandit problem. In International Conference on Computational Learning The-
ory, pages 454–468. Springer, 2007.

[15] A. Ayoub, Z. Jia, C. Szepesvari, M. Wang, and L. Yang. Model-based reinforce-


ment learning with value-targeted regression. In International Conference on Machine
Learning, pages 463–474. PMLR, 2020.

150
[16] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement
learning. In International Conference on Machine Learning, pages 263–272, 2017.

[17] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathe-
matical Society, 60(6):503–515, 1954.

[18] B. Bilodeau, D. J. Foster, and D. Roy. Tight bounds on minimax regret under loga-
rithmic loss via self-concordance. In International Conference on Machine Learning,
2020.

[19] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some


recent advances. ESAIM: probability and statistics, 9:323–375, 2005.

[20] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory.


In Advanced lectures on machine learning, pages 169–207. Springer, 2004.

[21] S. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit convex
optimization. In Conference on Learning Theory, pages 583–589, 2016.

[22] S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization: T regret
in one dimension. In Conference on Learning Theory, pages 266–278, 2015.

[23] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex opti-
mization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of
Computing, pages 72–85, 2017.

[24] N. Cesa-Bianchi and G. Lugosi. Minimax regret under log loss for general classes of
experts. In Conference on Computational Learning Theory, 1999.

[25] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge Univer-
sity Press, New York, NY, USA, 2006. ISBN 0521841089.

[26] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff
functions. In International Conference on Artificial Intelligence and Statistics, 2011.

[27] T. M. Cover. Universal portfolios. Mathematical Finance, 1991.

[28] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit
feedback. In Conference on Learning Theory (COLT), 2008.

[29] C. Dann, M. Mohri, T. Zhang, and J. Zimmert. A provably efficient model-free posterior
sampling method for episodic reinforcement learning. Advances in Neural Information
Processing Systems, 34:12040–12051, 2021.

[30] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the
linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679,
2020.

[31] S. Dong, B. Van Roy, and Z. Zhou. Provably efficient reinforcement learning with
aggregated states. arXiv preprint arXiv:1912.06366, 2019.

[32] D. L. Donoho and R. C. Liu. Geometrizing rates of convergence. Annals of Statistics,


1987.

151
[33] S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Prov-
ably efficient RL with rich observations via latent state decoding. In International
Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
[34] S. S. Du, S. M. Kakade, J. D. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bi-
linear classes: A structural framework for provable generalization in RL. International
Conference on Machine Learning, 2021.
[35] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the
bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth
annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
[36] D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits
with regression oracles. International Conference on Machine Learning (ICML), 2020.
[37] D. J. Foster, A. Agarwal, M. Dudı́k, H. Luo, and R. E. Schapire. Practical contextual
bandits with regression oracles. International Conference on Machine Learning, 2018.
[38] D. J. Foster, S. Kale, H. Luo, M. Mohri, and K. Sridharan. Logistic regression: The
importance of being improper. Conference on Learning Theory, 2018.
[39] D. J. Foster, C. Gentile, M. Mohri, and J. Zimmert. Adapting to misspecification in
contextual bandits. Advances in Neural Information Processing Systems, 33, 2020.
[40] D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of
interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
[41] D. J. Foster, N. Golowich, J. Qian, A. Rakhlin, and A. Sekhari. A note on model-
free reinforcement learning with the decision-estimation coefficient. arXiv preprint
arXiv:2211.14250, 2022.
[42] D. J. Foster, A. Rakhlin, A. Sekhari, and K. Sridharan. On the complexity of adver-
sarial decision making. arXiv preprint arXiv:2206.13063, 2022.
[43] D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making
with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023.
[44] E. Hazan and S. Kale. An online portfolio selection algorithm with regret logarithmic
in price variation. Mathematical Finance, 2015.
[45] S. R. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. Time-uniform chernoff bounds
via nonnegative supermartingales. Probability Surveys, 17:257–317, 2020.
[46] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contex-
tual decision processes with low Bellman rank are PAC-learnable. In International
Conference on Machine Learning, pages 1704–1713, 2017.
[47] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu. Reward-free exploration for
reinforcement learning. In International Conference on Machine Learning, pages 4870–
4879. PMLR, 2020.
[48] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning
with linear function approximation. In Conference on Learning Theory, pages 2137–
2143, 2020.

152
[49] C. Jin, Q. Liu, and S. Miryoosefi. Bellman eluder dimension: New rich classes of RL
problems, and sample-efficient algorithms. Neural Information Processing Systems,
2021.

[50] K.-S. Jun and C. Zhang. Crush optimism with pessimism: Structured bandits beyond
asymptotic optimality. Advances in Neural Information Processing Systems, 33, 2020.

[51] A. Kalai and S. Vempala. Efficient algorithms for universal portfolios. Journal of
Machine Learning Research, 2002.

[52] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian
Journal of Mathematics, 12:363–366, 1960.

[53] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances
in Neural Information Processing Systems, 17:697–704, 2004.

[54] R. Kleinberg, A. Slivkins, and E. Upfal. Bandits and experts in metric spaces. Journal
of the ACM (JACM), 66(4):1–77, 2019.

[55] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich
observations. In Advances in Neural Information Processing Systems, pages 1840–1848,
2016.

[56] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances
in Applied Mathematics, 6(1):4–22, 1985.

[57] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with
side information. In Advances in neural information processing systems, pages 817–824,
2008.

[58] T. Lattimore. Improved regret for zeroth-order adversarial bandit convex optimisation.
Mathematical Statistics and Learning, 2(3):311–334, 2020.

[59] T. Lattimore and C. Szepesvári. An information-theoretic approach to minimax regret


in partial monitoring. In Conference on Learning Theory, pages 2111–2139. PMLR,
2019.

[60] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

[61] G. Li, P. Kamath, D. J. Foster, and N. Srebro. Eluder dimension and generalized rank.
arXiv preprint arXiv:2104.06970, 2021.

[62] L. Li. A unifying framework for computational reinforcement learning theory. Rutgers,
The State University of New Jersey—New Brunswick, 2009.

[63] H. Luo, C.-Y. Wei, and K. Zheng. Efficient online portfolio with logarithmic regret. In
Advances in Neural Information Processing Systems, 2018.

[64] D. Misra, M. Henaff, A. Krishnamurthy, and J. Langford. Kinematic state abstrac-


tion and provably efficient rich-observation reinforcement learning. In International
conference on machine learning, pages 6961–6971. PMLR, 2020.

153
[65] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement
learning using linearly combined model ensembles. In International Conference on
Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020.

[66] M. Opper and D. Haussler. Worst case prediction over sequences under log loss. In
The Mathematics of Information Coding, Extraction and Distribution, 1999.

[67] L. Orseau, T. Lattimore, and S. Legg. Soft-bayes: Prod for mixtures of experts with
log-loss. In International Conference on Algorithmic Learning Theory, 2017.

[68] Y. Polyanskiy. Information theoretic methods in statistics and computer science. 2020.
URL https://people.lids.mit.edu/yp/homepage/sdpi course.html.

[69] A. Rakhlin and K. Sridharan. Statistical learning and sequential prediction, 2012.
Available at http://www.mit.edu/∼rakhlin/courses/stat928/stat928 notes.pdf.

[70] A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martin-
gale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111–153,
2015.

[71] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. Factorizing personalized markov


chains for next-basket recommendation. In Proceedings of the 19th International Con-
ference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30,
2010, pages 811–820. ACM, 2010.

[72] J. Rissanen. Complexity of strings in the class of markov sources. IEEE Transactions
on Information Theory, 32(4):526–532, 1986.

[73] H. Robbins. Some aspects of the sequential design of experiments. 1952.

[74] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic
exploration. In Advances in Neural Information Processing Systems, pages 2256–2264,
2013.

[75] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics
of Operations Research, 39(4):1221–1243, 2014.

[76] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling.
Operations Research, 66(1):230–252, 2018.

[77] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory


to algorithms. Cambridge university press, 2014.

[78] Y. M. Shtar’kov. Universal sequential coding of single messages. Problemy Peredachi


Informatsii, 1987.

[79] D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algo-
rithm for contextual bandits under realizability. Mathematics of Operations Research,
2021.

[80] M. Sion. On general minimax theorems. Pacific J. Math., 8:171–176, 1958.

[81] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press,


2018.

154
[82] W. R. Thompson. On the likelihood that one unknown probability exceeds another in
view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

[83] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Com-


pany, Incorporated, 2008.

[84] V. Vovk. A game of prediction with expert advice. In Proceedings of the eighth annual
conference on computational learning theory, pages 51–60. ACM, 1995.

[85] Y. Wang, R. Wang, and S. M. Kakade. An exponential lower bound for linearly-
realizable MDPs with constant suboptimality gap. Neural Information Processing Sys-
tems (NeurIPS), 2021.

[86] G. Weisz, P. Amortila, and C. Szepesvári. Exponential lower bounds for planning in
MDPs with linearly-realizable optimal action-value functions. In Algorithmic Learning
Theory, pages 1237–1264. PMLR, 2021.

[87] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive
features. In International Conference on Machine Learning, pages 6995–7004. PMLR,
2019.

[88] H. Yao, C. Szepesvári, B. Á. Pires, and X. Zhang. Pseudo-mdps and factored linear
action models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Re-
inforcement Learning, ADPRL 2014, Orlando, FL, USA, December 9-12, 2014, pages
1–9. IEEE, 2014.

[89] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435.
Springer, 1997.

[90] T. Zhang. Feel-good thompson sampling for contextual bandits and reinforcement
learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.

[91] H. Zhong, W. Xiong, S. Zheng, L. Wang, Z. Wang, Z. Yang, and T. Zhang. A posterior
sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962,
2022.

155
A. TECHNICAL TOOLS

A.1 Probabilistic Inequalities


A.1.1 Tail Bounds with Stopping Times

Lemma 33 (Hoeffding’s inequality with adaptive stopping time): For i.i.d.


random variables Z1 , . . . , ZT taking values in [a, b] almost surely, with probability at
least 1 − δ,
T′
r
1 X log(T /δ)

Zi − E[Z] ≤ (b − a) ∀1 ≤ T ′ ≤ T. (A.1)
T 2T ′
i=1

As a consequence, for any random variable τ ∈ [T ] with the property that for all t ∈ [T ],
I {τ ≤ t} is a measurable function of Z1 , . . . , Zt−1 (τ is called a stopping time), we have
that with probability at least 1 − δ,
τ
r
1X log(T /δ)
Zi − E[Z] ≤ (b − a) . (A.2)
τ 2τ
i=1

Proof of Lemma 33. Lemma 3 states that for any fixed T ′ ∈ [T ], with probability at least
1 − δ,
T′
r
1 X log(T /δ)
Zi − E[Z] ≤ (b − a) .
T′ 2T ′
i=1

(A.1) follows by applying this result with δ ′ = δ/T and taking a union bound over all T
choices for T ′ ∈ [T ]. For (A.2), we observe that

τ T′
r ( r )
1X log(T /δ) 1 X log(T /δ)
(Zi − E[Z]) − (b − a) ≤ max (Zi − E[Z]) − (b − a) .
τ 2τ T ′ ∈[T ] T ′ 2T ′
i=1 i=1

The result now follows from (A.1).

A.1.2 Tail Bounds for Martingales

Lemma 34: For any sequence of real-valued random variables (Xt )t≤T adapted to a
filtration (Ft )t≤T , it holds that with probability at least 1 − δ, for all T ′ ≤ T ,

T′ T′
X X
log Et−1 eXt + log(δ −1 ).
 
Xt ≤ (A.3)
t=1 t=1

Proof of Lemma 34. We claim that the sequence


τ
!
X
Xt − log Et−1 eXt
 
Zτ := exp
t=1

156
is a nonnegative supermartingale with respect to the filtration (Fτ )τ ≤T . Indeed, for any
choice of τ , we have
τ
" !#
X  Xt 
Eτ −1 [Zτ ] = Eτ −1 exp Xt − log Et−1 e
t=1
τ −1
!
X  Xt 
· Eτ −1 exp Xτ − log Eτ −1 eXτ
  
= exp Xt − log Et−1 e
t=1
τ −1
!
X  Xt 
= exp Xt − log Et−1 e
t=1
= Zτ .

Since Z0 = 1, Ville’s inequality (e.g., Howard et al. [45]) implies that for all λ > 0,
1
P0 (∃τ : Zτ > λ) ≤ .
λ
The result now follows by the Chernoff method.

The next result is a martingale counterpart to Bernstein’s inequality (Lemma 5).

Lemma 35 (Freedman’s inequality (Bernstein for martingales)): Let (Xt )t≤T


be a real-valued martingale difference sequence adapted to a filtration (Ft )t≤T . If
|Xt | ≤ R almost surely, then for any η ∈ (0, 1/R), with probability at least 1 − δ, for
all T ′ ≤ T ,
T′ T′
X X   log(δ −1 )
Xt ≤ η Et−1 Xt2 + .
η
t=1 t=1

Proof of Lemma 35. Without loss of generality, let R = 1, and fix η ∈ (0, 1). The result
follows by invoking Lemma 34 with ηXt in place of Xt , and by the facts that ea ≤ 1 + a +
(e − 2)a2 for a ≤ 1 and 1 + b ≤ eb for all b ∈ R.

The following result is an immediate consequence of Lemma 35.

Lemma 36: Let (Xt )t≤T be a sequence of random variables adapted to a filtration
(Ft )t≤T . If 0 ≤ Xt ≤ R almost surely, then with probability at least 1 − δ,
T T
X 3X
Xt ≤ Et−1 [Xt ] + 4R log(2δ −1 ),
2
t=1 t=1

and
T
X T
X
Et−1 [Xt ] ≤ 2 Xt + 8R log(2δ −1 ).
t=1 t=1

157
A.2 Information Theory
A.2.1 Properties of Hellinger Distance

Lemma 37: For any distributions P and Q over a pair of random variables (X, Y ),
 2  2
EX∼PX DH PY |X , QY |X ≤ 4DH (PX,Y , QX,Y ).

Proof of Lemma 37. Since squared Hellinger distance is an f -divergence, we have


 2  2

EX∼PX DH PY |X , QY |X = DH PY |X ⊗ PX , QY |X ⊗ PX .

Next, using that Hellinger distance satisfies the triangle inequality, along with the elemen-
tary inequality (a + b)2 ≤ 2(a2 + b2 ), we have,
 2  2
 2

EX∼PX DH PY |X , QY |X ≤ 2DH PY |X ⊗ PX , QY |X ⊗ QX + 2DH QY |X ⊗ PX , QY |X ⊗ QX
2 2
= 2DH (PX,Y , QX,Y ) + 2DH (PX , QX )
2
≤ 4DH (PX,Y , QX,Y ),

where the final line follows from the data processing inequality.

Lemma 38 (Subadditivity for squared HellingerQdistance): Let (XN 1 , F1 ), . . . , (Xn , Fn )


be a sequence of measurable spaces, and let X i = ii=t Xt and F i = i
t=1 Ft . For
each i, let P (· | ·) and Q (· | ·) be probability kernels from (X , F ) to (Xi , Fi ). Let
i i i−1 i−1

P and Q be the laws of X1 , . . . , Xn under Xi ∼ Pi (· | X1:i−1 ) and Xi ∼ Qi (· | X1:i−1 )


respectively. Then it holds that
" n #
X
2 2 2 i i
DH (P, Q) ≤ 10 log(n) · EP DH (P (· | X1:i−1 ), Q (· | X1:i−1 )) . (A.4)
i=1

A.2.2 Change-of-Measure Inequalities

Lemma 39 (Pinsker for sub-Gaussian random variables): Suppose that X ∼ P


and Y ∼ Q are both σ 2 -sub-Gaussian. Then
p
|EP [X] − EQ [Y ]| ≤ 2σ 2 · DKL (P ∥ Q).

Lemma 40 (Multiplicative Pinsker-type inequality for Hellinger distance):


lemmampmin Let P and Q be probability measures on (X , F ). For all h : X → R with
0 ≤ h(X) ≤ R almost surely under P and Q, we have
q
2 (P, Q).
|EP [h(X)] − EQ [h(X)]| ≤ 2R(EP [h(X)] + EQ [h(X)]) · DH (A.5)

158
In particular,
2
EP [h(X)] ≤ 3 EQ [h(X)] + 4R · DH (P, Q). (A.6)

Proof of Lemma 40. Let a measurable event A be fixed. Let p = P(A) and q = Q(A). Then
we have
(p − q)2 √ √
≤ ( p − q)2 ≤ DH 2 2
((p, 1 − p), (q, 1 − q)) ≤ DH (P, Q),
2(p + q)
where the third inequality is the data-processing inequality. It follows that
q
|p − q| ≤ 2(p + q)DH 2 (P, Q),

R1
To deduce the final result for R = 1, we observe that EP [h(X)] = 0 P(h(X) > t)dt and
likewise for EQ [h(X)], then apply Jensen’s inequality. The result for general R follows by
rescaling.
The inequality in (A.6) follows by applying the AM-GM inequality to (A.5) and rear-
ranging.

A.3 Minimax Theorem

Lemma 41 (Sion’s Minimax Theorem [80]): Let X and Y be convex sets in linear
topological spaces, and assume X is compact. Let f : X ×Y → R be such that (i) f (x, ·)
is concave and upper semicontinuous over Y for all x ∈ X and (ii) f (·, y) is convex and
lower semicontinuous over X for all y ∈ Y. Then

inf sup f (x, y) = sup inf f (x, y). (A.7)


x∈X y∈Y y∈Y x∈X

159

You might also like