AR23
AR23
These lecture notes are based on a course taught at MIT in Fall 2022 and Fall 2023. This
arXiv:2312.16730v1 [cs.LG] 27 Dec 2023
is a live draft, and all parts will be updated regularly. Please send us an email if you find a
mistake, typo, or missing reference.
Contents
1 Introduction 4
1.1 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 A Spectrum of Decision Making Problems . . . . . . . . . . . . . . . . . . . 4
1.3 Minimax Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Statistical Learning: Brief Refresher . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Refresher: Random Variables and Averages . . . . . . . . . . . . . . . . . . 10
1.6 Online Learning and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6.1 Connection to Statistical Learning . . . . . . . . . . . . . . . . . . . 14
1.6.2 The Exponential Weights Algorithm . . . . . . . . . . . . . . . . . . 15
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Multi-Armed Bandits 21
2.1 The Need for Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 The ε-Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 The Upper Confidence Bound (UCB) Algorithm . . . . . . . . . . . . . . . 27
2.4 Bayesian Bandits and the Posterior Sampling Algorithm⋆ . . . . . . . . . . 30
2.5 Adversarial Bandits and the Exp3 Algorithm⋆ . . . . . . . . . . . . . . . . . 34
2.6 Deferred Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Contextual Bandits 38
3.1 Optimism: Generic Template . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Optimism for Linear Models: The LinUCB Algorithm . . . . . . . . . . . . 43
3.3 Moving Beyond Linear Classes: Challenges . . . . . . . . . . . . . . . . . . 45
3.4 The ε-Greedy Algorithm for Contextual Bandits . . . . . . . . . . . . . . . 46
3.5 Inverse Gap Weighting: An Optimal Algorithm for General Model Classes . 49
3.5.1 Extending to Offline Regression . . . . . . . . . . . . . . . . . . . . . 52
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1
4 Structured Bandits 55
4.1 Building Intuition: Optimism for Structured Bandits . . . . . . . . . . . . . 58
4.1.1 UCB for Structured Bandits . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 The Eluder Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3 Suboptimality of Optimism . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 The Decision-Estimation Coefficient . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Decision-Estimation Coefficient: Examples . . . . . . . . . . . . . . . . . . . 69
4.3.1 Cheating Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Linear Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.3 Nonparametric Bandits . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.4 Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Relationship to Optimism and Posterior Sampling . . . . . . . . . . . . . . 75
4.4.1 Connection to Optimism . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 Connection to Posterior Sampling . . . . . . . . . . . . . . . . . . . 77
4.5 Incorporating Contexts⋆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Additional Properties of the Decision-Estimation Coefficient⋆ . . . . . . . . 79
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3
1. INTRODUCTION
• Medical treatment: based on a patient’s medical history and vital signs, we need to
decide what treatment will lead to the most positive outcome.
• Controlling a robot: based on sensor signals, we need to decide what signals to send
to a robot’s actuators in order to navigate to a goal.
For both problems, we (the learner /agent) are interacting with an unknown environment.
In the robotics example, we do not necessarily a-priori know how the signals we send to
our robot’s actuators change its configuration, or what the landscape it’s trying to navigate
looks like. However, because we are able to actively control the agent, we can learn to
model the environment on the fly as we make decisions and collect data, which will reduce
uncertainty and allow us to make better decisions in the future. The crux of the interactive
decision making problem is to make decisions in a way that balances (i) exploring the
environment to reduce our uncertainty and (ii) maximizing our overall performance (e.g.,
reaching a goal state as fast as possible).
Figure 1 depicts an idealized interactive decision making setting, which we will return
to throughout this course. Here, at each round t, the agent (doctor) observes the medical
history and vital signs of a patient, summarized in a context xt , makes a treatment decision
π t , and then observes the outcomes of the treatment in the form of a reward rt , and an
auxiliary observation ot about, say, illness progression. With time, we hope that the doctor
will learn a good mapping xt 7→ π t from contexts to decisions. How can we develop an
automated system that can achieve this goal?
It is tempting to cast the problem of finding a good mapping xt 7→ π t as a supervised
learning problem. After all, modern deep neural networks are able to achieve excellent
performance on many tasks, such as image classification and recognition, and it is not
out of the question that there exists a good neural network for the medical example as
well. The question is: how do we find it? In supervised learning, finding a good predictor
often amounts to fitting an appropriate model—such as a neural network—to the data. In
the above example, however, the available data may be limited to what treatments have
been assigned to patients, potentially missing better options. It is the process of active
data collection with a controlled amount of exploration that we would like to study in this
course.
The decision making framework in Figure 1 generalizes many interactive decision mak-
ing problems the reader might already be familiar with, including multi-armed bandits,
contextual bandits, and reinforcement learning. We will cover the foundations of algorithm
design and analysis for all of these settings from a unified perspective, with an emphasis on
sample efficiency (i.e., how to learn a good decision making policy using as few rounds of
interaction as possible).
4
xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>
context
⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>
decision
rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>
reward
ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>
observation
function approximation
online
structure in structure in learning
decisions contexts reinforcement
learning
structured
statistical bandits
learning
contextual
adversarial bandits
nature data
complex
observations
tabular
RL
multiarmed
bandits interactivity
to different assumptions we can place on the underlying environment and decision making
protocol, and give rise to what we describe as a spectrum of decision making problems,
which is illustrated in Figure 2. There are three core challenges we will focus on throughout
the course, which are given by the axes of Figure 2.
• Interactivity. Does the learning agent observe data passively, or do the decisions
they make actively influence what data we collect? In the setting of Figure 1, the
doctor observes the effects of the prescribed treatments, but not the counterfactuals
(the effects of the treatments not given). Hence, doctor’s decisions influence the data
they can collect, which in turn may significantly alter the ability to estimate the effects
of different treatments. On the other hand, in classical machine learning, a dataset is
typically given to the learner upfront, with no control over how it is collected.
5
signs might be a highly structured object. Likewise, the treatment π t might be a high-
dimensional vector with interacting components, or a complex multi-stage treatment
strategy.
• Data. Is the data (e.g., rewards or observations) observed by our learning algorithm
produced by a fixed data-generating process, or does it evolve arbitrarily, and even
adversarially in response to our actions? If there is fixed data-generating process,
do we wish to directly model it, or should we instead aim to be agnostic? Do we
observe only the labels of images, as in supervised learning, or a full trajectory of
states/actions/rewards for a policy employed by the robot?
As shown in Figure 2, many basic decision making and learning frameworks (contextual
bandits, structured bandits, statistical learning, online learning) can be thought of as ideal-
ized problems that each capture one or more of the possible challenges, while richer settings
such as reinforcement learning encompass all of them.
Figure 2 can be viewed as a roadmap for the course. We start with a brief introduction
to Statistical Learning (Section 1.4) and Online Learning (Section 1.6); the concepts and
results stated here will serve as a backbone for the rest of the course. We will then study, in
order, the problems of Multi-Armed Bandits (Section 2), Contextual Bandits (Section 3),
Structured Bandits (Section 4), Tabular Reinforcement Learning (Section 5), General Deci-
sion Making (Section 6), and Reinforcement Learning with General Function Approximation
(Section 7). Each of these topics will add a layer of complexity, and our aim is to develop a
unified approach to all the aforementioned problems, both in terms of statistical complexity
(the number of interactions required to achieve the goal), and in terms of algorithm design.
6
will be described in Sections 1.4 and 1.6), with a principled choice of exploration strategy
that balances greedily maximizing performance (exploitation) with information acquisition
(exploration). As we show, such algorithms achieve or nearly achieve optimality in (1.1) for
a surprisingly wide range of decision making problems.
• Regression, where common losses include the square loss ℓ(a, b) = (a − b)2 when
Y = Y ′ = R.
• Classification, where Y = Y ′ = {0, 1} and we consider the indicator (or 0-1) loss
ℓ(a, b) = I {a ̸= b}.
• Conditional density estimation with the logarithmic loss (log loss). Here Y ′ = ∆(Y),
the set of distributions on Y, and for p ∈ Y ′ ,
fb(·; HT ) : X → Y ′ . (1.4)
The goal in designing algorithms is to ensure that E L(fb) is minimized, where E[·] denotes
expectation with respect to the draw of the dataset HT . Without any assumptions, it is not
possible to learn a good predictor unless the number of examples T scales with |X | (this is
sometimes called the no-free-lunch theorem). The basic idea behind statistical learning is
to work with a restricted class of functions
F ⊆ {f : X → Y}
1
Note that we allow the outcome space Y to be different from the prediction space Y ′ .
7
in order to facilitate generalization. The class F can be thought of as (implicitly) encoding
prior knowledge about the structure of the data. For example, in computer vision, if the
features xt correspond to images and the outcomes y t are labels (e.g., “cat” or “dog”), one
might expect that choosing F to be a class of convolutional neural networks will work well,
since this encodes spatial structure.
<latexit sha1_base64="viLVQupRAFwdyUMHixuHqUlhlgo=">AAAG4HicdZVbT9swFMfNNjrW3WB73Eu0ColJqEpKr2+osAsSAsalMLUVcly39epc5DiDKMv73rY9js+zL7FvMztJKXWNlZ5ax7+/fexznNg+JQE3zX9LDx4+Wi48XnlSfPrs+YuXq2uvOoEXMoTPkEc9dmHDAFPi4jNOOMUXPsPQsSk+tyc7cvz8G2YB8dxTHvm478CRS4YEQS5cneFG9P363eVqySy3mvVKrWKYZdNsVLbqslNpVCtbhiU8spVA3o4u15b/9gYeCh3sckRhEHQt0+f9GDJOEMVJsRcG2IdoAke4K7oudHDQj9NwE2NdeAbG0GPi53Ij9d5VxNAJgsixBelAPg7UMenUjXVDPmz2Y+L6IccuyhYahtTgniH3bgwIw4jTSHQgYkTEaqAxZBBxcULFuWWyUOdcIwb9MUHXSbG4buwc7h8eG7vvP+wd7J3uHR6cFHsDPBRJSIXxLmSTjwxjN4nZyE5is2zVNsWR1qWxaokxj++T0Zh/wZR6V7eCpkRT09LzYv7olm5IMDPJPNvBLFrkZ7M31cnz2GdsGnlmtIG0aYhv4VYtCzm1LS2vCKwUzWxdKzjGg1ng2TnmVsXbVCRrygpEPCpygn0Ckxg50UQyYsatTUv8NUzl4NqMoEm69JQ1ys2WMK2qMJWmgp+PCZ/uSmxGPgpxFDKf4jspEKWwKSEXXyHPcaA7iHtX6TRdqx/3OL7mmTJzxiUrUWhbnqQCpz4Ny8RWFFS6NKTHoDtamDf3avhRWukKfucK6OKWeVIUefIWaSqLQBP+rDgWNUGaZkWQ514Tj8y1ZoVZDWh3HWkjyi7PPdvQqZQruqgciLPUCWdXdVHjZ9WmKKY1mPLivT99uRv3dzqVslUvVz9XS9vt/AuwAt6At2ADWKABtsEncATOAAJfwS/wB9wU7MKPws/C7wx9sJRrXoO5Vrj5Dw+jYUo=</latexit>
f (y|x)
y
<latexit sha1_base64="IX9FJnwE2dAuhsgJi8f27LGfznw=">AAAG23icdZXJbtswEIaZtHFTd0vaYy9CjQA9BIaVZrFvgZMuAYLsW2EbAUXRMmFqAUU1EQSdemt7bN6oL9G3KUnJcUyxgjwmRt9PDmeGkhNREvNW6+/c/KPHC7Uni0/rz56/ePlqafn1RRwmDOFzFNKQXTkwxpQE+JwTTvFVxDD0HYovnfGOfH75DbOYhMEZTyM88KEXkCFBkAvXcXq91Gg1W+qyqgO7HDRAeR1dLy/86bshSnwccERhHPfsVsQHGWScIIrzej+JcQTRGHq4J4YB9HE8yFSkubUiPK41DJn4BdxS3oeKDPpxnPqOIH3IR7H+TDpNz3oJH7YHGQmihOMAFQsNE2rx0JLbtlzCMOI0FQOIGBGxWmgEGURcJKc+s0wR6ozLYzAaEXSb1+sr1s7h/uGJtfvx097B3tne4cFpve/ioci/Ema7kI0/M4yDPGOek2etpr2xKlK6KY29kVuz+D7xRvwrpjS8uRe0JapMx8yL+dN7ekuChcln2QvM0io/nb2tT17GPmVV5IUxBtKlCb6HOxtFyMp2jLwmsBVa2E2j4AS708CLPJZWx7tUFGvCCkTcOnKKIwLzDPnpWDJixg+rtvjbammJ6zKCxmrpCWs12x1hOuvCrLU1/HJE+GRXYjPy1oijhEUUPyiBaIVVCQX4BoW+DwM369+oaXr2IOtzfMsLZeHMGnau0Y7MpAYrn4FlYisaKl0GMmQw8Crzll4D76lO1/AHR8AUt6yTpiiLV6WpbAJD+NPmqGpiVWZNUNbeEI+stWGFaQ8Yd50aIyoOz3+2YVJpR7SqdEUuTcLpUa1qoqLbNMWkBxUv3vu2/pavDi7WmvZmc/14vbHdLb8Ai+AteAfeAxtsgW3wBRyBc4AABj/Bb3BXG9S+137UfhXo/FypeQNmrtrdP7wfXyk=</latexit>
x
<latexit sha1_base64="jtuX9r/t3BwH73RfOLxwJs0brs8=">AAAG23icdZXJbtswEIaZtHFTd0vaYy9CjQA9BIaVZrFvgZMuAYLsW2EbAUXRMmFqAUU1EQSdemt7bN6oL9G3KUnJcUyxgjwmRt9PDmeGkhNREvNW6+/c/KPHC7Uni0/rz56/ePlqafn1RRwmDOFzFNKQXTkwxpQE+JwTTvFVxDD0HYovnfGOfH75DbOYhMEZTyM88KEXkCFBkAvX8e31UqPVbKnLqg7sctAA5XV0vbzwp++GKPFxwBGFcdyzWxEfZJBxgijO6/0kxhFEY+jhnhgG0MfxIFOR5taK8LjWMGTiF3BLeR8qMujHceo7gvQhH8X6M+k0PeslfNgeZCSIEo4DVCw0TKjFQ0tu23IJw4jTVAwgYkTEaqERZBBxkZz6zDJFqDMuj8FoRNBtXq+vWDuH+4cn1u7HT3sHe2d7hwen9b6LhyL/SpjtQjb+zDAO8ox5Tp61mvbGqkjppjT2Rm7N4vvEG/GvmNLw5l7QlqgyHTMv5k/v6S0JFiafZS8wS6v8dPa2PnkZ+5RVkRfGGEiXJvge7mwUISvbMfKawFZoYTeNghPsTgMv8lhaHe9SUawJKxBx68gpjgjMM+SnY8mIGT+s2uJvq6UlrssIGqulJ6zVbHeE6awLs9bW8MsR4ZNdic3IWyOOEhZR/KAEohVWJRTgGxT6PgzcrH+jpunZg6zP8S0vlIUza9i5RjsykxqsfAaWia1oqHQZyJDBwKvMW3oNvKc6XcMfHAFT3LJOmqIsXpWmsgkM4U+bo6qJVZk1QVl7Qzyy1oYVpj1g3HVqjKg4PP/ZhkmlHdGq0hW5NAmnR7WqiYpu0xSTHlS8eO/b+lu+OrhYa9qbzfXj9cZ2t/wCLIK34B14D2ywBbbBF3AEzgECGPwEv8FdbVD7XvtR+1Wg83Ol5g2YuWp3/wC1qF8o</latexit>
Empirical risk minimization and excess risk. The most basic and well-studied algo-
rithmic principle for statistical learning is Empirical Risk Minimization (ERM). Define the
empirical loss for the dataset HT as
T
b )= 1
X
L(f ℓ(f (xi ), y i ). (1.5)
T
i=1
Then, the empirical risk minimizer with respect to the class F is given by
To measure the performance of ERM and other algorithms that attempt to learn with
F, we consider excess loss (or, regret)
Intuitively, the quantity minf ′ ∈F L(f ′ ) in (1.7) captures the best prediction performance any
function in F can achieve, even with knowledge of the true distribution. If an algorithm fb
has low excess risk, this means that we are predicting future outcomes nearly as well as any
algorithm based on samples can hope to perform. ERM and other algorithms can ensure
that E(fb) is small in expectation or with high probability over draw of the dataset HT .
8
Connection to estimation. An appealing feature of the formulation in (1.7) is that it
does not presuppose any relationship between the class F and the data distribution; in
other words, it is agnostic. However, if F does happen to be good at modeling the data
distribution, the excess loss has an additional interpretation based on estimation.
Definition 1: For prediction with square loss, we say that the problem is well-specified
(or, realizable) if the regression function f ⋆ (a) := E[y|x = a] is in F.
The regression function f ⋆ can also be seen as a minimizer of L(f ) over measurable functions
f , for the same reason that Ez (z − b)2 is minimized at b = E[z].
Lemma 1: For the square loss, if the problem is well-specified, then for all f : X → Y,
Proof of Lemma 1. Adding and subtracting f ⋆ in the first term of (1.7), we have
E(f (x) − y)2 − E(f ⋆ (x) − y)2 = E(f (x) − f ⋆ (x))2 + 2 E[(f ⋆ (x) − y)(f (x) − f ⋆ (x))].
Inspecting (1.8), we see that any f achieving low excess loss necessarily estimates the true
regression function f ⋆ ; hence, the goals of prediction and estimation coincide.
Guarantees for ERM. We give bounds on the excess loss of ERM for perhaps the
simplest special case, in which F is finite.
where
q
log |F |
1. For any bounded loss (including classification), comp(F, T ) = T .
log |F |
2. For square loss regression, if the problem is well-specified, comp(F, T ) = T .
In addition, there exists a (different) algorithm that achieves comp(F, T ) = logT|F | for
both square loss regression and conditional density estimation, even when the problem
is not well-specified.
Henceforth, we shall use the symbol ≲ to indicate an inequality that holds up to constants,
or other problem parameters deemed less important for the present discussion. As an
example, the range of losses for the first part is hidden in this notation, and we only focus
on the dependence of the right-hand side on F and T .
9
q
log |F |
The rate comp(F, T ) = T above is sometimes referred to as a slow rate, and is
optimal for generic losses. The rate comp(F, T ) = logT|F | is referred to as a fast rate, and
takes advantage of additional structure (curvature, or strong convexity) of the square loss.
Critically, both bounds scale only with the cardinality of F, and do not depend on the size
of the feature space X , which could be infinite. This reflects the fact that working with
a restricted function class is allowing us to generalize across the feature space X . In this
context the cardinality log|F| should be thought of a notion of capacity, or expressiveness
for F. Intuitively, choosing a larger, more expressive class will require a larger amount of
data, but will make the excess loss bound in (1.7) more meaningful, since the benchmark
will be stronger.
Note that if Z ∼ N (0, σ 2 ) is Gaussian with variance σ 2 , then it is sub-Gaussian with vari-
ance proxy σ 2 . In this sense, sub-Gaussian random variables generalize the tail behavior of
Gaussians. A standard application of Chernoff method yields the following result.
Applying this result with Z and −Z and taking a union bound yields the following two-sided
guarantee:
T
!
T u2
1X
P Zi − E[Z] ≥ u ≤ 2 exp − 2 . (1.11)
T 2σ
i=1
10
Setting the right-hand side of (1.11) to δ and solving for u, we find that for any δ ∈ (0, 1),
with probability at least 1 − δ,
T
r
1X 2σ 2 log(2/δ)
Zi − E[Z] ≤ . (1.12)
T T
i=1
Remark 3 (Union bound): The factor 2 under the logarithm in (1.12) is the result
of applying union bound to (1.10). Throughout the course, we will frequently apply the
union bound to multiple—say N —high probability events involving sub-Gaussian ran-
dom variables. In this case, the union bound will result in terms of the form log(N/δ).
The mild logarithmic dependence is due to the sub-Gaussian tail behavior of the aver-
ages.
The following result shows that any bounded random variable is sub-Gaussian.
η 2 (b − a)2
∀η ∈ R, ln E exp{−η(Z − E[Z])} ≤ . (1.13)
8
As a consequence, for i.i.d. random variables Z1 , . . . , ZT taking values in [a, b] almost
surely, with probability at least 1 − δ,
T
r
1X log(1/δ)
Zi − E[Z] ≤ (b − a) (1.14)
T 2T
i=1
11
By union bound and Lemma 3, with probability at least 1 − |F |δ,
T
r
1X log(2/δ)
∀f ∈ F, E ℓ(f (X), Y ) − ℓ(f (Xi ), Yi ) ≤ (1.15)
T 2T
i=1
To deduce the in-expectation bound of Proposition 1 from the high-probability tail bound
of Lemma 4, a standard technique of “integrating out the tail”R is employed. More precisely,
∞
for a nonnegative random variable U , it holds that E[U ] ≤ τ + τ P (U ≥ z) dz for all τ > 0;
choosing τ ∝ T −1/2 concludes the proof.
To prove the Part 2 (the fast rate) from Proposition 1, we need a more refined con-
centration inequality (Bernstein’s inequality), which gives tighter guarantees for random
variables with small variance.
The proof for Part 2 is given as an exercise in Section 1.7. We refer the reader to Ap-
pendix A.1 for further background on tail bounds.
• Rather than receiving a batch dataset of T examples all at once, we receive the
examples (xt , y t ) one by one, and must predict y t from xt only using the examples we
have already observed.
• Instead of assuming that examples are drawn from a fixed distribution, we allow
examples to be generated in an arbitrary, potentially adversarial fashion.
12
which aims to predict the outcome y t from the features xt . The algorithm’s goal is to
minimize the cumulative loss over T rounds, given by
T
X
ℓ(fbt (xt ), y t )
t=1
for a known loss function ℓ : Y ′ × Y → R; the cumulative loss can be thought of as a sum
of “out-of-sample” prediction errors. Since we will not be placing assumptions on the data-
generating process, it is not possible to make meaningful statements about the cumulative
loss itself. However, we can aim to ensure that this cumulative loss is not much worse than
the best empirical explanation of the data by functions in a given class F. That is, we
measure the algorithm’s performance via regret to F:
T
X T
X
Reg = ℓ(fbt (xt ), y t ) − min ℓ(f (xt ), y t ). (1.18)
f ∈F
t=1 t=1
Our aim is to design prediction algorithms that keep regret small for any sequence of
data. As in statistical learning, the class F should be thought of as capturing our prior
knowledge about the problem, and might be a linear model or neural network. At first
glance, keeping the regret small for arbitrary sequences might seem like an impossible task,
as it stands in stark contrast with statistical learning, where data is generated i.i.d. from
a fixed distribution. Nonetheless, we will that algorithms with guarantees similar to those
for statistical learning are available.
Let us remark that it is often useful to apply online learning methods in settings where
data is not fully adversarial, but evolves according to processes too difficult to directly
model. For example, in the chapters that follow, we will apply online methods as a sub-
routine with more sophisticated algorithms for decision making. Here, the choice of past
decisions, while in our purview, does not look like i.i.d. or simple time-series data.
The algorithms we introduce in the sequel below ensure small regret even if data are ad-
versarially and adaptively chosen. More precisely, for deterministic algorithms, (xt , y t )
may be chosen based on fbt and all the past data, while for randomized algorithms,
Nature can only base this choice on q t .
13
In the context of Figure 2, online learning generalizes statistical learning by considering
arbitrary sequences of data, but still allows for general-purpose function approximation and
generalization via the class F. While the setting involves making predictions in an online
fashion, we do not think of this as an interactive decision making problem, because the
predictions made by the learning agent do not directly influence what data the agent gets
to observe.
Proposition 2: Suppose the examples (x1 , y 1 ), . . . , (xT , y T ) are drawn i.i.d. from a dis-
tribution M ⋆ , and suppose the loss function a 7→ ℓ(a, b) is convex in the first argument
for all b. Then for any online learning algorithm, if we define
T
1 X bt
fb(x) = f (x),
T
t=1
we have
1
E E(fb) ≤ · E[Reg].
T
which is equal to
T
" #
1X
E E(xt ,yt ) ℓ fbt (xt ), y t (1.21)
T
t=1
since fbt is a function of Ht−1 and (x, y) and (xt , y t ) are i.i.d. Second,
T T
" # " #
1X t t 1X t t
min L(f ) = min E ℓ(f (x ), y ) ≥ E min ℓ(f (x ), y ) (1.22)
f ∈F f ∈F T f ∈F T
t=1 t=1
In light of Proposition 2, one can interpret regret as generalizing the notion of excess risk
from i.i.d. data to arbitrary sequences.
14
Similar to Lemma 1 in the setting of statistical learning, the regret for online learning
has an additional interpretation in terms of estimation if the outcomes for the problem are
well-specified.
E[y t | xt = x] = f ⋆ (x).
Notably, this result holds even if the features x1 , . . . , xT are generated adversarially, with no
prior knowledge of the sequence. This is a significant departure from classical estimation
results in statistics, where estimation of an unknown function is typically done over a fixed,
known sequence (“design”) x1 , . . . , xT , or with respect to an i.i.d. dataset.
where η > 0 is a learning rate. Based on q t , the algorithm forms the prediction fbt . We give
two variants of the method here.
Exponential Weights (averaged) Exponential Weights (randomized)
for t = 1, . . . , T do for t = 1, . . . , T do
Compute q t in (1.23). Compute q t in (1.23).
Let f = Ef ∼qt [f ].
bt
Sample fbt ∼ q t .
Observe (x , y ), incur ℓ(f (x ), y ).
t t bt t t
Observe (xt , y t ), incur ℓ(fbt (xt ), y t ).
The only difference between these variants lies in whether we compute the prediction fbt
from q t via
The latter can be applied to any bounded loss functions, while the former leads to faster
rates for specific losses such as the square loss and log loss, but is only applicable when Y ′ is
convex. Note that the averaged version is inherently improper, while the second is proper,
yet randomized. From the point of view of regret, the key difference between these two
versions is the placement of “Ef ∼qt ”: For the averaged version it is inside the loss function,
and for the randomized version it is outside (see (1.19)). The averaged version can therefore
take advantage of the structure of the loss function, such as strong convexity, leading to
15
faster rates. The following result shows that Exponential Weights leads to regret bounds
for online learning, with rates that parallel those in Proposition 1.
Proposition 3: For any finite class F, the Exponential Weights algorithm (with ap-
propriate choice of η) satisfies
1
Reg ≲ comp(F, T ) (1.25)
T
for any sequence, where:
q
log |F |
1. For arbitrary bounded losses (including classification), comp(F, T ) = T .
This is achieved by the randomized variant.
2. For regression with the square loss and conditional density estimation with the
log loss, comp(F, T ) = logT|F| . This is achieved by the averaged variant.
We now turn to the proof of Proposition 3. Since we are not placing any assumptions
on the data generating process, we cannot hope to control the algorithm’s loss at any
particular time t, but only cumulatively. It is then natural to employ amortized analysis
with a potential function.
In more detail, the proof of Proposition 3 relies on several steps, common to standard
analyses of online learning: (i) define a potential function, (ii) relate the increase in potential
at each time step, to the loss of the algorithm, (iii) relate cumulative loss of any expert
f ∈ F to the final potential. For the Exponential Weights Algorithm, the proof relies on
the following potential for time t, parameterized by η > 0:
t
( )
X X
t i i
Φη = − log exp −η ℓ(f (x ), y ) . (1.26)
f ∈F i=1
The choice of this potential is rather opaque, and a full explanation of its origin is beyond
the scope of the course, but we mention in passing that there are principled ways of coming
up with potentials in general online learning problems.
Proof of Proposition 3. We first prove the second statement, focusing on conditional density
with the logarithmic loss; for the square loss, see Remark 6 below.
Proof for Part 2: Log loss. Recall that for each x, f (x) is a distribution over Y, and
ℓlog (f (x), y) = − log f (y|x) where we abuse the notation and write f (x) and f (·|x) inter-
changeably. With η = 1, the averaged variant of exponential weights satisfies
n P o
X X exp − t−1i=1 ℓ log (f (x i
), y i
)
fbt (y t |xt ) = q t (f )f (y t |xt ) = f (y t |xt ) P n P o , (1.27)
t−1
f ∈F exp − i=1 ℓlog (f (x ), y )
i i
f ∈F f ∈F
and thus
16
Hence, by telescoping
T
X
ℓlog (fb(xt ), y t ) = ΦT1 − Φ01 .
t=1
Finally, observe that Φ = − log |F| and, since − log is monotonically decreasing, we have
0
1
( T ) T
X X
⋆ i
T
Φ1 ≤ − log exp − i
ℓlog (f (x ), y ) = ℓlog (f ⋆ (xi ), y i ), (1.29)
i=1 i=1
for any f ⋆ ∈ F. This establishes the result for conditional density estimation with the log
loss. As already discussed, the above proof follows the strategy: the loss on each round
related to change in potential (1.28), and the cumulative loss of any expert is related to the
final potential (1.29). We now aim to replicate these steps for arbitrary bounded losses.
Proof for Part 1: Generic loss. To prove this result, we build on the log loss result above.
First, observe that without loss of generality, we may assume that ℓ ◦ f ∈ [0, 1] for all f ∈ F
and (x, y), as we can always re-scale the problem. The randomized variant of exponential
weights (1.24) satisfies
n P o
t−1
X exp −η i=1 ℓ(f (x i
), y i
)
Efbt ∼qt [ℓ(fbt (xt ), y t )] = ℓ(f (xt ), y t ) P n P o. (1.30)
t−1
f ∈F f ∈F exp −η i=1 ℓ(f (x i
), y i
)
η2
Φtη − Φt−1
η + ,
8
establishing the analogue of (1.28). Summing over t, this gives
T
X T η2
η Efbt ∼qt [ℓ(fbt (xt ), y t )] ≤ ΦTη − Φ0η + . (1.32)
8
t=1
T
X T η log|F|
Efbt ∼qt [ℓ(fbt (xt ), y t )] − ℓ(f ⋆ (xt ), y t ) ≤ + .
8 η
t=1
17
q
8 log |F |
With η = T , we conclude that
T
r
X
t t t ⋆ t t T log |F|
Efbt ∼qt [ℓ(fb (x ), y )] − ℓ(f (x ), y ) ≤ . (1.33)
2
t=1
Observe that Hoeffding’s inequality was all that was needed for Lemma 4. Curiously
enough, it was also the only nontrivial step in the proof of Proposition 3. In fact, the
connection between probabilistic inequalities and online learning regret inequalities (that
hold for arbitrary sequences) runs much deeper.
Remark 6 (Mixable losses): We did not provide a proof of Proposition 3 for square
loss. It is tempting to reduce square loss regression to density estimation by taking the
conditional density to be a Gaussian distribution. Indeed, the log loss of a distribution
with density proportional to exp{−(fbt (xt )−y t )2 } is, up to constants, the desired square
loss. However, the mixture in (1.27) does not immediately lead to a prediction strategy
for the square loss, as the expectation appears in the wrong location. This issue is fixed
by a notion known as mixability.
We say that a loss ℓ is mixable with parameter η if there exists a constant c > 0
such that the following holds: for any x and a distribution q ∈ ∆(F), there exists a
prediction fb(x) ∈ Y ′ such that for all y ∈ Y,
c X
ℓ(fb(x), y) ≤ − log q(f ) exp{−ηℓ(f (x), y)} . (1.34)
η
f ∈F
If loss is mixable, then given the exponential weights distribution q t , the best prediction
ybt = fbt (xt ) can be written (by bringing the right-hand side of (1.34) to the left side) as
an optimization problem
c X
arg min maxℓ(b y t , y t ) + log q t (f ) exp{−ηℓ(f (xt ), y t )} (1.35)
t
ybt ∈Y ′ y ∈Y η
f ∈F
which is equivalent to
t
c X X
arg min maxℓ(b y t , y t ) + log exp{−η ℓ(f (xi ), y i )} (1.36)
t
yb ∈Y ′ y t ∈Y η
f ∈F i=1
18
once we remove the normalization factor. With this choice, mixability allows one to
replicate the proof of Proposition 3 for the logarithmic loss, with the only difference
being that (1.27) (after applying − log to both sides) becomes an inequality. It can be
verified that square loss is mixable with parameter η = 2 and c = 1 when Y = Y ′ = [0, 1],
leading to the desired fast rate for square loss in Proposition 3. The idea of translating
the English statement “there exists a strategy such that for any outcome...” into a
min-max inequality will come up again in the course.
Remark 7 (Online linear optimization): For the slow rate in Proposition 3, the
nature of the loss and the dependence on the function f is immaterial for the proof. The
guarantee can be stated in a more abstract form that depends only on the vector of losses
for functions in F as follows. Let |F| = N . For timestep t, define ℓtf = ℓ(f (xt ), y t ) and
ℓt = (ℓtf1 , . . . , ℓtfN ) ∈ RN for F = {f1 , . . . , fN }. For a randomized strategy q t ∈ ∆([N ]),
expected loss of the learner can be written as
where ej ∈ RN is the standard basis vector with 1 in jth position. In its most general
form, the exponential weights algorithm gives bounds on the regret in (1.37) for any
sequence of vectors ℓ1 , . . . , ℓT , and the update takes the form
t−1
( )
X
q t (k) ∝ exp −η ℓt (k) .
i=1
This formulation can be viewed as a special case of a problem known as online linear
optimization, and the exponential weights method can be viewed as an instance of an
algorithm known as mirror descent.
1.7 Exercises
Exercise 1 (Proposition 1, Part 2.): Consider the setting of Proposition 1, where (x1 , y 1 ), . . . , (xT , y T )
are i.i.d., F = {f : X → [0, 1]} is finite, the true regression function satisfies f ⋆ ∈ F, and
Yi ∈ [0, 1] almost surely. Prove that empirical risk minimizer fb with respect to square loss
satisfies the following bound on excess risk. With probability at least 1 − δ,
log(|F|/δ)
E(fb) ≲ . (1.38)
T
Follow these steps:
19
1. For a fixed function f ∈ F, consider the random variable
Zi (f ) = (f (xi ) − y i )2 − (f ⋆ (xi ) − y i )2
3. Apply Bernstein’s inequality (Lemma 5) to show that with for any f ∈ F, with probability
at least 1 − δ,
4. Extend this probabilistic inequality to simultaneously hold for all f ∈ F by taking the
union bound over f ∈ F. Conclude as a consequence that the bound holds for fb, the empirical
minimizer, implying (1.38).
Exercise 2 (ERM in Online Learning): Consider the problem of Online Supervised Learn-
ing with indicator loss ℓ(f (x), y) = I {f (x) ̸= y}, Y = Y ′ = {0, 1}, and a finite class F.
1. Exhibit a class F for which ERM cannot ensure sublinear growth of regret for all sequences,
i.e. there exists a sequence (x1 , y 1 ), . . . , (xT , y T ) such that
T
X T
X
ℓ(fbt (xt ), y t ) − min ℓ(f (xt ), y t ) = Ω(T ),
f ∈F
t=1 t=1
where fbt is the empirical minimizer for the indicator loss on (x1 , y 1 ), . . . , (xt−1 , y t−1 ). Note:
The construction must have |F| ≤ C, where C is an absolute constant that does not depend
on T .
2. Show that
p if data are i.i.d., then in expectation over the data, ERM attains a sublinear
bound O( T log|F|) on regret for any finite class F.
Exercise 3 (Low Noise): 1. For a nonnegative random variable X, prove that for any η ≥ 0,
η2
ln E exp{−η(X − E[X])} ≤ E[X 2 ]. (1.40)
2
Hint: use the fact that ln x ≤ x − 1 and exp(−x) ≤ 1 − x + x2 /2 for x ≥ 0.
2. Consider the setting of Proposition 3, Part 1 (Generic Loss). Prove that the randomized
variant of the Exponential Weights Algorithm satisfies, for any f ⋆ ∈ F,
T T
X ηX log|F|
Efbt ∼qt [ℓ(fbt (xt ), y t )] − ℓ(f ⋆ (xt ), y t ) ≤ E bt t [ℓ(fbt (xt ), y t )2 ] + . (1.41)
t=1
2 t=1 f ∼q η
20
for any sequence of data and nonnegative losses. Hint: replace Hoeffding’s Lemma by (1.40).
3. Suppose ℓ(f (x), y) ∈ [0, 1] for all x ∈ X , y ∈ Y, and f ∈ F. Suppose that there is a
“perfect expert f ⋆ ∈ F such that ℓ(f ⋆ (xt ), y t ) = 0 for all t ∈ [T ]. Conclude that the above
algorithm, with an appropriate choice of η, enjoys a bound of O(log|F|) on the cumulative loss
|
of the algorithm (equivalently, the fast rate log|F
T for the average regret). This setting is called
“zero-noise.”
4. Consider the binary classification problem with indicator loss, and suppose F contains a
perfect expert, as above. The Halving Algorithm maintains a version space F t = {f ∈ F :
f (xs ) = y s , s < t} and, given xt , follows the majority vote of remaining experts in F t . Show
that this algorithm incurs cumulative loss at most O(log|F|). Hence, the Exponential Weights
Algorithm can be viewed as an extension of the Halving algorithm to settings where the optimal
loss is non-zero.
2. MULTI-ARMED BANDITS
This chapter introduces the multi-armed bandit problem, which is the simplest interactive
decision making framework we will consider in this course.
The protocol (see above) proceeds in T rounds. At each round t ∈ [T ], the learning agent
selects a discrete decision 2 π t ∈ Π = {1, . . . , A} using the data
collected so far; we refer to Π as the decision space or action space, with A ∈ N denoting
the size of the space. We allow the learner to randomize the decision at step t according
to a distribution pt = pt (· | Ht−1 ), sampling π t ∼ pt . Based on the decision π t , the learner
receives a reward rt , and their goal is to maximize the cumulative reward across all T
rounds. As an example, one might consider an application in which the learner is a doctor
(or personalized medical assistant) who aims to select a treatment (the decision) in order
to make a patient feel better (maximize reward); see Figure 4.
The multi-armed bandit problem can be studied in a stochastic framework, in which re-
wards are generated from a fixed (conditional) distribution, or an non-stochastic/adversarial
framework in the vein of online learning (Section 1.6). We will focus on the stochastic frame-
work, and make the following assumption.
rt ∼ M ⋆ (· | π t ), (2.1)
2
In the literature on bandits, decisions are often referred to as actions. We will use these terms inter-
changeably throughout this section.
21
We define
f ⋆ (π) := E [r | π] (2.2)
as the mean reward function under r ∼ M ⋆ (· | π). We measure the learner’s performance
via regret to the action π ⋆ := arg maxπ∈Π f ⋆ (π) with highest reward:
T
X T
X
⋆ ⋆
Eπt ∼pt f ⋆ (π t ) .
Reg := f (π ) − (2.3)
t=1 t=1
Regret is a natural notion of performance for the multi-armed bandit problem because
it is cumulative: it measures not just how well the learner can identify an action with
good reward, but how well it can maximize reward as it goes. This notion is well-suited
to settings like the personalized medicine example in Figure 4, where regret captures the
overall quality of treatments, not just the quality of the final treatment. As in the online
learning framework, we would like to develop algorithms that enjoy sublinear regret, i.e.
E[Reg]
→0 as T → ∞.
T
The most important feature of the multi-armed bandit problem, and what makes the
problem fundamentally interactive, is that the learner only receives a reward signal for the
single decision π t ∈ Π they select at each round. That is, the observed reward rt gives a
noisy estimate for f ⋆ (π t ), but reveals no information about the rewards for other decisions
π ̸= π t . For example in Figure 4, if the doctor prescribes a particular treatment to the
xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>
context
⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>
decision
rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>
reward
ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>
observation
Figure 4: An illustration of the multi-armed bandit problem. A doctor (the learner) aims
to select a treatment (the decision) to improve a patient’s vital signs (the reward).
patient, they can observe whether the patient responds favorably, but they do not directly
observe whether other possible treatments might have led to an even better outcome. This
issue is often referred to as partial feedback or bandit feedback. Partial feedback introduces
an element of active data collection, as it means that the information contained in the
dataset Ht depends on the decisions made by the learner, which we will see necessitates
exploring different actions. This should be contrasted with statistical learning (where the
dataset is generated independently from the learner) and online learning (where losses may
be chosen by nature in response to the learner’s behavior, but where the outcome y t — and
hence the full loss function ℓ(·, y t )—is always revealed).
22
In the context of Figure 2, the multi-armed bandit problem constitutes our first step
along the “interactivity” axis, but does not incorporate any structure in the decision space
(and does not involve features/contexts/covariates). In particular, information about one
action does not reveal information about any other actions, so there is no hope of using
function approximation to generalize across actions.3 As a result, the algorithms we will
cover in this section will have regret that scales with Ω(|Π|) = Ω(A). This shortcoming is
addressed by the structured bandit framework we will introduce in Section 4, which allows
for the use of function approximation to model structure in the decision space.4
where, for π ̸= π t , rt (π) denotes the counterfactual reward the learner would have
received if they had played π at round t. Using Hoeffding’s√ inequality, one can show
that this is equivalent to the definition in (2.3) up to O( T ) factors.
where nt (π) is the number of times π has been selected up to time t.5 Then, we can choose
the greedy action
bt = arg max fbt (π).
π
π∈Π
Unfortunately, due to the interactive nature of the bandit problem, this strategy can fail,
leading to linear regret (Reg = Ω(T )). Consider the following problem with Π = {1, 2}
(A = 2).
23
Suppose we initialize by playing each decision a single time to ensure that nt (π) > 0, then
follow the greedy strategy. One can see that with probability 1/4, the greedy algorithm will
get stuck on action 1, leading to regret Ω(T ).
The issue in this example is that the greedy algorithm immediately gives up on the
optimal action and never revisits it. To address this, we will consider algorithms that
deliberately explore less visited actions to ensure that their estimated rewards are not
misleading.
and with probability ε it samples a uniform random action π t ∼ unif({1, . . . , A}). As the
name suggests, ε-Greedy usually plays the greedy action (exploiting what it has already
learned), but the uniform sampling ensures that the algorithm will also explore unseen
actions. We can think of the parameter ε as modulating the tradeoff between exploiting
and exploring.
Proposition 4: Assume that f ⋆ (π) ∈ [0, 1] and rt is 1-sub-Gaussian. Then for any T ,
by choosing ε appropriately, the ε-Greedy algorithm ensures that with probability at
least 1 − δ,
This regret bound has E[Reg] → 0 with T → ∞ as desired, though we will see in the sequel
T √
that more sophisticated strategies can attain improved regret bounds that scale with AT .6
Proof of Proposition 4. Recall that πbt := arg maxπ fbt (π) denotes the greedy action at round
t, and that p denotes the distribution over π t . We can decompose the regret into two terms,
t
representing the contribution from choosing the greedy action and the contribution from
6
√
Note that AT ≤ A1/3 T 2/3 whenever A ≤ T , and when A ≥ T both guarantees are vacuous.
24
exploring uniformly:
T
X
Reg = Eπt ∼pt [f ⋆ (π ⋆ ) − f ⋆ (π t )]
t=1
T
X T
X
⋆ ⋆ ⋆
= (1 − ε) f (π ) − f (b
π )+ε t
Eπt ∼unif([A]) [f ⋆ (π ⋆ ) − f ⋆ (b
π t )]
t=1 t=1
T
X
≤ f ⋆ (π ⋆ ) − f ⋆ (b
π t ) + εT.
t=1
In the last inequality, we have simply written off the contribution from exploring uniformly
by using that f ⋆ (π) ∈ [0, 1]. It remains to bound the regret we incur from playing the
greedy action. Here, we bound the per-step regret in terms of estimation error using a
similar decomposition to Lemma 4 (note that we are now working with rewards rather than
losses):
f ⋆ (π ⋆ ) − f ⋆ (b
π t ) = [f ⋆ (π ⋆ ) − fbt (π ⋆ )] + [fbt (π ⋆ ) − fbt (b π t ) − f ⋆ (b
π t )] +[fbt (b π t )] (2.7)
| {z }
≤0
Note that this regret decomposition can also be applied to the pure greedy algorithm, which
we have already shown can fail. The reason why ε-Greedy succeeds, which we use in the
argument that follows, is that because we explore, the “effective” number of times that each
arm will be pulled prior to round t is of the order εt/A, which will ensure that the sample
mean converges to f ⋆ . In particular, we will show that the event
( r )
A log(AT /δ)
Et = max |f ⋆ (π) − fbt (π)| ≲ (2.9)
π εt
From here, taking a union bound over all t ∈ [T ] and π ∈ Π ensures that
s
⋆ t 2 log(2AT 2 /δ)
|f (π) − fb (π)| ≤ (2.11)
nt (π)
for all π and t simultaneously. It remains to show that the number of pulls nt (π) is suffi-
ciently large.
Let et ∈ {0, 1} be a random variable whose value indicates whether the algorithm
explored uniformly at step t, and let mt (π) = |{i < t : π i = π, ei = 1}|, which has
nt (π) ≥ mt (π). Let Z t = I {π t = π, et = 1}. Observe that we can write
X
mt (π) = Z i.
i<t
25
In addition, Z t ∼ Ber(ε/A), so we have E[mt (π)] = ε(t − 1)/A. Using Bernstein’s inequality
(Lemma 5) with Z 1 , . . . , Z t−1 , we have that for any fixed π and all u > 0, with probability
at least 1 − 2e−u ,
r
t ε(t − 1) p u 2ε(t − 1)u u ε(t − 1) 4u
m (π) − ≤ 2V[Z](t − 1)u + ≤ + ≤ + ,
A 3 A 3 2A 3
where we have used that V[Z] = ε/A · (1 − ε/A) ≤ ε/A, and then applied the arithmetic
√
mean-geometric mean (AM-GM) inequality, which states that xy ≤ x2 + y2 for x, y ≥ 0.
Rearranging, this gives
ε(t − 1) 4u
mt (π) ≥ − . (2.12)
2A 3
Setting u = log(2AT /δ) and taking a union bound, we are guaranteed that with probability
at least 1 − δ, for all π ∈ Π and t ∈ [T ]
This proof shows that the ε-Greedy strategy allows the learner to acquire information
uniformly for all actions, but we pay for this in terms of regret (specifically, through the
εT factor in the final regret bound (2.14)). This issue here is that the ε-Greedy strategy
continually explores all actions, even though we might expect to rule out actions with very
low reward after a relatively small amount of exploration. To address this shortcoming, we
will consider more adaptive strategies.
26
Remark 9 (Explore-then-commit): A relative of ε-Greedy is the explore-then-
commit (ETC) algorithm (e.g., Robbins [73], Langford and Zhang [57]), which uni-
formly explores actions for the first N rounds, then estimates rewards based on the
data collected and commits to the greedy action for the remaining T − N rounds.
This strategy can be shown to attain Reg ≲ A1/3 T 2/3 for an appropriate choice of N ,
matching ε-Greedy.
f¯t
<latexit sha1_base64="oOQa37e3XcFQdEoFwda+Jat9N7c=">AAAG43icdZVbb9MwFMe9wcootw0eQShimsTDVDVjl/Zt6sZl0rTuflFbJsdxU6vORY7DFkV55Ik34JF9j30AvgSfgS+B46Tr6hgrPbWOf3/72Oc4sQJKQl6v/5mavnd/pvJg9mH10eMnT5/NzT8/Cf2IIXyMfOqzMwuGmBIPH3PCKT4LGIauRfGpNdzMxk+/YBYS3zvicYB7LnQ80icIcuE671qQJf30M7+YW6jX6rIZ5Y5ZdBY2Xt3s//36+mbvYn7md9f2UeRijyMKw7Bj1gPeSyDjBFGcVrtRiAOIhtDBHdH1oIvDXiIjTo1F4bGNvs/Ez+OG9N5VJNANw9i1BOlCPgjVscypG+tEvN/oJcQLIo49lC/Uj6jBfSPbvmEThhGnsehAxIiI1UADyCDi4pCqE8vkoU64HAaDAUFXabW6aGy2d9oHxtb7D9u720fb7d3DatfGfZEHKUy2IBt+ZBh7acIcK03qNXN1SRzpWmbM1dSYxHeIM+DnmFL/8lbQyFBpmnpezB/f0usZmJt0kj3BLC7z49kb6uRF7GNWRp4bbSAtGuFbuLmahyxtU8srAlOiuV3TCg6wPQ48P8fCqniLimSNWIGIR0UOcUBgmiA3HmaMmPHdkin+1uvKwbUYQUO59Ig1ao2mMM0VYZYbCn46IHy0K7GZ7FGIvYgFFN9JgSiFpQzy8CXyXRd6dtK9lNN0zF7S5fiK58rcmSyYqUJb2UkqsPRpWCa2oqCZS0P6DHpOad7Cq+EdWekKfucK6OLO8qQoiuSVaZoVgSb8cXGUNaFMsyIocq+JJ8u1ZoVxDWh3HWsjyi/Pf7ahUylXtKy0xVnqhOOrWtYEebUpilENSl689031LV/unCzXzLXayr74ALRA3mbBS/AGvAUmWAcb4BPYA8cAARf8AL/AdQVXvlW+V37m6PRUoXkBJlrl+h9mwmZo</latexit>
f⇤
<latexit sha1_base64="7O9vQd/0qADC33r5YiCoEpXGK20=">AAAG3XicdZVbT9swFMcNGx3rbrA97iVahTRNqEo6KMkbKuyChIBxn9qucl03tepc5DiDKMrj3rY9ji+0L7FvMztJKXWNlZ5ax7+/fexznPRDSiJumv8WFh88XKo8Wn5cffL02fMXK6svz6MgZgifoYAG7LIPI0yJj8844RRfhgxDr0/xRX+8I8cvvmMWkcA/5UmIux50fTIkCHLhOhl+e9dbqZl1U7Rm05AdyzYt0XEcu9FwDCsfMs0aKNtRb3Xpb2cQoNjDPkcURlHbMkPeTSHjBFGcVTtxhEOIxtDFbdH1oYejbprHmhlrwjMwhgETP58bufeuIoVeFCVeX5Ae5KNIHZNO3Vg75kO7mxI/jDn2UbHQMKYGDwy5cWNAGEacJqIDESMiVgONIIOIi+OpzixThDrjchkMRwRdZ9XqmrFzuH94bOx++Lh3sHe6d3hwUu0M8FBkIBemu5CNPzGM/Sxlbj9Lzbq1uS6OtCmNtZkZs/g+cUf8K6Y0uLoV2BLNjaPnxfzJLb0lwcJks+w5Zsk8P53dVicvY5+yeeSF0QbSojG+hZ3NIuTcOlpeEVg5WtimVnCMB9PAi3MsrYq3qEjWhBWIeFTkBIcEZinykrFkxIzv1y3xt2UqB9diBI3zpSesUbcdYZwNYRq2gl+MCJ/sSmxGPgpxFLOQ4jspEKWwLiEfX6HA86A/SDtX+TRtq5t2OL7mhbJwpjUrU+i+PEkFzn0alomtKKh0aciAQd+dm7f0ang3r3QFv3MFdHHLPCmKMnnzNJVFoAl/WhzzmihPsyIoc6+JR+Zas8K0BrS7TrQRFZfnnm3oVMoVnVcOxFnqhNOrOq8Ji2pTFJMazHnx3p+83I37O+eNutWsb3zZqG23yi/AMngN3oC3wAJbYBt8BkfgDCDggl/gD7ip9Co/Kj8rvwt0caHUvAIzrXLzH4rTX+8=</latexit>
fˆt
<latexit sha1_base64="c4FYbizQrt//5llVh/MC3hPqYKs=">AAAG43icdZVbT9swFMcNGx3rbrA97iVahbQHVCWlLc0bKuyChIBxR21BjuO2Vp2LHGdQRfkEe9v2OL7NvsS+zRwnpdQ1VnpqHf/+9rHPceKElETcNP8tLD55ulR6tvy8/OLlq9dvVlbfnkVBzBA+RQEN2IUDI0yJj0854RRfhAxDz6H43BltZ+Pn3zGLSOCf8HGIex4c+KRPEOTCddkdQp700yt+vVIxq7Zt1i3bMKsN06y1mqJjbtRajYZhVU3ZKqBoh9erS3+7boBiD/scURhFHcsMeS+BjBNEcVruxhEOIRrBAe6Irg89HPUSGXFqrAmPa/QDJn4+N6T3oSKBXhSNPUeQHuTDSB3LnLqxTsz7rV5C/DDm2Ef5Qv2YGjwwsu0bLmEYcToWHYgYEbEaaAgZRFwcUnlmmTzUGdeAwXBI0G1aLq8Z2wd7B0fGzqfPu/u7J7sH+8flrov7Ig9SmOxANvrCMPbThA2cNDGrVmNdHGkzM1YjNWbxPTIY8ktMaXBzL2hlqDS2nhfzj+/pzQzMTTrLnmE2nuens7fUyYvYp6yMPDfaQNo0xvew3chDltbW8orAkmhum1rBEXangefnWFgVb1ORrAkrEPGoyDEOCUwT5I1HGSNm3Fi3xN+mqRxcmxE0kktPWKPasoWx68LUWgp+PiR8siuxmexRiMOYhRQ/SIEohfUM8vENCjwP+m7SvZHTdKxe0uX4lufK3JlUrFShnewkFVj6NCwTW1HQzKUhAwb9wdy8hVfDD2SlK/iDK6CLO8uToiiSN0/TrAg04U+LY14TyTQrgiL3mniyXGtWmNaAdtdjbUT55XlkGzqVckXnla44S51welXnNWFebYpiUoOSF+/9ycvdeLxzVqtazWr9W72y1S6+AMvgPfgAPgILbIIt8BUcglOAgAd+gT/groRLP0o/S79zdHGh0LwDM6109x9ryWMV</latexit>
ft
<latexit sha1_base64="ZQ2Nn/BfiP87lt5JFmIaryHpLtU=">AAAG63icdZVbT9swFMcNGx3rLsD2uJdoCGmTqqoBCu0bKuyChIBxn9qCHMdNrToXOc4givIp9rbtcXyRaV9iD/sus5OUUseL0lPr5Pe3j885TqyAkpA3Gn9mZh88nKs8mn9cffL02fOFxaUXZ6EfMYRPkU99dmHBEFPi4VNOOMUXAcPQtSg+t0bb8vn5F8xC4nsnPA5w34WORwYEQS5cV4sLvcizMZPyZJBe8qvF5Ua9kV1GeWAWg+Wt2q/Lv3Nvm4dXS3O/e7aPIhd7HFEYhl2zEfB+AhkniOK02otCHEA0gg7uiqEHXRz2kyzy1FgRHtsY+Ez8PG5k3vuKBLphGLuWIF3Ih6H6TDp1z7oRH7T6CfGCiGMP5QsNImpw35BpMGzCMOI0FgOIGBGxGmgIGURcJKs6tUwe6pTLYTAYEnSTVqsrxvbB3sGRsfPu/e7+7snuwf5xtWfjgUhoJkx2IBt9YBh7acIcK00adbNZEyndkMZspsY0vkecIf+MKfWv7wQtiWamrefF/PEdvSnB3KTT7BlmcZmfzN5SJy9in7BZ5LnRBtKhEb6D28085My2tbwiMDM0txtawRG2J4HneSysineoKNaYFYi4VeQYBwSmCXLjkWTEjGs1U/xtNpTEdRhBo2zpMWvUW21h2uvCrLYU/HxI+HhXYjPyVojDiAUU3yuBaIWahDx8jXzXhZ6d9K6zabpmP+lxfMNzZe5Mls1UoS2ZSQXOfBqWia0oqHRpSJ9BzynNW3g1vJN1uoLfOwK6uGWdFEVRvDJNZRNowp80R1kTZmVWBEXtNfHIWmtWmPSAdtexNqL88PxnGzqVckTLSlvkUiecHNWyJsi7TVGMezDjxXvfVN/y5cHZat3cqK9/Eh+ADsivefAKvAZvgAk2wRb4CA7BKUAgAt/BT3BbcStfK98qP3J0dqbQvARTV+X2H6+kaIc=</latexit>
⇡t ⇡⇤ ⇧
<latexit sha1_base64="CrLtaJzoNgRi917mWwMTGzYeeMA=">AAAG33icdZVbT9swFMcNGx3rbrA97iVahbQHVCWsbZI3VNgFCQEDCkxtQa7rpladixxnEEV53tu2x/F99iX2beZcSqlrrPTUOv797WOf42QQUBJyXf+3tPzo8UrlyerT6rPnL16+Wlt/fRb6EUO4g3zqs4sBDDElHu5wwim+CBiG7oDi88FkJxs//45ZSHzvlMcB7rvQ8ciIIMiFq9MLyCW/Wqvpdd22TMvU9LppGs1WQ3T0xpZu25pR1/NWA2U7ulpf+dsb+ihysccRhWHYNfSA9xPIOEEUp9VeFOIAogl0cFd0PejisJ/k0abahvAMtZHPxM/jWu69r0igG4axOxCkC/k4lMcyp2qsG/GR1U+IF0Qce6hYaBRRjftatnVtSBhGnMaiAxEjIlYNjSGDiIsDqs4tU4Q653IYDMYE3aTV6oa2c7h/eKztfvy0d7B3und4cFLtDfFI5CAXJruQTT4zjL00Yc4gTfS60dwUR9rKjNFMtXl8nzhj/g1T6l/fCawMzY2t5sX88R1tZmBh0nn2DLN4kZ/NbsmTl7HP2DzywigDadMI38F2swg5t7aSlwRGjha2pRQc4+Es8OIcSyvjbSqSNWUFIh4ZOcEBgWmC3HiSMWLGD5uG+DN16eDajKBJvvSU1eqWLYzdEGbLkvDzMeHTXYnNZI9EHEUsoPheCkQpbGaQh6+R77rQGya963yartFPehzf8EJZOJOakUr0IDtJCc59CpaJrUho5lKQPoOeszBv6VXwTl7pEn7vCqjizvIkKcrkLdI0KwJF+LPiWNSEeZolQZl7RTxZrhUrzGpAuetYGVFxeR7YhkolXdFF5VCcpUo4u6qLmqCoNkkxrcGcF+/96ctde7hztlU3WvXG10Ztu11+AVbBW/AOvAcGMME2+AKOQAcgQMAv8AfcVmDlR+Vn5XeBLi+VmjdgrlVu/wPIfWE8</latexit> <latexit sha1_base64="VkKOzmYZoBQEvUOYGuo9SlWiY64=">AAAG3XicdZVbb9MwFMc9YGWU2waPvERUk3iYqmTX5m3qxmXStJXdUVtNjuOmVp2LHIctivLIG/DIvhBfgm+DnaRr63pWemod//72sc9x4kSUxNw0/y08evxksfZ06Vn9+YuXr14vr7y5iMOEIXyOQhqyKwfGmJIAn3PCKb6KGIa+Q/GlM9qT45ffMYtJGJzxNMJ9H3oBGRAEuXCd9jrkerlhNs2iGVMd29ywTduwKk8DVK1zvbL4t+eGKPFxwBGFcdy1zIj3M8g4QRTn9V4S4wiiEfRwV3QD6OO4nxWx5saq8LjGIGTiF3Cj8E4rMujHceo7gvQhH8bqmHTqxroJH7T6GQmihOMAlQsNEmrw0JAbN1zCMOI0FR2IGBGxGmgIGURcHE99Zpky1BmXx2A0JOg2r9dXjb3jw+MTY//jp4Ojg7OD46PTes/FA5GBQpjtQzb6zDAO8ox5Tp6ZTWtrTRzptjTWVm7M4ofEG/JvmNLw5l7QkmhhbD0v5k/v6R0JliafZS8wS+f5yewtdfIq9glbRF4abSBtmuB72N4qQy6sreUVgVWgpd3WCk6wOwm8PMfKqnibimSNWYGIR0VOcURgniE/HUlGzLixZom/HVM5uDYjaFQsPWaNZssWxt4UZr2l4JdDwse7EpuRj0J0EhZRPJUCUQprEgrwDQp9HwZu1rsppula/azH8S0vlaUza1i5QjvyJBW48GlYJraioNKlIUMGA29u3sqr4b2i0hV86gro4pZ5UhRV8uZpKotAE/6kOOY1cZFmRVDlXhOPzLVmhUkNaHedaiMqL88D29CplCs6r3TFWeqEk6s6r4nKalMU4xosePHeH7/cjYc7F+tNa7u5+XWzsduuvgBL4B14Dz4AC+yAXfAFdMA5QMADv8AfcFe7rv2o/az9LtFHC5XmLZhptbv/wHpf9w==</latexit>
<latexit sha1_base64="WUZ8Xd1KFmgFudB/n13XDvNRRfs=">AAAG33icdZVbT9swFMcNGx3rbrA97iVahTRNqEpKgeYNFXZBQsCAAlPbIcdxU6vORY4ziKI8723b4/g++xL7NnMupdTxrPTUOv797WOf48QKKAm5rv9dWHzwcKn2aPlx/cnTZ89frKy+PA/9iCHcQz712aUFQ0yJh3uccIovA4aha1F8YU12s/GLb5iFxPfOeBzgoQsdj4wIgly4eoOAfH13tdLQm2Z7w2y1Nb2p523WMcpOA5Tt+Gp16c/A9lHkYo8jCsOwb+gBHyaQcYIoTuuDKMQBRBPo4L7oetDF4TDJo021NeGxtZHPxM/jWu69r0igG4axawnShXwcymOZUzXWj/ioM0yIF0Qce6hYaBRRjftatnXNJgwjTmPRgYgREauGxpBBxMUB1eeWKUKdczkMBmOCbtJ6fU3bPTo4OtH23n/YP9w/2z86PK0PbDwSOciFyR5kk48MYy9NmGOlid40NtfFkW5lxthMtXn8gDhj/gVT6l/fCToZmhtTzYv54zt6OwMLk86z55jFVX42e0eevIx9xuaRF0YZSJdG+A42N4uQc2sqeUlg5Ghht5SCE2zPAi/OsbQy3qUiWVNWIOKRkVMcEJgmyI0nGSNm3Fg3xN+2Lh1clxE0yZeeslqzYwpjtoVpdST8Ykz4dFdiM9kjEccRCyi+lwJRCusZ5OFr5Lsu9OxkcJ1P0zeGyYDjG14oC2fSMFKJtrKTlODcp2CZ2IqEZi4F6TPoOZV5S6+Cd/JKl/B7V0AVd5YnSVEmr0rTrAgU4c+Ko6oJ8zRLgjL3iniyXCtWmNWActexMqLi8vxnGyqVdEWrSlucpUo4u6pVTVBUm6SY1mDOi/e+Ib/lq53zVtPYarY/txs73fILsAxegzfgLTDANtgBn8Ax6AEECPgJfoPbGqx9r/2o/SrQxYVS8wrMtdrtP14XYLQ=</latexit>
Figure 5: Illustration of the UCB algorithm. Selecting the action π t optimistically ensures
that the suboptimality never greater exceeds the confidence width.
guaranteed that with high probability, they lower (resp. upper) bound f ⋆ . Given confidence
intervals, the UCB algorithm simply chooses π t as the “optimistic” action that maximizes
the upper confidence bound:
π t = arg max f¯t (π).
π∈Π
The following lemma shows that the instantaneous regret for this strategy is bounded by
the width of the confidence interval; see Figure 5 for an illustration.
27
Lemma 7: Fix t, and suppose that f ⋆ (π) ∈ [f t (π), f¯t (π)] for all π. Then the optimistic
action
π t = arg max f¯t (π)
π∈Π
has
Proof of Lemma 7. The result follows immediate from the observation that for any t ∈ [T ]
and any π ⋆ ∈ Π, we have
Lemma 7 implies that as long as we can build confidence intervals for which the width
f¯t (π t ) − f t (π t ) shrinks, the regret for the UCB strategy will be small. To construct such
intervals, here we appeal to Hoeffding’s inequality for adaptive stopping times (Lemma 33).7
As long as rt ∈ [0, 1], a union bound gives that with probability at least 1 − δ, for all t ∈ [T ]
and π ∈ Π, s
2 log(2T 2 A/δ)
|fbt (π) − f ⋆ (π)| ≤ , (2.18)
nt (π)
P
where we recall that fbt is the sample mean and nt (π) := i<t I {π = π}. This suggests
i
that by choosing
s s
2 log(2T 2 A/δ) 2
f¯t (π) = fbt (π) + , and f t
(π) = bt (π) − 2 log(2T A/δ) ,
f (2.19)
nt (π) nt (π)
we obtain a valid confidence interval. With this choice—along with Lemma 7—we are in a
favorable position, because for a given round t, one of two things must happen:
• The optimistic action has high reward, so the instantaneous regret is small.
• The instantaneous regret is large, which by Lemma 7 implies that confidence width
is large as well (and nt (π t ) is small). This can only happen a small number of times,
since nt (π t ) will increase as a result, causing the width to shrink.
Proposition 5: Using the confidence bounds in (2.19), the UCB algorithm ensures
that with probability at least 1 − δ,
p
Reg ≲ AT log(AT /δ).
7
While asymptotic confidence intervals in classical statistics arise from limit theorems, we are interested
in valid non-asymptotic intervals, and thus appeal to concentration inequalities.
28
This result is optimal up to the log(AT ) factor, which can be removed by using the same
algorithm with a slightly more sophisticated confidence interval construction [11]. Note
that compared to the statistical learning and online learning setting, where we were able to
attain regret bounds that scaled logarithmically with the size of the benchmark class, here
the optimal regret scales linearly with |Π| = A. This is the price we pay for partial/bandit
feedback, and reflects that fact that we must explore all actions to learn.
Proof of Proposition 5. Let us condition on the event in (2.18). Whenever this occurs, we
have that f ⋆ (π) ∈ f t (π), f¯t (π) for all t ∈ [T ] and π ∈ Π, so the confidence intervals are
here, the “∧1” term appears because we can write off the regret for early rounds where
nt (π t ) = 0 as 1.
To bound the right-hand side, we use a potential argument. The basic idea is that at
every round, np t
(π) must increase for some action π, and since there are only A actions, this
means that 1/ nt (π t ) can only be large for a small number of rounds. This can be thought
of as a quantitative instance of the pigeonhole principle.
The main regret bound now follows from Lemma 8 and (2.20).
29
To summarize, the key steps in the proof of Proposition 5 were to:
1. Use the optimistic property and validity of the confidence bounds to bound regret by
the sum of confidence widths.
2. Use a potential argument to show that the sum of confidence widths is small.
We will revisit and generalize both ideas in subsequent chapters for more sophisticated
settings, including contextual bandits, structured bandits, and reinforcement learning.
√
Remark 10 (Instance-dependent regret for UCB): The O( e AT ) regret bound
attained by UCB holds uniformly for all models, and is (nearly) minimax-optimal, in
the sense that ⋆
√ for any algorithm, there exists a model M for which the regret must
scale as Ω( AT ). Minimax optimality is a useful notion of performance, but may be
overly pessimistic. As an alternative, it is possible to show that the UCB attains what
is known as an instance-dependent regret bound, which adapts to the underlying reward
function, and can be smaller for “nice” problem instances.
Let ∆(π) := f ⋆ (π ⋆ ) − f ⋆ (π) be the suboptimality gap for decision π. Then, when
⋆
f (π) ∈ [0, 1], UCB can be shown to achieve
X log(AT /δ)
Reg ≲ .
∆(π)
π:∆(π)>0
If we keep the underlying model fixed and take√ T → ∞, this regret bound scales only
logarithmically in T , which improves upon the T -scaling of the minimax regret bound.
8
It is important that µ is known, otherwise this is no different from the frequentist setting.
30
Posterior Sampling
for t = 1, . . . , T do
Set pt (π) = P(π ⋆ = π | Ht−1 ), where Ht−1 = (π 1 , r1 ), . . . , (π t−1 , rt−1 ).
Sample π t ∼ pt and observe rt .
The basic idea is as follows. At each time t, we can use our knowledge of the prior to
compute the distribution P(π ⋆ = · | Ht−1 ), which represents the posterior distribution over
π ⋆ given all of the data we have collected from rounds 1, . . . , t − 1. The posterior sampling
algorithm simply samples the learner’s action π t from this distribution, thereby “matching”
the posterior distribution of π ⋆ .
Proposition 6: For any prior µ, the posterior sampling algorithm ensures that
p
RegBayes (µ) ≤ AT log(A). (2.22)
In what follows, we prove a simplified version of Proposition 6; the full proof is given in
Section 2.6.
Proof of Proposition 6 (simplified version). We will make the following simplified assump-
tions:
• We restrict to reward distributions where M ⋆ (· | π) = N (f ⋆ (π), 1). That is, f ⋆ is the
only part of the reward distribution that is unknown.
• f ⋆ belongs to a known class F, and rather than proving the regret bound in Proposi-
tion 6, we will prove a bound of the form
p
RegBayes (µ) ≲ AT log|F|,
which replaces the log A factor in the proposition with log|F|.
Since the mean reward function f ⋆ is the only part of the reward distribution M ⋆ that is
unknown, we can simplify by considering an equivalent formulation where the prior has the
form µ ∈ ∆(F). That is, we have a prior over f ⋆ rather than M ⋆ .
Before proceeding, let us introduce some notation. The process through which we sample
f ∼ µ and the run the bandit algorithm induces a joint law over (f ⋆ , HT ), which we call
⋆
P. Throughout the proof, we use E[·] to denote the expectation under this law. We also
define Et [·] = E[· | Ht ] and Pt [·] = P[· | Ht ].
We begin by using the law of total expectation to express the expected regret as
" T #
X
RegBayes (µ) = E Et−1 [f ⋆ (πf ⋆ ) − f ⋆ (π t )] .
t=1
Above, we have written π ⋆ = πf ⋆ to make explicit the fact that this is a random variable
whose value is a function of f ⋆ .
We first simplify the expected regret for each step t. Let µt (f ) := P(f ⋆ = f | Ht−1 ) be the
posterior distribution at timestep t. The learner’s decision π t is conditionally independent
of f ⋆ given Ht−1 , so we can write
Et−1 [f ⋆ (πf ⋆ ) − f ⋆ (π t )] = Ef ⋆ ∼µt ,πt ∼pt [f ⋆ (πf ⋆ ) − f ⋆ (π t )].
31
If we define f¯t (π) = Ef ⋆ ∼µt [f ⋆ (π)] as the expected reward function under the posterior, we
can further write this as
This quantity captures—on average—how far a given realization of f ⋆ deviates from the
posterior mean f¯t , for a specific decision πf ⋆ which is coupled to f ⋆ . The expression above
might appear to be unrelated to the learner’s decision distribution, but the next lemma
shows that it is possible to relate this quantity back to the learner’s decision distribution
using a notion of information gain (or, estimation error).
Proof of Lemma 9. We will show a more general result. Namely, for any ν ∈ ∆(F) and
f¯ : Π → R, if we define p(π) = Pf ∼ν (πf = π), then
q
¯
Ef ∼ν f (πf ) − f (πf ) ≤ A · Ef ∼ν Eπ∼p (f (π) − f¯(π))2 .
(2.24)
This can be thought of as a “decoupling” lemma. On the left-hand side, the random
variables f and πf are coupled, but on the right-hand side, π is drawn from the marginal
distribution over πf , independent of the draw of f itself.
To prove the result, we use Cauchy-Schwarz as follows:
" #
p 1/2 (π )
f
Ef ∼ν f (πf ) − f¯(πf ) = Ef ∼ν 1/2 f (πf ) − f¯(πf )
p (πf )
1/2
1 h 2 i1/2
≤ Ef ∼ν · Ef ∼ν p(πf ) f (πf ) − f¯(πf ) .
p(πf )
32
Using Lemma 9, we have that
" T #
Xq
A · Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2
E[Reg] ≤ E
t=1
v " T #
u
u
Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 .
X
≤ AT · E
t
t=1
To finish up we will show that Tt=1 Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 ≤ log|F|. To do this,
P
To keep notation as clear as possible going forward, let us use boldface script (π t , π ⋆ , f ⋆ ,
Ht ) to refer to the abstract random variables under consideration, and use non-boldface
script (π t , π ⋆ , f ⋆ , Ht ) to refer to their realizations. Our aim will be to use the conditional
entropy Ent(f ⋆ | Ht ) as a potential function, and show that for each t,
1
E Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 = Ent(f ⋆ | Ht−1 ) − Ent(f ⋆ | Ht ).
(2.25)
2
From here the result will follow, because
" T # T
1 X ⋆ t
¯ 2
X
Ent(f ⋆ | Ht−1 ) − Ent(f ⋆ | Ht )
t t
E Ef ⋆ ∼µt Eπt ∼pt (f (π ) − f (π )) =
2
t=1 t=1
= Ent(f ⋆ | H0 ) − Ent(f ⋆ | HT )
≤ Ent(f ⋆ | H0 )
≤ log|F|,
where the last inequality follows because the entropy of a random variable X over a set X
is always bounded by log|X |.
We proceed to prove (2.25). To begin, we use Lemma 39, which implies that
1 ⋆ t
(f (π ) − f¯t (π t ))2 ≤ DKL Prt |f ⋆ ,πt ,Ht−1 ∥ Prt |πt ,Ht−1 .
2
and
1
Ef ⋆ ∼µt Eπt ∼pt (f ⋆ (π t ) − f¯t (π t ))2 = Ef ⋆ ∼µt Eπt ∼pt DKL Prt |f ⋆ ,πt ,Ht−1 ∥ Prt |πt ,Ht−1
2
Since KL divergence satisfies Ex∼PX DKL PY |X=x ∥ PY = Ey∼PY DKL PX|Y =y ∥ PX , this
is equal to
Et−1 DKL Pf ⋆ |πt ,rt ,Ht−1 ∥ Pf ⋆ |Ht−1 = Et−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 . (2.26)
33
Taking the expectation over Ht−1 , we can write this as
E Et−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 = EHt−1 EHt |Ht−1 DKL Pf ⋆ |Ht ∥ Pf ⋆ |Ht−1 .
as desired.
The analysis above critically makes use of the fact that we are concerned with Bayesian
regret, and have access to the true prior. One might hope that by choosing a sufficiently
uninformative prior, this approach might continue to work in the frequentist setting. In
fact, this indeed the case for bandits, though a different analysis is required [7, 8]. However,
one can show (Sections 4 and 6) that the Bayesian analysis we have given here extends to
significantly richer decision making settings, while the frequentist counterpart is limited to
simple variants of the multi-armed bandit.
That is, if we take the worst-case value of the Bayesian regret over all possible choices
of prior, this coincides with the minimax value of the frequentist regret.
for each action and time step is arbitrary and fixed ahead of the interaction by an oblivious
adversary. Since we do not posit a stochastic model for rewards, we define regret as in (2.4).
The algorithm we present will build upon the exponential weights algorithm studied in
the context of online supervised learning in Section 1.6. To make the connection as clear
as possible, we make a temporary switch from rewards to losses, mapping rt to 1 − rt , a
transformation that does not change the problem itself.
34
Recall that pt denotes the randomization distribution for the learner at round t. As
discussed in Remark 7, we can write expected regret as
T
X T
X
Reg = ⟨pt , ℓt ⟩ − min ⟨eπ , ℓt ⟩ (2.27)
π∈[A]
t=1 t=1
where ℓt ∈ [0, 1]A is the vector of losses for each of the actions at time t.
Since only the loss (equivalently, reward) of the chosen action π t ∼ pt is observed, we
cannot directly appeal to the exponential weights algorithm, which requires knowledge of
the full vector ℓt . To address this, we build an unbiased estimate of the vector ℓt from
a single real-valued observation ℓt (π t ). At first, this might appear impossible, but it is
straightforward to show that
t ℓt (π)
ℓ (π) = t
e × I {π t = π} (2.28)
p (π)
This algorithm is known as Exp3 (“Exponential Weights for Exploration and Exploitation”).
A full proof of this result is left as an exercise in Section 2.7.
Here and throughout the proof, E[·] will denote the joint expectation over both M ⋆ ∼ µ and
over the sequence HT = (π 1 , r1 ), . . . , (π T , rT ) that the algorithm generates by interacting
with M ⋆ .
We first simplify the (conditional) expected regret for each step t. Let f¯t (π) := Et−1 [f ⋆ (π)]
denote the posterior mean reward function at time t, which should be thought of as
the expected value of f ⋆ given everything we have learned so far. Next, let f¯πt ′ (π) =
Et−1 [f ⋆ (π) | π ⋆ = π ′ ], which is the expected reward given everything we have learned so far,
assuming that π ⋆ = π ′ . We proceed to write the expression
Et−1 [f ⋆ (π ⋆ ) − f ⋆ (π t )]
35
in terms of these quantities. For the learner’s reward, we observe that f ⋆ is conditionally
independent of π t given Ht−1 , we have
Et−1 [f ⋆ (π t )] = Eπ∼pt [f¯t (π)].
For the reward of the optimal action, we begin by writing
X
Et−1 [f ⋆ (π ⋆ )] = Pt−1 (π ⋆ = π) Et−1 [f ⋆ (π) | π ⋆ = π]
π∈Π
where we have used that pt was chosen to match the posterior distribution over π ⋆ . This
establishes that
Et−1 [f ⋆ (π ⋆ ) − f ⋆ (π t )] = Eπ∼pt f¯πt (π) − f¯t (π) .
We now make use of the following decoupling-type inequality, which follows from (2.24):
q
Eπ∼pt f¯πt (π) − f¯t (π) ≤ A · Eπ,π⋆ ∼pt (f¯πt ⋆ (π) − f¯t (π))2 .
(2.32)
To keep notation as clear as possible going forward, let us use boldface script (π t , π ⋆ , f ⋆ ,
t
H ) to refer to the abstract random variables under consideration, and use non-boldface
script (π t , π ⋆ , f ⋆ , Ht ) to refer to their realizations. As in the simplified proof, we will
show that the right-hand side in (2.32) is related to a notion of information gain (that is,
information about π ⋆ acquired at step t). Using Pinsker’s inequality, we have
Eπt ,π⋆ ∼pt (f¯πt ⋆ (π t ) − f¯t (π t ))2 ≤ Et−1 DKL Prt |π⋆ ,πt ,Ht−1 ∥ Prt |πt ,Ht−1 .
Since KL divergence satisfies EX DKL PY |X ∥ PY = EY DKL PX|Y ∥ PX , this is equal to
Et−1 DKL Pπ⋆ |πt ,rt ,Ht−1 ∥ Pπ⋆ |Ht−1 = Et−1 DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 . (2.33)
This is quantifying how much information about π ⋆ we gain by playing π t and observing rt
at step t, relative to what we knew at step t − 1. Applying (2.32) and (2.33), we have
T T q
¯ ¯ A · Eπ,π⋆ ∼pt (f¯πt ⋆ (π) − f¯t (π))2
X t X
t
Eπ∼pt fπ (π) − f (π) ≤
t=1 t=1
XT q
≤ A · Et−1 DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1
t=1
v
u
u T
X
≤ tAT · E DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 .
t=1
We can write
E DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 = Ent(π ⋆ | Ht−1 ) − Ent(π ⋆ | Ht ),
so telescoping gives
T
X
E DKL Pπ⋆ |Ht ∥ Pπ⋆ |Ht−1 = Ent(π ⋆ | H0 ) − Ent(π ⋆ | HT ) ≤ log(A).
t=1
36
2.7 Exercises
Exercise 4 (Adversarial Bandits): In this exercise, we will prove a regret bound for adver-
sarial bandits (Section 2.5), where the sequence of rewards (losses) is non-stochastic. To make
a direct connection to the Exponential Weights Algorithm, we switch from rewards to losses,
mapping rt to 1 − rt , a transformation that does not change the problem itself. To simplify
the presentation, suppose that a collection of losses
for each action π and time step t is arbitrary and chosen before round t = 1; this is referred to
as an oblivious adversary. We denote by ℓt = (ℓt (1), . . . , ℓt (A)) the vector of losses at time t.
The protocol for the problem of adversarial multi-armed bandits (with losses) is as follows:
Multi-Armed Bandit Protocol
for t = 1, . . . , T do
Select decision π t ∈ Π := {1, . . . , A} by sampling π t ∼ pt
Observe loss ℓt (π t )
Let pt be the randomization distribution of the decision-maker on round t. Expected regret
can be written as
" T # T
X X
t
E[Reg] = E t
p ,ℓ − min eπ , ℓt . (2.34)
π∈[A]
t=1 t=1
Since only the loss of the chosen action π t ∼ pt is observed, we cannot directly appeal to the
Exponential Weights Algorithm. The solution is to build an unbiased estimate of the vector ℓt
from the single real-valued observation ℓt (π t ).
t
ℓ (· | π t ) defined by
1. Prove that the vector e
t ℓt (π)
ℓ (π | π t ) = t
e × I {π t = π} (2.35)
p (π)
is an unbiased estimate for ℓt (π) for all π ∈ [A]. In vector notation, this means
t
ℓ (· | π t )] = ℓt .
Eπt ∼pt [e
Conclude that
" T # " T #
X t X t
E[Reg] = E t e
Eπt ∼pt p , ℓ − min E Eπt ∼pt eπ , e
ℓ (2.36)
π∈[A]
t=1 t=1
t
Above, we use the shorthand e ℓ(· | π t ).
ℓ =e
2. Show that given π ′ ,
h t i ℓt (π ′ )2 h t i
ℓ (π | π ′ )2 = t ′ ,
Eπ∼pt e so that ℓ (π | π t )2 ≤ A.
Eπt ∼pt Eπ∼pt e (2.37)
p (π )
3. Define
t−1
( )
X s
pt (π) ∝ exp −η ℓ (· | π s )
eπ , e ,
s=1
37
s
which corresponds to the exponential weights algorithm on the estimated losses e
ℓ . Apply
(1.41) to the estimated losses to show that for any π ∈ [A],
" T # " T #
X t X t p
E t e
Eπt ∼pt p , ℓ −E Eπt ∼pt eπ , ℓ
e ≲ AT log A
t=1 t=1
3. CONTEXTUAL BANDITS
In the last section, we studied the multi-armed bandit problem, which arguably the simplest
framework for interactive decision making. This simplicity comes at a cost: few real-world
problems can be modeled as a multi-armed bandit problem directly. For example, for the
problem of selecting medical treatments, the multi-armed bandit formulation presupposes
that one treatment rule (action/decision) is good for all patients, which is clearly unreason-
able. To address this, we augment the problem formulation by allowing the decision-maker
to select the action π t after observing a context xt ; this is called the contextual bandit prob-
lem. The context xt , which may also be thought of as a feature vector or collection of
xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>
context
⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>
decision
rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>
reward
ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>
observation
covariates (e.g., a patient’s medical history, or the profile of a user arriving at a website),
can be used by the learner to better maximize rewards by tailoring decisions to the specific
patient or user under consideration.
38
Assumption 2 (Stochastic Rewards): Rewards are generated independently via
rt ∼ M ⋆ (· | xt , π t ), (3.1)
f ⋆ (x, π) := E [r | x, π] (3.2)
as the mean reward function under r ∼ M ⋆ (· | x, π), and define π ⋆ (x) := arg maxπ∈Π f ⋆ (x, π)
as the optimal policy, which maps each context x to the optimal action for the context. We
measure performance via regret relative to π ⋆ :
T
X T
X
Reg := f ⋆ (xt , π ⋆ (xt )) − Eπt ∼pt [f ⋆ (xt , π t )], (3.3)
t=1 t=1
where pt ∈ ∆(Π) is the learner’s action distribution at step t (conditioned on the Ht−1
and xt ). This provides a (potentially) much stronger notion of performance than what we
considered for the multi-armed bandit: Rather than competing with the reward of the single
best action, we are competing with the reward of the best sequence of decisions tailored to
the context sequence we observe.
9
pOne can show that running an independent instance of UCB for each context leads to regret
O(
e AT · |X |); see Exercise 5.
39
Assumption 3: The decision-maker has access to a class F ⊂ {f : X × Π → R} such
that f ⋆ ∈ F.
Using the class F, we would like to develop algorithms that can model the underlying reward
function for better decision making performance. With this goal in mind, it is reasonable to
try leveraging the algorithms and respective guarantees we have already seen for statistical
and online supervised learning. At this point, however, the decision-making problem—with
its exploration-exploitation dilemma—appears to be quite distinct from these supervised
learning frameworks. Indeed, naively applying supervised learning methods, which do not
account for the interactive nature of the problem, can lead to failure, as we saw with the
greedy algorithm in Section 2.1. In spite of these apparent difficulties, in the next few
lectures, we will show that it is possible to leverage supervised learning methods to develop
provable decision making methods, thereby bridging the two methodologies.
Optimism via confidence sets. Let us describe a general approach (or, template) for
applying the principle of optimism to contextual bandits [26, 1, 74, 37]. Suppose that at
each time, we have a way to construct a confidence set
Ft ⊆ F
based on the data observed so far, with the important property that f ⋆ ∈ F t . Given such
a confidence set we can define upper and lower confidence functions f t , f¯t : X × Π → R via
f t (x, π) = min f (x, π), f¯t (x, π) = max f (x, π). (3.4)
f ∈F t f ∈F t
These functions generalize the upper and lower confidence bounds we constructed in Sec-
tion 2. Since f ⋆ ∈ F t , they have the property that
40
That is, the suboptimality is bounded by the width of the confidence interval at (xt , π t ),
and the total regret is bounded as
T
f¯t (xt , π t ) − f t (xt , π t ).
X
Reg ≤ (3.7)
t=1
To make this approach concrete and derive sublinear bounds on the regret, we need a way to
construct the confidence set F t , ideally so that the width in (3.7) shrinks as fast as possible.
be the empirical risk minimizer at round t, and with β := 8 log(|F|/δ) define F 1 = F and
t−1 t−1
( )
X X
Ft = f ∈ F : (f (xi , π i ) − ri )2 ≤ (fbt (xi , π i ) − ri )2 + β (3.9)
i=1 i=1
for t > 1. That is, our confidence set F t is the collection of all functions that have empirical
squared error close to that of fbt . The idea behind this construction is to set β “just large
enough”, to ensure that we do not accidentally exclude f ⋆ , with the precise value for β
informed by the concentration inequalities we explored in Section 1. The only catch here is
that we need to use variants of these inequalities that handle dependent data, since xt and
π t are not i.i.d. in (3.8). The following result shows that F t is indeed valid and, moreover,
that all functions f ∈ F t have low estimation error on the history.
where β = 8 log(|F|/δ).
Lemma 10 is valid for any algorithm, but it is particularly useful for UCB as it establishes
the validity of the confidence bounds as per (3.5); however, it is not yet enough to show
that the algorithm attains low regret. Indeed, to bound the regret, we need to control the
confidence widths in (3.7), but there is a mismatch: for step τ , the regret bound in (3.7)
considers the width at (xτ , π τ ), but (3.10) only ensures closeness of functions in F τ under
(x1 , π 1 ), . . . , (xτ −1 , π τ −1 ). We will show in the sequel that for linear models, it is possible to
control this mismatch, but that this is not possible in general.
41
Proof of Lemma 10. For f ∈ F, define
U t (f ) = (f (xt , π t ) − rt )2 − (f ⋆ (xt , π t ) − rt )2 . (3.11)
It is straightforward to check that10
Et−1 U t (f ) = Et−1 (f (xt , π t ) − f ⋆ (xt , π t ))2 , (3.12)
:= E[· | Ht−1 , xt ]. Then Z t (f ) = Et−1 U t (f ) − U t (f ) is a martingale difference
where Et−1 [·] P
sequence and τt=1 Z t (f ) is a martingale. Since increments Z t (f ) are bounded as |Z t (f )| ≤ 1
(this holds whenever f ∈ [0, 1], rt ∈ [0, 1]), according to Lemma 35 with η = 81 , with
probability at least 1 − δ, for all τ ≤ T ,
τ τ
X 1X
Et−1 Z t (f )2 + 8 log(δ −1 ).
Z t (f ) ≤ (3.13)
8
t=1 t=1
Since the left-hand side is nonnegative, we conclude that with probability at least 1 − δ,
τ
X τ
X
(f ⋆ (xt , π t ) − rt )2 ≤ (f (xt , π t ) − rt )2 + 8 log(δ −1 ). (3.17)
t=1 t=1
and in particular
τ −1
X τ −1
X
⋆ 2
∀τ ∈ [T + 1], t t
(f (x , π ) − r ) ≤ t
(fbτ (xt , π t ) − rt )2 + 8 log(|F|/δ); (3.19)
t=1 t=1
that is, we have f ⋆ ∈ F τ for all τ ∈ {1, . . . , T + 1}, proving the first claim. For the second
part of the claim, observe that any f ∈ F τ must satisfy
τ −1
X
U t (f ) ≤ β
t=1
since the empirical risk of f ⋆ is never better than the empirical risk of the minimizer fbt .
Thus from (3.16), with probability at least 1 − δ, for all τ ≤ T ,
τ −1
X
Et−1 U t (f ) ≤ 2β + 16 log(δ −1 ). (3.20)
t=1
The second claim follows by taking union bound over f ∈ F τ ⊆ F, and by (3.12).
10
We leave Et−1 on the right-hand side to include the case of randomized decisions π t ∼ pt .
42
3.2 Optimism for Linear Models: The LinUCB Algorithm
We now instantiate the general template for optimistic algorithms developed in the previous
section for the special case where F is a class of linear functions.
Linear models. We fix a feature map ϕ : X × Π → Bd2 (1), where Bd2 (1) is the unit-norm
Euclidean ball in Rd . The feature map is assumed to be known to the learning agent. For
example, in the case of medical treatments, ϕ transforms the medical history and symptoms
x for the patient, along with a possible treatment π, to a representation ϕ(x, π) ∈ Bd2 (1).
We take F to be the set of linear functions given by
where Θ ⊆ Bd2 (1) is the parameter set. As before, we assume f ⋆ ∈ F; we let θ∗ denote
the corresponding parameter vector, so that f ⋆ (x, π) = ⟨θ⋆ , ϕ(x, π)⟩. With some abuse of
notation, we associate the set of parameters Θ to the corresponding functions in F.
To apply the technical results in the previous section, we assume for simplicity that
|Θ| = |F| is finite. To extend our results to potentially non-finite sets, one can work with
an ε-discretization, or ε-net, which is of size at most O(ε−d ) using standard arguments.
Taking ε ∼ 1/T ensures only a constant loss in cumulative regret relative to the continuous
set of parameters, while log |F| ≲ d log T .
LinUCB
Input: β > 0
for t = 1, . . . , T do
Compute the least squares solution θbt (over θ ∈ Θ) given by
(⟨θ, ϕ(xi , π i )⟩ − ri )2 .
X
θbt = arg min
θ∈Θ i<t
Define
t−1
X
et =
Σ ϕ(xi , π i )ϕ(xi , π i ) T + I.
i=1
Observe reward rt
The following result shows that LinUCB enjoys a regret bound that scales with the
complexity log|F| of the model class and the feature dimension d.
43
Proposition 7: Let Θ ⊆ Bd2 (1) and fix ϕ : X × Π → Bd2 (1). For a finite set F of linear
functions (3.21), taking β = 8 log(|F|/δ), LinUCB satisfies, with probability at least
1 − δ, p p
Reg ≲ βdT log(1 + T /d) ≲ dT log(|F|/δ) log(1 + T /d)
for any sequence of contexts x1 , . . . , xT . More generally, for infinite F, we may take
β = O(d log(T ))a and √
Reg ≲ d T log(T ).
a
This follows from a simple covering number argument.
Notably, this regret bound has no explicit dependence on the context space size |X |. Inter-
estingly, the bound is also independent of the number of actions |Π|, which is replaced by
the dimension d; this reflects that the linear structure of F allows the learner to generalize
not just across contexts, but across decisions. We will expand upon the idea of generalizing
across actions in Section 4.
Proof of Proposition 7. The confidence set (3.9) in the generic optimistic algorithm tem-
plate is
t−1 t−1
( )
X X
Ft = θ ∈ Θ : (⟨θ, ϕ(xi , π i )⟩ − ri )2 ≤ (⟨θbt , ϕ(xi , π i )⟩ − ri )2 + β , (3.22)
i=1 i=1
where θbt is the least squares solution computed in LinUCB. According to Lemma 10, with
probability at least 1 − δ, for all t ∈ [T ], all θ ∈ F t satisfy
t−1
X
(⟨θ − θ∗ , ϕ(xi , π i )⟩)2 ≤ 4β, (3.23)
i=1
2
Since θbt ∈ F t , we have that for any θ ∈ Θ′ , by triangle inequality, θ − θbt Σt ≤ 16β.
Furthermore, since θbt ∈ Θ ⊆ Bd2 (1), θ − θbt 2 ≤ 2. Combining the two constraints into one,
we find that Θ′ is a subset of
n o t−1
2
X
′′
Θ = θ ∈ Rd : θ − θbt et
Σ
≤ 16β + 4 , where et =
Σ ϕ(xi , π i )ϕ(xi , π i ) T + I.
i=1
(3.25)
The definition of f¯t in (3.4) and the inclusion Θ′ ⊆ Θ′′ implies that
f¯t (x, π) ≤
p
max√ ⟨θ, ϕ(x, π)⟩ = θbt , ϕ(x, π) + 16β + 4 ∥ϕ(x, π)∥(Σe t )−1 ,
θ:∥θ−θbt ∥Σ
e t ≤ 16β+4
(3.26)
11
p
For a PSD matrix Σ ⪰ 0, we define ∥x∥Σ = ⟨x, Σx⟩.
44
√
and similarly f t (x, π) ≥ θbt , ϕ(x, π) − 16β + 4∥ϕ(x, π)∥(Σe t )−1 . We conclude that regret
of the UCB algorithm, in view of Lemma 7, is
v
X T u
u T
X
∥ϕ(xt , π t )∥2(Σe t )−1 .
p t t
Reg ≤ 2 16β + 4 ∥ϕ(x , π )∥(Σe t )−1 ≲ tβT (3.27)
t=1 t=1
The above upper bound has the same flavor as the one in Lemma 8: as we obtain more and
more information in some direction v, the matrix Σe t has a larger and larger component in
that direction, and for that direction v, the term ∥v∥2(Σe t )−1 becomes smaller and smaller.
To conclude, we apply a potential argument, Lemma 11 below, to bound
T
X
∥ϕ(xt , π t )∥2(Σe t )−1 ≲ d log(1 + T /d).
t=1
The following result is referred to as the elliptic potential lemma, and it can be thought
of as a generalization of Lemma 8.
T
X
∥at ∥2V −1 ≤ 2d log(1 + T /d). (3.28)
t−1
t=1
45
Now, consider a (well-specified) problem instances in which rewards are deterministic
and given by
rt = f ⋆ (xt , π t ),
which we note is a constant function with respect to the context. Since f ⋆ is the true model,
πg is always the best action, bringing a reward of 1 − ε per round. Any time πb is chosen,
the decision-maker incurs instantaneous regret 1 − ε. We will now argue that if we apply
the generic optimistic algorithm from Section 3.1, it will choose πb every time a new context
is encountered, leading to Ω(N ) regret.
Let S t be the set of distinct contexts encountered before round t. Clearly, the exact
minimizers of empirical square loss (see (3.8)) are f ⋆ , and all fi where i is such that xi ∈ / St.
Hence, for any choice of β ≥ 0, the confidence set in (3.9) contains all fi for which xi ∈ / St.
This implies that for each t ∈ [T ] where x = xi ∈t
/ S , action πb has a higher upper confidence
t
46
Definition 3 (Online Regression Oracle): At each time t ∈ [T ], an online regression
oracle returns, given
(x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 )
with E[ri |xi , π i ] = f ⋆ (xi , π i ) and π i ∼ pi , a function fbt : X × Π → R such that
T
X
Eπt ∼pt (fbt (xt , π t ) − f ⋆ (xt , π t ))2 ≤ EstSq (F, T, δ)
t=1
with probability at least 1 − δ. For the results that follow, pi = pi (·|xi , Hi−1 ) will
represent the randomization distribution of a decision-maker.
For example, for finite classes, the (averaged) exponential weights method introduced in
Section 1.6 is an online regression oracle with EstSq (F, T, δ) = log(|F|/δ). More generally,
in view of Lemma 6, any online learning algorithm that attains low square loss regret for
the problem of predicting of rt based on (xt , π t ) leads to a valid online regression oracle.
Note that we make use of online learning oracles for the results that follow because
we aim to derive regret bounds that hold for arbitrary, potentially adversarial sequences
x1 , . . . , xT . If we instead assume that contexts are i.i.d., it is reasonable to make use of
algorithms for offline estimation, or statistical learning with F. See Section 3.5.1 for further
discussion.
The first general-purpose contextual bandit algorithm we will study, illustrated below,
is a contextual counterpart to the ε-Greedy method introduced in Section 2.
Observe reward r . t
At each step t, the algorithm uses an online regression oracle to compute a reward estimator
fbt (x, a) based on the data Ht−1 collected so far. Given this estimator, the algorithm uses the
same sampling strategy as in the non-contextual case: with probability 1 − ε, the algorithm
chooses the greedy decision
and with probability ε it samples a uniform random action π t ∼ unif({1, . . . , A}). The
following theorem shows that whenever the online estimation oracle has low estimation
error EstSq (F, T, δ), this method achieves low regret.
47
Proposition 8: Assume f ⋆ ∈ F and f ⋆ (x, a) ∈ [0, 1]. Suppose the decision-maker
has access to an online regression oracle (Definition 3) with a guarantee EstSq (F, T, δ).
Then by choosing ε appropriately, the ε-Greedy algorithm ensures that with probability
at least 1 − δ,
for any sequence x1 , . . . , xT . As a special case, when F is finite, if we use the (averaged)
exponential weights algorithm as an online regression oracle, the ε-Greedy algorithm
has
Notably, this result scales with log|F| for any finite class, analogous to regret bounds for
offline/online supervised learning. The T 2/3 -dependence in the regret bound is suboptimal
(as seen for the special case of non-contextual bandits), which we will address using more
deliberate exploration methods in the sequel.
Proof of Proposition 8. Recall that pt denotes the randomization strategy on round t, com-
puted after observing xt . Following the same steps as the proof of Proposition 4, we can
bound regret by
T
X T
X
Reg = Eπt ∼pt [f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π t )] ≤ f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π
bt ) + εT,
t=1 t=1
f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π
bt )
= [f ⋆ (xt , π ⋆ ) − fbt (xt , π ⋆ )] + [fbt (xt , π ⋆ ) − fbt (xt , π bt ) − f ⋆ (xt , π
bt )] + [fbt (xt , π bt )]
X
≤ |f ⋆ (xt , π) − fbt (xt , π)|
π t ,π ⋆ }
π∈{b
X 1 p t
= p p (π)|f ⋆ (xt , π) − fbt (xt , π)|.
π t ,π ⋆ }
π∈{b
p (π)
t
48
Summing across t, this gives
T T r 2 1/2
X
⋆ t ⋆ t ⋆ t 2A X
t
⋆ t t t t t
f (x , π (x )) − f (x , π
b)≤ Eπt ∼pt f (x , π ) − f (x , π )
b (3.32)
ε
t=1 t=1
2 1/2
r ( T )
2AT X
≤ Eπt ∼pt f ⋆ (xt , π t ) − fbt (xt , π t ) . (3.33)
ε
t=1
Now observe that the online regression oracle guarantees that with probability 1 − δ,
T
X 2
Eπt ∼pt f ⋆ (xt , π t ) − fbt (xt , π t ) ≤ EstSq (F, T, δ).
t=1
3.5 Inverse Gap Weighting: An Optimal Algorithm for General Model Classes
To conclude this section, we present a general, oracle-based algorithm for contextual bandits
which achieves
p
Reg ≲ AT log|F|
for any finite class F. As with ε-Greedy, this approach has no dependence on the cardinality
|X | of the context space, reflecting the ability to generalize across contexts. The dependence
on T improves upon ε-Greedy, and is optimal.
To motivate the approach, recall that conceptually, the key step of the proof of Propo-
sition 8 involved relating the instantaneous regret
between fbt and f ⋆ under the randomization distribution pt . The ε-Greedy exploration
distribution gives a way to relate these quantities, but the algorithm’s regret is suboptimal
because the randomization distribution puts mass at least ε/A on every action, even those
that are clearly suboptimal and should be discarded. One can ask whether there exists a
better randomization strategy that still admits an upper bound on (3.34) in terms of (3.35).
Proposition 9 below establishes exactly that. At first glance, this distribution might appear
to be somewhat arbitrary or “magical”, but we will show in subsequent chapters that it
arises as a special case of more general—and in some sense, universal—principle for designing
decision making algorithms, which extends well beyond contextual bandits.
49
Definition 4 (Inverse Gap Weighting [2, 36]): Given a vector fb = (fb(1), . . . , fb(A)) ∈
RA , the Inverse Gap Weighting distribution p = IGWγ (fb(1), . . . , fb(A)) with parameter
γ ≥ 0 is defined as
1
p(π) = , (3.36)
π ) − fb(π))
λ + 2γ(fb(b
b = arg maxπ fb(π) is the greedy action, and where λ ∈ [1, A] is chosen such that
where π
P
π p(π) = 1.
Above,P the normalizing constant λ ∈ P[1, A] is always guaranteed to exist, because we have
1 A
λ ≤ π p(π) ≤ λ , and because λ 7→ π p(π) is continuous over [1, A].
Let us give some intuition behind the distribution in (3.36). We can interpret the
parameter γ as trading off exploration and exploitation. Indeed, γ → 0 gives a uniform
distribution, while γ → ∞ amplifies the gap between the greedy action π b and any action
with fb(π) < fb(b
π ), resulting in a distribution supported only on actions that achieve the
largest estimated value fb(bπ ).
The following fundamental technical result shows that playing the Inverse Gap Weight-
ing distribution always suffices to link the instantaneous regret in (3.34) in to the instanta-
neous estimation error in (3.35).
Proposition 9: Consider a finite decision space Π = {1, . . . , A}. For any vector fb ∈ RA
and γ > 0, define p = IGWγ (fb(1), . . . , fb(A)). This strategy guarantees that for all
f ⋆ ∈ RA ,
A
Eπ∼p [f ⋆ (π ⋆ ) − f ⋆ (π)] ≤ + γ · Eπ∼p (fb(π) − f ⋆ (π))2 .
(3.37)
γ
Proof of Proposition 9. We break the “regret” term on the left-hand side of (3.37) into three
terms:
Eπ∼p f ⋆ (π ⋆ ) − f ⋆ (π) = Eπ∼p fb(b
π ) − fb(π) + Eπ∼p fb(π) − f ⋆ (π) + f ⋆ (π ⋆ ) − fb(b
π) .
| {z } | {z } | {z }
(I) exploration bias (II) est. error on policy (III) est. error at opt
The first term asks “how much would we lose by exploring, if fb were the true reward
function?”, and is equal to
X π ) − fb(π)
fb(b A−1
≤ ,
π λ + 2γ fb(bπ ) − fb(π) 2γ
while the second term is at most
1 γ
q
+ Eπ∼p (fb(π) − f ⋆ (π))2 .
Eπ∼p (fb(π) − f ⋆ (π))2 ≤
2γ 2
The third term can be further written as
γ 1
f ⋆ (π ⋆ ) − fb(π ⋆ ) − (fb(b
π ) − fb(π ⋆ )) ≤ p(π ⋆ )(f ⋆ (π ⋆ ) − fb(π ⋆ ))2 + − (fb(bπ ) − fb(π ⋆ ))
2 2γp(π ⋆ )
γ ⋆ 2 1 ⋆
≤ Eπ∼p (f (π) − f (π)) +
b − (f (b
b π ) − f (π )) .
b
2 2γp(π ⋆ )
50
The term in brackets above is equal to
π ) − fb(π ⋆ ))
λ + 2γ(fb(b λ A
π ) − fb(π ⋆ )) =
− (fb(b ≤ .
2γ 2γ 2γ
The simple result we just proved is remarkable. The special IGW strategy guarantees a
relation between regret and estimation error for any estimator fb and any f ⋆ , irrespective of
the problem structure or the class F. Proposition 9 will be at the core of the development for
the rest of the course, and will be greatly generalized to general decision making problems
and reinforcement learning.
Below, we present a contextual bandit algorithm called SquareCB [36] which makes use
of the Inverse Gap Weighting distribution.
SquareCB
Input: Exploration parameter γ > 0.
for t = 1, . . . , T do
Obtain fbt from online regression oracle with (x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 ).
Observe xt .
Compute pt = IGWγ fbt (xt , 1), . . . , fbt (xt , A) .
Select action π t ∼ pt .
Observe reward rt .
At each step t, the algorithm uses an online regression oracle to compute a reward estimator
fbt (x, a) based on the data Ht−1 collected so far. Given this estimator, the algorithm uses
Inverse Gap Weighting to compute pt = IGWγ (fbt (xt , ·)) as an exploratory distribution, then
samples π t ∼ pt .
The following result, which is a near-immediate consequence of Proposition 9, gives a
regret bound for this algorithm.
Proposition 10: Given a class F with f ⋆ ∈ F, assume the decision-maker has access
to an online regression
p oracle (Definition 3) with estimation error EstSq (F, T, δ). Then
SquareCB with γ = T A/EstSq (F, T, δ) attains a regret bound of
q
Reg ≲ AT EstSq (F, T, δ)
Proof of Proposition 10. We begin with regret, then add and subtract the squared estima-
tion error as follows:
XT
Reg = Eπt ∼pt [f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t )]
t=1
T
X h i
= Eπt ∼pt f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t ) − γ · (f ⋆ (xt , π t ) − fbt (xt , π t ))2 + γ · EstSq (F, T, δ).
t=1
51
By appealing to Proposition 9 with fb(xt , ·) and f ⋆ (xt , ·), for each step t, we have
h i A
Eπt ∼pt f ⋆ (xt , π ⋆ ) − f ⋆ (xt , π t ) − γ · (f ⋆ (xt , π t ) − fbt (xt , π t ))2 ≤ ,
γ
and thus
TA
+ γ · EstSq (F, T, δ).
Reg ≤
γ
Choosing γ to balance these terms yields the result.
If the online regression oracle is minimax optimal (that is, EstSq (F, T, δ) is the “best
possible” for F) then SquareCB is also minimax optimal for F. Thus, IGW not only provides
a connection between online supervised learning and decision making, but it does so in an
optimal fashion. Establishing minimax optimality is beyond the scope of this course: it
requires understanding of minimax optimality of online regression with arbitrary F, as well
as lower bound on regret of contextual bandits with arbitrary sequences of contexts. We
refer to Foster and Rakhlin [36] for details.
where x1 , . . . , xt−1 are i.i.d., π i ∼ p(xi ) for fixed p : X → ∆(Π) and E[ri |xi , π i ] =
f ⋆ (xi , π i ), an offline regression oracle returns a function fb : X × Π → R such that
Note that the normalization t−1 above is introduced to keep the scaling consistent with our
conventions for offline estimation.
Below, we state a variant of SquareCB which is adapted to offline oracles [79]. Compared
to the SquareCB for online oracles, the main change is that we update the estimation
oracle and exploratory distribution on an epoched schedule as opposed to updating at every
round. In addition, the parameter γ for the Inverse Gap Weighting distribution changes as
a function of the epoch.
52
for t = τm−1 + 1, . . . , τm do
Observe xt .
Compute pt = IGWγm fbm (xt , 1), . . . , fbm (xt , A) .
Select action π t ∼ pt .
Observe reward rt .
While this algorithm is quite intuitive, proving a regret bound for it is quite non-trivial—
much more so than the online oracle variant. They key challenge is that, while the con-
texts x1 , . . . , xT are i.i.d., the decisions π 1 , . . . , π T evolve in a time-dependent fashion, which
makes it unclear to invoke the guarantee in Definition 5. Nonetheless, the following remark-
able result shows that this algorithm attains a regret bound similar to that of Proposition
10.
q
Proposition 11 (Simchi-Levi and Xu [79]): Let τm = 2m and γm = AT /Estoff Sq (F, τm−1 , δ)
for m = 1, 2, . . .. Then with probability at least 1 − δ, regret of SquareCB with an offline
oracle is at most
⌈log T ⌉ q
X
Reg ≲ A · τm · Estoff 2
Sq (F, τm , δ/m ).
m=1
For a finite class F, we recall from Section 1 that empirical risk with the square loss (least
squares) achieves Estoff
Sq (F, T, δ) ≲ log(|F|/δ), which gives
p
Reg ≲ AT log(|F|/δ).
3.6 Exercises
53
we assume without loss of generality that T is a power of 2, and that an arbitrary decision is
made on round t = 1. At the end of each epoch m − 1, the offline oracle is invoked with the
data from the epoch, producing an estimated model fbm . This model is used for the greedy
step in the next epoch m. In other words, for any round t ∈ [2m + 1, 2m+1 ] of epoch m, the
algorithm observes xt ∼ D, chooses an action π t ∼ unif[A] with probability ε and chooses the
greedy action
π t = arg max fbm (xt , π)
π∈[A]
1. Prove that for any T ∈ N and δ > 0, by setting ε appropriately, this method ensures that
with probability at least 1 − δ,
2/3
log2 T
X
Reg ≲ A1/3 T 1/3 2m/2 Estoff
Sq (F, 2
m−1
, δ/m2 )1/2
m=1
To do so, we assumed that f ⋆ ∈ F, where f ⋆ (x, a) := Er∼M ⋆ (·|x,a) [r]; that is, we have a well-
specified model. In practice, it may be unreasonable to assume that we have f ⋆ ∈ F. Instead,
a weaker assumption is that there exists some function f¯ ∈ F such that
for some ε > 0; that is, the model is ε-misspecified. In this problem, we will generalize the
regret bound for SquareCB to handle misspecification. Recall that in the lecture notes, we
assumed (Definition 3) that the regression oracle satisfies
T
X
Eπt ∼pt (fbt (xt , π t ) − f ⋆ (xt , π t ))2 ≤ EstSq (F, T, δ).
t=1
In the misspecified setting, this is too much to ask for. Instead, we will assume that the oracle
satisfies the following guarantee for every sequence:
T
X T
X
(fbt (xt , π t ) − rt )2 − min (f (xt , π t ) − rt )2 ≤ RegSq (F, T ).
f ∈F
t=1 t=1
Whenever f ⋆ ∈ F, we have EstSq (F, T, δ) ≲ RegSq (F, T ) + log(1/δ) with probability at least
1 − δ. However, it is possible to keep RegSq (F, T ) small even when f ⋆ ∈
/ F. For example, the
averaged exponential weights algorithm satisfies this guarantee with RegSq (F, T ) ≲ log|F|,
regardless of whether f ⋆ ∈ F.
54
We will show that for every δ > 0, with an appropriate choice of γ, SquareCB (that is, the
algorithm that chooses pt = IGWγ (fbt (xt , ·))) ensures that with probability at least 1 − δ,
q
Reg ≲ AT · (RegSq (F, T ) + log(1/δ)) + ε · A1/2 T.
Assume that all functions in F and rewards take values in [0, 1].
1. Show that for any sequence of estimators fb1 , . . . , fbt , by choosing pt = IGWγ (fbt (xt , ·)), we
have that
T T
AT h i
Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 +εT.
X X
Reg = Eπt ∼pt [f ⋆ (xt , π ⋆ (xt )) − f ⋆ (xt , π t )] ≲ +γ
t=1
γ t=1
If we had f ⋆ = f¯, this would follow from Proposition 9, but the difference is that in general
(f¯ ̸= f ⋆ ), the expression above measures estimation error with respect to the best-in-class
model f¯ rather than the true model f ⋆ (at the cost of an extra εT factor).
2. Show that the following inequality holds for every sequence
T T
(fbt (xt , π t ) − f¯(xt , π t ))2 ≤ RegSq (F, T ) + 2 (rt − f¯(xt , π t ))(fbt (xt , π t ) − f¯(xt , π t )).
X X
t=1 t=1
3. Using Freedman’s inequality (Lemma 35), show that with probability at least 1 − δ,
T h i T
Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 ≤ 2 (fbt (xt , π t ) − f¯(xt , π t ))2 + O(log(1/δ)).
X X
t=1 t=1
4. Using Freedman’s inequality once more, show that with probability at least 1 − δ,
T T
1X h i
(r −f¯(xt , π t ))(fbt (xt , π t )−f¯(xt , π t )) ≤ Eπt ∼pt (fbt (xt , π t ) − f¯(xt , π t ))2 +O(ε2 T +log(1/δ)).
X
t
2
t=1
4 t=1
t=1
5. Combining the previous results, show that for any δ > 0, by choosing γ > 0 appropriately,
we have that with probability at least 1 − δ,
q
Reg ≲ AT · (RegSq (F, T ) + log(1/δ)) + ε · A1/2 T.
4. STRUCTURED BANDITS
Up to this point, we have focused our attention on bandit problems (with or without
contexts) in which the decision space Π is a small, finite set. This section introduces
the structured bandit problem, which generalizes the basic (non-contextual) multi-armed
bandit problem by allowing for large, potentially infinite or continuous decision spaces.
The protocol for the setting is as follows.
55
Structured Bandit Protocol
for t = 1, . . . , T do
Select decision π t ∈ Π. // Π is large and potentially continuous.
Observe reward rt ∈ R.
This protocol is exactly the same as for multi-armed bandits (Section 2), except that we
have removed the restriction that Π = {1, . . . , A}, and now allow it to be arbitrary. This
added generality is natural in many applications:
• In routing applications, the decision space may be finite, but combinatorially large.
For example, the decision might be a path or flow in a graph.
Both contextual bandits and structured bandits generalize the basic multi-armed bandit
problem, by incorporating function approximation and generalization, but in different ways:
• The contextual bandit formulation in Section 3 assumes structure in the context space.
The aim here was to generalize across contexts, but we restricted the decision space
to be finite (unstructured).
• In structured bandits, we will focus our attention on the case of no contexts, but will
assume the decision space is structured, and aim to generalize across decisions.
Clearly, both ideas above can be combined, and we will touch on this in Section 4.5.
xt
<latexit sha1_base64="sQRFZDGuUi34DGiPO7E0IvzubxY=">AAAB6nicdVDJSgNBEO2JW4xb1KOXxiB4Ct0hZLkFvHiMaBZIxtDT6Uma9Cx014gh5BO8eFDEq1/kzb+xJ4mgog8KHu9VUVXPi5U0QMiHk1lb39jcym7ndnb39g/yh0dtEyWaixaPVKS7HjNCyVC0QIIS3VgLFnhKdLzJRep37oQ2MgpvYBoLN2CjUPqSM7DS9f0tDPIFUiSEUEpxSmi1Qiyp12slWsM0tSwKaIXmIP/eH0Y8CUQIXDFjepTE4M6YBsmVmOf6iREx4xM2Ej1LQxYI484Wp87xmVWG2I+0rRDwQv0+MWOBMdPAs50Bg7H57aXiX14vAb/mzmQYJyBCvlzkJwpDhNO/8VBqwUFNLWFcS3sr5mOmGQebTs6G8PUp/p+0S0VaKZavyoVGZRVHFp2gU3SOKKqiBrpETdRCHI3QA3pCz45yHp0X53XZmnFWM8foB5y3T7v/jhU=</latexit>
context
⇡t
<latexit sha1_base64="9kZ+lwA+TgNCBG1gj9jwQVh/mJQ=">AAAB7HicbZDNSsNAFIVv6l+tf1WXboJFqJuQiKTuLLhxWcG0hSaWyXTSDp1MwsxEKKXP0I0LRdz6FK58BHc+iHsnbRdaPTBw+M69zL03TBmVyrY/jcLK6tr6RnGztLW9s7tX3j9oyiQTmHg4YYloh0gSRjnxFFWMtFNBUBwy0gqHV3neuidC0oTfqlFKghj1OY0oRkojz0/pneqWK7Zlz2TaVs113VpuFsRZmMrlW/XrfeqfNrrlD7+X4CwmXGGGpOw4dqqCMRKKYkYmJT+TJEV4iPqkoy1HMZHBeDbsxDzRpGdGidCPK3NGf3aMUSzlKA51ZYzUQC5nOfwv62QqugjGlKeZIhzPP4oyZqrEzDc3e1QQrNhIG4QF1bOaeIAEwkrfp6SP4Cyv/Nc0zyzHtc5vnErdhbmKcATHUAUHalCHa2iABxgoTOERngxuPBjPxsu8tGAseg7hl4zXb1JNkrs=</latexit>
decision
rt
<latexit sha1_base64="eU/BQXfxVYbmM9MfJPow7LIBoPM=">AAAB6nicdVDLSsNAFL2pr1pfVZduBovgKiShpu2u4MZlRfuANpbJdNIOnTyYmQgl9BPcuFDErV/kzr9x0lZQ0QMXDufcy733+AlnUlnWh1FYW9/Y3Cpul3Z29/YPyodHHRmngtA2iXksej6WlLOIthVTnPYSQXHoc9r1p5e5372nQrI4ulWzhHohHkcsYAQrLd2IOzUsVyzTsmpOo4Es03Gqrutq0nCr9YsasrWVowIrtIbl98EoJmlII0U4lrJvW4nyMiwUI5zOS4NU0gSTKR7TvqYRDqn0ssWpc3SmlREKYqErUmihfp/IcCjlLPR1Z4jVRP72cvEvr5+qoO5lLEpSRSOyXBSkHKkY5X+jEROUKD7TBBPB9K2ITLDAROl0SjqEr0/R/6TjmLZrVq+rlaa7iqMIJ3AK52BDDZpwBS1oA4ExPMATPBvceDRejNdla8FYzRzDDxhvn+pxjjU=</latexit>
reward
ot
<latexit sha1_base64="02emXtykcQvGDEHd3kakHTy/oFg=">AAAB6nicdVDLSgMxFM3UV62vqks3wSK4KkkpfewKblxWtLXQjiWTZtrQTDIkGaEM/QQ3LhRx6xe582/MtBVU9MCFwzn3cu89QSy4sQh9eLm19Y3Nrfx2YWd3b/+geHjUNSrRlHWoEkr3AmKY4JJ1LLeC9WLNSBQIdhtMLzL/9p5pw5W8sbOY+REZSx5ySqyTrtWdHRZLqIwQwhjDjOB6DTnSbDYquAFxZjmUwArtYfF9MFI0iZi0VBBj+hjF1k+JtpwKNi8MEsNiQqdkzPqOShIx46eLU+fwzCkjGCrtSlq4UL9PpCQyZhYFrjMidmJ+e5n4l9dPbNjwUy7jxDJJl4vCRECrYPY3HHHNqBUzRwjV3N0K6YRoQq1Lp+BC+PoU/k+6lTKulatX1VKrtoojD07AKTgHGNRBC1yCNugACsbgATyBZ094j96L97pszXmrmWPwA97bJ65Jjgw=</latexit>
observation
56
Assumption 4 (Stochastic Rewards): Rewards are generated independently via
rt ∼ M ⋆ (· | π t ), (4.1)
We define
f ⋆ (π) := E [r | π] (4.2)
as the mean reward function under r ∼ M ⋆ (· | π), and measure regret via
T
X T
X
Reg := f ⋆ (π ⋆ ) − Eπt ∼pt [f ⋆ (π t )]. (4.3)
t=1 t=1
Here, π ⋆ := arg maxπ∈Π f ⋆ (π) as usual. We will define the history as Ht = (π 1 , r1 ), . . . , (rt , π t ).
Function approximation. A first attempt to tackle the structured bandit problem might
be to applyp algorithms for the multi-armed bandit setting, such as UCB. This would give
e |Π|T ), which could be vacuous if Π is large relative to T . However, with no
regret O(
further assumptions on the underlying reward function f ⋆ , this is unavoidable. To allow for
better regret, we will make assumptions on the structure of f ⋆ that will allow us to share
information across decisions, and to generalize to decisions that we may not have played.
This is well-suited for the applications described above, where Π is a continuous set (e.g.,
Π ⊆ Rd ), but we expect f ⋆ to be continuous, or perhaps even linear with respect some
well-designed set of features. To make this idea precise, we follow the same approach as in
statistical learning and contextual bandits, and assume access to a well-specified function
class F that aims to capture our prior knowledge about f ⋆ .
Example 4.1 (Necessity of structural assumptions). Let Π = [A], and let F = {fi }i∈[A] ,
where
1 1
fi (π) := + I {π = i} .
2 2
It is clear that one needs
p Reg ≳ A for this setting, yet log|F| = log(A), so a regret bound
of the form Reg ≲ T log|F| is not possible if A is large relative to T . ◁
What this example highlights is that generalizing across decisions is fundamentally dif-
ferent (and, in some sense, more challenging) than generalizing across contexts. In light
57
of this, we will aim for guarantees that scale with log|F|, but additionally scale with an
appropriate notion of complexity of exploration for the decision space Π. Such a notion of
complexity should reflect how much information is shared across decisions, which depends
on the interplay between Π and F.
1. UCB attains guarantees that scale with log|F|, and additionally scale with a notion
of complexity called the eluder dimension, which is small for simple problems such as
bandits with linear rewards.
2. In general, UCB is not optimal, and can have regret that is exponentially large com-
pared to the optimal rate.
be the empirical minimizer on round t, and with β := 8 log(|F|/δ), define confidence sets
F 1 = F and
t−1 t−1
( )
X X
Ft = f ∈ F : (f (π i ) − ri )2 ≤ (fbt (π i ) − ri )2 + β . (4.5)
i=1 i=1
Defining f¯t (π) := maxf ∈F t f (π) as the upper confidence bound, the generalized UCB algo-
rithm is given by
π t = arg max f¯t (π). (4.6)
π∈Π
When does the confidence width shrink? Using Proposition 7, one can see the gen-
eralized UCB algorithm ensures that f ⋆ ∈ F t for all t with high probability. Whenever this
happens, regret is bounded by the upper confidence width:
T
f¯t (π t ) − f ⋆ (π t ).
X
Reg ≤ (4.7)
t=1
This bound holds for all structured bandit problems, with no assumption on the structure
of Π and F. Hence, to derive a regret bound, the only question we need to answer is when
will the confidence widths shrink?
58
For the unstructured multi-armed bandit, we need to shrinkpthe width for every arm
separately, and the best bound on (4.7) we can hope for is O( |Π|T ). One might hope
that if Π and F have nice structure, we can do better. In fact, we have already seen one
such case: For linear models, where
n o
F = π 7→ ⟨θ, ϕ(π)⟩ | θ ∈ Θ ⊂ Bd2 (1) , (4.8)
p
Proposition 7 shows that we can bound (4.7) by dT log|F|. Here, the number of decisions
|Π| is replaced by the dimension d, which reflects the fact that there are only d truly unique
directions to explore before we can start extrapolating to new actions. Is there a more
general version of this phenomenon when we move beyond linear models?
The eluder dimension is defined as Edimf ⋆ (F, ε) = supε′ ≥ε Edimf ⋆ (F, ε′ ) ∨ 1. We ab-
breviate Edim(F, ε) = maxf ⋆ ∈F Edimf ⋆ (F, ε).
The intuition behind the eluder dimension is simple: It asks, for a worst-case sequence of
decisions, how many times we can be “surprised” by a new decision π t if we can estimate
the underlying model f ⋆ well on all of the preceding points. In particular, if we form
confidence sets as in (4.5) with β = ε2 , then the number of times the upper confidence
width in (4.7) can be larger than ε is at most Edimf ⋆ (F, ε). We consider the definition
Edimf ⋆ (F, ε) = supε′ ≥ε Edimf ⋆ (F, ε′ ) ∨ 1 instead of directly working with Edimf ⋆ (F, ε) to
ensure monotonicity with respect to ε, which will be useful in the proofs that follow.
The following result gives a regret bound for UCB for generic structured bandit prob-
lems. The regret bound has no dependence on the size of the decision space, and scales
only with Edim(F, ε) and log|F|.
Proposition 12: For a finite set of functions F ⊂ (Π → [0, 1]), using β = 8 log(|F|/δ),
the generalized UCB algorithm guarantees that with probability at least 1 − δ,
np o q
Reg ≲ min Edim(F, ε) · T log(|F|/δ) + εT ≲ Edim(F, T −1/2 ) · T log(|F|/δ).
ε>0
(4.10)
59
For the case of linear models in (4.8), it is possible to use the elliptic potential lemma
(Lemma 11) to show that
Edim(F, ε) ≲ d log(ε−1 ).
p
For finite classes, this gives Reg ≲ dT log(|F|/δ) log(T ), which recovers the guarantee in
Proposition 7. Another well-known example is that of generalized linear models. Here, we
fix link function σ : [−1, +1] → R and define
n o
F = π 7→ σ ⟨θ, ϕ(π)⟩ | θ ∈ Θ ⊂ Bd2 (1) .
This is a more flexible model than linear bandits. A well-known special case is the logistic
bandit problem, where σ(z) = 1/(1 + e−z ). One can show [74] that for any choice of σ, if
there exist µ, L > 0 such that µ < σ ′ (z) < L for all z ∈ [−1, +1], then
L2
Edim(F, ε) ≲ · d log(ε−1 ). (4.11)
µ2
This leads to a regret bound that scales with Lµ dT log|F|, generalizing the regret bound
p
Edim(F, ε) ≳ ed (4.12)
for constant ε. That is, even for a single ReLU neuron, the eluder dimension is already
exponential, which is a bit disappointing. Fortunately, we will show in the sequel that the
eluder dimension can be overly pessimistic, and it is possible to do better, but this will
require changing the algorithm.
1. f ⋆ ∈ F t .
2. F t ⊆ F t .
Now, define
wt (π) = sup [f (π) − f ⋆ (π)],
f ∈F t
60
which is a useful upper bound on the upper confidence width at time t. Since F t ⊆ F t , we
have
XT
Reg ≤ wt (π t ).
t=1
We now appeal to the following technical lemma concerning the eluder dimension.
Lemma 12 (Russo and Van Roy [74], Lemma 3): Fix a function class F, function
f ⋆ ∈ F, and parameter β > 0. For any sequence π 1 , . . . , π T , if we define
( )
X
wt (π) = sup f (π) − f ⋆ (π) : (f (π i ) − f ⋆ (π i ))2 ≤ β ,
f ∈F i<t
Note that for the special case where β = α2 , the bound in Lemma 12 immediately
follows from the definition of the eluder dimension. The point of this lemma is to show that
a similar bound holds for all scales α simultaneously, but with a pre-factor αβ2 that grows
large when α2 ≪ β.
To apply this result, fix ε > 0, and bound
T
X T
X
t t
w (π ) ≤ wt (π t )I {wt (π t ) > ε} + εT. (4.13)
t=1 t=1
61
Proof of Lemma 12. Let us adopt the shorthand d = Edimf ⋆ (F, α). We begin with a
definition. We say π P is α-independent of π 1 , . . . , π t if there exists f ∈ F such that
⋆ t ⋆ i 2 2
|f (π) − f (π)| > α and
Pt i=1 (fi(π ) −⋆ f i(π2 )) ≤2 α . We say⋆ π is α-dependent on π , . . . , π
i 1 t
so M ≤ αβ2 .
Next, we claim that for τ and any sequence (π 1 , . . . , π τ ), there is some j such that π j
is α-dependent on at least ⌊τ /d⌋ disjoint subsequences of π 1 , . . . , π j−1 . Let N = ⌊τ /d⌋,
and let B1 , . . . , BN be subsequences of π 1 , . . . , π τ . We initialize with Bi = (π i ). If π N +1 is
α-dependent on Bi = (π i ) for all 1 ≤ i ≤ N we are done. Otherwise, choose i such that
π N +1 is α-independent of Bi , and add it to Bi . Repeat this process until we reach j such
that either π j is α-dependent P on all Bi or j = τ . In the first case we are done, while in
the second case, we have N |B
i=1 i | ≥ τ ≥ dN . Moreover, |B i | ≤ d, since each π j
∈ B i
is α-independent of its prefix (this follows from the definition of eluder dimension). We
conclude that |Bi | = d for all i, so in this case π τ is α-dependent on all Bi .
Finally, let (π t1 , . . . , π tτ ) be the subsequence π 1 , . . . , π T consisting of all elements for
which wii (π ti ) > α. Each element of the sequence is dependent on at most β/α2 disjoint
subsequences of (π t1 , . . . , π tτ ), and by the argument above, one element is dependent on at
least ⌊τ /d⌋ disjoint subsequences, so we must have ⌊τ /d⌋ ≤ β/α2 , and which implies that
τ ≤ (β/α2 + 1)d.
Example 4.2 (Cheating Code [9, 50]). Let A ∈ N be a power of 2 and consider the following
function class F.
• The decision space is Π = [A] ∪ C, where C = {c1 , . . . , clog2 (A) } is a set of “cheating”
actions.
• For all actions π ∈ [A], f (π) ∈ [0, 1] for all f ∈ F, but we otherwise make no
assumption on the reward.
• For each f ∈ F, rewards for actions in C take the following form. Let πf ∈ [A] denote
the action in [A] with highest reward. Let b(f ) = (b1 (f ), . . . , blog2 (A) (f )) ∈ {0, 1}log2 (A)
62
be a binary encoding for the index of πf ∈ [A] (e.g., if πf = 1, b(f ) = (0, 0, . . . , 0), if
πf = 2, b(f ) = (0, 0, . . . , 0, 1), and so on). For each action ci ∈ C, we set
f (ci ) = −bi (f ).
The idea here is that if we ignore the actions √ C, this looks like a standard multi-armed
bandit problem, and the optimal regret is Θ( AT ). However, we can use the actions in C
to “cheat” and get an exponential improvement in sample complexity. The argument is as
follows.
Suppose for simplicity that rewards are Gaussian with r ∼ N (f ⋆ (π), 1) under π. For
each cheating action ci ∈ C, since f ⋆ (ci ) = −bi (f ⋆ ) ∈ {0, −1}, we can determine whether the
value is bi (f ⋆ ) = 0 or bi (f ⋆ ) = 1 with high probability using O(1)
e action pulls. If we do this
for each ci ∈ C, which will incur O(log(A)) regret (there are log(A) such actions and each one
e
leads to constant regret), we can infer the binary encoding b(f ⋆ ) = b1 (f ⋆ ), . . . , blog2 (A) (f ⋆ )
for the optimal action πf ⋆ with high probability. At this point, we can simply stop exploring,
and commit to playing πf ⋆ for the remaining rounds, which will incur no more regret. If
one is careful with the details, this gives that with probability at least 1 − δ,
Reg ≲ log2 (A/δ).
In other words, by exploiting the cheating actions, our regret has gone from linear to
logarithmic in A (we have also improved the dependence on T , which is a secondary bonus).
Now, let us consider the behavior of the generalized UCB algorithm. Unfortunately,
since all actions ci ∈ C have f (ci ) ≤ 0 for all f ∈ F, we have f¯t (ci ) ≤ 0. As a result, the
generalized UCB algorithm will only ever pull actions in [A], ignoring the cheating actions
and effectively turning this into a vanilla multi-armed bandit problem, which means that
√
Reg ≳ AT .
◁
This example shows that UCB can behave suboptimally in the presence of decisions that
reveal useful information but do not necessarily lead to high reward. Since the “cheating”
actions are guaranteed to have low reward, UCB avoids them even though they are very
informative. We conclude that:
1. Obtaining optimal sample complexity for structured bandits requires algorithms that
more deliberately balance the tradeoff between optimizing reward and acquiring in-
formation.
2. In general, the optimal strategy for picking decisions can be very different depending
on the choice of the class F. This contrasts the contextual bandit setting, where we
saw that the Inverse Gap Weighting algorithm attained optimal sample complexity for
any choice of class F, and all that needed to change was how to perform estimation.
63
Example 4.2 implies that posterior sampling is not optimal in general. Indeed, poste-
rior sampling will never select the cheating arms in C, as these have sub-optimal reward
for all models
√ in F. As a result, the Bayesian regret of the algorithm will scale with
Reg ≳ AT for a worst-case prior.
Recall, following the discussion in Section 3, that the averaged exponential weights algorithm
achieves is an online regression oracle with EstSq (F, t, δ) ≲ log(|F|/δ).
The following algorithm, which we call Estimation-to-Decisions or E2D [40, 43], is a
general-purpose meta-algorithm for structured bandits.
Select action π t ∼ pt .
64
At each timestep t, the algorithm calls invokes an online regression oracle to obtain an esti-
mator fbt using the data Ht−1 = (π 1 , r1 , . . . , π t−1 , rt−1 ) observed so far. The algorithm then
finds a distribution pt by solving a min-max optimization problem involving the estimator
fbt and the class F, then samples the decision π t from this distribution.
The minimax problem in E2D is derived from a complexity measure (or, structural
parameter) for F called the Decision-Estimation Coefficient [40, 43], whose value is given
by
2
decγ (F, fb) = min max Eπ∼p f (πf ) − f (π) −γ · (f (π) − fb(π)) . (4.15)
p∈∆(Π) f ∈F | {z } | {z }
regret of decision information gain for obs.
The Decision-Estimation Coefficient can be thought of as the value of a game in which
the learner (represented by the min player) aims to find a distribution over decisions such
that for a worst-case problem instance (represented by the max player), the regret of their
decision is controlled by a notion of information gain (or, estimation error) relative to a
reference model fb. Conceptually, fb should be thought of as a guess for the true model,
and the learner (the min player) aims to—in the face of an unknown environment (the max
player)—optimally balance the regret of their decision with the amount information they
acquire. With enough information, the learner can confirm or rule out their guess fb, and
scale parameter γ controls how much regret they are willing to incur to do this. In general,
the larger the value of decγ (F, fb), the more difficult it is to explore.
To state a regret bound for E2D, we define
decγ (F) = sup decγ (F, fb). (4.16)
fb∈co(F )
Here, co(F) denotes the set of all convex combinations of elements in F. The reason we
consider the set co(F) is that in general, online estimation algorithms such as exponential
weights will produce improper predictions with fb ∈ co(F). In fact, it turns out (see Propo-
sition 24) that even if we allow fb to be unconstrained above, the maximizer always lies in
co(F) without loss of generality.
The main result for this section shows that the regret for E2D is controlled by the value
of the DEC and the estimation error EstSq (F, T, δ) for the online regression oracle.
Proposition 13 (Foster et al. [40]): The E2D algorithm with exploration parameter
γ > 0 guarantees that with probability at least 1 − δ,
We can optimize over the parameter γ in the result above, which yields
Reg ≤ inf decγ (F) · T + γ · EstSq (F, T, δ)
γ>0
≤ 2 · inf max decγ (F) · T, γ · EstSq (F, T, δ) .
γ>0
For finite classes, we can use the exponential weights method to obtain EstSq (F, T, δ) ≲
log(|F|/δ), and this bound specializes to
Reg ≲ inf max decγ (F) · T, γ · log(|F|/δ) . (4.18)
γ>0
65
As desired, this gives a bound on regret that scales only with:
2. the complexity of exploration in the decision space, which is captured by decγ (F).
Before interpreting the result further, we give the proof, which is a nearly immediate con-
sequence of the definition of the DEC, and bears strong similarity to the proof of the regret
bound for SquareCB (Proposition 10), minus contexts.
where the first equality above uses that pt is chosen as the minimizer for decγ (F, fbt ). Sum-
ming across rounds, we conclude that
When designing algorithms for structured bandits, a common challenge is that the
connection between decision making (where the learner’s decisions influence what feedback
is collected) and estimation (where data is collected passively) may not seem apparent a-
priori. The power of the Decision-Estimation Coefficient is that it—by definition—provides
a bridge, which the proof of Proposition 13 highlights. One can select decisions by building
an estimate for the model using all of the observations collected so far, then sampling
from the distribution p that solves (4.15) with the estimated reward function fb plugged
in. Boundedness of the DEC implies that at every round, any learner using this strategy
either enjoys small regret or acquires information, with their total regret controlled by the
cumulative online estimation error.
66
Example: Multi-Armed Bandit. Of course, the perspective above is only useful if the
DEC is indeed bounded, which itself is not immediately apparent. In Section 6, we will show
that boundedness of the DEC is not just sufficient, but in fact necessary for low regret in
a fairly strong quantitative sense. For now, we will build intuition about the DEC through
examples. We begin with the multi-armed bandit, where Π = [A] and F = RA . Our first
result shows that decγ (F) ≤ A γ , and that this is achieved with the Inverse Gap Weighting
method introduced in Section 3.
Proposition 14 (IGW minimizes the DEC): For the Multi-Armed Bandit setting,
where Π = [A] and F = RA , the Inverse Gap Weighting distribution p = IGW4γ (fb) in
(3.36) is the exact minimizer for decγ (F, fb), and certifies that decγ (F, fb) = A−1
4γ .
For any fixed p and π ⋆ , first-order conditions for optimality imply that the choice for f that
maximizes this expression is
1 1
f (π) = fb(π) − + I {π = π ⋆ } .
2γ 2γp(π ⋆ )
This choice gives
h i 1 − p(π ⋆ )
Eπ∼p [f (π ⋆ ) − f (π)] = Eπ∼p fb(π ⋆ ) − fb(π) +
2γp(π ⋆ )
and i 1 − p(π ⋆ ) (1 − p(π ⋆ ))2
h 1 1
γ Eπ∼p (f (π) − fb(π))2 = + ⋆
= ⋆
− .
4γ 4γp(π ) 4γp(π ) 4γ
Plugging in and simplifying, we compute that the original minimax game is equivalent to
h
⋆
i 1 1
min max Eπ∼p f (π ) − f (π) +
b b
⋆
− . (4.20)
⋆
p∈∆([A]) π ∈[A] 4γp(π ) 4γ
Finishing the proof: Ad-hoc approach. Observe that for any p ∈ ∆(Π), we have
h
⋆
i 1 h
⋆
i 1 A
max Eπ∼p f (π ) − f (π) +
b b
⋆
≥ Eπ⋆ ∼p Eπ∼p f (π ) − f (π) +
b b
⋆
= ,
⋆
π ∈[A] 4γp(π ) 4γp(π ) 4γ
67
A
so no p can attain value better than 4γ . If we can show that IGW achieves this value, we
are done.
Observe that by setting p = IGW4γ (fb), we have that for all π ⋆ ,
h i 1 h i λ
Eπ∼p fb(π ⋆ ) − fb(π) + ⋆
= E π∼p f
b (π ⋆
) − f
b (π) + π ) − fb(π ⋆ )
+ fb(b (4.21)
4γp(π ) 4γ
h i λ
= Eπ∼p fb(b π ) − fb(π) + .
4γ
Note that the value on the right-hand side is independent of π ⋆ . That is, the inverse gap
weighting distribution is an equalizing strategy. This means that for this choice of p, we
have
h
⋆
i 1 h
⋆
i 1
max Eπ∼p f (π ) − f (π) +
b b = min Eπ∼p f (π ) − f (π) +
b b
π ⋆ ∈[A] 4γp(π ⋆ ) π ⋆ ∈[A] 4γp(π ⋆ )
h i 1 A
= Eπ⋆ ∼p Eπ∼p fb(π ⋆ ) − fb(π) + = .
4γp(π ⋆ ) 4γ
Hence, p = IGW4γ (fb) achieves the optimal value.
Finishing the proof: Principled approach. We begin by relaxing to p ∈ RA
+ . Define
1
gπ⋆ (p) = fb(π ⋆ ) + .
4γp(π ⋆ )
Let ν ∈ R be a Lagrange multiplier and p ∈ RA
+ , and consider the Lagrangian
!
X X
L(p, ν) = gπ⋆ (p) − p(π)fb(π) + ν p(π) − 1 .
π π
By the KKT conditions, if we wish to show that p ∈ ∆(Π) is optimal for the objective in
(4.20), it suffices to find ν such that12
0 ∈ ∂p L(p, ν),
where ∂p denotes the subgradient with respect to p. Recall that for a convex function
′
h(x) = maxy g(x, y), we have ∂ x h(x) = co( ∇g(x, y) | g(x, y) = maxy′ g(x, y ) ). As a
result,
∂p L(p, ν) = ν1 − fb + co({∇p gπ⋆ (p) | gπ⋆ (p) = max
′
gπ′ (p)}).
π
Now, let p = IGW4γ (fb). We will argue that 0 ∈ ∂p L(p, ν) for an appropriate choice of ν. By
(4.21), we know that gπ (p) = gπ′ (p) for all π, π ′ (p is equalizing), so the expression above
simplifies to
∂p L(p, ν) = ν1 − fb + co({∇p gπ⋆ (p)}π⋆ ∈Π ). (4.22)
Noting that ∇p gπ⋆ (p) = − 4γp21(π⋆ ) eπ⋆ , we compute
X 1 λ
δ := p(π)gπ (p) = − = − − f (b
b π ) + f (π)
b ,
π
4γp(π) π∈Π 4γ π∈Π
λ
which has δ ∈ co({∇p gπ⋆ (p)}π⋆ ∈Π ). By choosing ν = 4γ + fb(b
π ), we have
ν1 − fb + δ = 0,
so (4.22) is satisfied.
12 d
If p ∈ ∆(Π), the KKT condition that dν
L(p, ν) = 0 is already satisfied.
68
4.3 Decision-Estimation Coefficient: Examples
We now show how to bound the Decision-Estimation Coefficient for a number of examples
beyond finite-armed bandits—some familiar and others new—and show how this leads to
bounds on regret via E2D.
Approximately solving the DEC. Before proceeding, let us mention that to apply
E2D, it is not necessary to exactly solve the minimax problem (4.15). Instead, let us say
that a distribution p = p(fb, γ) certifies an upper bound on the DEC if, given fb and γ > 0,
it ensures that
h i
sup Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ decγ (F, fb)
f ∈F
for some known upper bound decγ (F, fb) ≥ decγ (F, fb). In this case, letting decγ (F) :=
supfb decγ (F, fb), it is simple to see that if we use this distribution pt = p(fbt , γ) within E2D,
we have
Proposition 15 (DEC for Cheating Code): Consider the cheating code in Exam-
ple 4.2. For this class F, we have
log2 (A)
decγ (F) ≲ .
γ
Note that while the strategy p in Proposition 15 certifies a bound on the DEC, it is not
necessarily the exact minimizer, and hence the distributions p1 , . . . , pT played by E2D may
be different. Nonetheless, since the regret of E2D is bounded byp the DEC, this result (via
Proposition 13) implies that its regret is bounded by Reg ≲ log2 (A)T log|F|. Using a
slightly more refined version of the E2D algorithm [43], one can improve this to match the
log(T ) regret bound given in Example 4.2.
Proof of Proposition 15. To simplify exposition, we present a bound on decγ (F, fb) for this
example only for fb ∈ F, not for fb ∈ co(F). A similar approach (albeit with a slightly
different choice for p) leads to the same bound on decγ (F). Let fb ∈ F and γ > 0 be given,
and define
p = (1 − ε)πfb + ε · unif(C).
log2 (A)
decγ (F, fb) ≲ .
γ
69
Let f ∈ F be fixed, and consider the value
h i
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 .
We consider two cases. First the first, if πf = πfb, then we can upper bound
h i h i
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ Eπ∼p [f (πf ) − f (π)] = Eπ∼p f (πfb) − f (π) ≤ 2ε,
using that f ∈ [−1, 1]. To proceed, we want to argue that the negative offset term above
is sufficiently large; informally, this means that we are exploring “enough”. Observe that
since πf ̸= πfb, if we let b1 , . . . , blog2 (A) and b′1 , . . . , b′log (A) denote the binary representations
2
for πf and πfb, there exists i such that bi ̸= b′i . As a result, we have
h i ε ε ε
Eπ∼p (f (π) − fb(π))2 ≥ (f (ci ) − fb(ci ))2 = (bi − b′i )2 = .
log2 (A) log2 (A) log2 (A)
We conclude that in the second case,
h i ε
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ 2 − γ .
log2 (A)
Putting the cases together, we have
h i ε
Eπ∼p f (πf ) − f (π) − γ · (f (π) − fb(π))2 ≤ max 2ε, 2 − γ .
log2 (A)
To balance these terms, we set
log2 (A)
ε=2 ,
γ
which leads to the result.
70
where Σp := Ez∼p zz ⊤ .
The G-optimal design ensures coverage in every direction of the decision space, generalizing
the notion of uniform exploration for finite action spaces. In this sense, it can be thought
of as a “universal” exploratory distribution for linearly structured action spaces. Special
cases include:
• When Z = ∆([A]), we can take p = unif(e1 , . . . , eA ) as an optimal design
• When Z = Bd2 (1), we can again take p = unif(e1 , . . . , eA ) as an optimal design.
• For any positive definite matrix A ≻ 0, the set Z = z ∈ Rd | ⟨Az, z⟩ ≤ 1 is an
Proposition 17 (DEC for Linear Bandits): Consider the linear bandit setting. Let
a linear function fb and γ > 0 be given, consider the following distribution p:
q
• Define ϕ(π) = ϕ(π)/ 1 + γd fb(πfb) − fb(π) , where πfb = arg maxπ∈Π fb(π).
• Let q̄ ∈ ∆(Π) be the G-optimal design for the set {ϕ(π)}π∈Π , and define q =
1 1
2 q̄ + 2 Iπfb.
71
a 1
P 1
The normalizing constant λ always exists because we have 2λ
≤ π p(π) ≤ λ
.
One can show that decγ (F) ≳ γd for this setting as well, so this is the best bound we can
hope for. Combining this result with Proposition 13 and p using the averaged exponential
weights algorithm for estimation as in (4.18) gives Reg ≲ dT log(|F|/δ).
The first term captures the loss in exploration that we would incur if fb we true the reward
function, and is equal to:
where θ, θb ∈ Θ are parameters such that f (π) = ⟨θ, ϕ(π)⟩ and fb(π) = ⟨θ,
b ϕ(π)⟩. Defining
⊤
Σp = Eπ∼p ϕ(π)ϕ(π) , we can bound
b Σ−1/2 ϕ(πf )
b ϕ(πf ) = Σ1/2 (θ − θ),
θ − θ, p p
−1/2 γ b 2 + 1 ∥Σ−1/2 ϕ(πf )∥2 .
≤ ∥Σ1/2
p (θ − θ)∥2 ∥Σp
b ϕ(πf )∥2 ≤ ∥Σp1/2 (θ − θ)∥2
2 2γ p 2
1X q̄(π)
Σp ⪰ ϕ(π)ϕ(π)⊤
2 π λ + η(fb(π b) − fb(π))
f
1 X q̄(π) 1X 1
⪰ ϕ(π)ϕ(π)⊤ ⪰ q̄(π)ϕ(π)ϕ(π)⊤ =: Σq̄
2 π 1 + η(f (π b) − f (π))
b b 2 π 2
f
72
This means that we can bound
⟨ϕ(πf ), Σ−1 −1
p ϕ(πf )⟩ ≤ 2⟨ϕ(πf ), Σq̄ ϕ(πf )⟩
where the last line uses that q̄ is the G-optimal design for {ϕ(π)}π∈Π . We conclude that
2d 2dη b d
(IV) ≤ + (f (πfb) − fb(πf )) − (fb(πfb) − fb(πf )) ≤ .
2γ 2γ γ
Remark 14: In fact, it can be shown [39] that when Θ = Rd , the exact minimizer of
the DEC for linear bandits is given by
1 ⊤
p = arg max Eπ∼p f (π) +
b log det(Eπ∼p [ϕ(π)ϕ(π) ]) .
p∈∆(Π) 4γ
Proposition 18 (DEC for Lipschitz Bandits): Consider the Lipschitz bandit set-
ting, and suppose that there exists d > 0 such that Nρ (Π, ε) ≤ ε−d for all ε > 0. Let
fb : Π → [0, 1] and γ ≥ 1 be given and consider the following distribution:
73
2. Let p be the result of applying the inverse gap weighting strategy in (3.36) to fb,
restricted to the (finite) decision space Π′ .
1
By setting ε ∝ γ − d+1 , this strategy certifies that
1
decγ (F, fb) ≲ γ − d+1 .
d+1
Ignoring dependence on EstSq (F, T, δ), this result leads to regret bounds that scale as T d+2
(after tuning γ in Proposition 13), which cannot be improved.
Proof of Proposition 18. Let f ∈ F be fixed. Let Π′ be the ε-cover for Π. Since f is 1-
Lipschitz, for all π ∈ Π there exists a corresponding covering element ι(π) ∈ Π′ such that
ρ(π, ι(π)) ≤ ε, and consequently for any distribution p,
At this point, since ι(πf ) ∈ Π′ , Proposition 9 ensures that if we choose p using inverse gap
weighting over Π′ , we have
|Π′ | h i
Eπ∼p [f (ι(πf )) − f (π)] ≤ + γ · Eπ∼p (f (π) − fb(π))2 .
γ
From our assumption on the growth of Nρ (Π, ε), |Π′ | ≤ ε−d , so the value is at most
ε−d
ε+ .
γ
1
We choose ε ∝ γ − d+1 to balance the terms, leading to the result.
74
For this setting, whenever F ⊆ (Π → [0, 1]), results of Lattimore [58] imply that
d4
decγ (F) ≲ · polylog(d, γ) (4.25)
γ
for all γ > 0. ◁
For the function class
n o
d
F = f (π) = −relu(⟨ϕ(π), θ⟩) | θ ∈ Θ ⊂ B2 (1) ,
p
(4.25) leads to a poly(d)T regret bound for E2D. This highlights a case where the Eluder
dimension is overly pessimistic, since we saw that it grows exponentially for this class.
Compute
t t 2
p = arg min max Eπ∼p f (πf ) − f (π) − γ · (f (π) − f (π)) .
b
t
p∈∆(Π) f ∈F
Select action π t ∼ pt .
This strategy is the same as the basic E2D algorithm, except that at each step, we compute
a confidence set F t and modify the minimax problem so that the max player is restricted to
choose f ∈ F t .13 With this change, the distribution pt can be interpreted as the minimizer
for decγ (F t , fbt ).
To analyze this algorithm, we show that as long as f ⋆ ∈ F t for all t, the same per-step
analysis as in Proposition 13 goes through, with F replaced by F t . This allows us to prove
the following result.
13
Note that compared to the confidence sets used in UCB, a slight difference is that we compute F t using
the estimates fb1 , . . . , fbT produced by the online regression oracle (this is sometimes referred to as “online-
to-confidence set conversion”) as opposed to using ERM; this difference is unimportant, and the later would
work as well.
75
Proposition 19: For any δ ∈ (0, 1) and γ > 0, if we set β = EstSq (F, T, δ), then E2D
with confidence sets ensures that with probability at least 1 − δ,
T
X
Reg ≤ decγ (F t ) + γ · EstSq (F, T, δ). (4.26)
t=1
This bound is never worse than the one in Proposition 13, but it can be smaller if the
confidence sets F 1 , . . . , F T shrink quickly. For a proof, see Exercise 9.
Remark 15: In fact, the regret bound in (4.26) can be shown to hold for any sequence
of confidence sets F 1 , . . . , F T , as long as f ⋆ ∈ F t ∀t with probability at least 1 −
δ; the specific construction we use within the E2D variant above is chosen only for
concreteness.
Relation to confidence width and UCB. It turns out that the usual UCB algo-
rithm, which selects π t = arg maxπ∈Π f¯t (π) for f¯t (π) = maxf ∈F t f t (π), certifies a bound
on decγ (F t ) which is never worse than usual confidence width we use in the UCB analy-
sis.
Proposition 20: The UCB strategy π t = arg maxπ∈Π f¯t (π) certifies that
Proof of Proposition 20. By choosing π t = arg maxπ∈Π f¯t (π), we have that for any fb,
h i
⋆
dec0 (F t , fb) = inf sup Eπ∼p max
⋆
f (π ) − f (π)
p∈∆(Π) f ∈F t π
h i
⋆ t
≤ sup max
⋆
f (π ) − f (π )
f ∈F t π
h i
≤ sup max
⋆
f¯t (π ⋆ ) − f (π t )
f ∈F t π
As we saw in the analysis of UCB for multi-armed bandits with Π = {1, . . . , A} (Section 2.3),
the confidence width in (4.27) might be large for a given round t, but by the pigeonhole
argument (Lemma 8), when we sum over all rounds we have
T T √
f¯t (π t ) − f t (π t ) ≤ O(
X X
dec0 (F t ) ≤ e AT ).
t=1 t=1
Hence, even though UCB is not the optimal strategy to minimize the DEC, it can still lead
to upper bounds on regret when the confidence width shrinks sufficiently quickly. Of course,
as examples like the cheating code show, we should not expect this to happen in general.
76
Interestingly, the bound on the DEC in Proposition 20 holds for γ = 0, which only leads
to meaningful bounds on regret because F 1 , . . . , F T are shrinking. Indeed, Proposition 14
shows that with F = RA , we have
A
decγ (F) ≳ ,
γ
so the unrestricted class F has decγ (F) → ∞ as γ → 0. By allowing for γ > 0, we can
prove the following slightly stronger result, which replaces f t by fbt .
Proposition 21: For any γ > 0, the UCB strategy π t = arg maxπ∈Π f¯t (π) certifies
that
1
decγ (F t , fbt ) ≤ f¯t (π t ) − fbt (π t ) + .
4γ
Proof of Proposition 21. This is a slight generalization of the proof of Proposition 20. By
choosing π t = arg maxπ∈Π f¯t (π), we have
t ⋆ t 2
decγ (F, f ) = min max Eπ∼p max
b f (π ) − f (π) − γ ·(f (π) − f (π))
b
p∈∆(Π) f ∈Ft π⋆
⋆ t t t t 2
≤ max max f (π ) − f (π ) − γ ·(f (π ) − f (π ))
b
f ∈Ft π⋆
≤ max f¯ (π ) − f (π ) − γ ·(fb (π ) − f (π ))
t t t t t t 2
f ∈Ft
= max f (π ) − f (π ) − γ ·(f (π ) − f (π )) +f¯t (π t ) − fbt (π t ).
bt t t bt t t 2
f ∈Ft
| {z }
1
≤ 4γ
The dual Decision-Estimation Coefficient has the following Bayesian interpretation. The
adversary selects a prior distribution µ over models in M, and the learner (with knowledge
of the prior) finds a decision distribution p that balances the average tradeoff between regret
and information acquisition when the underlying model is drawn from µ.
Using the minimax theorem (Lemma 41), one can show that the Decision-Estimation
Coefficient and its Bayesian counterpart coincide.
77
Proposition 22 (Equivalence of primal and dual DEC): Under mild regularity
conditions, we have
decγ (F, fb) = decγ (F, fb). (4.29)
Thus, any bound on the dual DEC immediately yields a bound on the primal DEC. This
perspective is useful because it allows us to bring existing tools for Bayesian bandits and
reinforcement learning to bear on the primal Decision-Estimation Coefficient. As an ex-
ample, we can adapt the posterior sampling/probability matching strategy introduced in
Section 2. When applied to the Bayesian DEC—this approach selects p to be the action
distribution induced by sampling f ∼ µ and selecting πf . Using Lemma 9, one can show
that this strategy certifies that
|Π|
decγ (F) ≲
γ
for the multi-armed bandit. In fact, existing analysis techniques for the Bayesian setting
can be viewed as implicitly providing bounds on the dual Decision-Estimation Coefficient
[75, 22, 21, 76, 59, 58]. Notably, the dual DEC is always bounded by a Bayesian complexity
measure known as the information ratio, which is used throughout the literature on Bayesian
bandits and reinforcement learning [40].
Beyond the primal and dual Decision-Estimation Coefficient, there are deeper connec-
tions between the DEC and Bayesian algorithms, including a Bayesian counterpart to the
E2D algorithm itself [40].
This is the same as the contextual bandit protocol in Section 3, except that we allow Π to
be large and potentially continuous. As in that section, we allow the contexts x1 , . . . , xT to
be generated in an arbitrary, potentially adversarial fashion, but assume that
rt ∼ M ⋆ (· | xt , π t ),
and define f ⋆ (x, π) = Er∼M ⋆ (·|x,π) . We assume access to a function class F such that f ⋆ ∈ F,
and assume access to an estimation oracle for F that ensures that with probability at least
1 − δ,
XT
Eπt ∼pt (fbt (xt , π t ) − f ⋆ (xt , π t ))2 ≤ EstSq (F, T, δ).
t=1
For f ∈ F, we define πf (x) = arg maxπ∈Π f (x, π).
To extend the E2D algorithm to this setting, at each time t we solve the minimax
problem corresponding to the DEC, but condition on the context xt .
78
Estimation-to-Decisions (E2D) for Contextual Structured Bandits
Input: Exploration parameter γ > 0.
for t = 1, . . . , T do
Observe xt ∈ X .
Obtain fbt from online regression oracle with (x1 , π 1 , r1 ), . . . , (xt−1 , π t−1 , rt−1 ).
Compute
t t t t t t t 2
p = arg min max Eπ∼p f (x , πf (x )) − f (x , π) − γ · (f (x , π) − fb (x , π)) .
p∈∆(Π) f ∈F
Select action π t ∼ pt .
For x ∈ X , define
F(x, ·) = {f (x, ·) | f ∈ F}
as the projection of F onto x ∈ X . The following result shows that whenever the DEC is
bounded conditionally—that is, whenever it is bounded for F(x, ·) for all x—this strategy
has low regret.
Proposition 23: The E2D algorithm with exploration parameter γ > 0 guarantees
that
Reg ≤ sup decγ (F(x, ·)) · T + γ · EstSq (F, T, δ), (4.30)
x∈X
We omit the proof of this result, which is nearly identical to that of Proposition 13. The
basic idea is that for each round, once we condition on the context xt , the DEC allows us
to link regret to estimation error in the same fashion non-contextual setting.
We showed in Proposition 14 that the IGW distribution exactly solves the DEC minimax
problem when F = RA . Hence, the SquareCB algorithm in Section 3 is precisely the special
case of Contextual E2D in which F = RA .
Going beyond the finite-action setting, it is simplest to interpret Proposition 23 when
F(x, ·) has the same structure for all contexts. One example is contextual bandits with
linearly structured action spaces. Here, we take
where ϕ(x, a) ∈ Rd is a fixed feature map and G ⊂ (X → Bd2 (1)) is an arbitrary func-
tion class. This setting generalizes the linear contextual bandit problem from Section 3,
which corresponds to the case where G is a set of constant functions. We can apply
Proposition 17 to conclude that supx∈X decγ (F(x, ·)) ≲ γd , so that Proposition 23 gives
p
Reg ≲ dT · EstSq (F, T, δ).
79
Proposition 24: For any γ > 0,
4.7 Exercises
Exercise 8 (Posterior Sampling for Multi-Armed Bandits): Prove that for the standard
multi-armed bandit,
|Π|
decγ (F) ≲ ,
γ
by using the Posterior Sampling strategy (select p to be the action distribution induced by
sampling f ∼ µ and selecting πf ), and applying the decoupling lemma (Lemma 9). Recall that
here, decγ (F) is the “maxmin” version of the DEC (4.28).
Exercise 10: In this exercise, we will prove Proposition 24 as follows. First, show that the
left-hand side is an upper bound on the right-hand side. For the other direction:
1. Prove that
2. Use the Minimax Theorem (Lemma 41 in Appendix A.3) to conclude Proposition 24.
We now introduce the framework of reinforcement learning, which encompasses a rich set
of dynamic, stateful decision making problems. Consider the task of repeated medical
treatment assignment, depicted in Figure 4. To make the setting more realistic, it is natural
to allow the decision-maker to apply multi-stage strategies rather simple one-shot decisions
such as “prescribe a painkiller.” In principle, in the language of structured bandits, nothing
is preventing us from having each decision π t be a complex multi-stage treatment strategy
that, at each stage, acts on the patient’s dynamic state, which evolves as a function of
the treatments at previous stages. As an example, intermediate actions of the type “if
patient’s blood pressure is above X then do Y” can form a decision tree that defines the
complex strategy π t . Methods from the previous lectures provide guarantees for such a
setting, as long as we have a succinct model of expected rewards. What sets RL apart from
structured bandits is the additional information about the intermediate state transitions
and intermediate rewards. This information facilitates credit assignment, the mechanism for
recognizing which of the actions led to the overall (composite) decision to be good or bad.
This extra information can reduce what would otherwise be exponential sample complexity
in terms of the number of stages, states, and actions in multi-stage decision making.
80
This section is structured as follows. We first present the formal reinforcement learning
framework and present basic principles including Bellman optimality and dynamic pro-
gramming, which facilitate efficiently computing optimal decisions when the environment
is known. We then consider the case in which the environment is unknown, and give
algorithms for perhaps the simplest reinforcement learning setting, tabular reinforcement
learning, where the state and action spaces are finite. Algorithms for more complex rein-
forcement learning settings are given in Section 5.
M = S, A, {PhM }H M H
h=1 , {Rh }h=1 , d1 ,
PhM : S × A → ∆(S)
RhM : S × A → ∆(R)
is the reward distribution, and d1 ∈ ∆(S) is the initial state distribution. We allow the
reward distribution and transition kernel to vary across MDPs, but assume for simplicity
that the initial state distribution is fixed and known.
For a fixed MDP M , an episode proceeds under the following protocol. At the beginning
of the episode, the learner selects a randomized, non-stationary policy
π = (π1 , . . . , πH ),
where πh : S → ∆(A); we let Πrns for “randomized, non-stationary” denote the set of
all such policies. The episode then evolves through the following process, beginning from
s1 ∼ d1 . For h = 1, . . . , H:
• ah ∼ πh (sh ).
where EM ,π [·] denotes expectation under the process above. We define an optimal policy
for model M as
81
Value functions. Maximization in (5.2) is a daunting task, since each policy π is a
complex multi-stage object. It is useful to define intermediate “reward-to-go” functions to
start breaking this complex task into smaller sub-tasks. Specifically, for a given model M
and policy π, we define the state-action value function and state value function via
" H # " H #
,π
rh′ | sh = s, ah = a , and VhM ,π (s) = EM ,π
X X
M ,π
QMh (s, a) = E rh′ | sh = s .
h′ =h h′ =h
Online RL. For reinforcement learning, our main focus will be on what is called the
online reinforcement learning problem, in which we interact with an unknown MDP M ⋆ for
T episodes. For each episode t = 1, . . . , T , the learner selects a policy π t ∈ Πrns . The policy
is executed in the MDP M ⋆ , and the learner observes the resulting trajectory
82
programming can be viewed as solving the problem of credit assignment by breaking down
a complex multi-stage decision (policy) into a sequence of small decisions.
We start by observing that the optimal policy πM in (5.2) may not be uniquely defined.
For instance, if d1 assigns zero probability to some state s1 , the behavior of πM on this state
is immaterial. In what follows, we introduce a fundamental result, Proposition 25, which
guarantees existence of an optimal policy πM = (πM,1 , . . . , πM,H ) that maximizes V1M ,π (s)
over π ∈ Πrns for all states s ∈ S simultaneously (rather than just on average, as in (5.2)).
The fact that such a policy exists may seem magical at first, but it is rather straightforward.
Indeed, if πM,h (s) is defined for all s ∈ S and h = 2, . . . , H, then defining the optimal πM,1 (s)
at any s is a matter of greedily choosing an action that maximizes the sum of the expected
immediate reward and the remaining expected reward under the optimal policy. Indeed,
this observation is Bellman’s principle of optimality, stated more generally as follows [17]:
M ,⋆ ,⋆
for all s ∈ S, a ∈ A, and h ∈ [H]; we adopt the convention that VH+1 (s) = QMH+1 (s, a) = 0.
Since these optimal values are separate maximizations for each s, a, h, it is reasonable to ask
whether there exists a single policy that maximizes all these value functions simultaneously.
Indeed, the following lemma shows that there exists πM such that for all s, a, h,
,⋆ M ,πM
QM
h (s, a) = Qh (s, a), and VhM ,⋆ (s) = VhM ,πM (s). (5.5)
Proposition 25 (Bellman Optimality): The optimal value function (5.4) for MDP
M ,πM
M can be computed via VH+1 (s) := 0, and for each s ∈ S,
83
and an the optimal policy is given by
,πM
πM,h (s) ∈ arg max QM
h (s, a). (5.9)
a∈A
The update in (5.8) is referred to as value iteration (VI). It is useful to introduce a more suc-
cinct notation for this update. For an MDP M , define the Bellman Operators T1M , . . . , THM
via
M ′
[Th Q](s, a) = Esh+1 ∼P M (s,a),rh ∼RM (s,a) rh (s, a) + max
′
Q(sh+1 , a ) (5.10)
h h a ∈A
for any Q : S × A → R. Going forward, we will write the expectation above more succinctly
as
′
[ThM Q](s, a) = EM rh (sh , ah ) + max
′
Q(s h+1 , a ) | s h = s, ah = a (5.11)
a ∈A
ab
<latexit sha1_base64="cZeKLAuKLyN7qMbOGv/SfyeowIk=">AAAG3XicdZRbT9swFMfNNjpWdoHtkZdoFdIeUJV0UJo3BGwMCQHjPrVV5ThuatW5yHEGUZTHvW17HNr32T7Evs3spKXUNVF6ah3//sfHPsdxIkpibpr/5h49fjJfebrwrLr4/MXLV0vLry/iMGEIn6OQhuzKgTGmJMDnnHCKryKGoe9QfOkMd+T85VfMYhIGZzyNcNeHXkD6BEEuXKew5/SWambdFE+zaciB1TItMbDtVqNhG1YxZZq1rcW93yt/r9zj3vL8n44bosTHAUcUxnHbMiPezSDjBFGcVztJjCOIhtDDbTEMoI/jblbkmhurwuMa/ZCJX8CNwntfkUE/jlPfEaQP+SBW56RTN9dOeL/VzUgQJRwHqFyon1CDh4bcuOEShhGnqRhAxIjI1UADyCDi4niqU8uUqU65PAajAUE3ebW6auwcHRydGLsfPu4f7p/tHx2eVjsu7osKFMJsF7LhHsM4yDPmOXlm1q2NNXGkTWmsjdyYxg+IN+BfMKXh9Z2gJdHC2HpexE/v6E0JliafZi8wS2f5SfSWGnyU+4QtMi+NNpFtmuA72N4oUy6sreUVgVWgpW1qBSfYnSRenuPIqvg2FcUaswIRr4qc4ojAPEN+OpSMiPh+zRJ/m6ZycNuMoGGx9Jg16i1bGHtdmEZLwS8HhI93JTYjX4U4TlhE8b0SiFZYk1CAr1Ho+zBws851EaZtdbMOxze8VJbOrGblCu3Ik1TgwqdhmdiKgkqXhgwZDLyZuCOvhveKTlfwe1dAl7esk6IYFW+WprIJNOlPmmNWExdlVgSj2mvykbXWrDDpAe2uU21G5eV5YBs6lXJFZ5WuOEudcHJVZzVR2W2KYtyDBS++++OPu/Hw4KJRt5r19c9WbasJymcBrIC34B2wwCbYAp/AMTgHCHjgB/gFbiu9yrfK98rPEn00N9K8AVNP5fY/UXti2w==</latexit>
ab
<latexit sha1_base64="cZeKLAuKLyN7qMbOGv/SfyeowIk=">AAAG3XicdZRbT9swFMfNNjpWdoHtkZdoFdIeUJV0UJo3BGwMCQHjPrVV5ThuatW5yHEGUZTHvW17HNr32T7Evs3spKXUNVF6ah3//sfHPsdxIkpibpr/5h49fjJfebrwrLr4/MXLV0vLry/iMGEIn6OQhuzKgTGmJMDnnHCKryKGoe9QfOkMd+T85VfMYhIGZzyNcNeHXkD6BEEuXKew5/SWambdFE+zaciB1TItMbDtVqNhG1YxZZq1rcW93yt/r9zj3vL8n44bosTHAUcUxnHbMiPezSDjBFGcVztJjCOIhtDDbTEMoI/jblbkmhurwuMa/ZCJX8CNwntfkUE/jlPfEaQP+SBW56RTN9dOeL/VzUgQJRwHqFyon1CDh4bcuOEShhGnqRhAxIjI1UADyCDi4niqU8uUqU65PAajAUE3ebW6auwcHRydGLsfPu4f7p/tHx2eVjsu7osKFMJsF7LhHsM4yDPmOXlm1q2NNXGkTWmsjdyYxg+IN+BfMKXh9Z2gJdHC2HpexE/v6E0JliafZi8wS2f5SfSWGnyU+4QtMi+NNpFtmuA72N4oUy6sreUVgVWgpW1qBSfYnSRenuPIqvg2FcUaswIRr4qc4ojAPEN+OpSMiPh+zRJ/m6ZycNuMoGGx9Jg16i1bGHtdmEZLwS8HhI93JTYjX4U4TlhE8b0SiFZYk1CAr1Ho+zBws851EaZtdbMOxze8VJbOrGblCu3Ik1TgwqdhmdiKgkqXhgwZDLyZuCOvhveKTlfwe1dAl7esk6IYFW+WprIJNOlPmmNWExdlVgSj2mvykbXWrDDpAe2uU21G5eV5YBs6lXJFZ5WuOEudcHJVZzVR2W2KYtyDBS++++OPu/Hw4KJRt5r19c9WbasJymcBrIC34B2wwCbYAp/AMTgHCHjgB/gFbiu9yrfK98rPEn00N9K8AVNP5fY/UXti2w==</latexit>
r=0 r=1
<latexit sha1_base64="Ak7L7PvFrsCCa8imffPXInWu0Yw=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x19bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMY6WPH</latexit>
<latexit sha1_base64="rI+D0Zhe42u4rXwHEySmbLNT5x0=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x1+bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMScmPG</latexit>
r=0 r=0
<latexit sha1_base64="rI+D0Zhe42u4rXwHEySmbLNT5x0=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x1+bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMScmPG</latexit> <latexit sha1_base64="rI+D0Zhe42u4rXwHEySmbLNT5x0=">AAAG3XicdZRbT9swFMcNGx3rLsD2uJdoFRKTUJUUenuYhoBdkBAw7lNbIddxU6vORY4ziKI87m2Xt/Fl9jTtS+yD7H120lLqmig9tY5//+Njn+N0A0pCbpp/Z2bv3Z8rPJh/WHz0+MnThcWlZ6ehHzGET5BPfXbehSGmxMMnnHCKzwOGodul+Kw72JLzZ58xC4nvHfM4wB0XOh7pEQS5cB2x1+bFYsksNxu1SrVimGXTrFfWanJQqa9X1gxLeORTevNr5d/vH+1XBxdLc3/ato8iF3scURiGLcsMeCeBjBNEcVpsRyEOIBpAB7fE0IMuDjtJlmtqLAuPbfR8Jn4eNzLvbUUC3TCM3a4gXcj7oTonnbq5VsR7jU5CvCDi2EP5Qr2IGtw35MYNmzCMOI3FACJGRK4G6kMGERfHU5xYJk91wuUwGPQJukqLxWVja393/9DYfvtuZ2/neGd/76jYtnFPVCATJtuQDd4zjL00YU43TcyyVV0VR1qTxqqmxiS+S5w+/4Qp9S9vBA2JZqap50X8+IauSzA36SR7ilk8zY+jN9Tgw9zHbJZ5brSJbNII38DNap5yZptaXhFYGZrbmlZwiO1x4vk5Dq2Kb1JRrBErEPGqyBEOCEwT5MYDyYiIa6uW+KubysFtMoIG2dIj1ig3msI014WpNBT8rE/4aFdiM/JViIOIBRTfKoFohVUJefgS+a4LPTtpX2ZhWlYnaXN8xXNl7kxKVqrQXXmSCpz5NCwTW1FQ6dKQPoOeMxV36NXwTtbpCn7rCujylnVSFMPiTdNUNoEm/XFzTGvCrMyKYFh7TT6y1poVxj2g3XWszSi/PHdsQ6dSrui00hZnqROOr+q0Jsi7TVGMejDjxXd/9HE37h6cVspWrbz+0Spt1ED+zIMX4CVYARaogw3wARyAE4CAA76Bn+C6cFH4Uvha+J6jszNDzXMw8RSu/wMScmPG</latexit>
…
ag <latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>
ag <latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>
ag <latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>
ag
<latexit sha1_base64="UwN0Fnrxb7hVutzLzc89AK1TnmQ=">AAAG3XicdZTdTtswFMcNGx3rPoBxNe3GWoU0aahKOijNHQL2gYSAAQWmtqpc102tOh9ynEEV5XJ32y7H3Z5mL7Gn2CvMTlpKXROlp9bx73987HOcTshoJCzr79z8g4cLhUeLj4tPnj57vrS88uI8CmKOSR0HLOCXHRQRRn1SF1QwchlygrwOIxedwa6av/hKeEQD/0wMQ9LykOvTHsVISNcparvt5ZJVtuRTrUI1sGuWLQeOU6tUHGhnU5ZV2n75e/Vt/R88bq8s/Gl2Axx7xBeYoShq2FYoWgnigmJG0mIzjkiI8AC5pCGHPvJI1EqyXFO4Jj1d2Au4/PkCZt67igR5UTT0OpL0kOhH+pxymuYasejVWgn1w1gQH+cL9WIGRQDVxmGXcoIFG8oBwpzKXCHuI46wkMdTnFomT3XK5XIU9im+TovFNbh7dHB0Avfef9g/3D/bPzo8LTa7pCcrkAmTPcQHHzkhfppwt5MmVtneXJdHWlXG3kzhNH5A3b74QhgLrm4FNYVmxjHzMv7wlt5SYG7Safac8OEsP4le04OPcp+wWea5MSayw2JyCzubecqZdYy8JrAzNLdVo+CEdCeJ5+c4sjq+w2SxxqxE5KsjpySkKE2wNxwoRkZ8t27Lvy1LO7gdTvEgW3rMwnLNkcbZkKZS0/CLPhXjXcnNqFcjjmMeMnKnBLIV1hXkkysceB7yu0nzKgvTsFtJU5BrkStzZ1KyU43uqJPU4MxnYLncioYql4EMOPLdmbgjr4F3s07X8DtXwJS3qpOmGBVvlmaqCQzpT5pjVhNlZdYEo9ob8lG1Nqww6QHjrofGjPLLc882TCrtis4qu/IsTcLJVZ3VhHm3aYpxD2a8/O6PP+7w/sF5pWxXyxuf7dJ2FeTPIngFXoM3wAZbYBt8AsegDjBwwQ/wC9wU2oVvhe+Fnzk6PzfSrIKpp3DzH1ClYrI=</latexit>
H H +1
<latexit sha1_base64="3BtFn+1+EJZr4Q2qgFo9R3w88+4=">AAAG23icdZRbT9swFMcNGx3rbrA97iVahcQkVCUFSvM0BGwDCXG/TW2FHNdNrToXOc4givK0t23SXsa32dO0L7EPsvfZSUupa6L01Dr+/e3jc47jhJRE3DT/Tk0/eDhTejT7uPzk6bPnL+bmX55FQcwQPkUBDdiFAyNMiY9POeEUX4QMQ8+h+Nzpb8r588+YRSTwT3gS4rYHXZ90CYJcuA63L+cqZtUUT71uyIHVMC0xsO1GrWYbVj5lmpV3vxb//f7RentwOT/zp9UJUOxhnyMKo6hpmSFvp5BxgijOyq04wiFEfejiphj60MNRO80jzYwF4ekY3YCJn8+N3HtXkUIvihLPEaQHeS9S56RTN9eMebfRTokfxhz7qNioG1ODB4Y8ttEhDCNOEzGAiBERq4F6kEHERXLKY9sUoY65XAbDHkHXWbm8YGzu7+4fGVvvP+zs7Zzs7O8dl1sd3BX5z4XpFmT9jwxjP0uZ62SpWbVWl0RK69JYq5kxju8St8c/YUqDq1tBQ6K5sfW8WD+5pdckWJhsnD3DLJnkR6s31MUHsY/YPPLCaAPZoDG+he3VIuTc2lpeEVg5Wti6VnCEO6PAizwOrIpvUFGsISsQ8arIMQ4JzFLkJX3JiBWXlyzxt2YqidtgBPXzrYesUW3YwtgrwtQaCn7eI3x4KnEY+SrEQcxCiu+UQLTCkoR8fIUCz4N+J21d5cs0rXba4viaF8rCmVasTKEdmUkFzn0alomjKKh0aciAQd+dWHfg1fBu3ukKfucK6OKWdVIUg+JN0lQ2gSb8UXNMaqK8zIpgUHtNPLLWmh1GPaA9daKNqLg89xxDp1Ku6KSyI3KpE46u6qQmLLpNUQx7MOfFd3/4cTfuH5zVqla9unJoVdbroHhmwWvwBiwCC6yBdbANDsApQACDb+AnuCm1S19KX0vfC3R6aqB5Bcae0s1/vtxjFA==</latexit>
<latexit sha1_base64="kkihs6FJLR5CtcpV2OgY8Dqix1g=">AAAG3XicdZTPT9swFMcNGx3rfgDbYYddolVIk4aqpEBJTkPANpAQMH5PbYVc102tOj/kOIMoynG3bcfxD+2486T9NcxOWkpdE6Wv1vPn+/zs95x2SEnETfPf1PSDhzOlR7OPy0+ePns+N7/w4jQKYobwCQpowM7bMMKU+PiEE07xecgw9NoUn7X7m3L+7CtmEQn8Y56EuOVB1yddgiAXrqPtd9bFfMWsmuKp1w05sGzTEgPHsWs1x7DyKdOsvL959ce5+Tt3cLEw87vZCVDsYZ8jCqOoYZkhb6WQcYIozsrNOMIhRH3o4oYY+tDDUSvNc82MReHpGN2AiZ/Pjdx7V5FCL4oSry1ID/JepM5Jp26uEfOu3UqJH8Yc+6hYqBtTgweG3LjRIQwjThMxgIgRkauBepBBxMXxlMeWKVIdc7kMhj2CrrJyedHY3N/dPzS2Pnzc2ds53tnfOyo3O7grKpAL0y3I+p8Yxn6WMredpWbVWl0SR1qXxlrNjHF8l7g9/gVTGlzeCmyJ5sbR8yJ+ckuvSbAw2Th7ilkyyY+i22rwQe4jNs+8MNpENmiMb2FntUg5t46WVwRWjha2rhUc4s4o8eIcB1bFN6go1pAViHhV5AiHBGYp8pK+ZETE5SVL/K2ZysFtMIL6+dJD1qjajjDOijA1W8HPeoQPdyU2I1+FOIhZSPGdEohWWJKQjy9R4HnQ76TNyzxMw2qlTY6veKEsnGnFyhS6LU9SgXOfhmViKwoqXRoyYNB3J+IOvBrezTtdwe9cAV3esk6KYlC8SZrKJtCkP2qOSU2Ul1kRDGqvyUfWWrPCqAe0u060GRWX555t6FTKFZ1UdsRZ6oSjqzqpCYtuUxTDHsx58d0fftyN+wentapVr658tirrdVA8s+A1eAPeAgusgXWwDQ7ACUDABT/AL3Bduih9K30v/SzQ6amB5iUYe0rX/wFdLmPU</latexit>
1 2
<latexit sha1_base64="k/SbdCUDyXcVhdRB+9OxkISdxKw=">AAAG13icdZRLTxsxEMcNLSlNX9BrL1YjJCqhaDdAyJ6KgD6QEFDeVRIhr9dJrHgf8noL0WpPvbWHHtp+m56qfol+kN5r7yaEOGa1mVjj398ez4zXjRiNhWX9nZm9d3+u9GD+YfnR4ydPny2UF8/iMOGYnOKQhfzCRTFhNCCnggpGLiJOkO8ycu72t9X8+SfCYxoGJ2IQkbaPugHtUIyEdH2wLxcqVtWST70O1cBuWLYcOE6jVnOgnU9ZVuX1r+V/v7+1Xh1eLs79aXkhTnwSCMxQHDdtKxLtFHFBMSNZuZXEJEK4j7qkKYcB8kncTvNIM7gkPR7shFz+AgFz721Fivw4HviuJH0kerE+p5ymuWYiOo12SoMoESTAxUadhEERQnVs6FFOsGADOUCYUxkrxD3EERYyOeWJbYpQJ1xdjqIexddZubwEtw/2Do7gzpu3u/u7J7sH+8fllkc6Mv+5MN1BvP+OExJkKe+6WWpV7fUVmdK6MvZ6BifxPdrtiY+EsfDqRtBQaG4cMy/XH9zQGwosTDbJnhE+mObHqzf0xYexj9k88sIYA9liCbmBnfUi5Nw6Rl4T2Dla2LpRcES8ceBFHodWx7eYLNaIlYh8deSYRBRlKfYHfcXIFVdXbPm3YWmJ2+IU9/OtRyysNhxpnDVpag0NP+9RMTqVPIx6NeIw4REjt0ogW2FFQQG5wqHvo8BLW1f5Mk27nbYEuRaFsnCmFTvTaFdlUoNzn4Hl8igaqlwGMuQo6E6tO/Qa+G7e6Rp+6wqY4lZ10hTD4k3TTDWBIfxxc0xr4rzMmmBYe0M8qtaGHcY9YDz1wBhRcXnuOIZJpV3RaaUnc2kSjq/qtCYquk1TjHow5+V3f/Rxh3cPzmpVu15dq2zWQfHMgxfgJVgGNtgAm+A9OASnAAMCvoIf4GepXfpc+lKAszNDxXMw8ZS+/wdAiGHe</latexit> <latexit sha1_base64="07bX+4vdFQ5CNwlwQtQks78mCcI=">AAAG23icdZRbT9swFMcNGx3rbrA97iVahcQkVCUdlOZpCNgFCVHuMLUVclw3tepc5DiDKMrT3rZJexnfZk/TvsQ+yN5nJy2lronSU+v497ePzzmOE1IScdP8OzN77/5c6cH8w/Kjx0+ePltYfH4aBTFD+AQFNGDnDowwJT4+4YRTfB4yDD2H4jNnsCXnzz5jFpHAP+ZJiDsedH3SIwhy4TqoXSxUzKopnnrdkAOrYVpiYNuNWs02rHzKNCtvfy3/+/2j/Xr/YnHuT7sboNjDPkcURlHLMkPeSSHjBFGcldtxhEOIBtDFLTH0oYejTppHmhlLwtM1egETP58bufe2IoVeFCWeI0gP8n6kzkmnbq4V816jkxI/jDn2UbFRL6YGDwx5bKNLGEacJmIAESMiVgP1IYOIi+SUJ7YpQp1wuQyGfYKusnJ5ydhq7jYPje1373f2do53mntH5XYX90T+c2G6DdngA8PYz1LmOllqVq21FZHSujTWWmZM4rvE7fNPmNLg8kbQkGhubD0v1k9u6HUJFiabZE8xS6b58eoNdfFh7GM2j7ww2kA2aYxvYHutCDm3tpZXBFaOFrauFRzi7jjwIo9Dq+KbVBRrxApEvCpyhEMCsxR5yUAyYsU3K5b4WzeVxG0yggb51iPWqDZsYexVYWoNBT/rEz46lTiMfBViP2YhxbdKIFphRUI+vkSB50G/m7Yv82VaVidtc3zFC2XhTCtWptCOzKQC5z4Ny8RRFFS6NGTAoO9OrTv0ang373QFv3UFdHHLOimKYfGmaSqbQBP+uDmmNVFeZkUwrL0mHllrzQ7jHtCeOtFGVFyeO46hUylXdFrZFbnUCcdXdVoTFt2mKEY9mPPiuz/6uBt3D05rVateXT2wKht1UDzz4CV4BZaBBdbBBvgI9sEJQACDb+AnuC51Sl9KX0vfC3R2Zqh5ASae0vV/MKJi/g==</latexit>
84
Given the failure of ε-Greedy for this example, one can ask whether other algorithmic
principles also fail. As we will show now, the principle of optimism succeeds, and an
analogue of the UCB method yields a regret bound that is polynomial in the parameters
|S|, |A|, and H. Before diving into the details, we present a collection of standard tools for
analysis in MDPs, which will find use throughout the remainder of the lecture notes.
Here, one process evolves according to (M, π h ) and the one evolves according to (M, π h+1 ).
The processes only differ in the action taken once the state sh is reached. In the former,
the action π ′ (sh ) is taken, whereas in the latter it is π(sh ). Hence, (5.15) is equal to
,π ′ M ,π ′
h i
′
Esh ∼(M,π) EM ,π QMh (s h , π (sh )) − Q h (s h , π(s h )) | s1 = s (5.16)
85
In contrast to the performance difference lemma, which relates the values of two policies
under the same MDP, the next result relates the performance of the same policy under two
different MDPs. Specifically, the difference in initial value for two MDPs is decomposed
into a sum of errors between layer-wise value functions.
Hence, for M, M
c with the same initial state distribution,
H
,π M ,π
c
X c
EM ,π QM
f M (π) − f M (π) = h (sh , ah ) − rh − Vh+1 (sh+1 ) . (5.19)
h=1
In addition, for any MDP M and function Q = (Q1 , . . . , QH , QH+1 ) with QH+1 ≡ 0,
letting πQ,h (s) = arg maxa∈A Qh (s, a), we have
H
M ,πQ
X
EM ,πQ Qh (sh , ah ) − [ThM Qh+1 ](sh , ah ) | s1 = s . (5.20)
max Q1 (s, a) − V1 (s) =
a∈A
h=1
and, hence,
H
X
EM ,πQ Qh (sh , ah ) − [ThM Qh+1 ](sh , ah ) . (5.21)
Es1 ∼d1 max Q1 (s1 , a) − f M (πQ ) =
a∈A
h=1
Note that for the second part of Lemma 14 Q = (Q1 , . . . , QH ) can be any sequence of
functions, and need not be a value function corresponding to a particular policy or MDP.
It is worth noting that Q gives rise to the greedy policy πQ , which, in turn, gives rise to
QM ,πQ (the value of πQ in model M ), but it may well be the case that QM ,πQ ̸= Q.
Proof of Lemma 14. We will prove (5.19), and omit the proof for (5.18), which is similar
but more verbose. We have
H H
"H #
c,π M ,π M ,π c,π M ,π M ,π
X X c,π
X
M M M
E Qh (sh , ah ) − rh − Vh+1 (sh+1 ) = E Qh (sh , ah ) − Vh+1 (sh+1 ) − E rh
h=1 h=1 h=1
H
,π M ,π
X c c
EM ,π QM
M
= h (sh , ah ) − Vh+1 (sh+1 ) − f (π).
h=1
86
On the other hand, since VhM ,π (s) = Ea∼πh (s) [QM ,π
h (s, a)], a telescoping argument yields
H H
,π M ,π
EM ,π VhM ,π (sh ) − Vh+1
M ,π
X c X c
EM ,π QM
h (s ,
h ha ) − V (s
h+1 h+1 ) = (sh+1 )
h=1 h=1
V1M ,π (s1 ) − EM ,π VH+1
Mc,π c M ,π
=E (sH+1 )
= f M (π),
M ,π
where we have used that VH+1 = 0, and that both MDPs have the same initial state
distribution. We prove (5.21) (omitting the proof of (5.20)) using a similar argument. We
have
H
X
EM ,πQ Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a)
a∈A
h=1
H
"H #
X X
M ,πQ M ,πQ
= E Qh (sh , ah ) − max Qh+1 (sh+1 , a) − E rh
a∈A
h=1 h=1
XH
EM ,πQ Qh (sh , ah ) − max Qh+1 (sh+1 , a) − f M (πQ ).
=
a∈A
h=1
Since ah+1 = πQ,h (sh+1 ) = arg maxa∈A Qh+1 (sh+1 , a), we have
EM ,πQ Qh (sh , ah ) − max Qh+1 (sh+1 , a) = EM ,πQ Qh (sh , ah ) − Qh+1 (sh+1 , ah+1 ) ,
a∈A
Another similar analysis tool for MDPs, the simulation lemma, is deferred to Section 6
(Lemma 23). This result can be proven as a consequence of Lemma 14.
5.5 Optimism
To develop algorithms for regret minimization in unknown MDPs, we turn to the principle
of optimism, which we have seen is successful in tackling multi-armed bandits and linear
bandits (in small dimension). Recall that for bandits, Lemma 7 gave a way to decompose
the regret of optimistic algorithms into width of confidence intervals. What is the analogue
of Lemma 7 for MDPs? Thinking of optimistic estimates at the level of expected rewards for
policies π is unwieldy, and we need to dig into the structure of these multi-stage decisions. In
particular, the approach we employ is to construct a sequence of optimistic value functions
Q1 , . . . , QH which are guaranteed to over-estimate the optimal value function QM ,⋆ . For
multi-armed bandits, implementing optimism amounted to adding “bonuses,” constructed
from past data, to estimates for the reward function. We will construct optimistic value
functions in a similar fashion. Before giving the construction, we introduce a technical
lemma, which quantifies the error in using such optimistic estimates in terms of Bellman
residuals; Bellman residuals measure self-consistency of the optimistic estimates under the
application of the Bellman operator.
87
Lemma 15 (Error decomposition for optimistic policies): Let {Q1 , . . . , QH } be
a sequence of functions Qh : S × A → R with the property that for all (s, a),
,⋆
QM
h (s, a) ≤ Qh (s, a) (5.22)
Lemma 15 tells us that closeness of Qh to the Bellman backup ThM Qh+1 implies closeness
,⋆
b to πM in terms of the value. As a sanity check, if Qh = QM
of π h , the right-hand side
,⋆ ,⋆
of (5.23) is zero, since QM
h = ThM QM
h+1 . Crucially, errors do not accumulate too fast as a
function of the horizon. This fact should not be taken for granted: in general, if Q is not
optimistic, it could have been the case that small changes in Qh exponentially degrade the
quality of the policy πb.
Another important aspect of the decomposition (5.23) is the on-policy nature of the
terms in the sum. Observe that the law of sh for each of the terms is given by executing
π
b in model M . The distribution of sh is often referred to as the roll-in distribution; when
this distribution is induced by the policy executed by the algorithm, we may have a better
control of the error than in the off-policy case when the roll-in distribution is given by πM
or another unknown policy.
Proof of Lemma 15. Let V h (s) := maxa∈A Qh (s, a). Just as in the proof of Lemma 7, the
assumption that Qh is “optimistic” implies that
,⋆
QM
h (sh , πM (sh )) ≤ Qh (sh , πM (sh )) ≤ Qh (sh , π
b(sh ))
Remark 16: In fact, the proof of Lemma 15 only uses that the initial value Q1 is opti-
mistic. However, to construct a value function with this property, the algorithms we con-
sider will proceed by backwards induction, producing optimistic estimates Q1 , . . . , QH
in the process.
88
bandits: we assume no structure across states and actions, but require that the state and
action spaces are small. The regret bounds we present will depend polynomially on S = |S|
and A = |A|, as well as the horizon H.
Preliminaries. For simplicity, we assume that the reward function is known to the
learner, so that only the transition probabilities are unknown. This does not change the
difficulty of the problem in a meaningful way, but allows us to keep notation light.
Assumption 6: Rewards are deterministic, bounded, and known to the learner: RhM (s, a) =
δrh (s,a) for known rh : S × A → [0, 1], for all M . In addition, assume for simplicity that
V1M ,⋆ (s) ∈ [0, 1] for any s ∈ S.
as the empirical state-action and state-action-next state frequencies. We can estimate the
transition probabilities via
nt (s, a, s′ )
Pbht (s′ | s, a) = h t . (5.25)
nh (s, a)
The UCB-VI algorithm. The following algorithm, UCB-VI (“Upper Confidence Bound
Value Iteration”) [16], combines the notion of optimism with dynamic programming.
UCB-VI
for t = 1, . . . , T do
Let V tH+1 ≡ 1.
for h = H, . . . , 1 do
Update nth (s, a), nth (s, a, s′ ), and bth,δ (s, a), for all (s, a) ∈ S × A.
// bth,δ (s, a) is a bonus computed in (5.27).
Compute:
n o
Qth (s, a) = rh (s, a) + Es′ ∼Pbt (·|s,a) V th+1 (s′ ) + bth,δ (s, a) ∧ 1. (5.26)
h
The UCB-VI algorithm will be analyzed using Lemma 15. In constructing functions Qh , we
will need to satisfy two goals: (1) ensure that with high probability (5.22) is satisfied, i.e.
Qh s are optimistic; and (2) that Qh s are “self-consistent,” in the sense that the Bellman
residuals in (5.23) are small. The second requirement already suggests that we should define
Qh approximately as a Bellman backup ThM Qh+1 , going backwards for h = H + 1, . . . , 1
as in dynamic programming, while ensuring the first requirement. In addition to these
considerations, we will have to use a surrogate for the Bellman operator ThM , since the model
89
M is not known. This is achieved by estimating M using empirical transition frequencies.
Putting these ideas together gives the update in (5.26). We apply the principle of value
iteration, except that
1. For each episode t, we augment the rewards rh (s, a) with a “bonus” bth,δ (s, a) designed
to ensure optimism.
2. The Bellman operator is approximated using the estimated transition probabilities in
(5.25).
The bonus functions play precisely the same role as the width of the confidence interval in
(2.19): these bonuses ensure that (5.22) holds with high probability, as we will show below
in Lemma 16.
The following theorem shows that with an appropriate choice of bonus, this algorithm
achieves a polynomial regret bound.
90
for bonus functions bh,δ : S × A → R to be chosen later. Henceforth, we follow the usual
notation that for functions f, g over the same domain, f ≤ g indicates pointwise inequality
over the domain.
The first lemma we present shows that as long as the bonuses bh,δ are large enough to
bound the error between the estimated transition probabilities and true transition proba-
bilities, the functions Q1 , . . . , QH constructed above will be optimistic.
Lemma 16: Suppose we have estimates Pbh (· | s, a) h,s,a
and a function bh,δ : S ×A →
R with the property that for all s ∈ §, a ∈ A,
Pbh (s′ | s, a)VhM ,⋆ (s′ ) − PhM (s′ | s, a)VhM ,⋆ (s′ ) ≤ bh,δ (s, a).
X X
(5.30)
s′ s′
Proof of Lemma 16. The proof proceeds by backward induction on the statement
V h ≥ VhM ,⋆
This, in turn, implies that V h (s) = maxa Qh (s, a) ≥ maxa Qh (s, a) = VhM ,⋆ (s), concluding
M ,⋆
X X
max Pbh (s′ | s, a)V (s′ ) − PhM (s′ | s, a)V (s′ ) ≤ b′h,δ (s, a) (5.32)
V ∈{0,1}S
s′ s′
91
for Qh , V h defined in (5.29).
Proof of Lemma 17. That Qh − ThM Qh+1 ≤ 1 is immediate. To prove the main result,
observe that
n o
c c
Qh − ThM Qh+1 = ThM Qh+1 + bh,δ ∧ 1 − ThM Qh+1 ≤ (ThM − ThM )Qh+1 + bh,δ (5.34)
Since the maximum is achieved at a vertex of [0, 1]S , the statement follows.
Lemma 18: Let Pbht h∈[H],t∈[T ]
be defined as in (5.25). Then with probability at least
1 − δ, the functions
s s
t log(2SAHT /δ) ′t S log(2SAHT /δ)
bh,δ (s, a) = 2 , and bh,δ (s, a) = 8
nth (s, a) nth (s, a)
satisfy the assumptions of Lemma 16 and Lemma 17, respectively, for all s ∈ S, a ∈ A,
h ∈ [H], and t ∈ [T ] simultaneously.
Proof of Theorem 1. Putting everything together, we can now prove Theorem 1. Under
the event in Lemma 18, the functions Qt1 , . . . , QtH are optimistic, which means that the
conditions of Lemma 15 hold, and the instantaneous regret on round t (conditionally on
s1 ∼ d1 ) is at most
H H
t t
X X
EM ,bπ (Qth − ThM Qth+1 )(sth , π bht (sth )) + b′h,δ (sth , π
EM ,bπ (bh,δ (sth , π
bht (sth )) | s1 = s ≤ bht (sth ))) ∧ 1 ,
h=1 h=1
where the second inequality invokes Lemma 17. Summing over t = 1, . . . , T , and applying
the Azuma-Hoeffding inequality, we have that with probability at least 1 − δ, the regret of
UCB-VI is bounded by
T X
H
t
X
bht (sth )) + b′h,δ (sth , π
EM ,bπ (bh,δ (sth , π
bht (sth ))) ∧ 1
t=1 h=1
XT X H
bht (sth )) + b′h,δ (sth , π
p
≲ (bh,δ (sth , π bht (sth ))) ∧ 1 + HT log(1/δ).
t=1 h=1
92
Using the bonus definition in (5.27), the bonus term above is bounded by
s
T X H T XH
X S log(2SAHT /δ) p X 1
∧ 1 ≤ S log(2SAHT /δ) p t t t t ∧ 1 (5.37)
t t
bh (sh ))
nh (sh , π t t
nh (sh , π
bh (sh ))
t=1 h=1 t=1 h=1
So far, we have covered three general frameworks for interaction decision making: The
contextual bandit problem, the structured bandit problem, and the episodic reinforcement
learning problem; all of these frameworks generalize the classical multi-armed bandit prob-
lem in different directions. In the context of structured bandits, we introduced a complexity
measure called the Decision-Estimation Coefficient (DEC), which gave a generic approach
to algorithm design, and allowed us to reduce the problem of interactive decision making
to that of supervised online estimation. In this section, we will build on this develop-
ment on two fronts: First, we will introduce a unified framework for decision making,
which subsumes all of the frameworks we have covered so far. Then, we will show that
i) the Decision-Estimation Coefficient and its associated meta-algorithm (E2D) extend to
the general decision making framework, and ii) boundedness of the DEC is not just suf-
ficient, but actually necessary for low regret, and thus constitutes a fundamental limit.
As an application of the general tools we introduce, we will show how to use the (general-
ized) Decision-Estimation Coefficient to solve the problem of tabular reinforcement learning
(Section 6.6), offering an alternative to the UCB-VI method we introduced in Section 5.
6.1 Setting
For the remainder of the course, we will focus on a framework called Decision Making with
Structured Observations (DMSO), which subsumes all of the decision making frameworks
we have encountered so far. The protocol proceeds in T rounds, where for each round
t = 1, . . . , T :
93
Assumption 7 (Stochastic Rewards and Observations): Rewards and observa-
tions are generated independently via
(rt , ot ) ∼ M ⋆ (· | π t ), (6.1)
To facilitate the use of learning and function approximation, we assume the learner has
access to a model class M that contains the model M ⋆ . Depending on the problem do-
main, M might consist of linear models, neural networks, random forests, or other complex
function approximators; this generalizes the role of the reward function class F used in
contextual/structured bandits. We make the following standard realizability assumption,
which asserts that M is flexible enough to express the true model.
For a model M ∈ M, let EM ,π [·] denote the expectation under (r, o) ∼ M (π). Further,
following the notation in Section 5, let
f M (π) := EM ,π [r]
denote the optimal decision with maximal expected reward. Finally, define
FM := {f M | M ∈ M} (6.2)
as the induced class of mean reward functions. We evaluate the learner’s performance in
terms of regret to the optimal decision for M ⋆ :
T
X ⋆ ⋆
Reg := Eπt ∼pt f M (πM ⋆ ) − f M (π t ) , (6.3)
t=1
where p ∈ ∆(Π) is the learner’s distribution over decisions at round t. Going forward, we
t
⋆
abbreviate f ⋆ = f M and π ⋆ = πM ⋆ ,.
The DMSO framework is general enough to capture most online decision making prob-
lems. Let us first see how it subsumes the structured bandit and contextual bandit problems.
Example 6.1 (Structured bandits). When there are no observations (i.e., O = {∅}), the
DMSO framework is equivalent to structured bandits studied earlier in Section 4. Therein,
we defined a structured bandit instance by specifying a class F of mean reward functions
and a general class of reward distributions, such as sub-Gaussian or bounded. In the DMSO
framework, we may equivalently start with a set of models M and let FM be the induced
class (6.2). By changing the class F, this encompasses all of the concrete examples of
structured bandit problems we studied in Section 4, including linear bandits, nonparametric
bandits, and concave/convex bandits.
◁
94
Example 6.2 (Contextual bandits). The DMSO framework readily captures contextual
bandits (Section 3) with stochastic contexts (see Assumption 2). To make this precise,
we will slightly abuse the notation and think of π t as functions mapping the context xt
to an action in Π = [A]. To this end, on round t, the decision-maker selects a mapping
π t : X → [A] from contexts to actions, and the context ot = xt is observed at the end of the
round. This is equivalent to first observing xt and selecting π t (xt ) ∈ [A].
Formally, let O = X be the space of contexts, Π = [A] be the set of actions, and
Π : X → [A] be the space of decisions. The distribution (r, x) ∼ M (π) then has the
following structure: x ∼ DM and r ∼ RM (·|x, π(x)) for some context distribution DM and
reward distribution RM . In other words, the distribution DM for the context x (treated as
an observation) is part of the model M .
We mention in passing that the DMSO framework also naturally extends to the case
when contexts are adversarial rather than i.i.d., as in Section 4.5; see Foster et al. [40]. ◁
Example 6.3 (Online reinforcement learning). The online reinforcement learning frame-
work we introduced
P in t Section t5 immediately falls into the DMSO framework by taking
Π = Πrns , rt = H r
h=1 h , and o = τ t
. While we have only covered tabular reinforcement
learning so far, the literature on online reinforcement learning contains algorithms and
sample complexity bounds for a rich and extensive collection of different MDP structures
(e.g., Dean et al. [30], Yang and Wang [87], Jin et al. [48], Modi et al. [65], Ayoub et al.
[15], Krishnamurthy et al. [55], Du et al. [33], Li [62], Dong et al. [31]). All of these settings
correspond to specific choices for the model class M in the DMSO framework, and we will
cover this topic in detail in Section 7. ◁
We adopt the DMSO framework because it gives simple, yet unified approach to describ-
ing and understanding what is—at first glance—a very general and seemingly complicated
problem. Other examples that are covered by the DMSO framework include:
• Partial monitoring⋆
95
dQ
whenever P ≪ Q. More generally, defining p = dP
dν and q = dν for a common dominating
measure ν, we have
p
Z
Df (P ∥ Q) := qf dν + P(q = 0) · f ′ (∞), (6.5)
q>0 q
where f ′ (∞) := limx→0+ xf (1/x).
We will make use of the following f -divergences, all of which have unique properties
that make them useful in different contexts.
• Choosing f (t) = 12 |t − 1| gives the total variation (TV) distance
1 dP dQ
Z
DTV (P, Q) = − dν,
2 dν dν
which can also be written as
DTV (P, Q) = sup |P(A) − Q(A)|.
A∈F
√
• Choosing f (t) = (1 − t)2 gives squared Hellinger distance
Z r r !2
2 dP dQ
DH (P, Q) = − dν.
dν dν
Note that for TV distance and Hellinger distance, we use the notation D(·, ·) rather than
D(· ∥ ·) to emphasize that the divergence is symmetric. Other standard examples include
the Neyman-Pearson χ2 -divergence.
2 (P, Q) = 2, and D
It is known that DTV (P, Q) = 1 if and only if DH TV (P, Q) = 0 if and
2 2
only if DH (P, Q) = 0 (more generally, DH (P, Q) ≤ 2DTV (P, Q)). Moreover, they induce
same topology, i.e. a sequence converges in one distance if and only if it converges the
other. KL divergence cannot be bounded by TV distance or Hellinger distance in general,
but the following lemma shows that it is possible to relate these quantities if the density
ratios under consideration are bounded.
Lemma 20: Let P and Q be probability distributions over a measurable space (Ω, F ).
P(F )
If supF ∈F Q(F ) ≤ V , then
2
DKL (P ∥ Q) ≤ (2 + log(V ))DH (P, Q). (6.7)
96
Other properties we will use include:
• Chain rule and subadditivity properties for KL and Hellinger divergence (see Lemma
22).
We further define
The DEC in (6.9) should look familiar to the definition we used for structured bandits in
Section 4 (Eq. (4.15)). The main difference is that instead of being defined over a class F
of reward functions, the general DEC is defined over the class of models M, and the notion
of estimation error/information gain has changed to account for this. In particular, rather
97
than measuring information gain via the distance between mean reward functions, we now
consider the information gain
h i
2
Eπ∼p DH M (π), M
c(π) ,
which measures the distance between the distributions over rewards and observations under
the models M and M c (for the learner’s decision π). This is a stronger notion of distance since
i) it incorporates observations (e.g., trajectories for reinforcement learning), and ii) even for
bandit problems, we consider distance between distributions as opposed to distance between
means; the latter feature means that this notion of information gain can capture fine-grained
properties of the models under consideration, such as noise in the reward distribution.
The claim will then follow from Proposition 14. To prove this, first note that
c(π) = 1 (f M (π) − f M
2 c
DH c(π) ≤ DKL M (π) ∥ M
M (π), M (π))2 . (6.12)
2
In the other direction, one can directly compute
2
1 M c
M 2
DH M (π), M (π) = 1 − exp − (f (π) − f (π))
c
8
and using that 1 − exp{−x} ≥ (1 − e−1 )x for x ∈ [0, 1], we establish
2 c
DH c(π) ≥ c · (f M (π) − f M
M (π), M (π))2
1−1/e
for c = 8 . ◁
In fact, one can show that the general DEC in (6.9) coincides with the basic squared
error version from Section 4 for general structured bandit problems, not just multi-armed
bandits; see Proposition 41.
Let us next consider a twist on the bandit problem that is more information-theoretic in
nature, and highlights the need to work with information-theoretic divergences if we want
to handle general decision making problems.
98
Example 6.5 (Bandits with structured noise). Let Π = [A], R = R, O = {∅}. We define
n o
MMAB-SN = {M1 , . . . , MA } ∪ M
c
where Mi (π) := N (1/2, 1) for π ̸= i and Mi (π) := Ber(3/4) for π = i; we further define
c(π) := N (1/2, 1) for all π ∈ Π. Before proceeding with the calculations, observe that we
M
can solve the general decision making problem when the underlying model is M ⋆ ∈ M with
a simple algorithm. It is sufficient to select every action in [A] only once: all suboptimal
actions have Bernoulli rewards and give r ∈ {0, 1} almost surely, while the optimal action
has Gaussian rewards, and gives r ∈ / {0, 1} almost surely. Thus, if we select an action and
observe a reward r ∈/ {0, 1}, we know that we have identified the optimal action.
The valuable information contained in the reward distribution is reflected in the Hellinger
divergence, which attains its maximum value when comparing a continuous distribution to
a discrete one:
2 c(π) = 2I {π = i} .
DH Mi (π), M
and
c) ≲ (1 − 1/A)(3/4 − 1/2) − γ 2 ≲ I {γ ≤ A/4} .
decγ (M, M
A
This leads to an upper bound
c) ≲ I {γ ≤ A/4}
decγ (MMAB-SN , M (6.13)
Example 6.6 (Bandits with Full Information). Consider a “full-information” learning set-
ting. We have Π = [A] and R = [0, 1], and for a given decision π we observe a reward
r as in the standard multi-armed bandit, but also receive an observation o = (r(π ′ ))π′ ∈[A]
consisting of (counterfactual) rewards for every action.
For a given model M , let MR (π) denote the distribution over the reward r for π, and let
MO (π) denote the distribution of o. Then for any decision π, since all rewards are observed,
the data processing inequality implies that for all M, M c ∈ M and π ′ ∈ Π,
2 c(π) ≥ D2 MO (π), M
DH M (π), M H
cO (π) (6.14)
= DH 2
MO (π ′ ), M
cO (π ′ ) ≥ D2 MR (π ′ ), M
H
cR (π ′ ) . (6.15)
c ∈ M,
Using this property, we will show that for any M
c) ≤ 1
decγ (M, M . (6.16)
γ
99
Comparing to the finite-armed bandit, we see that the DEC for this example is independent
of A, which reflects the extra information contained in the observation o.
To prove (6.16), for a given model Mc ∈ M we choose p = Iπ (i.e. the decision maker
c
M
selects πM
c deterministically), and bound Eπ∼p [f
M
(πM ) − f M (π)] by
c c
f M (πM ) − f M (πM
c) ≤ f
M
(πM ) − f M (πM
c) + f
M
c) − f
(πM M
(πM )
c
≤2· max |f M (π) − f M (π)|
c}
π∈{πM ,πM
≤2· max DH MR (π), M
cR (π) .
c}
π∈{πM ,πM
We then use the AM-GM inequality, which implies that for any γ > 0,
cR (π) + 1
2 cR (π) ≲ γ · 2
max DH MR (π), M max DH MR (π), M
c}
π∈{πM ,πM c}
π∈{πM ,πM γ
1
2
≤ γ · DH M (πMc ), M (πM
c c) + ,
γ
where the final inequality uses (6.14). This certifies that for all M ∈ M, the choice for p
above satisfies
c(π) ≲ 1 ,
h i
2
Eπ∼p f M (πM ) − f M (π) − γ · DH M (π), M
γ
c) ≲ 1 .
so we have decγ (M, M ◁
γ
In what follows, we will show that the different behavior for the DEC for these examples
reflects the fact that the optimal regret is fundamentally different.
Estimation-to-Decisions (E2D), the meta-algorithm based on the DEC that we gave for
structured bandits in Section 4, readily extends to general decision making [40, 43]. The
general version of the meta-algorithm is given above. Compared to structured bandits,
the main difference is that rather than trying to estimate the reward function f ⋆ , we now
estimate the underlying model M ⋆ . To do so, we appeal once again to the notion of an
online estimation oracle, but this time for model estimation.
At each timestep t, the algorithm calls invokes an online estimation oracle to obtain an
estimate Mct for M ⋆ using the data Ht−1 = (π 1 , r1 , o1 ), . . . , (π t−1 , rt−1 , ot−1 ) observed so far.
100
Using this estimate, E2D proceeds by computing the distribution pt that achieves the value
decγ (M, M
ct ) for the Decision-Estimation Coefficient. That is, we set
2
t M M t
p = arg min sup Eπ∼p f (πM ) − f (π) − γ · DH M (π), M (π) .c (6.17)
p∈∆(Π) M ∈M
E2D then samples the decision π t from this distribution and moves on to the next round.
Like structured bandits, one can show that by running Estimation-to-Decisions in the
general decision making setting, the regret for decision making is bounded in terms of the
DEC and a notion of estimation error for the estimation oracle. The main difference is that
for general decision making, the notion of estimation error we need to control is the sum of
Hellinger distances between the estimates from the supervised estimation oracle M ⋆ , which
we define via
XT h i
2
EstH := Eπt ∼pt DH M ⋆ (π t ), M
ct (π t ) . (6.18)
t=1
With this definition, we can show that E2D enjoys the following bound on regret, analogous
to Proposition 13.
Proposition 26 (Foster et al. [40]): E2D with exploration parameter γ > 0 guar-
antees that
Reg ≤ sup decγ (M, M c) · T + γ · EstH , (6.19)
c∈M
M c
Note that we can optimize over the parameter γ in the result above, which yields
Reg ≤ inf sup decγ (M, M )·T +γ ·EstH ≤ 2· inf max sup decγ (M, M )·T, γ ·EstH .
c c
γ>0 c∈M γ>0 c∈M
M c M c
We will show in the sequel that for any finite class M, the averaged exponential weights
algorithm with the logarithmic loss achieves EstH ≲ log(|M|/δ) with probability at least
1 − δ. For this algorithm, and most others we will consider, one can take M c = co(M). In
fact, one can show (via an analogue of Proposition 24) that for any M c, even if Mc∈ / co(M),
we have decγ (M, M ) ≤ supM
c c∈co(M) deccγ (M, M ) ≤ deccγ (M) for any absolute constant
c
c > 0. This means we can restrict our attention to the convex hull without loss of generality.
Putting these facts together, we see that for any finite class, it is possible to achieve
101
For each t, since M ⋆ ∈ M, we have
⋆ ⋆
h i
2
M ⋆ (π t ), M
Eπt ∼pt f M (πM ⋆ ) − f M (π t ) − γ · Eπt ∼pt DH ct (π t )
h i
2
≤ sup Eπt ∼pt [f M (πM ) − f M (π t )] − γ · Eπt ∼pt DH M (π t ), Mct (π t )
M ∈M
h i
2
= inf sup Eπ∼p f M (πM ) − f M (π) − γ · DH M (π), M ct (π)
p∈∆(Π) M ∈M
ct ).
= decγ (M, M (6.21)
Examples for the upper bound. We now revisit the examples from Section 6.3 and
use E2D and Proposition 26 to derive regret bounds for them.
Example 6.4 (cont’d). For the Gaussian bandit problem from Example 6.4, plugging the
bound decγ (MMAB-G ) ≲ A/γ into Proposition 26 yields
AT
Reg ≲ + γ · EstH ,
γ
p
Choosing γ = AT /EstH balances the terms above and gives
p
Reg ≲ AT · EstH .
Example 6.5 (cont’d). For the bandit-type problem with structured noise from Exam-
ple 6.5, the bound decγ (MMAB-SN ) ≲ I {γ ≤ A/4} yields
Reg ≲ A · EstH .
102
Instead of directly performing estimation with respect to Hellinger distance, the simplest
way to develop conditional density estimation algorithms is to work with the logarithmic
loss. Given a tuple (π t , rt , ot ), define the logarithmic loss for a model M as
t 1
ℓlog (M ) = log , (6.22)
mM (rt , ot | π t )
where we define mM (·, · | π) as the conditional density for (r, o) under M . We define regret
under the logarithmic loss as:
T
X T
X
t ct ) − inf
RegKL = ℓlog (M ℓtlog (M ). (6.23)
M ∈M
t=1 t=1
The following result shows that a bound on the log-loss regret immediately yields a bound
on the Hellinger estimation error.
Lemma 21: For any online estimation algorithm, whenever Assumption 8 holds, we
have " T #
X
⋆ t t t
E[RegKL ] ≥ E DKL M (π ) ∥ M (π ) ,
c (6.24)
t=1
so that
E[EstH ] ≤ E[RegKL ]. (6.25)
Furthermore, for any δ ∈ (0, 1), with probability at least 1 − δ,
This result is desirable because regret minimization with the logarithmic loss is a well-
studied problem in online learning. Efficient algorithms are known for model classes of
interest [27, 84, 51, 44, 67, 70, 38, 63], and this is complemented by theory which provides
minimax rates for generic model classes [78, 66, 24, 18]. One example we have already seen
(Section 1) is the averaged exponential weights method, which guarantees
RegKL ≤ log|M|
for finite classes M. Another example is that for linear models, where (i.e., mM (r, o | π) =
⟨ϕ(r, o, π), θ⟩ for a fixed feature map in ϕ ∈ Rd ), algorithms with RegKL = O(d log(T )) are
known [72, 78]. All of these algorithms satisfy M c = co(M). We refer the reader to Chapter
9 of [25] for further examples and discussion.
While (6.25) is straightforward, (6.26) is rather remarkable, as the remainder term does
not scale with T . Indeed, a naive attempt at applying concentration inequalities to control
the deviations of the random quantities EstH and RegKL would require boundedness of the
loss function, which is problematic because the logarithmic loss can be unbounded. The
proof exploits unique properties of the moment generating function for the log loss.
103
question of optimality: That is, for a given class of models M, what is the best regret that
can be achieved by any algorithm? We will show that in addition to upper bounds, the
Decision-Estimation Coefficient actually leads to lower bounds on the optimal regret.
Background: Minimax regret. What does it mean to say that an algorithm is optimal
for a model class M? There are many notions of optimality, but in this course we will focus
on minimax optimality, which is one of the most basic and well-studied notions.
For a model class M, we define the minimax regret via14
⋆ ,p
M(M, T ) = inf sup EM [Reg(T )], (6.27)
p1 ,...,pT M ⋆ ∈M
where pt = pt (· | Ht−1 ) is the algorithm’s strategy for step t (a function of the history Ht−1 ),
and where we write regret as Reg(T ) to make the dependence on T explicit. Intuitively,
minimax regret asks what is the best any algorithm can perform on a worst-case model
(in M) possibly chosen with the algorithm in mind. Another way to say this is: For any
algorithm, there exists a model in M for which E[Reg(T )] ≥ M(M, T ). We will say that
an algorithm is minimax optimal if it achieves (6.27) up to absolute constants that do not
depend on M or T .
with
This is similar to the definition for the DEC we have been working with so far— which
we will call the offset DEC going forward—-except that it places a hard constraint on the
information gain as opposed to subtracting the information gain. Both quantities have
a similar interpretation, since subtracting the information gain implicitly biases the max
player towards model where the gain is small. Indeed, the offset DEC can be thought of as
14 ⋆
Here, for any algorithm p = p1 , . . . , pT , EM ,p denotes the expectation with respect to the observation
process (rt , ot ) ∼ M ⋆ (π t ) and any randomization used by the algorithm, when M ⋆ is the true model.
15 c
We adopt the hconvention
that the
i value of decε (M, M ) is zero is there exists p such that the set of
c
2 2
M ∈ M with Eπ∼p DH M (π), M (π) ≤ ε is empty.
c
104
a Lagrangian relaxation of the constrained DEC, and always upper bounds it via
n h i o
deccε (M, M
c) = inf sup Eπ∼p [f M (πM ) − f M (π)] | Eπ∼p D2 M (π), M
H
c(π) ≤ ε2
p∈∆(Π) M ∈M
n h i o
2 c(π) − ε2
= inf sup inf Eπ∼p [f M (πM ) − f M (π)] − γ Eπ∼p DH M (π), M ∨0
p∈∆(Π) M ∈M γ≥0
n h i o
2 c(π) − ε2
≤ inf inf sup Eπ∼p [f M (πM ) − f M (π)] − γ Eπ∼p DH M (π), M ∨0
γ≥0 p∈∆(Π) M ∈M
n o
= inf decγ (M, Mc) + γε2 ∨ 0.
γ≥0
This inequality is lossy, but cannot be improved in general. That is, there some classes
for which the constrained DEC is meaningfully smaller than the offset DEC. However, it is
possible to relate the two quantities if we restrict to a “localized” sub-class of models that
are not “too far” from the reference model M c.
For many “well-behaved” classes one can consider (e.g., multi-armed bandits and linear
bandits), one has decγ (Mα(ε,γ) (M c), Mc) ≈ decγ (M, Mc) whenever decγ (M, M c) ≈ γε2 (that
is, localization does not change the complexity), so that lower bounds in terms of the
constrained DEC immediately imply lower bounds in terms of the offset DEC. In general,
this is not the case, and it turns out that it is possible to obtain tighter upper bounds that
depend on the constrained DEC by using a refined version of the E2D algorithm. We refer
to Foster et al. [43] for details and further background on the constrained DEC.
105
is satisfied, it holds that for any algorithm, there exists a model M ∈ M for which
Proposition 28 shows that for any algorithm and model class M, the optimal regret
must scale with the constrained DEC in the worst-case. As a concrete example, √ we will
c
show in the sequel that for the multi-armed bandit with A actions, decε (M) ∝ ε A, which
leads to √
E[Reg] ≳ AT .
We mention in passing that by combining Proposition 28 with Proposition 27, we obtain
the following lower bound based on the (localized) offset DEC.
Corollary 1: Fix T ∈ N. Then for any algorithm, there exists a model M ∈ M for
which
E[Reg(T )] ≳ sup
√
sup decγ (Mα(T,γ) (M
c), M
c), (6.32)
γ≳ T M
c∈co(M)
This matches
q the lower boud in Propositionq 28 upper to a difference in the radius: we have
εT ∝ T for the lower bound, and εT ∝ log(|M|/δ)
1
T for the upper bound. This implies
that for any class where log|M| < ∞, the constrained DEC is necessary and sufficient for
low regret. By the discussion in the prequel, a similar conclusion holds for the offset DEC
(albeit, with a polynomial loss in rate). The interpretation of the log|M| gap between
the upper and lower bounds is that the DEC is capturing the complexity of exploring the
decision space, but the statistical capacity required to estimate the underlying model is a
separate issue which is not captured.
106
Anatomy of a lower bound. How should one go about proving a lower bound on the
minimax regret in (6.27)? We will follow a general recipe which can be found throughout
statistics, information theory, and decision making [32, 89, 83]. The approach will be to
find a pair of models M and M c that satisfy the following properties:
1. Any algorithm with regret much smaller than the DEC must query substantially
different decisions in Π depending on whether the underlying model is M or M c.
Intuitively, this means that any algorithm that achieves low regret must be able to
distinguish between the two models.
2. M and M c are “close” in a statistical sense (typically via total variation distance or
another f -divergence), which implies via change-of-measure arguments that the deci-
sions played by any algorithm which interacts with the models only via observations
(in our case, (π t , rt , ot )) will be similar for both models. In other words, the models
are difficult to distinguish.
One then concludes that the algorithm must have large regret on either M or M c.
To make this approach concrete, classical results in statistical estimation and supervised
learning choose the models M and M c in a way that is oblivious to the algorithm under
consideration [32, 89, 83]. However, due to the interactive nature of the decision making
problem, the lower bound proof we present now will choose the models in an adaptive
fashion.
Simplifications. Rather than proving the full result in Proposition 28, we will make the
following simplifying assumptions:
• Rather than proving a lower bound that scales with deccε (M) = supM c
c∈co(M) decε (M∪
{M
c}, M
c), we will prove a weaker lower bound that scales with sup c decc (M, M
c).
M ∈M ε
We refer to Foster et al. [43] for a full proof that removes these restrictions.
Preliminaries. We use the following technical lemma for the proof of Proposition 28.
107
fM
<latexit sha1_base64="H21olRoLeBWYGJ8QwsrdxBF93VI=">AAAB6nicdVDLSsNAFJ3UV62vVHe6GSyCq5CEmra7ghs3QkX7gDaWyXTSDp08mJkIJfQTunGhiFs/wN9w684Pce+kVVDRAxcO59zLPfd6MaNCmuablltaXlldy68XNja3tnf04m5LRAnHpIkjFvGOhwRhNCRNSSUjnZgTFHiMtL3xaea3bwgXNAqv5CQmboCGIfUpRlJJl/71eV8vmYZpVuxaDZqGbZcdx1Gk5pSrJxVoKStDqV6c6e/7L8+Nvv7aG0Q4CUgoMUNCdC0zlm6KuKSYkWmhlwgSIzxGQ9JVNEQBEW46jzqFR0oZQD/iqkIJ5+r3iRQFQkwCT3UGSI7Eby8T//K6ifSrbkrDOJEkxItFfsKgjGB2NxxQTrBkE0UQ5lRlhXiEOMJSfaegnvB1KfyftGzDcozyhVWqO2CBPDgAh+AYWKAC6uAMNEATYDAEM3AH7jWm3WoP2uOiNad9zuyBH9CePgBBmZF1</latexit>
<latexit sha1_base64="LwHbksNaMaAo6rKVfKDapB5JWLI=">AAAB+HicdVDLSsNAFJ3UV62PprrTzWARXIWk1LTdFdy4ESrYB7S1TCaTdujkwcxEqaGf4Be4caGIO/Ev3LrzQ9w7aRVU9MCFwzn3cu89TsSokKb5pmUWFpeWV7KrubX1jc28XthqiTDmmDRxyELecZAgjAakKalkpBNxgnyHkbYzPkr99gXhgobBmZxEpO+jYUA9ipFU0kDPe+dJ75K6ZIRkcjKdDvSiaZhmpVSrQdMolcq2bStSs8vVwwq0lJWiWC9c6+87L0+Ngf7ac0Mc+ySQmCEhupYZyX6CuKSYkWmuFwsSITxGQ9JVNEA+Ef1kdvgU7ivFhV7IVQUSztTvEwnyhZj4jur0kRyJ314q/uV1Y+lV+wkNoliSAM8XeTGDMoRpCtClnGDJJoogzKm6FeIR4ghLlVVOhfD1KfyftEqGZRvlU6tYt8EcWbAL9sABsEAF1MExaIAmwCAGN+AO3GtX2q32oD3OWzPa58w2+AHt+QN+35dQ</latexit>
c
fM
<latexit sha1_base64="zh7phG5US28P9HLf6AYbQLgjWWo=">AAAB6nicdZDLSsNAFIYn9VZbL1WX3QwWwVVI0prWXcFF3QgV7QXaUCaTSTt0cmFmIpTQR3DjQilufRVfwJ1v46RVUNEfBn6+/xzmnOPGjAppGO9abm19Y3Mrv10o7uzu7ZcODrsiSjgmHRyxiPddJAijIelIKhnpx5ygwGWk504vsrx3R7igUXgrZzFxAjQOqU8xkgrdtEZXo1LF0M8btnVmQUM3jLpVtTNj1WtWFZqKZKo0i61F+bXvtUelt6EX4SQgocQMCTEwjVg6KeKSYkbmhWEiSIzwFI3JQNkQBUQ46XLUOTxRxIN+xNULJVzS7x0pCoSYBa6qDJCciN9ZBv/KBon0G05KwziRJMSrj/yEQRnBbG/oUU6wZDNlEOZUzQrxBHGEpbpOQR3ha1P4v+laumnrtWuz0rTBSnlQBsfgFJigDprgErRBB2AwBvfgETxpTHvQFtrzqjSnffYcgR/SXj4ACBCQkA==</latexit>
GM
⇡M
<latexit sha1_base64="Udu75tOVB86XU7nbhcZ1RpsgTSY=">AAAB7HicdVDLSsNAFJ34rK2PqstuBovgKiSlpu2u4EI3QgXTFtpQJpNJO3QyCTMToYR+gxsXFnHrn/gD7vwbJ62Cih64cDjnXu65108Ylcqy3o219Y3Nre3CTrG0u7d/UD486so4FZi4OGax6PtIEkY5cRVVjPQTQVDkM9Lzpxe537sjQtKY36pZQrwIjTkNKUZKS+4woaPrUblqmZbVqLVa0DJrtbrjOJq0nHrzvAFtbeWotkuXi8prP+iMym/DIMZpRLjCDEk5sK1EeRkSimJG5sVhKkmC8BSNyUBTjiIivWwZdg5PtRLAMBa6uIJL9ftEhiIpZ5GvOyOkJvK3l4t/eYNUhU0vozxJFeF4tShMGVQxzC+HARUEKzbTBGFBdVaIJ0ggrPR/ivoJX5fC/0m3ZtqOWb+xq20HrFAAFXACzoANGqANrkAHuAADCu7BI1gY3HgwnoznVeua8TlzDH7AePkA31CRqg==</latexit>
⇡M
<latexit sha1_base64="ywKCnNPKhuVAF+BkLUoggyoNO0I=">AAAB+nicdVDLSsNAFJ3UV62vVJduhhbBVUhKTVvcFNy4USrYB7QlTCbTdujkwczEUmI+xY0LRdz6Gy5052f4B05bBRU9cOFwzr3ce48bMSqkab5pmaXlldW17HpuY3Nre0fP77ZEGHNMmjhkIe+4SBBGA9KUVDLSiThBvstI2x2fzPz2FeGChsGlnEak76NhQAcUI6kkR8/3IuokvQn1yAjJ5CxNHb1oGqZZKdVq0DRKpbJt24rU7HL1qAItZc1QrBfeX86Pn+2Go7/2vBDHPgkkZkiIrmVGsp8gLilmJM31YkEihMdoSLqKBsgnop/MT0/hgVI8OAi5qkDCufp9IkG+EFPfVZ0+kiPx25uJf3ndWA6q/YQGUSxJgBeLBjGDMoSzHKBHOcGSTRVBmFN1K8QjxBGWKq2cCuHrU/g/aZUMyzbKF1axboMFsmAfFMAhsEAF1MEpaIAmwGACbsAduNeutVvtQXtctGa0z5k98APa0wdh9Zhu</latexit>
c ⇧
<latexit sha1_base64="IUlsOhKosL6nzCFslUXyyNlS3Mg=">AAAB6nicdVDLSsNAFJ34rPVVdelmsAh1E5JQ03ZlwY3LivYBTSiT6bQdOpmEmYlQQj9BBBeKuPUzXPkJ7vwQ905aBRU9cOFwzr3cc28QMyqVZb0ZC4tLyyurubX8+sbm1nZhZ7clo0Rg0sQRi0QnQJIwyklTUcVIJxYEhQEj7WB8mvntKyIkjfilmsTED9GQ0wHFSGnpwmvQXqFomZZVcWo1aJmOU3ZdV5OaW64eV6CtrQzFk+fS+8uNd9ToFV69foSTkHCFGZKya1ux8lMkFMWMTPNeIkmM8BgNSVdTjkIi/XQWdQoPtdKHg0jo4grO1O8TKQqlnISB7gyRGsnfXib+5XUTNaj6KeVxogjH80WDhEEVwexu2KeCYMUmmiAsqM4K8QgJhJX+Tl4/4etS+D9pOabtmuVzu1h3wRw5sA8OQAnYoALq4Aw0QBNgMATX4A7cG8y4NR6Mx3nrgvE5swd+wHj6AOzZkfE=</latexit>
<latexit sha1_base64="CESz1jP2Pwc1WGRJkHq5OtY5sSA=">AAAB7HicdVBNS8NAEN3Ur1q/Ur3pZbEInkoSatreCl68CBVMW2hD2Ww37dLNJuxuhBL6E8SLB0W8evZvePXmD/HutlVQ0QcDj/dmmDcTJIxKZVlvRm5peWV1Lb9e2Njc2t4xi7stGacCEw/HLBadAEnCKCeeooqRTiIIigJG2sH4dOa3r4iQNOaXapIQP0JDTkOKkdKSl/Sz82nfLFlly6o69Tq0yo5TcV1Xk7pbqZ1Uoa2tGUqN4rX5vv/y3Oybr71BjNOIcIUZkrJrW4nyMyQUxYxMC71UkgThMRqSrqYcRUT62TzsFB5pZQDDWOjiCs7V7xMZiqScRIHujJAayd/eTPzL66YqrPkZ5UmqCMeLRWHKoIrh7HI4oIJgxSaaICyozgrxCAmElf5PQT/h61L4P2k5ZdstVy7sUsMFC+TBATgEx8AGVdAAZ6AJPIABBTfgDtwb3Lg1HozHRWvO+JzZAz9gPH0AFwuSjA==</latexit>
pM <latexit sha1_base64="InV1JmttPX1G1CRO3YG1mpdDMEU=">AAAB+HicdVDLSsNAFJ34rPXRqks3Q4vgKiSlpi1uCm7cKBXsA9oQJpNJO3TyYGai1JAvceNCEbf+hwvd+Rn+gdNWQUUPXDiccy/33uPGjAppGG/awuLS8spqbi2/vrG5VShu73RElHBM2jhiEe+5SBBGQ9KWVDLSizlBgctI1x0fT/3uJeGCRuGFnMTEDtAwpD7FSCrJKRZiJx1cUY+MkExPs8wplg3dMGqVRgMaeqVStSxLkYZVrR/WoKmsKcrN0vvL2dGz1XKKrwMvwklAQokZEqJvGrG0U8QlxYxk+UEiSIzwGA1JX9EQBUTY6ezwDO4rxYN+xFWFEs7U7xMpCoSYBK7qDJAcid/eVPzL6yfSr9spDeNEkhDPF/kJgzKC0xSgRznBkk0UQZhTdSvEI8QRliqrvArh61P4P+lUdNPSq+dmuWmBOXJgD5TAATBBDTTBCWiBNsAgATfgDtxr19qt9qA9zlsXtM+ZXfAD2tMH4RWXlQ==</latexit>
pM
c
⇧
<latexit sha1_base64="IUlsOhKosL6nzCFslUXyyNlS3Mg=">AAAB6nicdVDLSsNAFJ34rPVVdelmsAh1E5JQ03ZlwY3LivYBTSiT6bQdOpmEmYlQQj9BBBeKuPUzXPkJ7vwQ905aBRU9cOFwzr3cc28QMyqVZb0ZC4tLyyurubX8+sbm1nZhZ7clo0Rg0sQRi0QnQJIwyklTUcVIJxYEhQEj7WB8mvntKyIkjfilmsTED9GQ0wHFSGnpwmvQXqFomZZVcWo1aJmOU3ZdV5OaW64eV6CtrQzFk+fS+8uNd9ToFV69foSTkHCFGZKya1ux8lMkFMWMTPNeIkmM8BgNSVdTjkIi/XQWdQoPtdKHg0jo4grO1O8TKQqlnISB7gyRGsnfXib+5XUTNaj6KeVxogjH80WDhEEVwexu2KeCYMUmmiAsqM4K8QgJhJX+Tl4/4etS+D9pOabtmuVzu1h3wRw5sA8OQAnYoALq4Aw0QBNgMATX4A7cG8y4NR6Mx3nrgvE5swd+wHj6AOzZkfE=</latexit>
Figure 9: Models M and M c with corresponding mean rewards and average action distri-
butions. The overlap between the action distributions is at least 0.9, while near-optimal
choices for one model incur large regret for the other.
Proof of Proposition 28. Fix T ∈ N and consider any fixed algorithm, which we recall is
defined by a sequence of mappings p1 , . . . , pT , where pt = pt (· | Ht−1 ). Let PM denote the
distribution over HT for this algorithm when M is the true model, and let EM denote the
corresponding expectation.
Viewed as a function of the history Ht−1 , each pt is a random variable, and we can
consider its expected value under the model M . To this end, for any model M ∈ M, let
T
" #
1 X
pM := EM pt ∈ ∆(Π)
T
t=1
be the algorithm’s average action distribution when M is the true model. Our aim is to
show that we can find a model in M for which the algorithm’s regret is at least as large as
the lower bound in (6.32).
Let T ∈ N, and fix a value ε > 0 to be chosen momentarily. Fix an arbitrary model
c ∈ M and set
M
n h i o
M M 2 c(π) ≤ ε2 ,
M = arg max Eπ∼pM c
[f (π M ) − f (π)] | Eπ∼p c
M
D H M (π), M (6.36)
M ∈M
The model M should be thought of as a “worst-case alternative” g to M c, but only for the
specific algorithm under consideration. We will show that the algorithm needs to have large
regret on either M or Mc. To this end, we establish some basic properties; let us abbreviate
g (π) = f (πM ) − f (π) going forward:
M M M
c
Eπ∼pMc
g (π) is large.
M
Eπ∼pM
c
[g M (π)] ≥ deccε (M, M
c) =: ∆, (6.38)
108
since by (6.36), the model M is the best response to a potentially suboptimal choice
pMc. This is almost what we want, but there is a mismatch in models, since g M
considers the model M while pMc considers the model M .
c
To see why the first equality holds, we apply the chain rule to the sequence π 1 , z 1 , . . . , π T , z T
with z t = (rt , ot ). Let us use the bold notation zt to refer to a random variable under
consideration, and let z t refer to its realization. Then we have
c
DKL PM ∥ PM
" T #
X
c c c
= EM DKL PM (zt |Ht−1 , π t ) ∥ PM (zt |Ht−1 , π t ) + DKL PM (π t |Ht−1 ∥ PM (π t |Ht−1 )
t=1
T
" #
X
c
M t t
=E c(π ) ∥ M (π )
DKL M
t=1
since conditionally on Ht−1 , the law of π t does not depend on the model.
1
We can now choose ε = c1 · √CT , where c1 > 0 is a sufficiently small numerical
constant, to ensure that
2 c c
DTV PM , PM ≤ DKL PM ∥ PM ≤ 1/100. (6.39)
In other words, with constant probability, the algorithm can fail to distinguish M and
M
c.
Finally, we will make use of the fact that since rewards are in [0, 1], we have
h i h i r h i
c
M M
Eπ∼p f (π) − f (π) ≤ Eπ∼p DTV M (π), M c(π) ≤ Eπ∼p D2 M (π), M c(π) ≤ ε.
c
M c
M c
M H
(6.40)
c (π ∈
or else we are done, by (6.37). Our aim is to show that under this assumption, pM /
GM ) ≥ 1/2, which will imply that Eπ∼pM [g M (π)] ≳ ∆ via (6.42).
109
Step 2. By adding the inequalities (6.43) and (6.38), we have that
h i h i
c c c
f M (πM ) − f M (πM M
M M
c ) ≥ Eπ∼p c g (π) − g (π) − Eπ∼p c |f
M
M
(π) − f M
(π)|
9 h
M c
M
i
≥ ∆ − Eπ∼pM c
|f (π) − f (π)| .
10
c
M
In addition, by (6.40), we have Eπ∼pM c
|f (π) − f M (π)| ≤ ε, so that
c 9
f M (πM ) − f M (πM
c) ≥ ∆ − ε. (6.44)
10
1
Hence, as long as ε ≤ 10 ∆, which is implied by (6.30), we have
c 4
f M (πM ) − f M (πM
c) ≥ ∆. (6.45)
5
Finishing up. Note that since the choice of Mc ∈ M for this lower bound was arbitrary,
c
we are free to choose M to maximize decε (M, M ).
c c
∆2
≥ inf max (1 − p(i))∆ | p(i) ≤ ε2
p∈∆(Π) i 2
110
√
For any p, there exists i such that p(i) ≤ 1/A. If we choose ∆ = ε · 2A, this choice for i
2
will satisfy the constraint p(i) ∆2 ≤ ε2 , and we will be left with
p
deccε (M, M
c) ≥ (1 − p(i))∆ ≥ ε A/2,
Generalizing the argument above, we can prove a lower bound on the Decision-Estimation
Coefficient for any model class M that “embeds” the multi-armed bandit problem in a cer-
tain sense.
Proposition 30: Let a reference model M c be given, and suppose that a class M con-
tains a sub-class {M1 , . . . , MN } and collection of decisions π1 , . . . , πN with the property
that for all i:
1. DH 2 M (π), Mc(π) ≤ β 2 · I {π = πi }.
i
2. f Mi (πMi ) − f Mi (π) ≥ α · I {π ̸= πi }.
Then n √ o
deccε (M, M
c) ≳ α · I ε ≥ β/ N .
The examples that follow can be obtained by applying this result with an appropriate
sub-family.
Example 6.5 (cont’d). Recall the bandit-type problem with structured noise from Exam-
ple 6.5, where we have M = {M1 , . . . , MA }, with Mi (π) = N (1/2, 1)I {π ̸= i}+Ber(3/4)I {π = i}.
c(π) = N (1/2, 1), then this family satisfies the conditions of Proposition 30 with
If we set M n o
α = 1/4 and β 2 = 2. As a result, we have deccε (MMAB-SN ) ≳ I ε ≥ 2/A , which yields
p
E[Reg] ≳ O(A)
e
Example 6.6 (cont’d). Consider the full-information variant of the bandit setting in
Example 6.6. By adapting the argument in Example 6.4, one can show that
deccε (M) ≳ ε,
111
Next, we revisit some of the structured bandit classes considered in Section 4.
Example 6.7. Consider the linear bandit setting in Section 4.3.2, with F = {π 7→ ⟨θ, ϕ(π)⟩ | θ ∈ Θ},
where Θ ⊆ Bd2 (1) is a parameter set and ϕ : Π → Rd is a fixed feature map that is known to
the learner. Let M be the set of all reward distributions with f M ∈ F and 1-sub-Gaussian
noise. Then √
deccε (M) ≳ ε d,
which gives √
E[Reg] ≳ dT .
◁
Example 6.8. Consider the Lipschitz bandit setting in Section 4.3.3, where Π is a metric
space with metric ρ, and
Let M be the set of all reward distributions with f M ∈ F and 1-sub-Gaussian noise. Let
d > 0 be such that the covering number for Π satisfies
Nρ (Π, ε) ≥ ε−d .
Then 2
deccε (M) ≳ ε d+2 ,
d+1
which leads to E[Reg] ≳ T d+2 . ◁
112
Occupancy measures. The results we present make use of the notion of occupancy
measures for an MDP M . Let PM ,π (·) denote the law of a trajectory evolving under MDP
M and policy π. We define state occupancy measures via
,π M ,π
dM
h (s) = P (sh = s)
Bounding the DEC for tabular RL. Recall, that to certify a bound on the DEC, we
need to—given any parameter γ > 0 and estimator M c, exhibit a distribution (or, “strategy”)
p such that
h i
2
sup Eπ∼p f M (πM ) − f M (π) − γ · DH c(π) ≤ decγ (M, M
M (π), M c)
M ∈M
for some upper bound decγ (M, M c). For tabular RL, we will choose p using an algorithm
called Policy Cover Inverse Gap Weighting. As the name suggests, the approach combines
the inverse gap weighting technique introduced in the multi-armed bandit setting with the
notion of a policy cover —that is, a collection of policies that ensures good coverage on every
state [33, 64, 47].
,π c
dMh (s, a)
πh,s,a = arg max c c
. (6.46)
π∈Πrns 2HSA + η(f (πMc) − f (π))
M M
1
p(π) = c c
, (6.47)
c) − f
λ + η(f (πM (π))
M M
P
where λ ∈ [1, 2HSA] is chosen such that π p(π) = 1.
return p.
The algorithm consists of two steps. First, in (6.46), we compute the collection of policies
Ψ = {πh,s,a }h∈[H],s∈[S],a∈[A] that constitutes the policy cover. The basic idea here is that
each policies in the policy cover should balance (i) regret and (ii) coverage—that is—ensure
that all the states are sufficiently reached, which means we are exploring. We accomplish
this by using policies of the form
,π c
dMh (s, a)
πh,s,a := arg max c c
π∈Πrns 2HSA + η(f (πMc) − f (π))
M M
113
which—for each (s, a, h) tuple—maximize the ratio of the occupancy measure for (s, a)
at layer h to the regret gap under M c. This inverse gap weighted policy cover balances
exploration and exploration by trading off coverage with suboptimality. With the policy
cover in hand, the second step of the algorithm computes the exploratory distribution p by
simply applying inverse gap weighting to the elements of the cover and the greedy policy
πMc.
The bound on the Decision-Estimation Coefficient for the PC-IGW algorithm is as fol-
lows.
H 3 SA
and consequently certifies that decγ (M, M
c) ≲
γ .
How to estimate the model. The bound on the DEC we proved using the PC-IGW
algorithm assumes that M c ∈ M, but in general, estimators from online learning algorithm
such as exponential weights will produce M ct ∈ co(M). While it is possible to show that
the same bound on the DEC holds for M c ∈ co(M), a slightly more complex version of the
algorithm is required to certify such a bound. To run the PC-IGW algorithm as-is, we can
use a simple approach to obtain a proper estimator M c ∈ M.
Assume for simplicity that rewards are known, i.e. RhM (s, a) = Rh (s, a) for all M ∈ M.
Instead of directly working with an estimator for the entire model M , we work with layer-
wise estimators AlgEst;1 , . . . , AlgEst;H . At each round t, given the history {(π i , ri , oi )}t−1
i=1 ,
⋆
the layer-h estimator AlgEst;h produces an estimate Pbh for the true transition kernel PhM .
t
We obtain an estimation algorithm for the full model M ⋆ by taking M ct as the MDP that
has Ph as the transition kernel for each layer h. This algorithm has the following guaran-
b t
tee.
114
Proposition 32: The estimator described above has
H
X
EstH ≤ O(log(H)) · EstH;h .
h=1
ct ∈ M.
In addition, M
For each layer, we can obtain EstH;h ≤ O(S e 2 A) using the averaged exponential weights
algorithm, by applying the approach described in Section 6.4.1 to each layer. That is, for
each layer, we obtain Pbht by running averaged exponential weights with the loss ℓtlog (Ph ) =
e 2 A) with this approach because there are
− log(Ph (sh+1 | sh , ah )). We obtain EstH;h ≤ O(S
2
S A parameters for the transition distribution at each layer.
A lower bound on the DEC. We state, but do not prove a complementary lower bound
on the DEC for tabular RL.
√
Using Proposition 28, this gives E[Reg] ≳ HSAT .
c
f M (π) − f M (π) ≤ DTV M (π), Mc(π) (6.49)
c(π) ≤ 1 + η D2 M (π), M
≤ DH M (π), M c(π) ∀η > 0, (6.50)
2η 2 H
115
and
c
f M (π) − f M (π)
H hh i i XH h i
M ,π
X c c c
= EM ,π (PhM − PhM )Vh+1 (sh , ah ) + EM ,π Erh ∼RM (sh ,ah ) [rh ] − Er c
M [rh ]
h h ∼Rh (sh ,ah )
h=1 h=1
(6.51)
H
X h i
c c c
≤ EM ,π DTV PhM (sh , ah ), PhM (sh , ah ) + DTV RhM (sh , ah ), RhM (sh , ah ) . (6.52)
h=1
Next, we provide a “change-of-measure” lemma, which allows one to move from between
quantities involving an estimator M
c and those involving another model M .
Lemma 24 (Change of measure for RL): Consider any MDP M and reference
c which satisfy PH rh ∈ [0, 1]. For all p ∈ ∆(Π) and η > 0 we have
MDP M h=1
Proof of Proposition 31. Let M ∈ M be fixed. The main effort in the proof will be to
bound the quantity
h i
c
Eπ∼p f M (πM ) − f M (π)
in terms of the quantity on the right-hand side of (6.54), then apply change of measure
(Lemma 24). We begin with the decomposition
h i h i
c c c c
Eπ∼p f M (πM ) − f M (π) = Eπ∼p f M (πMc) − f
M
(π) + f M (πM ) − f M (πM
c) . (6.55)
| {z } | {z }
(II)
(I)
For the first term (I), which may be thought of as exploration bias, we have
c c
c) − f
f M (πM (π) 2HSA
h i X M
c c
Eπ∼p f M (πM
c) − f
M
(π) = c c
≤ , (6.56)
λ + η(f (πM
M
c) − f
M
(π)) η
c}
π∈Ψ∪{πM
where we have used that λ ≥ 0. We next bound the second term (II), which entails showing
that the PC-IGW distribution explores enough. We have
c c c c
f M (πM ) − f M (πM
c) = f
M
(πM ) − f M (πM ) − (f M (πM
c) − f
M
(πM )). (6.57)
116
We use the simulation lemma to bound
H
X h i
c c c c
f M (πM ) − f M (πM ) ≤ EM ,πM DTV PhM (sh , ah ), PhM (sh , ah ) + DTV RhM (sh , ah ), RhM (sh , ah )
h=1
H X
c ,πM
X
= dM
h (s, a)errM
h (s, a),
h=1 s,a
H X H X ¯ 1/2
X c ,πM
X c,πM dh (s, a) 2
dM (s, a)[errM
h (s, a)] = dM (s, a) (errM
h (s, a))
h h
d¯h (s, a)
h=1 s,a h=1 s,a
H c,π H
1 X X (dh M (s, a))2 η ′ X X ¯
M
2
≤ ¯h (s, a) + dh (s, a)(errM
h (s, a))
2η ′ s,a
d 2 s,a
h=1 h=1
H X c,πM H
(s, a))2 η′ X
M
1 X (dh c
Eπ∼p EM ,π (errM 2
= + h (sh , ah )) .
2η ′ d¯h (s, a) 2
h=1 s,a h=1
The second term is exactly the upper bound we want, so it remains to bound the ratio of
occupancy measures in the first term. Observe that for each (h, s, a), we have
c,πM c,πM c,πM
dM
h (s, a) dM
h (s, a) 1 dM
h (s, a) c
M c
M
≤ · ≤ 2HSA + η(f (π c ) − f (πh,s,a ,
)
d¯h (s, a) c,π c,π M
dh h,s,a (s, a) p(πh,s,a )
M M
dh h,s,a (s, a)
where the second inequality follows from the definition of p and the fact that λ ≤ 2HSA.
Furthermore, since
,π c
dMh (s, a)
πh,s,a = arg max c c
,
π∈Πrns 2HSA + η(f (πMc) − f (π))
M M
As a result, we have
H X c,πM H X
X (dM
h (s, a))2 X c,πM c c
¯ ≤ dM
h (s, a)(2HSA + η(f M (πM
c) − f
M
(πM ))
dh (s, a)
h=1 s,a h=1 s,a
2 c c
= 2H SA + ηH(f M (πM
c) − f
M
(πM )).
H
H 2 SA η ′ X c ηH M c c c c
≤ ′
+ Eπ∼p EM ,π (errM 2
h (sh , ah )) + ′ c) − f
(f (πM M
(πM )) − (f M (πM
c) − f
M
(πM )).
η 2 2η
h=1
117
ηH
We set η ′ = 2 so that the latter terms cancel and we are left with
H
c 2HSA ηH X c
Eπ∼p EM ,π (errM 2
f M (πM ) − f M (πM
c) ≤ + h (sh , ah )) .
η 4
h=1
We conclude by applying the change-of-measure lemma (Lemma 24), which implies that for
any η ′ > 0,
4HSA h i
Eπ∼p [f M (πM ) − f M (π)] ≤ + (4η ′ )−1 + (4H 2 η + η ′ ) · Eπ∼p DH
2
M (π), M
c(π) .
η
γ
The result follows by choosing η = η ′ = 21H 2
(we have made no effort to optimize the
constants here).
Proposition 34: There exists an algorithm that for any δ > 0, ensures that with
probability at least 1 − δ,
Compared to (6.20), this replaces the estimation term log|M| with the smaller quantity
log|Π|, replaces decγ (M) with the potentially larger quantity decγ (co(M)). Whether or
not this leads to an improvement depends on the class M. For multi-armed bandits, linear
bandits, and convex bandits, M is already convex, so this offers strict improvement. For
MDPs though, M is not convex: Even for the simple tabular MDP setting where |S| = S
118
and |A| = A, grows exponentially decγ (co(M)) in either H or S, whereas decγ (M) is
polynomial in all parameters.
We mention in passing that this result is proven using a different algorithm from E2D;
see Foster et al. [40, 42] for more background.
In this section we give a generalization of the E2D algorithm that incorporates two extra
features: general divergences and randomized estimators.
This variant of the DEC naturally leadsto regretbounds in terms of estimation error under
Dπ (· ∥ ·). Note that we use notation Dπ M
c ∥ M instead of say, D M c(π), M (π) , to reflect
that fact that the divergence may depend on M (resp. M c) and π through properties other
than M (π) (resp. M (π)).
c
Randomized estimators. The basic version of E2D assumes that at each round, the
online estimation oracle provides a point estimate M
ct . In some settings, it useful to consider
randomized estimators that, at each round, produce a distribution ν t ∈ ∆(M) over models.
For this setting, we further generalize the DEC by defining
h i
D M M π c
decγ (M, ν) = inf sup Eπ∼p f (πM ) − f (π) − γ · EM c∼ν D M ∥M (6.61)
p∈∆(Π) M ∈M
119
Algorithm. A generalization of E2D that incorporates general divergences and random-
ized estimators is given above on page 119. The algorithm is identical to E2D with Option
I, with the only differences being that i) we play the distribution that solves the minimax
problem (6.61) with the user-specified divergence Dπ (· ∥ ·) rather than squared Hellinger
distance, and ii) we use the randomized estimate ν t rather than a point estimate. Our per-
formance guarantee for this algorithm depends on the estimation performance of the oracle’s
randomized estimates ν 1 , . . . , ν T ∈ ∆(M) with respect to the given divergence Dπ (· ∥ ·),
which we define as
XT h t i
π t ⋆
EstD := Eπt ∼pt EM t
c ∼ν t D M
c ∥ M . (6.62)
t=1
Proposition 35: The algorithm E2D for General Divergences and Randomized Esti-
mators with exploration parameter γ > 0 guarantees that
Reg ≤ decD
γ (M) · T + γ · EstD (6.63)
almost surely.
for all models M, M ′ . In this case, it suffices for the online estimation oracle to directly
estimate the sufficient statistic by producing a randomized estimator ν t ∈ ∆(Ψ), and we
can write the estimation error as
T
X h t i
EstD := Eπt ∼pt Eψbt ∼ν t Dπ ψbt ∥ M ⋆ . (6.64)
t=1
The benefit of this perspective is that for many examples of interest, since the divergence
depends on the estimate only through ψ, we can derive bounds on Est that scale with
log|Ψ| instead of log|M|.
For example, in structured bandit problems, one can work with the divergence
c
DSq Mc(π), M (π) := (f M (π) − f M (π))2
which uses the mean reward function as a sufficient statistic, i.e. ψ(M ) = f M . Here, it is
clear that one can achieve EstD ≲ log|F|, which improves upon the rate EstH ≲ log|M| for
Hellinger distance, and recovers the specialized version of the E2D algorithm we considered
in Section 4. Analogously, for reinforcement learning, one can consider value functions as a
sufficient statistic, and use an appropriate divergence based on Bellman residuals to derive
estimation guarantees that scale with the complexity log|Q| of a given value function class
Q; see Section 7 for details.
120
Does randomized estimation help? Note that whenever D is convex in the first argu-
D D
ment, we have decD γ (M) ≤ supM c∈co(M) decγ (M, M ) = decγ (M) (that is, the randomized
c
DEC is never larger than the vanilla DEC), but it is not immediately apparent whether the
opposite direction of this inequality holds, and one might hope that working with the ran-
domized DEC in (6.61) would lead to improvements over the non-randomized counterpart.
The next result shows that this is not the case: Under mild assumptions on the divergence
D, randomization offers no improvement.
Proposition 36: Let D be any bounded divergence with the property that for all
models M, M ′ , M
c and π ∈ Π,
Dπ M ∥ M ′ ≤ C Dπ M
c ∥ M + Dπ M c ∥ M′ .
(6.65)
Squared Hellinger distance is symmetric and satisfies Condition (6.65) with C = 2. Hence,
writing decH D 2
γ (M) as shorthand for decγ (M) with D = DH (·, ·), we obtain the following
corollary.
Proposition 37: Suppose that R ⊆ [0, 1]. Then for all γ > 0,
decH
γ (M) ≤ sup decH H H
γ (M, M ) ≤ sup decγ (M, M ) ≤ decγ/4 (M).
c c
c∈co(M)
M M
c
This shows that for Hellinger distance—at least from a statistical perspective—there is no
benefit to using the randomized DEC compared to the original version. In some cases,
however, strategies p that minimize decH
γ (M, ν) can be simpler to compute than strategies
H c ∈ co(M).
that minimize decγ (M, Mc) for M
121
This quantity is similar to (6.62), but incorporates a bonus term
⋆ ct
γ −1 (f M (πM ⋆ ) − f M (πM
ct )),
⋆
which encourages the estimation algorithm to over-estimate the optimal value f M (πM ⋆ )
for the underlying model, leading to a form of optimism.
Example 6.9 (Structured bandits). Consider any structured bandit problem with decision
space Π, function class F ⊆ (Π → [0, 1]), and O = {∅}. Let MF be the class
MF = {M | f M ∈ F, M (π) is 1-sub-Gaussian ∀π}.
To derive bounds on the optimistic estimation error, we can appeal to an augmented version
of the (randomized) exponential weights algorithm which, for a learning rate parameter
η > 0, sets !!
X
ν t (f M ) ∝ exp −η (f M (π i ) − ri )2 − γ −1 f M (πM ) .
i<t
Optimistic E2D.
and
o-decD
γ (M) = sup o-decD
γ (M, ν). (6.69)
ν∈∆(M)
122
The Optimistic DEC the same as the generalized DEC in (6.61), except that the optimal
c
value f M (πM ) in (6.61) is replaced by the optimal value f M (πM
c ) for the (randomized)
reference model Mc ∼ ν. This seemingly small change is the main advantage of incorporating
optimistic estimation, and makes it possible to bound the Optimistic DEC for certain
divergences D for which the value of the generalized DEC in (6.61) would otherwise be
unbounded.
Remark 18: When the divergence D admits a sufficient statistic ψ : M → Ψ, for any
distribution ν ∈ ∆(M), if we define ν ∈ ∆(Ψ) via ν(ψ) = ν({M ∈ M : ψ(M ) = ψ}),
we have
h i
o-decD
γ (M, ν) = inf sup E E
π∼p ψ∼ν f ψ
(πψ ) − f M
(π) − γ · D π
(ψ ∥ M ) .
p∈∆(Π) M ∈M
In this case, by overloading notation slightly, we may simplify the definition in (6.69)
to
o-decD D
γ (M) = sup o-decγ (M, ν).
ν∈∆(Ψ)
Regret bound for optimistic E2D. The following result shows that the regret of Op-
timistic Estimation-to-Decisions is controlled by the Optimistic DEC and the optimistic
estimation error for the oracle.
Reg ≤ o-decD D
γ (M) · T + γ · OptEstγ (6.70)
almost surely.
This regret bound has the same structure as that of Proposition 35, but the DEC and
estimation error are replaced by their optimistic counterparts.
When does optimistic estimation help?. When does the regret bound in Proposition
38 improve upon its non-optimistic counterpart in Proposition 35? It turns out that for
asymmetric divergences such as those found in the context of reinforcement learning, the
regret bound in (6.70) can be much smaller than the corresponding bound in (6.63); see
Section 7.3.5 for an example. However, for symmetric divergences such as Hellinger distance,
we will show now that the result never improves upon Proposition 35.
Given a divergence D, we define the flipped divergence, which swaps the first and second
arguments, by
qπ M
D c ∥ M := Dπ M ∥ M c .
123
L2lip · Dπ Mc ∥ M for a constant Llip > 0. Then for all γ > 0,
D L2lip D L2lip
≤ o-decD
q q
dec3γ/2 (M) − γ (M) ≤ decγ/2 (M) + . (6.71)
2γ 2γ
This result shows that the optimistic DEC with divergence D is equivalent to the generalized
DEC in (6.61), but with the arguments to the divergence flipped. Thus, for symmetric
divergences, the quantities are equivalent. In particular, we can combine Proposition 39
with Proposition 36 to derive the following corollary for Hellinger distance.
Proposition 40: Suppose that rewards are bounded in [0, 1]. Then for all γ > 0,
1 3
o-decH
2γ (M) − ≤ sup decH H
γ (M, M ) ≤ o-decγ/6 (M) + .
γ M γ
For asymmetric divergences, in settings where there exists an estimation oracle for which
the flipped estimation error
T
X h t i
D π
Est = ct ∼ν t D M⋆ ∥ M
ct
q
Eπ∼pt EM
t=1
is controlled, Proposition 39 shows that to match the guarantee in Proposition 38, optimism
is not required, and it suffices to run the non-optimistic algorithm on page 119. However,
we show in Section 7.3.5 that for certain divergences found in the context of reinforcement
learning, estimation with respect to the flipped divergence is not feasible, yet working with
the optimistic DEC E2D.Opt leads to meaningful guarantees.
[Note: This subsection will be expanded in the next version.]
Then, letting
h i
decSq
γ (F, f ) =
b inf sup Eπ∼p f (πf ) − f (π) − γ(f (π) − fb(π))2 ,
p∈∆(Π) f ∈F
we have
decSq Sq
c1 γ (F) ≤ decγ (MF ) ≤ decc2 γ (F),
124
where c1 , c2 ≥ 0 are numerical constants.
M ⊗ D = {M ⊗ D | M ∈ M, D ∈ D}.
c ∈ M and D
Then for all M b ∈ D,
decγ (M ⊗ D, M
c ⊗ D)
b = decγ (M, M
c).
This can be seen to hold by restricting the supremum in (6.9) to range over models of
the form M ⊗ D.b
c) ≤ decγ (ρ ◦ M, ρ ◦ M
decγ (M, M c).
125
We now prove the high-probability bound using Lemma 34. Define Zt = 12 (ℓtlog (M
ct ) −
ℓtlog (M ⋆ )). Applying Lemma 34 with the sequence (−Zt )t≤T , we are guaranteed that with
probability at least 1 − δ,
T T T
X X 1 X t ct
− log Et−1 e−Zt ≤ Zt + log(δ −1 ) = ℓlog (M ) − ℓtlog (M ⋆ ) + log(δ −1 ).
2
t=1 t=1 t=1
Let t be fixed, and define abbreviate z t = (rt , ot ). Let ν(· | π) be any (conditional) domi-
ct ⋆
nating measure for mM and mM , and observe that
s
ct
−Zt m M
(z t
| π t
)
Et−1 e | π t = Et−1 ⋆ | πt
mM (z t | π t )
s
ct
mM (z | π t )
Z
M⋆ t
= m (z | π ) ⋆ ν(dz | π t )
mM (z | π t )
1 2 ⋆ t ct t
Z q
⋆ ct
= mM (z | π t )mM (z | π t )ν(dz | π t ) = 1 − DH M (π ), M (π ) .
2
Hence,
1 h i
Et−1 e−Zt = 1 − Et−1 DH 2
M ⋆ (π t ), M
ct (π t )
2
and, since − log(1 − x) ≥ x for x ∈ [0, 1], we conclude that
T T
1X h
2 ct (π t ) ≤ 1
i X
ct ) − ℓt (M ⋆ ) + log(δ −1 ).
Et−1 DH M ⋆ (π t ), M ℓtlog (M log
2 2
t=1 t=1
PH
Proof of Lemma 23. We first prove (6.49). Let X = h=1 rh . Since X ∈ [0, 1] almost
surely, we have
c c
f M (π) − f M (π) = EM ,π [X] − EM ,π [X] ≤ DTV M (π), M
c(π) ≤ DH M (π), M c(π) .
126
Proof of Lemma 24. We first prove (6.53). For all η > 0, we have
We now prove (6.54). Using Lemma 37, we have that for all h,
h i h i
c c c c
EM ,π DH
2
P M (sh , ah ), P M (sh , ah ) +EM ,π DH
2 2
RM (sh , ah ), RM (sh , ah ) ≤ 8DH M (π), M
c(π) .
As a result,
"H #
X
c c c
EM ,π DH 2 2
P M (sh , ah ), P M (sh , ah ) + DH RM (sh , ah ), RM (sh , ah ) 2
≤ 8HDH M (π), M
c(π) .
h=1
6.10 Exercises
4. Argue that
γ
c) ≤
decγ (M, M sup sup Eπ∼p,M ∼µ f M (πM )−f M (π)− EM ′ ∼ν DH2 (M (π), M ′ (π)) .
inf
ν∈∆(M) µ∈∆(M) p∈∆(Π) 4
127
5. Show that
c) ≤
sup decγ (M, M sup decγ/4 (M, M
c). (6.72)
c
M c∈co(M)
M
In other words, the estimation oracle cannot significantly increase the value of the DEC by
selecting models outside co(M).
Exercise 13 (Lower Bound on DEC for Tabular RL): We showed that for Gaussian
bandits,
p
deccε (M, M
c) ≥ ε A/2,
√
for all ε ≲ 1/ A by considering a small sub-family models and explicitly computing the DEC
for this sub-family. Show that if M is the set of all tabular MDPs with |S| = S, |A| = A, and
PH
h=1 rh ∈ [0, 1],
√
deccε (M, M
c) ≳ ε SA
√
for all ε ≲ 1/ SA, as long as H ≳ logA (S).
Exercise 14 (Structured Bandits with ReLU Rewards): We will show that structured
bandits with ReLU rewards suffer from the curse of dimensionality. Let relu(x) = max{x, 0}
and take Π = Bd2 (1) = π ∈ Rd | ∥π∥2 ≤ 1 . Consider the class of value functions of the form
where θ ∈ Θ = Sd−1 , is an unknown parameter vector and b ∈ [0, 1] is a known bias parameter.
Here Sd−1 := v ∈ Rd | ∥v∥ = 1 denotes the unit sphere. Let M = {Mθ }θ∈Θ , where for all π,
Mθ (π) := N (fθ (π), 1).
We will prove that for all d ≥ 16, there exists M ∈ M such that for all γ > 0,
ed/8
decγ (M, M ) ≳ ∧ 1, (6.74)
γ
for an appropriate choice of bias b. By slightly strengthening this result and appealing to
(6.32), it is possible to show that any algorithm must have E[Reg] ≳ ed/8 .
To prove (6.74), we will use the fact that for large d, a random vector v chosen uniformly
from the unit sphere is nearly orthogonal to any direction π. This fact is quantified as follows
(see Ball ’97):
α2
Pv∼unif(Sd−1 ) (⟨π, v⟩ > α) ≤ exp − d . (6.75)
2
In other words, instantaneous regret is at least (1 − b) whenever the decision π does not align
well with v.
128
2. Let M (π) = N (0, 1). Show that for all π ∈ Π, v ∈ Θ, and for any choice of b,
1 (1 − b)2
DH2 Mv (π), M (π) ≤ fv2 (π) ≤ I {⟨v, π⟩ > b} ,
2 2
i.e. information is obtained by the decision-maker only if the decision π aligns well with v in
the model Mv .
3. Show that
(1 − b)2
decγ (M, M ) ≥ inf Ev∼unif(Sd−1 ) Eπ∼p (1 − b) − (1 − b)I {⟨v, π⟩ > b} − γ I {⟨v, π⟩ > b} .
p∈∆(Π) 2
ε2
decγ (M′ , M ) ≥ ε − ε exp(−d/8) − γ exp(−d/8).
2
Conclude that for d ≥ 8,
ε ε2
decγ (M′ , M ) ≥ − γ exp(−d/8)
2 2
ed/8 1
5. Show that by choosing ε = 6γ ∧ 2 and recalling that b = 1 − ε, we get (6.74).
In this section, we consider the problem of online reinforcement learning with function ap-
proximation. The framework is the same as that of Section 5 but, in developing algorithms,
we no longer assume that the state and action spaces are finite/tabular, and in particular
we will aim for regret bounds that are independent of the number of states. To do this,
we will make use of function approximation—either directly modeling the transition prob-
abilities for the underlying MDP, or modeling quantities such as value functions—and our
goal will be to design algorithms that are capable of generalizing across the state space as
they explore. This will pose challenges similar to that of the structured and contextual
bandit settings, but we now face the additional challenge of credit assignment. Note that
the online reinforcement learning framework is a special case of the general decision making
setting in Section 6, but the algorithms we develop in this section will be tailored to the
MDP structure.
Recall (Section 5) that for reinforcement learning, each MDP M takes the form
M = S, A, {PhM }H M H
h=1 , {Rh }h=1 , d1 ,
where S is the state space, A is the action space, PhM : S × A → ∆(S) is the probability
transition kernel at step h, RhM : S × A → ∆(R) is the reward distribution, and d1 ∈ ∆(S1 )
is the initial state distribution.
PH All of the results in this section will take Π = ΠRNS , and
we will assume that h=1 rh ∈ [0, 1] unless otherwise specified.
129
which asserts that we have a function class that is capable of modeling the underlying
environment well. For reinforcement learning, there are various realizability assumptions
one can consider:
• Model realizability: We have a model class M of MDPs that contains the true MDP
M ⋆.
• Policy realizability: We have a class Π of policies that contains the optimal policy
πM ⋆ .
Note that model realizability implies value function realizability, which in turn implies policy
realizability. Ideally, we would like to be able to say that whenever one of these assumptions
holds, we can obtain regret bounds that scale with the complexity of the function class (e.g.,
log|M| for model realizability, or log|Q| for value function realizability), but do not depend
on the number of states |S| or other properties of the underlying MDP, analogous to the
situation for statistical learning. Unfortunately, the following result shows that this is too
much to ask for.
The interpretation of this result is that even if model realizability holds, any algorithm needs
regret that scales with min{|S|, |M|, 2H }. This means additional structural assumptions on
the underlying MDP M ⋆ —beyond realizability—are required if we want to obtain sample-
efficient learning guarantees. Note that since this construction satisfies model realizability,
the strongest form of realizability, it also rules out sample-efficient results for value function
and policy realizability.
In what follows, we will explore different structural assumptions that facilitate low regret
for reinforcement learning with function approximation. Briefly, the idea will be to make
assumptions that either i) allow for extrapolation across the state space, or ii) control the
number of “effective” state distributions the algorithm can encounter. We will begin by
investigating reinforcement learning with linear models, then explore a general structural
property known as Bellman rank.
130
Remark 20 (Further notions of realizability): There are many notions of realiz-
ability beyond those we consider above. For example, for value function approximation,
⋆
one can assume that QM ,π ∈ Q for all π, or assume that the class Q obeys certain
notions of consistency with respect to the Bellman operator for M ⋆ .
where ϕ(s, a) ∈ Bd2 (1) is a feature map that is known to the learner and θhM ∈ Bd2 (1) is an
unknown parameter vector. Equivalently, we can define
n o
Q = Qh (s, a) = ⟨ϕ(s, a), θh ⟩ | θh ∈ Bd2 (1) ∀h , (7.2)
Proposition 45 (Weisz et al. [86], Wang et al. [85]): For any d ∈ N and H ∈ N
sufficiently large, any algorithm for the Linear-Q⋆ model must have
n o
E[Reg] ≳ min 2Ω(d) , 2Ω(H) .
This contrasts the situation for contextual bandits and linear bandits, where linear rewards
were sufficient for low regret. The intuition is that, even though QM ,⋆ is linear, it might
take a very long time to estimate the value for even a small number of states. That is,
linearity of the optimal value function is not a useful assumption unless there is some kind
of additional structure that can guide us toward the optimal value function to being with.
We mention in passing that Proposition 45 can be proven by lower bounding the
Decision-Estimation Coefficient [40].
The Low-Rank MDP model. Proposition 45 implies that linearity of the optimal Q-
function alone is not sufficient for sample-efficient RL. To proceed, we will make a stronger
assumption, which asserts that the transition probabilities themselves have linear structure:
For all s ∈ S, a ∈ A, and h ∈ [H], we have
Here, ϕ(s, a) ∈ Bd2 (1) is a feature map that is known to the learner, ′ d
√ h (s ) ∈ R is another
µM
feature map which is unknown to the learner, and whM ∈ Bd2 ( d) is an unknown parameter
131
P ′
√
vector. Additionally, for simplicity, we assume that s′ ∈S |µh (s )| ≤ d, which in
M
particular holds in the tabular example below. As before, assume that both cumulative and
individual-step rewards are in [0, 1]. For the remainder of the subsection, we let M denote
the set of MDPs with these properties.
The linear structure in (7.3) implies that the transition matrix has rank at most d, thus
facilitating (as we shall see shortly) information sharing and generalization across states,
even when the cardinality of S and A is large or infinite. For this reason, we refer to MDPs
with this structure as low-rank MDPs [71, 88, 48, 5].
Just as linear bandits generalize unstructured multi-armed bandits, the low-rank MDP
model (7.3) generalizes tabular RL, which corresponds to the special case in which d =
|S| · |A|, ϕ(s, a) = es,a , and (µh (s′ ))s,a = PhM (s′ | s, a).
Properties of low-rank MDPs. The linear structure of the transition probabilities and
,⋆
mean rewards is a significantly more stringent assumption than linearity of QM h (s, a) in
(7.1). Notably, it implies that Bellman backups of arbitrary functions are linear.
Lemma 25: For any low-rank MDP M ∈ M and any Q : S × A → R and any h ∈ [H],
the Bellman operator is linear in ϕ:
M
[ThM Q](s, a) = ϕ(s, a), θQ
,π
As a special case, this lemma implies that for low-rank MDPs, QM
h is linear for all π.
M
X
′ ′
√
θQ ≤ ∥whM ∥ + µM
h (s )Q(s ) ≤ 2 d, (7.7)
s′
h is a vector of distributions on S.
since µM
132
the influential paper of Jin et al. [48]. Similar to the UCB-VI algorithm we analyzed for
tabular RL, the main idea behind the algorithm is to compute a state-action value Qt with
the optimistic property that
,⋆
Qth (s, a) ≥ QM
h (s, a)
for all s, a, h. This is achieved by combining the principle of dynamic programming with
an appropriate choice of bonus to ensure optimism. However, unlike UCB-VI, the algo-
rithm does not directly estimate transition probabilities (which is not feasible when µM is
unknown), and instead implements approximate value iteration by solving a certain least
squares objective.
LSVI-UCB
Input: R, ρ > 0
for t = 1, . . . , T do
Let QtH+1 ≡ 0.
for h = H, . . . , 1 do
Compute least-squares estimator
X 2
θbht = arg min ⟨ϕ(sih , aih ), θ⟩ − rhi − max Qth+1 (sih+1 , a) ,
a
θ∈Bd2 (ρ) i<t
Compute bonus:
√
bth,δ (s, a) = R∥ϕ(s, a)∥(Σt )−1 .
h
In more detail, for each episode t, the algorithm computes Qt1 , . . . , QtH through approximate
dynamic programming. At layer h, given Qth+1 , the algorithm computes a linear Q-function
Qb t (s, a) := ϕ(s, a), θbt , by solving a least squares problem in which X = ϕ(sh , ah ) is the
h h
feature vector and Y = rh + maxa Qth+1 (sh+1 , a) is the target/outcome. This is motivated
by Lemma 25, which asserts that the Bellman backup ThM Qth+1 (s, a) is linear. Given Q bt ,
h
the algorithm forms the optimistic estimate Qth via
n o
Qth (s, a) = Qb t (s, a) + bt (s, a) ∧ 1,
h h,δ
where
√ X
bth,δ (s, a) = R∥ϕ(s, a)∥(Σt )−1 , with Σth = ϕ(sih , aih )ϕ(sih , aih )⊤ + I,
h
i<t
133
is an elliptic bonus analogous to the bonus used within LinUCB. With this, the algorithm
proceeds to the next layer h−1. Once Qt is computed for every layer, the algorithm executes
the optimistic policy π
bt given by π
bht (s) = arg maxa∈A Qth (s, a).
The LSVI-UCB algorithm enjoys the following regret bound.
Proposition 46: If any δ > 0,√if we set R = c · d2 log(HT /δ) for a sufficiently large
numerical constant c and ρ = 2 d, LSVI-UCB has that with probability at least 1 − δ,
p
Reg ≲ H d3 · T log(HT /δ). (7.8)
policy π
b. In order to control these residuals, we constructed an estimated model M c and
c
defined empirical Bellman operators Th in terms of estimated transition kernels. We then
M
c
set Qh to be the empirical Bellman backup ThM Qh+1 , plus an optimistic bonus term. In
contrast, LSVI-UCB does not directly estimate the model. Instead, it performs regression
with a target that is an empirical Bellman backup. As we shall see shortly, subtleties arise
in the analysis of this regression step due to lack of independence.
Technical lemmas for regression. Recall from Lemma 25 that for any fixed Q : S×A →
R,
h i
EM rhi + max Q(sih+1 , a) | sih , aih = [ThM Q](sih , aih ). (7.9)
a
However, for layer h, the regression problem within LSVI-UCB concerns a data-dependent
function Q = Qth+1 (with i < t), which is chosen as a function of all the trajectories
τ 1 , . . . , τ t−1 . This dependence implies that the regression problem solved by LSVI-UCB is
not of the type studied, say, in Proposition 1. Instead, in the language of Section 1.4, the
mean of the outcome variable is itself a function that depends on all the data. The saving
grace here is that this dependence does not result in arbitrarily complex functions, which
will allow us to appeal to uniform convergence arguments. In particular, for every h and t,
Qth belongs to the class
n n √ o √ o
Q := (s, a) 7→ ⟨θ, ϕ(s, a)⟩ + R∥ϕ(s, a)∥(Σ)−1 ∧ 1 : ∥θ∥ ≤ 2 d, σmin (Σ) ≥ 1 . (7.10)
To make use of this fact, we first state an abstract result concerning regression with depen-
dent outcomes.
Lemma 26: Let G be an abstract set with |G| < ∞. Let x1 , . . . , xT ∈ X be fixed, and
for each g ∈ G, let y1 (g), . . . , yT (g) ∈ R be 1-subGaussian outcomes satisfying
134
for fg ∈ F ⊆ {f : X → R}.a In addition, assume that y1 (g), . . . , yT (g) are conditionally
independent given x1 , . . . , xT . For any latent g ∈ G, define the least-squares solution
T
X
fbg ∈ arg min (yi (g) − f (xi ))2 .
f ∈F i=1
The random vector Yg −fg has independent zero-mean 1-subGaussian entries by assumption,
f −fg √
while the multiplier is simply a T -dimensional vector of Euclidean length T , for
f −fg
T
each f ∈ F. Hence, each inner product in (7.11) is a sub-Gaussian vector with variance
proxy T1 (see Definition 2). Thus, with probability at least 1 − δ, the maximum on the
p
right-hand side does not exceed C log(|F|/δ)/T for an appropriate constant C. Taking
the union bound over g and squaring both sides of (7.11) yields the desired bound.
Lemma 27: With probability at least 1 − δ, we have that for all t and h,
X i i 2
≲ d2 log(HT /δ).
b t (si , ai ) − T M Qt
Q h h h h h+1 (sh , ah ) (7.12)
i<t
Proof sketch for Lemma 27. Let t ∈ [T ] and h ∈ [H] be fixed. To make the correspondence
with Lemma 26 explicit, for the data (sih , aih , sih+1 , rhi ), we define xi = ϕ(sih , aih ) and yi (Q) =
rhi + maxa Q(sih+1 , a), with Q ∈ Q playing the role of the index g ∈ G. With this, we have
h i
E[yi (Q) | xi ] = EM rhi + max Q(sih+1 , a) | sih , aih = [ThM Q](sih , aih ) = ϕ(sih , aih ), θQ
M
a
135
which is linear in xi = ϕ(sih , aih ), with the vector of coefficients θQ
M
depending on Q. The
regression problem is well-specified as long as we choose
n √ o
F = ϕ(s, a) 7→ ⟨ϕ(s, a), θ⟩ : ∥θ∥ ≤ 2 d
and Q as in (7.10). While both of these sets are infinite, we can to a standard covering
number argument for an appropriate scale ε. The cardinalities of ε-discretized classes can
be shown to be of size O(d)
e e 2 ), respectively, up to factors logarithmic in 1/ε and
and O(d
d. The statement follows after checking that discretization incurs a small price due to
Lipschitzness with respect to parameters. Finally, we union bound over t and h.
Establishing optimism. The next lemma shows that closeness of the regression estimate
to the Bellman backup on the data {(sih , aih )}i<t translates into closeness at an arbitrary
(s, a) pair as long as ϕ(s, a) is sufficiently covered by the data collected so far. This, in turn,
implies that Qt1 , . . . , QtH are optimistic.
Lemma 28: Whenever the event in Lemma 27 occurs, we have that for all (s, a, h)
and t ∈ [T ],
p
b t (s, a) − T M Qt
Q h h h+1 (s, a) ≲ d2 log(HT /δ) · ∥ϕ(s, a)∥(Σt )−1 =: bth,δ (s, a). (7.13)
h
and
,⋆
Qth (s, a) ≥ QM
h (s, a). (7.14)
Proof of Lemma 28. Writing the Bellman backup, via Lemma 25, as
M t
Th Qh+1 (s, a) = ⟨ϕ(s, a), θht ⟩
√
for some θht ∈ Rd with ∥θht ∥2 ≤ 2 d, we have that
D E
b t (s, a) − T M Qt (s, a) = ϕ(s, a), θbt − θt
Q h h h h h
D E
= (Σth )−1/2 ϕ(s, a), (Σth )1/2 (θbht − θht )
2
and θbht − θht ≲ d by (7.7).
To show (7.14), we proceed by induction on V th ≥ VhM ,⋆ , as in the proof of Lemma
M ,⋆
16. We start with the base case h = H + 1, which has V tH+1 = VH+1 ≡ 0. Assuming
136
M ,⋆ M ,⋆ ,⋆
V th+1 ≥ Vh+1 , we first observe that ThM is monotone and ThM V th+1 ≥ ThM Vh+1 = QM
h .
Hence,
bt = Q
Q b t − T M V h+1 + T M V h+1 (7.17)
h h h h
≥ Q − T V h+1 + QM ,⋆
b t
h
M
h h (7.18)
t M ,⋆
≥ −b h,δ + Qh (7.19)
and thus Qb t + bth,δ ≥ QM ,⋆ . Since QM ,⋆ ≤ 1, the clipped version Qt also satisfies Qt ≥ QM ,⋆ .
h h h h h h
This, in turn, implies V th ≥ VhM ,⋆ .
Finishing the proof. With the technical results above established, the proof of Propo-
sition 46 follows fairly quickly.
Proof of Proposition 46. Let M be the true model. Condition on the event in Lemma 27.
Then, since Q is optimistic by Lemma 28, we have that for each timestep t,
f M (πM ) − f M (b
π t ) ≤ Es1 ∼d1 V t1 (s1 ) − f M (b
πt)
H
t
X
EM ,bπ Qth (sh , ah ) − ThM Qth+1 (sh , ah )
=
h=1
by Lemma 14. Using the definition of bth,δ and Lemma 28, we have
H H
X
πt
M ,b
t
M t √ X πt
M ,b
h i
E Qh (sh , ah ) − Th Qh+1 (sh , ah ) ≲ R E ∥ϕ(sh , ah )∥(Σt )−1 .
h
h=1 h=1
Summing over all timesteps t gives
√ XT X
H h i
t
Reg ≤ R EM ,bπ ∥ϕ(sh , ah )∥(Σt )−1 .
h
t=1 h=1
137
• Low-Rank MDPs [87, 48, 5] • MDPs with low occupancy complexity
[34].
• Block MDPs and reactive POMDPs
[55, 33]. • Linear mixture MDPs [65, 15].
• MDPs with Linear Q⋆ and V ⋆ [34]. • Linear dynamical systems (LQR) [30].
Building intuition. Bellman rank is a property of the underlying MDP M ⋆ which gives
a way of controlling distribution shift—that is, how many times a deliberate algorithm can
be surprised by a substantially new state distribution dM ,π when it updates its policy. To
motivate the property, let us revisit the low-rank MDP model. Let M be a low-rank MDP
with feature map ϕ(s, a) ∈ Rd , and let Qh (s, a) = ϕ(s, a), θhQ be an arbitrary
D linear value
E
function. Observe that since M is a Low-Rank MDP, we have [ThM Q](s, a) = ϕ(x, a), θ̃hM,Q ,
′ ′ ′ ′
R
where θ̃hM,Q := whM + µM h (s ) maxa′ Qh+1 (s , a )ds . As a result, for any policy π, we can
write the Bellman residual for Q as
h i D E
EM ,π Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a) = EM ,π [ϕ(sh , ah )], θhQ − whM − θ̃hM,Q (7.20)
a
= ⟨XhM (π), WhM (Q)⟩, (7.21)
then the property (7.21) implies that rank(Eh (·, ·)) ≤ d. Bellman rank is an abstraction of
this property.16
Definition 8 (Bellman rank): For an MDP M with value function class Q and policy
class Π, the Bellman rank is defined as
Equivalently, Bellman rank is the smallest dimension d such that for all h, there exist
embeddings XhM (π), WhM (Q) ∈ Rd such that
16
Bellman rank was originally introduced in the pioneering work of Jiang et al. [46]. The definition of
Bellman rank we present, which is slightly different from the original definition, is taken from the later work
of Du et al. [34], Jin et al. [49], and is often referred to as Q-type Bellman rank.
138
The utility of Bellman rank is that the factorization in (7.24) gives a way of controlling
distribution shift in the MDP M , which facilitates the application of standard generaliza-
tion guarantees for supervised learning/estimation. Informally, there are only d effective
directions in which we can be “surprised” by the state distribution induced by a policy π,
to the extent that this matters for the class Q under consideration; this property was used
implicitly in the proof of the regret bound for LSVI-UCB. As we will see, low Bellman rank
is satisfied in many settings that go beyond the Low-Rank MDP model.
Like many of the algorithms we have covered, BiLinUCB is based on confidence sets and
optimism, though the way we will construct the confidence sets and implement optimism is
new.
PAC versus regret. For technical reasons, we will not directly give a regret bound for
BiLinUCB. Instead, we will prove a PAC (“probably approximately correct”) guarantee. For
PAC, the algorithm plays for T episodes, then outputs a final policy π
b, and its performance
is measured via
⋆ ⋆
f M (πM ⋆ ) − f M (b
π ). (7.25)
That is, instead of considering cumulative performance as with regret, we are only concerned
⋆ ⋆
with final performance. For PAC, we want to ensure that f M (πM ⋆ ) − f M (b π ) ≤ ε for some
−1
ε ≪ 1 using a number of episodes that is polynomial in ε and other problem parameters.
This is an √easier task than achieving low regret: If we have an algorithm that ensures that
E[Reg] ≲ CT for some problem-dependent constant C, we can turn this into an algorithm
that achieves PAC error ε using O εC2 episodes via online-to-batch conversion. In the other
direction, if we have an algorithm that achieves PAC error ε using O εC2 episodes, one can
use this to achieve E[Reg] ≲ C 1/3 T 2/3 using a simple explore-then-commit approach; this
is lossy, but is the best one can hope for in general.
• Given the current confidence set Qk , the algorithm computes a value function Qk and
corresponding policy π k := πQk that is optimistic on average
The main novelty here is that we are only aiming for optimism with respect to the
initial state distribution.
139
• Using the new policy π k , the algorithm gathers n episodes and uses these to compute
estimators {Ebhk (Q)}h∈[H] which approximate the Bellman residual Eh (π k , Q) for all
Q ∈ Q. Then, in (7.26), the algorithm computes the new confidence set Qk+1 by
restricting to value functions for which the estimated Bellman residual is small for
π 1 , . . . , π k . Eliminating value functions with large Bellman residual is a natural idea,
⋆
because we know from the Bellman equation that QM ,⋆ has zero Bellman residual.
BiLinUCB
Input: β > 0, iteration count K ∈ N, batch size n ∈ N.
Q1 ← Q.
for iteration k = 1, . . . , K do
Compute optimistic value function:
where
n
i 1X i,l i,l i,l i,l
Eh (Q) :=
b Qh (sh , ah ) − rh − max Qh+1 (sh+1 , a) .
n a∈A
l=1
1 Pn PH k,l
k = arg maxk∈[K] Vb k , where Vb k :=
Let b n l=1 h=1 rh .
b
Return π
b = πk .
Main guarantee. The main result for this section is the following PAC guarantee for
BiLinUCB.
⋆
Proposition 47: Suppose that M ⋆ has Bellman rank d and QM ,⋆ ∈ Q. For any ε > 0
3
and δ > 0, if we set n ≳ H d log(|Q|/δ)
ε2
, K ≳ Hd log(1+n/d), and β ∝ c·K log|Q|+log(HK/δ)
n ,
then BiLinUCB learns a policy π b such
⋆ ⋆
f M (πM ⋆ ) − f M (b
π) ≤ ε
episodes.
140
This result shows that low Bellman rank suffices to learn a near-optimal policy, with sample
complexity that only depends on the rank d, the horizon H, and the capacity log|Q| for the
value function class; this reflects that the algorithm is able to generalize across the state
space, with d and log|Q| controlling the degree of generalization. The basic principles at
play are:
while LSVI-UCB aims to find a value function that is uniformly optimistic for all states
and actions.
• The confidence set construction (7.26) explicitly removes value functions that have
large Bellman residual on the policies encountered so far. The key role of the Bellman
rank property is to ensure that there are only O(d) e “effective” state distributions
that lead to substantially different values for the Bellman residual, which means that
eventually, only value functions with low residual will remain.
Interestingly, the Bellman rank property is only used for analysis, and the algorithm does
not explicitly compute or estimate the factorization.
Regret bounds. The BiLinUCB algorithm can be lifted to provide a regret guarantee via
a explore-then-commit strategy: Run the algorithm for T0 episodes to learn a ε-optimal
policy, then commit to this policy for the remaining rounds. It is a simple exercise to show
that by choosing T0 appropriately, this approach gives
Reg ≤ O e (H 4 d2 log(|Q|/δ))1/3 · T 2/3 .
Under
√ an additional assumption known as Bellman completeness, it is possible to attain
T with a variant of this algorithm that uses a slightly different confidence set construction
[49].
Technical lemmas. Before proceeding, we state two technical lemmas. The first lemma
establishes validity for the confidence set Qk constructed by BiLinUCB.
141
Lemma 29: For any δ > 0, if we set β = c·K log|Q|+log(HK/δ)
n , where c > 0 is sufficiently
large absolute constant, then with probability at least 1 − δ, for all k ∈ [K]:
1. All Q ∈ Qk have X
(Eh (π i , Q))2 ≲ β ∀h ∈ [H]. (7.27)
i<k
⋆ ,⋆
2. QM ∈ Qk .
Proof of Lemma 29. Using Hoeffding’s inequality and a union bound (Lemma 3), we have
that with probability at least 1 − δ, for all k ∈ [K], h ∈ [H], and Q ∈ Q,
r
k k log(|Q|HK/δ)
Ebh (Q) − Eh (π , Q) ≤ C · , (7.28)
n
where c is an absolute constant.
To prove Part 1, we observe that for all k, using the AM-GM inequality, we have that
for all Q ∈ Q,
X X X
(Eh (π i , Q))2 ≤ 2 (Ebhi (Q))2 + 2 (Eh (π i , Q) − Ebhi (Q))2 .
i<k i<k i<k
For Q ∈ Qk , the definition of Qk implies that i<k (Ebhi (Q))2 ≤ β, while (7.28) implies that
P
2
P
i<k (Eh (π , Q) − Eh (Q)) ≲ β, which gives the result.
i bi
For Part 2, we similarly observe that for all k, h and Q ∈ Q,
X X X
(Ebhi (Q))2 ≤ 2 (Eh (π i , Q))2 + 2 (Eh (π i , Q) − Ebhi (Q))2 .
i<k i<k i<k
M ⋆ ,⋆ M ⋆ ,⋆
Since Q has Eh (π, Q ) = 0 ∀π by Bellman optimality, we have
X ⋆
X ⋆ ⋆ log(|Q|HK/δ)
(Ebhi (QM ,⋆ ))2 ≤ 2 (Eh (π i , QM ,⋆ ) − Ebhi (QM ,⋆ ))2 ≤ 2C 2 ,
n
i<k i<k
where the last inequality uses (7.28). It follows that as long as β ≥ 2C 2 log(|Q|HK/δ)
n , we will
M ⋆ ,⋆
have Q ∈ Q for all k.
k
The next result shows that whenever the event in the previous lemma holds, the value
functions constructed by BiLinUCB are optimistic.
Lemma 30: Whenever the event in Lemma 29 occurs, the following properties hold:
1. Define
⋆ ⋆
X
Σkh = XhM (π i )XhM (π i )⊤ . (7.29)
i<k
142
2. For all k, Qk is optimistic in the sense that
⋆ ,⋆ ⋆
Es1 ∼d1 [Qk1 (s1 , πQ (s1 ))] ≥ Es1 ∼d1 QM
1 (s1 , πM ⋆ (s1 )) = f M (πM ⋆ ). (7.31)
Proof of Lemma 30. For Part 1, recall that by the Bilinear class property, we can write
⋆ ⋆
Eh (π k , Q) = X M (π k ), W M (Q) , so that (7.27) implies that
⋆
X ⋆ ⋆ 2 X
W M (Q) Σk = X M (π i ), W M (Q) = (Eh (π i , Q))2 ≲ β.
h
i<k i<k
M ⋆ ,⋆
For Part 2, we observe that for all k, since Q ∈ Qk , we have
Es1 ∼d1 [Qk1 (s1 , πQ (s1 ))] = sup Es1 ∼d1 Q1 (s1 , πQ (s1 ))
Q∈Q
⋆ ,⋆
≥ Es1 ∼d1 QM
1 (s1 , πM ⋆ (s1 ))
⋆
= f M (πM ⋆ ).
Proving the main result. Equipped with the lemmas above, we prove Proposition 47.
Proof of Proposition 47. We first prove a generic bound on the suboptimality of each policy
π k for k ∈ [K]. Let us condition on the event in Lemma 29, which occurs with probability
at least 1 − δ. Whenever this event occurs, Lemma 30 implies that Qk is optimistic, so we
have can bound
⋆ ⋆ ⋆
f M (πM ⋆ ) − f M (π k ) ≤ Es1 ∼d1 Qk1 (s1 , πQk (s1 ) − f M (π k ) (7.32)
H
⋆ k
X
= EM ,π Qkh (sh , ah ) − rh − max Qkh+1 (sh+1 , a) (7.33)
a∈A
h=1
H
⋆ ⋆
X
= XhM (π k ), WhM (Qk ) , (7.34)
h=1
where the first equality uses the Bellman residual decomposition (Lemma 14), and the
second inequality uses the Bellman rank assumption. For any λ ≥ 0, using Cauchy-Schwarz,
we can bound
H H
⋆ ⋆ ⋆ ⋆
X X
XhM (π k ), WhM (Qk ) ≤ XhM (π k ) (λI+Σkh )−1
WhM (Qk ) λI+Σkh
h=1 h=1
H H
⋆ ⋆ ⋆
X X
XhM (π k ), WhM (Qk ) ≲ (λ1/2 + β 1/2 ) · XhM (π k ) (λI+Σkh )−1
. (7.35)
h=1 h=1
If we can find a policy π for which the right-hand side of (7.35) is small, this policy will be
k
guaranteed to have low regret. The following lemma shows that such a policy is guaranteed
to exist.
143
Lemma 31: For any λ > 0, as long as K ≥ Hd log 1 + λ−1 K/d , there exists k ∈ [K]
such that
Hd log 1 + λ−1 K/d
M⋆ k 2
Xh (π ) (λI+Σk )−1 ≲ ∀h ∈ [H]. (7.36)
h K
as desired.
Finally, we need to argue that the policy π b returned by the algorithm is at least as good
as π . This is straightforward and we only sketch the argument: By Hoeffding’s inequality
k
and a union bound, we have that with probability at least 1 − δ, for all k,
r
⋆ log(K/δ)
f M (π k ) − Vb k ≲ ,
n
q
π ) ≳ f M (π k ) − log(K/δ)
⋆ ⋆
which implies that f M (b n . The error term here is of lower order
than (7.37).
Proof of Lemma 31. To prove the result, we need a variant of the elliptic potential lemma
(Lemma 11).
T
X
log(1 + ∥at ∥2V −1 ) ≤ d log 1 + λ−1 T /d .
(7.38)
t
t=1
For any λ > 0, applying this result for each h ∈ [H] and summing gives
K X
H
X ⋆ 2
≤ Hd log 1 + λ−1 K/d .
log 1 + XhM (π k ) k
(λI+Σh )−1
k=1 h=1
144
⋆ 2 Hd log(1+λ−1 K/d)
which means that for all h ∈ [H], log 1 + XhM (π k ) (λI+Σkh )−1
≤ K , or
equivalently:
!
M⋆ 2 Hd log 1 + λ−1 K/d
Xh (π k ) (λI+Σkh )−1
≤ exp − 1.
K
and
′
M
Wh (Q) = M
E Qh (s, a) − rh + max
′
Qh+1 (sh+1 , a ) | sh = s, ah = a ∈ RSA ,
a s∈S,a∈A
we have
Eh (π, Q) = ⟨XhM (π), WhM (Q)⟩.
This shows that dB (M ) ≤ SA. ◁
Example 7.2 (Low-Rank MDPs). The calculation in (7.21) shows that by choosing XhM (π) :=
EM ,π [ϕ(sh , ah )] ∈ Rd and WhM (Q) := θhQ − whM − θ̃hM,Q ∈ Rd , any Low-Rank MDP M has
dB (M ) ≤ d. When specialized to this setting, the regret of BiLinUCB is worse than that
of LSVI-UCB (though still polynomial in all of the problem parameters). This is because
BiLinUCB is a more general algorithm, and does not take advantage of an additional feature
of the Low-Rank MDP model known as Bellman completeness: If M is a Low-Rank MDP,
then for all Q ∈ Q, we have ThM Qh+1 ∈ Qh+1 . By using a more specialized relative of
BiLinUCB that incorporates a modified confidence set construction to exploit completeness,
it is possible to match and actually improve upon the regret of LSVI-UCB [49].
◁
We now explore Bellman rank for some MDP families that have not already been covered.
145
Example 7.3 (Low Occupancy Complexity). An MDP M is said to have low occupancy
complexity if there exists a feature map ϕM (s, a) ∈ Rd such that for all π, there exists
θhM ,π ∈ Rd such that
,π M ,π
dM M
h (s, a) = ϕ (s, a), θh . (7.39)
Note that neither ϕM nor θM ,π is assumed to be known to the learner. If M has low
occupancy complexity, then for any value function Q and policy π, we have
h i
Eh (π, Q) = EM ,π Qh (sh , ah ) − rh + max Qh+1 (sh+1 , a)
a
X M ,π
M ′
= dh (s, a) E Qh (s, a) − rh + max ′
Qh+1 (sh+1 , a ) | sh = s, ah = a
a
s,a
ϕM (s, a), θhM ,π EM
X
′
= Qh (s, a) − rh + max
′
Qh+1 (sh+1 , a ) | sh = s, ah = a
a
s,a
* +
θhM ,π ,
X
= ϕM (s, a) EM Qh (s, a) − rh + max
′
Qh+1 (sh+1 , a′ ) | sh = s, ah = a .
a
s,a
sh+1 = AM sh + B M ah + ζh ,
where ζh ∼ N (0, I), and s1 ∼ N (0, I). We assume that rewards have the form17
rh = −s⊤ M ⊤ M
h Q sh − ah R ah
for matrices QM , RM ⪰ 0. A classical result, dating back to Kalman, is that the optimal
controller for this system is a linear mapping of the form
and that the value function QM ,⋆ (s, a) = (s, a)⊤ PhM (s, a) is quadratic. Hence, it suffices to
take Q to be the set of all quadratic functions in (s, a). With this choice, it⊤can
be shown
2 M ,π
that dB (M ) ≤ d + 1. The basic idea is to choose Xh (π) = (vec(E
M
sh sh ), 1) and use
the quadratic structure of the value functions. ◁
17
LQR is typically stated in terms of losses; we negate because we consider rewards.
146
Example 7.5 (Linear Q⋆ /V ⋆ ). In Proposition 45, we showed that for RL with linear
⋆
function approximation, assuming only that QM ,⋆ is linear is not enough to achieve low
⋆
regret. It turns out that if we also assume in addition that V M ,⋆ is linear, the situation
improves.
Consider an MDP M . Assume that there known feature maps ϕ(s, a) ∈ Rd and ψ(s′ ) ∈
d
R such that
,⋆
QM M
h (s, a) = ⟨ϕ(s, a), θh ⟩, and VhM ,⋆ (s) = ⟨ψ(s), whM ⟩.
Let
Q= Q | Qh (s, a) = ⟨ϕ(s, a), θh ⟩ : θh ∈ Rd , ∃w max⟨ϕ(s, a), θh ⟩ = ⟨ψ(s), w⟩ ∀s .
a∈A
Then dB (M ) ≤ 2d. We will not prove this result, but the basic idea is to choose XhM (π) =
EM ,π [(ϕ(sh , ah ), ψ(sh+1 ))]. ◁
This is the same as the definition (7.24) (which is typically referred to as Q-type Bellman
rank), except that we take ah = πQ (sh ) instead of ah = π(sh ).18 With an appropriate
modification, BiLinUCB can be shown to give sample complexity guarantees that scale with
V-type Bellman rank instead of Q-type. This definition captures meaningful classes of
tractable RL models that are not captured by the Q-type definition (7.24), with a canonical
example being Block MDPs.
Example 7.6 (Block MDP). The Block MDP [46, 33, 64] is a model in which the (“ob-
served”) state space S is large/high-dimensional, but the dynamics are governed by a (small)
latent state space Z. Formally, a Block MDP M = (S, A, P, R, H, d1 ) is defined based on
an (unobserved) latent state space Z, with zh denoting the latent state at layer h. We first
describe the dynamics for the latent space. Given initial latent state z1 , the latent states
evolve via
zh+1 ∼ PhM ,latent (zh , ah ).
The latent state zh is not observed. Instead, we observe
sh ∼ qhM (zh ),
where qhM : Z → ∆(S) is an emission distribution with the property that supp(qh (z)) ∩
supp(qh (z ′ )) = ∅ if z ̸= z ′ . This property (decodability) ensures that there exists a unique
18
The name “V-type” refers to the fact that (7.40) only depends on Q through the induced V -function
s 7→ Qh (s, πQ (s)), while (7.24) depends on the full Q-function, hence “Q-type”.
147
mapping ϕM h : S → Z that maps the observed state sh to the corresponding latent state zh .
We assume that RhM (s, a) = RhM (ϕM h (s), a), which implies that optimal policy πM depends
only on the endogenous latent state, i.e. πM,h (s) = πM,h (ϕM h (s)).
The main challenge of learning in Block MDPs is that the decoder ϕM is not known
to the learner in √ advance. Indeed, given access to the decoder, one can obtain regret
poly(H, |Z|, |A|) · T by applying tabular reinforcement learning algorithms to the latent
state space. In light of this, the aim of the Block MDP setting is to obtain sample com-
plexity guarantees that are independent of the size of the observed state space |S|, and
scale as poly(|Z|, |A|, H, log|F|), where F is an appropriate class of function approximators
(typically either a value function class Q containing QM ,⋆ or a class of decoders Φ that
attempts to model ϕM directly).
We now show that the Block MDP setting admits low V-type Bellman rank. Observe
that we can write
M ,π M
E Esh+1 |ah ,sh ∼πQ (sh ) Qh (sh , πQ (sh )) − rh − max Qh+1 (sh+1 , πQ (sh+1 ))
a∈A
X M ,π
= dh (z) Es∼qM (z) EM sh+1 |sh ,ah ∼πQ (sh ) Q (s ,
h h Q hπ (s )) − rh − max Q (s , π (s
h+1 h+1 Q h+1 )) .
h a∈A
z∈Z
and
M M
Wh (Q) = Es∼qM (z) Esh+1 |sh ,ah ∼πQ (sh ) Qh (sh , πQ (sh )) − rh − max Qh+1 (sh+1 , πQ (sh+1 ))
h a∈A z∈Z
so that the V-type Bellman rank is at most |Z|. This means that as long as Q contains
QM ,⋆ , we can obtain sample complexity guarantees that scale with |Z| rather than |S|, as
desired. ◁
There are a number of variants of Bellman rank, including Bilinear rank [34] and
Bellman-Eluder dimension [49], which subsume and slightly generalize both Bellman rank
definitions.
Proposition 48: For any class of MDPs M for which all M ∈ M have Bellman rank
at most d, we have
H 2d
decγ (M) ≲ . (7.41)
γ
148
√
This implies that that E2D meta-algorithm has E[Reg] ≲ H dT · EstH whenever we have
access to a realizable model class with low Bellman rank. As a special case, for any finite
class M, using averaged exponential weights as an estimation oracle gives
p
E[Reg] ≲ H dT log|M|. (7.42)
We will not prove Proposition 48, but interested readers can refer to Foster et al. [40].
The result can be proven using two approaches, both of which build on the techniques we
have already covered. The first approach is to apply a more general version of the PC-IGW
algorithm from Section 6.6, which incorporates optimal design in the space of policies. The
second approach is to move to the Bayesian DEC and appeal to posterior sampling, as in
Section 4.4.2.
which measures the squared bellman residual for an estimated value function under M .
With this choice, we appeal to the optimistic E2D algorithm (E2D.Opt) from Section 6.7.3.
One can show that the optimistic DEC for this divergence is bounded as
H ·d
o-decD
γ (M) ≲
sbr
.
γ
This implies that E2D.Opt, with an appropriate choice of estimation algorithm tailored to
π (· ∥ ·), achieves
Dsbr
Note that due to the asymmetric nature of Dsbrπ (· ∥ ·), it is critical to appeal to optimistic
estimation to derive this result. Indeed, the non-optimistic generalized DEC decD γ
sbr
does
not enjoy a polynomial bound. See Foster et al. [41] for details.
149
References
[1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic
bandits. In Advances in Neural Information Processing Systems, 2011.
[2] N. Abe and P. M. Long. Associative reinforcement learning using linear probabilis-
tic concepts. In Proceedings of the Sixteenth International Conference on Machine
Learning, pages 3–11. Morgan Kaufmann Publishers Inc., 1999.
[3] A. Agarwal and T. Zhang. Model-based RL with optimistic posterior sampling: Struc-
tural conditions and sample complexity. arXiv preprint arXiv:2206.07659, 2022.
[6] R. Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed
bandit problem. Advances in applied probability, 27(4):1054–1078, 1995.
[7] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit
problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Confer-
ence Proceedings, 2012.
[8] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear
payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013.
[9] K. Amin, M. Kearns, and U. Syed. Bandits, query learning, and the haystack dimen-
sion. In Proceedings of the 24th Annual Conference on Learning Theory, pages 87–106.
JMLR Workshop and Conference Proceedings, 2011.
[11] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits.
In COLT, volume 7, pages 1–122, 2009.
[13] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2-3):235–256, 2002.
[14] P. Auer, R. Ortner, and C. Szepesvári. Improved rates for the stochastic continuum-
armed bandit problem. In International Conference on Computational Learning The-
ory, pages 454–468. Springer, 2007.
150
[16] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement
learning. In International Conference on Machine Learning, pages 263–272, 2017.
[17] R. Bellman. The theory of dynamic programming. Bulletin of the American Mathe-
matical Society, 60(6):503–515, 1954.
[18] B. Bilodeau, D. J. Foster, and D. Roy. Tight bounds on minimax regret under loga-
rithmic loss via self-concordance. In International Conference on Machine Learning,
2020.
[21] S. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit convex
optimization. In Conference on Learning Theory, pages 583–589, 2016.
√
[22] S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization: T regret
in one dimension. In Conference on Learning Theory, pages 266–278, 2015.
[23] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex opti-
mization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of
Computing, pages 72–85, 2017.
[24] N. Cesa-Bianchi and G. Lugosi. Minimax regret under log loss for general classes of
experts. In Conference on Computational Learning Theory, 1999.
[25] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge Univer-
sity Press, New York, NY, USA, 2006. ISBN 0521841089.
[26] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff
functions. In International Conference on Artificial Intelligence and Statistics, 2011.
[28] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit
feedback. In Conference on Learning Theory (COLT), 2008.
[29] C. Dann, M. Mohri, T. Zhang, and J. Zimmert. A provably efficient model-free posterior
sampling method for episodic reinforcement learning. Advances in Neural Information
Processing Systems, 34:12040–12051, 2021.
[30] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the
linear quadratic regulator. Foundations of Computational Mathematics, 20(4):633–679,
2020.
[31] S. Dong, B. Van Roy, and Z. Zhou. Provably efficient reinforcement learning with
aggregated states. arXiv preprint arXiv:1912.06366, 2019.
151
[33] S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford. Prov-
ably efficient RL with rich observations via latent state decoding. In International
Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
[34] S. S. Du, S. M. Kakade, J. D. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bi-
linear classes: A structural framework for provable generalization in RL. International
Conference on Machine Learning, 2021.
[35] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the
bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth
annual ACM-SIAM symposium on Discrete algorithms, pages 385–394, 2005.
[36] D. J. Foster and A. Rakhlin. Beyond UCB: Optimal and efficient contextual bandits
with regression oracles. International Conference on Machine Learning (ICML), 2020.
[37] D. J. Foster, A. Agarwal, M. Dudı́k, H. Luo, and R. E. Schapire. Practical contextual
bandits with regression oracles. International Conference on Machine Learning, 2018.
[38] D. J. Foster, S. Kale, H. Luo, M. Mohri, and K. Sridharan. Logistic regression: The
importance of being improper. Conference on Learning Theory, 2018.
[39] D. J. Foster, C. Gentile, M. Mohri, and J. Zimmert. Adapting to misspecification in
contextual bandits. Advances in Neural Information Processing Systems, 33, 2020.
[40] D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of
interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
[41] D. J. Foster, N. Golowich, J. Qian, A. Rakhlin, and A. Sekhari. A note on model-
free reinforcement learning with the decision-estimation coefficient. arXiv preprint
arXiv:2211.14250, 2022.
[42] D. J. Foster, A. Rakhlin, A. Sekhari, and K. Sridharan. On the complexity of adver-
sarial decision making. arXiv preprint arXiv:2206.13063, 2022.
[43] D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making
with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023.
[44] E. Hazan and S. Kale. An online portfolio selection algorithm with regret logarithmic
in price variation. Mathematical Finance, 2015.
[45] S. R. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. Time-uniform chernoff bounds
via nonnegative supermartingales. Probability Surveys, 17:257–317, 2020.
[46] N. Jiang, A. Krishnamurthy, A. Agarwal, J. Langford, and R. E. Schapire. Contex-
tual decision processes with low Bellman rank are PAC-learnable. In International
Conference on Machine Learning, pages 1704–1713, 2017.
[47] C. Jin, A. Krishnamurthy, M. Simchowitz, and T. Yu. Reward-free exploration for
reinforcement learning. In International Conference on Machine Learning, pages 4870–
4879. PMLR, 2020.
[48] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning
with linear function approximation. In Conference on Learning Theory, pages 2137–
2143, 2020.
152
[49] C. Jin, Q. Liu, and S. Miryoosefi. Bellman eluder dimension: New rich classes of RL
problems, and sample-efficient algorithms. Neural Information Processing Systems,
2021.
[50] K.-S. Jun and C. Zhang. Crush optimism with pessimism: Structured bandits beyond
asymptotic optimality. Advances in Neural Information Processing Systems, 33, 2020.
[51] A. Kalai and S. Vempala. Efficient algorithms for universal portfolios. Journal of
Machine Learning Research, 2002.
[52] J. Kiefer and J. Wolfowitz. The equivalence of two extremum problems. Canadian
Journal of Mathematics, 12:363–366, 1960.
[53] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances
in Neural Information Processing Systems, 17:697–704, 2004.
[54] R. Kleinberg, A. Slivkins, and E. Upfal. Bandits and experts in metric spaces. Journal
of the ACM (JACM), 66(4):1–77, 2019.
[55] A. Krishnamurthy, A. Agarwal, and J. Langford. PAC reinforcement learning with rich
observations. In Advances in Neural Information Processing Systems, pages 1840–1848,
2016.
[56] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances
in Applied Mathematics, 6(1):4–22, 1985.
[57] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with
side information. In Advances in neural information processing systems, pages 817–824,
2008.
[58] T. Lattimore. Improved regret for zeroth-order adversarial bandit convex optimisation.
Mathematical Statistics and Learning, 2(3):311–334, 2020.
[60] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
[61] G. Li, P. Kamath, D. J. Foster, and N. Srebro. Eluder dimension and generalized rank.
arXiv preprint arXiv:2104.06970, 2021.
[62] L. Li. A unifying framework for computational reinforcement learning theory. Rutgers,
The State University of New Jersey—New Brunswick, 2009.
[63] H. Luo, C.-Y. Wei, and K. Zheng. Efficient online portfolio with logarithmic regret. In
Advances in Neural Information Processing Systems, 2018.
153
[65] A. Modi, N. Jiang, A. Tewari, and S. Singh. Sample complexity of reinforcement
learning using linearly combined model ensembles. In International Conference on
Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020.
[66] M. Opper and D. Haussler. Worst case prediction over sequences under log loss. In
The Mathematics of Information Coding, Extraction and Distribution, 1999.
[67] L. Orseau, T. Lattimore, and S. Legg. Soft-bayes: Prod for mixtures of experts with
log-loss. In International Conference on Algorithmic Learning Theory, 2017.
[68] Y. Polyanskiy. Information theoretic methods in statistics and computer science. 2020.
URL https://people.lids.mit.edu/yp/homepage/sdpi course.html.
[69] A. Rakhlin and K. Sridharan. Statistical learning and sequential prediction, 2012.
Available at http://www.mit.edu/∼rakhlin/courses/stat928/stat928 notes.pdf.
[70] A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martin-
gale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111–153,
2015.
[72] J. Rissanen. Complexity of strings in the class of markov sources. IEEE Transactions
on Information Theory, 32(4):526–532, 1986.
[74] D. Russo and B. Van Roy. Eluder dimension and the sample complexity of optimistic
exploration. In Advances in Neural Information Processing Systems, pages 2256–2264,
2013.
[75] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics
of Operations Research, 39(4):1221–1243, 2014.
[76] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling.
Operations Research, 66(1):230–252, 2018.
[79] D. Simchi-Levi and Y. Xu. Bypassing the monster: A faster and simpler optimal algo-
rithm for contextual bandits under realizability. Mathematics of Operations Research,
2021.
154
[82] W. R. Thompson. On the likelihood that one unknown probability exceeds another in
view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
[84] V. Vovk. A game of prediction with expert advice. In Proceedings of the eighth annual
conference on computational learning theory, pages 51–60. ACM, 1995.
[85] Y. Wang, R. Wang, and S. M. Kakade. An exponential lower bound for linearly-
realizable MDPs with constant suboptimality gap. Neural Information Processing Sys-
tems (NeurIPS), 2021.
[86] G. Weisz, P. Amortila, and C. Szepesvári. Exponential lower bounds for planning in
MDPs with linearly-realizable optimal action-value functions. In Algorithmic Learning
Theory, pages 1237–1264. PMLR, 2021.
[87] L. Yang and M. Wang. Sample-optimal parametric Q-learning using linearly additive
features. In International Conference on Machine Learning, pages 6995–7004. PMLR,
2019.
[88] H. Yao, C. Szepesvári, B. Á. Pires, and X. Zhang. Pseudo-mdps and factored linear
action models. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Re-
inforcement Learning, ADPRL 2014, Orlando, FL, USA, December 9-12, 2014, pages
1–9. IEEE, 2014.
[89] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435.
Springer, 1997.
[90] T. Zhang. Feel-good thompson sampling for contextual bandits and reinforcement
learning. SIAM Journal on Mathematics of Data Science, 4(2):834–857, 2022.
[91] H. Zhong, W. Xiong, S. Zheng, L. Wang, Z. Wang, Z. Yang, and T. Zhang. A posterior
sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962,
2022.
155
A. TECHNICAL TOOLS
As a consequence, for any random variable τ ∈ [T ] with the property that for all t ∈ [T ],
I {τ ≤ t} is a measurable function of Z1 , . . . , Zt−1 (τ is called a stopping time), we have
that with probability at least 1 − δ,
τ
r
1X log(T /δ)
Zi − E[Z] ≤ (b − a) . (A.2)
τ 2τ
i=1
Proof of Lemma 33. Lemma 3 states that for any fixed T ′ ∈ [T ], with probability at least
1 − δ,
T′
r
1 X log(T /δ)
Zi − E[Z] ≤ (b − a) .
T′ 2T ′
i=1
(A.1) follows by applying this result with δ ′ = δ/T and taking a union bound over all T
choices for T ′ ∈ [T ]. For (A.2), we observe that
τ T′
r ( r )
1X log(T /δ) 1 X log(T /δ)
(Zi − E[Z]) − (b − a) ≤ max (Zi − E[Z]) − (b − a) .
τ 2τ T ′ ∈[T ] T ′ 2T ′
i=1 i=1
Lemma 34: For any sequence of real-valued random variables (Xt )t≤T adapted to a
filtration (Ft )t≤T , it holds that with probability at least 1 − δ, for all T ′ ≤ T ,
T′ T′
X X
log Et−1 eXt + log(δ −1 ).
Xt ≤ (A.3)
t=1 t=1
156
is a nonnegative supermartingale with respect to the filtration (Fτ )τ ≤T . Indeed, for any
choice of τ , we have
τ
" !#
X Xt
Eτ −1 [Zτ ] = Eτ −1 exp Xt − log Et−1 e
t=1
τ −1
!
X Xt
· Eτ −1 exp Xτ − log Eτ −1 eXτ
= exp Xt − log Et−1 e
t=1
τ −1
!
X Xt
= exp Xt − log Et−1 e
t=1
= Zτ .
Since Z0 = 1, Ville’s inequality (e.g., Howard et al. [45]) implies that for all λ > 0,
1
P0 (∃τ : Zτ > λ) ≤ .
λ
The result now follows by the Chernoff method.
Proof of Lemma 35. Without loss of generality, let R = 1, and fix η ∈ (0, 1). The result
follows by invoking Lemma 34 with ηXt in place of Xt , and by the facts that ea ≤ 1 + a +
(e − 2)a2 for a ≤ 1 and 1 + b ≤ eb for all b ∈ R.
Lemma 36: Let (Xt )t≤T be a sequence of random variables adapted to a filtration
(Ft )t≤T . If 0 ≤ Xt ≤ R almost surely, then with probability at least 1 − δ,
T T
X 3X
Xt ≤ Et−1 [Xt ] + 4R log(2δ −1 ),
2
t=1 t=1
and
T
X T
X
Et−1 [Xt ] ≤ 2 Xt + 8R log(2δ −1 ).
t=1 t=1
157
A.2 Information Theory
A.2.1 Properties of Hellinger Distance
Lemma 37: For any distributions P and Q over a pair of random variables (X, Y ),
2 2
EX∼PX DH PY |X , QY |X ≤ 4DH (PX,Y , QX,Y ).
Next, using that Hellinger distance satisfies the triangle inequality, along with the elemen-
tary inequality (a + b)2 ≤ 2(a2 + b2 ), we have,
2 2
2
EX∼PX DH PY |X , QY |X ≤ 2DH PY |X ⊗ PX , QY |X ⊗ QX + 2DH QY |X ⊗ PX , QY |X ⊗ QX
2 2
= 2DH (PX,Y , QX,Y ) + 2DH (PX , QX )
2
≤ 4DH (PX,Y , QX,Y ),
where the final line follows from the data processing inequality.
158
In particular,
2
EP [h(X)] ≤ 3 EQ [h(X)] + 4R · DH (P, Q). (A.6)
Proof of Lemma 40. Let a measurable event A be fixed. Let p = P(A) and q = Q(A). Then
we have
(p − q)2 √ √
≤ ( p − q)2 ≤ DH 2 2
((p, 1 − p), (q, 1 − q)) ≤ DH (P, Q),
2(p + q)
where the third inequality is the data-processing inequality. It follows that
q
|p − q| ≤ 2(p + q)DH 2 (P, Q),
R1
To deduce the final result for R = 1, we observe that EP [h(X)] = 0 P(h(X) > t)dt and
likewise for EQ [h(X)], then apply Jensen’s inequality. The result for general R follows by
rescaling.
The inequality in (A.6) follows by applying the AM-GM inequality to (A.5) and rear-
ranging.
Lemma 41 (Sion’s Minimax Theorem [80]): Let X and Y be convex sets in linear
topological spaces, and assume X is compact. Let f : X ×Y → R be such that (i) f (x, ·)
is concave and upper semicontinuous over Y for all x ∈ X and (ii) f (·, y) is convex and
lower semicontinuous over X for all y ∈ Y. Then
159