Contextual Multi-Armed Bandits: A+b+1 A+b+2
Contextual Multi-Armed Bandits: A+b+1 A+b+2
Contextual Multi-Armed Bandits: A+b+1 A+b+2
Abstract 1 INTRODUCTION
We study contextual multi-armed bandit prob- Internet search engines, such as Google, Yahoo! and Mi-
lems where the context comes from a metric crosoft’s Bing, receive revenue from advertisements shown
space and the payoff satisfies a Lipschitz condi- to a user’s query. Whenever a user decides to click on an ad
tion with respect to the metric. Abstractly, a con- displayed for a search query, the advertiser pays the search
textual multi-armed bandit problem models a sit- engine. Thus, part of the search engine’s goal is to display
uation where, in a sequence of independent trials, ads that are most relevant to the user in the hopes of in-
an online algorithm chooses, based on a given creasing the chance of a click, and possibly increasing its
context (side information), an action from a set of expected revenue. In order to achieve this, the search en-
possible actions so as to maximize the total pay- gine has to learn over time which ads are the most relevant
off of the chosen actions. The payoff depends on to display for different queries. On the one hand, it is im-
both the action chosen and the context. In con- portant to exploit currently relevant ads, and on the other
trast, context-free multi-armed bandit problems, hand, one should explore potentially relevant ads. This
a focus of much previous research, model situa- problem can be naturally posed as a multi-armed bandit
tions where no side information is available and problem with context. Here by context we mean a user’s
the payoff depends only on the action chosen. query. Each time a query x arrives and an ad y is dis-
Our problem is motivated by sponsored web played there is an (unknown) probability µ(x, y) that the
search, where the task is to display ads to a user user clicks on the ad.1 We call µ(x, y) the click-through
of an Internet search engine based on her search rate (or CTR) of x and y.
query so as to maximize the click-through rate We want to design an online algorithm, which given a query
(CTR) of the ads displayed. We cast this prob- in each time step and a history of past queries and ad clicks,
lem as a contextual multi-armed bandit problem displays an ad to maximize the expected number of clicks.
where queries and ads form metric spaces and In our setting, we make a crucial yet very natural assump-
the payoff function is Lipschitz with respect to tion that the space of queries and ads are endowed with a
both the metrics. For any > 0 we present an metric and µ(x, y) satisfies a Lipschitz condition with re-
a+b+1
algorithm with regret O(T a+b+2 + ) where a, b spect to each coordinate. Informally, we assume that the
are the covering dimensions of the query space CTRs of two similar ads for the same query are close, and
and the ad space respectively. We prove a lower that of two similar queries for the same ad are also close.
ã+b̃+1
− Lastly, we assume that the sequence of queries is fixed in
bound Ω(T ã+b̃+2 ) for the regret of any algo-
rithm where ã, b̃ are packing dimensions of the advance by an adversary and revealed in each time step (aka
query spaces and the ad space respectively. For oblivious adversary).
finite spaces or convex bounded subsets of Eu- Clearly, the best possible algorithm—Bayes optimal —
clidean spaces, this gives an almost matching up- displays, for a given query, the ad which has the highest
per and lower bound. CTR. Of course, in order to execute it the CTRs must be
known. Instead we are interested in algorithms that do not
Appearing in Proceedings of the 13th International Conference depend on the knowledge of the CTRs and whose perfor-
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La- mance is still asymptotically the same as that of the Bayes
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-
right 2010 by the authors. 1
For simplicity we assume that one ad is displayed per query.
485
Contextual Multi-Armed Bandits
optimal. More precisely, for any algorithm A, we consider of any Lipschitz contextual bandit algorithm, showing that
the expected difference between the number of clicks that our algorithm is essentially optimal.
the Bayes optimal receives and A receives for T queries.
This difference is called the regret of A and is denoted by
1.1 RELATED WORK
RA (T ). An algorithm is said to be asymptotically Bayes
optimal if the per-query regret RA (T )/T approaches 0 as There is a body of relevant literature on context-free multi-
T → ∞ for any sequence of queries. armed bandit problems: first bounds on the regret for the
The standard measure of quality of an asymptotically model with finite action space were obtained in the classic
Bayes optimal algorithm is the speed of convergence at paper by Lai and Robbins [1985]; a more detailed exposi-
which per-round regret approaches zero. Equivalently, one tion can be found in Auer et al. [2002]. Auer et al. [2003]
measures the growth of the regret RA (T ) as T → ∞. The introduced and provided regret optimal algorithms in the
bounds are usually of the form RA (T ) = O(T γ ) for some non-stochastic bandit problem when payoffs are adversar-
γ < 1. Such regret bounds are the standard way of mea- ial. In recent years much work has been done on very large
suring performance of algorithms for multi-armed bandit action spaces. Flaxman et al. [2005] considered a setting
problems, for online learning problems and, more broadly, where actions form a convex set and in each round a convex
for reinforcement learning problems. payoff function is adversarially chosen. Continuum actions
spaces and payoff functions satisfying (variants of) Lips-
The main contributions of this paper are 1) a formal chitz condition were studied in Kleinberg [2005a,b], Auer
model of the Lipschitz contextual bandit problem on met- et al. [2007]. Most recently, metric action spaces where the
ric spaces, 2) a novel, conceptually simple and clean al- payoff function is Lipschitz was considered by Kleinberg
gorithm, which we call query-ad-clustering, and 3) lower et al. [2008]. Inspired by their work, we also consider met-
bounds that show the algorithm is essentially optimal with ric spaces for our work. In a follow-up paper by Bubeck
respect to regret. In particular, the following theorem states et al. [2008] the results of Kleinberg et al. [2008] are ex-
our results in our contextual bandit model. Note that the tended to more general settings.
covering dimension of a metric space is defined as the
smallest d such that the number of balls of radius r required Our model can be viewed as a direct and strict generaliza-
to cover the space is O(r−d ). The packing dimension, is tion of the classical multi-armed bandit problem by Lai and
defined as the largest d˜ such that there for any r there exists Robbins and the bandit problem in continuum and general
˜ metric spaces as presented by Agrawal [1995] and Klein-
a subset of disjoint balls of radius r of size Ω(r−d ).
berg et al. [2008]. These models can be viewed as a special
Theorem 1. Consider a contextual Lipschitz multi-armed
case of our model where the query space is a singleton. Our
bandit problem with query metric space (X, LX ) and ads
upper and lower bounds on the regret apply to these mod-
metric space (Y, LY ) of size at least 2. Let a, b be the cov-
els as well. See section 1.3 for a closer comparison with
ering dimensions of X, Y respectively, and ã, b̃ be the pack-
the model of Kleinberg et al. [2008].
ing dimensions of X, Y respectively. Then,
Online learning with expert advice is a class of problems
a+b+1 related to multi-armed bandits, see the book by Cesa-
• For any γ > a+b+2 , the query-ad-clustering algo-
rithm A has the property that there exists constants Bianchi and Lugosi [2006]. These can viewed as multi-
T0 , C such that for any instance µ, T ≥ T0 and se- armed bandit problems with side information, but their
quence of T queries the regret RA (T ) ≤ C · T γ . structure is different than the structure of our model. The
most relevant work is the Exp4 algorithm of Auer et al.
• For any γ < ã+ b̃+1
ã+b̃+2
there exists positive constants [2003] where experts are simply any multi-armed bandit al-
C, T0 such that for any T ≥ T0 and any algorithm A gorithm, and the goal is to compete against the best expert.
there exists an instance µ and a sequence of T queries In fact this setting and the Exp4 algorithm can be reformu-
such that the regret RA (T ) ≥ C · T γ . lated in our model, which is discussed further at the end of
section 2.
If the query space and the ads space are convex bounded
We are aware of three papers that define multi-armed bandit
subsets of Euclidean spaces or are finite then ã = a and
problem with side information. The first two are by Wang
b̃ = b (finite spaces have zero dimension) and the theorem
et al. [2005] and Goldenshluger and Zeevi [2007], how-
provides matching upper and lower bounds.
ever, the models in these papers are very different from
The paper is organized as follows. In section 1.1 we ours. The epoch-greedy algorithm proposed in Langford
discuss related work, and introduce our Lipschitz contex- and Zhang [2007] pertains to a setting where contexts ar-
tual multi-armed bandit model in section 1.2. Then we in- rive i.i.d. and regret is defined relative to the best context-
troduce the query-ad-clustering algorithm in section 2 and to-action mapping in some fixed class of such mappings.
give an upper bound on its regret. In section 3 we present They upper bound the regret of epoch-greedy in terms of
what is essentially a matching lower bound on the regret an exploitation parameter that makes it hard to compare
486
Tyler Lu, Dávid Pál, Martin Pál
487
Contextual Multi-Armed Bandits
we present the algorithm and then we prove O(T γ ) upper the book [Devroye and Lugosi, 2001, Chapter 2] or in the
bound on its regret. original paper by Hoeffding [1963].
Before we state the algorithm we define several parame-
Hoeffding’s Inequality Let X1 , X2 , . . . , Xn be indepen-
ters that depend on (X, Y ) and γ and fully specify the al-
dent bounded random variables such that Xi , 1 ≤ i ≤ n,
gorithm. Let a, b to be the covering dimensions of X, Y
has support [ai , bi ]. Then for the sum S = X1 + X2 + · · · +
respectively. We define a0 , b0 so that a0 > a, b0 > b
0
+b0 +1 Xn we have for any u ≥ 0,
and γ > aa0 +b 0 +2 . We also let c, d be constants such that
488
Tyler Lu, Dávid Pál, Martin Pál
Now suppose that the good event occurs. Let R b be the Equivalently,
actual regret,
! 2Rt−1 (y) ≥ µ(x0 , y0∗ ) − µ(x0 , y) − 2r.
X
0 We substitute the definition of Rt−1 (y) into this inequality
R=
b sup µ(xt , y ) − µ(xt , yt ) .
t
yt0 ∈Y and square both sides of the inequality. (Note that both
2i ≤t≤min(T,2i+1 −1)
xt ∈Xj side are positive.) This gives an upper bound on nT (y) =
Since the algorithm during the phase i displays ads only nt−1 (y) + 1:
from PY0 , the actual regret R
b can be decomposed as a sum 16i
R = y∈Y0 Ry where Ry is the contribution to the regret nT (y) = nt−1 (y) + 1 ≤ 2.
(µ(x0 , y0∗ )
b b b
− µ(x0 , y) − 2r)
by displaying the ad y, that is,
! Combining with (5) we have
X
0
Ry =
b sup µ(xt , yt ) − µ(xt , y) b y ≤ nT (y) [µ(x0 , y0∗ ) − µ(x0 , y) + 3r]
R
yt0 ∈Y
2i ≤t≤min(T,2i+1 −1)
xt ∈Xj ≤ nT (y) [µ(x0 , y0∗ ) − µ(x0 , y) − 2r] + 5rnT (y)
yt =y
16i
≤ + 5rnT (y).
Fix y ∈ Y0 . Pick any > 0. Let y ∗ be an -optimal µ(x0 , y ∗ ) − µ(x0 , y) − 2r
for query x0 , that is, y ∗ is such that µ(x0 , y ∗ ) ≥ supy∈Y
Using the definition of a bad ad we get that
µ(x0 , y)−. Let y0∗ be the optimal ad in Y0 for the query x0 ,
that is, y0∗ = argmaxy∈Y0 µ(x0 , y). Lipschitz condition b y ≤ 16i + 5rnT (y) .
guarantees that for any xt ∈ Xj ∀y ∈ Ybad R (7)
r
sup µ(xt , yt0 ) ≤ sup µ(x0 , y) + r Summing over all ads, both bad and good, we have
yt0 ∈Y y∈Y
≤ µ(x0 , y ∗ ) + r +
X X
R
b= Rby + R
by
≤ µ(x0 , y0∗ ) + 2r + , y∈Ygood y∈Ybad
489
Contextual Multi-Armed Bandits
Proof. Let denote by nj the number of queries belonging the Bayes optimal strategy, setting = T −1/(a+b+2) we re-
PN
to cluster Xj . Clearly n = j=1 nj . From the preceding trieve the same regret upper bound as query-ad-clustering.
lemma we have However, the problem with this algorithm is that it must
keep track of an extremely large number, E, of experts
N N
X X 16i while ignoring the structure of our model—it does not ex-
Ri (T ) = Ri,j (T ) ≤ 6rnj + K +1 ploit the fact that a bandit algorithm can be run for each
j=1 j=1
r
context “piece” as opposed to each expert.
16i
≤ 6rn + N K +1 .
r
3 A LOWER BOUND
Lemma 6. For any T ≥ 0, the regret of the query-ad- ã+b̃+1
clustering algorithm is bounded as In this section we prove for any γ < ã+ b̃+1
lower bound
γ
Ω(T ) on the regret of any algorithm for a contextual
a0 +b0 +1
RA (T ) ≤ (24 + 64cd log2 T + 4cd)T a0 +b0 +2 = O (T γ ) . Lipschitz MAB (X, Y ) with ã = PACK(X, LZ ), b̃ =
PACK(Y, LY ). On the highest level, the main idea of the
lower bound is a simple averaging argument. We construct
The lemma proves the first part of Theorem 1. several “hard” instances and we show that the average re-
gret of any algorithm on those instances is Ω(T γ ).
Proof. Let k be the last phase, that is, k is such that 2k ≤
T < 2k+1 . In other words k = blog2 T c. We sum the Before we construct the instances we define several param-
regret over all phases 0, 1, . . . , k. We use the preceding eters that depend on (X, Y ) and γ. We define a0 , b0 so that
0
+b0 +1
lemma and recall that in phase i a0 ∈ [0, ã], b0 ∈ [0, b̃] and γ = aa0 +b 0 +2 . Moreover, if
0
ã > 0 we ensure that a ∈ (0, ã) and likewise if b̃ > 0 we
a0 i b0 i
i
r = 2− a0 +b0 +2 , N = c · 2 a0 +b0 +2 , K = d · 2 a0 +b0 +2 , n ≤ 2i . ensure b0 ∈ (0, b̃). Let c, d be constants such that for any
r ∈ (0, 1] there exist 2r-separated subsets of X, Y of sizes
0 0
We have at least cr−a and dr−b respectively. Existence of such
constants is guaranteed by the definition of the packing di-
k k
i mension. We also use positive constants α, β, C, T0 that
6 · 2− a0 +b0 +2 · 2i
X X
RA (T ) = Ri (T ) ≤ can be expressed in terms of a0 , b0 , c, d only. We don’t give
i=0 i=0
the formulas for these constants; they can be in principle
a0 i b0 i 16i extracted from the proofs.
+2 a0 +b0 +2 ·d·2 a0 +b0 +2 · i +1
2− a0 +b0 +2
k
Hard instances: Let time horizon T be given. The
0 0 0 0 0 0
≤
X
6·2 ia +b +1
a0 +b0 +2 + 16icd2 ia +b +1
a0 +b0 +2 + cd2 i a0a+b
+b
0 +2 “hard” instances are constructed as follows. Let r =
0 0
i=0
α · T −1/(a +b +2) and X0 ⊆ X, Y0 ⊆ Y be 2r-separated
0 0
While the query-ad-clustering algorithm achieves what µv (x, y) = 1/2+max{0, r −LY (y, v(x0 ))−LX (x, x0 )}.
turns out to be the optimal regret bound, we note that a
Furthermore, we assume that in each round t the payoff
modification of the Exp4 “experts” algorithm Auer et al.
µ̂t the algorithm receives lies in {0, 1}, that is, µ̂t is a
[2003] achieves the same bound (but we discuss the prob-
Bernoulli random variable with parameter µv (xt , yt ).
lems with this algorithm below). Each expert is defined by
a mapping f : {X1 , . . . , XN } → Y0 where given a x ∈ X Now, we choose a sequence of T queries. The sequence
finds the appropriate cluster Xx and recommends f (Xx ). of queries will consists of |X0 | subsequences, one for each
a
There are E = (1/b )(1/ ) such experts (mappings), x0 ∈ X0 , concatenated together. For each x0 j∈ Xk0 the
and one of them is -close to the Bayes optimal strategy. corresponding subsequence consists of M = |XT0 | (or
The√regret bound Auer et al. [2003] for Exp4 gives us j k
O( T E log E) to the best expert, which has regret T to M = |XT0 | + 1) copies of x0 . In Lemma 7 we lower
490
Tyler Lu, Dávid Pál, Martin Pál
bound the contribution of each subsequence to the total re- is a strict generalization of previously studied multi-armed
gret. The proof of Lemma 7 is an adaptation of the proof bandit settings where no side information is given in each
Theorem 6.11 from Cesa-Bianchi and Lugosi [2006, Chap- round. We believe that our model applies to many other real
ter 6] of a lower bound for the finitely-armed bandit prob- life scenarios where additional information is available that
lem. In Lemma 8 we sum the contributions together and affects the rewards of the actions.
give the final lower bound.
We present a very natural and conceptually simple algo-
Lemma 7. For x0 ∈ X0 consider a sequence of M copies rithm known as query-ad-clustering, which roughly speak-
of query x0 . Then for T ≥ T0 and for any algorithm A the ing, clusters the contexts into similar regions and runs
average regret on this sequence of queries is lower bounded a multi-armed bandit algorithm for each context cluster.
as When the query and ad spaces are endowed with a met-
1 X p ric for which the reward function is Lipschitz, we prove
Rx0 = |X |
RvA (M ) ≥ β |Y0 |M , an upper bound on the regret of query-ad-clustering and
|Y0 | 0
X
v∈Y0 0 a lower bound on the regret of any algorithm showing
that query-ad-clustering is optimal. Specifically, the upper
where RvA (M ) denotes the regret on instance µv . a+b+1
bound O(T a+b+2 + ) is dependent on the covering dimen-
sion of the query (a) and ad spaces (b) and the lower bound
Proof. Deferred to the full version of the paper. ã+b̃+1
−
Ω(T ã+b̃+2 ) is dependent on the packing dimensions of
Lemma 8. For any algorithm A, there exists an v ∈ Y0X0 , spaces (ã, b̃). For bounded Euclidean spaces and finite sets,
and an instance µv and a sequence of T ≥ T0 queries on these dimensions are equal and imply nearly tight bounds
which regret is at least on the regret. The lowernbound can be o strengthened to
∞
a+b̃+1 ã+b+1
γ Ω(T γ ) for any γ < max ,
a+b̃+2 ã+b+2
. So, if either
RA (T ) ≥ C · T
ã = a or b̃ = b, then we can still prove a lower bound
Proof. We use the preceding lemma and sum the regret that matches the upper bound. However, the lower bound
over all x0 ∈ X0 . will hold “only” for infinitely many time horizons T (as op-
posed to all horizons). It seems that for Lipschitz context
1 X MABs where ã 6= a and b̃ 6= b one needs to craft a dif-
sup RvA (T ) ≥ |X |
RvA (T )
v∈Y0
X0 |Y0 | 0 X
ferent notion of dimension, which would somehow capture
v∈Y0 0
491
Contextual Multi-Armed Bandits
vances in Neural Information Processing Systems 19, T. L. Lai and Herbert Robbins. Asymptotically efficient
(NIPS 2007), pages 49–56. MIT Press, 2007. adaptive allocation rules. Advances in Applied Mathe-
matics, 6(1):4–22, 1985.
Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-
time analysis of the multiarmed bandit problem. Ma- John Langford. How do we get weak action dependence for
chine Learning, 47(2-3):235–256, 2002. learning with partial observations? Blog post: http:
//hunch.net/?p=421, September 2008.
Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund., and
Robert E. Schapire. The nonstochastic multiarmed ban- John Langford and Tong Zhang. The epoch-greedy algo-
dit problem. SIAM Journal on Computing, 32(1):48–77, rithm for multi-armed bandits with side information. In
2003. NIPS, 2007.
Peter Auer, Ronald Ortner, and Csaba Szepesvári. Im- Richard S. Sutton and Andrew G. Barto. Reinforcement
proved rates for the stochastic continuum-armed bandit Learning. MIT Press, 1998.
problem. In Proceedings of the 20th Annual Confer- Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vin-
ence on Learning Theory, (COLT 2007), pages 454–468. cent Poor. Bandit problems with side observations.
Springer, 2007. IEEE Transactions on Automatic Control, 50(3):338–
355, May 2005.
Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba
Szepesvári. Online optimization in x-armed bandits. In
NIPS, pages 201–208, 2008.
Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learn-
ing, and Games. Cambridge University Press, 2006.
Luc Devroye and Gábor Lugosi. Combinatorial Methods
in Density Estimation. Springer, 2001.
Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action
elimination and stopping conditions for the multi-armed
bandit and reinforcement learning problems. Journal of
Machine Learning Research, 7:1079–1105, 2006.
Abraham D. Flaxman, Adam T. Kalai, and H. Brendan
McMahan. Online convex optimization in the bandit set-
ting: gradient descent without a gradient. In Proceed-
ings of the sixteenth annual ACM-SIAM symposium on
Discrete algorithms (SODA 2005), pages 385–394. Soci-
ety for Industrial and Applied Mathematics Philadelphia,
PA, USA, 2005.
Alexander Goldenshluger and Assaf Zeevi. Performance
limitations in bandit problems with side observations.
manuscript, 2007.
Wassily Hoeffding. Probability inequalities for sums of
bounded random variables. Journal of the American Sta-
tistical Association, 58(301):13–30, 1963.
Robert D. Kleinberg. Nearly tight bounds for the
continuum-armed bandit problem. In Lawrence K. Saul,
Yair Weiss, and Léon Bottou, editors, Advances in Neu-
ral Information Processing Systems 17, (NIPS 2005),
pages 697–704. MIT Press, 2005a.
Robert D. Kleinberg. Online Decision Problems with Large
Strategy Sets. PhD thesis, Massachusetts Institute of
Technology, June 2005b.
Robert D. Kleinberg, Aleksandrs Slivkins, and Eli Upfal.
Multi-armed bandits in metric spaces. In Proceedings
of the 40th Annual ACM Symposium, STOC 2008, pages
681–690. Association for Computing Machinery, 2008.
492