0% found this document useful (0 votes)
13 views8 pages

Bandit Problems

This document provides an abstract for a paper that reviews applications of multi-armed bandit problems. It introduces the multi-armed bandit framework and its application in domains like healthcare, finance, and machine learning. In healthcare, applications include clinical trials to efficiently evaluate treatment options by balancing exploration and exploitation. In finance, the problem applies to sequential portfolio selection. The paper aims to comprehensively survey real-world uses of multi-armed bandits and contextual bandits.

Uploaded by

Utkarsha Sutar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Bandit Problems

This document provides an abstract for a paper that reviews applications of multi-armed bandit problems. It introduces the multi-armed bandit framework and its application in domains like healthcare, finance, and machine learning. In healthcare, applications include clinical trials to efficiently evaluate treatment options by balancing exploration and exploitation. In finance, the problem applies to sequential portfolio selection. The paper aims to comprehensively survey real-world uses of multi-armed bandits and contextual bandits.

Uploaded by

Utkarsha Sutar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Survey on Practical Applications of Multi-Armed and Contextual Bandits

Djallel Bouneffouf1 , Irina Rish2


1,2
IBM Thomas J. Watson Research Center, Yorktown Heights, NY USA
{dbouneffouf, Irish }@us.ibm.com
arXiv:1904.10040v1 [cs.LG] 2 Apr 2019

Abstract success in such settings is finding a good trade-off be-


tween exploration (e.g., trying a new drug) and exploita-
In recent years, multi-armed bandit (MAB) tion (choosing the known drug).
framework has attracted a lot of attention in
various applications, from recommender sys- This inherent exploration vs. exploitation trade-off ex-
tems and information retrieval to healthcare ists in many sequential decision problems, and is tra-
and finance, due to its stellar performance ditionally formulated as the multi-armed bandit (MAB)
combined with certain attractive properties, problem, stated as follows: given K possible actions,
such as learning from less feedback. The or “arms”, each associated with a fixed but unknown
multi-armed bandit field is currently flourish- reward probability distribution [Lai and Robbins, 1985;
ing, as novel problem settings and algorithms Auer et al., 2002], at each iteration (time point) an agent
motivated by various practical applications are selects an arm to play and receives a reward, sampled
being introduced, building on top of the clas- from the respective arm’s probability distribution inde-
sical bandit problem. This article aims to pro- pendently from the previous actions. The task of an agent
vide a comprehensive review of top recent de- is to learn how to choose its actions so that the cumulative
velopments in multiple real-life applications of rewards over time is maximized. Different solutions have
the multi-armed bandit. Specifically, we intro- been proposed for this problem, based on a stochastic
duce a taxonomy of common MAB-based ap- formulation [Lai and Robbins, 1985; Auer et al., 2002;
plications and summarize state-of-art for each Bouneffouf and Féraud, 2016] and a Bayesian formu-
of those domains. Furthermore, we identify lation [Agrawal and Goyal, 2012]; however, these ap-
important current trends and provide new per- proaches did not take into account the context, or side
spectives pertaining to the future of this excit- information available to the agent.
ing and fast-growing field. A particularly useful version of the MAB is the con-
textual multi-arm bandit (CMAB), or simply the con-
textual bandit problem, where at each iteration, before
1 Introduction choosing an arm, the agent observes an N -dimensional
Sequential decision-making problems, where at each context, or feature vector. The agent uses this con-
point in time an agent must choose the best ac- text, along with the rewards of the arms played in the
tion out of several alternatives, are frequently en- past, to choose which arm to play in the current itera-
countered in various practical applications, from tion. Over time, the agent’s aim is to collect enough
clinical trials [Durand et al., 2018] to recommender information about the relationship between the context
systems [Mary et al., 2015] and anomaly detection vectors and rewards, so that it can predict the next
[Ding et al., 2019]. Often, there is a side informa- best arm to play by looking at the current context
tion, or context, associated with each action (e.g., a [Langford and Zhang, 2008; Agrawal and Goyal, 2013].
user’s profile), and the feedback, or reward, is limited Different algorithms were proposed for the general case,
to the chosen option. For example, in clinical trials including LINUCB [Li et al., 2010], Neural Bandit
[Durand et al., 2018; Bastani and Bayati, 2015] the con- [Allesiardo et al., 2014] and Contextual Thompson Sam-
text is the patient’s medical record (e.g. health condition, pling (CTS) [Agrawal and Goyal, 2013], where a linear
family history, etc.), the actions correspond to the treat- dependency is typically assumed between the expected
ment options being compared, and the reward represents reward of an action and its context.
the outcome of the proposed treatment (e.g., success or We will now provide an extensive overview includ-
failure). An important aspect affecting the long-term ing various applications of the bandit framework, both
in real-life problem setting arising in multiple practi- bandit problem which extends the standard Thompson
cal domains (healthcare, computer network routing, fi- Sampling approach to incorporate reward processing bi-
nance, and beyond), as well as in computer science and ases associated with several neurological and psychiatric
machine-learning in particular, where bandit approaches conditions, including Parkinson’s and Alzheimer’s dis-
can help improve hyperparameter tuning and other im- eases, attention-deficit/hyperactivity disorder (ADHD),
portant algorithmic choices in supervised learning, active addiction, and chronic pain. They demonstrate empir-
learning and reinforcement learning. ically, from the behavioral modeling perspective, that
their parametric framework can be viewed as a first step
2 Real-Life Applications of Bandit towards a unifying computational model capturing re-
ward processing abnormalities across multiple mental
As a general mathematical framework, the stochastic
conditions.
multi-armed bandit setting addresses the primary diffi-
culty in sequential decision-making under uncertainty, 2.2 Finance
namely, the exploration versus exploitation dilemma, and
therefore provides a natural formalism for most real-life In recent years, sequential portfolio selection has been a
online decision making problems. focus of increasing interest at the intersection of the ma-
chine learning and quantitative finance. The trade-off be-
2.1 Healthcare tween exploration and exploitation, with the goal of max-
Clinical trials. Collecting data for assessing treatment imizing cumulative reward, is a natural formulation of
effectiveness on animal models during the full range of the portfolio choice problems. In [Shen et al., 2015], the
disease stages can be difficult when using conventional authors proposed a bandit algorithm for making online
random treatment allocation procedures, since poor treat- portfolio choices via exploiting correlations among mul-
ments can cause deterioration of subject’s health. Au- tiple arms. By constructing orthogonal portfolios from
thors in [Durand et al., 2018] aim to design an adaptive multiple assets and integrating their approach with the
allocation strategy to improve the efficiency of data col- upper-confidence-bound bandit framework, the authors
lection by allocating more samples for exploring promis- derive the optimal portfolio strategy representing a com-
ing treatments. They cast this application as a contex- bination of passive and active investments according to
tual bandit problem and introduce a practical algorithm a risk-adjusted reward function. In [Huo and Fu, 2017],
for exploration vs. exploitation in this framework. The the authors incorporate risk-awareness into the classic
work relies on sub-sampling to compare treatment op- multi-armed bandit setting and introduce a novel algo-
tions using an equivalent amount of information. Pre- rithm for portfolio construction. Through filtering as-
cisely, they extend the sub-sampling strategy to contex- sets based on the topological structure of financial mar-
tual bandit setting g by applying sub-sampling within ket and combining the optimal multi-armed bandit policy
Gaussian Process regression. with the minimization of a coherent risk measure, they
Warfarin is the most widely used oral anticoagulant achieve a balance between risk and return.
agent in the world; however, dosing it correctly remains
a significant challenge, as the appropriate dose can be 2.3 Dynamic Pricing
highly variable among individuals due to various clin- Online retailer companies are often faced with the dy-
ical, demographic and genetic factors. Physicians cur- namic pricing problem: the company must decide on
rently follow a fixed-dose strategy: they start patients on real-time prices for each of its multiple products. The
5mg/day (the appropriate dose for the majority of pa- company can run price experiments (make frequent price
tients) and slowly adjust the dose over the course of a changes) to learn about demand and maximize long-run
few weeks by tracking the patients anti-coagulation lev- profits. The authors in [Misra et al., 2018] propose a dy-
els. However, an incorrect initial dosage can result in namic price experimentation policy, where the company
highly adverse consequences such as stroke (if the initial has only incomplete demand information. For this gen-
dose is too low) or internal bleeding (if the initial dose eral setting, authors derive a pricing algorithm that bal-
is too high). Thus, authors in [Bastani and Bayati, 2015] ances earning an immediate profit vs. learning for fu-
tackle the problem of learning and assigning an appropri- ture profits. The approach combines multi-armed ban-
ate initial dosage to patients by modeling the problem as dit with partial identification of consumer demand from
a multi-armed bandit with high-dimensional covariates, economic theory. Similar to [Misra et al., 2018], au-
and propose a novel and efficient bandit algorithm based thors in [Mueller et al., 2018] consider high-dimensional
on the LASSO estimator. dynamic multi-product pricing with an evolving low-
Brain and behavior modeling. Drawing an inspira- dimensional linear demand model. They show that the
tion from behavioral studies of human decision making revenue maximization problem reduces to an online ban-
in both healthy controls and patients with different men- dit convex optimization with side information given by
tal disorders, authors in [Bouneffouf et al., 2017a] pro- the observed demands. The approach applies a ban-
pose a general parametric framework for multi-armed dit convex optimization algorithm in a projected low-

2
dimensional space spanned by the latent product fea- while repeatedly interacting with it. They address the
tures, while simultaneously learning this span via online challenges of combinatorial action space, since the num-
singular value decomposition of a carefully-crafted ma- ber of feasible influencer sets grows exponentially with
trix containing the observed demands. the maximum number of influencers, and limited feed-
back, since only the influenced portion of the network is
2.4 Recommender Systems observed. They propose and analyze IMLinUCB, a com-
Recommender systems are frequently used in various ap- putationally efficient UCB-based algorithm.
plication to predict user preferences. However, they also
face the exploration-exploitation dilemma when mak- 2.6 Information Retrieval
ing a recommendation, since they need to exploit their Authors in [Losada et al., 2017] argue that Informa-
knowledge about the previously chosen items the user tion Retrieval iterative selection process can be natu-
is interested in, while also exploring new items the rally modeled as a contextual bandit problem. Cast-
user may like. Authors in [Zhou et al., 2017] approach ing document judging as a multi-armed bandit problem
this challenge using multi-armed bandit setting, espe- leads to highly effective adjudication methods. Under
cially for large-scale recommender systems that have re- this bandit allocation framework, they consider station-
ally large or infinite number of items. They propose ary and non-stationary models and propose seven new
two large-scale bandit approaches in situations when document adjudication methods (five stationary meth-
no prior information is available. Continuous explo- ods and two non-stationary variants). This comparative
ration in their approaches can address the cold start prob- study includes existing methods designed for pooling-
lem in recommender systems. In context-aware recom- based evaluation and existing methods designed for
mender systems, most existing approaches focus on rec- metasearch. In mobile information retrieval, authors
ommending relevant items to users, taking into account in [Bouneffouf et al., 2013] introduce an algorithm that
contextual information, such as time, location, or so- tackles this dilemma in Context-Based Information Re-
cial aspects. However, none of those approaches has trieval (CBIR) area. It is based on dynamic explo-
considered the problem of users content evolution. In ration/exploitation and can adaptively balance the two
[Bouneffouf et al., 2012] authors introduce an algorithm aspects by deciding which users situation is most relevant
that takes this dynamics into account. It is based on dy- for exploration or exploitation. Within a deliberately de-
namic exploration/exploitation and can adaptively bal- signed online framework they conduct evaluations with
ance the two aspects, deciding which situation is most mobile users.
relevant for exploration or exploitation. In this sense,
[Bouneffouf, 2014] propose to study the ”freshness” of 2.7 Dialogue Systems
the user’s content through the bandit problem. They Dialogue response selection. Dialogue response selec-
introduce in this paper an algorithm named Freshness- tion is an important step towards natural response gen-
Aware Thompson Sampling that manages the recom- eration in conversational agents. Existing work on con-
mendation of fresh document according to the user’s risk versational models mainly focuses on offline supervised
of the situation. learning using a large set of context-response pairs. In
[Liu et al., 2018] authors focus on online learning of re-
2.5 Influence Maximization sponse selection in dialog systems. They propose a con-
Autors in [Vaswani et al., 2017] consider influence max- textual multi-armed bandit model with a nonlinear re-
imization (IM) in social networks, which is the problem ward function that uses distributed representation of text
of maximizing the number of users that become aware of for online response selection. A bidirectional LSTM is
a product by selecting a set of seed users to expose the used to produce the distributed representations of dialog
product to. They propose a novel parametrization that context and responses, which serve as the input to a con-
not only makes the framework agnostic to the underlying textual bandit. They propose a customized Thompson
diffusion model, but also statistically efficient to learn sampling method that is applied to a polynomial feature
from data. They give a corresponding monotone, sub- space in approximating the reward.
modular surrogate function, and show that it is a good Pro-activity dialogue systems. An objective of pro-
approximation to the original IM objective. They also activity in dialogue systems is to enhance the usabil-
consider the case of a new marketer looking to exploit an ity of conversational agents by enabling them to initi-
existing social network, while simultaneously learning ate conversation on their own. While dialogue systems
the factors governing information propagation. For this, have become increasingly popular during the last cou-
they develop a LinUCB-based bandit algorithm. Authors ple of years, current task-oriented dialogue systems are
in [Wen et al., 2017] also study the online influence max- still mainly reactive, as users tend to initiate conversa-
imization problem in social networks but under the inde- tions. Authors of [Silander and others, 2018] propose to
pendent cascade model. Specifically, they try to learn the introduce the paradigm of contextual bandits as frame-
set of best seeds or influencers in a social network online work for proactive dialog systems. Contextual bandits

3
have been the model of choice for the problem of reward ploration and exploitation is imperative to enable adap-
maximization with partial feedback since they fit well tation to changes in behavior. Exploration consists of
to the task description, they also explore the notion of the selection and investigation of transactions with the
memory into this paradigm, where they propose two dif- purpose of improving predictive models, and exploita-
ferentiable memory models that act as parts of the para- tion consists of investigating transactions detected to be
metric reward estimation function. The first one, Con- suspicious. Modeling the detection of fraudulent trans-
volutional Selective Memory Networks, uses a selection actions as rewarding, they use an incremental regression
of past interactions as part of the decision support. The tree learner to create clusters of transactions with simi-
second model, called Contextual Attentive Memory Net- lar expected rewards. This enables the use of a contex-
work, implements a differentiable attention mechanism tual multi-armed bandit (CMAB) algorithm to provide
over the past interactions of the agent. The goal is to gen- the exploration/exploitation trade-off.
eralize the classic model of contextual bandits to settings
where temporal information needs to be incorporated and 2.9 Telecommunication
leveraged in a learnable manner. In [Boldrini et al., 2018], a multi-armed bandit model
Multi-domain dialogue systems. Building multi- was used to describe the problem of best wireless net-
domain dialogue agents is a challenging task and an open work selection by a multi-Radio Access Technology
problem in modern AI. Within the domain of dialogue, (multi-RAT) device, with the goal of maximizing the
the ability to orchestrate multiple independently trained quality perceived by the final user. The proposed model
dialog agents, or skills, to create a unified system is of extends the classical MAB model in a twofold manner.
particular significance. In [Upadhyay et al., 2018], the First, it foresees two different actions: to measure and
authors study the task of online posterior dialogue or- to use; second, it allows actions to span over multiple
chestration, where they define posterior orchestration as time steps. Two new algorithms designed to take advan-
the task of selecting a subset of skills which most ap- tage of the higher flexibility provided by the muMAB
propriately answers a user input using features extracted model are also introduced. The first one, referred to
from both the user input and the individual skills. To as measure-use-UCB1 is derived from the UCB1 algo-
account for the various costs associated with extracting rithm, while the second one, referred to as Measure with
skill features, they consider online posterior orchestra- Logarithmic Interval, is appositely designed for the new
tion under a skill execution budget. This setting is for- model so to take advantage of the new measure action,
malized as Context-Attentive Bandit with Observations, while aggressively using the best arm. The authors in
a variant of context-attentive bandits, and evaluate it on [Kerkouche et al., 2018] demonstrate the possibility to
simulated non-conversational and proprietary conversa- optimize the performance of the Long Range Wide Area
tional datasets. Network technology. Authors suggest that nodes use
multi-armed bandit algorithms, to select the communica-
2.8 Anomaly Detection tion parameters (spreading factor and emission power).
Performing anomaly detection on attributed networks Evaluations show that such learning methods allow to
concerns with finding nodes whose behaviors deviate manage the trade-off between energy consumption and
significantly from the majority of nodes. Authors in packet loss much better than an Adaptive Data Rate algo-
[Ding et al., 2019] investigate the problem of anomaly rithm adapting spreading factors and transmission pow-
detection in an interactive setting by allowing the sys- ers on the basis of Signal to Interference and Noise Ratio
tem to proactively communicate with the human expert values.
in making a limited number of queries about ground 2.10 Bandit in Real-Life Applications:
truth anomalies. Their objective is to maximize the Summary and Future Directions
true anomalies presented to the human expert after a
given budget is used up. Along with this line, they
formulate the problem through the principled multi-
Table 1: Bandit for Real Life Application
armed bandit framework and develop a novel collabo- Non- Non-
rative contextual bandit algorithm, that explicitly mod- MAB stationary CMAB stationary
els the nodal attributes and node dependencies seam- √ MAB √ CMAB
Healthcare
lessly in a joint framework, and handles the exploration- Finance

exploitation dilemma when querying anomalies of dif- Dynamic pricing √

√ √ √
ferent types. Credit card transactions predicted to be Recommendr system √
Maximization
fraudulent by automated detection systems are typically Dialogue system

handed over to human experts for verification. To limit Telecomunication

costs, it is standard practice to select only the most



Anomaly detection
suspicious transactions for investigation. Authors in
[Soemers et al., 2018] claim that a trade-off between ex- Table 1 provides a summary of bandit problem for-

4
mulations used in each of the above domain applica- through adaptive resource allocation and early-stopping.
tions. We see that, for example, non-stationary bandit [Li et al., 2016] formulated hyperparameter optimiza-
was not used in healthcare applications, since perhaps tion as a pure-exploration non-stochastic infinite-armed
there was no significant change assumed to happen in bandit problem where a predefined resources, such as it-
the environment in the process of making the treatment erations, data samples, or features are allocated to ran-
decision, i.e. no transition in the state of the the patient; domly sampled configurations. This work introduced a
such transition, if it occurred, would be better modeled novel algorithm, Hyperband, for this framework and an-
using reinforcement learning rather than non-stationary alyze its theoretical properties, providing several desir-
bandit. There are clearly other domains where the non- able guarantees. Furthermore, Hyperband wascmpared
stationary bandit is a more appropriate setting, but it with popular Bayesian optimization methods on a suite
looks like this setting was not yet much investigated in of hyperparameter optimization problems; it was ob-
healthcare domain. For example, anomaly detection, is served that Hyperband can provide more than an order-
a domain where non-stationary contextual bandit could of-magnitude speedup over its competitors on a variety
be used, since in this setting the anomaly could be adver- of deep-learning and kernel-based learning problems.
sarial, which means that any bandit applied to this set-
ting should have some kind of drift condition, in-order to 3.3 Feature Selection
adapt to new type of attack. Another observation is that
none of the existing work tried to develop an algorithm In a classical online supervised learning the true label of
that could solve these different tasks at the same time, or a sample is always revealed to the classifier, unlike in a
apply the knowledge obtained in one domain to another bandit setting were any wrong classification resuls into
domain, thus opening a direction of research on multi- zero reward, and only the single correct classification
task and transfer learning in bandit setting. Furthermore, yields reward 1. The authors of [Wang et al., 2014] in-
given an online nature of bandit problem, continuous, or vestigate the problem of Online Feature Selection, where
lifelong learning would be a natural next step, adapting the aim is to make accurate predictions using only a
the model learned in the previous tasks to the new one, small number of active features using epsilon greedy
while still remembering how to perform earlier task, thus algorithm. The authors of [Bouneffouf et al., 2017b]
avoiding the problem of ”catastrophic forgetting”. tackle the online feature selection problem by addressing
the combinatorial optimization problem in the stochastic
3 Bandit for Better Machine Learning bandit setting with bandit feedback, utilizing the Thomp-
In this section we are describing how bandit algorithms son Sampling algorithm.
could be used to improve other algorithms, e.g. various
machine-learning techniques. 3.4 Bandit for Active Learning
3.1 Algorithm Selection Labelling all training examples in supervised classifica-
Algorithm selection is typically based on models of al- tion setting can be costly. Active learning strategies solve
gorithm performance, learned during a separate offline this problem by selecting the most useful unlabelled ex-
training sequence, which can be prohibitively expen- amples to obtain the label for, and to train a predictive
sive. In recent work, they adopted an online approach, model. The choice of examples to label can be seen
in which a performance model is iteratively updated and as a dilemma between the exploration and the exploita-
used to guide selection on a sequence of problem in- tion over the input space. In [Bouneffouf et al., 2014], a
stances. The resulting exploration-exploitation trade-off novel active learning strategy manages this compromise
was represented as a bandit problem with expert ad- by modelling the active learning problem as a contex-
vice, using an existing solver for this game, but this re- tual bandit problem. they propose a sequential algorithm
quired using an arbitrary bound on algorithm runtimes, named Active Thompson Sampling (ATS), which, in
thus invalidating the optimal regret of the solver. In each round, assigns a sampling distribution on the pool,
[Gagliolo and Schmidhuber, 2010], a simpler framework samples one point from this distribution, and queries
was proposed for representing algorithm selection as a the oracle for this sample point label. The authors of
bandit problem, using partial information and an un- [Ganti and Gray, 2013] also propose a multi-armed ban-
known bound on losses. dit inspired, pool-based active learning algorithm for the
problem of binary classification. They utilize ideas such
3.2 Hyperparameter Optimization as lower confidence bounds, and self-concordant regular-
Performance of machine learning algorithms depends ization from the multi-armed bandit literature to design
critically on identifying a good set of hyperparame- their proposed algorithm. In each round, the proposed
ters. While recent approaches use Bayesian optimization algorithm assigns a sampling distribution on the pool,
to adaptively select optimal hyperparameter configura- samples one point from this distribution, and queries the
tions, they rather focus on speeding up random search oracle for the label of this sampled point.

5
3.5 Clustering
Table 2: Bandit in Machine Learning
[Sublime and Lefebvre, 2018] considers collaborative MAB Non- CMAB Non-
stationary stationary
clustering, which is machine-learning paradigm con- MAB
√ CMAB
cerned with the unsupervised analysis of complex multi- Algorithm Slection √
Parameter Optimization
view data using several algorithms working together. Features Selection
√ √
Well-known applications of collaborative clustering in- Active Learning
√ √

clude multiview clustering and distributed data cluster-



Clustering √ √ √
Reinforcement learning
ing, where several algorithms exchange information in
order to mutually improve each others. One of the key
issue with multi-view and collaborative clustering is to
Summary and Future Directions
assess which collaborations are going to be beneficial or
detrimental. Many solutions have been proposed for this Table 2 summarizes the types of bandit problems used to
problem, and all of them conclude that, unless two mod- solve the machine-learning problems mentioned above.
els are very close, it is difficult to predict in advance the We see, for example, that contextual bandit was not used
result of a collaboration. To address this problem, the in feature selection or hyperparameter optimization. This
authors of [Sublime and Lefebvre, 2018] propose a col- observation could point into a direction for future work,
laborative peer to peer clustering algorithm based on the where side information could be employed in feature se-
principle of non stochastic multi-arm bandits to assess lection. Also, non-stationary bandit was rarely consid-
in real time which algorithms or views can bring useful ered in these problem settings, which is also suggest-
information. ing possible extensions of current work. For instance,
the non-stationary contextual bandit could be useful in
3.6 Reinforcement learning the non-stationary feature selection setting, where find-
ing the right features is time-dependent and context-
Autonomous cyber-physical systems play a large role in dependent when the environment keeps changing. Our
our lives. To ensure that agents behave in ways aligned main observation is also that each technique is solving
with the values of the societies in which they operate, just one machine learning problem at a time; thus, the
we must develop techniques that allow these agents to question is whether a bandit setting and algoritms can
not only maximize their reward in an environment, but be developed to solve multiple machine learning prob-
also to learn and follow the implicit constraints assumed lems simultaneously, and whether transfer and continual
by the society. In [Noothigattu et al., 2018], the authors learning can be achieved in this setting. One solution
study a setting where an agent can observe traces of be- could be to model all these problems in a combinatorial
havior of members of the society but has no access to bandit framework, where the bandit algorithm would find
the explicit set of constraints that give rise to the ob- the optimal solution for each problem at each iteration;
served behavior. Instead, inverse reinforcement learning thus, combinatorial bandit could be further used as a tool
is used to learn such constraints, that are then combined for advancing automated machine learning.
with a possibly orthogonal value function through the
use of a contextual bandit-based orchestrator that picks a
contextually-appropriate choice between the two policies 4 Conclusions
(constraint-based and environment reward-based) when In this article, we reviewed some of the most notable re-
taking actions. The contextual bandit orchestrator al- cent work on applications of multi-armed bandit and con-
lows the agent to mix policies in novel ways, taking the textual bandit, both in real-life domains and in automated
best actions from either a reward maximizing or con- machine learning. We summarized, in an organized way
strained policy. The [Laroche and Féraud, 2017] tackles (Tables 1 and 2), various existing applications, by types
the problem of online RL algorithm selection. A meta- of bandit settings used, and discussed the advantages of
algorithm is given for input a portfolio constituted of sev- using bandit techniques in each domain. We briefly out-
eral off-policy RL algorithms. It then determines at the lines of several important open problems and promising
beginning of each new trajectory, which algorithm in the future extensions.
portfolio is in control of the behaviour during the next In summary, the bandit framework, including both
trajectory, in order to maximise the return. A novel meta- multi-arm and contextual bandit, is currently very ac-
algorithm, called Epochal Stochastic Bandit Algorithm tive and promising research areas, and there are multiple
Selection. Its principle is to freeze the policy updates at novel techniques and applications emerging each year.
each epoch, and to leave a rebooted stochastic bandit in We hope our survey can help the reader better under-
charge of the algorithm selection. stand some key aspects of this exciting field and get a
better perspective on its notable advancements and future
3.7 Bandit for Machine Learning: promises.

6
References of human behavior: Reward processing in mental
[Agrawal and Goyal, 2012] Shipra Agrawal and Navin disorders. In AGI, pages 237–248. Springer, 2017.
Goyal. Analysis of thompson sampling for the multi- [Bouneffouf et al., 2017b] Djallel Bouneffouf, Irina
armed bandit problem. In COLT 2012 - The 25th Rish, Guillermo A. Cecchi, and Raphaël Féraud.
Annual Conference on Learning Theory, June 25-27, Context attentive bandits: Contextual bandit with
2012, Edinburgh, Scotland, pages 39.1–39.26, 2012. restricted context. In IJCAI 2017, Melbourne,
[Agrawal and Goyal, 2013] Shipra Agrawal and Navin Australia, August 19-25, 2017, pages 1468–1475,
Goyal. Thompson sampling for contextual bandits 2017.
with linear payoffs. In ICML (3), pages 127–135, [Bouneffouf, 2014] Djallel Bouneffouf. Freshness-
2013. aware thompson sampling. In International Confer-
[Allesiardo et al., 2014] Robin Allesiardo, Raphaël ence on Neural Information Processing, pages 373–
Féraud, and Djallel Bouneffouf. A neural networks 380. Springer, 2014.
committee for the contextual bandit problem. In [Ding et al., 2019] Kaize Ding, Jundong Li, and Huan
Neural Information Processing - 21st International Liu. Interactive anomaly detection on attributed net-
Conference, ICONIP 2014, Kuching, Malaysia, works. In Proceedings of the Twelfth ACM Interna-
November 3-6, 2014. Proceedings, Part I, pages tional Conference on Web Search and Data Mining,
374–381, 2014. WSDM ’19, pages 357–365, New York, NY, USA,
[Auer et al., 2002] Peter Auer, Nicolò Cesa-Bianchi, 2019. ACM.
and Paul Fischer. Finite-time analysis of the mul- [Durand et al., 2018] Audrey Durand, Charis Achilleos,
tiarmed bandit problem. Machine Learning, 47(2- Demetris Iacovides, Katerina Strati, Georgios D Mit-
3):235–256, 2002. sis, and Joelle Pineau. Contextual bandits for adapting
[Bastani and Bayati, 2015] Hamsa Bastani and treatment in a mouse model of de novo carcinogene-
Mohsen Bayati. Online decision-making with sis. In Machine Learning for Healthcare Conference,
high-dimensional covariates. Available at SSRN pages 67–82, 2018.
2661896, 2015. [Gagliolo and Schmidhuber, 2010] Matteo Gagliolo and
[Boldrini et al., 2018] Stefano Boldrini, Luca Jürgen Schmidhuber. Algorithm selection as a bandit
De Nardis, Giuseppe Caso, Mai Le, Jocelyn problem with unbounded losses. In Learning and In-
Fiorina, and Maria-Gabriella Di Benedetto. mumab: telligent Optimization, 4th International Conference,
A multi-armed bandit model for wireless network LION 4, Venice, Italy, January 18-22, 2010. Selected
selection. Algorithms, 11(2):13, 2018. Papers, pages 82–96, 2010.
[Bouneffouf and Féraud, 2016] Djallel Bouneffouf and [Ganti and Gray, 2013] Ravi Ganti and Alexander G
Raphaël Féraud. Multi-armed bandit problem with Gray. Building bridges: Viewing active learning
known trend. Neurocomputing, 205:16–21, 2016. from the multi-armed bandit lens. arXiv preprint
arXiv:1309.6830, 2013.
[Bouneffouf et al., 2012] Djallel Bouneffouf, Amel
Bouzeghoub, and Alda Lopes Gançarski. A [Huo and Fu, 2017] Xiaoguang Huo and Feng Fu. Risk-
contextual-bandit algorithm for mobile context-aware aware multi-armed bandit problem with application
recommender system. In International Conference to portfolio selection. Royal Society open science,
on Neural Information Processing, pages 324–331. 4(11):171377, 2017.
Springer, 2012. [Kerkouche et al., 2018] Raouf Kerkouche, Réda
[Bouneffouf et al., 2013] Djallel Bouneffouf, Amel Alami, Raphaël Féraud, Nadège Varsier, and Patrick
Bouzeghoub, and Alda Lopes Gançarski. Contextual Maillé. Node-based optimization of lora transmis-
bandits for context-based information retrieval. In sions with multi-armed bandit algorithms. In ICT
International Conference on Neural Information 2018, Saint Malo, France, June 26-28, 2018, pages
Processing, pages 35–42. Springer, 2013. 521–526, 2018.
[Bouneffouf et al., 2014] Djallel Bouneffouf, Romain [Lai and Robbins, 1985] T. L. Lai and Herbert Robbins.
Laroche, Tanguy Urvoy, Raphael Féraud, and Robin Asymptotically efficient adaptive allocation rules. Ad-
Allesiardo. Contextual bandit for active learning: Ac- vances in Applied Mathematics, 6(1):4–22, 1985.
tive thompson sampling. In International Conference [Langford and Zhang, 2008] John Langford and Tong
on Neural Information Processing, pages 405–412. Zhang. The epoch-greedy algorithm for multi-armed
Springer, 2014. bandits with side information. In Advances in neu-
[Bouneffouf et al., 2017a] Djallel Bouneffouf, Irina ral information processing systems, pages 817–824,
Rish, and Guillermo A Cecchi. Bandit models 2008.

7
[Laroche and Féraud, 2017] Romain Laroche and Adapting to concept drift in credit card transaction
Raphaël Féraud. Algorithm selection of off- data streams using contextual bandits and decision
policy reinforcement learning algorithm. CoRR, trees. In AAAI, 2018.
abs/1701.08810, 2017. [Sublime and Lefebvre, 2018] Jérémie Sublime and
[Li et al., 2010] Lihong Li, Wei Chu, John Langford, Sylvain Lefebvre. Collaborative clustering through
and Robert E. Schapire. A contextual-bandit approach constrained networks using bandit optimization.
to personalized news article recommendation. CoRR, In 2018 International Joint Conference on Neural
2010. Networks, IJCNN 2018, Rio de Janeiro, Brazil, July
[Li et al., 2016] Lisha Li, Kevin Jamieson, Giulia 8-13, 2018, pages 1–8, 2018.
DeSalvo, Afshin Rostamizadeh, and Ameet Tal- [Upadhyay et al., 2018] Sohini Upadhyay, Mayank
walkar. Hyperband: A novel bandit-based approach Agarwal, Djallel Bounneffouf, and Yasaman
to hyperparameter optimization. arXiv preprint Khazaeni. A bandit approach to posterior dialog
arXiv:1603.06560, 2016. orchestration under a budget. 2018.
[Liu et al., 2018] Bing Liu, Tong Yu, Ian Lane, and [Vaswani et al., 2017] Sharan Vaswani, Branislav Kve-
Ole J. Mengshoel. Customized nonlinear bandits for ton, Zheng Wen, Mohammad Ghavamzadeh, Laks VS
online response selection in neural conversation mod- Lakshmanan, and Mark Schmidt. Model-independent
els. In AAAI, 2018, pages 5245–5252, 2018. online learning for influence maximization. In Pro-
ceedings of the 34th International Conference on Ma-
[Losada et al., 2017] David E Losada, Javier Parapar,
chine Learning-Volume 70, pages 3530–3539. JMLR.
and Alvaro Barreiro. Multi-armed bandits for adju-
org, 2017.
dicating documents in pooling-based evaluation of in-
formation retrieval systems. Information Processing [Wang et al., 2014] Jialei Wang, Peilin Zhao, Steven CH
& Management, 53(5):1005–1025, 2017. Hoi, and Rong Jin. Online feature selection and its
applications. IEEE Transactions on Knowledge and
[Mary et al., 2015] Jérémie Mary, Romaric Gaudel, and
Data Engineering, 26(3):698–710, 2014.
Philippe Preux. Bandits and recommender systems. In
Machine Learning, Optimization, and Big Data - First [Wen et al., 2017] Zheng Wen, Branislav Kveton,
International Workshop, MOD 2015, pages 325–336, Michal Valko, and Sharan Vaswani. Online influence
2015. maximization under independent cascade model
with semi-bandit feedback. In Advances in neural
[Misra et al., 2018] Kanishka Misra, Eric M Schwartz, information processing systems, pages 3022–3032,
and Jacob Abernethy. Dynamic online pricing with 2017.
incomplete information using multi-armed bandit ex-
periments. 2018. [Zhou et al., 2017] Qian Zhou, XiaoFang Zhang, Jin
Xu, and Bin Liang. Large-scale bandit approaches
[Mueller et al., 2018] Jonas Mueller, Vasilis Syrgkanis, for recommender systems. In International Confer-
and Matt Taddy. Low-rank bandit methods for ence on Neural Information Processing, pages 811–
high-dimensional dynamic pricing. arXiv preprint 821. Springer, 2017.
arXiv:1801.10242, 2018.
[Noothigattu et al., 2018] Ritesh Noothigattu, Djallel
Bouneffouf, Nicholas Mattei, Rachita Chandra,
Piyush Madan, Kush Varshney, Murray Campbell,
Moninder Singh, and Francesca Rossi. Interpretable
multi-objective reinforcement learning through pol-
icy orchestration. arXiv preprint arXiv:1809.08343,
2018.
[Shen et al., 2015] Weiwei Shen, Jun Wang, Yu-Gang
Jiang, and Hongyuan Zha. Portfolio choices with
orthogonal bandit learning. In Twenty-Fourth Inter-
national Joint Conference on Artificial Intelligence,
2015.
[Silander and others, 2018] Tomi Silander et al. Contex-
tual memory bandit for pro-active dialog engagement.
2018.
[Soemers et al., 2018] Dennis JNJ Soemers, Tim Brys,
Kurt Driessens, Mark HM Winands, and Ann Nowé.

You might also like