Bandit Problems
Bandit Problems
2
dimensional space spanned by the latent product fea- while repeatedly interacting with it. They address the
tures, while simultaneously learning this span via online challenges of combinatorial action space, since the num-
singular value decomposition of a carefully-crafted ma- ber of feasible influencer sets grows exponentially with
trix containing the observed demands. the maximum number of influencers, and limited feed-
back, since only the influenced portion of the network is
2.4 Recommender Systems observed. They propose and analyze IMLinUCB, a com-
Recommender systems are frequently used in various ap- putationally efficient UCB-based algorithm.
plication to predict user preferences. However, they also
face the exploration-exploitation dilemma when mak- 2.6 Information Retrieval
ing a recommendation, since they need to exploit their Authors in [Losada et al., 2017] argue that Informa-
knowledge about the previously chosen items the user tion Retrieval iterative selection process can be natu-
is interested in, while also exploring new items the rally modeled as a contextual bandit problem. Cast-
user may like. Authors in [Zhou et al., 2017] approach ing document judging as a multi-armed bandit problem
this challenge using multi-armed bandit setting, espe- leads to highly effective adjudication methods. Under
cially for large-scale recommender systems that have re- this bandit allocation framework, they consider station-
ally large or infinite number of items. They propose ary and non-stationary models and propose seven new
two large-scale bandit approaches in situations when document adjudication methods (five stationary meth-
no prior information is available. Continuous explo- ods and two non-stationary variants). This comparative
ration in their approaches can address the cold start prob- study includes existing methods designed for pooling-
lem in recommender systems. In context-aware recom- based evaluation and existing methods designed for
mender systems, most existing approaches focus on rec- metasearch. In mobile information retrieval, authors
ommending relevant items to users, taking into account in [Bouneffouf et al., 2013] introduce an algorithm that
contextual information, such as time, location, or so- tackles this dilemma in Context-Based Information Re-
cial aspects. However, none of those approaches has trieval (CBIR) area. It is based on dynamic explo-
considered the problem of users content evolution. In ration/exploitation and can adaptively balance the two
[Bouneffouf et al., 2012] authors introduce an algorithm aspects by deciding which users situation is most relevant
that takes this dynamics into account. It is based on dy- for exploration or exploitation. Within a deliberately de-
namic exploration/exploitation and can adaptively bal- signed online framework they conduct evaluations with
ance the two aspects, deciding which situation is most mobile users.
relevant for exploration or exploitation. In this sense,
[Bouneffouf, 2014] propose to study the ”freshness” of 2.7 Dialogue Systems
the user’s content through the bandit problem. They Dialogue response selection. Dialogue response selec-
introduce in this paper an algorithm named Freshness- tion is an important step towards natural response gen-
Aware Thompson Sampling that manages the recom- eration in conversational agents. Existing work on con-
mendation of fresh document according to the user’s risk versational models mainly focuses on offline supervised
of the situation. learning using a large set of context-response pairs. In
[Liu et al., 2018] authors focus on online learning of re-
2.5 Influence Maximization sponse selection in dialog systems. They propose a con-
Autors in [Vaswani et al., 2017] consider influence max- textual multi-armed bandit model with a nonlinear re-
imization (IM) in social networks, which is the problem ward function that uses distributed representation of text
of maximizing the number of users that become aware of for online response selection. A bidirectional LSTM is
a product by selecting a set of seed users to expose the used to produce the distributed representations of dialog
product to. They propose a novel parametrization that context and responses, which serve as the input to a con-
not only makes the framework agnostic to the underlying textual bandit. They propose a customized Thompson
diffusion model, but also statistically efficient to learn sampling method that is applied to a polynomial feature
from data. They give a corresponding monotone, sub- space in approximating the reward.
modular surrogate function, and show that it is a good Pro-activity dialogue systems. An objective of pro-
approximation to the original IM objective. They also activity in dialogue systems is to enhance the usabil-
consider the case of a new marketer looking to exploit an ity of conversational agents by enabling them to initi-
existing social network, while simultaneously learning ate conversation on their own. While dialogue systems
the factors governing information propagation. For this, have become increasingly popular during the last cou-
they develop a LinUCB-based bandit algorithm. Authors ple of years, current task-oriented dialogue systems are
in [Wen et al., 2017] also study the online influence max- still mainly reactive, as users tend to initiate conversa-
imization problem in social networks but under the inde- tions. Authors of [Silander and others, 2018] propose to
pendent cascade model. Specifically, they try to learn the introduce the paradigm of contextual bandits as frame-
set of best seeds or influencers in a social network online work for proactive dialog systems. Contextual bandits
3
have been the model of choice for the problem of reward ploration and exploitation is imperative to enable adap-
maximization with partial feedback since they fit well tation to changes in behavior. Exploration consists of
to the task description, they also explore the notion of the selection and investigation of transactions with the
memory into this paradigm, where they propose two dif- purpose of improving predictive models, and exploita-
ferentiable memory models that act as parts of the para- tion consists of investigating transactions detected to be
metric reward estimation function. The first one, Con- suspicious. Modeling the detection of fraudulent trans-
volutional Selective Memory Networks, uses a selection actions as rewarding, they use an incremental regression
of past interactions as part of the decision support. The tree learner to create clusters of transactions with simi-
second model, called Contextual Attentive Memory Net- lar expected rewards. This enables the use of a contex-
work, implements a differentiable attention mechanism tual multi-armed bandit (CMAB) algorithm to provide
over the past interactions of the agent. The goal is to gen- the exploration/exploitation trade-off.
eralize the classic model of contextual bandits to settings
where temporal information needs to be incorporated and 2.9 Telecommunication
leveraged in a learnable manner. In [Boldrini et al., 2018], a multi-armed bandit model
Multi-domain dialogue systems. Building multi- was used to describe the problem of best wireless net-
domain dialogue agents is a challenging task and an open work selection by a multi-Radio Access Technology
problem in modern AI. Within the domain of dialogue, (multi-RAT) device, with the goal of maximizing the
the ability to orchestrate multiple independently trained quality perceived by the final user. The proposed model
dialog agents, or skills, to create a unified system is of extends the classical MAB model in a twofold manner.
particular significance. In [Upadhyay et al., 2018], the First, it foresees two different actions: to measure and
authors study the task of online posterior dialogue or- to use; second, it allows actions to span over multiple
chestration, where they define posterior orchestration as time steps. Two new algorithms designed to take advan-
the task of selecting a subset of skills which most ap- tage of the higher flexibility provided by the muMAB
propriately answers a user input using features extracted model are also introduced. The first one, referred to
from both the user input and the individual skills. To as measure-use-UCB1 is derived from the UCB1 algo-
account for the various costs associated with extracting rithm, while the second one, referred to as Measure with
skill features, they consider online posterior orchestra- Logarithmic Interval, is appositely designed for the new
tion under a skill execution budget. This setting is for- model so to take advantage of the new measure action,
malized as Context-Attentive Bandit with Observations, while aggressively using the best arm. The authors in
a variant of context-attentive bandits, and evaluate it on [Kerkouche et al., 2018] demonstrate the possibility to
simulated non-conversational and proprietary conversa- optimize the performance of the Long Range Wide Area
tional datasets. Network technology. Authors suggest that nodes use
multi-armed bandit algorithms, to select the communica-
2.8 Anomaly Detection tion parameters (spreading factor and emission power).
Performing anomaly detection on attributed networks Evaluations show that such learning methods allow to
concerns with finding nodes whose behaviors deviate manage the trade-off between energy consumption and
significantly from the majority of nodes. Authors in packet loss much better than an Adaptive Data Rate algo-
[Ding et al., 2019] investigate the problem of anomaly rithm adapting spreading factors and transmission pow-
detection in an interactive setting by allowing the sys- ers on the basis of Signal to Interference and Noise Ratio
tem to proactively communicate with the human expert values.
in making a limited number of queries about ground 2.10 Bandit in Real-Life Applications:
truth anomalies. Their objective is to maximize the Summary and Future Directions
true anomalies presented to the human expert after a
given budget is used up. Along with this line, they
formulate the problem through the principled multi-
Table 1: Bandit for Real Life Application
armed bandit framework and develop a novel collabo- Non- Non-
rative contextual bandit algorithm, that explicitly mod- MAB stationary CMAB stationary
els the nodal attributes and node dependencies seam- √ MAB √ CMAB
Healthcare
lessly in a joint framework, and handles the exploration- Finance
√
exploitation dilemma when querying anomalies of dif- Dynamic pricing √
√
√ √ √
ferent types. Credit card transactions predicted to be Recommendr system √
Maximization
fraudulent by automated detection systems are typically Dialogue system
√
handed over to human experts for verification. To limit Telecomunication
√
4
mulations used in each of the above domain applica- through adaptive resource allocation and early-stopping.
tions. We see that, for example, non-stationary bandit [Li et al., 2016] formulated hyperparameter optimiza-
was not used in healthcare applications, since perhaps tion as a pure-exploration non-stochastic infinite-armed
there was no significant change assumed to happen in bandit problem where a predefined resources, such as it-
the environment in the process of making the treatment erations, data samples, or features are allocated to ran-
decision, i.e. no transition in the state of the the patient; domly sampled configurations. This work introduced a
such transition, if it occurred, would be better modeled novel algorithm, Hyperband, for this framework and an-
using reinforcement learning rather than non-stationary alyze its theoretical properties, providing several desir-
bandit. There are clearly other domains where the non- able guarantees. Furthermore, Hyperband wascmpared
stationary bandit is a more appropriate setting, but it with popular Bayesian optimization methods on a suite
looks like this setting was not yet much investigated in of hyperparameter optimization problems; it was ob-
healthcare domain. For example, anomaly detection, is served that Hyperband can provide more than an order-
a domain where non-stationary contextual bandit could of-magnitude speedup over its competitors on a variety
be used, since in this setting the anomaly could be adver- of deep-learning and kernel-based learning problems.
sarial, which means that any bandit applied to this set-
ting should have some kind of drift condition, in-order to 3.3 Feature Selection
adapt to new type of attack. Another observation is that
none of the existing work tried to develop an algorithm In a classical online supervised learning the true label of
that could solve these different tasks at the same time, or a sample is always revealed to the classifier, unlike in a
apply the knowledge obtained in one domain to another bandit setting were any wrong classification resuls into
domain, thus opening a direction of research on multi- zero reward, and only the single correct classification
task and transfer learning in bandit setting. Furthermore, yields reward 1. The authors of [Wang et al., 2014] in-
given an online nature of bandit problem, continuous, or vestigate the problem of Online Feature Selection, where
lifelong learning would be a natural next step, adapting the aim is to make accurate predictions using only a
the model learned in the previous tasks to the new one, small number of active features using epsilon greedy
while still remembering how to perform earlier task, thus algorithm. The authors of [Bouneffouf et al., 2017b]
avoiding the problem of ”catastrophic forgetting”. tackle the online feature selection problem by addressing
the combinatorial optimization problem in the stochastic
3 Bandit for Better Machine Learning bandit setting with bandit feedback, utilizing the Thomp-
In this section we are describing how bandit algorithms son Sampling algorithm.
could be used to improve other algorithms, e.g. various
machine-learning techniques. 3.4 Bandit for Active Learning
3.1 Algorithm Selection Labelling all training examples in supervised classifica-
Algorithm selection is typically based on models of al- tion setting can be costly. Active learning strategies solve
gorithm performance, learned during a separate offline this problem by selecting the most useful unlabelled ex-
training sequence, which can be prohibitively expen- amples to obtain the label for, and to train a predictive
sive. In recent work, they adopted an online approach, model. The choice of examples to label can be seen
in which a performance model is iteratively updated and as a dilemma between the exploration and the exploita-
used to guide selection on a sequence of problem in- tion over the input space. In [Bouneffouf et al., 2014], a
stances. The resulting exploration-exploitation trade-off novel active learning strategy manages this compromise
was represented as a bandit problem with expert ad- by modelling the active learning problem as a contex-
vice, using an existing solver for this game, but this re- tual bandit problem. they propose a sequential algorithm
quired using an arbitrary bound on algorithm runtimes, named Active Thompson Sampling (ATS), which, in
thus invalidating the optimal regret of the solver. In each round, assigns a sampling distribution on the pool,
[Gagliolo and Schmidhuber, 2010], a simpler framework samples one point from this distribution, and queries
was proposed for representing algorithm selection as a the oracle for this sample point label. The authors of
bandit problem, using partial information and an un- [Ganti and Gray, 2013] also propose a multi-armed ban-
known bound on losses. dit inspired, pool-based active learning algorithm for the
problem of binary classification. They utilize ideas such
3.2 Hyperparameter Optimization as lower confidence bounds, and self-concordant regular-
Performance of machine learning algorithms depends ization from the multi-armed bandit literature to design
critically on identifying a good set of hyperparame- their proposed algorithm. In each round, the proposed
ters. While recent approaches use Bayesian optimization algorithm assigns a sampling distribution on the pool,
to adaptively select optimal hyperparameter configura- samples one point from this distribution, and queries the
tions, they rather focus on speeding up random search oracle for the label of this sampled point.
5
3.5 Clustering
Table 2: Bandit in Machine Learning
[Sublime and Lefebvre, 2018] considers collaborative MAB Non- CMAB Non-
stationary stationary
clustering, which is machine-learning paradigm con- MAB
√ CMAB
cerned with the unsupervised analysis of complex multi- Algorithm Slection √
Parameter Optimization
view data using several algorithms working together. Features Selection
√ √
Well-known applications of collaborative clustering in- Active Learning
√ √
6
References of human behavior: Reward processing in mental
[Agrawal and Goyal, 2012] Shipra Agrawal and Navin disorders. In AGI, pages 237–248. Springer, 2017.
Goyal. Analysis of thompson sampling for the multi- [Bouneffouf et al., 2017b] Djallel Bouneffouf, Irina
armed bandit problem. In COLT 2012 - The 25th Rish, Guillermo A. Cecchi, and Raphaël Féraud.
Annual Conference on Learning Theory, June 25-27, Context attentive bandits: Contextual bandit with
2012, Edinburgh, Scotland, pages 39.1–39.26, 2012. restricted context. In IJCAI 2017, Melbourne,
[Agrawal and Goyal, 2013] Shipra Agrawal and Navin Australia, August 19-25, 2017, pages 1468–1475,
Goyal. Thompson sampling for contextual bandits 2017.
with linear payoffs. In ICML (3), pages 127–135, [Bouneffouf, 2014] Djallel Bouneffouf. Freshness-
2013. aware thompson sampling. In International Confer-
[Allesiardo et al., 2014] Robin Allesiardo, Raphaël ence on Neural Information Processing, pages 373–
Féraud, and Djallel Bouneffouf. A neural networks 380. Springer, 2014.
committee for the contextual bandit problem. In [Ding et al., 2019] Kaize Ding, Jundong Li, and Huan
Neural Information Processing - 21st International Liu. Interactive anomaly detection on attributed net-
Conference, ICONIP 2014, Kuching, Malaysia, works. In Proceedings of the Twelfth ACM Interna-
November 3-6, 2014. Proceedings, Part I, pages tional Conference on Web Search and Data Mining,
374–381, 2014. WSDM ’19, pages 357–365, New York, NY, USA,
[Auer et al., 2002] Peter Auer, Nicolò Cesa-Bianchi, 2019. ACM.
and Paul Fischer. Finite-time analysis of the mul- [Durand et al., 2018] Audrey Durand, Charis Achilleos,
tiarmed bandit problem. Machine Learning, 47(2- Demetris Iacovides, Katerina Strati, Georgios D Mit-
3):235–256, 2002. sis, and Joelle Pineau. Contextual bandits for adapting
[Bastani and Bayati, 2015] Hamsa Bastani and treatment in a mouse model of de novo carcinogene-
Mohsen Bayati. Online decision-making with sis. In Machine Learning for Healthcare Conference,
high-dimensional covariates. Available at SSRN pages 67–82, 2018.
2661896, 2015. [Gagliolo and Schmidhuber, 2010] Matteo Gagliolo and
[Boldrini et al., 2018] Stefano Boldrini, Luca Jürgen Schmidhuber. Algorithm selection as a bandit
De Nardis, Giuseppe Caso, Mai Le, Jocelyn problem with unbounded losses. In Learning and In-
Fiorina, and Maria-Gabriella Di Benedetto. mumab: telligent Optimization, 4th International Conference,
A multi-armed bandit model for wireless network LION 4, Venice, Italy, January 18-22, 2010. Selected
selection. Algorithms, 11(2):13, 2018. Papers, pages 82–96, 2010.
[Bouneffouf and Féraud, 2016] Djallel Bouneffouf and [Ganti and Gray, 2013] Ravi Ganti and Alexander G
Raphaël Féraud. Multi-armed bandit problem with Gray. Building bridges: Viewing active learning
known trend. Neurocomputing, 205:16–21, 2016. from the multi-armed bandit lens. arXiv preprint
arXiv:1309.6830, 2013.
[Bouneffouf et al., 2012] Djallel Bouneffouf, Amel
Bouzeghoub, and Alda Lopes Gançarski. A [Huo and Fu, 2017] Xiaoguang Huo and Feng Fu. Risk-
contextual-bandit algorithm for mobile context-aware aware multi-armed bandit problem with application
recommender system. In International Conference to portfolio selection. Royal Society open science,
on Neural Information Processing, pages 324–331. 4(11):171377, 2017.
Springer, 2012. [Kerkouche et al., 2018] Raouf Kerkouche, Réda
[Bouneffouf et al., 2013] Djallel Bouneffouf, Amel Alami, Raphaël Féraud, Nadège Varsier, and Patrick
Bouzeghoub, and Alda Lopes Gançarski. Contextual Maillé. Node-based optimization of lora transmis-
bandits for context-based information retrieval. In sions with multi-armed bandit algorithms. In ICT
International Conference on Neural Information 2018, Saint Malo, France, June 26-28, 2018, pages
Processing, pages 35–42. Springer, 2013. 521–526, 2018.
[Bouneffouf et al., 2014] Djallel Bouneffouf, Romain [Lai and Robbins, 1985] T. L. Lai and Herbert Robbins.
Laroche, Tanguy Urvoy, Raphael Féraud, and Robin Asymptotically efficient adaptive allocation rules. Ad-
Allesiardo. Contextual bandit for active learning: Ac- vances in Applied Mathematics, 6(1):4–22, 1985.
tive thompson sampling. In International Conference [Langford and Zhang, 2008] John Langford and Tong
on Neural Information Processing, pages 405–412. Zhang. The epoch-greedy algorithm for multi-armed
Springer, 2014. bandits with side information. In Advances in neu-
[Bouneffouf et al., 2017a] Djallel Bouneffouf, Irina ral information processing systems, pages 817–824,
Rish, and Guillermo A Cecchi. Bandit models 2008.
7
[Laroche and Féraud, 2017] Romain Laroche and Adapting to concept drift in credit card transaction
Raphaël Féraud. Algorithm selection of off- data streams using contextual bandits and decision
policy reinforcement learning algorithm. CoRR, trees. In AAAI, 2018.
abs/1701.08810, 2017. [Sublime and Lefebvre, 2018] Jérémie Sublime and
[Li et al., 2010] Lihong Li, Wei Chu, John Langford, Sylvain Lefebvre. Collaborative clustering through
and Robert E. Schapire. A contextual-bandit approach constrained networks using bandit optimization.
to personalized news article recommendation. CoRR, In 2018 International Joint Conference on Neural
2010. Networks, IJCNN 2018, Rio de Janeiro, Brazil, July
[Li et al., 2016] Lisha Li, Kevin Jamieson, Giulia 8-13, 2018, pages 1–8, 2018.
DeSalvo, Afshin Rostamizadeh, and Ameet Tal- [Upadhyay et al., 2018] Sohini Upadhyay, Mayank
walkar. Hyperband: A novel bandit-based approach Agarwal, Djallel Bounneffouf, and Yasaman
to hyperparameter optimization. arXiv preprint Khazaeni. A bandit approach to posterior dialog
arXiv:1603.06560, 2016. orchestration under a budget. 2018.
[Liu et al., 2018] Bing Liu, Tong Yu, Ian Lane, and [Vaswani et al., 2017] Sharan Vaswani, Branislav Kve-
Ole J. Mengshoel. Customized nonlinear bandits for ton, Zheng Wen, Mohammad Ghavamzadeh, Laks VS
online response selection in neural conversation mod- Lakshmanan, and Mark Schmidt. Model-independent
els. In AAAI, 2018, pages 5245–5252, 2018. online learning for influence maximization. In Pro-
ceedings of the 34th International Conference on Ma-
[Losada et al., 2017] David E Losada, Javier Parapar,
chine Learning-Volume 70, pages 3530–3539. JMLR.
and Alvaro Barreiro. Multi-armed bandits for adju-
org, 2017.
dicating documents in pooling-based evaluation of in-
formation retrieval systems. Information Processing [Wang et al., 2014] Jialei Wang, Peilin Zhao, Steven CH
& Management, 53(5):1005–1025, 2017. Hoi, and Rong Jin. Online feature selection and its
applications. IEEE Transactions on Knowledge and
[Mary et al., 2015] Jérémie Mary, Romaric Gaudel, and
Data Engineering, 26(3):698–710, 2014.
Philippe Preux. Bandits and recommender systems. In
Machine Learning, Optimization, and Big Data - First [Wen et al., 2017] Zheng Wen, Branislav Kveton,
International Workshop, MOD 2015, pages 325–336, Michal Valko, and Sharan Vaswani. Online influence
2015. maximization under independent cascade model
with semi-bandit feedback. In Advances in neural
[Misra et al., 2018] Kanishka Misra, Eric M Schwartz, information processing systems, pages 3022–3032,
and Jacob Abernethy. Dynamic online pricing with 2017.
incomplete information using multi-armed bandit ex-
periments. 2018. [Zhou et al., 2017] Qian Zhou, XiaoFang Zhang, Jin
Xu, and Bin Liang. Large-scale bandit approaches
[Mueller et al., 2018] Jonas Mueller, Vasilis Syrgkanis, for recommender systems. In International Confer-
and Matt Taddy. Low-rank bandit methods for ence on Neural Information Processing, pages 811–
high-dimensional dynamic pricing. arXiv preprint 821. Springer, 2017.
arXiv:1801.10242, 2018.
[Noothigattu et al., 2018] Ritesh Noothigattu, Djallel
Bouneffouf, Nicholas Mattei, Rachita Chandra,
Piyush Madan, Kush Varshney, Murray Campbell,
Moninder Singh, and Francesca Rossi. Interpretable
multi-objective reinforcement learning through pol-
icy orchestration. arXiv preprint arXiv:1809.08343,
2018.
[Shen et al., 2015] Weiwei Shen, Jun Wang, Yu-Gang
Jiang, and Hongyuan Zha. Portfolio choices with
orthogonal bandit learning. In Twenty-Fourth Inter-
national Joint Conference on Artificial Intelligence,
2015.
[Silander and others, 2018] Tomi Silander et al. Contex-
tual memory bandit for pro-active dialog engagement.
2018.
[Soemers et al., 2018] Dennis JNJ Soemers, Tim Brys,
Kurt Driessens, Mark HM Winands, and Ann Nowé.