Search | arXiv e-print repository

Rethinking Knowledge Transfer in Learning Using Privileged Information

Authors: Danil Provodin, Bram van den Akker, Christina Katsimerou, Maurits Kaptein, Mykola Pechenizkiy

Abstract: In supervised machine learning, privileged information (PI) is information that is unavailable at inference, but is accessible during training time. Research on learning using privileged information (LUPI) aims to transfer the knowledge captured in PI onto a model that can perform inference without PI. It seems that this extra bit of information ought to make the resulting model better. However, f… ▽ More In supervised machine learning, privileged information (PI) is information that is unavailable at inference, but is accessible during training time. Research on learning using privileged information (LUPI) aims to transfer the knowledge captured in PI onto a model that can perform inference without PI. It seems that this extra bit of information ought to make the resulting model better. However, finding conclusive theoretical or empirical evidence that supports the ability to transfer knowledge using PI has been challenging. In this paper, we critically examine the assumptions underlying existing theoretical analyses and argue that there is little theoretical justification for when LUPI should work. We analyze LUPI methods and reveal that apparent improvements in empirical risk of existing research may not directly result from PI. Instead, these improvements often stem from dataset anomalies or modifications in model design misguidedly attributed to PI. Our experiments for a wide variety of application domains further demonstrate that state-of-the-art LUPI approaches fail to effectively transfer knowledge from PI. Thus, we advocate for practitioners to exercise caution when working with PI to avoid unintended inductive biases. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2405.19017 [pdf, other]

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

Authors: Danil Provodin, Maurits Kaptein, Mykola Pechenizkiy

Abstract: We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of $\tilde{O} (DS\sqrt{AT})$ for an… ▽ More We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of $\tilde{O} (DS\sqrt{AT})$ for any communicating CMDP with $S$ states, $A$ actions, and diameter $D$. This regret bound matches the lower bound in order of time horizon $T$ and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: To appear at ICML'24

arXiv:2309.15737 [pdf, other]

Provably Efficient Exploration in Constrained Reinforcement Learning:Posterior Sampling Is All You Need

Authors: Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Abstract: We present a new algorithm based on posterior sampling for learning in constrained Markov decision processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of \tilde{O} (HS \sqrt{AT}) for any… ▽ More We present a new algorithm based on posterior sampling for learning in constrained Markov decision processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of \tilde{O} (HS \sqrt{AT}) for any communicating CMDP with S states, A actions, and bound on the hitting time H. This regret bound matches the lower bound in order of time horizon T and is the best-known regret bound for communicating CMDPs in the infinite-horizon undiscounted setting. Empirical results show that, despite its simplicity, our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2209.03596 [pdf, other]

An Empirical Evaluation of Posterior Sampling for Constrained Reinforcement Learning

Authors: Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Abstract: We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Ou… ▽ More We study a posterior sampling approach to efficient exploration in constrained reinforcement learning. Alternatively to existing algorithms, we propose two simple algorithms that are more efficient statistically, simpler to implement and computationally cheaper. The first algorithm is based on a linear formulation of CMDP, and the second algorithm leverages the saddle-point formulation of CMDP. Our empirical results demonstrate that, despite its simplicity, posterior sampling achieves state-of-the-art performance and, in some cases, significantly outperforms optimistic algorithms. △ Less

Submitted 8 September, 2022; originally announced September 2022.

arXiv:2202.06657 [pdf, other]

The Impact of Batch Learning in Stochastic Linear Bandits

Authors: Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Abstract: We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period. Unlike previous work, we consider a more practically relevant batch-centric scenario of batch learning. That is to say, we provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main the… ▽ More We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period. Unlike previous work, we consider a more practically relevant batch-centric scenario of batch learning. That is to say, we provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main theoretical results show that the impact of batch learning is a multiplicative factor of batch size relative to the regret of online behavior. Primarily, we study two settings of the stochastic linear bandits: bandits with finitely and infinitely many arms. While the regret bounds are the same for both settings, the former setting results hold under milder assumptions. Also, we provide a more robust result for the 2-armed bandit problem as an important insight. Finally, we demonstrate the consistency of theoretical results by conducting empirical experiments and reflect on optimal batch size choice. △ Less

Submitted 1 September, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: This is a longer version of the paper published at ICDM'22. arXiv admin note: text overlap with arXiv:2111.02071

arXiv:2111.02071 [pdf, other]

The Impact of Batch Learning in Stochastic Bandits

Authors: Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein

Abstract: We consider a special case of bandit problems, namely batched bandits. Motivated by natural restrictions of recommender systems and e-commerce platforms, we assume that a learning agent observes responses batched in groups over a certain time period. Unlike previous work, we consider a more practically relevant batch-centric scenario of batch learning. We provide a policy-agnostic regret analysis… ▽ More We consider a special case of bandit problems, namely batched bandits. Motivated by natural restrictions of recommender systems and e-commerce platforms, we assume that a learning agent observes responses batched in groups over a certain time period. Unlike previous work, we consider a more practically relevant batch-centric scenario of batch learning. We provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main theoretical results show that the impact of batch learning can be measured in terms of online behavior. Finally, we demonstrate the consistency of theoretical results by conducting empirical experiments and reflect on the optimal batch size choice. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: To appear at the workshop on the Ecological Theory of Reinforcement Learning, NeurIPS 2021

arXiv:1908.07808 [pdf, other]

Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

Authors: Jules Kruijswijk, Petri Parvinen, Maurits Kaptein

Abstract: The (contextual) multi-armed bandit problem (MAB) provides a formalization of sequential decision-making which has many applications. However, validly evaluating MAB policies is challenging; we either resort to simulations which inherently include debatable assumptions, or we resort to expensive field trials. Recently an offline evaluation method has been suggested that is based on empirical data,… ▽ More The (contextual) multi-armed bandit problem (MAB) provides a formalization of sequential decision-making which has many applications. However, validly evaluating MAB policies is challenging; we either resort to simulations which inherently include debatable assumptions, or we resort to expensive field trials. Recently an offline evaluation method has been suggested that is based on empirical data, thus relaxing the assumptions, and can be used to evaluate multiple competing policies in parallel. This method is however not directly suited for the continuous armed (CAB) problem; an often encountered version of the MAB problem in which the action set is continuous instead of discrete. We propose and evaluate an extension of the existing method such that it can be used to evaluate CAB policies. We empirically demonstrate that our method provides a relatively consistent ranking of policies. Furthermore, we detail how our method can be used to select policies in a real-life CAB problem. △ Less

Submitted 21 August, 2019; originally announced August 2019.

arXiv:1904.09339 [pdf, other]

Continuous-Time Birth-Death MCMC for Bayesian Regression Tree Models

Authors: Reza Mohammadi, Matthew Pratola, Maurits Kaptein

Abstract: Decision trees are flexible models that are well suited for many statistical regression problems. In a Bayesian framework for regression trees, Markov Chain Monte Carlo (MCMC) search algorithms are required to generate samples of tree models according to their posterior probabilities. The critical component of such an MCMC algorithm is to construct good Metropolis-Hastings steps for updating the t… ▽ More Decision trees are flexible models that are well suited for many statistical regression problems. In a Bayesian framework for regression trees, Markov Chain Monte Carlo (MCMC) search algorithms are required to generate samples of tree models according to their posterior probabilities. The critical component of such an MCMC algorithm is to construct good Metropolis-Hastings steps for updating the tree topology. However, such algorithms frequently suffering from local mode stickiness and poor mixing. As a result, the algorithms are slow to converge. Hitherto, authors have primarily used discrete-time birth/death mechanisms for Bayesian (sums of) regression tree models to explore the model space. These algorithms are efficient only if the acceptance rate is high which is not always the case. Here we overcome this issue by developing a new search algorithm which is based on a continuous-time birth-death Markov process. This search algorithm explores the model space by jumping between parameter spaces corresponding to different tree structures. In the proposed algorithm, the moves between models are always accepted which can dramatically improve the convergence and mixing properties of the MCMC algorithm. We provide theoretical support of the algorithm for Bayesian regression tree models and demonstrate its performance. △ Less

Submitted 26 October, 2020; v1 submitted 19 April, 2019; originally announced April 2019.

Comments: Published at http://jmlr.org/papers/v21/19-307 in the Journal of Machine Learning Research (https://www.jmlr.org)

Journal ref: Journal of Machine Learning Research 2020, Vol. 21, No. 201, 1-26

arXiv:1811.01926 [pdf, other]

contextual: Evaluating Contextual Multi-Armed Bandit Problems in R

Authors: Robin van Emden, Maurits Kaptein

Abstract: Over the past decade, contextual bandit algorithms have been gaining in popularity due to their effectiveness and flexibility in solving sequential decision problems---from online advertising and finance to clinical trial design and personalized medicine. At the same time, there are, as of yet, surprisingly few options that enable researchers and practitioners to simulate and compare the wealth of… ▽ More Over the past decade, contextual bandit algorithms have been gaining in popularity due to their effectiveness and flexibility in solving sequential decision problems---from online advertising and finance to clinical trial design and personalized medicine. At the same time, there are, as of yet, surprisingly few options that enable researchers and practitioners to simulate and compare the wealth of new and existing bandit algorithms in a standardized way. To help close this gap between analytical research and empirical evaluation the current paper introduces the object-oriented R package "contextual": a user-friendly and, through its object-oriented structure, easily extensible framework that facilitates parallelized comparison of contextual and context-free bandit policies through both simulation and offline analysis. △ Less

Submitted 1 January, 2020; v1 submitted 6 November, 2018; originally announced November 2018.

Comments: 55 pages, 12 figures

MSC Class: I.2.6; K.4.4; F.2.0 ACM Class: I.2.6; K.4.4; F.2.0

arXiv:1802.10529 [pdf, other]

Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

Authors: Maurits Kaptein, Paul Ketelaar

Abstract: In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We pro… ▽ More In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset. △ Less

Submitted 28 February, 2018; originally announced February 2018.

Comments: 1 figure. Working paper including [R] package

arXiv:1602.06700 [pdf, other]

StreamingBandit; Experimenting with Bandit Policies

Authors: Jules Kruijswijk, Robin van Emden, Petri Parvinen, Maurits Kaptein

Abstract: A large number of statistical decision problems in the social sciences and beyond can be framed as a (contextual) multi-armed bandit problem. However, it is notoriously hard to develop and evaluate policies that tackle these types of problem, and to use such policies in applied studies. To address this issue, this paper introduces StreamingBandit, a Python web application for developing and testin… ▽ More A large number of statistical decision problems in the social sciences and beyond can be framed as a (contextual) multi-armed bandit problem. However, it is notoriously hard to develop and evaluate policies that tackle these types of problem, and to use such policies in applied studies. To address this issue, this paper introduces StreamingBandit, a Python web application for developing and testing bandit policies in field studies. StreamingBandit can sequentially select treatments using (online) policies in real time. Once StreamingBandit is implemented in an applied context, different policies can be tested, altered, nested, and compared. StreamingBandit makes it easy to apply a multitude of bandit policies for sequential allocation in field experiments, and allows for the quick development and re-use of novel policies. In this article, we detail the implementation logic of StreamingBandit and provide several examples of its use. △ Less

Submitted 4 September, 2018; v1 submitted 22 February, 2016; originally announced February 2016.

Comments: 47 pages, 15 figures, accepted for publication in Journal of Statistical Software (JSS)

arXiv:1502.00598 [pdf, other]

Lock in Feedback in Sequential Experiments

Authors: Maurits Kaptein, Davide Iannuzzi

Abstract: We often encounter situations in which an experimenter wants to find, by sequential experimentation, $x_{max} = \arg\max_{x} f(x)$, where $f(x)$ is a (possibly unknown) function of a well controllable variable $x$. Taking inspiration from physics and engineering, we have designed a new method to address this problem. In this paper, we first introduce the method in continuous time, and then present… ▽ More We often encounter situations in which an experimenter wants to find, by sequential experimentation, $x_{max} = \arg\max_{x} f(x)$, where $f(x)$ is a (possibly unknown) function of a well controllable variable $x$. Taking inspiration from physics and engineering, we have designed a new method to address this problem. In this paper, we first introduce the method in continuous time, and then present two algorithms for use in sequential experiments. Through a series of simulation studies, we show that the method is effective for finding maxima of unknown functions by experimentation, even when the maximum of the functions drifts or when the signal to noise ratio is low. △ Less

Submitted 12 January, 2016; v1 submitted 2 February, 2015; originally announced February 2015.

Comments: 20 Pages, 7 Figures

arXiv:1410.4009 [pdf, other]

Thompson sampling with the online bootstrap

Authors: Dean Eckles, Maurits Kaptein

Abstract: Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap… ▽ More Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal. While sometimes easy to implement and asymptotically optimal, Thompson sampling can be computationally demanding in large scale bandit problems, and its performance is dependent on the model fit to the observed data. We introduce bootstrap Thompson sampling (BTS), a heuristic method for solving bandit problems which modifies Thompson sampling by replacing the posterior distribution used in Thompson sampling by a bootstrap distribution. We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case. Subsequently, we detail why BTS using the online bootstrap is more scalable than regular Thompson sampling, and we show through simulation that BTS is more robust to a misspecified error distribution. BTS is an appealing modification of Thompson sampling, especially when samples from the posterior are otherwise not available or are costly. △ Less

Submitted 15 October, 2014; originally announced October 2014.

Comments: 13 pages, 4 figures

MSC Class: 68W27; 62L05 ACM Class: G.3; I.2.6

Showing 1–13 of 13 results for author: Kaptein, M