Search | arXiv e-print repository

Indexability of Finite State Restless Multi-Armed Bandit and Rollout Policy

Authors: Vishesh Mittal, Rahul Meshram, Deepak Dev, Surya Prakash

Abstract: We consider finite state restless multi-armed bandit problem. The decision maker can act on M bandits out of N bandits in each time step. The play of arm (active arm) yields state dependent rewards based on action and when the arm is not played, it also provides rewards based on the state and action. The objective of the decision maker is to maximize the infinite horizon discounted reward. The cla… ▽ More We consider finite state restless multi-armed bandit problem. The decision maker can act on M bandits out of N bandits in each time step. The play of arm (active arm) yields state dependent rewards based on action and when the arm is not played, it also provides rewards based on the state and action. The objective of the decision maker is to maximize the infinite horizon discounted reward. The classical approach to restless bandits is Whittle index policy. In such policy, the M arms with highest indices are played at each time step. Here, one decouples the restless bandits problem by analyzing relaxed constrained restless bandits problem. Then by Lagrangian relaxation problem, one decouples restless bandits problem into N single-armed restless bandit problems. We analyze the single-armed restless bandit. In order to study the Whittle index policy, we show structural results on the single armed bandit model. We define indexability and show indexability in special cases. We propose an alternative approach to verify the indexable criteria for a single armed bandit model using value iteration algorithm. We demonstrate the performance of our algorithm with different examples. We provide insight on condition of indexability of restless bandits using different structural assumptions on transition probability and reward matrices. We also study online rollout policy and discuss the computation complexity of algorithm and compare that with complexity of index computation. Numerical examples illustrate that index policy and rollout policy performs better than myopic policy. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: 15 Pages, submitted to conference

arXiv:2108.00892 [pdf, other]

Indexability and Rollout Policy for Multi-State Partially Observable Restless Bandits

Authors: Rahul Meshram, Kesav Kaza

Abstract: Restless multi-armed bandits with partially observable states has applications in communication systems, age of information and recommendation systems. In this paper, we study multi-state partially observable restless bandit models. We consider three different models based on information observable to decision maker -- 1) no information is observable from actions of a bandit 2) perfect information… ▽ More Restless multi-armed bandits with partially observable states has applications in communication systems, age of information and recommendation systems. In this paper, we study multi-state partially observable restless bandit models. We consider three different models based on information observable to decision maker -- 1) no information is observable from actions of a bandit 2) perfect information from bandit is observable only for one action on bandit, there is a fixed restart state, i.e., transition occurs from all other states to that state 3) perfect state information is available to decision maker for both actions on a bandit and there are two restart state for two actions. We develop the structural properties. We also show a threshold type policy and indexability for model 2 and 3. We present Monte Carlo (MC) rollout policy. We use it for whittle index computation in case of model 2. We obtain the concentration bound on value function in terms of horizon length and number of trajectories for MC rollout policy. We derive explicit index formula for model 3. We finally describe Monte Carlo rollout policy for model 1 when it is difficult to show indexability. We demonstrate the numerical examples using myopic policy, Monte Carlo rollout policy and Whittle index policy. We observe that Monte Carlo rollout policy is good competitive policy to myopic. △ Less

Submitted 29 July, 2021; originally announced August 2021.

Comments: 8 pages, submitted to CDC

arXiv:2102.04321 [pdf, ps, other]

Monte Carlo Rollout Policy for Recommendation Systems with Dynamic User Behavior

Authors: Rahul Meshram, Kesav Kaza

Abstract: We model online recommendation systems using the hidden Markov multi-state restless multi-armed bandit problem. To solve this we present Monte Carlo rollout policy. We illustrate numerically that Monte Carlo rollout policy performs better than myopic policy for arbitrary transition dynamics with no specific structure. But, when some structure is imposed on the transition dynamics, myopic policy pe… ▽ More We model online recommendation systems using the hidden Markov multi-state restless multi-armed bandit problem. To solve this we present Monte Carlo rollout policy. We illustrate numerically that Monte Carlo rollout policy performs better than myopic policy for arbitrary transition dynamics with no specific structure. But, when some structure is imposed on the transition dynamics, myopic policy performs better than Monte Carlo rollout policy. △ Less

Submitted 8 February, 2021; originally announced February 2021.

Comments: 5 Pages, 4 figures, conference COMSNETS 2021

arXiv:2007.12933 [pdf, ps, other]

Simulation Based Algorithms for Markov Decision Processes and Multi-Action Restless Bandits

Authors: Rahul Meshram, Kesav Kaza

Abstract: We consider multi-dimensional Markov decision processes and formulate a long term discounted reward optimization problem. Two simulation based algorithms---Monte Carlo rollout policy and parallel rollout policy are studied, and various properties for these policies are discussed. We next consider a restless multi-armed bandit (RMAB) with multi-dimensional state space and multi-actions bandit model… ▽ More We consider multi-dimensional Markov decision processes and formulate a long term discounted reward optimization problem. Two simulation based algorithms---Monte Carlo rollout policy and parallel rollout policy are studied, and various properties for these policies are discussed. We next consider a restless multi-armed bandit (RMAB) with multi-dimensional state space and multi-actions bandit model. A standard RMAB consists of two actions for each arms whereas in multi-actions RMAB, there are more that two actions for each arms. A popular approach for RMAB is Whittle index based heuristic policy. Indexability is an important requirement to use index based policy. Based on this, an RMAB is classified into indexable or non-indexable bandits. Our interest is in the study of Monte-Carlo rollout policy for both indexable and non-indexable restless bandits. We first analyze a standard indexable RMAB (two-action model) and discuss an index based policy approach. We present approximate index computation algorithm using Monte-Carlo rollout policy. This algorithm's convergence is shown using two-timescale stochastic approximation scheme. Later, we analyze multi-actions indexable RMAB, and discuss the index based policy approach. We also study non-indexable RMAB for both standard and multi-actions bandits using Monte-Carlo rollout policy. △ Less

Submitted 25 July, 2020; originally announced July 2020.

Comments: 3 Figures

arXiv:1910.01860 [pdf, ps, other]

Online repeated posted price auctions with a demand side platform

Authors: Rahul Meshram, Kesav Kaza

Abstract: We consider an online ad network problem in which an ad exchange auctions ad slots and intermediaries called demand side platforms (DSPs) buy these ad slots for their clients (advertisers). An intermediary represents multiple advertisers. Different types of ad slots are auctioned by the ad exchange, e.g., video ad, banner ad etc. We study repeated posted price auctions for homogeneous and heteroge… ▽ More We consider an online ad network problem in which an ad exchange auctions ad slots and intermediaries called demand side platforms (DSPs) buy these ad slots for their clients (advertisers). An intermediary represents multiple advertisers. Different types of ad slots are auctioned by the ad exchange, e.g., video ad, banner ad etc. We study repeated posted price auctions for homogeneous and heterogeneous items when there is an intermediary. In a posted price auction, the auctioneer sets a fixed reserve price. The buyer can accept the price and win the ad slot or reject the price. We analyze the system from the auctioneer's perspective and show that the optimal reserve price is dynamic for heterogeneous items. We also investigate system from intermediary's perspective and devise algorithms for scheduling advertisers. Often the advertisers have budget constraints and impression constraints. We formulate a revenue optimization problem at the intermediary and also consider the problem of scheduling advertisers with budget and impression constraints. Finally, we present a numerical study for the single seller and advertiser model which considers various valuation distributions such as uniform, exponential and lognormal. △ Less

Submitted 4 October, 2019; originally announced October 2019.

arXiv:1904.08962 [pdf, ps, other]

Constrained Restless Bandits for Dynamic Scheduling in Cyber-Physical Systems

Authors: Kesav Kaza, Rahul Meshram, Varun Mehta, S. N. Merchant

Abstract: This paper studies a class of constrained restless multi-armed bandits (CRMAB). The constraints are in the form of time varying set of actions (set of available arms). This variation can be either stochastic or semi-deterministic. Given a set of arms, a fixed number of them can be chosen to be played in each decision interval. The play of each arm yields a state dependent reward. The current state… ▽ More This paper studies a class of constrained restless multi-armed bandits (CRMAB). The constraints are in the form of time varying set of actions (set of available arms). This variation can be either stochastic or semi-deterministic. Given a set of arms, a fixed number of them can be chosen to be played in each decision interval. The play of each arm yields a state dependent reward. The current states of arms are partially observable through binary feedback signals from arms that are played. The current availability of arms is fully observable. The objective is to maximize long term cumulative reward. The uncertainty about future availability of arms along with partial state information makes this objective challenging. Applications for CRMAB can be found in resource allocation in cyber-physical systems involving components with time varying availability. First, this optimization problem is analyzed using Whittle's index policy. To this end, a constrained restless single-armed bandit is studied. It is shown to admit a threshold-type optimal policy and is also indexable. An algorithm to compute Whittle's index is presented. An alternate solution method with lower complexity is also presented in the form of an online rollout policy. A detailed discussion on the complexity of both these schemes is also presented, which suggests that online rollout policy with short look ahead is simpler to implement than Whittle's index computation. Further, upper bounds on the value function are derived in order to estimate the degree of sub-optimality of various solutions. The simulation study compares the performance of Whittle's index, online rollout, myopic and modified Whittle's index policies. △ Less

Submitted 6 September, 2021; v1 submitted 18 April, 2019; originally announced April 2019.

Comments: 17 pages, 2 figures

arXiv:1803.08651 [pdf, ps, other]

Learning Recommendations While Influencing Interests

Authors: Rahul Meshram, D. Manjunath, Nikhil Karamchandani

Abstract: Personalized recommendation systems (RS) are extensively used in many services. Many of these are based on learning algorithms where the RS uses the recommendation history and the user response to learn an optimal strategy. Further, these algorithms are based on the assumption that the user interests are rigid. Specifically, they do not account for the effect of learning strategy on the evolution… ▽ More Personalized recommendation systems (RS) are extensively used in many services. Many of these are based on learning algorithms where the RS uses the recommendation history and the user response to learn an optimal strategy. Further, these algorithms are based on the assumption that the user interests are rigid. Specifically, they do not account for the effect of learning strategy on the evolution of the user interests. In this paper we develop influence models for a learning algorithm that is used to optimally recommend websites to web users. We adapt the model of \cite{Ioannidis10} to include an item-dependent reward to the RS from the suggestions that are accepted by the user. For this we first develop a static optimisation scheme when all the parameters are known. Next we develop a stochastic approximation based learning scheme for the RS to learn the optimal strategy when the user profiles are not known. Finally, we describe several user-influence models for the learning algorithm and analyze their effect on the steady user interests and on the steady state optimal strategy as compared to that when the users are not influenced. △ Less

Submitted 23 March, 2018; originally announced March 2018.

Comments: 13 pages, submitted to conference

arXiv:1801.01301 [pdf, ps, other]

Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks

Authors: Kesav Kaza, Rahul Meshram, Varun Mehta, S. N. Merchant

Abstract: This work studies a generalized class of restless multi-armed bandits with hidden states and allow cumulative feedback, as opposed to the conventional instantaneous feedback. We call them lazy restless bandits (LRB) as the events of decision-making are sparser than events of state transition. Hence, feedback after each decision event is the cumulative effect of the following state transition event… ▽ More This work studies a generalized class of restless multi-armed bandits with hidden states and allow cumulative feedback, as opposed to the conventional instantaneous feedback. We call them lazy restless bandits (LRB) as the events of decision-making are sparser than events of state transition. Hence, feedback after each decision event is the cumulative effect of the following state transition events. The states of arms are hidden from the decision-maker and rewards for actions are state dependent. The decision-maker needs to choose one arm in each decision interval, such that long term cumulative reward is maximized. As the states are hidden, the decision-maker maintains and updates its belief about them. It is shown that LRBs admit an optimal policy which has threshold structure in belief space. The Whittle-index policy for solving LRB problem is analyzed; indexability of LRBs is shown. Further, closed-form index expressions are provided for two sets of special cases; for more general cases, an algorithm for index computation is provided. An extensive simulation study is presented; Whittle-index, modified Whittle-index and myopic policies are compared. Lagrangian relaxation of the problem provides an upper bound on the optimal value function; it is used to assess the degree of sub-optimality various policies. △ Less

Submitted 29 January, 2019; v1 submitted 4 January, 2018; originally announced January 2018.

arXiv:1603.09233 [pdf, ps, other]

Optimal Recommendation to Users that React: Online Learning for a Class of POMDPs

Authors: Rahul Meshram, Aditya Gopalan, D. Manjunath

Abstract: We describe and study a model for an Automated Online Recommendation System (AORS) in which a user's preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendatio… ▽ More We describe and study a model for an Automated Online Recommendation System (AORS) in which a user's preferences can be time-dependent and can also depend on the history of past recommendations and play-outs. The three key features of the model that makes it more realistic compared to existing models for recommendation systems are (1) user preference is inherently latent, (2) current recommendations can affect future preferences, and (3) it allows for the development of learning algorithms with provable performance guarantees. The problem is cast as an average-cost restless multi-armed bandit for a given user, with an independent partially observable Markov decision process (POMDP) for each item of content. We analyze the POMDP for a single arm, describe its structural properties, and characterize its optimal policy. We then develop a Thompson sampling-based online reinforcement learning algorithm to learn the parameters of the model and optimize utility from the binary responses of the users to continuous recommendations. We then analyze the performance of the learning algorithm and characterize the regret. Illustrative numerical results and directions for extension to the restless hidden Markov multi-armed bandit problem are also presented. △ Less

Submitted 30 March, 2016; originally announced March 2016.

Comments: 8 pages, submitted to conference

arXiv:1110.0310 [pdf, other]

Joint Routing, Scheduling And Power Control For Multihop Wireless Networks With Multiple Antennas

Authors: Harish Vangala, Rahul Meshram, Prof. Vinod Sharma

Abstract: We consider the problem of Joint Routing, Scheduling and Power-control (JRSP) problem for multihop wireless networks (MHWN) with multiple antennas. We extend the problem and a (sub-optimal) heuristic solution method for JRSP in MHWN with single antennas. We present an iterative scheme to calculate link capacities(achievable rates) in the interference environment of the network using SINR model. We… ▽ More We consider the problem of Joint Routing, Scheduling and Power-control (JRSP) problem for multihop wireless networks (MHWN) with multiple antennas. We extend the problem and a (sub-optimal) heuristic solution method for JRSP in MHWN with single antennas. We present an iterative scheme to calculate link capacities(achievable rates) in the interference environment of the network using SINR model. We then present the algorithm for solving the JRSP problem. This completes a feasible system model for MHWN when nodes have multiple antennas. We show that the gain we achieve by using multiple antennas in the network is linear both in optimal performance as well as heuristic algorithmic performance. △ Less

Submitted 3 October, 2011; originally announced October 2011.

Comments: Submitted to NCC-2012. First Draft is here. Final version has many changes

Showing 1–10 of 10 results for author: Meshram, R