This document discusses reinforcement learning and Markov decision processes. It defines reinforcement learning as a reward-driven trial-and-error process where an agent learns to interact with an environment to maximize rewards. Multi-armed bandit problems are introduced as a simple reinforcement learning problem where choosing actions can explore unknown reward distributions or exploit apparently best actions. Strategies for solving multi-armed bandits like epsilon-greedy and upper confidence bound methods are described. Markov decision processes generalize reinforcement learning problems by incorporating state-based representations and stochastic state transitions. The Q-learning algorithm is presented as a model-free method for learning action values in finite state Markov decision processes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
33 views11 pages
CSD311: Artificial Intelligence
This document discusses reinforcement learning and Markov decision processes. It defines reinforcement learning as a reward-driven trial-and-error process where an agent learns to interact with an environment to maximize rewards. Multi-armed bandit problems are introduced as a simple reinforcement learning problem where choosing actions can explore unknown reward distributions or exploit apparently best actions. Strategies for solving multi-armed bandits like epsilon-greedy and upper confidence bound methods are described. Markov decision processes generalize reinforcement learning problems by incorporating state-based representations and stochastic state transitions. The Q-learning algorithm is presented as a model-free method for learning action values in finite state Markov decision processes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11
CSD311: Artificial Intelligence
Reinforcement learning
Definition 4 (Reinforcement learning (RL))
A reward-driven trial-and-error process, in which a system learns to interact with a complex environment to achieve rewarding outcomes. The action policy is typically learnt from the data via the trial and error process and is not on a user defined heuristic function. RL is used in complex environments where it is easy to decide the reward value but difficult to reason about how to achieve the reward. For example, in chess it is trivial to know whether a game ended in a win for white or black or was a draw. But very hard to come up with a sequence of decisions (moves) that would ensure the reward with high probability. Any heuristic is likely to be much poorer as a predictor of an action sequence compared to learning the most successful (rewarding) decisions from the data assuming large amounts Multi-armed bandits (MAB)
MAB: is a situation where there are some machines (say m)
that give expected rewards (slot machines) that are unknown. A player who plays the machines wants to maximize rewards in the long run. Since the expected reward is not known a player has to explore the slot machines before it can play to maximize reward (exploit). Clearly, the best policy is to continuously play the machine with the highest expected reward. The player has to design an algorithm/strategy that will maximize reward. This typically means finding a policy to do explore and exploit actions. The MAB is the simplest such problem since it is state less. That means at any point the environment (in this case machine behaviour) does not change. Strategies for MAB I
Naive algorithm: Do a fixed number of trials for each
machine and then choose the machine with the highest expectation and stick with it. Problems: how many trials in exploration? What if the wrong machine was chosen after the trials? -Greedy: Choose a random machine for a fraction of the trials. Also, choose whether a particular trial will explore with probability . Otherwise, choose the slot machine with the current highest pay off. The value can be annealed - that is start with a larger value and keep reducing it progressively. Not efficient in learning payoffs of new machines (in a dynamic setting). Strategies for MAB II
Upper bounding methods: Choose a machine with the
highest statistical upper bound ui where ui = mi + ci where mi is the average payoff and ci = K × nsii - K is a positive constant, si is the sample standard deviation, ni is the number of times the i th machine was chosen. Note that ci is K times the standard error (Central Limit theorem). Here explore and exploit are integrated in the policy. A machine that has been chosen fewer times is likely to have a higher ci and therefore a higher ui . The value of K will mediate between explore and exploit. Larger values of K will bias choice towards exploration. State based RL
In MAB the policy could be decided based only on the agent’s
knowledge. The environment was fixed and did not change. More generally, the environment changes and the action has to take the state into account while deciding the action. Examples: Chess/Go/tic-tac-toe: reward at end (1, -1, 0) for win, loss, draw respectively available only at the end. State is board position and actions are moves. Driving agent: State is the vector of sensor inputs. Actions are the driving actions - steer, acclerate, brake, some combination of the previous. Both current and past state-action pairs may be relevant for the reward. One goal of RL is to find the value of an action in a state regardless of when the reward actually comes. Actions can then be chosen based on these action-values in that state. Schematic of RL
In MAB st+1 = st and rt is fixed and tied to the machine -
essentially means state does not change over time. A slightly different version of MAB is where new machines are added with new payoff probabilities but the payoff probabilities of all the machines always remain the same. The most general version of RL is where states change with time and rewards also change with time and rewards are not available immediately but have to be inferred after a terminal state is reached. Markov Decision Process (MDP) I
An MDP is a stochastic transition system where the current
state has sufficient information to take an action and decide the reward for the action. An unfolding instance of an MDP can be represented as the sequence s0 a0 r0 , s1 a1 r1 . . . where the sequence can be finite or infinite. si ai ri si+1 .. means in the i th time step the MDP was in state si , did action ai that resulted in state si+1 and got immediate reward ri . A notational variant used some times: the reward is written ri+1 instead of ri . Formally, an MDP is a four tuple (S, A, Pa , Ra ), a ∈ A. S - a set of states. Can be finite or infinite. A - set of actions. Pa (s, s 0 ) = Pr (st+1 = s 0 |st = s, at = a) - is the probability at time t of transiting from current state st to state st+1 with action a. Markov Decision Process (MDP) II
Ra (s, s 0 ) - is the immediate reward when transiting from state
s to state s 0 under action a. A policy π is a (probabilistic) mapping from π : S → A. An MDP models a stochastic decision making process and the goal is to find a ‘good’ (ideally optimal) policy that will will maximize cumulative reward E [Rt |st , at ] as t → ∞. For finite MDPs t is bounded. P The iexpected cumulative reward is given by: E [Rt |st , at ] = ∞ i=0 γ E [rt+i |st , at ] where at = π(st ). The expectation is over st+1 ∼ Pat (st , st+1 ). 0 < γ < 1 is the discount factor parameter that models the trade-off between making early decisions versus late decisions. For example, small values of γ will promote maximizing immediate rewards while values close to 1 will maximize delayed or long term rewards. Markov Decision Process (MDP) III
RL algorithms aim to learn the expected cumulative rewards
for each state-action pair so that the right action can be chosen in each state. Due to the large number of possible pair (s, a) the expected cumulative rewards cannot be learnt for each pair instead it is approximated by a function of (st , at ) Finite MDP, Q-algorithm
Q-learning is an algorithm to learn good/optimal actions in
states with finite S. Q : S × A → R is a table that is updated in an iterative manner. The algorithm is model free - that is does not have to learn a state change model. But it will not work for infinite spaces. The update is done using Bellman’s equation (see qlearning.ipynb for an example):