0% found this document useful (0 votes)
33 views11 pages

CSD311: Artificial Intelligence

This document discusses reinforcement learning and Markov decision processes. It defines reinforcement learning as a reward-driven trial-and-error process where an agent learns to interact with an environment to maximize rewards. Multi-armed bandit problems are introduced as a simple reinforcement learning problem where choosing actions can explore unknown reward distributions or exploit apparently best actions. Strategies for solving multi-armed bandits like epsilon-greedy and upper confidence bound methods are described. Markov decision processes generalize reinforcement learning problems by incorporating state-based representations and stochastic state transitions. The Q-learning algorithm is presented as a model-free method for learning action values in finite state Markov decision processes.

Uploaded by

Ayaan Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views11 pages

CSD311: Artificial Intelligence

This document discusses reinforcement learning and Markov decision processes. It defines reinforcement learning as a reward-driven trial-and-error process where an agent learns to interact with an environment to maximize rewards. Multi-armed bandit problems are introduced as a simple reinforcement learning problem where choosing actions can explore unknown reward distributions or exploit apparently best actions. Strategies for solving multi-armed bandits like epsilon-greedy and upper confidence bound methods are described. Markov decision processes generalize reinforcement learning problems by incorporating state-based representations and stochastic state transitions. The Q-learning algorithm is presented as a model-free method for learning action values in finite state Markov decision processes.

Uploaded by

Ayaan Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CSD311: Artificial Intelligence

Reinforcement learning

Definition 4 (Reinforcement learning (RL))


A reward-driven trial-and-error process, in which a system learns to
interact with a complex environment to achieve rewarding
outcomes. The action policy is typically learnt from the data via
the trial and error process and is not on a user defined heuristic
function.
RL is used in complex environments where it is easy to decide
the reward value but difficult to reason about how to achieve
the reward. For example, in chess it is trivial to know whether
a game ended in a win for white or black or was a draw. But
very hard to come up with a sequence of decisions (moves)
that would ensure the reward with high probability.
Any heuristic is likely to be much poorer as a predictor of an
action sequence compared to learning the most successful
(rewarding) decisions from the data assuming large amounts
Multi-armed bandits (MAB)

MAB: is a situation where there are some machines (say m)


that give expected rewards (slot machines) that are unknown.
A player who plays the machines wants to maximize rewards
in the long run.
Since the expected reward is not known a player has to
explore the slot machines before it can play to maximize
reward (exploit). Clearly, the best policy is to continuously
play the machine with the highest expected reward.
The player has to design an algorithm/strategy that will
maximize reward. This typically means finding a policy to do
explore and exploit actions.
The MAB is the simplest such problem since it is state less.
That means at any point the environment (in this case
machine behaviour) does not change.
Strategies for MAB I

Naive algorithm: Do a fixed number of trials for each


machine and then choose the machine with the highest
expectation and stick with it. Problems: how many trials in
exploration? What if the wrong machine was chosen after the
trials?
-Greedy: Choose a random machine for a fraction  of the
trials. Also, choose whether a particular trial will explore with
probability . Otherwise, choose the slot machine with the
current highest pay off. The  value can be annealed - that is
start with a larger value and keep reducing it progressively.
Not efficient in learning payoffs of new machines (in a
dynamic setting).
Strategies for MAB II

Upper bounding methods: Choose a machine with the


highest statistical upper bound ui where ui = mi + ci where
mi is the average payoff and ci = K × nsii - K is a positive
constant, si is the sample standard deviation, ni is the number
of times the i th machine was chosen. Note that ci is K times
the standard error (Central Limit theorem). Here explore and
exploit are integrated in the policy. A machine that has been
chosen fewer times is likely to have a higher ci and therefore a
higher ui . The value of K will mediate between explore and
exploit. Larger values of K will bias choice towards
exploration.
State based RL

In MAB the policy could be decided based only on the agent’s


knowledge. The environment was fixed and did not change.
More generally, the environment changes and the action has
to take the state into account while deciding the action.
Examples:
Chess/Go/tic-tac-toe: reward at end (1, -1, 0) for win, loss,
draw respectively available only at the end. State is board
position and actions are moves.
Driving agent: State is the vector of sensor inputs. Actions are
the driving actions - steer, acclerate, brake, some combination
of the previous. Both current and past state-action pairs may
be relevant for the reward.
One goal of RL is to find the value of an action in a state
regardless of when the reward actually comes. Actions can
then be chosen based on these action-values in that state.
Schematic of RL

In MAB st+1 = st and rt is fixed and tied to the machine -


essentially means state does not change over time.
A slightly different version of MAB is where new machines are
added with new payoff probabilities but the payoff
probabilities of all the machines always remain the same.
The most general version of RL is where states change with
time and rewards also change with time and rewards are not
available immediately but have to be inferred after a terminal
state is reached.
Markov Decision Process (MDP) I

An MDP is a stochastic transition system where the current


state has sufficient information to take an action and decide
the reward for the action. An unfolding instance of an MDP
can be represented as the sequence s0 a0 r0 , s1 a1 r1 . . . where
the sequence can be finite or infinite. si ai ri si+1 .. means in the
i th time step the MDP was in state si , did action ai that
resulted in state si+1 and got immediate reward ri . A
notational variant used some times: the reward is written ri+1
instead of ri .
Formally, an MDP is a four tuple (S, A, Pa , Ra ), a ∈ A.
S - a set of states. Can be finite or infinite.
A - set of actions.
Pa (s, s 0 ) = Pr (st+1 = s 0 |st = s, at = a) - is the probability at
time t of transiting from current state st to state st+1 with
action a.
Markov Decision Process (MDP) II

Ra (s, s 0 ) - is the immediate reward when transiting from state


s to state s 0 under action a. A policy π is a (probabilistic)
mapping from π : S → A.
An MDP models a stochastic decision making process and the
goal is to find a ‘good’ (ideally optimal) policy that will will
maximize cumulative reward E [Rt |st , at ] as t → ∞. For finite
MDPs t is bounded. P The iexpected cumulative reward is given
by: E [Rt |st , at ] = ∞
i=0 γ E [rt+i |st , at ] where at = π(st ). The
expectation is over st+1 ∼ Pat (st , st+1 ).
0 < γ < 1 is the discount factor parameter that models the
trade-off between making early decisions versus late decisions.
For example, small values of γ will promote maximizing
immediate rewards while values close to 1 will maximize
delayed or long term rewards.
Markov Decision Process (MDP) III

RL algorithms aim to learn the expected cumulative rewards


for each state-action pair so that the right action can be
chosen in each state. Due to the large number of possible pair
(s, a) the expected cumulative rewards cannot be learnt for
each pair instead it is approximated by a function of (st , at )
Finite MDP, Q-algorithm

Q-learning is an algorithm to learn good/optimal actions in


states with finite S.
Q : S × A → R is a table that is updated in an iterative
manner. The algorithm is model free - that is does not have
to learn a state change model. But it will not work for infinite
spaces.
The update is done using Bellman’s equation (see
qlearning.ipynb for an example):

You might also like