0% found this document useful (0 votes)

33 views11 pages

CSD311: Artificial Intelligence

This document discusses reinforcement learning and Markov decision processes. It defines reinforcement learning as a reward-driven trial-and-error process where an agent learns to interact with an environment to maximize rewards. Multi-armed bandit problems are introduced as a simple reinforcement learning problem where choosing actions can explore unknown reward distributions or exploit apparently best actions. Strategies for solving multi-armed bandits like epsilon-greedy and upper confidence bound methods are described. Markov decision processes generalize reinforcement learning problems by incorporating state-based representations and stochastic state transitions. The Q-learning algorithm is presented as a model-free method for learning action values in finite state Markov decision processes.

Uploaded by

Ayaan Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views11 pages

CSD311: Artificial Intelligence

Uploaded by

Ayaan Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

CSD311: Artificial Intelligence

Reinforcement learning

Definition 4 (Reinforcement learning (RL))

A reward-driven trial-and-error process, in which a system learns to
interact with a complex environment to achieve rewarding
outcomes. The action policy is typically learnt from the data via
the trial and error process and is not on a user defined heuristic
function.
RL is used in complex environments where it is easy to decide
the reward value but difficult to reason about how to achieve
the reward. For example, in chess it is trivial to know whether
a game ended in a win for white or black or was a draw. But
very hard to come up with a sequence of decisions (moves)
that would ensure the reward with high probability.
Any heuristic is likely to be much poorer as a predictor of an
action sequence compared to learning the most successful
(rewarding) decisions from the data assuming large amounts
Multi-armed bandits (MAB)

MAB: is a situation where there are some machines (say m)

that give expected rewards (slot machines) that are unknown.
A player who plays the machines wants to maximize rewards
in the long run.
Since the expected reward is not known a player has to
explore the slot machines before it can play to maximize
reward (exploit). Clearly, the best policy is to continuously
play the machine with the highest expected reward.
The player has to design an algorithm/strategy that will
maximize reward. This typically means finding a policy to do
explore and exploit actions.
The MAB is the simplest such problem since it is state less.
That means at any point the environment (in this case
machine behaviour) does not change.
Strategies for MAB I

Naive algorithm: Do a fixed number of trials for each

machine and then choose the machine with the highest
expectation and stick with it. Problems: how many trials in
exploration? What if the wrong machine was chosen after the
trials?
-Greedy: Choose a random machine for a fraction of the
trials. Also, choose whether a particular trial will explore with
probability . Otherwise, choose the slot machine with the
current highest pay off. The value can be annealed - that is
start with a larger value and keep reducing it progressively.
Not efficient in learning payoffs of new machines (in a
dynamic setting).
Strategies for MAB II

Upper bounding methods: Choose a machine with the

highest statistical upper bound ui where ui = mi + ci where
mi is the average payoff and ci = K × nsii - K is a positive
constant, si is the sample standard deviation, ni is the number
of times the i th machine was chosen. Note that ci is K times
the standard error (Central Limit theorem). Here explore and
exploit are integrated in the policy. A machine that has been
chosen fewer times is likely to have a higher ci and therefore a
higher ui . The value of K will mediate between explore and
exploit. Larger values of K will bias choice towards
exploration.
State based RL

In MAB the policy could be decided based only on the agent’s

knowledge. The environment was fixed and did not change.
More generally, the environment changes and the action has
to take the state into account while deciding the action.
Examples:
Chess/Go/tic-tac-toe: reward at end (1, -1, 0) for win, loss,
draw respectively available only at the end. State is board
position and actions are moves.
Driving agent: State is the vector of sensor inputs. Actions are
the driving actions - steer, acclerate, brake, some combination
of the previous. Both current and past state-action pairs may
be relevant for the reward.
One goal of RL is to find the value of an action in a state
regardless of when the reward actually comes. Actions can
then be chosen based on these action-values in that state.
Schematic of RL

In MAB st+1 = st and rt is fixed and tied to the machine -

essentially means state does not change over time.
A slightly different version of MAB is where new machines are
added with new payoff probabilities but the payoff
probabilities of all the machines always remain the same.
The most general version of RL is where states change with
time and rewards also change with time and rewards are not
available immediately but have to be inferred after a terminal
state is reached.
Markov Decision Process (MDP) I

An MDP is a stochastic transition system where the current

state has sufficient information to take an action and decide
the reward for the action. An unfolding instance of an MDP
can be represented as the sequence s0 a0 r0 , s1 a1 r1 . . . where
the sequence can be finite or infinite. si ai ri si+1 .. means in the
i th time step the MDP was in state si , did action ai that
resulted in state si+1 and got immediate reward ri . A
notational variant used some times: the reward is written ri+1
instead of ri .
Formally, an MDP is a four tuple (S, A, Pa , Ra ), a ∈ A.
S - a set of states. Can be finite or infinite.
A - set of actions.
Pa (s, s 0 ) = Pr (st+1 = s 0 |st = s, at = a) - is the probability at
time t of transiting from current state st to state st+1 with
action a.
Markov Decision Process (MDP) II

Ra (s, s 0 ) - is the immediate reward when transiting from state

s to state s 0 under action a. A policy π is a (probabilistic)
mapping from π : S → A.
An MDP models a stochastic decision making process and the
goal is to find a ‘good’ (ideally optimal) policy that will will
maximize cumulative reward E [Rt |st , at ] as t → ∞. For finite
MDPs t is bounded. P The iexpected cumulative reward is given
by: E [Rt |st , at ] = ∞
i=0 γ E [rt+i |st , at ] where at = π(st ). The
expectation is over st+1 ∼ Pat (st , st+1 ).
0 < γ < 1 is the discount factor parameter that models the
trade-off between making early decisions versus late decisions.
For example, small values of γ will promote maximizing
immediate rewards while values close to 1 will maximize
delayed or long term rewards.
Markov Decision Process (MDP) III

RL algorithms aim to learn the expected cumulative rewards

for each state-action pair so that the right action can be
chosen in each state. Due to the large number of possible pair
(s, a) the expected cumulative rewards cannot be learnt for
each pair instead it is approximated by a function of (st , at )
Finite MDP, Q-algorithm

Q-learning is an algorithm to learn good/optimal actions in

states with finite S.
Q : S × A → R is a table that is updated in an iterative
manner. The algorithm is model free - that is does not have
to learn a state change model. But it will not work for infinite
spaces.
The update is done using Bellman’s equation (see
qlearning.ipynb for an example):

Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Unit 4
No ratings yet
Unit 4
49 pages
06 MDP
No ratings yet
06 MDP
89 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Kguh
No ratings yet
Kguh
38 pages
Sections
No ratings yet
Sections
76 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
RL 1
No ratings yet
RL 1
30 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
MDP Concepts
No ratings yet
MDP Concepts
23 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Unit 1
No ratings yet
Unit 1
18 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Notes
No ratings yet
Notes
6 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
RL Frra
No ratings yet
RL Frra
9 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
RL Ese
No ratings yet
RL Ese
7 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
RL Frra
No ratings yet
RL Frra
10 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lesson Plan in Math
No ratings yet
Lesson Plan in Math
24 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
InTech-Multi Automata Learning
No ratings yet
InTech-Multi Automata Learning
21 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
(Sociology For A New Century Series) Wendy Griswold - Cultures and Societies in A Changing World (2012, SAGE Publications, Inc)
No ratings yet
(Sociology For A New Century Series) Wendy Griswold - Cultures and Societies in A Changing World (2012, SAGE Publications, Inc)
233 pages
10 AI Competitions To Help High School Students Stand Out in Col
No ratings yet
10 AI Competitions To Help High School Students Stand Out in Col
12 pages
Purcom Lesson 5
No ratings yet
Purcom Lesson 5
28 pages
Assignment 8: (Https://swayam - Gov.in)
No ratings yet
Assignment 8: (Https://swayam - Gov.in)
4 pages
Plant Leaf Disease Detection Using Machine Learning
No ratings yet
Plant Leaf Disease Detection Using Machine Learning
7 pages
Master Thesis Pilot Study
100% (2)
Master Thesis Pilot Study
5 pages
Internet Accessibility and Its Impact
No ratings yet
Internet Accessibility and Its Impact
8 pages
Writing Chapter 1
No ratings yet
Writing Chapter 1
19 pages
Quantitative and Qualitative Research
No ratings yet
Quantitative and Qualitative Research
16 pages
Answering Video Questions On Designing Effective Training For Ecotourism Organizations
No ratings yet
Answering Video Questions On Designing Effective Training For Ecotourism Organizations
12 pages
Analisis Kesiapan Penerapan Rekam Medis Elektronik: Sebuah Studi Kualitatif
No ratings yet
Analisis Kesiapan Penerapan Rekam Medis Elektronik: Sebuah Studi Kualitatif
15 pages
Dissertation Writing Chapter 4
100% (2)
Dissertation Writing Chapter 4
7 pages
B8 Eng WK4
No ratings yet
B8 Eng WK4
6 pages
The Impact of Technology On Marketing
100% (1)
The Impact of Technology On Marketing
12 pages
Abayomi 2021
No ratings yet
Abayomi 2021
17 pages
Group - Chapter 3
No ratings yet
Group - Chapter 3
17 pages
DLL - Mathematics 3 - Q2 - W9
No ratings yet
DLL - Mathematics 3 - Q2 - W9
2 pages
Health Expectations - 2016 - Chinn - Easy Read and Accessible Information For People With Intellectual Disabilities Is It
No ratings yet
Health Expectations - 2016 - Chinn - Easy Read and Accessible Information For People With Intellectual Disabilities Is It
12 pages
M1 WHILE TASK Theories of Globalization
No ratings yet
M1 WHILE TASK Theories of Globalization
15 pages
(123doc) - Luan-Van-Tot-Nghiep-Chuyen-Nganh-Ngon-Ngu-Anh
No ratings yet
(123doc) - Luan-Van-Tot-Nghiep-Chuyen-Nganh-Ngon-Ngu-Anh
16 pages
Databases Designed Concept
No ratings yet
Databases Designed Concept
14 pages
Tema 4
No ratings yet
Tema 4
10 pages
Effect of Advertisement On Consumer Beha
No ratings yet
Effect of Advertisement On Consumer Beha
8 pages
HRM PRESENTATION Full Tyaari
No ratings yet
HRM PRESENTATION Full Tyaari
4 pages
G10-2nd PERIODICAL TEST-MAPEH10 (TOS) Final
No ratings yet
G10-2nd PERIODICAL TEST-MAPEH10 (TOS) Final
3 pages
List of Courses - m2 Ape
No ratings yet
List of Courses - m2 Ape
3 pages
Strawberry Generation
No ratings yet
Strawberry Generation
2 pages
English Year 3
No ratings yet
English Year 3
2 pages
8A - Listening Unit 10
No ratings yet
8A - Listening Unit 10
2 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

CSD311: Artificial Intelligence

Uploaded by

CSD311: Artificial Intelligence

Uploaded by

CSD311: Artificial Intelligence

Definition 4 (Reinforcement learning (RL))

MAB: is a situation where there are some machines (say m)

Naive algorithm: Do a fixed number of trials for each

Upper bounding methods: Choose a machine with the

In MAB the policy could be decided based only on the agent’s

In MAB st+1 = st and rt is fixed and tied to the machine -

An MDP is a stochastic transition system where the current

Ra (s, s 0 ) - is the immediate reward when transiting from state

RL algorithms aim to learn the expected cumulative rewards

Q-learning is an algorithm to learn good/optimal actions in

You might also like