RL 1

Reinforcement Learning (RL) involves an agent learning to behave optimally in an environment without knowing the transition model or reward function. It can be approached through passive learning, where the agent evaluates a fixed policy, or active learning, where the agent explores to find an optimal policy. Key methods include Direct Utility Estimation, Adaptive Dynamic Programming, and Temporal Difference Learning, each with its own advantages and challenges in estimating state utilities and learning optimal policies.

Uploaded by

yaminimukku18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views30 pages

RL 1

Uploaded by

yaminimukku18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Reinforcement Learning

Reinforcement Learningg in a nutshell

Imagine playing a new game whose rules you

don’tt know; after a hundred or so moves
don moves, your
opponent announces, “You lose”.

‐Russell and Norvig

Introduction
d to Artificial
f l Intelligence
ll
Reinforcement Learning
• Agent
g placed
p in an environment and must
learn to behave optimally in it
• Assume that the world behaves like an
MDP, except:
– Agent can act but does not know the transition
model
– Agent observes its current state its reward but
doesn’t know the reward function
• Goal: learn an optimal policy
Factors that Make RL Difficult
• Actions have non‐deterministic effects
– which are initially unknown and must be
learned
• Rewards / punishments can be infrequent
– Often at the end of long sequences of actions
– How do we determine what action(s) were
really responsible for reward or punishment?
(credit assignment problem)
– World is large and complex
Passive vs. Active learning
• Passive learning
– The agent acts based on a fixed policy π and
tries to learn how good the policy is by
observing the world go by
– Analogous to policy evaluation in policy
iteration
• Active learning
– The
h agent attempts to ffind
d an optimall (or
( at
least good) policy by exploring different
actions in the world
– Analogous to solving the underlying MDP
Model‐Based
Model Based vs. Model‐Free
Model Free RL
• Model based approach
pp to RL:
– learn the MDP model (T and R), or an
approximation
pp of it
– use it to find the optimal policy
• Model free approach to RL:
– derive the optimal policy without explicitly
learning the model

We will consider both types of approaches

Passive Reinforcement Learning
• Suppose
pp agent’s
g p
policyy π is fixed
• It wants to learn how good that policy is in
the world ie. it wants to learn Uπ(s) ()
• This is just like the policy evaluation part of
policyy iteration
p
• The big difference: the agent doesn’t know
the transition model or the reward function
(but it gets to observe the reward in each
state it is in)
Passive RL
• Suppose
pp we are given
g a policy
p y
• Want to determine how good it is
Given π: Need to learn Uπ(S):
Appr. 1: Direct Utility Estimation
• Direct utilityy estimation ((model free))
– Estimate Uπ(s) as average total reward of
epochs
p containingg s ((calculatingg from s to end
of epoch)
• Reward to g
go off a state s
– the sum of the (discounted) rewards from that
state until a terminal state is reached
• Key: use observed reward to go of the
state as the direct evidence of the actual
expected utility of that state
Direct Utility Estimation
Suppose we observe the following trial:
(1,1)-0.04 → (1,2)-0.04 →(1,3)-0.04 → (1,2)-0.04 → (1,3)-0.04
→ (2,3)-0.04 → (3,3)-0.04 → (4,3)+1

The total reward starting at (1,1) is 0.72. We call this a sample

of the observed-reward-to-go for (1,1).
For (1,2) there are two samples for the observed-reward-to-go
(assuming γ=1):
1. (1,2)-0.04 →(1,3)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (2,3)-0.04 →
(3,3)-0.04 → (4,3)+1 [Total: 0.76]
2 (1
2. 2)-0.04 → (1,3)
(1,2) (1 3)-0.04 → (2,3)
(2 3)-0.04 → (3,3)
(3 3)-0.04 → (4,3)
(4 3)+1
[Total: 0.84]
Direct Utility Estimation
• Direct Utilityy Estimation keepsp a runningg
average of the observed reward‐to‐go for
each state
• Eg. For state (1,2), it stores (0.76+0.84)/2 =
08
0.8
• As the number of trials goes to infinity, the
sample average converges to the true
utility
Direct Utility Estimation
• The big problem with Direct Utility
Estimation: it converges very slowly!
• Why?
– Doesn’t exploit the fact that utilities of states are
not independent
p
– Utilities follow the Bellman equation

U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'
Note the dependence
p on neighboring
g g states
Direct Utility Estimation
Using the dependence to your advantage:
Suppose you know that state (3,3) has
a high utility
Suppose you are now at (3,2)
The Bellman equation would be able
to tell you that (3,2) is likely to have a
high utility because (3,3) is a
neighbor.
neighbor
Remember that each blank
state has R(s) = -0.04 DEU can’t tell you that until the end
of the trial
Adaptive Dynamic Programming
(M d l b
(Model based)
d)
• This method does take advantage g of the
constraints in the Bellman equation
• Basicallyy learns the transition model T and
the reward function R
• Based on the underlyingy g MDP ((T and R)) we
can perform policy evaluation (which is
part of policy
p p y iteration p previouslyy taught)
g )
Adaptive Dynamic Programming
• Recall that ppolicyy evaluation in p
policyy
iteration involves solving the utility for each
state if policy πi is followed.
• This leads to the equations:
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'
• The equations above are linear, so they can
be solved with linear algebra in time O(n3)
where n is the number of states
Adaptive Dynamic Programming
• Make use of p policyy evaluation to learn the
utilities of states
• In order to use the policy evaluation eqn:
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ), s ' )U π ( s ' )
s'

the agent needs to learn the transition

model T(s,a,s’) and the reward function R(s)
How do we learn these models?
Adaptive Dynamic Programming
• Learningg the reward function R(s):
()
Easy because it’s deterministic. Whenever you
see a new state, store the observed reward value
as R(s)
• Learning the transition model T(s,a,s’):
Keep track of how often you get to state s’ given
that you’re in state s and do action a.
– eg. if you are in s = (1,3) and you execute Right three
times and you end up in s’=(2,3) twice, then
T(s,Right,s’)) = 2/3.
T(s,Right,s
ADP Algorithm
function PASSIVE‐ADP‐AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r’
static π,
static: π a fixed policy
mdp, an MDP with model T, rewards R, discount γ
U, a table of utilities, initially empty
Nsa, a table
t bl off frequencies
f i forf state‐action
t t ti pairs,
i iinitially
iti ll zero
Nsas’, a table of frequencies for state‐action‐state triples, initially zero
s, a the previous state and action, initially null
if s’’ is
i new then
th do d U[s’]
U[ ’] ← r’;
’ R[s’]
R[ ’] ← r’’ Update reward
if s is not null, then do function
increment Nsa[s,a] and Nsas’[s,a,s’] Update transition
f each
for h t such
h that
th t Nsas’[s,a,t]
[ t] is i nonzero ddo model
T[s,a,t] ← Nsas’[s,a,t] / Nsa[s,a]
U ← POLICY‐EVALUATION(π, U, mdp)
if TERMINAL?[s’]
TERMINAL?[ ’] then h s, a ← nullll else
l s, a ← s’,’ π[s’]
[ ’]
return a
The Problem with ADP
• Need to solve a system
y of simultaneous
equations – costs O(n3)
– Very hard to do if you have 1050 states like in
Backgammon
– Could makes thingsg a little easier with modified
policy iteration
• Can we avoid the computational expense
of full policy evaluation?
Temporal Difference Learning
• Instead of calculating the exact utility for a state
can we approximate it and possibly make it less
computationally expensive?
• Yes we can! Using Temporal Difference (TD)
learning
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'

• IInstead
t d off doing
d i this
thi sum over all
ll successors, only
l adjust
dj t the
th
utility of the state based on the successor observed in the trial.
• It does not estimate the transition model – model free
TD Learning
Example:
• Suppose you see that Uπ(1,3) = 0.84 and Uπ(2,3) =
0.92 after the first trial.
• If the transition (1 3) → (2,3)
(1,3) (2 3) happens all the
time, you would expect to see:
Uπ(1,3) = R(1,3) + Uπ(2,3)
⇒Uπ(1,3) = ‐0.04 + Uπ(2,3)
⇒ Uπ(1,3) = ‐0.04 + 0.92 = 0.88
• Since
Si you observe
b Uπ(1,3)
(1 3) = 0
0.84
84 iin th
the fi
firstt ttrial,
i l
it is a little lower than 0.88, so you might want to
“bump” it towards 0.88.
Temporal Difference Update
When we move from state s to s’, we apply the
following update rule:
U π ( s ) = U π ( s ) + α ( R( s ) + γU π ( s ' ) − U π ( s ))

This is similar to one step of value iteration

We call this equation a “backup”
Convergence
• Since we’re using the observed successor s’ instead of all
the
h successors, whath happens
h if the i i s → s’’ is
h transition i
very rare and there is a big jump in utilities from s to s’?
• How can Uπ(s) converge to the true equilibrium value?
• Answer: The average value of Uπ(s) will converge to the
correct value
• This means we need to observe enough trials that have
transitions from s to its successors
• Essentially, the effects of the TD backups will be averaged
over a large
l number
b off transitions
t iti
• Rare transitions will be rare in the set of transitions
observed
Comparison
p between ADP and TD
• Advantages of ADP:
– Converges to the true utilities faster
– Utility estimates don’t vary as much from the true
utilities
• Advantages of TD:
– Simpler,
p , less computation
p per observation
p
– Crude but efficient first approximation to ADP
– Don’t need to build a transition model in order to
perform its updates (this is important because we can
interleave computation with exploration rather than
having to wait for the whole model to be built first)
ADP and TD
Overall comparisons
What You Should Know
• How reinforcement learningg differs from
supervised learning and from MDPs
• Pros and cons of:
– Direct Utility Estimation
– Adaptive Dynamic Programming
– Temporal Difference Learning

Basler RDP-110
No ratings yet
Basler RDP-110
26 pages
Dzone Researchguide Automatedtesting
No ratings yet
Dzone Researchguide Automatedtesting
41 pages
How To Trade Like A Trader-Preneur PDF
100% (2)
How To Trade Like A Trader-Preneur PDF
50 pages
Manual Roche Cobas B 221
No ratings yet
Manual Roche Cobas B 221
360 pages
Open Group Guide: Business Capabilities
100% (1)
Open Group Guide: Business Capabilities
25 pages
Packet Tracer
No ratings yet
Packet Tracer
4 pages
Programming The Internet of Things
100% (1)
Programming The Internet of Things
86 pages
Probabilistic Reasoning: Unit-V
No ratings yet
Probabilistic Reasoning: Unit-V
33 pages
GVX 9000
No ratings yet
GVX 9000
212 pages
Building Information Modelling (Bim) For Facilities Management (FM) : The Mediacity Case Study Approach
No ratings yet
Building Information Modelling (Bim) For Facilities Management (FM) : The Mediacity Case Study Approach
21 pages
Introduction To Java Programming: Preparred by R.Divya, Btech (It)
No ratings yet
Introduction To Java Programming: Preparred by R.Divya, Btech (It)
24 pages
Unixtoolbox Book
No ratings yet
Unixtoolbox Book
30 pages
Data Sheet enUS 1890876683
No ratings yet
Data Sheet enUS 1890876683
47 pages
Information Technology Project Management: Prof. Dr. Ir. Riri Fitri Sari, MM, MSC
No ratings yet
Information Technology Project Management: Prof. Dr. Ir. Riri Fitri Sari, MM, MSC
19 pages
Data Structures, Algorithms and Applications in C++
No ratings yet
Data Structures, Algorithms and Applications in C++
826 pages
Driver SCN Serie
No ratings yet
Driver SCN Serie
47 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Nist SP 800-229
No ratings yet
Nist SP 800-229
27 pages
Capstone Case Study
No ratings yet
Capstone Case Study
4 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Lec 08
No ratings yet
Lec 08
59 pages
1965 STHS Yearbook
No ratings yet
1965 STHS Yearbook
122 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Parallelism in A Uniprocessor System: Multiprogramming
No ratings yet
Parallelism in A Uniprocessor System: Multiprogramming
2 pages
Elektor-1982-07 (Super LN Phono, Class A+B Amplifier)
No ratings yet
Elektor-1982-07 (Super LN Phono, Class A+B Amplifier)
97 pages
Big Ip Dns Datasheet
No ratings yet
Big Ip Dns Datasheet
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Lec 10
No ratings yet
Lec 10
50 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
06 MDP
No ratings yet
06 MDP
89 pages
16 RL
No ratings yet
16 RL
51 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
DWDV Notes
No ratings yet
DWDV Notes
111 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Sony Philips Super Audio CD (SACD) White Paper
No ratings yet
Sony Philips Super Audio CD (SACD) White Paper
12 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Math Quad
No ratings yet
Math Quad
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Full Download Multiple Valued Logic Concepts and Representation 1st Edition D. Michael Miller PDF
100% (3)
Full Download Multiple Valued Logic Concepts and Representation 1st Edition D. Michael Miller PDF
40 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
AI Unit3 Part 1
No ratings yet
AI Unit3 Part 1
5 pages
Tendernotice 1
No ratings yet
Tendernotice 1
16 pages
Invoice Template
No ratings yet
Invoice Template
5 pages
Unit 3
No ratings yet
Unit 3
32 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
CSEC Information Technology June 2022 P2 (Answered)
No ratings yet
CSEC Information Technology June 2022 P2 (Answered)
20 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
FP5207
No ratings yet
FP5207
13 pages
Unit 4
No ratings yet
Unit 4
49 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages

RL 1

Uploaded by

RL 1

Uploaded by

Reinforcement Learning

Reinforcement Learningg in a nutshell

Imagine playing a new game whose rules you

‐Russell and Norvig

We will consider both types of approaches

The total reward starting at (1,1) is 0.72. We call this a sample

the agent needs to learn the transition

This is similar to one step of value iteration

You might also like