0% found this document useful (0 votes)
5 views30 pages

RL 1

Reinforcement Learning (RL) involves an agent learning to behave optimally in an environment without knowing the transition model or reward function. It can be approached through passive learning, where the agent evaluates a fixed policy, or active learning, where the agent explores to find an optimal policy. Key methods include Direct Utility Estimation, Adaptive Dynamic Programming, and Temporal Difference Learning, each with its own advantages and challenges in estimating state utilities and learning optimal policies.

Uploaded by

yaminimukku18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

RL 1

Reinforcement Learning (RL) involves an agent learning to behave optimally in an environment without knowing the transition model or reward function. It can be approached through passive learning, where the agent evaluates a fixed policy, or active learning, where the agent explores to find an optimal policy. Key methods include Direct Utility Estimation, Adaptive Dynamic Programming, and Temporal Difference Learning, each with its own advantages and challenges in estimating state utilities and learning optimal policies.

Uploaded by

yaminimukku18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Reinforcement Learning

Reinforcement Learningg in a nutshell

Imagine playing a new game whose rules you


don’tt know; after a hundred or so moves
don moves, your
opponent announces, “You lose”.

‐Russell and Norvig


Introduction
d to Artificial
f l Intelligence
ll
Reinforcement Learning
• Agent
g placed
p in an environment and must
learn to behave optimally in it
• Assume that the world behaves like an
MDP, except:
– Agent can act but does not know the transition
model
– Agent observes its current state its reward but
doesn’t know the reward function
• Goal: learn an optimal policy
Factors that Make RL Difficult
• Actions have non‐deterministic effects
– which are initially unknown and must be
learned
• Rewards / punishments can be infrequent
– Often at the end of long sequences of actions
– How do we determine what action(s) were
really responsible for reward or punishment?
(credit assignment problem)
– World is large and complex
Passive vs. Active learning
• Passive learning
– The agent acts based on a fixed policy π and
tries to learn how good the policy is by
observing the world go by
– Analogous to policy evaluation in policy
iteration
• Active learning
– The
h agent attempts to ffind
d an optimall (or
( at
least good) policy by exploring different
actions in the world
– Analogous to solving the underlying MDP
Model‐Based
Model Based vs. Model‐Free
Model Free RL
• Model based approach
pp to RL:
– learn the MDP model (T and R), or an
approximation
pp of it
– use it to find the optimal policy
• Model free approach to RL:
– derive the optimal policy without explicitly
learning the model

We will consider both types of approaches


Passive Reinforcement Learning
• Suppose
pp agent’s
g p
policyy π is fixed
• It wants to learn how good that policy is in
the world ie. it wants to learn Uπ(s) ()
• This is just like the policy evaluation part of
policyy iteration
p
• The big difference: the agent doesn’t know
the transition model or the reward function
(but it gets to observe the reward in each
state it is in)
Passive RL
• Suppose
pp we are given
g a policy
p y
• Want to determine how good it is
Given π: Need to learn Uπ(S):
Appr. 1: Direct Utility Estimation
• Direct utilityy estimation ((model free))
– Estimate Uπ(s) as average total reward of
epochs
p containingg s ((calculatingg from s to end
of epoch)
• Reward to g
go off a state s
– the sum of the (discounted) rewards from that
state until a terminal state is reached
• Key: use observed reward to go of the
state as the direct evidence of the actual
expected utility of that state
Direct Utility Estimation
Suppose we observe the following trial:
(1,1)-0.04 → (1,2)-0.04 →(1,3)-0.04 → (1,2)-0.04 → (1,3)-0.04
→ (2,3)-0.04 → (3,3)-0.04 → (4,3)+1

The total reward starting at (1,1) is 0.72. We call this a sample


of the observed-reward-to-go for (1,1).
For (1,2) there are two samples for the observed-reward-to-go
(assuming γ=1):
1. (1,2)-0.04 →(1,3)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (2,3)-0.04 →
(3,3)-0.04 → (4,3)+1 [Total: 0.76]
2 (1
2. 2)-0.04 → (1,3)
(1,2) (1 3)-0.04 → (2,3)
(2 3)-0.04 → (3,3)
(3 3)-0.04 → (4,3)
(4 3)+1
[Total: 0.84]
Direct Utility Estimation
• Direct Utilityy Estimation keepsp a runningg
average of the observed reward‐to‐go for
each state
• Eg. For state (1,2), it stores (0.76+0.84)/2 =
08
0.8
• As the number of trials goes to infinity, the
sample average converges to the true
utility
Direct Utility Estimation
• The big problem with Direct Utility
Estimation: it converges very slowly!
• Why?
– Doesn’t exploit the fact that utilities of states are
not independent
p
– Utilities follow the Bellman equation

U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'
Note the dependence
p on neighboring
g g states
Direct Utility Estimation
Using the dependence to your advantage:
Suppose you know that state (3,3) has
a high utility
Suppose you are now at (3,2)
The Bellman equation would be able
to tell you that (3,2) is likely to have a
high utility because (3,3) is a
neighbor.
neighbor
Remember that each blank
state has R(s) = -0.04 DEU can’t tell you that until the end
of the trial
Adaptive Dynamic Programming
(M d l b
(Model based)
d)
• This method does take advantage g of the
constraints in the Bellman equation
• Basicallyy learns the transition model T and
the reward function R
• Based on the underlyingy g MDP ((T and R)) we
can perform policy evaluation (which is
part of policy
p p y iteration p previouslyy taught)
g )
Adaptive Dynamic Programming
• Recall that ppolicyy evaluation in p
policyy
iteration involves solving the utility for each
state if policy πi is followed.
• This leads to the equations:
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'
• The equations above are linear, so they can
be solved with linear algebra in time O(n3)
where n is the number of states
Adaptive Dynamic Programming
• Make use of p policyy evaluation to learn the
utilities of states
• In order to use the policy evaluation eqn:
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ), s ' )U π ( s ' )
s'

the agent needs to learn the transition


model T(s,a,s’) and the reward function R(s)
How do we learn these models?
Adaptive Dynamic Programming
• Learningg the reward function R(s):
()
Easy because it’s deterministic. Whenever you
see a new state, store the observed reward value
as R(s)
• Learning the transition model T(s,a,s’):
Keep track of how often you get to state s’ given
that you’re in state s and do action a.
– eg. if you are in s = (1,3) and you execute Right three
times and you end up in s’=(2,3) twice, then
T(s,Right,s’)) = 2/3.
T(s,Right,s
ADP Algorithm
function PASSIVE‐ADP‐AGENT(percept) returns an action
inputs: percept, a percept indicating the current state s’ and reward signal r’
static π,
static: π a fixed policy
mdp, an MDP with model T, rewards R, discount γ
U, a table of utilities, initially empty
Nsa, a table
t bl off frequencies
f i forf state‐action
t t ti pairs,
i iinitially
iti ll zero
Nsas’, a table of frequencies for state‐action‐state triples, initially zero
s, a the previous state and action, initially null
if s’’ is
i new then
th do d U[s’]
U[ ’] ← r’;
’ R[s’]
R[ ’] ← r’’ Update reward
if s is not null, then do function
increment Nsa[s,a] and Nsas’[s,a,s’] Update transition
f each
for h t such
h that
th t Nsas’[s,a,t]
[ t] is i nonzero ddo model
T[s,a,t] ← Nsas’[s,a,t] / Nsa[s,a]
U ← POLICY‐EVALUATION(π, U, mdp)
if TERMINAL?[s’]
TERMINAL?[ ’] then h s, a ← nullll else
l s, a ← s’,’ π[s’]
[ ’]
return a
The Problem with ADP
• Need to solve a system
y of simultaneous
equations – costs O(n3)
– Very hard to do if you have 1050 states like in
Backgammon
– Could makes thingsg a little easier with modified
policy iteration
• Can we avoid the computational expense
of full policy evaluation?
Temporal Difference Learning
• Instead of calculating the exact utility for a state
can we approximate it and possibly make it less
computationally expensive?
• Yes we can! Using Temporal Difference (TD)
learning
U π ( s ) = R( s ) + γ ∑ T ( s, π ( s ),
) s ' )U π ( s ' )
s'

• IInstead
t d off doing
d i this
thi sum over all
ll successors, only
l adjust
dj t the
th
utility of the state based on the successor observed in the trial.
• It does not estimate the transition model – model free
TD Learning
Example:
• Suppose you see that Uπ(1,3) = 0.84 and Uπ(2,3) =
0.92 after the first trial.
• If the transition (1 3) → (2,3)
(1,3) (2 3) happens all the
time, you would expect to see:
Uπ(1,3) = R(1,3) + Uπ(2,3)
⇒Uπ(1,3) = ‐0.04 + Uπ(2,3)
⇒ Uπ(1,3) = ‐0.04 + 0.92 = 0.88
• Since
Si you observe
b Uπ(1,3)
(1 3) = 0
0.84
84 iin th
the fi
firstt ttrial,
i l
it is a little lower than 0.88, so you might want to
“bump” it towards 0.88.
Temporal Difference Update
When we move from state s to s’, we apply the
following update rule:
U π ( s ) = U π ( s ) + α ( R( s ) + γU π ( s ' ) − U π ( s ))

This is similar to one step of value iteration


We call this equation a “backup”
Convergence
• Since we’re using the observed successor s’ instead of all
the
h successors, whath happens
h if the i i s → s’’ is
h transition i
very rare and there is a big jump in utilities from s to s’?
• How can Uπ(s) converge to the true equilibrium value?
• Answer: The average value of Uπ(s) will converge to the
correct value
• This means we need to observe enough trials that have
transitions from s to its successors
• Essentially, the effects of the TD backups will be averaged
over a large
l number
b off transitions
t iti
• Rare transitions will be rare in the set of transitions
observed
Comparison
p between ADP and TD
• Advantages of ADP:
– Converges to the true utilities faster
– Utility estimates don’t vary as much from the true
utilities
• Advantages of TD:
– Simpler,
p , less computation
p per observation
p
– Crude but efficient first approximation to ADP
– Don’t need to build a transition model in order to
perform its updates (this is important because we can
interleave computation with exploration rather than
having to wait for the whole model to be built first)
ADP and TD
Overall comparisons
What You Should Know
• How reinforcement learningg differs from
supervised learning and from MDPs
• Pros and cons of:
– Direct Utility Estimation
– Adaptive Dynamic Programming
– Temporal Difference Learning

You might also like