0% found this document useful (0 votes)
4 views31 pages

Model Free Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views31 pages

Model Free Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Model free methods

EE5531- Reinforcement learning based control


Model free methods

Montecarlo (MC) methods Temporal difference (TD) methods


(averaging) ● SARSA
● Q-learning

2
Dynamic programming backup

3
Montecarlo (MC) methods

● MC methods directly learn from episodes of experience

● MC methods are model-free: No knowledge of MDP transitions are required

● Value = mean return

● Caveat: All episodes must terminate

4
First visit MC for evaluation

● Goal: Learn the value function given policy

● Idea: estimate it from experience by average the returns observed after visits to that state

● Recall: the return is the total discounted reward

● Recall: the value function is the expected return

● Monte-Carlo policy prediction uses the empirical mean return instead of expected return

5
Incremental and running mean

● We can compute the mean of a ● Observe episode


sequence incrementally: ● Compute the value dsf for state s at
episode where

● From MC

● Incremental way

● is the number of times state s is visited


6
Value update in MC

● In MC, the update of is done using for each episode

● Instead of , we can use step size to calculate a running mean

7
MC prediction

● First-visit MC method: Estimates as the average of the returns following first visits to s.
● Every-visit MC method: Estimates as the average of the returns following all visits to s.

8
Are we updating policy here? No
Example-1: Random walk

Value of a state is probability of terminating on the right once started from C. Consider
and . Consider

● Episode-1: C, D, E, D, C, B, A, 0
● Episode-2: C, D, E, 1
● Episode-3: C, D, C, B, A, 0

Compute after episodes 1, 2 and 3 using MC and TD methods.

9
Example-1: Random walk contd

after episode 1 for TD

after episodes 1, 2, 3 for MC


10
Example-1: Random walk contd

11
Generalized Policy Iteration with MC Evaluation

● Monte Carlo Policy Evaluation:


● Policy Improvement: greedy?

12
Monte Carlo Estimation of Action Values

● Greedy policy improvement over requires a model of the MDP

● Greedy policy improvement over Q(s, a) is model-free

Recall Policy iteration

● Generalized Policy Iteration with action-value function:


○ Monte Carlo Policy Evaluation:
○ Policy Improvement: greedy?

13
-greedy Policy Improvement

● We have to ensure that each state-action pair is visited a sufficient (infinite) number of times
● Simple idea: -greedy
● All actions have non-zero probability
● With probability choose a random action, with probability 1 − take the greedy action

14
On-policy First-visit MC Control

Action value
evaluation

Greedy policy

15
Temporal difference (TD) learning

Estimate/ optimize the value function of an unknown MDP using Temporal Difference Learning.

● TD is a combination of Monte Carlo and dynamic programming ideas


● Similar to MC methods, TD methods learn directly raw experiences without a dynamic model
● TD learns from incomplete episodes by bootstrapping
● Bootstrapping: update estimated based on other estimates without waiting for a final outcome

16
Temporal difference (TD) learning

● Goal: Learn the value function given policy

● Monte Carlo Update: Update value V(St) towards the actual Gt

● Temporal-difference learning algorithm: TD(0)

● Update value towards the estimated return

17
TD Prediction

Update is done without waiting


for episode to end

18
Example-1 (revisit): Random walk

Value of a state is probability of terminating on the right once started from C. Consider
and . Consider

● Episode-1: C, D, E, D, C, B, A, 0
● Episode-2: C, D, E, 1
● Episode-3: C, D, C, B, A, 0

Compute after episodes 1, 2 and 3 using MC and TD methods.

19
Example-1 (revisit): Random walk contd

after episode 1 for TD

after episodes 1, 2, 3 for MC


20
Example-1 (revisit): Random walk contd

21
Example-2: MC vs TD

Observe the following episodes. No discounting. What are the estimates for and using MC and
TD(0)?

Optimal value of

MC estimate of

TD estimate of

● TD uses the Markov structure


● MC explains the data better (min MSE)
22
MC vs TD

● Batch MC converges to the solution with minimum MSE on the observed returns
● Batch TD converges to the solution of the maximum-likelihood Markov model
● Temporal difference (TD) methods
○ SARSA
○ Q-learning

23
SARSA: On-policy TD Control

● Learn action-value function


● Transitions from state-action pair to state-action pair

● Episode

● SARSA: State, Action, Reward, State, Action

24
SARSA: On-policy TD Control

● Why is it called on-policy method?

25
Q-learning: Off-policy TD Control

● Learn action-value function via the update

● Why is it called off-policy method? 26


Example: Q-learning

● Consider a grid
● Two actions possible: Backward and Forward
● Forward is always 1 step and on the last grid bump into wall
● Backward always takes to the first grid
● Entering the last tile gives you +10 reward
● Entering the first tile gives you +2 reward
● Find action values in each grid using Q-learning

27
Example: Q-learning

28
Convergence of Q-learning

29
Unified view

30
Summary

● TD can learn before knowing the final outcome


○ TD can learn online after every step
○ MC must wait until end of episode before return is known
○ TD can learn without the final outcome
● TD can learn from incomplete sequences
○ MC can only learn from complete sequences
○ TD works in continuing (non-terminating) environments
○ MC only works for episodic (terminating) environments
● MC has high variance, zero bias. Not very sensitive to initial value. Good convergence properties.
● TD has low variance, some bias. Sensitive to initial value. Convergence not always guaranteed.

31

You might also like