0% found this document useful (0 votes)

5 views31 pages

Model Free Methods

Uploaded by

sasidhar reddy navuluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views31 pages

Model Free Methods

Uploaded by

sasidhar reddy navuluri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Model free methods

EE5531- Reinforcement learning based control

Model free methods

Montecarlo (MC) methods Temporal difference (TD) methods

(averaging) ● SARSA
● Q-learning

2
Dynamic programming backup

3
Montecarlo (MC) methods

● MC methods directly learn from episodes of experience

● MC methods are model-free: No knowledge of MDP transitions are required

● Value = mean return

● Caveat: All episodes must terminate

4
First visit MC for evaluation

● Goal: Learn the value function given policy

● Idea: estimate it from experience by average the returns observed after visits to that state

● Recall: the return is the total discounted reward

● Recall: the value function is the expected return

● Monte-Carlo policy prediction uses the empirical mean return instead of expected return

5
Incremental and running mean

● We can compute the mean of a ● Observe episode

sequence incrementally: ● Compute the value dsf for state s at
episode where

● From MC

● Incremental way

● is the number of times state s is visited

6
Value update in MC

● In MC, the update of is done using for each episode

● Instead of , we can use step size to calculate a running mean

7
MC prediction

● First-visit MC method: Estimates as the average of the returns following first visits to s.
● Every-visit MC method: Estimates as the average of the returns following all visits to s.

8
Are we updating policy here? No
Example-1: Random walk

Value of a state is probability of terminating on the right once started from C. Consider
and . Consider

● Episode-1: C, D, E, D, C, B, A, 0
● Episode-2: C, D, E, 1
● Episode-3: C, D, C, B, A, 0

Compute after episodes 1, 2 and 3 using MC and TD methods.

9
Example-1: Random walk contd

after episode 1 for TD

after episodes 1, 2, 3 for MC

10
Example-1: Random walk contd

11
Generalized Policy Iteration with MC Evaluation

● Monte Carlo Policy Evaluation:

● Policy Improvement: greedy?

12
Monte Carlo Estimation of Action Values

● Greedy policy improvement over requires a model of the MDP

● Greedy policy improvement over Q(s, a) is model-free

Recall Policy iteration

● Generalized Policy Iteration with action-value function:

○ Monte Carlo Policy Evaluation:
○ Policy Improvement: greedy?

13
-greedy Policy Improvement

● We have to ensure that each state-action pair is visited a sufficient (infinite) number of times
● Simple idea: -greedy
● All actions have non-zero probability
● With probability choose a random action, with probability 1 − take the greedy action

14
On-policy First-visit MC Control

Action value
evaluation

Greedy policy

15
Temporal difference (TD) learning

Estimate/ optimize the value function of an unknown MDP using Temporal Difference Learning.

● TD is a combination of Monte Carlo and dynamic programming ideas

● Similar to MC methods, TD methods learn directly raw experiences without a dynamic model
● TD learns from incomplete episodes by bootstrapping
● Bootstrapping: update estimated based on other estimates without waiting for a final outcome

16
Temporal difference (TD) learning

● Goal: Learn the value function given policy

● Monte Carlo Update: Update value V(St) towards the actual Gt

● Temporal-difference learning algorithm: TD(0)

● Update value towards the estimated return

17
TD Prediction

Update is done without waiting

for episode to end

18
Example-1 (revisit): Random walk

Value of a state is probability of terminating on the right once started from C. Consider
and . Consider

● Episode-1: C, D, E, D, C, B, A, 0
● Episode-2: C, D, E, 1
● Episode-3: C, D, C, B, A, 0

Compute after episodes 1, 2 and 3 using MC and TD methods.

19
Example-1 (revisit): Random walk contd

after episode 1 for TD

after episodes 1, 2, 3 for MC

20
Example-1 (revisit): Random walk contd

21
Example-2: MC vs TD

Observe the following episodes. No discounting. What are the estimates for and using MC and
TD(0)?

Optimal value of

MC estimate of

TD estimate of

● TD uses the Markov structure

● MC explains the data better (min MSE)
22
MC vs TD

● Batch MC converges to the solution with minimum MSE on the observed returns
● Batch TD converges to the solution of the maximum-likelihood Markov model
● Temporal difference (TD) methods
○ SARSA
○ Q-learning

23
SARSA: On-policy TD Control

● Learn action-value function

● Transitions from state-action pair to state-action pair

● Episode

● SARSA: State, Action, Reward, State, Action

24
SARSA: On-policy TD Control

● Why is it called on-policy method?

25
Q-learning: Off-policy TD Control

● Learn action-value function via the update

● Why is it called off-policy method? 26

Example: Q-learning

● Consider a grid
● Two actions possible: Backward and Forward
● Forward is always 1 step and on the last grid bump into wall
● Backward always takes to the first grid
● Entering the last tile gives you +10 reward
● Entering the first tile gives you +2 reward
● Find action values in each grid using Q-learning

27
Example: Q-learning

28
Convergence of Q-learning

29
Unified view

30
Summary

● TD can learn before knowing the final outcome

○ TD can learn online after every step
○ MC must wait until end of episode before return is known
○ TD can learn without the final outcome
● TD can learn from incomplete sequences
○ MC can only learn from complete sequences
○ TD works in continuing (non-terminating) environments
○ MC only works for episodic (terminating) environments
● MC has high variance, zero bias. Not very sensitive to initial value. Good convergence properties.
● TD has low variance, some bias. Sensitive to initial value. Convergence not always guaranteed.

Module 5-rl
No ratings yet
Module 5-rl
54 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Deep Reinforcement Learning Algorithms For Dynamic Pricing
No ratings yet
Deep Reinforcement Learning Algorithms For Dynamic Pricing
18 pages
Electric Vehicle and Charging Infrastructure Development A Comprehensives Review Using Science Mapping and Thematic Analysis
No ratings yet
Electric Vehicle and Charging Infrastructure Development A Comprehensives Review Using Science Mapping and Thematic Analysis
402 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
No ratings yet
Temporal Difference (TD) Learning: Slides Prepared by DR J Alamelu Mangai
57 pages
Artificial Intelligencebased Realtime Traffic Management
No ratings yet
Artificial Intelligencebased Realtime Traffic Management
6 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lecture 4 Monte Carlo Method
100% (1)
Lecture 4 Monte Carlo Method
22 pages
Unit 4
100% (1)
Unit 4
7 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
QP Ans
No ratings yet
QP Ans
40 pages
Team-7 Project Report
No ratings yet
Team-7 Project Report
59 pages
16 RL
No ratings yet
16 RL
51 pages
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
No ratings yet
Lecture#4 Temporal DifferenceTD Learning Q Learning & SARSA 2024
62 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Deep Reinforcement Learning For Stock Prediction
No ratings yet
Deep Reinforcement Learning For Stock Prediction
9 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
No ratings yet
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
65 pages
Unit 4
No ratings yet
Unit 4
8 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Ee126 Project 1
No ratings yet
Ee126 Project 1
5 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Jia Zhou - JMLR 2023
No ratings yet
Jia Zhou - JMLR 2023
61 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
cs188 Fa18 mt1 Sol
No ratings yet
cs188 Fa18 mt1 Sol
19 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
15 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
AI A-Z Q-Learning Implementation
No ratings yet
AI A-Z Q-Learning Implementation
22 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
Unit Iii Monte Carlo & Temporal Difference Methods
No ratings yet
Unit Iii Monte Carlo & Temporal Difference Methods
18 pages
4 Monte Carlo Methods
No ratings yet
4 Monte Carlo Methods
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
C - 307 An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning
No ratings yet
C - 307 An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning
18 pages
Markov Decision Process
No ratings yet
Markov Decision Process
21 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Development of A Signal-Free Intersection Control System For CAVs and Corridor Level Impact Assessment
No ratings yet
Development of A Signal-Free Intersection Control System For CAVs and Corridor Level Impact Assessment
16 pages
P22 Digital Twin-Driven Robotic Disassembly Sequence Dynamic Planning Under Uncertain Missing Condition
No ratings yet
P22 Digital Twin-Driven Robotic Disassembly Sequence Dynamic Planning Under Uncertain Missing Condition
10 pages
Lec 5
No ratings yet
Lec 5
13 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Deep Reinforcement Learning Based Dynamic Resource Allocation in 5G Ultra-Dense Networks
No ratings yet
Deep Reinforcement Learning Based Dynamic Resource Allocation in 5G Ultra-Dense Networks
7 pages
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
No ratings yet
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
9 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
Midterm 1: CS 188 Fall 2018 Introduction To Artificial Intelligence
No ratings yet
Midterm 1: CS 188 Fall 2018 Introduction To Artificial Intelligence
14 pages
Notes
No ratings yet
Notes
6 pages
An Introduction To Quantum Reinforcement Learning
No ratings yet
An Introduction To Quantum Reinforcement Learning
6 pages
Reinforcement Learning - Ipynb - Colab
No ratings yet
Reinforcement Learning - Ipynb - Colab
5 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
A Review On Reinforcement Learning Based News Recommendation Systems and Its Challenges
No ratings yet
A Review On Reinforcement Learning Based News Recommendation Systems and Its Challenges
6 pages
RL Sem 8
No ratings yet
RL Sem 8
3 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
Driverless Car: Autonomous Driving Using Deep Reinforcement Learning in Urban Environment
No ratings yet
Driverless Car: Autonomous Driving Using Deep Reinforcement Learning in Urban Environment
6 pages
Optimal Battle Strategy in Pok Emon Using Reinforcement Learning
No ratings yet
Optimal Battle Strategy in Pok Emon Using Reinforcement Learning
5 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
Improving Traffic Light Systems Using Deep Q-Networks
No ratings yet
Improving Traffic Light Systems Using Deep Q-Networks
2 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

Model Free Methods

Uploaded by

Model Free Methods

Uploaded by

Model free methods

EE5531- Reinforcement learning based control

Montecarlo (MC) methods Temporal difference (TD) methods

● MC methods directly learn from episodes of experience

● MC methods are model-free: No knowledge of MDP transitions are required

● Value = mean return

● Caveat: All episodes must terminate

● Goal: Learn the value function given policy

● Recall: the return is the total discounted reward

● Recall: the value function is the expected return

● We can compute the mean of a ● Observe episode

● is the number of times state s is visited

● In MC, the update of is done using for each episode

● Instead of , we can use step size to calculate a running mean

Compute after episodes 1, 2 and 3 using MC and TD methods.

after episode 1 for TD

after episodes 1, 2, 3 for MC

● Monte Carlo Policy Evaluation:

● Greedy policy improvement over requires a model of the MDP

● Greedy policy improvement over Q(s, a) is model-free

Recall Policy iteration

● Generalized Policy Iteration with action-value function:

● TD is a combination of Monte Carlo and dynamic programming ideas

● Goal: Learn the value function given policy

● Monte Carlo Update: Update value V(St) towards the actual Gt

● Temporal-difference learning algorithm: TD(0)

● Update value towards the estimated return

Update is done without waiting

Compute after episodes 1, 2 and 3 using MC and TD methods.

after episode 1 for TD

after episodes 1, 2, 3 for MC

● TD uses the Markov structure

● Learn action-value function

● SARSA: State, Action, Reward, State, Action

● Why is it called on-policy method?

● Learn action-value function via the update

● Why is it called off-policy method? 26

● TD can learn before knowing the final outcome

You might also like