Model Free Methods
Model Free Methods
2
Dynamic programming backup
3
Montecarlo (MC) methods
4
First visit MC for evaluation
● Idea: estimate it from experience by average the returns observed after visits to that state
● Monte-Carlo policy prediction uses the empirical mean return instead of expected return
5
Incremental and running mean
● From MC
● Incremental way
7
MC prediction
● First-visit MC method: Estimates as the average of the returns following first visits to s.
● Every-visit MC method: Estimates as the average of the returns following all visits to s.
8
Are we updating policy here? No
Example-1: Random walk
Value of a state is probability of terminating on the right once started from C. Consider
and . Consider
● Episode-1: C, D, E, D, C, B, A, 0
● Episode-2: C, D, E, 1
● Episode-3: C, D, C, B, A, 0
9
Example-1: Random walk contd
11
Generalized Policy Iteration with MC Evaluation
12
Monte Carlo Estimation of Action Values
13
-greedy Policy Improvement
● We have to ensure that each state-action pair is visited a sufficient (infinite) number of times
● Simple idea: -greedy
● All actions have non-zero probability
● With probability choose a random action, with probability 1 − take the greedy action
14
On-policy First-visit MC Control
Action value
evaluation
Greedy policy
15
Temporal difference (TD) learning
Estimate/ optimize the value function of an unknown MDP using Temporal Difference Learning.
16
Temporal difference (TD) learning
17
TD Prediction
18
Example-1 (revisit): Random walk
Value of a state is probability of terminating on the right once started from C. Consider
and . Consider
● Episode-1: C, D, E, D, C, B, A, 0
● Episode-2: C, D, E, 1
● Episode-3: C, D, C, B, A, 0
19
Example-1 (revisit): Random walk contd
21
Example-2: MC vs TD
Observe the following episodes. No discounting. What are the estimates for and using MC and
TD(0)?
Optimal value of
MC estimate of
TD estimate of
● Batch MC converges to the solution with minimum MSE on the observed returns
● Batch TD converges to the solution of the maximum-likelihood Markov model
● Temporal difference (TD) methods
○ SARSA
○ Q-learning
23
SARSA: On-policy TD Control
● Episode
24
SARSA: On-policy TD Control
25
Q-learning: Off-policy TD Control
● Consider a grid
● Two actions possible: Backward and Forward
● Forward is always 1 step and on the last grid bump into wall
● Backward always takes to the first grid
● Entering the last tile gives you +10 reward
● Entering the first tile gives you +2 reward
● Find action values in each grid using Q-learning
27
Example: Q-learning
28
Convergence of Q-learning
29
Unified view
30
Summary
31