Sec 12
Sec 12
Reinforcement Learning
1 Introduction
In the reinforcement learning setting, we don’t have direct access to the transi-
tion distribution p(s0 |s, a) or the reward function r(s, a) — information about
these only come to us through the outcome of the environment. This problem
is hard because some states can lead to high rewards, but we don’t know which
ones; even if we did, we don’t know how to get there!
To deal with this, in lecture, we discussed model-based and model-free re-
inforcement learning.
• Finding the best route to take through a treacherous forest using a map.
• Bringing the new Boston Dynamics robot into a new obstacle course it
has never seen before.
3 Model-based Learning
For model-based learning, we estimate the missing world models: r(s, a) and
p(s0 |s, a), and then use planning (value or policy iteration) to develop a policy
π.
1
4 Model-Free Learning
In model-free learning, we are no longer interested in learning the transition
function and reward function. Instead, we are looking to directly infer the
optimal policy from samples of the world — that is, given that we are in state
s, we want to know the best action a = π ∗ (s) to take. This makes model-free
learning cheaper and simpler.
To do this, we look to learn the optimal Q-values, defined as
X
Q∗ (s, a) = r(s, a) + γ p(s0 |s, a)V ∗ (s0 ), ∀s, a (1)
s0 ∈S
where V ∗ (s) is the optimal value function. The value Q∗ (s, a) is the value from
taking action a in state s and then following the optimal continuation from
the next state.
By learning this Q-value function, Q∗ , we also have the optimal policy,
with
To learn Q∗ wee can substitute for Q∗ values on the right hand side, and get
an alternate form of the Bellman equations, which states that for an optimal
policy π ∗ ,
X
Q∗ (s, a) = r(s, a) + γ p(s0 |s, a) max
0
[Q∗ (s0 , a0 )], ∀s, a (3)
a ∈A
s0 ∈S
The question then becomes how we can find the Q values that satisfy the
Bellman equations as written in Equation 3.
There are two ways that we do this. One is “on-policy” (SARSA) and one
is “off-policy” (Q-learning).
2
5 On-Policy RL: SARSA
Whatever the behavior of the RL agent, a first way to update the Q-values is
an on-policy method.
Given current state s, current action a, reward r, next state s0 , next action
a0 (s, a, r, s0 , a, hence the name SARSA), the update is
• Decay the learning rate over time, but not too quickly.1
• Move from -greedy to greedy over time, so that in the limit the policy is
greedy; e.g., it can be useful to set for a state s to c/N (s) where N (s)
is the number of times the state has been visited.
3
Given current state s, current action a, reward r, next state s0 , the update
in Q-learning is:
• Decay the learning rate over time, but not too quickly.2
4
7 Exercise: Model-free RL
Consider the same MDP below on the following grid.
At each square, we can go left, right, up, or down. Normally we get a reward
of 0 from moving, but if we attempt to move off the grid, we get a reward of
−1 and stay where we are. Also, if we move onto square A, we get a reward
of 10 and are teleported to square B. The discount factor is γ = 0.9.
Suppose an RL agent starts at the top left square, (0, 0), and follow an
-greedy policy. At the beginning, suppose Q(s, a) = 0 for all s, a, except
we know that we shouldn’t go off the grid, so that the values of Q for the
corresponding s, a pairs are −1 (ie. moving off the grid from position (0, 0)
to (−1, 0) is disallowed and corresponds with an initialization of Q(s, a) = −1).
The learning rate α = 0.1. With RL, the realized reward for an action will
depend on the state, the action, and whether or not the action succeeds.
1. For the first step, suppose -greedy tells the agent to explore, and the
agent selects right as its action (and this action succeeds). Write the
Q-learning update in step one.
2. Now write the SARSA update for step one, assuming that in addition
to right in step one, -greedy tells the agent to explore and go down in
the second step (and this action succeeds).