0% found this document useful (0 votes)
40 views5 pages

Sec 12

This document provides an overview of reinforcement learning concepts including model-based and model-free approaches. It discusses SARSA and Q-learning as on-policy and off-policy model-free reinforcement learning algorithms. SARSA directly learns the policy from samples while following it, whereas Q-learning learns the optimal policy independently of the behavior policy. The document uses examples and concept questions to illustrate exploration versus exploitation and the differences between SARSA and Q-learning.

Uploaded by

Prateer Kr Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views5 pages

Sec 12

This document provides an overview of reinforcement learning concepts including model-based and model-free approaches. It discusses SARSA and Q-learning as on-policy and off-policy model-free reinforcement learning algorithms. SARSA directly learns the policy from samples while following it, whereas Q-learning learns the optimal policy independently of the behavior policy. The document uses examples and concept questions to illustrate exploration versus exploitation and the differences between SARSA and Q-learning.

Uploaded by

Prateer Kr Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS 181 Spring 2021 Section 12

Reinforcement Learning

1 Introduction
In the reinforcement learning setting, we don’t have direct access to the transi-
tion distribution p(s0 |s, a) or the reward function r(s, a) — information about
these only come to us through the outcome of the environment. This problem
is hard because some states can lead to high rewards, but we don’t know which
ones; even if we did, we don’t know how to get there!
To deal with this, in lecture, we discussed model-based and model-free re-
inforcement learning.

2 From Planning to Reinforcement Learning


Recall that MDPs are defined by a set of states, actions, rewards, and transi-
tion probabilities {S, A, r, p}, and our goal is to find the policy π ∗ that maxi-
mizes the sum of discounted rewards.
In planning, we are explicitly provided with the model of the environment,
whereas in reinforcement learning, an agent does not have a model of the
environment to begin with. Instead, it must interact with the environment to
learn what its policy should be.

2.1 Concept Question


Would the following be problems more likely to be solved with MDP planning
and which through reinforcement learning?

• Finding the best route to take through a treacherous forest using a map.

• Bringing the new Boston Dynamics robot into a new obstacle course it
has never seen before.

3 Model-based Learning
For model-based learning, we estimate the missing world models: r(s, a) and
p(s0 |s, a), and then use planning (value or policy iteration) to develop a policy
π.

3.1 Concept Question


Can you think of a way to do this in practice? What are some downsides?

1
4 Model-Free Learning
In model-free learning, we are no longer interested in learning the transition
function and reward function. Instead, we are looking to directly infer the
optimal policy from samples of the world — that is, given that we are in state
s, we want to know the best action a = π ∗ (s) to take. This makes model-free
learning cheaper and simpler.
To do this, we look to learn the optimal Q-values, defined as
X
Q∗ (s, a) = r(s, a) + γ p(s0 |s, a)V ∗ (s0 ), ∀s, a (1)
s0 ∈S

where V ∗ (s) is the optimal value function. The value Q∗ (s, a) is the value from
taking action a in state s and then following the optimal continuation from
the next state.
By learning this Q-value function, Q∗ , we also have the optimal policy,
with

π ∗ (s) = arg max Q∗ (s, a) (2)


a

To learn Q∗ wee can substitute for Q∗ values on the right hand side, and get
an alternate form of the Bellman equations, which states that for an optimal
policy π ∗ ,
X
Q∗ (s, a) = r(s, a) + γ p(s0 |s, a) max
0
[Q∗ (s0 , a0 )], ∀s, a (3)
a ∈A
s0 ∈S

The question then becomes how we can find the Q values that satisfy the
Bellman equations as written in Equation 3.
There are two ways that we do this. One is “on-policy” (SARSA) and one
is “off-policy” (Q-learning).

4.1 Exploration vs. Exploitation


An RL agent also needs to decide how to act in the environment while collecting
observations. This gets to the key issue of exploration vs. exploitation.
In an exploitative approach, when we are in state s, we can simply take
action a = arg maxa∈A Q(s, a) based on our current estimate of the Q-function.
In an explorative approach, we want to ensure that we have visited enough
states and taken enough actions from those states to get good Q-function
estimates, and this can lead us to prefer to add some randomization to the
behavior of the agent.

4.1.1 Concept Question


What would be a problem if our approach was only exploitative? In practice,
how might we balance exploitation vs exploration?

2
5 On-Policy RL: SARSA
Whatever the behavior of the RL agent, a first way to update the Q-values is
an on-policy method.
Given current state s, current action a, reward r, next state s0 , next action
a0 (s, a, r, s0 , a, hence the name SARSA), the update is

Q(s, a) ← Q(s, a) + αt [r + γQ(s0 , a0 ) − Q(s, a)] (4)

This is known as the SARSA update (State-Action-Reward-State-Action),


since we look ahead to get the action π(s0 ). Here, αt , with 0 ≤ αt < 1 is the
learning rate at update t. γ is the discount factor.
Here, we are taking the difference between our current Q-value at a state-
action and the one that we predict using the current reward and the discounted
Q-value following the policy. We update the Q(s, a) value in the direction of
this difference, which is known as the “temporal difference error.”
Since we follow π in choosing action a0 , this gradient method is on-policy.
In particular, it learns Q-values that correspond to the behavior of the agent.
The SARSA update rule is like we’re doing a stochastic gradient descent for
one observation, looking to improve our estimate of Q(s, a).
Because SARSA is on-policy, it is not guranteed to converge to the optimal
Q-values. In order to converge to the optimal Q-values, SARSA needs (stated
informally):

• Visit every action in every state infinitely often

• Decay the learning rate over time, but not too quickly.1

• Move from -greedy to greedy over time, so that in the limit the policy is
greedy; e.g., it can be useful to set  for a state s to c/N (s) where N (s)
is the number of times the state has been visited.

5.1 Concept Question


What would SARSA learn when following a fixed policy π? What is the tension
in reducing exploration  in -greedy when using SARSA to learn the optimal
Q∗ values?

6 Off-Policy RL: Q-Learning


Whatever the behavior of the RL agent, a second way to update the Q-values
is an off-policy method.
1
P
For each (s, a) pair, we need t αt = ∞P for the periods t in which we update Q(s, a)
(don’t reduce learning rate too quickly), and t αt2 < ∞ (eventually learning rate becomes
small). A typical choice is to set the learning rate αt for an update on (s, a) to 1/N (s, a)
where N (s, a) is the number of times action a is taken in state s.

3
Given current state s, current action a, reward r, next state s0 , the update
in Q-learning is:

Q(s, a) ← Q(s, a) + αt [r + γ max


0
Q(s0 , a0 ) − Q(s, a)] (5)
a

Q-learning uses State-Action-Reward-State from the environment. It is


the max over actions a0 that makes this an ‘off-policy’ method. Here, we are
taking the difference between our current Q-value for a state-action and the
one that we predict using the current reward and the discounted Q-value when
following the best action from the next state. We update the Q(s, a) value to
reduce this difference, which is known as the “temporal difference error.”
We can see that the Q-learning update can be viewed as a stochastic gra-
dient descent for one observation, looking to find estimates of Q-values that
better approximate the Bellman equations.
Because Q-learning is off policy, it is guaranteed to converge to the optimal
Q-values as long as the following is true (stated informally):

• Visit every action in every state infinitely often

• Decay the learning rate over time, but not too quickly.2

6.1 Concept Question


Is the Q-learning update equal to the SARSA learning update in the case that
the behavior in SARSA is greedy and not -greedy?
2
P
For each (s, a) pair, we need t αt = ∞P for the periods t in which we update Q(s, a)
(don’t reduce learning rate too quickly), and t αt2 < ∞ (eventually learning rate becomes
small). A typical choice is to set the learning rate αt for an update on (s, a) to 1/N (s, a)
where N (s, a) is the number of times action a is taken in state s.

4
7 Exercise: Model-free RL
Consider the same MDP below on the following grid.

At each square, we can go left, right, up, or down. Normally we get a reward
of 0 from moving, but if we attempt to move off the grid, we get a reward of
−1 and stay where we are. Also, if we move onto square A, we get a reward
of 10 and are teleported to square B. The discount factor is γ = 0.9.

Suppose an RL agent starts at the top left square, (0, 0), and follow an
-greedy policy. At the beginning, suppose Q(s, a) = 0 for all s, a, except
we know that we shouldn’t go off the grid, so that the values of Q for the
corresponding s, a pairs are −1 (ie. moving off the grid from position (0, 0)
to (−1, 0) is disallowed and corresponds with an initialization of Q(s, a) = −1).

The learning rate α = 0.1. With RL, the realized reward for an action will
depend on the state, the action, and whether or not the action succeeds.

1. For the first step, suppose -greedy tells the agent to explore, and the
agent selects right as its action (and this action succeeds). Write the
Q-learning update in step one.

2. Now write the SARSA update for step one, assuming that in addition
to right in step one, -greedy tells the agent to explore and go down in
the second step (and this action succeeds).

3. Are the updates the same? If not, why not?

You might also like