0% found this document useful (0 votes)

40 views5 pages

Sec 12

This document provides an overview of reinforcement learning concepts including model-based and model-free approaches. It discusses SARSA and Q-learning as on-policy and off-policy model-free reinforcement learning algorithms. SARSA directly learns the policy from samples while following it, whereas Q-learning learns the optimal policy independently of the behavior policy. The document uses examples and concept questions to illustrate exploration versus exploitation and the differences between SARSA and Q-learning.

Uploaded by

Prateer Kr Roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views5 pages

Sec 12

Uploaded by

Prateer Kr Roy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS 181 Spring 2021 Section 12

Reinforcement Learning

1 Introduction
In the reinforcement learning setting, we don’t have direct access to the transi-
tion distribution p(s0 |s, a) or the reward function r(s, a) — information about
these only come to us through the outcome of the environment. This problem
is hard because some states can lead to high rewards, but we don’t know which
ones; even if we did, we don’t know how to get there!
To deal with this, in lecture, we discussed model-based and model-free re-
inforcement learning.

2 From Planning to Reinforcement Learning

Recall that MDPs are defined by a set of states, actions, rewards, and transi-
tion probabilities {S, A, r, p}, and our goal is to find the policy π ∗ that maxi-
mizes the sum of discounted rewards.
In planning, we are explicitly provided with the model of the environment,
whereas in reinforcement learning, an agent does not have a model of the
environment to begin with. Instead, it must interact with the environment to
learn what its policy should be.

2.1 Concept Question

Would the following be problems more likely to be solved with MDP planning
and which through reinforcement learning?

• Finding the best route to take through a treacherous forest using a map.

• Bringing the new Boston Dynamics robot into a new obstacle course it
has never seen before.

3 Model-based Learning
For model-based learning, we estimate the missing world models: r(s, a) and
p(s0 |s, a), and then use planning (value or policy iteration) to develop a policy
π.

3.1 Concept Question

Can you think of a way to do this in practice? What are some downsides?

1
4 Model-Free Learning
In model-free learning, we are no longer interested in learning the transition
function and reward function. Instead, we are looking to directly infer the
optimal policy from samples of the world — that is, given that we are in state
s, we want to know the best action a = π ∗ (s) to take. This makes model-free
learning cheaper and simpler.
To do this, we look to learn the optimal Q-values, defined as
X
Q∗ (s, a) = r(s, a) + γ p(s0 |s, a)V ∗ (s0 ), ∀s, a (1)
s0 ∈S

where V ∗ (s) is the optimal value function. The value Q∗ (s, a) is the value from
taking action a in state s and then following the optimal continuation from
the next state.
By learning this Q-value function, Q∗ , we also have the optimal policy,
with

π ∗ (s) = arg max Q∗ (s, a) (2)

To learn Q∗ wee can substitute for Q∗ values on the right hand side, and get
an alternate form of the Bellman equations, which states that for an optimal
policy π ∗ ,
X
Q∗ (s, a) = r(s, a) + γ p(s0 |s, a) max
0
[Q∗ (s0 , a0 )], ∀s, a (3)
a ∈A
s0 ∈S

The question then becomes how we can find the Q values that satisfy the
Bellman equations as written in Equation 3.
There are two ways that we do this. One is “on-policy” (SARSA) and one
is “off-policy” (Q-learning).

4.1 Exploration vs. Exploitation

An RL agent also needs to decide how to act in the environment while collecting
observations. This gets to the key issue of exploration vs. exploitation.
In an exploitative approach, when we are in state s, we can simply take
action a = arg maxa∈A Q(s, a) based on our current estimate of the Q-function.
In an explorative approach, we want to ensure that we have visited enough
states and taken enough actions from those states to get good Q-function
estimates, and this can lead us to prefer to add some randomization to the
behavior of the agent.

4.1.1 Concept Question

What would be a problem if our approach was only exploitative? In practice,
how might we balance exploitation vs exploration?

2
5 On-Policy RL: SARSA
Whatever the behavior of the RL agent, a first way to update the Q-values is
an on-policy method.
Given current state s, current action a, reward r, next state s0 , next action
a0 (s, a, r, s0 , a, hence the name SARSA), the update is

Q(s, a) ← Q(s, a) + αt [r + γQ(s0 , a0 ) − Q(s, a)] (4)

This is known as the SARSA update (State-Action-Reward-State-Action),

since we look ahead to get the action π(s0 ). Here, αt , with 0 ≤ αt < 1 is the
learning rate at update t. γ is the discount factor.
Here, we are taking the difference between our current Q-value at a state-
action and the one that we predict using the current reward and the discounted
Q-value following the policy. We update the Q(s, a) value in the direction of
this difference, which is known as the “temporal difference error.”
Since we follow π in choosing action a0 , this gradient method is on-policy.
In particular, it learns Q-values that correspond to the behavior of the agent.
The SARSA update rule is like we’re doing a stochastic gradient descent for
one observation, looking to improve our estimate of Q(s, a).
Because SARSA is on-policy, it is not guranteed to converge to the optimal
Q-values. In order to converge to the optimal Q-values, SARSA needs (stated
informally):

• Visit every action in every state infinitely often

• Decay the learning rate over time, but not too quickly.1

• Move from -greedy to greedy over time, so that in the limit the policy is
greedy; e.g., it can be useful to set for a state s to c/N (s) where N (s)
is the number of times the state has been visited.

5.1 Concept Question

What would SARSA learn when following a fixed policy π? What is the tension
in reducing exploration in -greedy when using SARSA to learn the optimal
Q∗ values?

6 Off-Policy RL: Q-Learning

Whatever the behavior of the RL agent, a second way to update the Q-values
is an off-policy method.
1
P
For each (s, a) pair, we need t αt = ∞P for the periods t in which we update Q(s, a)
(don’t reduce learning rate too quickly), and t αt2 < ∞ (eventually learning rate becomes
small). A typical choice is to set the learning rate αt for an update on (s, a) to 1/N (s, a)
where N (s, a) is the number of times action a is taken in state s.

3
Given current state s, current action a, reward r, next state s0 , the update
in Q-learning is:

Q(s, a) ← Q(s, a) + αt [r + γ max

0
Q(s0 , a0 ) − Q(s, a)] (5)
a

Q-learning uses State-Action-Reward-State from the environment. It is

the max over actions a0 that makes this an ‘off-policy’ method. Here, we are
taking the difference between our current Q-value for a state-action and the
one that we predict using the current reward and the discounted Q-value when
following the best action from the next state. We update the Q(s, a) value to
reduce this difference, which is known as the “temporal difference error.”
We can see that the Q-learning update can be viewed as a stochastic gra-
dient descent for one observation, looking to find estimates of Q-values that
better approximate the Bellman equations.
Because Q-learning is off policy, it is guaranteed to converge to the optimal
Q-values as long as the following is true (stated informally):

• Visit every action in every state infinitely often

• Decay the learning rate over time, but not too quickly.2

6.1 Concept Question

Is the Q-learning update equal to the SARSA learning update in the case that
the behavior in SARSA is greedy and not -greedy?
2
P
For each (s, a) pair, we need t αt = ∞P for the periods t in which we update Q(s, a)
(don’t reduce learning rate too quickly), and t αt2 < ∞ (eventually learning rate becomes
small). A typical choice is to set the learning rate αt for an update on (s, a) to 1/N (s, a)
where N (s, a) is the number of times action a is taken in state s.

4
7 Exercise: Model-free RL
Consider the same MDP below on the following grid.

At each square, we can go left, right, up, or down. Normally we get a reward
of 0 from moving, but if we attempt to move off the grid, we get a reward of
−1 and stay where we are. Also, if we move onto square A, we get a reward
of 10 and are teleported to square B. The discount factor is γ = 0.9.

Suppose an RL agent starts at the top left square, (0, 0), and follow an
-greedy policy. At the beginning, suppose Q(s, a) = 0 for all s, a, except
we know that we shouldn’t go off the grid, so that the values of Q for the
corresponding s, a pairs are −1 (ie. moving off the grid from position (0, 0)
to (−1, 0) is disallowed and corresponds with an initialization of Q(s, a) = −1).

The learning rate α = 0.1. With RL, the realized reward for an action will
depend on the state, the action, and whether or not the action succeeds.

1. For the first step, suppose -greedy tells the agent to explore, and the
agent selects right as its action (and this action succeeds). Write the
Q-learning update in step one.

2. Now write the SARSA update for step one, assuming that in addition
to right in step one, -greedy tells the agent to explore and go down in
the second step (and this action succeeds).

3. Are the updates the same? If not, why not?

The Brain That Changes Itself
50% (10)
The Brain That Changes Itself
12 pages
2022 Artist of Life Workbook Mint - Print A5
100% (8)
2022 Artist of Life Workbook Mint - Print A5
80 pages
AIA 6600 - Module 4 - Using Quest With US Census Data
100% (1)
AIA 6600 - Module 4 - Using Quest With US Census Data
6 pages
Creative Classroom PDF
100% (1)
Creative Classroom PDF
170 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
18 pages
Conselling
No ratings yet
Conselling
58 pages
Example Synthesis of Review of Related Literature and Studies
100% (1)
Example Synthesis of Review of Related Literature and Studies
4 pages
Week 4: Schemes of Work: Assessment Homework
No ratings yet
Week 4: Schemes of Work: Assessment Homework
6 pages
150 Commonly Misspelled Words: Writing, Reading, and Language Center MC Germantown
No ratings yet
150 Commonly Misspelled Words: Writing, Reading, and Language Center MC Germantown
2 pages
Comparative Analysis
No ratings yet
Comparative Analysis
8 pages
Teacher Education For Mother Tongue - Based Education Programs
No ratings yet
Teacher Education For Mother Tongue - Based Education Programs
15 pages
Unidad 1 Uncover
No ratings yet
Unidad 1 Uncover
10 pages
Topic 3 Job Analysis & Design
No ratings yet
Topic 3 Job Analysis & Design
26 pages
Ruby Bridges Lesson Plan
No ratings yet
Ruby Bridges Lesson Plan
5 pages
Data Flow Diagrams: Mechanics
No ratings yet
Data Flow Diagrams: Mechanics
24 pages
Get Real Theory
No ratings yet
Get Real Theory
2 pages
What Is Programming
No ratings yet
What Is Programming
1 page
DQ 5
No ratings yet
DQ 5
3 pages
Inquiries, Investigation, and Immersion (Facebook Usage)
No ratings yet
Inquiries, Investigation, and Immersion (Facebook Usage)
38 pages
Gusti Rayyan Noor (1710117210015) Pengajaran Mikro A9
No ratings yet
Gusti Rayyan Noor (1710117210015) Pengajaran Mikro A9
16 pages
Module 2. Lesson 1
No ratings yet
Module 2. Lesson 1
8 pages
Final Report Template and Instructions
No ratings yet
Final Report Template and Instructions
2 pages
DLP Math 1 Second Grading (Lesson 4)
No ratings yet
DLP Math 1 Second Grading (Lesson 4)
2 pages
Changing Culture To Facilitate Organisational Change: A Case Study
No ratings yet
Changing Culture To Facilitate Organisational Change: A Case Study
13 pages
Barriers To Communication
No ratings yet
Barriers To Communication
14 pages
Nguyễn Hoàng Mai Chi - 207140231039
No ratings yet
Nguyễn Hoàng Mai Chi - 207140231039
8 pages
Understanding The Self
No ratings yet
Understanding The Self
4 pages
Assignment No 4 - KNN Twitter
No ratings yet
Assignment No 4 - KNN Twitter
3 pages
Curiosity and Wonder: Cue Into Children's Inborn Motivation To Learn
No ratings yet
Curiosity and Wonder: Cue Into Children's Inborn Motivation To Learn
3 pages
Beige Scrapbook Art and History Presentation
No ratings yet
Beige Scrapbook Art and History Presentation
1 page
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2141)

Sec 12

Uploaded by

Sec 12

Uploaded by

CS 181 Spring 2021 Section 12

2 From Planning to Reinforcement Learning

2.1 Concept Question

3.1 Concept Question

π ∗ (s) = arg max Q∗ (s, a) (2)

4.1 Exploration vs. Exploitation

4.1.1 Concept Question

Q(s, a) ← Q(s, a) + αt [r + γQ(s0 , a0 ) − Q(s, a)] (4)

This is known as the SARSA update (State-Action-Reward-State-Action),

• Visit every action in every state infinitely often

5.1 Concept Question

6 Off-Policy RL: Q-Learning

Q(s, a) ← Q(s, a) + αt [r + γ max

Q-learning uses State-Action-Reward-State from the environment. It is

• Visit every action in every state infinitely often

6.1 Concept Question

3. Are the updates the same? If not, why not?

You might also like