0% found this document useful (0 votes)

105 views46 pages

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley

This document provides an introduction to reinforcement learning concepts including: 1. The goal of reinforcement learning is to learn behaviors or policies that maximize expected returns through interactions with an environment. 2. Reinforcement learning algorithms generally involve estimating returns or value functions from samples generated by running a policy, and then improving the policy based on those estimates. 3. There are different types of reinforcement learning algorithms including value-based methods, policy gradients, actor-critic methods, and model-based reinforcement learning. These different approaches make different assumptions and have varying degrees of sample efficiency and stability.

Uploaded by

kong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views46 pages

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley

Uploaded by

kong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to Reinforcement Learning

CS 285
Instructor: Sergey Levine
UC Berkeley
Definitions
Terminology & notation

1. run away
2. ignore
3. pet
Imitation Learning

training supervised
data learning

Images: Bojarski et al. ‘16, NVIDIA

Reward functions
Definitions

Andrey Markov
Definitions

Richard
Andrey Bellman
Markov
Definitions

Richard Bellman
Definitions
The goal of reinforcement learning
we’ll come back to partially observed later
The goal of reinforcement learning
The goal of reinforcement learning
Finite horizon case: state-action marginal

state-action marginal
Infinite horizon case: stationary distribution

stationary distribution

stationary = the
same before and
after transition
Infinite horizon case: stationary distribution

stationary distribution

stationary = the
same before and
after transition
Expectations and stochastic systems

infinite horizon case finite horizon case

In RL, we almost always care about expectations

+1 -1
Algorithms
The anatomy of a reinforcement learning algorithm

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

A simple example

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

Another example: RL by backprop

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

Which parts are expensive?
trivial, fast
fit a model/
estimate the return
real robot/car/power
grid/whatever: expensive
1x real time, until we
invent time travel generate samples
(i.e. run the policy)
MuJoCo simulator:
up to 10000x real time

improve the policy

Value Functions
How do we deal with all these expectations?

what if we knew this part?

Definition: Q-function

Definition: value function

Using Q-functions and value functions
The anatomy of a reinforcement learning algorithm

this often uses Q-

fit a model/ functions or value
estimate the return functions

generate samples
(i.e. run the policy)

improve the policy

Types of Algorithms
Types of RL algorithms

• Policy gradients: directly differentiate the above objective

• Value-based: estimate value function or Q-function of the optimal policy
(no explicit policy)
• Actor-critic: estimate value function or Q-function of the current policy,
use it to improve policy
• Model-based RL: estimate the transition model, and then…
• Use it for planning (no explicit policy)
• Use it to improve a policy
• Something else
Model-based RL algorithms

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

Model-based RL algorithms

improve the policy

1. Just use the model to plan (no policy)

• Trajectory optimization/optimal control (primarily in continuous spaces) –
essentially backpropagation to optimize over actions
• Discrete planning in discrete action spaces – e.g., Monte Carlo tree search
2. Backpropagate gradients into the policy
• Requires some tricks to make it work
3. Use the model to learn a value function
• Dynamic programming
• Generate simulated experience for model-free learner
Value function based algorithms

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

Direct policy gradients

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

Actor-critic: value functions + policy gradients

fit a model/
estimate the return

generate samples
(i.e. run the policy)

improve the policy

Tradeoffs Between Algorithms
Why so many RL algorithms?
• Different tradeoffs
• Sample efficiency
• Stability & ease of use fit a model/
estimate return
• Different assumptions
• Stochastic or deterministic? generate
samples (i.e.
• Continuous or discrete? run the policy)

• Episodic or infinite horizon? improve the

policy
• Different things are easy or hard in
different settings
• Easier to represent the policy?
• Easier to represent the model?
Comparison: sample efficiency
• Sample efficiency = how many samples fit a model/
do we need to get a good policy? estimate return

• Most important question: is the generate

samples (i.e.
algorithm off policy? run the policy)

• Off policy: able to improve the policy improve the

without generating new samples from that policy

policy
• On policy: each time the policy is changed,
even a little bit, we need to generate new
samples
just one gradient step
Comparison: sample efficiency

off-policy on-policy

More efficient Less efficient

(fewer samples) (more samples)

model-based model-based off-policy actor-critic on-policy policy evolutionary or

shallow RL deep RL Q-function style gradient gradient-free
learning methods algorithms algorithms

Why would we use a less efficient algorithm?

Wall clock time is not the same as efficiency!
Comparison: stability and ease of use
• Does it converge?
• And if it converges, to what?
• And does it converge every time?

Why is any of this even a question???

• Supervised learning: almost always gradient descent
• Reinforcement learning: often not gradient descent
• Q-learning: fixed point iteration
• Model-based RL: model is not optimized for expected reward
• Policy gradient: is gradient descent, but also often the least
efficient!
Comparison: stability and ease of use
• Value function fitting
• At best, minimizes error of fit (“Bellman error”)
• Not the same as expected reward
• At worst, doesn’t optimize anything
• Many popular deep RL value fitting algorithms are not guaranteed to
converge to anything in the nonlinear case
• Model-based RL
• Model minimizes error of fit
• This will converge
• No guarantee that better model = better policy
• Policy gradient
• The only one that actually performs gradient descent (ascent) on
the true objective
Comparison: assumptions
• Common assumption #1: full observability
• Generally assumed by value function fitting
methods
• Can be mitigated by adding recurrence
• Common assumption #2: episodic learning
• Often assumed by pure policy gradient methods
• Assumed by some model-based RL methods
• Common assumption #3: continuity or
smoothness
• Assumed by some continuous value function
learning methods
• Often assumed by some model-based RL
methods
Examples of Algorithms
Examples of specific algorithms
• Value function fitting methods
• Q-learning, DQN
• Temporal difference learning
• Fitted value iteration
• Policy gradient methods
• REINFORCE We’ll learn about
• Natural policy gradient most of these in the
• Trust region policy optimization
• Actor-critic algorithms
next few weeks!
• Asynchronous advantage actor-critic (A3C)
• Soft actor-critic (SAC)
• Model-based RL algorithms
• Dyna
• Guided policy search
Example 1: Atari games with Q-functions

• Playing Atari with deep

reinforcement learning,
Mnih et al. ‘13
• Q-learning with
convolutional neural
networks
Example 2: robots and model-based RL

• End-to-end training of
deep visuomotor
policies, L.* , Finn* ’16
• Guided policy search
(model-based RL) for
image-based robotic
manipulation
Example 3: walking with policy gradients

• High-dimensional
continuous control with
generalized advantage
estimation, Schulman et
al. ‘16
• Trust region policy
optimization with value
function approximation
Example 4: robotic grasping with Q-functions

• QT-Opt, Kalashnikov et
al. ‘18
• Q-learning from images
for real-world robotic
grasping

4.5 Application Exercises On Number Theory - Proof Techniques
No ratings yet
4.5 Application Exercises On Number Theory - Proof Techniques
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Chapter 1 Review Questions
No ratings yet
Chapter 1 Review Questions
13 pages
Discrete Mathematics Prelims Quiz 2 by Bertski
100% (1)
Discrete Mathematics Prelims Quiz 2 by Bertski
5 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Unit 3
No ratings yet
Unit 3
13 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
RL Unit - Iii
No ratings yet
RL Unit - Iii
20 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
37 RL
No ratings yet
37 RL
18 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Reinforcement Learning - Basics
No ratings yet
Reinforcement Learning - Basics
7 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
cs224r L03 MDP PG
No ratings yet
cs224r L03 MDP PG
30 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
ML Unit2
No ratings yet
ML Unit2
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
cs224r L04 Actor Critic
No ratings yet
cs224r L04 Actor Critic
89 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
A Method For Knowledge Representation To Design Intelligent Problems Solver in Mathematics Based On Rela-Ops Model
No ratings yet
A Method For Knowledge Representation To Design Intelligent Problems Solver in Mathematics Based On Rela-Ops Model
22 pages
Unit 14: Boolean Algebra
No ratings yet
Unit 14: Boolean Algebra
32 pages
Digital Logic Design (EE-210) : Course Teacher Engr. Syeda Iffat Naqvi Week # 10 (Online Lecture)
No ratings yet
Digital Logic Design (EE-210) : Course Teacher Engr. Syeda Iffat Naqvi Week # 10 (Online Lecture)
30 pages
Code G
No ratings yet
Code G
3 pages
SMA 102 Course Outline
No ratings yet
SMA 102 Course Outline
2 pages
ITE3711 Lecture1.3 Operators 20230906
No ratings yet
ITE3711 Lecture1.3 Operators 20230906
18 pages
Table (With Non Integer Solution) Given Below
No ratings yet
Table (With Non Integer Solution) Given Below
2 pages
BITE202P-Digital Logic & Microprocessor Lab: Assessment 1: Combinational Logic Circuits
No ratings yet
BITE202P-Digital Logic & Microprocessor Lab: Assessment 1: Combinational Logic Circuits
2 pages
CBB1201 - Flat QB - 2 Marks
No ratings yet
CBB1201 - Flat QB - 2 Marks
5 pages
Transition-Graphs - Chapter 6
No ratings yet
Transition-Graphs - Chapter 6
38 pages
Ataei Et Al - An Applications of Fuzzy Sets To The Rock Mass Rating (RMR) System Used in Rock Engineering
No ratings yet
Ataei Et Al - An Applications of Fuzzy Sets To The Rock Mass Rating (RMR) System Used in Rock Engineering
10 pages
Daa 1ST Unit.1
No ratings yet
Daa 1ST Unit.1
54 pages
Design and Analysis of Algorithms: Recursion
No ratings yet
Design and Analysis of Algorithms: Recursion
20 pages
Tuples
No ratings yet
Tuples
26 pages
Maxima Minima
No ratings yet
Maxima Minima
18 pages
L-2005-05-Divide and Conquer
No ratings yet
L-2005-05-Divide and Conquer
25 pages
TCS Sem5
No ratings yet
TCS Sem5
12 pages
Machine - Learning - Using - Python - Ipynb - Colaboratory
No ratings yet
Machine - Learning - Using - Python - Ipynb - Colaboratory
11 pages
Constructing B-Tree From In-Order & Post-Order-1 PDF
No ratings yet
Constructing B-Tree From In-Order & Post-Order-1 PDF
4 pages
21cs15it Problem Solving and Python Programming Questions and Answers
No ratings yet
21cs15it Problem Solving and Python Programming Questions and Answers
6 pages
DM QB
No ratings yet
DM QB
10 pages
Note01 1x2
No ratings yet
Note01 1x2
17 pages
Program Python Unit 1
No ratings yet
Program Python Unit 1
11 pages
DSGTChap 4 Counting
No ratings yet
DSGTChap 4 Counting
60 pages
Multivariable Calculus: Section 4 Maxima and Minima Using Lagrange Multipliers
No ratings yet
Multivariable Calculus: Section 4 Maxima and Minima Using Lagrange Multipliers
18 pages
Numerical Methods-Module On False Position Method
No ratings yet
Numerical Methods-Module On False Position Method
4 pages
Limits and Derivatives: Theorem 1
No ratings yet
Limits and Derivatives: Theorem 1
9 pages

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley

Uploaded by

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley

Uploaded by

Introduction to Reinforcement Learning

Images: Bojarski et al. ‘16, NVIDIA

infinite horizon case finite horizon case

In RL, we almost always care about expectations

improve the policy

improve the policy

improve the policy

improve the policy

what if we knew this part?

Definition: value function

this often uses Q-

improve the policy

• Policy gradients: directly differentiate the above objective

improve the policy

improve the policy

1. Just use the model to plan (no policy)

improve the policy

improve the policy

improve the policy

• Episodic or infinite horizon? improve the

• Most important question: is the generate

• Off policy: able to improve the policy improve the

More efficient Less efficient

model-based model-based off-policy actor-critic on-policy policy evolutionary or

Why would we use a less efficient algorithm?

Why is any of this even a question???

• Playing Atari with deep

You might also like