0% found this document useful (0 votes)
8 views56 pages

2023 Week7 modelbasedRL Updated

Uploaded by

luciferboboqwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views56 pages

2023 Week7 modelbasedRL Updated

Uploaded by

luciferboboqwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Week 7: Model-based Reinforcement Learning

Bolei Zhou

UCLA

November 10, 2023

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 1 / 56


Recap of Last Week on Policy Optimization SOTAs
1 Policy Gradient→TRPO→ACKTR→PPO
1 Stochastic policy thus output probability over discrete actions
2 Start with policy gradient and importance sampling for off-policy
learning
2 Q-learning→DDPG→TD3
1 Deterministic policy (like a regression function)
2 Start with Bellman equation, which doesn’t care which transition tuples
are used, or how the actions were selected, or what happens after a
given transition
3 Optimal Q-function should satisfy the Bellman equation for all possible
transitions, so it is very easy to enable off-policy learning
3 SAC
1 SAC optimizes a stochastic policy in an off-policy way, which unifies
stochastic policy optimization and DDPG-style approaches
2 off-policy method, high sample efficiency, it incorporates the clipped
double-Q trick like TD3, and due to the inherent stochasticity of the
policy in SAC, it also winds up benefiting from something like target
policy smoothing.
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 2 / 56
Recap of Last Week on Policy Optimization SOTAs

1 Great implementations of SOTA methods are ready for your course


project and research project!
2 SpinningUp: Nice implementations and summary of the algorithms
from OpenAI
1 https://spinningup.openai.com/

3 Stable-baseline3 in PyTorch:
1 https://github.com/DLR-RM/stable-baselines3
4 CleanRL:
1 https://github.com/vwxyzjn/cleanrl
2 High-quality single file implementation of deep RL algorithms

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 3 / 56


Plan for the rest of the quarter

1 Previous lectures: Value-based RL, policy-based RL, Policy


optimization SOTA
2 Other topics in RL:
1 Model-based RL
2 Imitation learning
3 Distributed ML system
4 Offline RL and more
5 RL theory
6 Environment, reward function,
7 LLM + RL
3 Doing RL research: Switch a track that is less crowded, or establish a
new setting as a new track, and succeed!

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 4 / 56


This Week’s Plan

1 Today
1 Introduction of model-based reinforcement learning
2 Model-based value optimization
3 Model-based policy optimization
4 Case studies on robot object manipulation and learning world models
from images
2 Thursday: Optimal Control and RL

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 5 / 56


Model-based Reinforcement Learning

1 Previous lectures on model-free RL


1 Learning policy directly from experiences through policy gradient
2 Learning value function through MC or TD
2 This lecture will be on model-based RL
1 Learning model of the environment from experience

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 6 / 56


Model-based and Model-free RL

1 Model-free RL
1 No model
2 Learn value/policy functions from experience
2 Model-based RL
1 Besides learn policy function or value function from the experience,
also learn a model from experience
2 Plan value/policy functions from model

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 7 / 56


Building a Model of the Environment
1 Diagram of model-free reinforcement learning

2 Diagram of model-based reinforcement learning

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 8 / 56


Modeling the Environment for Planning

1 Plan to better interact with the real environment

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 9 / 56


Modeling the Environment for Planning

1 Planning is the computational process that takes a model as input


and produces or improves a policy by interacting with the modeled
environment
learning planning
experience −−−−→ model −−−−−→ better policy

2 State-space planning: search through the state space for an optimal


policy or an optimal path to a goal
3 Model-based value optimization methods share a common structure
backups
model →
− simulated trajectories −−−−−→ values →
− policy

4 Model-based policy optimization methods have a simpler structure as

model →
− policy

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 10 / 56


Structure of the Model-based RL

1 Relationships among learning, planning and acting

2 Two roles of the real experience:


1 Improve the value and policy directly using previously methods
2 Improve the model to match the real environment more accurately
(predictive model on the environment): p(st+1 |st , at ), R(st , at ) )

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 11 / 56


Advantage of Model-based RL

1 Pros: Higher sample efficiency

1 Sample-efficient learning is crucial for real-world RL applications such


as robotics
DARPA robotics failure
2 Model can be learned efficiently by supervised learning methods

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 12 / 56


Advantage of Model-based RL

1 Pros: Higher sample efficiency

2 Cons:
1 First learning a model then constructing a value function or policy
function leads to two sources of approximation error
2 It is difficult to have the guarantee of convergence

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 13 / 56


What is a Model

1 A model M is a representation of an MDP parameterized by η


2 Usually a model M = (P, R) represents state transitions and rewards

St+1 ∼Pη (St+1 |St , At )


Rt+1 =Rη (Rt+1 |St , At )

3 Typically we assume conditional independence between state


transitions and rewards

P(St+1 , Rt+1 |St , At ) = P(St+1 |St , At )P(Rt+1 |St , At )

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 14 / 56


Sometimes it is easy to access the model

1 Known models: Game of Go: the rule of the game is the model

2 Physics models: Vehicle dynamics model and kinematics bicycle


model

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 15 / 56


Today’s Plan

1 Intro on model-based reinforcement learning


2 Model-based value optimization
3 Model-based policy optimization
4 Case studies on robot object manipulation and learning world models
from images

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 16 / 56


Learning the Model

1 Goal: learn model Mη from experience {S1 , A1 , R2 , ..., ST }


1 So consider it as a supervised learning problem

S1 , A1 →R2 , S2
S1 , A1 →R2 , S2
..
.
S1 , A1 →R2 , S2

2 Learning s, a → r is a regression problem


3 Learning s, a → s 0 is a density estimation problem
4 Pick a loss function, e.g., mean-squares error, KL divergence then
optimize model parameters that minimize the empirical loss

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 17 / 56


Examples of Models for the World Model

1 Table Lookup Model


2 Linear Expectation Model
3 Linear Gaussian Model
4 Gaussian Process Model
5 Deep Belief Network Model ...

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 18 / 56


Table Lookup Model

1 Model is an explicit MDP, P̂ and R̂


2 Count visits N(s,a) to each state action pair
T
1 X
a
P̂s,s 0 = 1(St = s, At = a, St+1 = s 0 )
N(s, a)
t=1
T XT
1 X
R̂as = 1(St = s, At = a)Rt
N(s, a)
t=1 t=1

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 19 / 56


Example of AB

1 Two states A and B; no discounting;


2 Observed 8 episodes of experience:
1 (State, Reward, Next State, Next Reward...)
2 (A, 0, B, 0), (B, 1), (B, 1), (B, 1), (B, 1), (B, 1), (B, 1), (B, 0)
3 So the estimated a table lookup model from the experience as follows

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 20 / 56


Sample-Based Planning

1 A simple but sample-efficient approach to planning


2 Use the model only to generate samples
3 General procedure:
1 Sample experience from the model

St+1 ∼Pη (St+1 |St , At )


Rt+1 =Rη (Rt+1 |St , At )

2 Apply model-free RL to sampled experiences:


1 Monte-Carlo control
2 Sarsa
3 Q-learning

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 21 / 56


Sample-Based Planning for AB Example

1 Observed 8 episodes of experience in the format of (State, Reward,


Next State, Next Reward...)
1 (A, 0, B, 0), (B, 1), (B, 1), (B, 1), (B, 1), (B, 1), (B, 1), (B, 0)
2 Construct the model

3 Sample experience from the model


1 (B, 1), (B, 0), (B, 1), (A, 0, B, 1), (B, 1), (A, 0, B, 1), (B, 1), (B, 0)
4 Monte-Carlo Learning on the sampled experience
1 V (A) = 1, V (B) = 0.75

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 22 / 56


Planning with an Inaccurate Model

1 Given an imperfect model hPη , Rη i =


6 hP, Ri
2 Performance of model-based RL is limited to the optimal policy for
approximate MDP hS, A, Pη , Rη i
1 Model-based RL is only as good as the estimated model
3 When the model is inaccurate, planning process will compute a
suboptimal policy
4 Possible solutions:
1 When the accuracy of the model is low, use model-free RL
2 Reason explicitly about the model uncertainty (how confident we are
for the estimated state): Use probabilistic model such as Bayesian and
Gaussian Process

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 23 / 56


Real and Simulated Experience

1 We now have two sources of experience


2 Real experience: sampled from the environment (true MDP)

S 0 , S ∼Ps,s
a
0

R =Ras

3 Simulated experience: sampled from the model (approximate MDP)

Sˆ0 , Ŝ ∼Pη (S 0 |S, A)


R̂ =Rη (R|S, A)

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 24 / 56


Integrating Learning and Planning

1 Model-free RL
1 No model
2 Learn value function (and/or policy) from real experience
2 Model-based RL (using Sample-based Planning)
1 Learn a model from real experience
2 Plan value function (and/or policy) from simulated experience
3 Dyna developed by RS Sutton 1991
1 Learn a model from real experience
2 Learn and plan value function (and/or policy) from both real and
simulated experience

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 25 / 56


Dyna for Integrating Learning, Planning, and Reacting

1 Architecture of Dyna

2 By Richard Sutton. ACM SIGART Bulletin 1991


3 Chapter 8 of the Textbook

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 26 / 56


Algorithm of Dyna

1 Combining direct RL, model learning, and planning together

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 27 / 56


Result of Dyna
1 A simple maze environment: travel fro S to G as quickly as possible
2 learning curves varying the number of planning steps per real step

3 Policies found by planning and nonplanning Dyna-Q agents

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 28 / 56


Today’s Plan

1 Intro on model-based reinforcement learning


2 Model-based value optimization
3 Model-based policy optimization
4 Case studies on robot object manipulation and learning world models
from images

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 29 / 56


A quick annoucement

1 A seminar talk in this lecture room from 4:15 pm - 5:15 pm today


2 Coordinated Learning-based Autonomy for Urban Air Mobility
Operations
1 In this talk the speaker will initiate exciting and open-ended discussions
on the new possible flight planning and coordination models,
learning-based separation assurance algorithm and AI certification
concerns in aviation autonomy.
2 Prof. Peng Wei from George Washington University

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 30 / 56


Policy Optimization with Model-based RL

1 Previous model-based value-based RL:


backups
model →
− simulated trajectories −−−−−→ values →
− policy

2 Can we optimize the policy and learn the model directly, without
estimating the value?
improves
model −−−−−→ policy

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 31 / 56


Model-based Policy Optimization in RL

1 Policy gradient, as a model-free RL, only cares about the policy


πθ (at |st ) and expected return

τ = {s1 , a1 , s2 , a2 ..., sT , aT } ∼πθ (at |st )


hX i
arg max Eτ ∼πθ γ t r (st , at )
θ t

2 In policy gradient, no p(st+1 |st , at ) is needed (no matter it is known


or unknown)
T
Y
p(s1 , a1 , ..., st , aT ) = p(s1 ) πθ (at |st )p(st+1 |st , at )
t=1

3 But can we do better if we know the model or are able to learn the
model?
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 32 / 56
Model-based Policy Optimization in RL

1 Model-based policy optimization in RL is strongly influenced from the


control theory that optimizes a controller
2 The controller uses the model, also termed as the system dynamics
st = f (st−1 , at−1 ), to decide the optimal controls for a trajectory to
minimize the cost:
T
X
arg min c(st , at ) subject to st = f (st−1 , at−1 )
a1 ,...,aT
t=1

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 33 / 56


Optimal Control for Trajectory Optimization

T
X
min c(st , at ) subject to st = f (st−1 , at−1 )
a1 ,...,aT
t=1

1 If the dynamics is known it becomes the optimal control problem


2 Cost function is the negative reward of the RL problem
3 The optimal solution can be solved by Linear-Quadratic Regulator
(LQR) and iterative LQR (iLQR) under some simplified assumptions
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 34 / 56
Model Learning for Trajectory Optimization: Algorithm 1

1 If the dynamics model is unknown, we can combine model learning


and trajectory optimization
2 Algorithm 1
0
1
P D = {(s, a, s )i }
run base policy π0 (at |st ) (random policy) to collect
2 learn dynamics model s 0 = f (s, a) to minimize i ||f (si , ai ) − si0 ||2
3 plan through f (s, a) to choose actions
3 Step 2 is supervised learning to train a model to minimize the least
square error from the sampled data
4 Step 3 can be solved by Linear Quadratic Regulator (LQR), to
calculate the optimal trajectory using the model and a cost function:
T
X
min c(st , at ) subject to st = f (st−1 , at−1 )
a1 ,...,aT
t=1

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 35 / 56


Model Learning for Trajectory Optimization: Algorithm 2

1 The previous solution is vulnerable to drifting, a tiny error


accumulates fast along the trajectory
2 We may also land in areas where the model has not been learned yet

3 So we have the following improved algorithm with learning the model


iteratively
4 Algorithm 2
1 run base policy π0 (at |st ) (random policy) to collect D = {(s, a, s 0 )i }
2 Loop
1 learn dynamics model s 0 = f (s, a) to minimize 0 2
P
i ||f (si , ai ) − si ||
2 plan through f (s, a) to choose actions
3 execute those actions and add the resulting data {(s, a, s 0 )i } to D

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 36 / 56


Model Learning for Trajectory Optimization: Algorithm 3

1 Nevertheless, the previous method executes all planned actions before


fitting the model again. We may be off-grid too far already
2 So we can use Model Predictive Control (MPC) that we optimize the
whole trajectory but we take the first action only, then we observe
and replan again
3 In MPC, we optimize the whole trajectory but we take the first action
only. We observe and replan again. The replan gives us a chance to
take corrective action after observed the current state again

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 37 / 56


Model Learning for Trajectory Optimization: Algorithm 3

Algorithm 3 with MPC


1 run base policy π0 (at |st ) to collect D = {(s, a, s 0 )i }
2 Loop each step
1 Loop every N steps
1 learn dynamics model s 0 = p(s, a) to minimize ||f (si , ai ) − si0 ||2
P
i
2 MPC
1 plan through f (s, a) to choose actions
2 execute the first planned action and observe the resulting state s 0
(MPC)
3 append (s, a, s 0 ) to dataset D

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 38 / 56


Model Learning for Trajectory Optimization: Algorithm 4

1 Finally we can plug the policy learning along with model learning and
optimal control

2 Algorithm 4: Learning Model and Policy Together


1 run base policy π0 (at |st ) (random policy) to collect D = {(s, a, s 0 )i }
2 Loop
P 0 2
1 learn dynamics model f (s, a) to minimize i ||f (si , ai ) − si ||
2 backpropagate through f (s, a) into the policy to optimize πθ (at |st )
3 run πθ (at |st ), appending the visited (s, a, s 0 ) to D

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 39 / 56


Parameterizing the Model
What function is used to parameterize the dynamics?
1 Global model: st+1 = f (st , at ) is represented by a big neural network
1 Pro: very expressive and can use lots of data to fit
2 Con: not so great in low data regimes, and cannot express model
uncertainty
2 Local model: model the transition as time-varying linear-Gaussian
dynamics
1 Pro: very data-efficient and can express model uncertainty
2 Con: not great with non-smooth dynamics
3 Con: very slow when dataset is large
3 Local model as time-varying linear-Gaussian dynamics

p(xt+1 |xt , ut ) =N (f (xt , ut ))


f (xt , ut ) =At xt + Bt ut

df df
1 All we needed are the local gradients At = dxt and Bt = dut
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 40 / 56
Global Model versus Local Model

1 Local model as time-varying linear-Gaussian

p(xt+1 |xt , ut ) =N (f (xt , ut ))


f (xt , ut ) =At xt + Bt ut

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 41 / 56


Today’s Plan

1 Intro on model-based reinforcement learning


2 Model-based value optimization
3 Model-based policy optimization
4 Case studies on robot object manipulation and learning world models
from images

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 42 / 56


Case Study 1: Model-based Robotic Object Manipulation

1 Learning to Control a Low-Cost Manipulator using Data-Efficient


Reinforcement Learning. RSS 2011

2 No pose feedback, visual feedback from a Kinetics-type depth camera


3 Total cost: $500 = 6-degree Arm($370)+ Kinetics($130)
4 System setup:
1 Control signal u ∈ R 4 : Pulse widths for the first four motors
2 State x ∈ R 3 : 3D center of the object
3 Policy π : R 3 → R 4 P
T
4 Expected return J π = t=0 Ext [c(xt )] where c = − exp(−d 2 /σc2 )

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 43 / 56


Case Study 1: Model-based Robotic Object Manipulation

1 Model the system dynamics as probabilistic non-prarametric Gaussian


process GP

2 PILCO: A model-based and data-efficient approach to policy search.


Deisenroth and Rasmussen. ICML 2011
3 Demo link: https://www.youtube.com/watch?v=gdT6dwUOYC0

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 44 / 56


Case Study 2: Model-based Robotic Object Manipulation

1 Learning Contact-Rich Manipulation Skills with Guided Policy Search.


Sergey Levine and Pieter Abbeel. The best Robotics Manipulation
Paper award at ICRA 2015
2 One of Sergey Levine’s representative works

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 45 / 56


Case Study 2: Model-based Robotic Object Manipulation
1 Local models + Iterative LQR
1 Linear-Gaussian controller: p(ut |xt ) = N (Kt xt + kt , Ct )
2 Time-varying linear-Gaussian dynamics:
p(xt+1 |xt , ut ) = N (fxt xt + fut ut , Ft )
3 Can be solved as linear-quadratic-Gaussian (LQG) problem using
optimal control
2 Guided policy search for global model:
1 policy model: πθ
2 supervised learning of neural network using the guidance of the
linear-Gaussian controller

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 46 / 56


Case Study 2: Model-based Robotic Object Manipulation

1 Demo link: https://www.youtube.com/embed/mSzEyKaJTSU


(LINK IS NOT AVAILABLE ANYMORE)
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 47 / 56
Case Study 3: Learning world models

1 Interactive blog by David Ha: https://worldmodels.github.io/


1 NeurIPS’18 Oral: Recurrent World Models Facilitate Policy Evolution

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 48 / 56


Case Study 3: Learning world models
1 VAE for feature extraction

2 RNN model

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 49 / 56


Case Study 4: Learning world models and planning from
images
1 A recent hot research topic in RL field
2 Deep Planning Network (PlaNet) at ICML’19 and Dreamer at
ICLR’20 from Google and DeepMind
1 https://ai.googleblog.com/2020/03/
introducing-dreamer-scalable.html

3 PlaNet solves a variety of image-based control tasks, competing with


advanced model-free agents in terms of final performance while being
5000% more data efficient on average.
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 50 / 56
Case Study 4: Learning model and planning from images
1 Given five input images, the model reconstructs them and predicts the
future images up to time step 50

2 Extremely sample efficient compared to model-free methods

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 51 / 56


Case Study 5: MuZero, model-based RL for AlphaZero

1 Nature paper 2020, Mastering Atari, Go, chess and shogi by planning
with a learned model
1 Paper:
https://www.nature.com/articles/s41586-020-03051-4.epdf
2 Blog article: https://www.deepmind.com/blog/
muzero-mastering-go-chess-shogi-and-atari-without-rules
2 Evolution of AlphaGo→AlphaGo Zero→ AlphaZero → MuZero
1 less and less domain knowledge
3 MuZero combines model with AlphaZero’s powerful lookahead tree
search
4 Two planning methods in AI:
1 lookahead search (AlphaZero)
1 but rely on being given knowledge of their environment’s dynamics
2 model-based planning

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 52 / 56


Case Study 5: MuZero, model-based RL for AlphaZero

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 53 / 56


Case Study 5: MuZero, model-based RL for AlphaZero

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 54 / 56


Case Study 5: MuZero, model-based RL for AlphaZero

1 a. How MuZero uses model to plan; b. How MuZero acts in the


environment; c. How MuZero trains its model.

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 55 / 56


Summary of Model-based RL

1 Instead of fitting a policy or a value function, we develop a model to


predict the system dynamics
2 Model-based RL has much higher sample efficiency, which is crucial
for real-world applications such as robotic manipulation

Bolei Zhou CS260R Reinforcement Learning November 10, 2023 56 / 56

You might also like