2023 Week7 modelbasedRL Updated
2023 Week7 modelbasedRL Updated
Bolei Zhou
UCLA
3 Stable-baseline3 in PyTorch:
1 https://github.com/DLR-RM/stable-baselines3
4 CleanRL:
1 https://github.com/vwxyzjn/cleanrl
2 High-quality single file implementation of deep RL algorithms
1 Today
1 Introduction of model-based reinforcement learning
2 Model-based value optimization
3 Model-based policy optimization
4 Case studies on robot object manipulation and learning world models
from images
2 Thursday: Optimal Control and RL
1 Model-free RL
1 No model
2 Learn value/policy functions from experience
2 Model-based RL
1 Besides learn policy function or value function from the experience,
also learn a model from experience
2 Plan value/policy functions from model
model →
− policy
2 Cons:
1 First learning a model then constructing a value function or policy
function leads to two sources of approximation error
2 It is difficult to have the guarantee of convergence
1 Known models: Game of Go: the rule of the game is the model
S1 , A1 →R2 , S2
S1 , A1 →R2 , S2
..
.
S1 , A1 →R2 , S2
S 0 , S ∼Ps,s
a
0
R =Ras
1 Model-free RL
1 No model
2 Learn value function (and/or policy) from real experience
2 Model-based RL (using Sample-based Planning)
1 Learn a model from real experience
2 Plan value function (and/or policy) from simulated experience
3 Dyna developed by RS Sutton 1991
1 Learn a model from real experience
2 Learn and plan value function (and/or policy) from both real and
simulated experience
1 Architecture of Dyna
2 Can we optimize the policy and learn the model directly, without
estimating the value?
improves
model −−−−−→ policy
3 But can we do better if we know the model or are able to learn the
model?
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 32 / 56
Model-based Policy Optimization in RL
T
X
min c(st , at ) subject to st = f (st−1 , at−1 )
a1 ,...,aT
t=1
1 Finally we can plug the policy learning along with model learning and
optimal control
df df
1 All we needed are the local gradients At = dxt and Bt = dut
Bolei Zhou CS260R Reinforcement Learning November 10, 2023 40 / 56
Global Model versus Local Model
2 RNN model
1 Nature paper 2020, Mastering Atari, Go, chess and shogi by planning
with a learned model
1 Paper:
https://www.nature.com/articles/s41586-020-03051-4.epdf
2 Blog article: https://www.deepmind.com/blog/
muzero-mastering-go-chess-shogi-and-atari-without-rules
2 Evolution of AlphaGo→AlphaGo Zero→ AlphaZero → MuZero
1 less and less domain knowledge
3 MuZero combines model with AlphaZero’s powerful lookahead tree
search
4 Two planning methods in AI:
1 lookahead search (AlphaZero)
1 but rely on being given knowledge of their environment’s dynamics
2 model-based planning