Audio to text embedding
Audio to text embedding
Kevin P. Murphy
December 9, 2024
1
Parts of this monograph are borrowed from chapters 34 and 35 of my textbook [Mur23]. However, I have added a
lot of new material, so this text supercedes those chapters. Thanks to Lihong Li, who wrote Section 5.4 and parts of
Section 1.4, and Pablo Samuel Castro, who proof-read a draft of this manuscript.
2
Contents
1 Introduction 9
1.1 Sequential decision making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.2 Universal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.3 Episodic vs continuing tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.4 Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Canonical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Partially observed MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Markov decision process (MDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.3 Contextual MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.4 Contextual bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.5 Belief state MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.6 Optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.6.1 Best-arm identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.6.2 Bayesian optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.6.3 Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.6.4 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Value-based RL (Approximate Dynamic Programming) . . . . . . . . . . . . . . . . . 18
1.3.2 Policy-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.3 Model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.4 Dealing with partial observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.4.1 Optimal solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.4.2 Finite observation history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.4.3 Stateful (recurrent) policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Exploration-exploitation tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.1 Simple heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.2 Methods based on the belief state MDP . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.2.1 Bandit case (Gittins indices) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.2.2 MDP case (Bayes Adaptive MDPs) . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.3 Upper confidence bounds (UCBs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.3.1 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3.2 Bandit case: Frequentist approach . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3.3 Bandit case: Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.3.4 MDP case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 Thompson sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4.1 Bandit case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4.2 MDP case (posterior sampling RL) . . . . . . . . . . . . . . . . . . . . . . . 25
3
1.5 RL as a posterior inference problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.2 Soft value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5.3 Maximum entropy RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.4 Active inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Value-based RL 31
2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.1 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.2 Bellman’s equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.3 Example: 1d grid world . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 Computing the value function and policy given a known world model . . . . . . . . . . . . . . 33
2.2.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Real-time dynamic programming (RTDP) . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Computing the value function without knowing the world model . . . . . . . . . . . . . . . . 35
2.3.1 Monte Carlo estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 Temporal difference (TD) learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.3 Combining TD and MC learning using TD(λ) . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 SARSA: on-policy TD control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Q-learning: off-policy TD control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.1 Tabular Q learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.2 Q learning with function approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2.1 Neural fitted Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2.2 DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2.3 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2.4 The deadly triad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.2.5 Target networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2.6 Two time-scale methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2.7 Layer norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 Maximization bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.3.1 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.3.2 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.3.3 Randomized ensemble DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4 DQN extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4.1 Q learning for continuous actions . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4.2 Dueling DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.4.3 Noisy nets and exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.4.4 Multi-step DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.4.5 Rainbow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.4.6 Bigger, Better, Faster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.4.7 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3 Policy-based RL 49
3.1 The policy gradient theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.1 Advantage actor critic (A2C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.2 Generalized advantage estimation (GAE) . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.3 Two-time scale actor critic algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.4 Natural policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4
3.3.4.1 Natural gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.4.2 Natural actor critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Policy improvement methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Policy improvement lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.2 Trust region policy optimization (TRPO) . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.4 VMPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Off-policy methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5.1 Policy evaluation using importance sampling . . . . . . . . . . . . . . . . . . . . . . . 59
3.5.2 Off-policy actor critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2.1 Learning the critic using V-trace . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.2.2 Learning the actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.2.3 IMPALA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5.3 Off-policy policy improvement methods . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.3.1 Off-policy PPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.3.2 Off-policy VMPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.3.3 Off-policy TRPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.4 Soft actor-critic (SAC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.4.1 Policy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5.4.2 Policy improvement: Gaussian policy . . . . . . . . . . . . . . . . . . . . . . 64
3.5.4.3 Policy improvement: softmax policy . . . . . . . . . . . . . . . . . . . . . . . 64
3.5.4.4 Adjusting the temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6 Deterministic policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6.1 DDPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.2 Twin Delayed DDPG (TD3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4 Model-based RL 69
4.1 Decision-time planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Model predictive control (MPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.2 Heuristic search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.3 Monte Carlo tree search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.3.1 AlphaGo and AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.3.2 MuZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.3.3 EfficientZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.4 Trajectory optimization for continuous actions . . . . . . . . . . . . . . . . . . . . . . 73
4.1.4.1 Random shooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.4.2 LQG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.4.3 CEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.4.4 MPPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.4.5 GP-MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1.5 SMC for MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Background planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 A game-theoretic perspective on MBRL . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 Dyna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2.1 Tabular Dyna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2.2 Dyna with function approximation . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.3 Dealing with model errors and uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.3.1 Avoiding compounding errors in rollouts . . . . . . . . . . . . . . . . . . . . . 79
4.2.3.2 End-to-end differentiable learning of model and planner . . . . . . . . . . . . 80
4.2.3.3 Unified model and planning variational lower bound . . . . . . . . . . . . . . 80
4.2.3.4 Dynamically switching between MFRL and MBRL . . . . . . . . . . . . . . . 80
4.3 World models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5
4.3.1 Generative world models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1.1 Observation-space world models . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1.2 Factored models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1.3 Latent-space world models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1.4 Dreamer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.1.5 Iris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.2 Non-generative world models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2.1 Value prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.2.2 Self prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.2.3 Policy prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2.4 Observation prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2.5 Partial observation prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2.6 BYOL-Explore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4 Beyond one-step models: predictive representations . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.1 General value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.2 Successor representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.3 Successor models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4.3.1 Learning SMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3.2 Jumpy models using geometric policy composition . . . . . . . . . . . . . . . 92
4.4.4 Successor features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.4.1 Generalized policy improvement . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.4.2 Option keyboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.4.3 Learning SFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.4.4 Choosing the tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 Other topics in RL 97
5.1 Distributional RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.1 Quantile regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.2 Replacing regression with classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Reward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.1 Reward hacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.2 Sparse reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.3 Reward shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.4 Intrinsic reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.4.1 Knowledge-based intrinsic motivation . . . . . . . . . . . . . . . . . . . . . . 99
5.2.4.2 Goal-based intrinsic motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Hierarchical RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.1 Feudal (goal-conditioned) HRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.1.1 Hindsight Experience Relabeling (HER) . . . . . . . . . . . . . . . . . . . . . 101
5.3.1.2 Hierarchical HER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1.3 Learning the subgoal space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2.2 Learning options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4 Imitation learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Imitation learning by behavior cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 Imitation learning by inverse reinforcement learning . . . . . . . . . . . . . . . . . . . 104
5.4.3 Imitation learning by divergence minimization . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Offline RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.1 Offline model-free RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5.1.1 Policy constraint methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5.1.2 Behavior-constrained policy gradient methods . . . . . . . . . . . . . . . . . 107
6
5.5.1.3 Uncertainty penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.1.4 Conservative Q-learning and pessimistic value functions . . . . . . . . . . . . 107
5.5.2 Offline model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5.3 Offline RL using reward-conditioned sequence modeling . . . . . . . . . . . . . . . . . 108
5.5.4 Hybrid offline/online methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6 LLMs and RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6.1 RL for LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.6.1.1 RLHF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6.1.2 Assistance game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6.1.3 Run-time inference as MPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6.2 LLMs for RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6.2.1 LLMs for pre-processing the input . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6.2.2 LLMs for rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6.2.3 LLMs for world models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.6.2.4 LLMs for policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.7 General RL, AIXI and universal AGI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7
8
Chapter 1
Introduction
where s0 is the agent’s initial state, R(st , at ) is the reward function that the agent uses to measure the
value of performing an action in a given state, Vπ (s0 ) is the value function for policy π evaluated at s0 , and
the expectation is wrt
p(a0 , s1 , a1 , . . . , aT , sT |s0 , π) = π(a0 |s0 )penv (o1 |a0 )δ(s1 = U (s0 , a0 , o1 )) (1.2)
× π(a1 |s1 )penv (o2 |a1 , o1 )δ(s2 = U (s1 , a1 , o2 )) (1.3)
× π(a2 |s2 )penv (o3 |a1:2 , o1:2 )δ(s3 = U (s2 , a2 , o3 )) . . . (1.4)
where penv is the environment’s distribution over observations (which is usually unknown). We define the
optimal policy as
π ∗ = arg max Ep0 (s0 ) [Vπ (s0 )] (1.5)
π
Note that picking a policy to maximize the sum of expected rewards is an instance of the maximum
expected utility principle. There are various ways to design or learn an optimal policy, depending on the
assumptions we make about the environment, and the form of the agent. We will discuss some of these
options below.
9
Figure 1.1: A small agent interacting with a big external world.
Figure 1.2: Diagram illustrating the interaction of the agent and environment. The agent has internal state st , and
chooses action at based on its policy πt . It then predicts its next internal states, st+1|t , via the predict function P ,
and optionally predicts the resulting observation, ôt+1 , via the observation decoder D. The environment has (hidden)
internal state zt , which gets updated by the world model W to give the new state zt+1 = W (zt , at ) in response to the
agent’s action. The environment also emits an observation ot+1 via the observation model O. This gets encoded to et+1
by the agent’s observation encoder E, which the agent uses to update its internal state using st+1 = U (st , at , et+1 ).
The policy is parameterized by θt , and these parameters may be updated (at a slower time scale) by the RL policy π RL .
Square nodes are functions, circles are variables (either random or deterministic). Dashed square nodes are stochastic
functions that take an extra source of randomness (not shown).
10
environment can be modeled by a controlled Markov process1 with hidden state zt , which gets updated
at each step in response to the agent’s action at . To allow for non-deterministic dynamics, we write this as
zt+1 = W (zt , at , ϵzt ), where W is the environment’s state transition function (which is usually not known
to the agent) and ϵzt is random system noise.2 , The agent does not see the world state zt , but instead
sees a potentially noisy and/or partial observation ot+1 = O(zt+1 , ϵot+1 ) at each step, where ϵot+1 is random
observation noise. For example, when navigating a maze, the agent may only see what is in front of it, rather
than seeing everything in the world all at once; furthermore, even the current view may be corrupted by
sensor noise. Any given image, such as one containing a door, could correspond to many different locations in
the world (this is called perceptual aliasing), each of which may require a different action. Thus the agent
needs use these observations to incrementally update its own internal belief state about the world, using
the state update function st+1 = SU (st , at , ot+1 ); this represents the agent’s beliefs about the underlying
world state zt , as well as the unknown world model W itself (or some proxy thereof). In the simplest setting,
the internal st can just store all the past observations, ht = (o1:t , a1:t−1 ), but such non-parametric models
can take a lot of time and space to work with, so we will usually consider parametric approximations. The
agent can then pass its state to its policy to pick actions, using at+1 = πt (st+1 ).
We can further elaborate the behavior of the agent by breaking the state-update function into two
parts. First the agent predicts its own next state, st+1|t = P (st , at ), using a prediction function P ,
and then it updates this prediction given the observation using update function U , to give st+1 =
U (st+1|t , ot+1 ). Thus the SU function is defined as the composition of the predict and update functions:
st+1 = SU (st , at , ot+1 ) = U (P (st , at ), ot+1 ). If the observations are high dimensional (e.g., images), the
agent may choose to encode its observations into a low-dimensional embedding et+1 using an encoder,
et+1 = E(ot+1 ); this can encourage the agent to focus on the relevant parts of the sensory signal. (The state
update then becomes st+1 = U (st+1|t , et+1 ).) Optionally the agent can also learn to invert this encoder by
training a decoder to predict the next observation using ôt+1 = D(st+1|t ); this can be a useful training signal,
as we will discuss in Chapter 4. Finally, the agent needs to learn the action policy πt . We parameterize this
by θt , so πt (st ) = π(st ; θt ). These parameters themselves may need to be learned; we use the notation π RL
to denote the RL policy which specifies how to update the policy parameters at each step. See Figure 1.2 for
an illustration.
We see that, in general, there are three interacting stochastic processes we need to deal with: the
environment’s states zt (which are usually affected by the agents actions); the agent’s internal states st (which
reflect its beliefs about the environment based on the observed data); and the the agent’s policy parameters
θt (which are updated based on the information stored in the belief state). The reason there are so many
RL algorithms is that this framework is very general. In the rest of this manuscript we will study special
cases, where we make different assumptions about the environment’s state zt and dynamics, the agent’s
state st and dynamics, the form of the action policy π(st |θt ), and the form of the policy learning method
θt+1 = π RL (θt , st , at , ot+1 ).
or structural equation model. This is standard practice in the control theory and causality communities.
11
where each reward is multiplied by a discount factor γ ∈ [0, 1]:
where rt = R(st , at ) is the reward, and Gt is the reward-to-go. For episodic tasks that terminate at time T ,
we define Gt = 0 for t ≥ T . Clearly, the return satisfies the following recursive relationship:
The discount factor γ plays two roles. First, it ensures the return is finite even if T = ∞ (i.e., infinite
horizon), provided we use γ < 1 and the rewards rt are bounded. Second, it puts more weight on short-term
rewards, which generally has the effect of encouraging the agent to achieve its goals more quickly. (For
example, if γ = 0.99, then an agent that reaches a terminal reward of 1.0 in 15 steps will receive an expected
discounted reward of 0.9915 = 0.86, whereas if it takes 17 steps it will only get 0.9917 = 0.84.) However, if γ is
too small, the agent will become too greedy. In the extreme case where γ = 0, the agent is completely myopic,
and only tries to maximize its immediate reward. In general, the discount factor reflects the assumption that
there is a probability of 1 − γ that the interaction will end at the next step. For finite horizon problems,
where T is known, we can set γ = 1, since we know the life time of the agent a priori.3
1.1.4 Regret
So far we have been discussing maximizing the reward. However, the upper bound on this is usually unknown,
so it can be hard to know how well a given agent is doing. An alternative approach is to work in terms of
the regret, which is defined as the difference between the expected reward under the agent’s policy and the
oracle policy π∗ , which knows the true MDP. Specifically, let πt be the agent’s policy at time t. Then the
per-step regret at t is defined as
lt ≜ Es1:t R(st , π∗ (st )) − Eπ(at |st ) [R(st , at )] (1.10)
Here the expectation is with respect to randomness in choosing actions using the policy π, as well as earlier
states, actions and rewards, as well as other potential sources of randomness.
If we only care about the final performance of the agent, as in most optimization problems, it is enough
to look at the simple regret at the last step, namely lT . Optimizing simple regret results in a problem
known as pure exploration [BMS11], where the agent needs to interact with the environment to learn
the underlying MDP; at the end, it can then solve for the resulting policy using planning methods (see
Section 2.2). However, in RL, it is more common to focus on the cumulative regret, also called the total
regret or just the regret, which is defined as
" T #
X
LT ≜ E lt (1.11)
t=1
Thus the agent will accumulate reward (and regret) while it learns a model and policy. This is called earning
while learning, and requires performing exploratory actions, to learn the model (and hence optimize
long-term reward), while also performing actions that maximize the reward at each step. This requires solving
the exploration-exploitation tradeoff, as we discussed in Section 1.4.
3We may also use γ = 1 for continuing tasks, targeting the (undiscounted) average reward criterion [Put94].
12
1.1.5 Further reading
In later chapters, we will describe methods for learning the best policy to maximize Vπ (s0 ) = E [G0 |s0 , π]).
More details on RL can be found in textbooks such as [Sze10; SB18; Aga+22a; Pla22; ID19; RJ22; Li23;
MMT24], and reviews such as [Aru+17; FL+18; Li18; Wen18a]. For details on how RL relates to control
theory, see e.g., [Son98; Rec19; Ber19; Mey22], and for connections to operations research, see [Pow22].
Note that we can combine these two distributions to derive the joint world model pW O (zt+1 , ot+1 |zt , at ).
Also, we can use these distributions to derive the environment’s non-Markovian observation distribution,
penv (ot+1 |o1:t , a1:t ), used in Equation (1.4), as follows:
X
penv (ot+1 |o1:t , a1:t ) = p(ot+1 |zt+1 )p(zt+1 |a1:t ) (1.14)
zt+1
X X
p(zt+1 |a1:t ) = ··· p(z1 |a1 )p(z2 |z1 , a1 ) . . . p(zt+1 |zt , at ) (1.15)
z1 zt
If the world model (both p(o|z) and p(z ′ |z, a)) is known, then we can — in principle — solve for the optimal
policy. The method requires that the agent’s internal state correspond to the belief state st = bt = p(zt |ht ),
where ht = (o1:t , a1:t−1 ) is the observation history. The belief state can be updated recursively using Bayes rule.
See Section 1.2.5 for details. The belief state forms a sufficient statistic for the optimal policy. Unfortunately,
computing the belief state and the resulting optimal policy is wildly intractable [PT87; KLC98]. We discuss
some approximate methods in Section 1.3.4.
plant, and the agent is called the controller. States are denoted by xt ∈ X ⊆ RD , actions are denoted by ut ∈ U ⊆ RK , and
rewards are replaced by costs ct ∈ R.
13
Figure 1.3: Illustration of an MDP as a finite state machine (FSM). The MDP has three discrete states (green
cirlces), two discrete actions (orange circles), and two non-zero rewards (orange arrows). The numbers on the
black edges represent state transition probabilities, e.g., p(s′ = s0 |a = a0 , s′ = s0 ) = 0.7; most state transitions
are impossible (probability 0), so the graph is sparse. The numbers on the yellow wiggly edges represent expected
rewards, e.g., R(s = s1 , a = a0 , s′ = s0 ) = +5; state transitions with zero reward are not annotated. From
https: // en. wikipedia. org/ wiki/ Markov_ decision_ process . Used with kind permission of Wikipedia author
waldoalvarez.
In lieu of an observation model, we assume the environment (as opposed to the agent) sends out a reward
signal, sampled from pR (rt |st , at , st+1 ). The expected reward is then given by
X
R(st , at , st+1 ) = r pR (r|st , at , st+1 ) (1.17)
r
X
R(st , at ) = pS (st+1 |st , at )R(st , at , st+1 ) (1.18)
st+1
Given a stochastic policy π(at |st ), the agent can interact with the environment over many steps. Each
step is called a transition, and consists of the tuple (st , at , rt , st+1 ), where at ∼ π(·|st ), st+1 ∼ pS (st , at ),
and rt ∼ pR (st , at , st+1 ). Hence, under policy π, the probability of generating a trajectory length T ,
τ = (s0 , a0 , r0 , s1 , a1 , r1 , s2 , . . . , sT ), can be written explicitly as
−1
TY
p(τ ) = p0 (s0 ) π(at |st )pS (st+1 |st , at )pR (rt |st , at , st+1 ) (1.19)
t=0
In general, the state and action sets of an MDP can be discrete or continuous. When both sets are finite,
we can represent these functions as lookup tables; this is known as a tabular representation. In this case,
we can represent the MDP as a finite state machine, which is a graph where nodes correspond to states,
and edges correspond to actions and the resulting rewards and next states. Figure 1.3 gives a simple example
of an MDP with 3 states and 2 actions.
If we know the world model pS and pR , and if the state and action space is tabular, then we can solve for
the optimal policy using dynamic programming techniques, as we discuss in Section 2.2. However, typically
the world model is unknown, and the states and actions may need complex nonlinear models to represent
their transitions. In such cases, we will have to use RL methods to learn a good policy.
14
drawn from a common distribution. This requires the agent to generalize across multiple MDPs, rather than
overfitting to a specific environment [Cob+19; Kir+21; Tom+22]. (This form of generalization is different
from generalization within an MDP, which requires generalizing across states, rather than across environments;
both are important.)
A contextual MDP is a special kind of POMDP where the hidden variable corresponds to the unknown
parameters of the model. In [Gho+21], they call this an epistemic POMDP, which is closely related to the
concept of belief state MDP which we discuss in Section 1.2.5.
p(bt+1 |bt , ot+1 , at+1 , rt+1 ) = I (bt+1 = BayesRule(bt , ot+1 , at+1 , rt+1 )) (1.20)
5 The terminology arises by analogy to a slot machine (sometimes called a “bandit”) in a casino. If there are K slot machines,
each with different rewards (payout rates), then the agent (player) must explore the different machines until they have discovered
which one is best, and can then stick to exploiting it.
6 Technically speaking, this is a POMDP, where we assume the states are observed, and the parameters are the unknown
hidden random variables. This is in contrast to Section 1.2.1, where the states were not observed, and the parameters were
assumed to be known.
15
,
Action 1 Action 2
Success Failur e Success Failur e
, , , ,
Action 1
Success Failur e
, ,
Figure 1.4: Illustration of sequential belief updating for a two-armed beta-Bernoulli bandit. The prior for the reward
for action 1 is the (blue) uniform distribution Beta(1, 1); the prior for the reward for action 2 is the (orange) unimodal
distribution Beta(2, 2). We update the parameters of the belief state based on the chosen action, and based on whether
the observed reward is success (1) or failure (0).
If we can solve this (PO)MDP, we have the optimal solution to the exploration-exploitation problem.
As a simple example, consider a context-free Bernoulli bandit, where pR (r|a) = Ber(r|µa ), and
µa = pR (r = 1|a) = R(a) is the expected reward for taking action a. The only unknown parameters are
w = µ1:A . Suppose we use a factored beta prior
Y
p0 (w) = Beta(µa |α0a , β0a ) (1.22)
a
where
t−1
X
Ntr (a) = I (ai = a, ri = r) (1.24)
i=1
This is illustrated in Figure 1.4 for a two-armed Bernoulli bandit. We can use a similar method for a
Gaussian bandit, where pR (r|a) = N (r|µa , σa2 ).
In the case of contextual bandits, the problem is conceptually the same, but becomes more complicated
computationally. If we assume a linear regression bandit, pR (r|s, a; w) = N (r|ϕ(s, a)T w, σ 2 ), we can use
Bayesian linear regression to compute p(w|Dt ) exactly in closed form. If we assume a logistic regression
bandit, pR (r|s, a; w) = Ber(r|σ(ϕ(s, a)T w)), we have to use approximate methods for approximate Bayesian
logistic regression to compute p(w|Dt ). If we have a neural bandit of the form pR (r|s, a; w) = N (r|f (s, a; w))
for some nonlinear function f , then posterior inference is even more challenging (this is equivalent to the
problem of inference in Bayesian neural networks, see e.g., [Arb+23] for a review paper for the offline case,
and [DMKM22; JCM24] for some recent online methods).
We can generalize the above methods to compute the belief state for the parameters of an MDP in the
obvious way, but modeling both the reward function and state transition function.
Once we have computed the belief state, we can derive a policy with optimal regret using the methods
like UCB (Section 1.4.3) or Thompson sampling (Section 1.4.4).
16
changes over time, but the environment state does not.7 Such problems commomly arise when we are trying
to optimize a fixed but unknown function R. We can “query” the function by evaluating it at different points
(parameter values), and in some cases, the resulting observation may also include gradient information. The
agent’s goal is to find the optimum of the function in as few steps as possible. We give some examples of this
problem setting below.
where â = πT (a1:T , r1:T ) is the estimated optimal arm as computed by the terminal policy πT applied to
the sequence of observations obtained by the exploration policy π. This can be solved by a simple adaptation
of the methods used for standard bandits.
for some unknown function R, where w ∈ RN , using as few actions (function evaluations of R) as possible.
This is essentially an “infinite arm” version of the best-arm identification problem [Tou14], where we replace
the discrete choice of arms a ∈ {1, . . . , K} with the parameter vector w ∈ RN . In this case, the optimal
policy can be computed if the agent’s state st is a belief state over the unknown function, i.e., st = p(R|ht ).
A common way to represent this distribution is to use Gaussian processes. We can then use heuristics like
expected improvement, knowledge gradient or Thompson sampling to implement the corresponding policy,
wt = π(st ). For details, see e.g., [Gar23].
17
Approach Method Functions learned On/Off Section
Value-based SARSA Q(s, a) On Section 2.4
Value-based Q-learning Q(s, a) Off Section 2.5
Policy-based REINFORCE π(a|s) On Section 3.2
Policy-based A2C π(a|s), V (s) On Section 3.3.1
Policy-based TRPO/PPO π(a|s), A(s, a) On Section 3.4.3
Policy-based DDPG a = π(s), Q(s, a) Off Section 3.6.1
Policy-based Soft actor-critic π(a|s), Q(s, a) Off Section 3.5.4
Model-based MBRL p(s′ |s, a) Off Chapter 4
Table 1.1: Summary of some popular methods for RL. On/off refers to on-policy vs off-policy methods.
Although in principle it is possible to learn the learning rate (stepsize) policy using RL (see e.g., [Xu+17]),
the policy is usually chosen by hand, either using a learning rate schedule or some kind of manually
designed adaptive learning rate policy (e.g., based on second order curvature information).
The value function for the optimal policy π ∗ is known to satisfy the following recursive condition, known as
Bellman’s equation:
This follows from the principle of dynamic programming, which computes the optimal solution to a
problem (here the value of state s by combining the optimal solution of various subproblems (here the values
of the next states s′ ). This can be used to derive the following learning rule:
where s′ ∼ pS (·|s, a) is the next state sampled from the environment, and r = R(s, a) is the observed reward.
This is called Temporal Difference or TD learning (see Section 2.3.2 for details). Unfortunately, it is not
clear how to derive a policy if all we know is the value function. We now describe a solution to this problem.
18
We first generalize the notion of value function to assigning a value to a state and action pair, by defining
the Q function as follows:
"∞ #
X
Qπ (s, a) ≜ Eπ [G0 |s0 = s, a0 = a] = Eπ t
γ rt |s0 = s, a0 = a (1.30)
t=0
This quantity represents the expected return obtained if we start by taking action a in state s, and then
follow π to choose actions thereafter. The Q function for the optimal policy satisfies a modified Bellman
equation
h i
′ ′
Q∗ (s, a) = R(s, a) + γEpS (s′ |s,a) max
′
Q∗ (s , a ) (1.31)
a
Q(s, a) ← r + γ max
′
Q(s′ , a′ ) − Q(s, a) (1.32)
a
where we sample s′ ∼ pS (·|s, a) from the environment. The action is chosen at each step from the implicit
policy
a = argmax Q(s, a′ ) (1.33)
a′
1.3.2 Policy-based RL
In this section we give a brief introductin to Policy-based RL; for details see Chapter 3.
In policy-based methods, we try to directly maximize J(πθ ) = Ep(s0 ) [Vπ (s0 )] wrt the parameter’s θ; this
is called policy search. If J(πθ ) is differentiable wrt θ, we can use stochastic gradient ascent to optimize θ,
which is known as policy gradient (see Section 3.1).
Policy gradient methods have the advantage that they provably converge to a local optimum for many
common policy classes, whereas Q-learning may diverge when approximation is used (Section 2.5.2.4). In
addition, policy gradient methods can easily be applied to continuous action spaces, since they do not need
to compute argmaxa Q(s, a). Unfortunately, the score function estimator for ∇θ J(πθ ) can have a very high
variance, so the resulting method can converge slowly.
One way to reduce the variance is to learn an approximate value function, Vw (s), and to use it as a
baseline in the score function estimator. We can learn Vw (s) using using TD learning. Alternatively, we can
learn an advantage function, Aw (s, a), and use it as a baseline. These policy gradient variants are called actor
critic methods, where the actor refers to the policy πθ and the critic refers to Vw or Aw . See Section 3.3 for
details.
1.3.3 Model-based RL
In this section, we give a brief introduction to model-based RL; for more details, see Chapter 4.
Value-based methods, such as Q-learning, and policy search methods, such as policy gradient, can be very
sample inefficient, which means they may need to interact with the environment many times before finding
a good policy, which can be problematic when real-world interactions are expensive. In model-based RL, we
first learn the MDP, including the pS (s′ |s, a) and R(s, a) functions, and then compute the policy, either using
approximate dynamic programming on the learned model, or doing lookahead search. In practice, we often
interleave the model learning and planning phases, so we can use the partially learned policy to decide what
data to collect, to help learn a better model.
19
1.3.4 Dealing with partial observability
In an MDP, we assume that the state of the environment st is the same as the observation ot obtained by the
agent. But in many problems, the observation only gives partial information about the underlying state of the
world (e.g., a rodent or robot navigating in a maze). This is called partial observability. In this case, using
a policy of the form at = π(ot ) is suboptimal, since ot does not give us complete state information. Instead
we need to use a policy of the form at = π(ht ), where ht = (a1 , o1 , . . . , at−1 , ot ) is the entire past history of
observations and actions, plus the current observation. Since depending on the entire past is not tractable for
a long-lived agent, various approximate solution methods have been developed, as we summarize below.
1.3.5 Software
Implementing RL algorithms is much trickier than methods for supervised learning, or generative methods
such as language modeling and diffusion, all of which have stable (easy-to-optimize) loss functions. Therefore
it is often wise to build on existing software rather than starting from scratch. We list some useful libraries
in Section 1.3.5.
In addition, RL experiments can be very high variance, making it hard to draw valid conclusions. See
[Aga+21b; Pat+24; Jor+24] for some recommended experimental practices. For example, when reporting
performance across different environments, with different intrinsic difficulties (e.g., different kinds of Atari
20
URL Language Comments
Stoix Jax Mini-library with many methods (including MBRL)
PureJaxRL Jax Single files with DQN; PPO, DPO
JaxRL Jax Single files with AWAC, DDPG, SAC, SAC+REDQ
Stable Baselines Jax Jax Library with DQN, CrossQ, TQC; PPO, DDPG, TD3, SAC
Jax Baselines Jax Library with many methods
Rejax Jax Library with DDQN, PPO, (discrete) SAC, DDPG
Dopamine Jax/TF Library with many methods
Rlax Jax Library of RL utility functions (used by Acme)
Acme Jax/TF Library with many methods (uses rlax)
CleanRL PyTorch Single files with many methods
Stable Baselines 3 PyTorch Library with DQN; A2C, PPO, DDPG, TD3, SAC, HER
TianShou PyTorch Library with many methods (including offline RL)
games), [Aga+21b] recommend reporting the interquartile mean (IQM) of the performance metric, which
is the mean of the samples between the 0.25 and 0.75 percentiles, (this is a special case of a trimmed mean).
Let this estimate be denoted by µ̂(Di ), where D is the empirical data (e.g., reward vs time) from the i’th
run. We can estimate the uncertainty in this estimate using a nonparametric method, such as bootstrap
resampling, or a parametric approximation, such as a Gaussian approximation. (This requires computing the
standard error of the mean, √σ̂n , where n is the number of trials, and σ̂ is the estimated standard deviation of
the (trimmed) data.)
where τ > 0 is a temperature parameter that controls how entropic the distribution is. As τ gets close to 0,
πτ becomes close to a greedy policy. On the other hand, higher values of τ will make π(a|s) more uniform,
21
R̂(s, a1 ) R̂(s, a2 ) πϵ (a|s1 ) πϵ (a|s2 ) πτ (a|s1 ) πτ (a|s2 )
1.00 9.00 0.05 0.95 0.00 1.00
4.00 6.00 0.05 0.95 0.12 0.88
4.90 5.10 0.05 0.95 0.45 0.55
5.05 4.95 0.95 0.05 0.53 0.48
7.00 3.00 0.95 0.05 0.98 0.02
8.00 2.00 0.95 0.05 1.00 0.00
Table 1.3: Comparison of ϵ-greedy policy (with ϵ = 0.1) and Boltzmann policy (with τ = 1) for a simple MDP with 6
states and 2 actions. Adapted from Table 4.1 of [GK19].
and encourage more exploration. Its action selection probabilities can be much “smoother” with respect to
changes in the reward estimates than ϵ-greedy, as illustrated in Table 1.3.
The Boltzmann policy explores equally widely in all states. An alternative approach is to try to explore
(state,action) combinations where the consequences of the outcome might be uncertain. This can be achived
using an exploration bonus Rtb (s, a), which is large if the number of times we have tried actioon a in state
s is small. We can then add Rb to the regular reward, to bias the behavior in a way that will hopefully
cause the agent to learn useful information about the world. This is called an intrinsic reward function
(Section 5.2.4).
Suppose we have a way to compute the recursively compute the belief state over model parameters, p(θt |D1:t ).
How do we use this to solve for the policy in the resulting belief state MDP?
In the special case of context-free bandits with a finite number of arms, the optimal policy of this belief
state MDP can be computed using dynamic programming. The result can be represented as a table of action
probabilities, πt (a1 , . . . , aK ), for each step; this are known as Gittins indices [Git89] (see [PR12; Pow22] for
a detailed explanation). However, computing the optimal policy for general contextual bandits is intractable
[PT87].
We can extend the above techniques to the MDP case by constructing a BAMDP, which stands for “Bayes-
Adaptive MDP” [Duf02]. However, this is computationally intractable to solve, so various approximations are
made (see e.g., [Zin+21; AS22; Mik+20]).
22
1.4.3.1 Basic idea
To use a UCB strategy, the agent maintains an optimistic reward function estimate R̃t , so that R̃t (st , a) ≥
R(st , a) for all a with high probability, and then chooses the greedy action accordingly:
at = argmax R̃t (st , a) (1.35)
a
UCB can be viewed a form of exploration bonus, where the optimistic estimate encourages exploration.
Typically, the amount of optimism, R̃t −R, decreases over time so that the agent gradually reduces exploration.
With properly constructed optimistic reward estimates, the UCB strategy has been shown to achieve near-
optimal regret in many variants of bandits [LS19]. (We discuss regret in Section 1.1.4.)
The optimistic function R̃ can be obtained in different ways, sometimes in closed forms, as we discuss
below.
23
Figure 1.5: Illustration of the reward distribution Q(a) for a Gaussian bandit with 3 different actions, and the
corresponding lower and upper confidence bounds. We show the posterior means Q(a) = µ(a) with a vertical dotted line,
and the scaled posterior standard deviations cσ(a) as a horizontal solid line. From [Sil18]. Used with kind permission
of David Silver.
[AJO08] presents the more sophisticated UCRL2 algorithm, which computes confidence intervals on all the
MDP model parameters at the start of each episode; it then computes the resulting optimistic MDP and
solves for the optimal policy, which it uses to collect more data.
If the posterior is uncertain, the agent will sample many different actions, automatically resulting in exploration.
As the uncertainty decreases, it will start to exploit its knowledge.
To see how we can implement this method, note that we can compute the expression in Equation (1.42)
by using a single Monte Carlo sample θ̃t ∼ p(θ|ht ). We then plug in this parameter into our reward model,
and greedily pick the best action:
at = argmax R(st , a′ ; θ̃t ) (1.43)
a′
This sample-then-exploit approach will choose actions with exactly the desired probability, since
Z
pa = I a = argmax R(st , a′ ; θ̃t ) p(θ̃t |ht ) = Pr (a = argmax R(st , a′ ; θ̃t )) (1.44)
a′ θ̃t ∼p(θ|ht ) a′
24
Cumulative regret
arm0 arm0 40 observed
1400
10 arm1
arm1 c √ t
arm2 35
1200 arm2
5 30
cumulative reward
1000
0 25
800
20
LT
−5
600
15
−10 400
10
−15 200
5
0 0
−20
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
time time
Figure 1.6: Illustration of Thompson sampling applied to a linear-Gaussian contextual bandit. The context has the
form st = (1, t, t2 ). (a) True reward for each arm vs time. (b) Cumulative reward per arm vs time. (c) Cumulative
regret vs time. Generated by thompson_sampling_linear_gaussian.ipynb.
Despite its simplicity, this approach can be shown to achieve optimal regret (see e.g., [Rus+18] for a
survey). In addition, it is very easy to implement, and hence is widely used in practice [Gra+10; Sco10;
CL11].
In Figure 1.6, we give a simple example of Thompson sampling applied to a linear regression bandit. The
context has the form st = (1, t, t2 ). The true reward function has the form R(st , a) = wTa st . The weights per
arm are chosen as follows: w0 = (−5, 2, 0.5), w1 = (0, 0, 0), w2 = (5, −1.5, −1). Thus we see that arm 0 is
initially worse (large negative bias) but gets better over time (positive slope), arm 1 is useless, and arm 2 is
initially better (large positive bias) but gets worse over time. The observation noise is the same for all arms,
σ 2 = 1. See Figure 1.6(a) for a plot of the reward function. We use a conjugate Gaussian-gamma prior and
perform exact Bayesian updating. Thompson sampling quickly discovers that arm 1 is useless. Initially it
pulls arm 2 more, but it adapts to the non-stationary nature of the problem and switches over to arm 0, as
shown in Figure 1.6(b). In Figure 1.6(c), we show that the empirical cumulative regret in blue is close to the
optimal lower bound in red.
25
Figure 1.7: A graphical model for optimal control.
also forms the foundation of the SAC method discussed in Section 3.5.4, the MPO discussed in Section 3.4.4,
and the MPC method discussed in Section 4.1.5.
In the above, we have assumed that R(s, a) < 0, so that Equation (1.46) gives a valid probability. However,
this is not required, since we can simply replace the likelihood term p(Ot = 1|st , at ) with an unnormalized
potential, ϕt (st , at ); this will not affect the results of inference. For brevity, we will just write p(Ot ) rather
than p(Ot = 1), since 1 is just a dummy value.
To simplify notation, we assume a uniform action prior, p(at |st ) = 1/|A|; this is without loss of generality,
since we can always push an informative action prior p(at |st ) into the potential function ϕt (st , at ). (We call
this an “action prior” rather than a policy, since we are going to derive the policy using posterior inference, as
we explain below.) Under these assumptions, the posterior probability of observing a length-T trajectory τ ,
when optimality achieved in every step, is
" −1
TY
#" T #
Y
p(τ |O1:T ) ∝ p(τ , O1:T ) ∝ p(s1 ) pS (st+1 |st , at ) p(Ot |st , at )
t=1 t=1
−1
TY T
!
X
= p(s1 ) pS (st+1 |st , at ) exp R(st , at ) (1.47)
t=1 t=1
(Typically p(s1 ) is a delta function at the observed initial state s1 .) The intuition of Equation (1.47) is
clearest when the state transitions are deterministic. In this case, pS (st+1 |st , at ) is either 1 or 0, depending
on whether the transition is dynamically feasible or not. Hence we have
T
X
p(τ |O1:T ) ∝ I (p(τ ) ̸= 0) exp( R(st , at )) (1.48)
t=1
where the first term determines if τ is feasible or not. In this case, find the action sequence that maximizes
the sum of rewards is equivalent to inferring the MAP sequence of actions, which we denote by â1:T (s1 ).
(The case of stochastic transitions is more complicated, and will be discussed later.)
26
For deterministic environments, the optimal policy is open loop, and corresponds to following the optimal
action sequence â1:T (s1 ). (This is like a shortest path planning problem.) However, in the stochastic case,
we need to compute a closed loop policy, π(at |st ), that conditions on the observed state. To compute this,
let us define the following quantities:
(These terms are analogous to the backwards messages in the forwards-backwards algorithm for HMMs
[Rab89].) Using this notation, we can write the optimal policy using
where we have assumed the action prior p(at |st ) = 1/|A| for notational simplicty. (Recall that the action
prior is distinct from the optimal policy, which is given by p(at |st , Ot:T ).)
This is a standard log-sum-exp computation, and is similar to the softmax operation. Thus we call it a soft
value function. When the values of Q(st , at ) are large (which can be ensure by scaling up all the rewards),
this approximates the standard hard max operation:
X
V (st ) = log exp(Q(st , at )) ≈ max Q(st , at ) (1.57)
at
at
For the deterministic case, the backup for Q becomes the usual
Q(st , at ) = log p(Ot |st , at ) + log βt+1 (st+1 ) = r(st , at ) + V (st+1 ) (1.58)
where st+1 = f (st , at ) is the next state. However, for the stochastic case, we get
Q(st , at ) = r(st , at ) + log EpS (st+1 |st ,at ) [exp(V (st+1 ))] (1.59)
This replaces the standard expectation over the next state with a softmax. This can result in Q functions that
are optimistic, since if there is one next state with particularly high reward (e.g., you win the lottery), it will
dominate the backup, even if on average it is unlikely. This can result in risk seeking behavior, and is known
as the optimism bias (see e.g., [Mad+17; Cha+21] for discussion). We will discuss a solution to this below.
27
1.5.3 Maximum entropy RL
Recall that the true posterior is given by
−1
TY T
!
X
p(τ |O1:T ) ≜ p∗ (τ ) ∝ p(s1 ) pS (st+1 |st , at ) exp R(st , at ) (1.60)
t=1 t=1
In the sections above, we derived the exact posterior over states and actions conditioned on the optimality
variables. However, in general we will have to approximate it.
Let us denote the approximate posterior by q(τ ). Variational inference corresponds to the minimizing
(wrt q) the following objective:
DKL (q(τ ) ∥ p∗ (τ )) = −Eq(τ ) [log p∗ (τ ) − log q(τ )] (1.61)
We can drive this loss to its minimum value of 0 by performing exact inference, which sets q(τ ) = p∗ (τ ),
which is given by
−1
TY
p∗ (τ ) = p(s1 |O1:T ) pS (st+1 |st , at , O1:T )p(at |st , O1:T )) (1.62)
t=1
Unfortunately, this uses an optimistic form of the dynamics, pS (st+1 |st , at , O1:T ), in which the agent plans
assuming it directly controls the state distribution itself, rather than just the action distribution. We can
solve this optimism bias problem by instead using a “causal” variational posterior of the following form:8
−1
TY −1
TY
q(τ ) = p(s1 ) pS (st+1 |st , at )p(at |st , O1:T ) = p(s1 ) pS (st+1 |st , at )π(at |st ) (1.63)
t=1 t=1
where π(at |st ) is the policy we wish to learn. In the case of deterministic transitions, where pS (st+1 |st , at ) =
δ(st+1 − f (st , at )), we do not need this simplification, since pS (st+1 |st , at , O1:T ) = pS (st+1 |st , at ). (And in
both cases p(s1 |O1:T ) = p(s1 ), which is assumed to be a delta function.) We can now write the (negative of)
the objective as follows:
" T
X
−DKL (q(τ ) ∥ p∗ (τ )) = Eq(τ ) log p(s1 ) + (log pS (st+1 |st , at ) + R(st , at )) − (1.64)
t=1
T
#
X
− log p(s1 ) − (log pS (st+1 |st , at ) + log π(at |st )) (1.65)
t=1
" T #
X
= Eq(τ ) R(st , at ) − log π(at |st ) (1.66)
t=1
T
X
= Eq(st ,at ) [R(st , at )] + Eq(st ) H(π(·|st )) (1.67)
t=1
This is known as the maximum entropy RL objective [ZABD10].We can optimize this using the soft actor
critic algorithm which we discuss in Section 3.5.4.
Note that we can tune the magnitude of the entropy regularizer by defining the optimality variable using
p(Ot = 1|st , at ) = exp( α1 R(st , at )). This gives the objective
T
X
J(π) = Eq(st ,at ) [R(st , at )] + αEq(st ) H(π(·|st )) (1.68)
t=1
As α → 0 (equivalent to scaling up the rewards), this approaches the standard (unregularized) RL objective.
8 Unfortunately, this trick is specific to variational inference, which means that other posterior inference methods, such as
sequential Monte Carlo [Pic+19; Lio+22], will still suffer from the optimism bias in the stochastic case (see e.g., [Mad+17] for
discussion).
28
1.5.4 Active inference
Control as inference is closely related to a technique known as active inference, as we explain below. For
more details on the connection, see [Mil+20; WIP20; LÖW21; Saj+21; Tsc+20].
The active inference technique was developed in the neuroscience community, that has its own vocabulary
for standard ML concepts. We start with the free energy principle [Fri09; Buc+17; SKM18; Ger19;
Maz+22]. The FEP is equivalent to using variational inference to perform state estimation (perception) and
parameter estimation (learning) in a latent variable model. In particular, consider an LVM p(z, o|θ) with
hidden states z, observations o, and parameters θ. We define the variational free energy to be
F(o|θ) = DKL (q(z|o, θ) ∥ p(z|o, θ)) − log p(o|θ) = Eq(z|o,θ) [log q(z|o, θ) − log p(o, z|θ)] ≥ − log p(o|θ)
(1.69)
which is the KL between the approximate variational posterior q and the true posterior p, minus a normalization
constant, log p(o|θ), which is known as the free energy. State estimation (perception) corresponds to solving
minq(z|o,θ) F(o|θ), and parameter estimation (model fitting) corresponds to solving minθ F(o|θ), just as in
the EM (expectation maximization) algorithm. (We can also be Bayesian about θ, as in variational Bayes
EM, instead of just computing a point estimate.) This EM procedure will minimize the VFE, which is an
upper bound on the negative log marginal likelihood of the data. In other words, it adjusts the model (belief
state and parameters) so that it better predicts the observations, so the agent is less surprised (minimizes
prediction errors).
To extend the above FEP to decision making problems, we define the expected free energy as follows
G(a) = Eq(o|a) [F(o)] = Eq(o,z|a) [log q(z|o) − log p(o, z)] (1.70)
where q(o|a) is the posterior predictive distribution over future observations given action sequence a. (We
can also condition on any observed history or agent state h, but we omit this (and the model parameters θ)
from the notation for brevity.) We can decompose the EFE (which the agent wants to minimize) into two
terms. First there is the intrinsic value, known as the epistemic drive:
Minimizing this will encourage the agent to choose actions which maximize the mutual information between
the observations o and the hidden states z, thus reducing uncertainty about the hidden states. (This is called
epistemic foraging.) The extrinsic value, known as the exploitation term, is given by
Minimizing this will encourage the agent to choose actions that result in observations that match its prior.
For example, if the agent predicts that the world will look brighter when it flips a light switch, it can take
the action of flipping the switch to fulfill this prediction. This prior can be related to a reward function by
defining as p(o) ∝ eR(o) , encouraging goal directed behavior, exactly as in control-as-inference. However, the
active inference approach provides a way of choosing actions without needing to specify a reward. Since
solving to the optimal action at each step can be slow, it is possible to amortize this cost by training a
policy network to compute π(a|h) = argmina G(a|h), where h is the observation history (or current state),
as shown in [Mil20; HL20]; this is called “deep active inference”.
Overall, we see that this framework provides a unified theory of both perception and action, both of which
try to minimize some form of free energy. In particular, minimizing the expected free energy will cause the
agent to pick actions to reduce its uncertainty about its hidden states, which can then be used to improve
its predictive model pθ of observations; this in turn will help minimize the VFE of future observations, by
updating the internal belief state q(z|o, θ) to explain the observations. In other words, the agent acts so it
can learn so it becomes less surprised by what it sees. This ensures the agent is in homeostasis with its
environment.
Note that active inference is often discussed in the context of predictive coding. This is equivalent to a
special case of FEP where two assumptions are made: (1) the generative model p(z, o|θ) is a a nonlinear
29
hierarchical Gaussian model (similar to a VAE decoder), and (2) the variational posterior approximation uses
a diagonal Laplace approximation, q(z|o, θ) = N (z|ẑ, H) with the mode ẑ being computed using gradient
descent, and H being the Hessian at the mode. This can be considered a non-amortized version of a VAE,
where inference (E step) is done with iterated gradient descent, and parameter estimation (M step) is also
done with gradient descent. (A more efficient incremental EM version of predictive coding, which updates
{ẑn : n = 1 : N } and θ in parallel, was recently presented in [Sal+24], and an amortized version in [Tsc+23].)
For more details on predictive coding, see [RB99; Fri03; Spr17; HM20; MSB21; Mar21; OK22; Sal+23;
Sal+24].
30
Chapter 2
Value-based RL
This is the expected return obtained if we start in state s and follow π to choose actions in a continuing task
(i.e., T = ∞).
Similarly, we define the state-action value function, also known as the Q-function, as follows:
"∞ #
X
Qπ (s, a) ≜ Eπ [G0 |s0 = s, a0 = a] = Eπ t
γ rt |s0 = s, a0 = a (2.2)
t=0
This quantity represents the expected return obtained if we start by taking action a in state s, and then
follow π to choose actions thereafter.
Finally, we define the advantage function as follows:
This tells us the benefit of picking action a in state s then switching to policy π, relative to the baseline
return of always following π. Note that Aπ (s, a) can be both positive and negative, and Eπ(a|s) [Aπ (s, a)] = 0
due to a useful equality: Vπ (s) = Eπ(a|s) [Qπ (s, a)].
31
A fundamental result about the optimal value function is Bellman’s optimality equations:
Conversely, the optimal value functions are the only solutions that satisfy the equations. In other words,
although the value function is defined as the expectation of a sum of infinitely many rewards, it can be
characterized by a recursive equation that involves only one-step transition and reward models of the MDP.
Such a recursion play a central role in many RL algorithms we will see later.
Given a value function (V or Q), the discrepancy between the right- and left-hand sides of Equations (2.4)
and (2.5) are called Bellman error or Bellman residual. We can define the Bellman operator B given
an MDP M = (R, T ) and policy π as a function that takes a value function V and derives a few value function
V ′ that satisfies
V ′ (s) = BM
π
V (s) ≜ Eπ(a|s) R(s, a) + γET (s′ |s,a) [V (s′ )] (2.6)
This reduces the Bellman error. Applying the Bellman operator to a state is called a Bellman backup. If
we iterate this process, we will converge to the optimal value function V∗ , as we discuss in Section 2.2.1.
Given the optimal value function, we can derive an optimal policy using
Following such an optimal policy ensures the agent achieves maximum expected return starting from any
state.
The problem of solving for V∗ , Q∗ or π∗ is called policy optimization. In contrast, solving for Vπ or Qπ
for a given policy π is called policy evaluation, which constitutes an important subclass of RL problems as
will be discussed in later sections. For policy evaluation, we have similar Bellman equations, which simply
replace maxa {·} in Equations (2.4) and (2.5) with Eπ(a|s) [·].
In Equations (2.7) and (2.8), as in the Bellman optimality equations, we must take a maximum over all
actions in A, and the maximizing action is called the greedy action with respect to the value functions,
Q∗ or V∗ . Finding greedy actions is computationally easy if A is a small finite set. For high dimensional
continuous spaces, see Section 2.5.4.1.
32
Q*(s, a)
R(s)
𝛄=0 𝛄=1 𝛄 = 0.9
ST1 ST1 Up Down Up Down Up Down
0
S1 S1
0 0 0 0 1.0 0 0.81
Up
S2 a1 S2
0
0 0 1.0 1.0 0.73 0.9
Down
S3 a2 S3 0
0 1.0 1.0 1.0 0.81 1.0
ST2 ST2 1
Figure 2.1: Left: illustration of a simple MDP corresponding to a 1d grid world of 3 non-absorbing states and 2
actions. Right: optimal Q-functions for different values of γ. Adapted from Figures 3.1, 3.2, 3.4 of [GK19].
what we desire. A proper choice of γ is up to the agent designer, just like the design of the reward function,
and has to reflect the desired behavior of the agent.
2.2 Computing the value function and policy given a known world
model
In this section, we discuss how to compute the optimal value function (the prediction problem) and the
optimal policy (the control problem) when the MDP model is known. (Sometimes the term planning is
used to refer to computing the optimal policy, given a known model, but planning can also refer to computing
a sequence of actions, rather than a policy.) The algorithms we discuss are based on dynamic programming
(DP) and linear programming (LP).
For simplicity, in this section, we assume discrete state and action sets with γ < 1. However, exact
calculation of optimal policies often depends polynomially on the sizes of S and A, and is intractable, for
example, when the state space is a Cartesian product of several finite sets. This challenge is known as
the curse of dimensionality. Therefore, approximations are typically needed, such as using parametric
or nonparametric representations of the value function or policy, both for computational tractability and
for extending the methods to handle MDPs with general state and action sets. This requires the use of
approximate dynamic programming (ADP) and approximate linear programming (ALP) algorithms
(see e.g., [Ber19]).
Note that the update rule, sometimes called a Bellman backup, is exactly the right-hand side of the
Bellman optimality equation Equation (2.4), with the unknown V∗ replaced by the current estimate Vk . A
33
fundamental property of Equation (2.9) is that the update is a contraction: it can be verified that
max |Vk+1 (s) − V∗ (s)| ≤ γ max |Vk (s) − V∗ (s)| (2.10)
s s
In other words, every iteration will reduce the maximum value function error by a constant factor.
Vk will converge to V∗ , after which an optimal policy can be extracted using Equation (2.8). In practice,
we can often terminate VI when Vk is close enough to V∗ , since the resulting greedy policy wrt Vk will be
near optimal. Value iteration can be adapted to learn the optimal action-value function Q∗ .
We can guarantee that Vπ′ ≥ Vπ . This is called the policy improvement theorem. To see this, define r ′ ,
T′ and v ′ as before, but for the new policy π ′ . The definition of π ′ implies r ′ + γT′ v ≥ r + γTv = v, where
the equality is due to Bellman’s equation. Repeating the same equality, we have
v ≤ r ′ + γT′ v ≤ r ′ + γT′ (r ′ + γT′ v) ≤ r ′ + γT′ (r ′ + γT′ (r ′ + γT′ v)) ≤ · · · (2.13)
′ ′2 ′ ′ −1 ′ ′
2
= (I + γT + γ T + · · · )r = (I − γT ) r =v (2.14)
34
Figure 2.2: Policy iteration vs value iteration represented as backup diagrams. Empty circles represent states, solid
(filled) circles represent states and actions. Adapted from Figure 8.6 of [SB18].
Starting from an initial policy π0 , policy iteration alternates between policy evaluation (E) and improvement
(I) steps, as illustrated below:
E I E I E
π0 → Vπ0 → π1 → Vπ1 · · · → π∗ → V∗ (2.15)
The algorithm stops at iteration k, if the policy πk is greedy with respect to its own value function Vπk . In
this case, the policy is optimal. Since there are at most |A||S| deterministic policies, and every iteration
strictly improves the policy, the algorithm must converge after finite iterations.
In PI, we alternate between policy evaluation (which involves multiple iterations, until convergence of
Vπ ), and policy improvement. In VI, we alternate between one iteration of policy evaluation followed by one
iteration of policy improvement (the “max” operator in the update rule). We are in fact free to intermix any
number of these steps in any order. The process will converge once the policy is greedy wrt its own value
function.
Note that policy evaluation computes Vπ whereas value iteration computes V∗ . This difference is illustrated
in Figure 2.2, using a backup diagram. Here the root node represents any state s, nodes at the next level
represent state-action combinations (solid circles), and nodes at the leaves representing the set of possible
resulting next state s′ for each possible action. In PE, we average over all actions according to the policy,
whereas in VI, we take the maximum over all actions.
2.3 Computing the value function without knowing the world model
In the rest of this chapter, we assume the agent only has access to samples from the environment, (s′ , r) ∼
p(s′ , r|s, a). We will show how to use these samples to learn optimal value function and Q-function, even
without knowing the MDP dynamics.
where η is the learning rate, and the term in brackets is an error term. We can use a similar technique to
estimate Qπ (s, a) = E [Gt |st = s, at = a] by simply starting the rollout with action a.
We can use MC estimation of Q, together with policy iteration (Section 2.2.3), to learn an optimal policy.
Specifically, at iteration k, we compute a new, improved policy using πk+1 (s) = argmaxa Qk (s, a), where Qk
35
is approximated using MC estimation. This update can be applied to all the states visited on the sampled
trajectory. This overall technique is called Monte Carlo control.
To ensure this method converges to the optimal policy, we need to collect data for every (state, action)
pair, at least in the tabular case, since there is no generalization across different values of Q(s, a). One way
to achieve this is to use an ϵ-greedy policy (see Section 1.4.1). Since this is an on-policy algorithm, the
resulting method will converge to the optimal ϵ-soft policy, as opposed to the optimal policy. It is possible to
use importance sampling to estimate the value function for the optimal policy, even if actions are chosen
according to the ϵ-greedy policy. However, it is simpler to just gradually reduce ϵ.
where η is the learning rate. (See [RFP15] for ways to adaptively set the learning rate.) The δt = yt − V (st )
term is known as the TD error. A more general form of TD update for parametric value function
representations is
we see that Equation (2.16) is a special case. The TD update rule for evaluating Qπ is similar, except we
replace states with states and actions.
It can be shown that TD learning in the tabular case, Equation (2.16), converges to the correct value func-
tion, under proper conditions [Ber19]. However, it may diverge when using nonlinear function approximators,
as we discuss in Section 2.5.2.4. The reason is that this update is a “semi-gradient”, which refers to the fact
that we only take the gradient wrt the value function, ∇w V (st , wt ), treating the target Ut as constant.
The potential divergence of TD is also consistent with the fact that Equation (2.18) does not correspond
to a gradient update on any objective function, despite having a very similar form to SGD (stochastic gradient
descent). Instead, it is an example of bootstrapping, in which the estimate, Vw (st ), is updated to approach
a target, rt + γVw (st+1 ), which is defined by the value function estimate itself. This idea is shared by DP
methods like value iteration, although they rely on the complete MDP model to compute an exact Bellman
backup. In contrast, TD learning can be viewed as using sampled transitions to approximate such backups.
An example of a non-bootstrapping approach is the Monte Carlo estimation in the previous section. It
samples a complete trajectory, rather than individual transitions, to perform an update; this avoids the
divergence issue, but is often much less efficient. Figure 2.3 illustrates the difference between MC, TD, and
DP.
36
Figure 2.3: Backup diagrams of V (st ) for Monte Carlo, temporal difference, and dynamic programming updates of the
state-value function. Used with kind permission of Andy Barto.
time t + 1 is replaced by its value function estimate. In contrast, MC waits until the end of the episode
or until T is large enough, then uses the estimate Gt:T = rt + γrt+1 + · · · + γ T −t−1 rT −1 . It is possible to
interpolate between these by performing an n-step rollout, and then using the value function to approximate
the return for the rest of the trajectory, similar to heuristic search (Section 4.1.2). That is, we can use the
n-step return
Rather than picking a specific lookahead value, n, we can take a weighted average of all possible values,
with a single parameter λ ∈ [0, 1], by using
∞
X
Gλt ≜ (1 − λ) λn−1 Gt:t+n (2.23)
n=1
P∞
This is called the lambda return. Note that these coefficients sum to one (since t=0 (1 − λ)λt = 1−λ 1−λ = 1,
for λ < 1), so the return is a convex combination of n-step returns. See Figure 2.4 for an illustration. We can
now use Gλt inside the TD update instead of Gt:t+n ; this is called TD(λ).
Note that, if a terminal state is entered at step T (as happens with episodic tasks), then all subsequent
n-step returns are equal to the conventional return, Gt . Hence we can write
−t−1
TX
Gλt = (1 − λ) λn−1 Gt:t+n + λT −t−1 Gt (2.24)
n=1
From this we can see that if λ = 1, the λ-return becomes equal to the regular MC return Gt . If λ = 0, the
λ-return becomes equal to the one-step return Gt:t+1 (since 0n−1 = 1 iff n = 1), so standard TD learning is
often called TD(0) learning. This episodic form also gives us the following recursive equation
37
Figure 2.4: The backup diagram for TD(λ). Standard TD learning corresponds to λ = 0, and standard MC learning
corresponds to λ = 1. From Figure 12.1 of [SB18]. Used with kind permission of Richard Sutton.
(This trace term gets reset to 0 at the start of each episode.) We replace the TD(0) update of wt+1 =
wt + ηδt ∇w Vw (st ) with the TD(λ) version to get
where a′ ∼ π(s′ ) is the action the agent will take in state s′ . After Q is updated (for policy evaluation), π
also changes accordingly as it is greedy with respect to Q (for policy improvement). This algorithm, first
proposed by [RN94], was further studied and renamed to SARSA by [Sut96]; the name comes from its
update rule that involves an augmented transition (s, a, r, s′ , a′ ).
In order for SARSA to converge to Q∗ , every state-action pair must be visited infinitely often, at least in
the tabular case, since the algorithm only updates Q(s, a) for (s, a) that it visits. One way to ensure this
condition is to use a “greedy in the limit with infinite exploration” (GLIE) policy. An example is the ϵ-greedy
policy, with ϵ vanishing to 0 gradually. It can be shown that SARSA with a GLIE policy will converge to Q∗
and π∗ [Sin+00].
38
2.5 Q-learning: off-policy TD control
SARSA is an on-policy algorithm, which means it learns the Q-function for the policy it is currently using,
which is typically not the optimal policy, because of the need to perform exploration. However, with a
simple modification, we can convert this to an off-policy algorithm that learns Q∗ , even if a suboptimal or
exploratory policy is used to choose actions.
This is the update rule of Q-learning for the tabular case [WD92].
Since it is off-policy, the method can use (s, a, r, s′ ) triples coming from any data source, such as older
versions of the policy, or log data from an existing (non-RL) system. If every state-action pair is visited
infinitely often, the algorithm provably converges to Q∗ in the tabular case, with properly decayed learning
rates [Ber19]. Algorithm 1 gives a vanilla implementation of Q-learning with ϵ-greedy exploration.
For terminal states, s ∈ S + , we know that Q(s, a) = 0 for all actions a. Consequently, for the optimal
value function, we have V ∗ (s) = maxa′ Q∗ (s, a) = 0 for all terminal states. When performing online learning,
we don’t usually know which states are terminal. Therefore we assume that, whenever we take a step in the
environment, we get the next state s′ and reward r, but also a binary indicator done(s′ ) that tells us if s′ is
terminal. In this case, we set the target value in Q-learning to V ∗ (s′ ) = 0 yielding the modified update rule:
h i
Q(s, a) ← Q(s, a) + η r + (1 − done(s′ ))γ max
′
Q(s′ ′
, a ) − Q(s, a) (2.30)
a
For brevity, we will usually ignore this factor in the subsequent equations, but it needs to be implemented in
the code.
Figure 2.5 gives an example of Q-learning applied to the simple 1d grid world from Figure 2.1, using
γ = 0.9. We show the Q-functon at the start and end of each episode, after performing actions chosen by an
ϵ-greedy policy. We initialize Q(s, a) = 0 for all entries, and use a step size of η = 1. At convergence, we have
Q∗ (s, a) = r + γQ∗ (s′ , a∗ ), where a∗ =↓ for all states.
39
Q-function
Episode Time Step Action (s,α,r , s') r + γ Q*(s' , α)
Q-function
episode start episode end
UP DOWN UP DOWN
S1 1 1 (S1 , D,0,S2) 0 + 0.9 X 0 = 0 S1
0 0 0 0
1 2 (S2 ,U,0,S1) 0 + 0.9 X 0 = 0
S1 S1
0 0 0 0
2 1 (S1 , D,0,S2) 0 + 0.9 x 0 = 0
2 3 (S3 , D,0,ST2) 1
S3 0 1 S3 0 1
4 7 (S2 , D,0,S3) 1
S1 S1
0 0.81 0 0.81
5 1 (S1 , U, 0,ST1) 0
S3 0.81 1 S3 0.81 1
Figure 2.5: Illustration of Q learning for one random trajectory in the 1d grid world in Figure 2.1 using ϵ-greedy
exploration. At the end of episode 1, we make a transition from S3 to ST 2 and get a reward of r = 1, so we estimate
Q(S3 , ↓) = 1. In episode 2, we make a transition from S2 to S3 , so S2 gets incremented by γQ(S3 , ↓) = 0.9. Adapted
from Figure 3.3 of [GK19].
40
2.5.2 Q learning with function approximation
To make Q learning work with high-dimensional state spaces, we have to replace the tabular (non-parametric)
representation with a parametric approximation, denoted Qw (s, a). We can update this function using one or
more steps of SGD on the following loss function
2
L(w|s, a, r, s′ ) = (r + γ max
′
Qw (s′ , a′ )) − Qw (s, a) (2.31)
a
Since nonlinear functions need to be trained on minibatches of data, we compute the average loss over multiple
randomly sampled experience tuples (see Section 2.5.2.3 for discussion) to get
2.5.2.2 DQN
The influential deep Q-network or DQN paper of [Mni+15] also used neural nets to represent the Q function,
but performed a smaller number of gradient updates per iteration. Furthermore, they proposed to modify the
target value when fitting the Q function in order to avoid instabilities during training (see Section 2.5.2.4 for
details).
The DQN method became famous since it was able to train agents that can outperform humans when
playing various Atari games from the ALE (Atari Learning Environment) benchmark [Bel+13]. Here the
input is a small color image, and the action space corresponds to moving left, right, up or down, plus an
optional shoot action.1
Since 2015, many more extensions to DQN have been proposed, with the goal of improving performance
in various ways, either in terms of peak reward obtained, or sample efficiency (e.g., reward obtained after only
1 For more discussion of ALE, see [Mac+18a], and for a recent extension to continuous actions (representing joystick control),
see the CALE benchmark of [FC24]. Note that DQN was not the first deep RL method to train an agent from pixel input; that
honor goes to [LR10], who trained an autoencoder to embed images into low-dimensional latents, and then used neural fitted Q
learning (Section 2.5.2.1) to fit the Q function.
41
(a) (b)
Figure 2.6: (a) A simple MDP. (b) Parameters of the policy diverge over time. From Figures 11.1 and 11.2 of [SB18].
Used with kind permission of Richard Sutton.
100k steps in the environment, as proposed in the Atari-100k benchmark [Kai+19]), or training stability, or
all of the above. We discuss some of these extensions in Section 2.5.4.
Since Q learning is an off-policy method, we can update the Q function using any data source. This is
particularly important when we use nonlinear function approximation (see Section 2.5.2), which often needs a
lot of data for model fitting. A natural source of data is data collected earlier in the trajectori of the agent;
this is called an experience replay buffer, which stores (s, a, r, s′ ) transition tuples into a buffer. This can
improve the stability and sample efficiency of learning, and was originally proposed in [Lin92].
This modification has two advantages. First, it improves data efficiency as every transition can be used
multiple times. Second, it improves stability in training, by reducing the correlation of the data samples
that the network is trained on, since the training tuples do not have to come from adjacent moments in time.
(Note that experience replay requires the use of off-policy learning methods, such as Q learning, since the
training data is sampled from older versions of the policy, not the current policy.)
It is possible to replace the uniform sampling from the buffer with one that favors more important
transition tuples that may be more informative about Q. This idea is formalized in [Sch+16a], who develop a
technique known as prioritized experience replay.
The problem with the naive Q learning objective in Equation (2.31) is that it can lead to instability, since
the target we are regressing towards uses the same parameters w as the function we are updating. So the
network is “chasing its own tail”. Although this is fine for tabular models, it can fail for nonlinear models, as
we discuss below.
In general, an RL algorithm can become unstable when it has these three components: function approxi-
mation (such as neural networks), bootstrapped value function estimation (i.e., using TD-like methods instead
of MC), and off-policy learning (where the actions are sampled from some distribution other than the policy
that is being optimized). This combination is known as the deadly triad [Sut15; van+18]).
A classic example of this is the simple MDP depicted in Figure 2.6a, due to [Bai95]. (This is known as
Baird’s counter example.) It has 7 states and 2 actions. Taking the dashed action takes the environment
to the 6 upper states uniformly at random, while the solid action takes it to the bottom state. The reward is
0 in all transitions, and γ = 0.99. The value function Vw uses a linear parameterization indicated by the
expressions shown inside the states, with w ∈ R8 . The target policies π always chooses the solid action in
every state. Clearly, the true value function, Vπ (s) = 0, can be exactly represented by setting w = 0.
42
Suppose we use a behavior policy b to generate a trajectory, which chooses the dashed and solid actions
with probabilities 6/7 and 1/7, respectively, in every state. If we apply TD(0) on this trajectory, the
parameters diverge to ∞ (Figure 2.6b), even though the problem appears simple. In contrast, with on-policy
data (that is, when b is the same as π), TD(0) with linear approximation can be guaranteed to converge to
a good value function approximate [TR97]. The difference is that with on-policy learning, as we improve
the value function, we also improve the policy, so the two become self-consistent, whereas with off-policy
learning, the behavior policy may not match the optimal value function that is being learned, leading to
inconsistencies.
The divergence behavior is demonstrated in many value-based bootstrapping methods, including TD,
Q-learning, and related approximate dynamic programming algorithms, where the value function is represented
either linearly (like the example above) or nonlinearly [Gor95; TVR97; OCD21]. The root cause of these
divergence phenomena is that bootstrapping methods typically are not minimizing a fixed objective function.
Rather, they create a learning target using their own estimates, thus potentially creating a self-reinforcing
loop to push the estimates to infinity. More formally, the problem is that the contraction property in the
tabular case (Equation (2.10)) may no longer hold when V is approximated by Vw .
We discuss some solutions to the deadly triad problem below.
for training Qw . We can periodically set w− ← sg(w), usually after a few episodes, where the stop gradient
operator is used to prevent autodiff propagating gradients back to w. Alternatively, we can use an exponential
moving average (EMA) of the weights, i.e., we use w = ρw + (1 − ρ)sg(w), where ρ ≪ 1 ensures that Qw
slowly catches up with Qw . (If ρ = 0, we say that this is a detached target, since it is just a frozen copy of
the current weights.) The final loss has the form
L(w) = E(s,a,r,s′ )∼U (D) [L(w|s, a, r, s′ )] (2.34)
′ ′
L(w|s, a, r, s ) = (q(r, s ; w) − Qw (s, a)) 2
(2.35)
Theoretical work justifying this technique is given in [FSW23; Che+24a].
43
Data No LayerNorm With LayerNorm
Figure 2.7: We generate a dataset (left) with inputs x distributed in a circle with radius 0.5 and labels y = ||x||. We
then fit a two-layer MLP without LayerNorm (center) and with LayerNorm (right). LayerNorm bounds the values and
prevents catastrophic overestimation when extrapolating. From Figure 3 of [Bal+23]. Used with kind permission of
Philip Ball.
Figure 2.8: Comparison of Q-learning and double Q-learning on a simple episodic MDP using ϵ-greedy action selection
with ϵ = 0.1. The initial state is A, and squares denote absorbing states. The data are averaged over 10,000 runs.
From Figure 6.5 of [SB18]. Used with kind permission of Richard Sutton.
So we see that Q1 uses Q2 to choose the best action but uses Q1 to evaluate it, and vice versa. This technique
is called double Q-learning [Has10]. Figure 2.8 shows the benefits of the algorithm over standard Q-learning
44
in a toy problem.
In [HGS16], they combine double Q learning with deep Q networks (Section 2.5.2.2) to get double DQN.
This modifies Equation (2.37) to its gradient form, and then the current network for action proposals, but
the target network for action evaluation. Thus the training target becomes
In Section 3.6.2 we discuss an extension called clipped double DQN which uses two Q networks and
their frozen copies to define the following target:
q(r, s′ ; w1:2 , w1:2 ) = r + γ min Qwi (s′ , argmax Qwi (s′ , a′ )) (2.39)
i=1,2 a′
The double DQN method is extended in the REDQ (randomized ensembled double Q learning) method
of [Che+20], which uses an ensemble of N > 2 Q-networks. Furthermore, at each step, it draws a random
sample of M ≤ N networks, and takes the minimum over them when computing the target value. That is, it
uses the following update (see Algorithm 2 in appendix of [Che+20]):
where M is a random subset from the N value functions. The ensemble reduces the variance, and the
minimum reduces the overestimation bias.2 If we set N = M = 2, we get a method similar to clipped double
Q learning. (Note that REDQ is very similiar to the Random Ensemble Mixture method of [ASN20],
which was designed for offline RL.)
Q learning is not directly applicable to continuous actions due to the need to compute the argmax over
actions. An early solution to this problem, based on neural fitted Q learning (see Section 2.5.2.1), is proposed
in [HR11]. This became the basis of the DDPG algorithm of Section 3.6.1, which learns a policy to predict
the argmax.
An alternative approach is to use gradient-free optimizers such as the cross-entropy method to approximate
the argmax. The QT-Opt method of [Kal+18] treats the action vector a as a sequence of actions, and
optimizes one dimension at a time [Met+17]. The CAQL (continuous action Q-learning) method of [Ryu+20])
uses mixed integer programming to solve the argmax problem, leveraging the ReLU structure of the Q-network.
The method of [Sey+22] quantizes each action dimension separately, and then solves the argmax problem
using methods inspired by multi-agent RL.
2 In addition, REDQ performs G ≫ 1 updates of the value functions for each environment step; this high Update-To-Data
(UTD) ratio (also called Replay Ratio) is critical for sample efficiency, and is commonly used in model-based RL.
45
2.5.4.2 Dueling DQN
The dueling DQN method of [Wan+16], learns a value function and an advantage function, and derives the
Q function, rather than learning it directly. This is helpful when there are many actions with similar Q-values,
since the advantage A(s, a) = Q(s, a) − V (s) focuses on the differences in value relative to a shared baseline.
In more detail, we define a network with |A| + 1 output heads, which computes Aw (s, a) for a = 1 : A
and Vw (s). We can then derive
Qw (s, a) = Vw (s) + Aw (s, a) (2.41)
However, this naive approach ignores the following constraint that holds for any policy π:
Thus we can satisfy the constraint for the optimal policy by subtracting off maxa A(s, a) from the advantage
head. Equivalently we can compute the Q function using
In practice, the max is replaced by an average, which seems to work better empirically.
This can be implemented for episodic environments by storing experience tuples of the form
n
X
τ = (s, a, γ k−1 rk , sn , done) (2.50)
k=1
where done = 1 if the trajectory ended at any point during the n-step rollout.
46
Figure 2.9: Plot of median human-normalized score over all 57 Atari games for various DQN agents. The yellow,
red and green curves are distributional RL methods (Section 5.1), namely categorical DQN (C51) (Section 5.1.2)
Quantile Regression DQN (Section 5.1.1), and Implicit Quantile Networks [Dab+18]. Figure from https: // github.
com/ google-deepmind/ dqn_ zoo .
Theoretically this method is only valid if all the intermediate actions, a2:n−1 , are sampled from the current
optimal policy derived from Qw , as opposed to some behavior policy, such as epsilon greedy or some samples
from the replay buffer from an old policy. In practice, we can just restrict sampling to recent samples from
the replay buffer, making the resulting method approximately on-policy.
Instead of using a fixed n, it is possible to use a weighted combination of returns; this is known as the
Q(λ) algorithm [PW94; Koz+21].
2.5.4.5 Rainbow
The Rainbow method of [Hes+18] combined 6 improvements to the vanilla DQN method, as listed below.
(The paper is called “Rainbow” due to the color coding of their results plot, a modified version of which is
shown in Figure 2.9.) At the time it was published (2018), this produced SOTA results on the Atari-200M
benchmark. The 6 improvements are as follows:
• Use a larger CNN with residual connections, namely the Impala network from [Esp+18] with the
modifications (including the use of spectral normalization) proposed in [SS21].
• Use Munchausen RL [VPG20], which modifies the Q learning update rule by adding an entropy-like
penalty.
• Collect 1 environment step from 64 parallel workers for each minibatch update (rather than taking
many steps from a smaller number of workers).
47
2.5.4.6 Bigger, Better, Faster
At the time of writing this document (2024), the SOTA on the 100k sample-efficient Atari benchmark [Kai+19]
is obtained by the BBF algorithm of [Sch+23b]. (BBF stands for “Bigger, Better, Faster”.) It uses the
following tricks, in order of decreasing importance:
• Use a larger CNN with residual connections, namely a modified version of the Impala network from
[Esp+18].
• Increase the update-to-data (UTD) ratio (number of times we update the Q function for every
observation that is observed), in order to increase sample efficiency [HHA19].
• Use a periodic soft reset of (some of) the network weights to avoid loss of elasticity due to increased
network updates, following the SR-SPR method of [D’O+22].
• Use n-step returns, as in Section 2.5.4.4, and then gradually decrease (anneal) the n-step return from
n = 10 to n = 3, to reduce the bias over time.
• Add weight decay.
• Add a self-predictive representation loss (Section 4.3.2.2) to increase sample efficiency.
• Gradually increase the discount factor from γ = 0.97 to γ = 0.997, to encourage longer term planning
once the model starts to be trained.3
• Drop noisy nets (which requires multiple network copies and thus slows down training due to increased
memory use), since it does not help.
• Use dueling DQN (see Section 2.5.4.2).
3 The Agent 57 method of [Bad+20] automatically learns the exploration rate and discount factor using a multi-armed
bandit stratey, which lets it be more exploratory or more exploitative, depending on the game. This resulted in super human
performance on all 57 Atari games in ALE. However, it required 80 billion frames (environment steps)! This was subsequently
reduced to the “standard” 200M frames in the MEME method of [Kap+22].
48
Chapter 3
Policy-based RL
In the previous section, we considered methods that estimate the action-value function, Q(s, a), from which
we derive a policy. However, these methods have several disadvantages: (1) they can be difficult to apply to
continuous action spaces; (2) they may diverge if function approximation is used (see Section 2.5.2.4); (3)
the training of Q, often based on TD-style updates, is not directly related to the expected return garnered
by the learned policy; (4) they learn deterministic policies, whereas in stochastic and partially observed
environments, stochastic policies are provably better [JSJ94].
In this section, we discuss policy search methods, which directly optimize the parameters of the policy
so as to maximize its expected return. We mostly focus on policy gradient methods, that use the gradient
of the loss to guide the search. As we will see, these policy methods often benefit from estimating a value or
advantage function to reduce the variance in the policy search process, so we will also use techniques from
Chapter 2. The parametric policy will be denoted by πθ (a|s). For discrete actions, this can be a DNN with a
softmax output. For continuous actions, we can use a Gaussian output layer, or a diffusion policy [Ren+24].
For more details on policy gradient methods, see [Wen18b; Leh24].
49
where pπ (s0 → s, t) is the probability of going from s0 to s in t steps, and pπt (s) is the marginal probability of
being in state s at time t (after each episodic reset). Note that ργπ is a measure
P of time spent in non-terminal
states, but it is not a probability measure, since it is not normalized, i.e., s ργπ (s) ̸= 1. However, we may
abuse notation and still treat it like a probability, so we can write things like
X
Eργπ (s) [f (s)] = ργπ (s)f (s) (3.6)
s
This is known as the policy gradient theorem [Sut+99]. In statistics, the term ∇θ log πθ (a|s) is called the
(Fisher) score function1 , so sometimes Equation (3.11) is called the score function estimator or SFE
[Fu15; Moh+20].
3.2 REINFORCE
One way to apply the policy gradient theorem to optimize a policy is to use stochastic gradient ascent.
Theoretical results concerning the convergence and sample complexity of such methods can be found in
[Aga+21a].
To implement such a method, let τ = (s0 , a0 , r0 , s1 , . . . , sT ) be a trajectory created by sampling from
s0 ∼ p0 and then following πθ . Then we have
∞
X
∇θ J(πθ ) = γ t Ept (s)πθ (at |st ) [∇θ log πθ (at |st )Qπθ (st , at )] (3.12)
t=0
T
X −1
≈ γ t Gt ∇θ log πθ (at |st ) (3.13)
t=0
1 This is distinct from the Stein score, which is the gradient wrt the argument of the log probability, ∇ log π (a|s), as used
a θ
in diffusion.
50
where the return is defined as follows
−t−1
TX T
X −1
Gt ≜ rt + γrt+1 + γ 2 rt+2 + · · · + γ T −t−1 rT −1 = γ k rt+k = γ j−t rj (3.14)
k=0 j=t
In practice, estimating the policy gradient using Equation (3.11) can have a high variance. A baseline
function b(s) can be used for variance reduction to get
∇θ J(πθ ) = Eρθ (s)πθ (a|s) [∇θ log πθ (a|s)(Qπθ (s, a) − b(s))] (3.15)
Any function that satisfies E [∇θ b(s)] = 0 is a valid baseline. This follows since
X X X X
∇θ πθ (a|s)(Q(s, a) − b(s)) = ∇θ πθ (a|s)Q(s, a) − ∇θ [ πθ (a|s)]b(s) = ∇θ πθ (a|s)Q(s, a) − 0
a a a a
(3.16)
A common choice for the baseline is b(s) = Vπθ (s). This is a good choice since Vπθ (s) and Q(s, a) are
correlated and have similar magnitudes, so the scaling factor in front of the gradient term will be small.
Using this we get an update of the following form
T
X −1
θ ←θ+η γ t (Gt − b(st ))∇θ log πθ (at |st ) (3.17)
t=0
This is is called the REINFORCE estimator [Wil92].2 The update equation can be interpreted as follows:
we compute the sum of discounted future rewards induced by a trajectory, compared to a baseline, and if
this is positive, we increase θ so as to make this trajectory more likely, otherwise we decrease θ. Thus, we
reinforce good behaviors, and reduce the chances of generating bad ones.
Eligibility”. The phrase “characteristic eligibility” refers to the ∇ log πθ (at |st ) term; the phrase “offset reinforcement” refers to
the Gt − b(st ) term; and the phrase “nonnegative factor” refers to the learning rate η of SGD.
51
3.3.1 Advantage actor critic (A2C)
Concretely, consider the use of the one-step TD method to estimate the return in the episodic case, i.e.,
we replace Gt with Gt:t+1 = rt + γVw (st+1 ). If we use Vw (st ) as a baseline, the REINFORCE update in
Equation (3.17) becomes
T
X −1
θ ←θ+η γ t (Gt:t+1 − Vw (st )) ∇θ log πθ (at |st ) (3.18)
t=0
T
X −1
=θ+η γ t rt + γVw (st+1 ) − Vw (st ) ∇θ log πθ (at |st ) (3.19)
t=0
Note that δt = rt+1 + γVw (st+1 ) − Vw (st ) is a single sample approximation to the advantage function
A(st , at ) = Q(st , at ) − V (st ). This method is therefore called advantage actor critic or A2C. See
Algorithm 4 for the pseudo-code.3 (Note that Vw (st+1 ) = 0 if st is a done state, representing the end of an
episode.) Note that this is an on-policy algorithm, where we update the value function Vwπ to reflect the value
of the current policy π. See Section 3.3.3 for further discussion of this point.
13 until converged
In practice, we should use a stop-gradient operator on the target value for the TD update, for reasons
explained in Section 2.5.2.4. Furthermore, it is common to add an entropy term to the policy, to act as a
regularizer (to ensure the policy remains stochastic, which smoothens the loss function — see Section 3.5.4).
If we use a shared network with separate value and policy heads, we need to use a single loss function for
training all the parameters ϕ. Thus we get the following loss, for each trajectory, where we want to minimize
TD loss, maximize the policy gradient (expected reward) term, and maximize the entropy term.
T
1X
L(ϕ; τ ) = [λT D LT D (st , at , rt , st+1 ) − λP G JP G (st , at , rt , st+1 ) − λent Jent (st )] (3.20)
T t=1
qt = rt + γ(1 − done(st ))Vϕ (st+1 ) (3.21)
LT D (st , at , rt , st+1 ) = (sg(qt ) − Vϕ (s)) 2
(3.22)
JP G (st , at , rt , st+1 ) = (sg(qt − Vϕ (st )) log πϕ (at |st ) (3.23)
X
Jent (st ) = − πϕ (a|st ) log πϕ (a|st ) (3.24)
a
3 In [Mni+16], they proposed a distributed version of A2C known as A3C which stands for “asynchrononous advantage actor
critic”.
52
To handle the dynamically varying scales of the different loss functions, we can use the PopArt method of
[Has+16; Hes+19] to allow for a fixed set of hyper-parameter values for λi . (PopArt stands for “Preserving
Outputs Precisely, while Adaptively Rescaling Targets”.)
A(n)
w (st , at ) = Gt:t+n − Vw (st ) (3.26)
δt = rt + γvt+1 − vt (3.32)
T −t+1
At = δt + γλδt+1 + · · · + (γλ) δT −1 = δt + γλAt+1 (3.33)
Here λ ∈ [0, 1] is a parameter that controls the bias-variance tradeoff: larger values decrease the bias but
increase the variance. This is called generalized advantage estimation (GAE) [Sch+16b]. See Algorithm 5
for some pseudocode. Using this, we can define a general actor-critic method, as shown in Algorithm 6.
We can generalize this approach even further, by using gradient estimators of the form
"∞ #
X
∇J(θ) = E Ψt ∇ log πθ (at |st ) (3.34)
t=0
53
Algorithm 6: Actor critic with GAE
1 Initialize parameters ϕ, environment state s
2 repeat
3 (s1 , a1 , r1 , . . . , sT ) = rollout(s, πϕ )
4 v1:T = Vϕ (s1:T )
5 (A1:T , q1:T ) = sg(GAE(r1:T , v1:T , γ, λ))
PT
6 L(ϕ) = T1 t=1 λT D (Vϕ (st ) − qt )2 − λP G At log πϕ (at |st ) − λent H(πϕ (·|st ))
7 ϕ := ϕ − η∇L(ϕ)
8 until converged
θk+1 = θk − ηk gk (3.40)
where gk = ∇θ L(θk ) is the gradient of the loss at the previous parameter values, and ηk is the learning rate.
It can be shown that the above update is equivalent to minimizing a locally linear approximation to the loss,
L̂k , subject to the constraint that the new parameters do not move too far (in Euclidean distance) from the
54
(a) (b)
Figure 3.1: Changing the mean of a Gaussian by a fixed amount (from solid to dotted curve) can have more impact
when the (shared) variance is small (as in a) compared to when the variance is large (as in b). Hence the impact (in
terms of prediction accuracy) of a change to µ depends on where the optimizer is in (µ, σ) space. From Figure 3 of
[Hon+10], reproduced from [Val00]. Used with kind permission of Antti Honkela.
previous parameters:
where the step size ηk is proportional to ϵ. This is called a proximal update [PB+14].
One problem with the SGD update is that Euclidean distance in parameter space does not make sense for
probabilistic models. For example, consider comparing two Gaussians, pθ = p(y|µ, σ) and pθ′ = p(y|µ′ , σ ′ ).
The (squared) Euclidean distance between the parameter vectors decomposes as ||θ−θ ′ ||22 = (µ−µ′ )2 +(σ−σ ′ )2 .
However, the predictive distribution has the form exp(− 2σ1 2 (y − µ)2 ), so changes in µ need to be measured
relative to σ. This is illustrated in Figure 3.1(a-b), which shows two univariate Gaussian distributions (dotted
and solid lines) whose means differ by ϵ. In Figure 3.1(a), they share the same small variance σ 2 , whereas in
Figure 3.1(b), they share the same large variance. It is clear that the difference in µ matters much more (in
terms of the effect on the distribution) when the variance is small. Thus we see that the two parameters
interact with each other, which the Euclidean distance cannot capture.
The key to NGD is to measure the notion of distance between two probability distributions in terms
of the KL divergence. This can be approximated in terms of the Fisher information matrix (FIM). In
particular, for any given input x, we have
1 T
DKL (pθ (y|x) ∥ pθ+δ (y|x)) ≈ δ Fx δ (3.43)
2
where Fx is the FIM
Fx (θ) = −Epθ (y|x) ∇2 log pθ (y|x) = Epθ (y|x) (∇ log pθ (y|x))(∇ log pθ (y|x))T (3.44)
We now replace the Euclidean distance between the parameters, d(θk , θk+1 ) = ||δ||22 , with
where δ = θk+1 − θk and Fk = Fx (θk ) for a randomly chosen input x. This gives rise to the following
constrained optimization problem:
If we replace the constraint with a Lagrange multiplier, we get the unconstrained objective:
55
Solving Jk (δ) = 0 gives the update
δ = −ηk F−1
k gk (3.48)
The term F g is called the natural gradient. This is equivalent to a preconditioned gradient update,
−1
where we use the inverse FIM as a preconditioning matrix. We can compute the (adaptive) learning rate
using r
ϵ
ηk = (3.49)
gk F−1
T
k gk
Computing the FIM can be hard. A simple approximation PN is to replace the model’s distribution
PN with the
empirical distribution. In particular, define pD (x, y) = N1 n=1 δxn (x)δyn (y), pD (x) = N1 n=1 δxn (x) and
pθ (x, y) = pD (x)p(y|x, θ). Then we can compute the empirical Fisher [Mar16] as follows:
F(θ) = Epθ (x,y) ∇ log p(y|x, θ)∇ log p(y|x, θ)T (3.50)
≈ EpD (x,y) ∇ log p(y|x, θ)∇ log p(y|x, θ) T
(3.51)
1 X
= ∇ log p(y|x, θ)∇ log p(y|x, θ)T (3.52)
|D|
(x,y)∈D
and compute δ k+1 = −ηk F−1 k gk . This approach is called natural policy gradient [Kak01; Raj+17].
We can compute F−1 k gk without having to invert Fk by using the conjugate gradient method, where
each CG step uses efficient methods for Hessian-vector products [Pea94]. This is called Hessian free
optimization [Mar10]. Similarly, we can efficiently compute gTk (F−1k gk ).
As a more accurate alternative to the empirical Fisher, [MG15] propose the KFAC method, which stands
for “Kronecker factored approximate curvature”; this approximates the FIM of a DNN as a block diagonal
matrix, where each block is a Kronecker product of two small matrices. This was applied to policy gradient
learning in [Wu+17].
56
Then one can show [Ach+17] that
1 π(a|s) πk 2γC π,πk
J(π) − J(πk ) ≥ Epγπk (s)πk (a|s) A (s, a) − E γ [TV(π(·|s), πk (·|s))] (3.56)
1−γ πk (a|s) (1 − γ)2 pπk (s)
| {z }
L(π,πk )
where C π,πk = maxs |Eπ(a|s) [Aπk (s, a)] |. In the above, L(π, πk ) is a surrogate objective, and the second term
is a penalty term.
If we can optimize this lower bound (or a stochastic approximation, based on samples from the current
policy πk ), we can guarantee monotonic policy improvement (in expectation) at each step. We will replace
this objective with a trust-region update that is easier to optimize:
The constraint bounds the worst-case performance decline at each update. The overall procedure becomes
an approximate policy improvement method. There are various ways of implementing the above method in
practice, some of which we discuss below. (See also [GDWF22], who propose a framework called mirror
learning, that justifies these “approximations” as in fact being the optimal thing to do for a different kind of
objective.)
ϵ2
then π also satisfies the TV constraint with δ = 2 . Next it considers a first-order expansion of the surrogate
objective to get
π(a|s) πk
L(π, πk ) = Epγπk (s)πk (a|s) A (s, a) ≈ gTk (θ − θk ) (3.59)
πk (a|s)
where gk = ∇θ L(πθ , πk )|θk . Finally it considers a second-order expansion of the KL term to get the
approximate constraint
1
Epγπk (s) [DKL (πk ∥ π) (s)] ≈ (θ − θk )T Fk (θ − θk ) (3.60)
2
where Fk = gk gTk is an approximation to the Fisher information matrix (see Equation (3.54)). We then use
the update
θk+1 = θk + ηk vk (3.61)
q
where vk = F−1
k gk is the natural gradient, and the step size is initialized to ηk =
2δ
vT F v
. (In practice we
k k k
compute vk by approximately solving the linear system Fk v = gk using conjugate gradient methods, which
just require matrix vector multiplies.) We then use a backtracking line search procedure to ensure the trust
region is satisfied.
57
This holds provided the support of π is contained in the support of πk at every state. We then use the
following update:
πk+1 = argmax E(s,a)∼pγπk [min (ρk (s, a)Aπk (s, a), ρ̃k (s, a)Aπk (s, a))] (3.63)
π
3.4.4 VMPO
In this section, we discuss the VMPO algorithm of [FS+19], which is an on-policy extenson of the earlier
on-policy MPO algorithm (MAP policy optimization) from [Abd+18]. It was originally explained in terms of
“control as inference” (see Section 1.5), but we can also view it as a contrained policy improvement method,
based on Equation (3.57). In particular, VMPO leverages the fact that if
Epγπk (s) [DKL (π ∥ πk ) (s)] ≤ δ (3.64)
2
then π also satisfies the TV constraint with δ = ϵ2 .
Note that here the KL is reversed compared to TRPO in Section 3.4.2. This new version will encourage
π to be mode-covering, so it will naturally have high entropy, which can result in improved robustness.
Unfortunately, this kind of KL is harder to compute, since we are taking expectations wrt the unknown
distribution π.
To solve this problem, VMPO adopts an EM-type approach. In the E step, we compute a non-parametric
version of the state-action distribution given by the unknown new policy:
ψ(s, a) = π(a|s)pγπk (s) (3.65)
The optimal new distribution is given by
ψk+1 = argmax Eψ(s,a) [Aπk (s, a)] s.t. DKL (ψ ∥ ψk ) ≤ δ (3.66)
ψ
58
where ψk (s, a) = πk (a|s)pγπk (s). The solution to this is
In the M step, we project this target distribution back onto the space of parametric policies, while satisfying
the KL trust region constraint:
πk+1 = argmax E(s,a)∼pγπk [w(s, a) log π(a|s)] s.t. Epγπk [DKL (ψk ∥ ψ) (s)] ≤ δ (3.71)
π
However, since the trajectories are sampled from πb , we use importance sampling (IS) to correct for the
distributional mismatch, as first proposed in [PSS00]. This gives
n T −1
1 X p(τ (i) |π) X t (i)
JˆIS (π) ≜ γ rt (3.73)
n i=1 p(τ (i) |πb ) t=0
h i
It can be verified that Eπb JˆIS (π) = J(π), that is, JˆIS (π) is unbiased, provided that p(τ |πb ) > 0 whenever
p(τ (i) |π)
p(τ |π) > 0. The importance ratio, p(τ (i) |πb )
, is used to compensate for the fact that the data is sampled
59
from πb and not π. It can be simplified as follows:
QT −1 TY−1
p(τ |π) p(s0 ) t=0 π(at |st )pS (st+1 |st , at )pR (rt |st , at , st+1 ) π(at |st )
= QT −1 = (3.74)
p(τ |πb ) p(s0 ) t=0 πb (at |st )pS (st+1 |st , at )pR (rt |st , at , st+1 ) π (a |s )
t=0 b t t
This simplification makes it easy to apply IS, as long as the target and behavior policies are known. (If the
behavior policy is unknown, we can estimate it from D, and replace πb by its estimate πˆb . For convenience,
define the per-step importance ratio at time t by
We can reduce the variance of the estimator by noting that the reward rt is independent of the trajectory
beyond time t. This leads to a per-decision importance sampling variant:
n T −1
ˆ 1 XX Y (i)
JPDIS (π) ≜ ρt′ (τ (i) )γ t rt (3.76)
n i=1 t=0 ′
t ≤t
where we define δt = (rt + γV (st+1 ) − V (st )) as the TD error at time t. To extend this to the off-policy case,
we use the per-step importance ratio trick. However, to bound the variance of the estimator, we truncate the
IS weights. In particular, we define
π(at |st ) π(at |st )
ct = min c, , ρt = min ρ, (3.79)
πb (at |st ) πb (at |st )
where c and ρ are hyperparameters. We then define the V-trace target value for V (si ) as
i+n−1 t−1
!
X Y
vi = V (si ) + γ t−i
ct′ ρt δt (3.80)
t=i t′ =i
The product of the weights ci . . . ct−1 (known as the “trace”) measures how much a temporal difference δt
at time t impacts the update of the value function at earlier time i. If the policies are very different, the
60
variance of this product will be large. So the truncation parameter c is used to reduce the variance. In
[Esp+18], they find c = 1 works best.
The use of the target ρt δt rather than δt means we are evaluating the value function for a policy that is
somewhere between πb and π. For ρ = ∞ (i.e., no truncation), we converge to the value function V π , and for
ρ → 0, we converge to the value function V πb . In [Esp+18], they find ρ = 1 works best.
Note that if c = ρ, then ci = ρi . This gives rise to the simplified form
n−1 j
!
X Y
vt = V (st ) + γ j
ct+m δt+j (3.82)
j=0 m=0
We can use the above V-trace targets to learn an approximate value function by minimizing the usual ℓ2
loss:
L(w) = Et∼D (vt − Vw (st ))2 (3.83)
the gradient of which has the form
Differentiating this and ignoring the term ∇θ Qπ (s, a), as suggested by [DWS12], gives a way to (approximately)
estimate the off-policy policy-gradient using a one-step IS correction ratio:
XX
∇θ Jπb (πθ ) ≈ pγπb (s)∇θ πθ (a|s)Qπ (s, a) (3.86)
s a
πθ (a|s)
= Epγπb (s),πb (a|s) ∇θ log πθ (a|s)Qπ (s, a) (3.87)
πb (a|s)
In practice, we can approximate Qπ (st , at ) by qt = rt + γvt+1 , where vt+1 is the V-trace estimate for state
st+1 . If we use V (st ) as a baseline, to reduce the variance, we get the following gradient estimate for the
policy:
∇J(θ) = Et∼D [ρt ∇θ log πθ (at |st )(rt + γvt − Vw (st ))] (3.88)
We can also replace the 1-step IS-weighted TD error ρt (rt + γvt − Vw (st )) with an IS-weighted GAE value
by modifying the generalized advantage estimation method in Section 3.3.2. In particular, we just need to
define λt = λ min(1, ρt ). We denote the IS-weighted GAE estimate as Aρt .4
3.5.2.3 IMPALA
As an example of an off-policy AC method, we consider IMPALA, which stands for “Importance Weighted
Actor-Learning Architecture”. [Esp+18]. This uses shared parameters for the policy and value function (with
different output heads), and adds an entropy bonus to ensure the policy remains stochastic. Thus we end up
with the following objective, which is very similar to on-policy actor-critic shown in Algorithm 6:
L(ϕ) = Et∼D λT D (Vϕ (st ) − vt )2 − λP G Aρt log πϕ (at |st ) − λent H(πϕ (·|st )) (3.89)
4 For an implementation, see https://github.com/google-deepmind/rlax/blob/master/rlax/_src/multistep.py#L39
61
The only difference from standard A2C is that we need to store the probabilities of each action, πb (at |st ), in
addition to (st , at , rt , st+1 ) in the dataset D, which can be used to compute ρt . [Esp+18] was able to use this
method to train a single agent (using a shared CNN and LSTM for both value and policy) to play all 57
games at a high level. Furthermore, they showed that their method — thanks to its off-policy corrections —
outperformed the A3C method (a parallel version of A2C) in Section 3.3.1.
n−1 j
!
X Y
Aπtrace
k
(st , at ) = δt + γ j
ct+m δt+j (3.91)
j=0 m=1
k (at |st )
where δt = rt + γV (st+1 ) − V (st ), and ct = min c, ππk−i (at |st ) is the truncated importance sampling ratio.
To compute the TV penalty term from off policy data, we need to choose between the PPO (Section 3.4.3),
VMPO (Section 3.4.4) and TRPO (Section 3.4.2) approach. We discuss each of these cases below.
62
3.5.4 Soft actor-critic (SAC)
The soft actor-critic (SAC) algorithm [Haa+18a; Haa+18b] is an off-policy actor-critic method based on a
framework known as maximum entropy RL, which we introduced in Section 1.5.3. Crucially, even though
SAC is off-policy and utilizes a replay buffer to sample past experiences, the policy update is done using
the actor’s own probability distribution, eliminating the need to use importance sampling to correct for
discrepancies between the behavior policy (used to collect data) and the target policy (used for updating), as
we will see below.
We start by slightly rewriting the maxent RL objective from Equation (1.67) using modified notation:
J SAC (θ) ≜ Epγπθ (s)πθ (a|s) [R(s, a) + α H(πθ (·|s))] (3.93)
Note that the entropy term makes the objective easier to optimize, and encourages exploration.
To optimize this, we can perform a soft policy evaluation step, and then a soft policy improvement step.
In the policy evaluation step, we can repeatedly apply a modified Bellman backup operator T π defined as
T π Q(st , at ) = r(st , at ) + γEst+1 ∼p [V (st+1 )] (3.94)
where
V (st ) = Eat ∼π [Q(st , at ) − α log π(at |st )] (3.95)
is the soft value function. If we iterate Qk+1 = T π Qk „ this will converge to the soft Q function for π.
In the policy improvement step, we derive the new policy based on the soft Q function by softmaxing over
the possible actions for each state. We then project the update back on to the policy class Π:
′ exp( α1 Qπold (st , ·))
πnew = arg min DKL π (·|st ) ∥ (3.96)
π ′ ∈Π Z πold (st )
(The partition function Z πold (st ) may be intractable to compute for a continuous action space, but it cancels
out when we take the derivative of the objective, so this is not a problem, as we show below.) After solving
the above optimization problem, we are guaranteed to satisfy the soft policy improvement theorem, i.e.,
Qπnew (st , at ) ≥ Qπold (st , at ) for all st and at .
The above equations are intractable in the non-tabular case, so we now extend to the setting where we
use function approximation.
63
where ãt+1 ∼ πθ (st+1 ) is a sampled next action. In [Che+20], they propose the REDQ method (Section 2.5.3.3)
which uses a random ensemble of N ≥ 2 networks instead of just 2.
Jπ (θ) = Est ∼D [Eat ∼πθ [α log πθ (at |st ) − Qw (st , at )]] (3.101)
Since we are taking gradients wrt θ, which affects the inner expectation term, we need to either use the
REINFORCE estimator from Equation (3.15) or the reparameterization trick (see e.g., [Moh+20]). The
latter is much lower variance, so is preferable.
To explain this in more detail, let us assume the policy distribution has the form πθ (at |st ) = N (µθ (st ), σ 2 I).
We can write the random action as at = fθ (st , ϵt ), where f is a deterministic function of the state and a
noise variable ϵt , since at = µ(st ) + σ 2 ϵt , where ϵt ∼ N (0, I). The objective now becomes
Jπ (θ) = Est ∼D,ϵt ∼N [α log πθ (fθ (st , ϵt )|st ) − Qw (st , fθ (st , ϵt ))] (3.102)
where we have replaced the expectation of at wrt πθ with an expectation of ϵt wrt its noise distribution
N . Hence we can now safely take stochastic gradients. See Algorithm 8 for the pseudocode. (Note that,
for discrete actions, we can avoid the need for the reparameterization trick by computing the expectations
explicitly, as discussed in Section 3.5.4.3.)
which avoids the need for reparameterization. (In [Zho+22], they propose to augment Jπ′ with an entropy
penalty, adding a term of the form 12 (Hold − Hπ )2 , to prevent drastic changes in the policy, where the entropy
of the policy can be computed analytically per sampled state.) The JQ term is similar to before
′ 1 2
JQ (w) = E(st ,at ,rt+1 ,st+1 )∼D (Qw (st , at ) − q ′ (rt+1 , st+1 )) ) (3.104)
2
where now the frozen target function is given by
X
q ′ (rt+1 , st+1 ) = rt+1 + γ πθ (at+1 |st+1 )[ min Qwi (st+1 , at+1 ) − α log πθ (at+1 |st+1 )] (3.105)
i=1,2
at+1
where H is the target entropy (a hyper-parameters). This objective is approximated by sampling actions
from the replay buffer.
64
Algorithm 8: SAC
1 Initialize environment state s, policy parameters θ, N critic parameters wi , target parameters
wi = wi , replay buffer D = ∅, discount factor γ, EMA rate ρ, step size ηw , ηπ .
2 repeat
3 Take action a ∼ πθ (·|s)
4 (s′ , r) = step(a, s)
5 D := D ∪ {(s, a, r, s′ )}
6 s ← s′
7 for G updates do
8 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
9 w = update-critics(θ, w, B)
10 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
11 θ = update-policy(θ, w, B)
12 until converged
13 .
14 def update-critics(θ, w, B):
15 Let (sj , aj , rj , s′j )B
j=1 = B
16 qj = q(rj , s′j ; w1:N , θ) for j = 1 : B
17 for i = 1 : N doP
(s,a,r,s′ )j ∈B (Qwi (sj , aj ) − sg(qj ))
1 2
18 L(wi ) = |B|
19 wi ← wi − ηw ∇L(wi ) // Descent
20 wi := ρwi + (1 − ρ)wi //Update target networks
21 Return w1:N , w1:N
22 .
23 def update-actor(θ, w, B):
PN
24 Q̂(s, a) ≜ N1 i=1Qwi (s, a) // Average critic
1
P
25 J(θ) = |B| s∈B Q̂(s, ãθ (s)) − α log πθ (ã(s)|s) , ãθ (s) ∼ πθ (·|s)
26 θ ← θ + ηθ ∇J(θ) // Ascent
27 Return θ
65
For discrete actions, temperature objective is given by
" #
X
J ′ (α) = Est ∼D πt (a|st )[−α(log πt (at |st ) + H)] (3.106)
a
The deterministic policy gradient theorem [Sil+14] tells us that the gradient of this expression is given
by
where ∇θ µθ (s) is the M × N Jacobian matrix, and M and N are the dimensions of A and θ, respectively.
For stochastic policies of the form πθ (a|s) = µθ (s) + noise, the standard policy gradient theorem reduces to
the above form as the noise level goes to zero.
Note that the gradient estimate in Equation (3.109) integrates over the states but not over the actions,
which helps reduce the variance in gradient estimation from sampled trajectories. However, since the
deterministic policy does not do any exploration, we need to use an off-policy method for training. This
collects data from a stochastic behavior policy πb , whose stationary state distribution is pγπb . The original
objective, J(µθ ), is approximated by the following:
Jb (µθ ) ≜ Epγπb (s) [Vµθ (s)] = Epγπb (s) [Qµθ (s, µθ (s))] (3.110)
where we have a dropped a term that depends on ∇θ Qµθ (s, a) and is hard to estimate [Sil+14].
To apply Equation (3.111), we may learn Qw ≈ Qµθ with TD, giving rise to the following updates:
So we learn both a state-action critic Qw and an actor µθ . This method avoids importance sampling in the
actor update because of the deterministic policy gradient, and we avoids it in the critic update because of the
use of Q-learning.
If Qw is linear in w, and uses features of the form ϕ(s, a) = aT ∇θ µθ (s), then we say the function
approximator for the critic is compatible with the actor; in this case, one can show that the above
approximation does not bias the overall gradient.
The basic off-policy DPG method has been extended in various ways, some of which we describe below.
66
3.6.1 DDPG
The DDPG algorithm of [Lil+16], which stands for “deep deterministic policy gradient”, uses the DQN
method (Section 2.5.2.2) to update Q that is represented by deep neural networks. In more detail, the actor
tries to minimize the output of the critic by optimize
averaged over states s drawn from the replay buffer. The critic tries to minimize the 1-step TD loss
where Qw is the target critic network, and the samples (s, a, r, a′ ) are drawn from a replay buffer. (See
Section 2.5.2.5 for a discussion of target networks.)
The D4PG algorithm [BM+18], which stands for “distributed distributional DDPG”, extends DDPG to
handle distributed training, and to handle distributional RL (see Section 5.1).
Second it uses clipped double Q learning, which is an extension of the double Q-learning discussed in
Section 2.5.3.1 to avoid over-estimation bias. In particular, the target values for TD learning are defined using
Third, it uses delayed policy updates, in which it only updates the policy after the value function has
stabilized. (See also Section 3.3.3.) See Algorithm 9 for the pseudcode.
67
Algorithm 9: TD3
1 Initialize environment state s, policy parameters θ, target policy parameters θ, critic parameters wi ,
target critic parameters wi = wi , replay buffer D = ∅, discount factor γ, EMA rate ρ, step size ηw ,
ηθ .
2 repeat
3 a = µθ (s) + noise
4 (s′ , r) = step(a, s)
5 D := D ∪ {(s, a, r, s′ )}
6 s ← s′
7 for G updates do
8 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
9 w = update-critics(θ, w, B)
10 Sample a minibatch B = {(sj , aj , rj , s′j )} from D
11 θ = update-policy(θ, w, B)
12 until converged
13 .
14 def update-critics(θ, w, B):
15 Let (sj , aj , rj , s′j )B
j=1 = B
16 for j = 1 : B do
17 ãj = µθ (s′j ) + clip(noise, −c, c)
18 qj = rj + γ mini=1,2 Qwi (s′j , ãj )
19 for i = 1 : 2 do P
(s,a,r,s′ )j ∈B (Qwi (sj , aj ) − sg(qj ))
1 2
20 L(wi ) = |B|
21 wi ← wi − ηw ∇L(wi ) // Descent
22 wi := ρwi + (1 − ρ)wi //Update target networks with EMA
23 Return w1:N , w1:N
24 .
25 def update-actor(θ, w, B):
1
P 2
26 J(θ) = |B| s∈B (Qw1 (s, µθ (s)))
27 θ ← θ + ηθ ∇J(θ) // Ascent
28 θ := ρθ + (1 − ρ)θ //Update target policy network with EMA
29 Return θ, θ
68
Chapter 4
Model-based RL
Model-free approaches to RL typically need a lot of interactions with the environment to achieve good
performance. For example, state of the art methods for the Atari benchmark, such as rainbow (Section 2.5.2.2),
use millions of frames, equivalent to many days of playing at the standard frame rate. By contrast, humans
can achieve the same performance in minutes [Tsi+17]. Similarly, OpenAI’s robot hand controller [And+20]
needs 100 years of simulated data to learn to manipulate a rubiks cube.
One promising approach to greater sample efficiency is model-based RL (MBRL). In the simplest
approach to MBRL, we first learn the state transition or dynamics model pS (s′ |s, a) — also called a world
model — and the reward function R(s, a), using some offline trajectory data, and then we use these models
to compute a policy (e.g., using dynamic programming, as discussed in Section 2.2, or using some model-free
policy learning method on simulated data, as discussed in Chapter 3). It can be shown that the sample
complexity of learning the dynamics is less than the sample complexity of learning the policy [ZHR24].
However, the above two-stage approach — where we first learn the model, and then plan with it — can
suffer from the usual problems encountered in offline RL (Section 5.5), i.e., the policy may query the model
at a state for which no data has been collected, so predictions can be unreliable, causing the policy to learn
the wrong thing. To get better results, we have to interleave the model learning and policy learning, so that
one helps the other (since the policy determines what data is collected).
There are two main ways to perform MBRL. In the first approach, known as decision-time planning or
model predictive control, we use the model to choose the next action by searching over possible future
trajectories. We then score each trajectory, pick the action corresponding to the best one, take a step in the
environment, and repeat. (We can also optionally update the model based on the rollouts.) This is discussed
in Section 4.1.
The second approach is to use the current model and policy to rollout imaginary trajectories, and to use
this data (optionally in addition to empirical data) to improve the policy using model-free RL; this is called
background planning, and is discussed in Section 4.2.
The advantage of decision-time planning is that it allows us to train a world model on reward-free data,
and then use that model to optimize any reward function. This can be particularly useful if the reward
contains changing constraints, or if it is an intrinsic reward (Section 5.2.4) that frequently changes based on
the knowledge state of the agent. The downside of decision-time planning is that it is much slower. However,
it is possible to combine the two methods, as we discuss below. For an empirical comparison of background
planning and decision-time planning, see [AP24].
Some generic pseudo-code for an MBRL agent is given in Algorithm 10. (The rollout function is defined
in Algorithm 11; some simple code for model learning is shown in Algorithm 12, although we discuss other
loss functions in Section 4.3; finally, the code for the policy learning is given in other parts of this manuscript.)
For more details on general MBRL, see e.g., [Wan+19; Moe+23; PKP21].
69
Algorithm 10: MBRL agent
1 def MBRL-agent(Menv ; T, H, N ):
2 Initialize state s ∼ Menv
3 Initialize data buffer D = ∅, model M̂
4 Initialize value function V , policy proposal π
5 repeat
6 // Collect data from environment
7 τenv = rollout(s, π, T, Menv ),
8 s = τenv [−1],
9 D = D ∪ τenv
10 // Update model
11 if Update model online then
12 M̂ = update-model(M̂, τenv )
13 if Update model using replay then
14
n
τreplay = sample-trajectory(D), n = 1 : N
15 M̂ = update-model(M̂, τreplay
1:N
)
16 // Update policy
17 if Update on-policy with real then
18 (π, V ) = update-on-policy(π, V, τenv )
19 if Update on-policy with imagination then
20
n
τimag = rollout(sample-init-state(D), π, T, M̂ ), n = 1 : N
21 (π, V ) = update-on-policy(π, V, τimag
1:N
)
22 if Update off-policy with real then
23
n
τreplay = sample-trajectory(D), n = 1 : N
24 (π, V ) = update-off-policy(π, V, τreplay
1:N
)
25 if Update off-policy with imagination then
26
n
τimag = rollout(sample-state(D), π, T, M̂ ), n = 1 : N
27 (π, V ) = update-off-policy(π, V, τimag
1:N
)
28 until until converged
70
4.1 Decision-time planning
If the model is known, and the state and action space is discrete and low dimensional, we can use exact
techniques based on dynamic programming to compute the policy, as discussed in Section 2.2. However, for
the general case, approximate methods must be used for planning, whether the model is known (e.g., for
board games like Chess and Go) or learned.
One approach to approximate planning is to be lazy, and just wait until we know what state we are in,
call it st , and then decide what to do, rather than trying to learn a policy that maps any state to the best
action. This is called decision time planning or “planning in the now” [KLP11]. We discuss some
variants of this approach below.
Here, H is called the planning horizon, and V̂ (st+H ) is an estimate of the reward-to-go at the end of this
H-step look-ahead process. We can often speed up the optimization process by using a pre-trained proposal
policy at = π(st ), which can be used to guide the search process, as we discuss below.
Note that MPC computes a fixed sequence of actions, at:t+H−1 , also called a plan, given the current state
st ; since the future actions at′ for t′ > t are independent of the future states st′ , this is called an open loop
controller. Such a controller can work well in deterministic environments (where st′ can be computed from
st and the action sequence), but in general, we will need to replan at each step, as the actual next state is
observed. Thus MPC is a way of creating a closed loop controller.
We can combine MPC with model and policy/proposal learning using the pseudocode in Algorithm 10,
where the decision policy at = πMPC (st ) is implemented by Equation (4.2). If we want to learn the propsoal
policy at = π(st ), we should use off-policy methods, since the training data (even if imaginary) will be
collected by πMPC rather than by π. When learning the world model, we only need it to be locally accurate,
around the current state, which means we can often use simpler models in MPC than in background planning
approaches.
In the sections below, we discuss particular kinds of MPC methods. Further connections between MPC
and RL are discussed in [Ber24].
71
Figure 4.1: Illustration of heuristic search. In this figure, the subtrees are ordered according to a depth-first search
procedure. From Figure 8.9 of [SB18]. Used with kind permission of Richard Sutton.
bound of the true value function. Admissibility ensures we will never incorrectly prune off parts of the search
space. In this case, the resulting algorithm is known as A∗ search, and is optimal. For more details on
classical AI heuristic search methods, see [Pea84; RN19].
4.1.3.2 MuZero
AlphaZero assumes the world model is known. The MuZero method of [Sch+20] learns a world model, by
training a latent representation of the observations, zt = ϕ(ot ), and a corresponding latent dynamics model
zt = M (zt , at ). The world model is trained to predict the immediate reward, the future reward (i..e, the
value), and the optimal policy, where the optimal policy is computed using MCTS.
In more detail, to learn the model, MuZero uses a sum of 3 loss terms applied to each (zt−1 , at , zt , rt ) tuple
in the replay buffer. The first loss is L(rt , r̂t ), where rt is the observed reward and r̂t = R(zt ) is the predicted
reward. The second loss is L(π MCTSt , π t ), where π MCTS
t is the target policy from MCTS search (see below)
Pn−1
and π t = f (zt ) is the predicted policy. The third loss is L(GMCTS t , vt ), where GMCTS
t = i=0 γ i rt+i + γ k vt+n
is the n-step bootstrap target value derived from MCTS search (see below), and vt = V (zt ) is the predicted
value from the current model.
To pick an action, MuZero does not use the policy directly. Instead it uses MCTS to rollout a search tree
using the dynamics model, starting from the current state zt . It uses the predicted policy π t and value vt as
heuristics to limit the breadth and depth of the search. Each time it expands a node in the tree, it assigns it
a unique integer id (since we are assuming the dynamics are deterministic), thus lazily creating a discrete
MDP. It then partially solves for the tabular Q function for this MDP using Monte Carlo rollouts, similar to
real-time dynamic programming (Section 2.2.2).
72
In more detail, the MCTS process is as follows. Let sk = zt be the root node, for k = 0. We initialize
Q(sk , a) = 0 and P (sk , a) = π t (a|sk ), where the latter is the prior for each action. To select the action ak
to perform next (in the rollout), we use the UCB heuristic (Section 1.4.3) based on the empirical counts
N (s, a) combined with the prior policy, P (s, a), which act as pseudocounts. After expanding this node, we
create the child node sk+1 = M (sk , ak ); we initialize Q(sk+1 , a) = 0 and P (sk+1 , a) = π t (a|sk+1 ), and repeat
the process until we reach a maximum depth, where we apply the value function to the corresponding leaf
node. We then compute the empirical sum of discounted rewards along each of the explored paths, and
use this to update the Q(s, a) and N (s, a) values for all visited nodes. After performing 50 such rollouts,
we compute the empirical P distribution over actions at the root node to get the MCTS visit count policy,
π MCTS
t (a) = [N (s0
, a)/( b N (s 0
, b))]1/τ
, where τ is a temperature. Finally we sample an action at from
π MCTS
t , take a step, add (ot , at , rt , π MCTS
t , GMCTS
t ) to the replay buffer, compute the losses, update the model
and policy parameters, and repeat.
The Stochastic MuZero method of [Ant+22] extends MuZero to allow for stochastic environments. The
Sampled MuZero method of [Hub+21] extends MuZero to allow for large action spaces.
4.1.3.3 EfficientZero
The Efficient Zero paper [Ye+21] extends MuZero by adding an additional self-prediction loss to help train
the world model. (See Section 4.3.2.2 for a discussion of such losses.) It also makes several other changes. In
Pn−1
particular, it replaces the empirical sum of instantaneous rewards, i=0 γ i rt+i , used in computing GMCTS
t ,
with an LSTM model that predicts the sum of rewards for a trajectory starting at zt ; they call this the value
prefix. In addition, it replaces the stored value at the leaf nodes of trajectories in the replay buffer with
new values, by rerunning MCTS using the current model applied to the leaves. They show that all three
changes help, but the biggest gain is from the self-prediction loss. The recent Efficient Zero V2 [Wan+24b]
extends this to also work with continuous actions, by replacing tree search with sampling-based Gumbel
search, amongst other changes.
4.1.4.2 LQG
If the system dynamics are linear and the reward function corresponds to negative quadratic cost, the optimal
action sequence can be solved mathematically, as in the linear-quadratic-Gaussian (LQG) controller (see
e.g., [AM89; HR17]).
If the model is nonlinear, we can use differential dynamic programming (DDP) [JM70; TL05]. In
each iteration, DDP starts with a reference trajectory, and linearizes the system dynamics around states
on the trajectory to form a locally quadratic approximation of the reward function. This system can be
solved using LQG, whose optimal solution results in a new trajectory. The algorithm then moves to the next
iteration, with the new trajectory as the reference trajectory.
4.1.4.3 CEM
It common to use black-box (gradient-free) optimization methods like the cross-entropy method or CEM
in order to find the best action sequence. The CEM method is a simple derivative-free optimization method for
73
continuous black-box functions f : RD → R. We start with a multivariate Gaussian, N (µ0 , Σ0 ), representing
a distribution over possible action a. We sample from this, evaluate all the proposals, pick the top K, then
refit the Gaussian to these top K, and repeat, until we find a sample with sufficiently good score (or we
perform moment matching on the top K scores). For details, see [Rub97; RK04; Boe+05].
In Section 4.1.4.4, we discuss the MPPI method, which is a common instantiation of CEM method.
Another example is in the TD-MPC paper [HSW22a]. They learn the world model (dynamics model) in a
latent space so as to predict future value and reward using temporal difference learning, and then use CEM
to implement MPC for this world model. In [BXS20] they discuss how to combine CEM with gradient-based
planning.
4.1.4.4 MPPI
The model predictive path integral or MPPI approach [WAT17] is a version of CEM. Originally MPPI
was limited to models with linear dynamics, but it was extended to general nonlinear models in [Wil+17].
The basic idea is that the initial mean of the Gaussian at step t, namely µt = at:t+H , is computed based on
shifting µ̂t−1 forward by one step. (Here µt is known as a reference trajectory.)
In [Wag+19], they apply this method for robot control. They consider a state vector of the form
st = (qt , q̇t ), where qt is the configuration of the robot. The deterministic dynamics has the form
qt + q̇t ∆t
st+1 = F (st , at ) = (4.4)
q̇t + f (st , at )∆t
where f is a 2 layer MLP. This is trained using the Dagger method of [RGB11], which alternates between
fitting the model (using supervised learning) on the current replay buffer (initialized with expert data), and
then deploying the model inside the MPPI framework to collect new data.
4.1.4.5 GP-MPC
[KD18] proposed GP-MPC, which combines a Gaussian process dynamics model with model predictive
control. They compute a Gaussian approximation to the future state trajectory given a candidate action
trajectory, p(st+1:t+H |at:t+H−1 , st ), by moment matching, and use this to deterministically compute the
expected reward and its gradient wrt at:t+H−1 . Using this, they can solve Equation (4.2) to find a∗t:t+H−1 ;
finally, they execute the first step of this plan, a∗t , and repeat the whole process.
The key observation is that moment matching is a deterministic operator that maps p(st |a1:t−1 ) to
p(st+1 |a1:t ), so the problem becomes one of deterministic optimal control, for which many solution methods
exist. Indeed the whole approach can be seen as a generalization of the LQG method from classical control,
which assumes a (locally) linear dynamics model, a quadratic cost function, and a Gaussian distribution over
states [Rec19]. In GP-MPC, the moment matching plays the role of local linearization.
The advantage of GP-MPC over the earlier method known as PILCO (“probabilistic inference for learning
control”), which learns a policy by maximizing the expected reward from rollouts (see [DR11; DFR15] for
details), is that GP-MPC can handle constraints more easily, and it can be more data efficient, since it
continually updates the GP model after every step (instead of at the end of an trajectory).
74
where xt = (st , at ), and Ot is the “optimality variable” which is clamped to the value 1, with distribution
p(Ot = 1|st , at ) = exp(R(st , at )). (Henceforth we will assume a uniform prior over actions, so p(at ) ∝ 1.) If
we can sample from this distribution, we can find state-action sequences with high expected reward, and then
we can just extract the first action from one of these sampled trajectories.1
In practice we only compute the posterior for h steps into the future, although we still condition on
optimality out to the full horizon T . Thus we define our goal as computing
where p(Ot = 1|st , at ) = exp(R(st , at )) is the probability that the “optimality variable” obtains its observed
(clamped) value of 1. We have decomposed the posterior as a forwards filtering term, αh (x1:h ), and a
backwards likelihood or smoothing term, βh (xh ), as is standard in the literature on inference in state-space
models (see e.g., [Mur23, Ch.8-9]). Note that if we define the value function as V (sh ) = log p(Oh:T |sh ), then
the backwards message can be rewritten as follows [Pic+19]:
A standard way to perform posterior inference in models such as these is to use Sequential Monte Carlo
or SMC, which is an extension of particle filtering (i.e., sequential importance sampling with resampling) to a
general sequence of distributions over a growing state space (see e.g., [Mur23, Ch 13.]). When combined with
an approximation to the backwards message, the approach is called twisted SMC [BDM10; WL14; AL+16;
Law+22; Zha+24]. This was applied to MPC in [Pic+19]. In particular, they suggest using SAC to learn a
value function V , analogous to the backwards twist function, and policy π, which can be used to create the
forwards proposal. More precisely, the policy can be combined with the world model M (st |st−1 , at−1 ) to
give a (Markovian) proposal disribution over the next state and action:
This can then be used inside of an SMC algorithm to sample trajectories from the posterior in Equation (4.6).
In particular, at each step, we sample from the proposal to extend each previous particle (sampled trajectory)
by one step, and then reweight the corresponding particle using
p(x1:T |O1:T ) p(x1:t−1 |O1:T )p(xt |x1:t−1 , O1:T )
wt = = (4.9)
q(x1:t ) q(x1:t−1 )q(xt |x1:t−1 )
p(xt |x1:t−1 , O1:T ) 1 p(x1:t |O1:T )
= wt−1 ∝ wt−1 (4.10)
q(xt |x1:t−1 ) q(xt |x1:t−1 ) p(x1:t−1 |O1:T )
Now plugging in the forward-backward equation from Equation (4.6), and doing some algebra, we get the
following (see [Pic+19, App. A.4] for the detailed derivation):
where
A(st , at , st+1 ) = rt − log π(at |st ) + V (st+1 ) − Ep(st |st−1 ,at−1 ) [exp(V (st ))] (4.13)
is a maximum entropy version of an advantage function. We show the overall pseudocode in Algorithm 13.
An improved version of the above method, called Critic SMC, is presented in [Lio+22]. The main
difference is that they first extend each of the N particles (sampled trajectories) by K possible “putative
actions” ank
i , then score them using a learned heuristic function Q(si , ai ), then resample N winners ai from
n nk n
1We should really marginalize over the state sequences, and then find the maximum marginal probability action sequence, as
in Equation (4.2), but we approximate this by joint sampling, for simplicity. For more discussion on this point, see [LG+24].
75
Algorithm 13: SMC for MPC
1 def SMC-MPC(st , M, π, V, H)
2 Initialize particles: {snt = st }N n=1
3 Initialize weights: {wtn = 1}N n=1
4 for i = t : t + H do
5 // Propose one-step extension
6 {ani ∼ π(·|sni )}
7 {(sni+1 , rin ) ∼ M (·|sni , ani )}
8 // Update weights
9 {win ∝ wi−1 n
exp(A(sni , ani , sni+1 ))}
10 // Resampling
11 {xn1:i } ∼ Multinom(n; wi1 , . . . , wiN )
12 {win = 1}
13 Sample n ∼ Unif(1 : N ) // Pick one of the top samples
14 Return ant
this set of N ×K particles, and then push these winners through the dynamics model to get sni+1 ∼ M (·|sni , ani ).
Finally, they reweight the N particles by the advantage and resample, as before. This can be advantageous if
the dynamics model is slow to evaluate, since we can evaluate K possible extensions just using the heuristic
function. We can think of this as a form of stochastic beam search, where the beam has N candidates, and
you expand each one using K possible actions, and then reduce the population (beam) back to N
We define the loss of a model M̂ given a distribution µ(s, a) over states and actions as
h i
ℓ(M̂, µ) = E(s,a)∼µ DKL Menv (·|s, a) ∥ M̂ (·|s, a)
76
We now define MBRL as a two-player general-sum game:
policy player model player
z }| { z }| {
max J(π, M̂ ), min ℓ(M̂, µπMenv )
π M̂
PT
where µπMenv = T1 t=0 Menv (st = s, at = a) as the induced state visitation distribution when policy π is
applied in the real world Menv , so that minimizing ℓ(M̂, µπMenv ) gives the maximum likelhood estimate
for M̂ .
Now consider a Nash equilibrium of this game, that is a pair (π, M̂ ) that satisfies ℓ(M̂, µπMenv ) ≤ ϵMenv
and J(π, M̂ ) ≥ J(π ′ , M̂ ) − ϵπ for all π ′ . (That is, the model is accurate when predicting the rollouts from π,
and π cannot be improved when evaluated in M̂ ). In [RMK20] they prove that the Nash equilibirum policy
π is near optimal wrt the real world, in the sense that J(π ∗ , Menv ) − J(π, Menv ) is bounded by a constant,
where π ∗ is an optimal policy for the real world Menv . (The constant depends on the ϵ parameters, and the
∗
TV distance between µπMenv and µπ∗ M̂
.)
A natural approach to trying to find such a Nash equilibrium is to use gradient descent ascent or
GDA, in which each player updates its parameters simultaneously, using
Unfortunately, GDA is often an unstable algorithm, and often needs very small learning rates η. In addition,
to increase sample efficiency in the real world, it is better to make multiple policy improvement steps using
synthetic data from the model M̂k at each step.
Rather than taking small steps in parallel, the best response strategy fully optimizes each player given
the previous value of the other player, in parallel:
Unfortunately, making such large updates in parallel can often result in a very unstable algorithm.
To avoid the above problems, [RMK20] propose to replace the min-max game with a Stackelberg game,
which is a generalization of min-max games where we impose a specific player ordering. In particular, let the
players be A and B, let their parameters be θA and θB , and let their losses be LA (θA , θB ) and LB (θA , θB ).
If player A is the leader, the Stackelberg game corresponds to the following nested optimization problem,
also called a bilevel optimization problem:
∗ ∗
min LA (θA , θB (θA )) s.t. θB (θA ) = argmin LB (θA , θ)
θA θ
Since the follower B chooses the best response to the leader A, the follower’s parameters are a function of the
leader’s. The leader is aware of this, and can utilize this when updating its own parameters.
The main advantage of the Stackelberg approach is that one can derive gradient-based algorithms that
will provably converge to a local optimum [CMS07; ZS22]. In particular, suppose we choose the policy as
leader (PAL). We then just have to solve the following optimization problem:
We can solve the first step by executing πk in the environment to collect data Dk and then fitting a local
(policy-specific) dynamics model by solving M̂k+1 = argmin ℓ(M̂, Dk ). (For example, this could be a locally
77
linear model, such as those used in trajectory optimization methods discussed in Section 4.1.4.4.) We then
(slightly) improve the policy to get πk+1 using a conservative update algorithm, such as natural actor-critic
(Section 3.3.4) or TRPO (Section 3.4.2), on “imaginary” model rollouts from M̂k+1 .
Alternatively, suppose we choose the model as leader (MAL). We now have to solve
We can solve the first step by using any RL algorithm on “imaginary” model rollouts from M̂k to get πk+1 . We
then apply this in the real world to get data Dk+1 , which we use to slightly improve the model to get M̂k+1
by using a conservative model update applied to Dk+1 . (In practice we can implement a conservative model
update by mixing Dk+1 with data generated from earlier models, an approach known as data aggregation
[RB12].) Compared to PAL, the resulting model will be a more global model, since it is trained on data from
a mixture of policies (including very suboptimal ones at the beginning of learning).
4.2.2 Dyna
The Dyna paper [Sut90] proposed an approach to MBRL that is related to the approach discussed in
Section 4.2.1, in the sense that it trains a policy and world model in parallel, but it differs in one crucial way: the
policy is also trained on real data, not just imaginary data. That is, we define πk+1 = πk +ηπ ∇π J(πk , D̂k ∪Dk ),
where Dk is data from the real environment and D̂k = rollout(πk , M̂k ) is imaginary data from the model.
This makes Dyna a hybrid model-free and model-based RL method, rather than a “pure” MBRL method.
In more detail, at each step of Dyna, the agent collects new data from the environment and adds it to a
real replay buffer. This is then used to do an off-policy update. It also updates its world model given the real
data. Then it simulates imaginary data, starting from a previously visited state (see sample-init-state
function in Algorithm 10), and rolling out the current policy in the learned model. The imaginary data is
then added to the imaginary replay buffer and used by an on-policy learning algorithm. This process continue
until the agent runs out of time and must take the next step in the environment.
78
Algorithm 14: Tabular Dyna-Q
1 def dyna-Q-agent(s, Menv ; ϵ, η, γ):
2 Initialize data buffer D = ∅, Q(s, a) = 0 and M̂ (s, a) = 0
3 repeat
4 // Collect real data from environment
5 a = eps-greedy(Q, ϵ)
6 (s′ , r) = env.step(s, a)
7 D = D ∪ {(s, a, r, s′ )}
8 // Update policy on real data
9 Q(s, a) := Q(s, a) + η[r + γ maxa′ Q(s′ , a′ ) − Q(s, a)]
10 // Update model on real data
11 M̂ (s, a) = (s′ , r)
12 s := s′
13 // Update policy on imaginary data
14 for n=1:N do
15 Select (s, a) from D
16 (s′ , r) = M̂ (s, a)
17 Q(s, a) := Q(s, a) + η[r + γ maxa′ Q(s′ , a′ ) − Q(s, a)]
18 until until converged
in [Wu+23] they combine Dyna with PPO and GP world models. (Technically speaking, these on-policy
approaches are not valid with Dyna, but they can work if the replay buffer used for policy training is not too
stale.)
79
gradient-based solver. In [Man+19], they combine the MPO algorithm (Section 3.4.4) for continuous control
with uncertainty sets on the dynamics to learn a policy that optimizes for a worst case expected return
objective.
whereQ P (τ ) is the distribution over trajectories induced by policy applied to the true world model, P (τ ) =
∞
µ(s0 ) t=0 M (st+1 |st , at )π(at |st ), and Q(τ ) is the distribution over trajectories using the estimated world
Q∞
model, Q(τ ) = µ(s0 ) t=0 M̂ (st+1 |st , at )π(at |st ). They then maximize this bound wrt π and M̂ .
In [Ghu+22] they extend MNM to work with images (and other high dimensional states) by learning a
latent encoder Ê(zt |ot ) as well as latent dynamics M̂ (zt+1 |zt , at ), similar to other self-predictive methods
(Section 4.3.2.2). They call their method Aligned Latent Models.
−1
TY
p(st+1:T , rt+1:T , at:T −1 |st ) = π(ai |si )M (si+1 |si , ai )R(ri+1 |si , ai ) (4.14)
i=t
80
4.3.1.1 Observation-space world models
The simplest approach is to define M (s′ |s, a) as a conditional generative model over states. If the state space
is high dimensional (e.g., images), we can use standard techniques for image generation such as diffusion
(see e.g., the Diamond method of [Alo+24]). If the observed states are low-dimensional vectors, such as
proprioceptive states, we can use transformers (see e.g., the Transformer Dynamics Model of [Sch+23a]).
where p(ot |zt ) = D(ot |zt ) is the decoder, or likelihood function, and π(at |zt ) is the policy.
The world model is usually trained by maximizing the marginal likelihood of the observed outputs given
an action sequence. (We discuss non-likelihood based loss functions in Section 4.3.2.) Computing the marginal
likelihood requires marginalizing over the hidden variables zt+1:T . To make this computationally tractable,
it is common to use amortized variational inference, in which we train an encoder network, p(zt |ot ), to
approximate the posterior over the latents. Many papers have followed this basic approach, such as the
“world models” paper [HS18], and the methods we discuss below.
4.3.1.4 Dreamer
In this section, we summarize the approach used in Dreamer paper [Haf+20] and its recent extensions,
such as DreamerV2 [Haf+21] and DreamerV3 [Haf+23]. These are all based on the background planning
approach, in which the policy is trained on imaginary trajectories generated by a latent variable world model.
(Note that Dreamer is based on an earlier approach called PlaNet [Haf+19], which used MPC instead of
background planning.)
In Dreamer, the stochastic dynamic latent variables in Equation (4.15) are replaced by deterministic
dynamic latent variables ht , since this makes the model easier to train. (We will see that ht acts like the
posterior over the hidden state at time t − 1; this is also the prior predictive belief state before we see ot .) A
“static” stochastic variable ϵt is now generated for each time step, and acts like a “random effect” in order
to help generate the observations, without relying on ht to store all of the necessary information. (This
simplifies the recurrent latent state.) In more detail, Dreamer defines the following functions:2
• A hidden dynamics (sequence) model: ht+1 = U (ht , at , ϵt )
• A latent state prior: ϵ̂t ∼ P (ϵ̂t |ht )
• A latent state decoder (observation predictor): ôt ∼ D(ôt |ht , ϵ̂t ).
• A reward predictor: r̂t ∼ R(r̂t |ht , ϵ̂t )
• A latent state encoder: ϵt ∼ E(ϵt |ht , ot ).
2 To map from our notation to the notation in the paper, see the following key: o → x , U → f (sequence model),
t t ϕ
P → pϕ (ẑt |ht ) (dynamics predictor),ion model), D → pϕ (x̂t |ht , ẑt ) (decoder), E → qϕ (ϵt |ht , xt ) (encoder).
81
π at
ht U ht+1
P Lkl
ǫ̂t ǫt
D E
ôt
Lop ot
Figure 4.2: Illustration of Dreamer world model as a factor graph (so squares are functions, circles are variables). We
have unrolled the forwards prediction for only 1 step. Also, we have omitted the reward prediction loss.
82
• A policy function: at ∼ π(at |ht )
See Figure 4.2 for an illustration of the system.
We now give a simplified explanation of how the world model is trained. The loss has the form
" T #
X
LWM = Eq(ϵ1:T ) βo Lo (ot , ôt ) + βz Lz (ϵt , ϵ̂t ) (4.16)
t=1
where the β terms are different weights for each loss, and q is the posterior over the latents, given by
T
Y
q(ϵ1:T |h0 , o1:T , a1:T ) = E(ϵt |ht , ot )δ(ht − U (ht−1 , at−1 , ϵt−1 )) (4.17)
t=1
4.3.1.5 Iris
The Iris method of [MAF22] follows the MBRL paradigm, in which it alternates beween (1) learning a world
model using real data Dr and then generate imaginary rollouts Di using the WM, and (2) learning the policy
given Di and collecting new data Dr′ for learning. In the model learning stage, Iris learns a discrete latent
encoding using the VQ-VAE method, and then fits a transformer dynamics model to the latent codes. In
the policy learning stage, it uses actor critic methods. The Delta-Iris method of [MAF24] extends this by
training the model to only predict the delta between neighboring frames. Note that, in both cases, the policy
has the form at = π(ot ), where ot is an image, so the the rollouts need to ground to pixel space, and cannot
only be done in latent space.
3 The symlog function is defined as symlog(x) = sign(x) ln(|x| + 1), and its inverse is symexp(x) = sign(x)(exp(|x|) − 1). The
symlog function squashes large positive and negative values, while preserving small values.
83
Loss Policy Usage Examples
OP Observables Dyna Diamond [Alo+24], Delta-Iris [MAF24]
OP Observables MCTS TDM [Sch+23a]
OP Latents Dyna Dreamer [Haf+23]
RP, VP, PP Latents MCTS MuZero [Sch+20]
RP, VP, PP, ZP Latents MCTS EfficientZero [Ye+21]
RP, VP, ZP Latents MPC-CEM TD-MPC [HSW22b]
VP, ZP Latents Aux. Minimalist [Ni+24]
VP, ZP Latents Dyna DreamingV2 [OT22]
VP, ZP, OP Latents Dyna AIS [Sub+22]
POP Latents Dyna Denoised MDP [Wan+22]
Table 4.1: Summary of some world-modeling methods. The “loss” column refers to the loss used to train the latent
encoder (if present) and the dynamics model (OP = observation prediction, ZP = latent state prediction, RP = reward
prediction, VP = value prediction, PP = policy prediction, POP = partial observation prediction). The “policy” column
refers to the input that is passed to the policy. (For MCTS methods, the policy is just used as a proposal over action
sequences to initialize the search/ optimization process.) The “usage” column refers to how to the world model is used:
for background planning (which we call “Dyna”), or for decision-time planning (which we call “MCTS”), or just as
an auxiliary loss on top of standard policy/value learning (which we call “Aux”). Thus Aux methods are single-stage
(“end-to-end”), whereas the other methods alternate are two-phase, and alternate between improving the world model
and then using it for improving the policy (or searching for the optimal action).
84
Figure 4.3: Illustration of an encoder zt = E(ot ), which is passed to a value estimator vt = V (zt ), and a world model,
which predicts the next latent state ẑt+1 = M (zt , at ), the reward rt = R(zt , at ), and the termination (done) flag,
dt = done(zt ). From Figure C.2 of [AP23]. Used with kind permission of Doina Precup.
that two states s1 and s2 are bisimiliar if P (s′ |s1 , a) ≈ P (s′ |s2 , a) and R(s1 , a) = R(s2 , a). From this, we can
derive a continuous measure called the bisimulation metric [FPP04]. This has the advantage (compared
to value equivalence) of being policy independent, but the disadvantage that it can be harder to compute
[Cas20; Zha+21], although there has been recent progress on computaitonally efficient methods such as MICo
[Cas+21] and KSMe [Cas+23].
where the LHS is the predicted mean of the next latent state under the true model, and the RHS is the
predicted mean under the learned dynamics model. We call this the EZP, which stands for expected z
prediction.4
A trivial way to minimize the (E)ZP loss is for the embedding to map everything to a constant vector,
say E(D) = 0, in which case zt+1 will be trivial for the dynamics model M to predict. However this is not a
useful representation. This problem is representational collapse [Jin+22]. Fortunately, we can provably
prevent collapse (at least for linear encoders) by using a frozen target network [Tan+23; Ni+24]. That is, we
use the following auxiliary loss
where
ϕ = ρϕ + (1 − ρ)sg(ϕ) (4.25)
is the (stop-gradient version of) the EMA of the encoder weights. (If we set ρ = 0, this is called a detached
network.)
4 In [Ni+24], they also describe the ZP loss, which requires predicting the full distribution over z ′ using a stochastic transition
model. This is strictly more powerful, but somewhat more complicated, so we omit it for simplicity.
85
We can also train the latent encoder to predict the reward. Formally, we want to ensure we can satisfy
the following condition, which we call RP for “reward prediction”:
See Figure 4.3 for an illustration. In [Ni+24], they prove that a representation that satisfies ZP and RP is
enough to satisfy value equivalence (sufficiency for Q∗ ).
Methods that optimize ZP and VP loss have been used in many papers, such as Predictron [Sil+17b],
Value Prediction Networks [OSL17], Self Predictive Representations (SPR) [Sch+21], Efficient
Zero (Section 4.1.3.3), BYOL-Explore (Section 4.3.2.6), etc.
The value function and reward losses may be too sparse to learn efficiently. Although self-prediction loss can
help somewhat, it does not use any extra information from the environment as feedback. Consequently it is
natural to consider other kinds of prediction targets for learning the latent encoder (and dynamics). When
using MCTS, it is possible compute what the policy should be for a given state, and this can be used as a
prediction target for the reactive policy at = π(zt ), which in turn can be used as a feedback signal for the
latent state. This method is used by MuZero (Section 4.1.3.2) and EfficientZero (Section 4.1.3.3).
Another natural target to use for learning the encoder and dynamics is the next observation, using a one-step
version of Equation (4.14). Indeed, [Ni+24] say that a representation ϕ satsifies the OP (observation
prediction) criterion if it satisfies the following condition:
where D is the decoder. In order to repeatedly apply this, we need to be able to update the encoding z = ϕ(D)
in a recursive or online way. Thus we must also satisfify the following recurrent encoder condition, which
[Ni+24] call Rec:
where U is the update operator. Note that belief state updates (as in a POMDP) satisfy this property.
Furthermore, belief states are a sufficient statistic to satisfy the OP condition. See Section 4.3.1.3 for a
discussion of generative models of this form. However, there are other approaches to partial observability
which work directly in prediction space (see Section 4.4.2).
We have argued that predicting all the observations is problematic, but not predicting them is also problematic.
A natural compromise is to predict some of the observations, or at least sone function of them. This is known
as a partial world model (see e.g., [AP23]).
The best way to do this is an open research problem. A simple approach would be to predict all the
observations, but put a penalty on the resulting OP loss term. A more sophisticated approach would be
to structure the latent space so that we distinguish latent variables that are useful for learning Q∗ (i.e.,
which affect the reward and which are affected by the agent’s actions) from other latent variables that are
needed to explain parts of the observation but otherwise are not useful. We can then impose an information
bottleneck penalty on the latter, to prevent the agent focusing on irrelevant observational details. (See e.g.,
the denoised MDP method of [Wan+22].)
86
rt
Vt−1 Lvp Gt TD Vt
V π at V
zt−1 U zt
D P E
Lop ot
Figure 4.4: Illustration of (a simplified version of ) the BYOL-Explore architecture, represented as a factor graph (so
squares are functions, circles are variables). The dotted lines represent an optional observation prediction loss. The
map from notation in this figure to the paper is as follows: U → hc (closed-loop RNN update), P → ho (open-loop
RNN update), D → g (decoder), E → f (encoder). We have unrolled the forwards prediction for only 1 step. Also, we
have omitted the reward prediction loss. The V node is the EMA version of the value function. The TD node is the
TD operator.
87
4.3.2.6 BYOL-Explore
As an example of the above framework, consider the BYOL-Explore paper [Guo+22a], which uses a
non-generative world model trained with ZP and VP loss. (BYOL stands for “build your own latent”.) See
Figure 4.4 for the computation graph, which we see is slightly simpler than the Dreamer computation graph
in Figure 4.2 due to the lack of stochastic latents. In addition to using self-prediction loss to help train the
latent representation, the error in this loss can be used to define an intrinsic reward, to encourage the agent
to explore states where the model is uncertain. See Section 5.2.4 for further discussion of this topic.
If C(st+1 ) = Rt+1 , this reduces to the value function.5 However, we can also define the GVF to predict
components of the observation vector; this is called nexting [MWS14], since it refers to next state prediction
at different timescales.
88
If we define the policy-dependent state-transition matrix by
X
T π (s, s′ ) = π(a|s)T (s′ |s, a) (4.31)
a
Thus we see that the SR replaces information about individual transitions with their cumulants, just as the
value function replaces individual rewards with the reward-to-go.
Like the value function, the SR obeys a Bellman equation
X X
M π (s, s̃) = π(a|s) T (s′ |s, a) (I (s′ = s̃) + γM π (s′ , s̃)) (4.33)
a s′
= E [I (s = s̃) + γM π (s′ , s̃)]
′
(4.34)
M π (s, s̃) ← M π (s, s̃) + η (I (s′ = s̃) + γM π (s′ , s̃) − M π (s, s̃)) (4.35)
| {z }
δ
where s is the next state sampled from T (s |s, a). Compare this to the value-function TD update in
′ ′
Equation (2.16):
V π (s) ← V π (s) + η (R(s′ ) + γV π (s′ ) − V π (s)) (4.36)
| {z }
δ
However, with an SR, we can easily compute the value function for any reward function (as approximated by
a given policy) as follows: X
V R,π = M π (s, s̃)R(s̃) (4.37)
s̃
M π (s, a, s̃) ← M π (s, a, s̃) + η (I (s′ = s̃) + γM π (s′ , a′ , s̃) − M π (s, a, s̃)) (4.40)
| {z }
δ
where s′ is the next state sampled from T (s′ |s, a) and a′ is the next action sampled from π(s′ ). Compare this
to the (on-policy) SARSA update from Equation (2.28):
However, from an SR, we can compute the state-action value function for any reward function:
X
QR,π (s, a) = M π (s, a, s̃)R(s̃) (4.42)
s̃
89
Goal
s̃
<latexit sha1_base64="3fl7AMR82nporqgYiPa0q5VI1tY=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Ie0oWw203bpbhJ2N0IJ/RVePCji1Z/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSNsGm4EdhKFVAYC28H4dua3n1BpHkcPZpKgL+kw4gPOqLHSY89wEWKmp/1yxa26c5BV4uWkAjka/fJXL4xZKjEyTFCtu56bGD+jynAmcFrqpRoTysZ0iF1LIypR+9n84Ck5s0pIBrGyFRkyV39PZFRqPZGB7ZTUjPSyNxP/87qpGVz7GY+S1GDEFosGqSAmJrPvScgVMiMmllCmuL2VsBFVlBmbUcmG4C2/vEpaF1WvVq3dX1bqN3kcRTiBUzgHD66gDnfQgCYwkPAMr/DmKOfFeXc+Fq0FJ585hj9wPn8AP1aQuA==</latexit>
s
<latexit sha1_base64="uWioic9Sc3+uT7pr32g9YUpvtBk=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZq6H6p7FbcOcgq8XJShhz1fumrN4hZGqE0TFCtu56bGD+jynAmcFrspRoTysZ0iF1LJY1Q+9n80Ck5t8qAhLGyJQ2Zq78nMhppPYkC2xlRM9LL3kz8z+umJrzxMy6T1KBki0VhKoiJyexrMuAKmRETSyhT3N5K2IgqyozNpmhD8JZfXiWty4pXrVQbV+XabR5HAU7hDC7Ag2uowT3UoQkMEJ7hFd6cR+fFeXc+Fq1rTj5zAn/gfP4A4jeNAg==</latexit>
Agent
s
<latexit sha1_base64="uWioic9Sc3+uT7pr32g9YUpvtBk=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KolI9Vj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZq6H6p7FbcOcgq8XJShhz1fumrN4hZGqE0TFCtu56bGD+jynAmcFrspRoTysZ0iF1LJY1Q+9n80Ck5t8qAhLGyJQ2Zq78nMhppPYkC2xlRM9LL3kz8z+umJrzxMy6T1KBki0VhKoiJyexrMuAKmRETSyhT3N5K2IgqyozNpmhD8JZfXiWty4pXrVQbV+XabR5HAU7hDC7Ag2uowT3UoQkMEJ7hFd6cR+fFeXc+Fq1rTj5zAn/gfP4A4jeNAg==</latexit>
Figure 4.5: Illustration of successor representation for the 2d maze environment shown in (a) with reward shown in (d),
which assings all states a reward of -0.1 except for the goal state which has a reward of 1.0. In (b-c) we show the SRs
for a random policy and the optimal policy. In (e-f ) we show the corresponding value functons. In (b), we see that the
SR under the random policy assigns high state occupancy values to states which are close (in Manhattan distance) to the
current state s13 (e.g., M π (s13 , s14 ) = 5.97) and low values to states that are further away (e.g., M π (s13 , s12 ) = 0.16).
In (c), we see that the SR under the optimal policy assigns high state occupancy values to states which are close to the
optimal path to the goal (e.g., M π (s13 , s14 ) = 1.0) and which fade with distance from the current state along that path
(e.g., M π (s13 , s12 ) = 0.66). From Figure 3 of [Car+24]. Used with kind permission of Wilka Carvalho. Generated by
https: // github. com/ wcarvalho/ jaxneurorl/ blob/ main/ successor_ representation. ipynb .
90
This can be used to improve the policy as we discuss in Section 4.4.4.1.
We see that the SR representation has the computational advantages of model-free RL (no need to do
explicit planning or rollouts in order to compute the optimal action), but also the flexibility of model-based
RL (we can easily change the reward function without having to learn a new value function). This latter
property makes SR particularly well suited to problems that use intrinsic reward (see Section 5.2.4), which
often changes depending on the information state of the agent.
Unfortunately, the SR is limited in several ways: (1) it assumes a finite, discrete state space; (2) it
depends on a given policy. We discuss ways to overcome limitation 1 in Section 4.4.3, and limitation 2 in
Section 4.4.4.1.
Thus µπ (s̃|s) tells us the probability that s̃ can be reached from s within a horizon determined by γ when
following π, even though we don’t know exactly when we will reach s̃.
SMs obey a Bellman-like recursion
µπ (s̃|s, a) ← µπ (s̃|s, a) + η ((1 − γ)T (s′ |s, a) + γµπ (s̃|s′ , a′ ) − µπ (s̃|s, a)) (4.51)
| {z }
δ
91
where s′ is the next state sampled from T (s′ |s, a) and a′ is the next action sampled from π(s′ ). With an
SM, we can compute the state-action value for any reward:
1
QR,π (s, a) = Eµπ (s̃|s,a) [R(s̃)] (4.52)
1−γ
This can be used to improve the policy as we discuss in Section 4.4.4.1.
where (T π µπ )(·|s, a) is the Bellman operator applied to µπ and then evaluated at (s, a), i.e.,
X X
(T π µπ )(s̃|s, a) = (1 − γ)T (s′ |s, a) + γ T (s′ |s, a) π(a′ |s′ (µπ (s̃|s′ , a′ ) (4.54)
s′ a′
We can sample from this as follows: first sample s′ ∼ T (s′ |s, a) from the environment and then with probability
1 − γ set s̃ = s′ and terminate. Otherwise sample a′ ∼ π(a′ |s′ ) and then create a bootstrap sample from the
model using s̃ ∼ µπ (s̃|s′ , a′ ).
There are many possible density models we can use for µπ . In [Tha+22], they use a VAE. In [Tom+24],
they use an autoregressive transformer applied to a set of discrete latent tokens, which are learned using
VQ-VAE or a non-reconstructive self-supervised loss. They call their method Video Occcupancy Models.
An alternative approach to learning SMs, that avoids fitting a normalized density model over states, is to
use contrastive learning to estimate how likely s̃ is to occur after some number of steps, given (s, a), compared
to some randomly sampled negative state [ESL21; ZSE24]. Although we can’t sample from the resulting
learned model (we can only use it for evaluation), we can use it to improve a policy that achieves a target
state (an approach known as goal-conditioned policy learning, discussed in Section 5.3.1).
92
We will henceforth drop the ϕ superscript from the notation, for brevity. SFs obey a Bellman equation
then we can derive the value function for any reward as follows:
This allows us to define multiple Q functions (and hence policies) just by changing the weight vector w, as
we discuss in Section 4.4.4.1.
a∗ (s; wnew ) = argmax max Qπi (s, a, wnew ) = argmax max ψ πi (s, a)T wnew (4.65)
a i a i
P
If wnew is in the span of the training tasks (i.e., there exist weights αi such that wnew i αi wi ), then the GPI
theorem states that π(a|s) = I (a = a∗ (s, wnew )) will perform at least as well as any of the existing policies,
i.e., Qπ (s, a) ≥ maxi Qπi (s, a) (c.f., policy improvement in Section 3.4). See Figure 4.6 for an illustration.
Note that GPI is a model-free approach to computing a new policy, based on an existing library of policies.
In [Ale+23], they propose an extension that can also leverage a (possibly approximate) world model to learn
better policies that can outperform the library of existing policies by performing more decision-time search.
93
(a) Successor Features (b) Generalized Policy Improvement
⇡
⇥ 2
⇤ ⇡i
= E⇡ (s, a)> wnew }
<latexit sha1_base64="Waw5YcCTllGahBnnmSg7wjYGcIk=">AAACUXicbVFBaxQxFH47am23Wtd67CW4CIKwzGylehFKRfBYwW0Lm+mQyWR2wiaTIXkjLMP8RQ968n/00kOLmdlFtPVBeN/7vvdI3pe0UtJhGP4aBA8ePtp6vL0z3H3ydO/Z6Pn+mTO15WLGjTL2ImVOKFmKGUpU4qKygulUifN0+bHTz78J66Qpv+KqErFmi1LmkjP0VDIqaGpU5lbap4ZWTraXPsmWfCBUMyzStPnUJj1FlchxTqtCJhF5Q+iCac1IX0//1JfTNXPYMSoz6KiViwLjZDQOJ2Ef5D6INmAMmzhNRj9oZnitRYlcMefmUVhh3DCLkivRDmntRMX4ki3E3MOSaeHipnekJa88k5HcWH9KJD3790TDtOuW9p3dlu6u1pH/0+Y15u/jRpZVjaLk64vyWhE0pLOXZNIKjmrlAeNW+rcSXjDLOPpPGHoTorsr3wdn00l0NDn68nZ8fLKxYxsO4CW8hgjewTF8hlOYAYfvcAU3cDv4ObgOIAjWrcFgM/MC/olg9zfKb7Or</latexit>
<latexit sha1_base64="yWN7FMIcSQGJYSArI0BBEESubl0=">AAACRnicbVBNT9wwEJ0s/aDbry099mJ1VWmRqlWCKsoRwaXHrdRdkDZp5HgdsHBsy55AV8G/jgtnbv0JvXCgqnqts+TQQkey/PzePHnmFUYKh3H8PeqtPXj46PH6k/7TZ89fvBy82pg5XVvGp0xLbQ8L6rgUik9RoOSHxnJaFZIfFCf7rX5wyq0TWn3BpeFZRY+UKAWjGKh8kKVGjOi52ySpsdqgJmlFv+VNoHPh0yYttFy4ZRWuwDnhv3bSyL2nm+GB2njSBBMeF2Vz5oMVkSh+5n3q88EwHserIvdB0oEhdDXJB1fpQrO64gqZpM7Nk9hg1lCLgknu+2ntuKHshB7xeYCKVtxlzSoGT94FZkFKbcNRSFbs346GVq5dJXS247q7Wkv+T5vXWO5kjVCmRq7Y7UdlLUlIq82ULITlDOUyAMqsCLMSdkwtZRiS74cQkrsr3wezrXGyPd7+/GG4u9fFsQ5v4C2MIIGPsAufYAJTYHABP+AGfkaX0XX0K/p929qLOs9r+Kd68Aef/rRL</latexit>
⇡fridge
at ⇠ ⇡fridge
<latexit sha1_base64="93wdZ10L3t3S7aEV393Ipgnd7W8=">AAACBHicbVDLSsNAFJ3UV62vqMtuBovgqiTia1l047KCfUATwmQyaYdOHszcCCVk4cZfceNCEbd+hDv/xmmbhbYeuHA4517uvcdPBVdgWd9GZWV1bX2julnb2t7Z3TP3D7oqySRlHZqIRPZ9opjgMesAB8H6qWQk8gXr+eObqd97YFLxJL6HScrciAxjHnJKQEueWSdeDgV2FI+wk3IvdwBwKHkwZIVnNqymNQNeJnZJGqhE2zO/nCChWcRioIIoNbCtFNycSOBUsKLmZIqlhI7JkA00jUnElJvPnijwsVYCHCZSVwx4pv6eyEmk1CTydWdEYKQWvan4nzfIILxycx6nGbCYzheFmcCQ4GkiOOCSURATTQiVXN+K6YhIQkHnVtMh2IsvL5PuadO+aJ7fnTVa12UcVVRHR+gE2egStdAtaqMOougRPaNX9GY8GS/Gu/Exb60Y5cwh+gPj8wfKlpg1</latexit>
<latexit sha1_base64="R9osOw/dYDAjZREfWK3W+rGCyaM=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaL4KokUtRl0Y3LCvYBTQyTyaQdOnkwcyOUkI2/4saFIm79DHf+jdM2C209cOFwzr3ce4+fCq7Asr6Nysrq2vpGdbO2tb2zu2fuH3RVkknKOjQRiez7RDHBY9YBDoL1U8lI5AvW88c3U7/3yKTiSXwPk5S5ERnGPOSUgJY888hJFX/InZR7uQOAQ8mDISsKz6xbDWsGvEzsktRRibZnfjlBQrOIxUAFUWpgWym4OZHAqWBFzckUSwkdkyEbaBqTiCk3nz1Q4FOtBDhMpK4Y8Ez9PZGTSKlJ5OvOiMBILXpT8T9vkEF45eY8TjNgMZ0vCjOBIcHTNHDAJaMgJpoQKrm+FdMRkYSCzqymQ7AXX14m3fOGfdFo3jXrresyjio6RifoDNnoErXQLWqjDqKoQM/oFb0ZT8aL8W58zFsrRjlziP7A+PwBdM6W+Q==</latexit>
⇡fridge
⇡
...
<latexit sha1_base64="q0/rPJF2cF/fpjDRGwxwnNC+7Cg=">AAACDHicbVDLSsNAFJ3UV62vqEs3g0FwVRIp6rLoxmUL9gFtDJPppB06eTBzo5SQD3Djr7hxoYhbP8Cdf+O0zUJbD1w4c869zL3HTwRXYNvfRmlldW19o7xZ2dre2d0z9w/aKk4lZS0ai1h2faKY4BFrAQfBuolkJPQF6/jj66nfuWdS8Ti6hUnC3JAMIx5wSkBLnmk177J+wr2sD4ADyQdDlude9jAXQi7G+mladtWeAS8TpyAWKtDwzK/+IKZpyCKggijVc+wE3IxI4FSwvNJPFUsIHZMh62kakZApN5sdk+MTrQxwEEtdEeCZ+nsiI6FSk9DXnSGBkVr0puJ/Xi+F4NLNeJSkwCI6/yhIBYYYT5PBAy4ZBTHRhFDJ9a6YjogkFHR+FR2Cs3jyMmmfVZ3zaq1Zs+pXRRxldISO0Sly0AWqoxvUQC1E0SN6Rq/ozXgyXox342PeWjKKmUP0B8bnDx3AnFA=</latexit>
Qwfridge
milk
Legend
<latexit sha1_base64="kQkDxPnVv0Mgc/tsfqEuJYOLy9w=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AL8Nj0E=</latexit>
<latexit sha1_base64="2M9Uun3jhzFBOkbVOCuyBd1NlPo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMF0xbaUDabTbt0sxt2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZelAlu0PO+ndLa+sbmVnm7srO7t39QPTxqGZVrygKqhNKdiBgmuGQBchSsk2lG0kiwdjS6m/ntJ6YNV/IRxxkLUzKQPOGUoJWCnooV9qs1r+7N4a4SvyA1KNDsV796saJ5yiRSQYzp+l6G4YRo5FSwaaWXG5YROiID1rVUkpSZcDI/duqeWSV2E6VtSXTn6u+JCUmNGaeR7UwJDs2yNxP/87o5JjfhhMssRybpYlGSCxeVO/vcjblmFMXYEkI1t7e6dEg0oWjzqdgQ/OWXV0nrou5f1S8fLmuN2yKOMpzAKZyDD9fQgHtoQgAUODzDK7w50nlx3p2PRWvJKWaO4Q+czx/uTI7H</latexit>
=
<latexit sha1_base64="8LVrqJucQl9xNwsjgr7UF8xAfX0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexKUC9C0IvHBMwDkiXMTnqTMbOzy8ysEEK+wIsHRbz6Sd78GyfJHjSxoKGo6qa7K0gE18Z1v53c2vrG5lZ+u7Czu7d/UDw8auo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDfzW0+oNI/lgxkn6Ed0IHnIGTVWqt/0iiW37M5BVomXkRJkqPWKX91+zNIIpWGCat3x3MT4E6oMZwKnhW6qMaFsRAfYsVTSCLU/mR86JWdW6ZMwVrakIXP198SERlqPo8B2RtQM9bI3E//zOqkJr/0Jl0lqULLFojAVxMRk9jXpc4XMiLEllClubyVsSBVlxmZTsCF4yy+vkuZF2bssV+qVUvU2iyMPJ3AK5+DBFVThHmrQAAYIz/AKb86j8+K8Ox+L1pyTzRzDHzifP4+7jMo=</latexit>
<latexit sha1_base64="hpnySaJh+/CAhHti5mNvHw101yA=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkVI9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq777ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmGG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBBPF7K2IRFhhYmw8FRuCt/ryOule1b1mvfHQqLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEWdo5I</latexit>
wmilk
<latexit sha1_base64="dwqnoHlLqdsg8BuVgD6mRKkkLpQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ac0oWy223bpbhJ2J0oJ/RtePCji1T/jzX/jts1BWx8MPN6bYWZemEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCanhUkS8iQIl7ySaUxVK3g7HtzO//ci1EXH0gJOEB4oOIzEQjKKV/Kde5iMSJeR42itX3Ko7B1klXk4qkKPRK3/5/ZilikfIJDWm67kJBhnVKJjk05KfGp5QNqZD3rU0ooqbIJvfPCVnVumTQaxtRUjm6u+JjCpjJiq0nYriyCx7M/E/r5vi4DrIRJSkyCO2WDRIJcGYzAIgfaE5QzmxhDIt7K2EjaimDG1MJRuCt/zyKmldVL3Lau2+Vqnf5HEU4QRO4Rw8uII63EEDmsAggWd4hTcndV6cd+dj0Vpw8plj+APn8wdGwJHa</latexit>
State features
Apple ..
<latexit sha1_base64="qDf6pzroMMJvMcIkHWdR/tYPv+o=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cKthbaUDabTbt2sxt2J4VS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8MBXcoOd9O4W19Y3NreJ2aWd3b/+gfHjUMirTlDWpEkq3Q2KY4JI1kaNg7VQzkoSCPYbD25n/OGLacCUfcJyyICF9yWNOCVqp1R1FCk2vXPGq3hzuKvFzUoEcjV75qxspmiVMIhXEmI7vpRhMiEZOBZuWuplhKaFD0mcdSyVJmAkm82un7plVIjdW2pZEd67+npiQxJhxEtrOhODALHsz8T+vk2F8HUy4TDNkki4WxZlwUbmz192Ia0ZRjC0hVHN7q0sHRBOKNqCSDcFffnmVtC6q/mW1dl+r1G/yOIpwAqdwDj5cQR3uoAFNoPAEz/AKb45yXpx352PRWnDymWP4A+fzB85dj0s=</latexit>
⇡milk
Open Drawer . Max
<latexit sha1_base64="uAph9Fync0syabW4IbKWtkcTpLo=">AAAB9XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gPaWDbbTbt0Nwm7E6WE/g8vHhTx6n/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKSLeQIGStxPNqQokbwWjm6nfeuTaiDi6x3HCfUUHkQgFo2ilh24ielkXkSghR5NeueJW3RnIMvFyUoEc9V75q9uPWap4hExSYzqem6CfUY2CST4pdVPDE8pGdMA7lkZUceNns6sn5MQqfRLG2laEZKb+nsioMmasAtupKA7NojcV//M6KYZXfiaiJEUesfmiMJUEYzKNgPSF5gzl2BLKtLC3EjakmjK0QZVsCN7iy8ukeVb1Lqrnd+eV2nUeRxGO4BhOwYNLqMEt1KEBDDQ8wyu8OU/Oi/PufMxbC04+cwh/4Hz+ALoPkqw=</latexit>
Milk
<latexit sha1_base64="YoX6KhxbTerZB8O3mp8Q8Gf5Oeg=">AAACAHicbVDLSsNAFJ34rPUVdeHCzWARXJVEirosunFZwT6giWEymbRDJw9mbpQSsvFX3LhQxK2f4c6/cdpmoa0HLhzOuZd77/FTwRVY1rextLyyurZe2ahubm3v7Jp7+x2VZJKyNk1EIns+UUzwmLWBg2C9VDIS+YJ1/dH1xO8+MKl4Et/BOGVuRAYxDzkloCXPPHRSxe9zJ+Ve7gDgQJJHJovCM2tW3ZoCLxK7JDVUouWZX06Q0CxiMVBBlOrbVgpuTiRwKlhRdTLFUkJHZMD6msYkYsrNpw8U+EQrAQ4TqSsGPFV/T+QkUmoc+bozIjBU895E/M/rZxBeujmP0wxYTGeLwkxgSPAkDRxwySiIsSaESq5vxXRIJKGgM6vqEOz5lxdJ56xun9cbt41a86qMo4KO0DE6RTa6QE10g1qojSgq0DN6RW/Gk/FivBsfs9Ylo5w5QH9gfP4Ak0iXDQ==</latexit>
⇡drawer
at ⇠ ⇡drawer
<latexit sha1_base64="aaob0i0A55J+HffMxei18XxR810=">AAACBHicbVDLSsNAFJ34rPUVddnNYBFclUR8LYtuXFawD2hCmEwm7dDJg5kbpYQs3Pgrblwo4taPcOffOG2z0NYDFw7n3Mu99/ip4Aos69tYWl5ZXVuvbFQ3t7Z3ds29/Y5KMklZmyYikT2fKCZ4zNrAQbBeKhmJfMG6/uh64nfvmVQ8ie9gnDI3IoOYh5wS0JJn1oiXQ4EdxSPspNzLHQAcSPLAZOGZdathTYEXiV2SOirR8swvJ0hoFrEYqCBK9W0rBTcnEjgVrKg6mWIpoSMyYH1NYxIx5ebTJwp8pJUAh4nUFQOeqr8nchIpNY583RkRGKp5byL+5/UzCC/dnMdpBiyms0VhJjAkeJIIDrhkFMRYE0Il17diOiSSUNC5VXUI9vzLi6Rz0rDPG2e3p/XmVRlHBdXQITpGNrpATXSDWqiNKHpEz+gVvRlPxovxbnzMWpeMcuYA/YHx+QPo/JhJ</latexit>
⇡drawer
Knife
w
...
<latexit sha1_base64="kQkDxPnVv0Mgc/tsfqEuJYOLy9w=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9USk0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/AL8Nj0E=</latexit>
<latexit sha1_base64="2M9Uun3jhzFBOkbVOCuyBd1NlPo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMF0xbaUDabTbt0sxt2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZelAlu0PO+ndLa+sbmVnm7srO7t39QPTxqGZVrygKqhNKdiBgmuGQBchSsk2lG0kiwdjS6m/ntJ6YNV/IRxxkLUzKQPOGUoJWCnooV9qs1r+7N4a4SvyA1KNDsV796saJ5yiRSQYzp+l6G4YRo5FSwaaWXG5YROiID1rVUkpSZcDI/duqeWSV2E6VtSXTn6u+JCUmNGaeR7UwJDs2yNxP/87o5JjfhhMssRybpYlGSCxeVO/vcjblmFMXYEkI1t7e6dEg0oWjzqdgQ/OWXV0nrou5f1S8fLmuN2yKOMpzAKZyDD9fQgHtoQgAUODzDK7w50nlx3p2PRWvJKWaO4Q+czx/uTI7H</latexit>
=
<latexit sha1_base64="8LVrqJucQl9xNwsjgr7UF8xAfX0=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexKUC9C0IvHBMwDkiXMTnqTMbOzy8ysEEK+wIsHRbz6Sd78GyfJHjSxoKGo6qa7K0gE18Z1v53c2vrG5lZ+u7Czu7d/UDw8auo4VQwbLBaxagdUo+ASG4Ybge1EIY0Cga1gdDfzW0+oNI/lgxkn6Ed0IHnIGTVWqt/0iiW37M5BVomXkRJkqPWKX91+zNIIpWGCat3x3MT4E6oMZwKnhW6qMaFsRAfYsVTSCLU/mR86JWdW6ZMwVrakIXP198SERlqPo8B2RtQM9bI3E//zOqkJr/0Jl0lqULLFojAVxMRk9jXpc4XMiLEllClubyVsSBVlxmZTsCF4yy+vkuZF2bssV+qVUvU2iyMPJ3AK5+DBFVThHmrQAAYIz/AKb86j8+K8Ox+L1pyTzRzDHzifP4+7jMo=</latexit>
Task Q⇡wdrawer
<latexit sha1_base64="AyR5teNAat/Susz4Jr45Wl5qJQ8=">AAACDHicbVDLSsNAFJ3UV62vqEs3g0FwVRIRdVl047IF+4Cmhslk0g6dPJi5sZTQD3Djr7hxoYhbP8Cdf+O0zUJbD1w4c869zL3HTwVXYNvfRmlldW19o7xZ2dre2d0z9w9aKskkZU2aiER2fKKY4DFrAgfBOqlkJPIFa/vDm6nffmBS8SS+g3HKehHpxzzklICWPNNq3Oduyr3cBcCBJCMmJxMvH82FiIuhfpqWXbVnwMvEKYiFCtQ988sNEppFLAYqiFJdx06hlxMJnAo2qbiZYimhQ9JnXU1jEjHVy2fHTPCJVgIcJlJXDHim/p7ISaTUOPJ1Z0RgoBa9qfif180gvOrlPE4zYDGdfxRmAkOCp8nggEtGQYw1IVRyvSumAyIJBZ1fRYfgLJ68TFpnVeeiet44t2rXRRxldISO0Sly0CWqoVtUR01E0SN6Rq/ozXgyXox342PeWjKKmUP0B8bnDz1mnGQ=</latexit>
<latexit sha1_base64="G4NrL4+FchmdNAyoq1HYtt9oFXI=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4hkUcCGzI79MLI7OxmZlZDCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl77JcqVdK1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOejjQQ=</latexit>
wmilk
<latexit sha1_base64="dwqnoHlLqdsg8BuVgD6mRKkkLpQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ac0oWy223bpbhJ2J0oJ/RtePCji1T/jzX/jts1BWx8MPN6bYWZemEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCanhUkS8iQIl7ySaUxVK3g7HtzO//ci1EXH0gJOEB4oOIzEQjKKV/Kde5iMSJeR42itX3Ko7B1klXk4qkKPRK3/5/ZilikfIJDWm67kJBhnVKJjk05KfGp5QNqZD3rU0ooqbIJvfPCVnVumTQaxtRUjm6u+JjCpjJiq0nYriyCx7M/E/r5vi4DrIRJSkyCO2WDRIJcGYzAIgfaE5QzmxhDIt7K2EjaimDG1MJRuCt/zyKmldVL3Lau2+Vqnf5HEU4QRO4Rw8uII63EEDmsAggWd4hTcndV6cd+dj0Vpw8plj+APn8wdGwJHa</latexit>
milk
Task Policies <latexit sha1_base64="atwBTWaewmruY30kSfX/vOndlS0=">AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqexKUY9FLx4r2A9ol5JNs21sNlmSrFCW/gcvHhTx6v/x5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61jEo1ZU2qhNKdkBgmuGRNy61gnUQzEoeCtcPx7cxvPzFtuJIPdpKwICZDySNOiXVSq5eMeN/vlyte1ZsDrxI/JxXI0eiXv3oDRdOYSUsFMabre4kNMqItp4JNS73UsITQMRmyrqOSxMwE2fzaKT5zygBHSruSFs/V3xMZiY2ZxKHrjIkdmWVvJv7ndVMbXQcZl0lqmaSLRVEqsFV49joecM2oFRNHCNXc3YrpiGhCrQuo5ELwl19eJa2Lqn9Zrd3XKvWbPI4inMApnIMPV1CHO2hAEyg8wjO8whtS6AW9o49FawHlM8fwB+jzBz40juw=</latexit> <latexit sha1_base64="qQ8W6PPGg+MLPiwWXFeQFQUGATw=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK/YI2lM120i7dbOLuRiihf8KLB0W8+ne8+W/ctjlo64OBx3szzMwLEsG1cd1vp7C2vrG5Vdwu7ezu7R+UD49aOk4VwyaLRaw6AdUouMSm4UZgJ1FIo0BgOxjfzfz2EyrNY9kwkwT9iA4lDzmjxkqdXjLi/awx7ZcrbtWdg6wSLycVyFHvl796g5ilEUrDBNW667mJ8TOqDGcCp6VeqjGhbEyH2LVU0gi1n83vnZIzqwxIGCtb0pC5+nsio5HWkyiwnRE1I73szcT/vG5qwhs/4zJJDUq2WBSmgpiYzJ4nA66QGTGxhDLF7a2EjaiizNiISjYEb/nlVdK6qHpX1cuHy0rtNo+jCCdwCufgwTXU4B7q0AQGAp7hFd6cR+fFeXc+Fq0FJ585hj9wPn8AOl6QGw==</latexit>
1 T
t
<latexit sha1_base64="4mSRiAOC1HPbUsbyd7QN48TyFAA=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsN+3azSbsToQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUNnGqGW+xWMa6E1DDpVC8hQIl7ySa0yiQ/CEY3878hyeujYjVPU4S7kd0qEQoGEUrNbFfrrhVdw6ySrycVCBHo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKupYpG3PjZ/NApObPKgISxtqWQzNXfExmNjJlEge2MKI7MsjcT//O6KYbXfiZUkiJXbLEoTCXBmMy+JgOhOUM5sYQyLeythI2opgxtNiUbgrf88ippX1S9y2qtWavUb/I4inACp3AOHlxBHe6gAS1gwOEZXuHNeXRenHfnY9FacPKZY/gD5/MH4xeNAQ==</latexit>
Figure 4.6: Illustration of successor features representation. (a) Here ϕt = ϕ(st ) is the vector of features for the state
at time t, and ψ π is the corresponding SF representation, which depends on the policy π. (b) Given a set of existing
policies and their SFs, we can create a new one by specifying a desired weight vector wnew and taking a weighted
combination of the existing SFs. From Figure 5 of [Car+24]. Used with kind permission of Wilka Carvalho.
Thus the policy πi that is chosen depends in the current state. Thus ws induces a set of policies that are
active for a period of time, similar to playing a chord on a piano.
Once the cumulant function is known, we have to learn the corresponding SF. The standard approach
learns a different SF for every policy, which is limiting. In [Bor+19] they introduced Universal Successor
Feature Approximators which takes an input a policy encoding zw , representing a policy πw (typically
we set zw = w). We then define
ψ πw (s, a) = ψ θ (s, a, zw ) (4.68)
94
The GPI update then becomes
so we replace the discrete max over a finite number of policies with a continuous optimization problem (to be
solved per state).
If we want to learn the policies and SFs at the same time, we can optimize the following losses in parallel:
where a∗ = argmaxa′ ψ θ (s′ , a′ , zw )T w. The first equation is standard Q learning loss, and the second is
the TD update rule in Equation (4.63) for the SF. In [Car+23], they present the Successor Features
Keyboard, that can learn the policy, the SFs and the task encoding zw , all simultaneously. They also
suggest replacing the squared error regression loss in Equation (4.70) with a cross-entropy loss, where each
dimension of the SF is now a discrete probability distribution over M possible values of the corresponding
feature. (c.f. Section 5.1.2).
95
96
Chapter 5
Other topics in RL
5.1 Distributional RL
The distributional RL approach of [BDM17; BDR23], predicts the distribution of (discounted) returns, not
PT
just the expected return. More precisely, let Z π = t=0 γ t rt be a random variable representing the reward-to-
go. The standard value function is defined to compute the expectation of this variable: V π (s) = E [Z π |s0 = s].
In DRL, we instead attempt to learn the full distribution, p(Z π |s0 = s). For a general review of distributional
regression, see [KSS23]. Below we briefly mention a few algorithms in this class that have been explored in
the context of RL.
Figure 5.1: Illustration of how to encode a scalar target y or distributional target Z using a categorical distribution.
From Figure 1 of [Far+24]. Used with kind permission of Jesse Farebrother.
97
An even simpler approach is to replace the distributional target with the standard scalar target (representing
the mean), and then discretize this target and use cross entropy loss instead of squared error.1 Unfortunately,
this encoding is lossy. In [Sch+20], they proposed the two-hot transform, that is a lossless encoding of the
target based on putting appropriate weight on the nearest two bins (see Figure 5.1). In [IW18], they proposed
the HL-Gauss histogram loss, that convolves the target value y with a Gaussian, and then discretizes the
resulting continuous distribution. This is more symetric than two-hot encoding, P as shown in Figure 5.1.
Regardless of how the discrete target is chosen, predictions are made using ŷ(s; θ) = k pk (s)bk , where pk (s)
is the probability of bin k, and bk is the bin center.
In [Far+24], they show that the HL-Gauss trick works much better than MSE, two-hot and C51 across a
variety of problems (both offline and online), especially when they scale to large networks. They conjecture
that the reason it beats MSE is that cross entropy is more robust to noisy targets (e.g., due to stochasticity)
and nonstationary targets. They also conjecture that the reason HL works better than two-hot is that HL is
closer to ordinal regression, and reduces overfitting by having a softer (more entropic) target distribution
(similiar to label smoothing in classification problems).
for predicting the mean leverages a distribution, for robustness and ease of optimization.
98
where Φ : S → R is a potential function, then we can guarantee that the sum of shaped rewards will match
the sum of original rewards plus a constant. This is called Potential-Based Reward Shaping.
In [Wie03], they prove that (in the tabular case) this approach is equivalent to initializing the value
function to V (s) = Φ(s). In [TMM19], they propose an extension called potential-based advice, where they
show that a potential of the form F (s, a, s′ , a′ ) = γΦ(s′ , a′ ) − Φ(s, a) is also valid (and more expressive). In
[Hu+20], they introduce a reward shaping function z which can be used to down-weight or up-weight the
shaping function:
r′ (s, a) = r(s, a) + zϕ (s, a)F (s, a) (5.2)
They use bilevel optimization to optimize ϕ wrt the original task performance.
99
To help filter out such random noise, [Pat+17] proposes an Intrinsic Curiosity Module. This first
learns an inverse dynamics model of the form a = f (s, s′ ), which tries to predict which action was used,
given that the agent was in s and is now in s′ . The classifier has the form softmax(g(ϕ(s), ϕ(s′ ), a)), where
z = ϕ(s) is a representation function that focuses on parts of the state that the agent can control. Then the
agent learns a forwards dynamics model in z-space. Finally it defines the intrinsic reward as
Thus the agent is rewarded for visiting states that lead to unpredictable consequences, where the difference
in outcomes is measured in a (hopefully more meaningful) latent space.
Another solution is to replace the cross entropy with the KL divergence, R(s, a) = DKL (p||q) = Hce (p, q) −
H(p), which goes to zero once the learned model matches the true model, even for unpredictable states.
This has the desired effect of encouraging exploration towards states which have epistemic uncertainty
(reducible noise) but not aleatoric uncertainty (irreducible noise) [MP+22]. The BYOL-Hindsight method
of [Jar+23] is one recent approach that attempts to use the R(s, a) = DKL (p||q) objective. Unfortunately,
computing the DKL (p||q) term is much harder than the usual variational objective of DKL (q||p). A related
idea, proposed in the RL context by [Sch10], is to use the information gain as a reward. This is defined
as Rt (st , at ) = DKL (q(st |ht , at , θt )||q(st |ht , at , θt−1 ), where ht is the history of past observations, and
θt = update(θt−1 , ht , at , st ) are the new model parameters. This is closely related to the BALD (Bayesian
Active Learning by Disagreement) criterion [Hou+11; KAG19], and has the advantage of being easier to
compute, since it is does not reference the true distribution p.
We will discuss goal-conditioned RL in Section 5.3.1. If the agent creates its own goals, then it provides
a way to explore the environment. The question of when and how an agent to switch to pursuing a new
goal is studied in [Pis+22] (see also [BS23]). Some other key work in this space includes the scheduled
auxiliary control method of [Rie+18], and the Go Explore algorithm in [Eco+19; Eco+21] and its recent
LLM extension [LHC24].
5.3 Hierarchical RL
So far we have focused on MDPs that work at a single time scale. However, this is very limiting. For example,
imagine planning a trip from San Francisco to New York: we need to choose high level actions first, such as
which airline to fly, and then medium level actions, such as how to get to the airport, followed by low level
actions, such as motor commands. Thus we need to consider actions that operate multiple levels of temporal
abstraction. This is called hierarchical RL or HRL. This is a big and important topic, and we only brief
mention a few key ideas and methods. Our summary is based in part on [Pat+22]. (See also Section 4.4
where we discuss multi-step predictive models; by contrast, in this section we focus on model-free methods.)
100
Subgoal
𝜋2(s,g)
State Goal
Subgoal
𝜋1(s,g)
State Goal
Primitive Action
𝜋0(s,g)
State Goal
Figure 5.2: Illustration of a 3 level hierarchical goal-conditioned controller. From http: // bigai. cs. brown. edu/
2019/ 09/ 03/ hac. html . Used with kind permission of Andrew Levy.
101
will be stable, providing a stationary learning target. By contrast, if all policies are learned simultaneously,
the distribution becomes non-stationary, which makes learning harder. For more details, see the paper, or
the corresponding blog post (with animations) at http://bigai.cs.brown.edu/2019/09/03/hac.html.
5.3.2 Options
The feudal approach to HRL is somewhat limited, since not all subroutines or skills can be defined in terms
of reaching a goal state (even if it is a partially specified one, such as being in a desired location but without
specifying the velocity). For example, consider the skill of “driving in a circle”, or “finding food”. The options
framework is a more general framework for HRL first proposed in [SPS99]. We discuss this below.
5.3.2.1 Definitions
An option ω = (I, π, β) is a tuple consisting of: the initiation set Iω ⊂ S, which is a subset of states that this
option can start from (also called the affordances of each state [Khe+20]); the subpolicy πω (a|s) ∈ [0, 1];
and the termination condition βω (s) ∈ [0, 1], which gives the probability of finishing in state s. (This
induces a geometric distribution over option durations, which we denote by τ ∼ βω .) The set of all options is
denoted Ω.
To execute an option at step t entails choosing an action using at = πω (st ) and then deciding whether to
terminate at step t + 1 with probability 1 − βω (st+1 ) or to continue following the option at step t + 1. (This
is an example of a semi-Markov decision process [Put94].) If we define πω (s) = a and βω (s) = 0 for all
s, then this option corresponds to primitive action a that terminates in one step. But with options we can
expand the repertoire of actions to include those that take many steps to finish.
To create an MDP with options, we need to define the reward function and dynamics model. The reward
is defined as follows:
R(s, ω) = E R1 + γR2 + · · · + γ τ −1 Rτ |S0 = s, A0:τ −1 ∼ πω , τ ∼ βω (5.6)
Note that pγ (s′ |s, ω) is not a conditional probability distribution, because of the γ k term, but we can usually
treat it like one. Note also that a dynamics model that can predict multiple steps ahead is sometimes called
a jumpy model (see also Section 4.4.3.2).
102
We can use these definitions to define the value function for a hierarchical policy using a generalized
Bellman equation, as follows:
" #
X X
′ ′
Vπ (s) = π(ω|s) R(s, ω) + pγ (s |s, ω)Vπ (s ) (5.8)
ω∈Ω(s) s′
We can compute this using value iteration. We can then learn a policy using policy iteration, or a policy
gradient method. In other words, once we have defined the options, we can use all the standard RL machinery.
Note that GCRL can be considered a special case of options where each option corresponds to a different
goal. Thus the reward function has the form R(s, ω) = I (s = ω), the termination function is βω (s) = I (s = ω),
and the initiation set is the entire state space.
The early work on options, including the MAXQ approach of [Die00], assumed that the set of options was
manually specified. Since then, many methods for learning options have been proposed. We mention a few of
these below.
The first set of methods for option learning rely on two stage training. In the first stage, exploration
methods are used to collect trajectories. Then this data is analysed, either by inferring hidden segments using
EM applied to a latent variable model [Dan+16], or by using the skill chaining method of [KB09], which
uses classifiers to segment the trajectories. The labeled data can then be used to define a set of options,
which can be trained using standard methods.
The second set of methods for option learning use end-to-end training, i.e., the options and their policies
are jointly learned online. For example, [BHP17] propose the option-critic architecture. The number
of options is manually specified, and all policies are randomly initialized. Then they are jointly trained
using policy gradient methods designed for semi-MDPs. (See also [RLT18] for a hierarchical extension of
option-critic to support options calling options.) However, since the learning signal is just the main task
reward, the method can work poorly in problems with sparse reward compared to subgoal methods (see
discussion in [Vez+17; Nac+19]).
Another problem with option-critic is that it requires specialized methods that are designed for optimizing
semi-MDPs. In [ZW19], they propose double actor critic, which allows the use of standard policy gradient
methods. This works by defining two parallel augmented MDPs, where the state space of each MDP is the
cross-product of the original state space and the set of options. The manager learns a policy over options, and
the worker learns a policy over states for each option. Both MDPs just use task rewards, without subgoals or
subtask rewards.
It has been observed that option learning using option-critic or double actor-critic can fail, in the sense
that the top level controller may learn to switch from one option to the next at almost every time step [ZW19;
Har+18]. The reason is that the optimal policy does not require the use of temporally extended options, but
instead can be defined in terms of primitive actions (as in standard RL). Therefore in [Har+18] they propose
to add a regularizer called the deliberation cost, in which the higher level policy is penalized whenever it
switches options. This can speed up learning, at the cost of a potentially suboptimal policy.
Another possible failure mode in option learning is if the higher level policy selects a single option for
the entire task duration. To combat this, [KP19] propose the Interest Option Critic, which learns the
initiation condition Iω so that the option is selected only in certain states of interest, rather than the entire
state space.
In [Mac+23], they discuss how the successor representation (discussed in Section 4.4) can be used to
define options, using a method they call the Representation-driven Option Discovery (ROD) cycle.
In [Lin+24b] they propose to represent options as programs, which are learned using LLMs.
103
5.4 Imitation learning
In previous sections, an RL agent is to learn an optimal sequential decision making policy so that the total
reward is maximized. Imitation learning (IL), also known as apprenticeship learning and learning
from demonstration (LfD), is a different setting, in which the agent does not observe rewards, but has access
to a collection Dexp of trajectories generated by an expert policy πexp ; that is, τ = (s0 , a0 , s1 , a1 , . . . , sT )
and at ∼ πexp (st ) for τ ∈ Dexp . The goal is to learn a good policy by imitating the expert, in the absence
of reward signals. IL finds many applications in scenarios where we have demonstrations of experts (often
humans) but designing a good reward function is not easy, such as car driving and conversational systems.
(See also Section 5.5, where we discuss the closely related topic of offline RL, where we also learn from a
collection of trajectories, but no longer assume they are generated by an optimal policy.)
where the expectation wrt pγπexp may be approximated by averaging over states in Dexp . A challenge with
this method is that the loss does not consider the sequential nature of IL: future state distribution is not
fixed but instead depends on earlier actions. Therefore, if we learn a policy π̂ that has a low imitation error
under distribution pγπexp , as defined in Equation (5.9), it may still incur a large error under distribution pγπ̂
(when the policy π̂ is actually run). This problem has been tackled by the offline RL literature, which we
discuss in Section 5.5.
where RθPis an unknown reward function with parameter θ. Abusing notation slightly, we denote by
T −1
Rθ (τ ) = t=0 Rθ (st , at )) the cumulative reward along the trajectory τ . This model assigns exponentially
R
small probabilities to trajectories with lower cumulative rewards. The partition function, Zθ ≜ τ exp(Rθ (τ )),
is in general intractable to compute, and must be approximated. Here, we can take a sample-based approach.
Let Dexp and D be the sets of trajectories generated by an expert, and by some known distribution q,
respectively. We may infer θ by maximizing the likelihood, p(Dexp |θ), or equivalently, minimizing the negative
log-likelihood loss
1 X 1 X exp(Rθ (τ ))
L(θ) = − Rθ (τ ) + log (5.11)
|Dexp | |D| q(τ )
τ ∈Dexp τ ∈D
104
(a) online reinforcement learning (b) off-policy reinforcement learning (c) offline reinforcement learning
rollout data rollout data
buffer buffer
update
update learn
Figure 5.3: Comparison of online on-policy RL, online off-policy RL, and offline RL. From Figure 1 of [Lev+20a].
Used with kind permission of Sergey Levine.
The term inside the log of the loss is an importance sampling estimate of Z that is unbiased as long as
q(τ ) > 0 for all τ . However, in order to reduce the variance, we can choose q adaptively as θ is being updated.
The optimal sampling distribution, q∗ (τ ) ∝ exp(Rθ (τ )), is hard to obtain. Instead, we may find a policy π̂
which induces a distribution that is close to q∗ , for instance, using methods of maximum entropy RL discussed
in Section 1.5.3. Interestingly, the process above produces the inferred reward Rθ as well as an approximate
optimal policy π̂. This approach is used by guided cost learning [FLA16], and found effective in robotics
applications.
min max Epγπexp (s,a) [Tw (s, a)] − Epγπ (s,a) [f ∗ (Tw (s, a))] (5.12)
π w
5.5 Offline RL
Offline reinforcement learning (also called batch reinforcement learning [LGR12]) is concerned with
learning a reward maximizing policy from a fixed, static dataset, collected by some existing policy, known as
the behavior policy. Thus no interaction with the environment is allowed (see Figure 5.3). This makes
policy learning harder than the online case, since we do not know the consequences of actions that were not
taken in a given state, and cannot test any such “counterfactual” predictions by trying them. (This is the
same problem as in off-policy RL, which we discussed in Section 3.5.) In addition, the policy will be deployed
105
on new states that it may not have seen, requiring that the policy generalize out-of-distribution, which is the
main bottleneck for current offline RL methods [Par+24b].
A very simple and widely used offline RL method is known as behavior cloning or BC. This amounts to
training a policy to predict the observed output action at associated with each observed state st , so we aim
to ensure π(st ) ≈ at , as in supervised learning. This assumes the offline dataset was created by an expert,
and so falls under the umbrella of imitation learning (see Section 5.4.1 for details). By contrast, offline RL
methods can leverage suboptimal data. We give a brief summary of some of these methods below. For more
details, see e.g., [Lev+20b; Che+24b; Cet+24]. For some offline RL benchmarks, see DR4L [Fu+20], RL
Unplugged [Gul+20], OGBench (Offline Goal-Conditioned benchmark) [Par+24a], and D5RL [Raf+24].
In the policy constraint method, we use a modified form of actor-critic, which, at iteration k, uses an
update of the form
h 2 i
Qπk+1 ← argmin E(s,a,s′ )∼D Q(s, a) − (R(s, a) + γEπk (a′ |s′ ) [Qπk (s′ , a′ )]) (5.13)
Q
πk+1 ← argmax Es∼D Eπ(a|s) Qπk+1 (s, a) s.t. D(π, πb ) ≤ ϵ (5.14)
π
One problem with the above method is that we have to fit a parametric model to πb (a|s) in order to
evaluate the divergence term. Fortunately, in the case of KL, the divergence can be enforced implicitly, as in
the advantage weighted regression or AWR method of [Pen+19], the reward weighted regression
method of [PS07], the advantage weighted actor critic or AWAC method of [Nai+20], the advantage
weighted behavior model or ABM method of [Sie+20], In this approach, we first solve (nonparametrically)
for the new policy under the KL divergence constraint to get π k+1 , and then we project this into the required
106
policy function class via supervised regression, as follows:
1 1 π
π k+1 (a|s) ← πb (a|s) exp Qk (s, a) (5.17)
Z α
πk+1 ← argmin DKL (π k+1 ∥ π) (5.18)
π
where µπ (s) = Eπ(a|s) [a] is the mean of the predicted action, and α is a hyper-parameter. As another example,
the DQL method of [WHZ23] optimizes a diffusion policy using
min L(π) = Ldiffusion (π) + Lq (π) = Ldiffusion (π) − αEs∼D,a∼π(·|s) [Q(s, a)] (5.20)
π
Finally, [Aga+22b] discusses how to transfer the policy from a previous agent to a new agent by combining
BC with Q learning.
107
where E(B, w) = E(s,a,s′ )∈B (Qw (s, a) − (r + γ maxa′ Qw (s′ , a′ )))2 is the usual loss for Q-learning, and
C(B, w) is some conservative penalty. In the conservative Q learning or CQL method of [Kum+20], we
use the following penalty term:
C(B, w) = Es∼B,a∼π(·|s) [Qw (s, a)] − E(s,a)∼B [Qw (s, a)] (5.23)
PT
where RTGt = k=t rt is the return to go. (For a comparison of decision transformers to other offline RL
methods, see [Bha+24].)
The diffuser method of [Jan+22] is a diffusion version of trajectory transformer, so it fits p(s1:T , a1:T , r1:T )
using diffusion, where the action space is assumed to be continuous. They also replace beam search with
classifier guidance. The decision diffuser method of [Aja+23] extends diffuser by using classifer-free
guidance, where the conditioning signal is the reward-to-go, simlar to decision transformer. However, unlike
diffuser, the decision diffuser just models the future state trajectories (rather than learning a joint distribution
over states and actions), and infers the actions using an inverse dynamics model at = π(st , st+1 ), which is
trained using supervised learning.
108
One problem with the above approaches is that conditioning on a desired return and taking the predicted
action can fail dramatically in stochastic environments, since trajectories that result in a return may have
only achieved that return due to chance [PMB22; Yan+23; Bra+22; Vil+22]. (This is related to the optimism
bias in the control-as-inference approach discussed in Section 1.5.)
where the Q(s, a) term inside the max ensures conservatism (so Q lower bounds the value of the learned
policy), and the V πβ (s) term ensures “calibration” (so Q upper bounds the value of the behavior policy).
Then online finetuning is performed in the usual way.
the elementary components wt are sub-words (which allows for generalization), not words. So a more precise term would be
“tokens” instead of “words”.
3 The fact that the action (token) sequence is generated by an autoregressive policy inside the agent’s head is an implementation
detail, and not part of the problem specification; for example, the agent could instead use discrete diffusion to generate
at = (at,1 , . . . , at,Nt ).
109
5.6.1.1 RLHF
LLMs are usually trained with behavior cloning, i.e., MLE on a fixed dataset, such as a large text (and
tokenized image) corpus scraped from the web. This is called “pre-training”. We can then improve their
performance using RL, as we describe below; this is called “post-training”.
A common way to perform post-training is to use reinforcement learning from human feedback or
RLHF. This technique, which was first introduced in the InstructGPT paper [Ouy+22], works as follows.
First a large number of (context, answer0, answer1) tuples are generated, either by a human or an LLM.
Then human raters are asked if they prefer answer 0 or answer 1. Let y = 0 denote the event that they prefer
answer 0, and y = 1 the event that they prefer answer 1. We can then fit a model of the form
exp(ϕ(c, a0 ))
p(y = 0|a0 , a1 , c) = (5.26)
exp(ϕ(c, a0 )) + exp(ϕ(c, a1 ))
using binary cross entropy loss, where ϕ(c, a) is some function that maps text to a scalar (interpreted as
logits). Typically ϕ(c, a) is a shallow MLP on top of the last layer of a pretrained LLM. Finally, we define
the reward function as R(s, a) = ϕ(s, a), where s is the context (e.g., a prompt or previous dialog state), and
a is the action (answer generated by LLM). We then use this reward to fine-tune the LLM using a policy
gradient method such as PPO (Section 3.4.3), or a simpler method such as RLOO [Ahm+24], which is based
on REINFORCE (Section 3.2).
Note that this form of training assumes the agent just interacts with a single action (answer) in response
to a single prompt, so is learning the reward for a bandit problem, rather than the full MDP. Also, the
learned reward function is a known parametric model (since it is fit to the human feedback data), whereas in
RL, the reward is an unknown non-differentiable blackbox function. When viewed in this light, it becomes
clear that one can also use non-RL algorithms to improve performance of LLMs, such as DPO [Raf+23] or
the density estimation methods of [Dum+24]. For more details on RL for LLMs, see e.g., [Kau+23].
110
response from the user), making the method much slower than “reactive” LLM policies, that does not use
look-ahead search (but still conditions on the entire past context). However, once trained, it may be possible
to distill this slower “system 2” policy into a faster reactive “system 1” policy.
111
reward. The learned reward model is then used as a shaping function (Section 5.2.3) when training an agent
in the NetHack environment, which has very sparse reward.
In [Ma+24], they present the Eureka system, that learns the reward using bilevel optimization, with
RL on the inner loop and LLM-powered evolutionary search on the outer loop. In particular, in the inner
loop, given a candidate reward function Ri , we use PPO to train a policy, and then return a scalar quality
score Si = S(Ri ). In the outer loop, we ask an LLM to generate a new set of reward functions, Ri′ , given
a population of old reward functions and their scores, (Ri , Si ), which have been trained and evaluated in
parallel on a fleet of GPUs. The prompt also includes the source code of the environment simulator. Each
generated reward function Ri is represented as a Python function, that has access to the ground truth state
of the underlying robot simulator. The resulting system is able to learn a complex reward function that is
sufficient to train a policy (using PPO) that can control a simulated robot hand to perform various dexterous
manipulation tasks, including spinning a pen with its finger tips. In [Li+24], they present a somewhat related
approach and apply it to Minecraft.
In [Ven+24], they propose code as reward, in which they prompt a VLM with an initial and goal image,
and ask it to describe the corresponding sequence of tasks needed to reach the goal. They then ask the LLM
to synthesize code that checks for completion of each subtask (based on processing of object properties, such
as relative location, derived from the image). These reward functions are then “verified” by applying them to
an offline set of expert and random trajectories; a good reward function should allocate high reward to the
expert trajectories and low reward to the random ones. Finally, the reward functions are used as auxiliary
rewards inside an RL agent.
There are of course many other ways an LLM could be used to help learn reward functions, and this
remains an active area of research.
112
Figure 5.4: Illustration of how to use a pretrained LLM (combined with RAG) as a policy. From Figure 5 of [Par+23].
Used with kind permission of Joon Park.
from an external “memory”, rather than explicitly storing the entire history ht in the context (this is called
retrieval augmented generation or RAG); see Figure 5.4 for an illustration. Note that no explicit
learning (in the form of parametric updates) is performed in these systems; instead they rely entirely on
in-context learning (and prompt engineering).
An alternative approach is to enumerate all possible discrete actions, and use the LLM to score them in
terms of their likelihoods given the goal, and their suitability given a learned value function applied to the
current state, i.e. π(at = k|g, pt , ot , ht ) ∝ LLM(wk |gt , pt , ht )Vk (ot ), where gt is the current goal, wk is a text
description of action k, and Vk is the value function for action k. This is the approach used in the robotics
SayCan approach [Ich+23], where the primitive actions ak are separately trained goal-conditioned policies.
Calling the LLM at every step is very slow, so an alternative is to use the LLM to generate code that
represents (parts of) the policy. For example, the Voyager system in [Wan+24a] builds up a reusable skill
library (represented as Python functions), by alternating between environment exploration and prompting
the (frozen) LLM to generate new tasks and skills, given the feedback collected so far.
There are of course many other ways an LLM could be used to help learn policies, and this remains an
active area of research.
where Pr(p) is the prior probability of p, and we assume the likelihood is 1 if p can generate the observations
given the actions, and is 0 otherwise.
The key question is: what is a reasonable prior over programs? In [Hut05], Marcus Hutter proposed
to apply the idea of Solomonoff induction [Sol64] to the case of an online decision making agent. This
113
amounts to using the prior Pr(p) = 2−ℓ(p) , where ℓ(p) is the length of program p. This prior favors shorter
programs, and the likelihood filters out programs that cannot explain the data.
The resulting agent is known as AIXI, where “AI” stands for “Artificial Intelligence” and “XI” referring
to the Greek letter ξ used in Solomonoff induction. The AIXI agent has been called the “most intelligent
general-purpose agent possible” [HQC24], and can be viewed as the theoretical foundation of (universal)
artificial general intelligence or AGI.
Unfortunately, the AIXI agent is intractable to compute, since it relies on Solomonoff induction and
Kolmogorov complexity, both of which are intractable, but various approximations can be devised. For
example, we can approximate the expectimax with MCTS (see Section 4.1.3). Alternatively, [GM+24] showed
that it is possible to use meta learning to train a generic sequence predictor, such as a transformer or LSTM,
on data generated by random Turing machines, so that the transformer learns to approximate a universal
predictor. Another approach is to learn a policy (to avoid searching over action sequences) using TD-learning
(Section 2.3.2); the weighting term in the policy mixture requires that the agent predict its own future actions,
so this approach is known as self-AIXI [Cat+23].
Note that AIXI is a normative theory for optimal agents, but is not very practical, since it does not take
computational limitations into account. In [Aru+24a; Aru+24b], they describe an approach which extends
the above Bayesian framework, while also taking into account the data budget (due to limited environment
interactions) that real agents must contend with (which prohibits modeling the entire environment or
finding the optimal action). This approach, known as Capacity-Limited Bayesian RL (CBRL), combines
Bayesian inference, RL, and rate distortion theory, and can be seen as a normative theoretical foundation for
computationally bounded rational agents.
114
Bibliography
115
[Ale+23] L. N. Alegre, A. L. C. Bazzan, A. Nowé, and B. C. da Silva. “Multi-step generalized policy
improvement by leveraging approximate models”. In: NIPS. Vol. 36. Curran Associates, Inc.,
2023, pp. 38181–38205. url: https://proceedings.neurips.cc/paper_files/paper/2023/
hash/77c7faab15002432ba1151e8d5cc389a-Abstract-Conference.html.
[Alo+24] E. Alonso, A. Jelley, V. Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. “Diffusion
for world modeling: Visual details matter in Atari”. In: arXiv [cs.LG] (May 2024). url: http:
//arxiv.org/abs/2405.12399.
[AM89] B. D. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Methods. Prentice-Hall
International, Inc., 1989.
[Ama98] S Amari. “Natural Gradient Works Efficiently in Learning”. In: Neural Comput. 10.2 (1998),
pp. 251–276. url: http://dx.doi.org/10.1162/089976698300017746.
[AMH23] A. Aubret, L. Matignon, and S. Hassas. “An information-theoretic perspective on intrinsic
motivation in reinforcement learning: A survey”. en. In: Entropy 25.2 (Feb. 2023), p. 327. url:
https://www.mdpi.com/1099-4300/25/2/327.
[Ami+21] S. Amin, M. Gomrokchi, H. Satija, H. van Hoof, and D. Precup. “A survey of exploration
methods in reinforcement learning”. In: arXiv [cs.LG] (Aug. 2021). url: http://arxiv.org/
abs/2109.00157.
[An+21] G. An, S. Moon, J.-H. Kim, and H. O. Song. “Uncertainty-Based Offline Reinforcement Learning
with Diversified Q-Ensemble”. In: NIPS. Vol. 34. Dec. 2021, pp. 7436–7447. url: https://
proceedings.neurips.cc/paper_files/paper/2021/file/3d3d286a8d153a4a58156d0e02d8570c-
Paper.pdf.
[And+17] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin,
P. Abbeel, and W. Zaremba. “Hindsight Experience Replay”. In: arXiv [cs.LG] (July 2017).
url: http://arxiv.org/abs/1707.01495.
[And+20] O. M. Andrychowicz et al. “Learning dexterous in-hand manipulation”. In: Int. J. Rob. Res.
39.1 (2020), pp. 3–20. url: https://doi.org/10.1177/0278364919887447.
[Ant+22] I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver. “Planning in Stochastic
Environments with a Learned Model”. In: ICLR. 2022. url: https://openreview.net/forum?
id=X6D9bAHhBQ1.
[AP23] S. Alver and D. Precup. “Minimal Value-Equivalent Partial Models for Scalable and Robust
Planning in Lifelong Reinforcement Learning”. en. In: Conference on Lifelong Learning Agents.
PMLR, Nov. 2023, pp. 548–567. url: https://proceedings.mlr.press/v232/alver23a.
html.
[AP24] S. Alver and D. Precup. “A Look at Value-Based Decision-Time vs. Background Planning
Methods Across Different Settings”. In: Seventeenth European Workshop on Reinforcement
Learning. Oct. 2024. url: https://openreview.net/pdf?id=Vx2ETvHId8.
[Arb+23] J. Arbel, K. Pitas, M. Vladimirova, and V. Fortuin. “A Primer on Bayesian Neural Networks:
Review and Debates”. In: arXiv [stat.ML] (Sept. 2023). url: http://arxiv.org/abs/2309.
16314.
[ARKP24] S. Alver, A. Rahimi-Kalahroudi, and D. Precup. “Partial models for building adaptive model-
based reinforcement learning agents”. In: COLLAS. May 2024. url: https://arxiv.org/abs/
2405.16899.
[Aru+17] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. “A Brief Survey of Deep
Reinforcement Learning”. In: IEEE Signal Processing Magazine, Special Issue on Deep Learning
for Image Understanding (2017). url: http://arxiv.org/abs/1708.05866.
116
[Aru+24a] D. Arumugam, M. K. Ho, N. D. Goodman, and B. Van Roy. “Bayesian Reinforcement Learning
With Limited Cognitive Load”. en. In: Open Mind 8 (Apr. 2024), pp. 395–438. url: https:
//direct.mit.edu/opmi/article- pdf/doi/10.1162/opmi_a_00132/2364075/opmi_a_
00132.pdf.
[Aru+24b] D. Arumugam, S. Kumar, R. Gummadi, and B. Van Roy. “Satisficing exploration for deep
reinforcement learning”. In: Finding the Frame Workshop at RLC. July 2024. url: https:
//openreview.net/forum?id=tHCpsrzehb.
[AS22] D. Arumugam and S. Singh. “Planning to the information horizon of BAMDPs via epistemic
state abstraction”. In: NIPS. Oct. 2022.
[AS66] S. M. Ali and S. D. Silvey. “A General Class of Coefficients of Divergence of One Distribution
from Another”. In: J. R. Stat. Soc. Series B Stat. Methodol. 28.1 (1966), pp. 131–142. url:
http://www.jstor.org/stable/2984279.
[ASN20] R. Agarwal, D. Schuurmans, and M. Norouzi. “An Optimistic Perspective on Offline Reinforce-
ment Learning”. en. In: ICML. PMLR, Nov. 2020, pp. 104–114. url: https://proceedings.
mlr.press/v119/agarwal20c.html.
[Att03] H. Attias. “Planning by Probabilistic Inference”. In: AI-Stats. 2003. url: http://research.
goldenmetallic.com/aistats03.pdf.
[AY20] B. Amos and D. Yarats. “The Differentiable Cross-Entropy Method”. In: ICML. 2020. url:
http://arxiv.org/abs/1909.12830.
[Bad+20] A. P. Badia, B. Piot, S. Kapturowski, P Sprechmann, A. Vitvitskyi, D. Guo, and C Blundell.
“Agent57: Outperforming the Atari Human Benchmark”. In: ICML 119 (Mar. 2020), pp. 507–517.
url: https://proceedings.mlr.press/v119/badia20a/badia20a.pdf.
[Bai95] L. C. Baird. “Residual Algorithms: Reinforcement Learning with Function Approximation”. In:
ICML. 1995, pp. 30–37.
[Bal+23] P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. “Efficient Online Reinforcement Learning with
Offline Data”. en. In: ICML. PMLR, July 2023, pp. 1577–1594. url: https://proceedings.
mlr.press/v202/ball23a.html.
[Ban+23] D. Bansal, R. T. Q. Chen, M. Mukadam, and B. Amos. “TaskMet: Task-driven metric learning for
model learning”. In: NIPS. Ed. by A Oh, T Naumann, A Globerson, K Saenko, M Hardt, and S
Levine. Vol. abs/2312.05250. Dec. 2023, pp. 46505–46519. url: https://proceedings.neurips.
cc / paper _ files / paper / 2023 / hash / 91a5742235f70ae846436d9780e9f1d4 - Abstract -
Conference.html.
[Bar+17] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. “Suc-
cessor Features for Transfer in Reinforcement Learning”. In: NIPS. Vol. 30. 2017. url: https://
proceedings.neurips.cc/paper_files/paper/2017/file/350db081a661525235354dd3e19b8c05-
Paper.pdf.
[Bar+19] A. Barreto et al. “The Option Keyboard: Combining Skills in Reinforcement Learning”. In:
NIPS. Vol. 32. 2019. url: https://proceedings.neurips.cc/paper_files/paper/2019/
file/251c5ffd6b62cc21c446c963c76cf214-Paper.pdf.
[Bar+20] A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup. “Fast reinforcement learning with
generalized policy updates”. en. In: PNAS 117.48 (Dec. 2020), pp. 30079–30087. url: https:
//www.pnas.org/doi/abs/10.1073/pnas.1907370117.
[BBS95] A. G. Barto, S. J. Bradtke, and S. P. Singh. “Learning to act using real-time dynamic pro-
gramming”. In: AIJ 72.1 (1995), pp. 81–138. url: http://www.sciencedirect.com/science/
article/pii/000437029400011O.
[BDG00] C. Boutilier, R. Dearden, and M. Goldszmidt. “Stochastic dynamic programming with factored
representations”. en. In: Artif. Intell. 121.1-2 (Aug. 2000), pp. 49–107. url: http://dx.doi.
org/10.1016/S0004-3702(00)00033-3.
117
[BDM10] M. Briers, A. Doucet, and S. Maskel. “Smoothing algorithms for state-space models”. In: Annals
of the Institute of Statistical Mathematics 62.1 (2010), pp. 61–89.
[BDM17] M. G. Bellemare, W. Dabney, and R. Munos. “A Distributional Perspective on Reinforcement
Learning”. In: ICML. 2017. url: http://arxiv.org/abs/1707.06887.
[BDR23] M. G. Bellemare, W. Dabney, and M. Rowland. Distributional Reinforcement Learning. http:
//www.distributional-rl.org. MIT Press, 2023.
[Bel+13] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. “The Arcade Learning Environment:
An Evaluation Platform for General Agents”. In: JAIR 47 (2013), pp. 253–279.
[Bel+16] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. “Unifying
Count-Based Exploration and Intrinsic Motivation”. In: NIPS. 2016. url: http://arxiv.org/
abs/1606.01868.
[Ber19] D. Bertsekas. Reinforcement learning and optimal control. Athena Scientific, 2019. url: http:
//www.mit.edu/~dimitrib/RLbook.html.
[Ber24] D. P. Bertsekas. “Model Predictive Control and Reinforcement Learning: A unified framework
based on Dynamic Programming”. In: arXiv [eess.SY] (June 2024). url: http://arxiv.org/
abs/2406.00592.
[Bha+24] P. Bhargava, R. Chitnis, A. Geramifard, S. Sodhani, and A. Zhang. “When should we prefer
Decision Transformers for Offline Reinforcement Learning?” In: ICLR. 2024. url: https :
//arxiv.org/abs/2305.14550.
[BHP17] P.-L. Bacon, J. Harb, and D. Precup. “The Option-Critic Architecture”. In: AAAI. 2017.
[BKH16] J. L. Ba, J. R. Kiros, and G. E. Hinton. “Layer Normalization”. In: (2016). arXiv: 1607.06450
[stat.ML]. url: http://arxiv.org/abs/1607.06450.
[BLM16] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory
of Independence. Oxford University Press, 2016.
[BM+18] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, T. B. Dhruva, A. Muldal,
N. Heess, and T. Lillicrap. “Distributed Distributional Deterministic Policy Gradients”. In:
ICLR. 2018. url: https://openreview.net/forum?id=SyZipzbCb¬eId=SyZipzbCb.
[BMS11] S. Bubeck, R. Munos, and G. Stoltz. “Pure Exploration in Finitely-armed and Continuous-armed
Bandits”. In: Theoretical Computer Science 412.19 (2011), pp. 1832–1852.
[Boe+05] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. “A Tutorial on the Cross-Entropy
Method”. en. In: Ann. Oper. Res. 134.1 (2005), pp. 19–67. url: https://link.springer.com/
article/10.1007/s10479-005-5724-z.
[Bor+19] D. Borsa, A. Barreto, J. Quan, D. J. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and
T. Schaul. “Universal Successor Features Approximators”. In: ICLR. 2019. url: https://
openreview.net/pdf?id=S1VWjiRcKX.
[Bos16] N. Bostrom. Superintelligence: Paths, Dangers, Strategies. en. London, England: Oxford Uni-
versity Press, Mar. 2016. url: https://www.amazon.com/Superintelligence- Dangers-
Strategies-Nick-Bostrom/dp/0198739834.
[Bou+23] K. Bousmalis et al. “RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation”.
In: TMLR (June 2023). url: http://arxiv.org/abs/2306.11706.
[Bra+22] D. Brandfonbrener, A. Bietti, J. Buckman, R. Laroche, and J. Bruna. “When does return-
conditioned supervised learning work for offline reinforcement learning?” In: NIPS. June 2022.
url: http://arxiv.org/abs/2206.01079.
[BS23] A. Bagaria and T. Schaul. “Scaling goal-based exploration via pruning proto-goals”. en. In: IJCAI.
Aug. 2023, pp. 3451–3460. url: https://dl.acm.org/doi/10.24963/ijcai.2023/384.
118
[BSA83] A. G. Barto, R. S. Sutton, and C. W. Anderson. “Neuronlike adaptive elements that can
solve difficult learning control problems”. In: SMC 13.5 (1983), pp. 834–846. url: http :
//dx.doi.org/10.1109/TSMC.1983.6313077.
[BT12] M. Botvinick and M. Toussaint. “Planning as inference”. en. In: Trends Cogn. Sci. 16.10 (2012),
pp. 485–488. url: https://pdfs.semanticscholar.org/2ba7/88647916f6206f7fcc137fe7866c58e6211e.
pdf.
[Buc+17] C. L. Buckley, C. S. Kim, S. McGregor, and A. K. Seth. “The free energy principle for action
and perception: A mathematical review”. In: J. Math. Psychol. 81 (2017), pp. 55–79. url:
https://www.sciencedirect.com/science/article/pii/S0022249617300962.
[Bur+18] Y. Burda, H. Edwards, A Storkey, and O. Klimov. “Exploration by random network distillation”.
In: ICLR. Vol. abs/1810.12894. Sept. 2018.
[BXS20] H. Bharadhwaj, K. Xie, and F. Shkurti. “Model-Predictive Control via Cross-Entropy and
Gradient-Based Optimization”. en. In: Learning for Dynamics and Control. PMLR, July 2020,
pp. 277–286. url: https://proceedings.mlr.press/v120/bharadhwaj20a.html.
[CA13] E. F. Camacho and C. B. Alba. Model predictive control. Springer, 2013.
[Cao+24] Y. Cao, H. Zhao, Y. Cheng, T. Shu, G. Liu, G. Liang, J. Zhao, and Y. Li. “Survey on large
language model-enhanced reinforcement learning: Concept, taxonomy, and methods”. In: arXiv
[cs.LG] (Mar. 2024). url: http://arxiv.org/abs/2404.00282.
[Car+23] W. C. Carvalho, A. Saraiva, A. Filos, A. Lampinen, L. Matthey, R. L. Lewis, H. Lee, S. Singh, D.
Jimenez Rezende, and D. Zoran. “Combining Behaviors with the Successor Features Keyboard”.
In: NIPS. Vol. 36. 2023, pp. 9956–9983. url: https://proceedings.neurips.cc/paper_
files / paper / 2023 / hash / 1f69928210578f4cf5b538a8c8806798 - Abstract - Conference .
html.
[Car+24] W. Carvalho, M. S. Tomov, W. de Cothi, C. Barry, and S. J. Gershman. “Predictive rep-
resentations: building blocks of intelligence”. In: Neural Comput. (Feb. 2024). url: https:
//gershmanlab.com/pubs/Carvalho24.pdf.
[Cas11] P. S. Castro. “On planning, prediction and knowledge transfer in Fully and Partially Observable
Markov Decision Processes”. en. PhD thesis. McGill, 2011. url: https://www.proquest.com/
openview/d35984acba38c072359f8a8d5102c777/1?pq-origsite=gscholar&cbl=18750.
[Cas20] P. S. Castro. “Scalable methods for computing state similarity in deterministic Markov Decision
Processes”. In: AAAI. 2020.
[Cas+21] P. S. Castro, T. Kastner, P. Panangaden, and M. Rowland. “MICo: Improved representations
via sampling-based state similarity for Markov decision processes”. In: NIPS. Nov. 2021. url:
https://openreview.net/pdf?id=wFp6kmQELgu.
[Cas+23] P. S. Castro, T. Kastner, P. Panangaden, and M. Rowland. “A kernel perspective on behavioural
metrics for Markov decision processes”. In: TMLR abs/2310.19804 (Oct. 2023). url: https:
//openreview.net/pdf?id=nHfPXl1ly7.
[Cat+23] E. Catt, J. Grau-Moya, M. Hutter, M. Aitchison, T. Genewein, G. Delétang, K. Li, and J.
Veness. “Self-Predictive Universal AI”. In: NIPS. Vol. 36. 2023, pp. 27181–27198. url: https://
proceedings.neurips.cc/paper_files/paper/2023/hash/56a225639da77e8f7c0409f6d5ba996b-
Abstract-Conference.html.
[Cen21] Center for Research on Foundation Models (CRFM). “On the Opportunities and Risks of
Foundation Models”. In: (2021). arXiv: 2108.07258 [cs.LG]. url: http://arxiv.org/abs/
2108.07258.
[Cet+24] E. Cetin, A. Tirinzoni, M. Pirotta, A. Lazaric, Y. Ollivier, and A. Touati. “Simple ingredients
for offline reinforcement learning”. In: arXiv [cs.LG] (Mar. 2024). url: http://arxiv.org/
abs/2403.13097.
119
[Cha+21] A. Chan, H. Silva, S. Lim, T. Kozuno, A Mahmood, and M. White. “Greedification opera-
tors for policy optimization: Investigating forward and reverse KL divergences”. In: JMLR
abs/2107.08285.253 (July 2021), pp. 1–79. url: http://jmlr.org/papers/v23/21-054.html.
[Che+20] X. Chen, C. Wang, Z. Zhou, and K. W. Ross. “Randomized Ensembled Double Q-Learning:
Learning Fast Without a Model”. In: ICLR. Oct. 2020. url: https://openreview.net/pdf?
id=AY8zfZm0tDd.
[Che+21a] C. Chen, Y.-F. Wu, J. Yoon, and S. Ahn. “TransDreamer: Reinforcement Learning with
Transformer World Models”. In: Deep RL Workshop NeurIPS. 2021. url: http://arxiv.org/
abs/2202.09481.
[Che+21b] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and
I. Mordatch. “Decision Transformer: Reinforcement Learning via Sequence Modeling”. In: arXiv
[cs.LG] (June 2021). url: http://arxiv.org/abs/2106.01345.
[Che+24a] F. Che, C. Xiao, J. Mei, B. Dai, R. Gummadi, O. A. Ramirez, C. K. Harris, A. R. Mahmood, and
D. Schuurmans. “Target networks and over-parameterization stabilize off-policy bootstrapping
with function approximation”. In: ICML. May 2024.
[Che+24b] J. Chen, B. Ganguly, Y. Xu, Y. Mei, T. Lan, and V. Aggarwal. “Deep Generative Models for
Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions”. In: TMLR
(Feb. 2024). url: https://openreview.net/forum?id=Mm2cMDl9r5.
[Che+24c] J. Chen, B. Ganguly, Y. Xu, Y. Mei, T. Lan, and V. Aggarwal. “Deep Generative Models for
Offline Policy Learning: Tutorial, Survey, and Perspectives on Future Directions”. In: TMLR
(Feb. 2024). url: https://openreview.net/forum?id=Mm2cMDl9r5.
[Che+24d] W. Chen, O. Mees, A. Kumar, and S. Levine. “Vision-language models provide promptable
representations for reinforcement learning”. In: arXiv [cs.LG] (Feb. 2024). url: http://arxiv.
org/abs/2402.02651.
[Chr19] P. Christodoulou. “Soft Actor-Critic for discrete action settings”. In: arXiv [cs.LG] (Oct. 2019).
url: http://arxiv.org/abs/1910.07207.
[Chu+18] K. Chua, R. Calandra, R. McAllister, and S. Levine. “Deep Reinforcement Learning in a Handful
of Trials using Probabilistic Dynamics Models”. In: NIPS. 2018. url: http://arxiv.org/abs/
1805.12114.
[CL11] O. Chapelle and L. Li. “An empirical evaluation of Thompson sampling”. In: NIPS. 2011.
[CMS07] B. Colson, P. Marcotte, and G. Savard. “An overview of bilevel optimization”. en. In: Ann. Oper.
Res. 153.1 (Sept. 2007), pp. 235–256. url: https://link.springer.com/article/10.1007/
s10479-007-0176-2.
[Cob+19] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman. “Quantifying Generalization in
Reinforcement Learning”. en. In: ICML. May 2019, pp. 1282–1289. url: https://proceedings.
mlr.press/v97/cobbe19a.html.
[Col+22] C. Colas, T. Karch, O. Sigaud, and P.-Y. Oudeyer. “Autotelic agents with intrinsically motivated
goal-conditioned reinforcement learning: A short survey”. en. In: JAIR 74 (July 2022), pp. 1159–
1199. url: https://www.jair.org/index.php/jair/article/view/13554.
[CS04] I. Csiszár and P. C. Shields. “Information theory and statistics: A tutorial”. In: (2004).
[Csi67] I. Csiszar. “Information-Type Measures of Difference of Probability Distributions and Indirect
Observations”. In: Studia Scientiarum Mathematicarum Hungarica 2 (1967), pp. 299–318.
[CVRM23] F. Che, G. Vasan, and A Rupam Mahmood. “Correcting discount-factor mismatch in on-
policy policy gradient methods”. en. In: ICML. PMLR, July 2023, pp. 4218–4240. url: https:
//proceedings.mlr.press/v202/che23a.html.
120
[Dab+17] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. “Distributional reinforcement learning
with quantile regression”. In: arXiv [cs.AI] (Oct. 2017). url: http://arxiv.org/abs/1710.
10044.
[Dab+18] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. “Implicit quantile networks for distributional
reinforcement learning”. In: arXiv [cs.LG] (June 2018). url: http://arxiv.org/abs/1806.
06923.
[Dan+16] C. Daniel, H. van Hoof, J. Peters, and G. Neumann. “Probabilistic inference for determining
options in reinforcement learning”. en. In: Mach. Learn. 104.2-3 (Sept. 2016), pp. 337–357. url:
https://link.springer.com/article/10.1007/s10994-016-5580-x.
[Day93] P. Dayan. “Improving generalization for temporal difference learning: The successor representa-
tion”. en. In: Neural Comput. 5.4 (July 1993), pp. 613–624. url: https://ieeexplore.ieee.
org/abstract/document/6795455.
[DFR15] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. “Gaussian Processes for Data-Efficient Learning
in Robotics and Control”. en. In: IEEE PAMI 37.2 (2015), pp. 408–423. url: http://dx.doi.
org/10.1109/TPAMI.2013.218.
[DH92] P. Dayan and G. E. Hinton. “Feudal Reinforcement Learning”. In: NIPS 5 (1992). url: https://
proceedings.neurips.cc/paper_files/paper/1992/file/d14220ee66aeec73c49038385428ec4c-
Paper.pdf.
[Die00] T. G. Dietterich. “Hierarchical reinforcement learning with the MAXQ value function decompo-
sition”. en. In: JAIR 13 (Nov. 2000), pp. 227–303. url: https://www.jair.org/index.php/
jair/article/view/10266.
[Die+07] M. Diehl, H. G. Bock, H. Diedam, and P.-B. Wieber. “Fast Direct Multiple Shooting Algo-
rithms for Optimal Robot Control”. In: Lecture Notes in Control and Inform. Sci. 340 (2007).
url: https://www.researchgate.net/publication/29603798_Fast_Direct_Multiple_
Shooting_Algorithms_for_Optimal_Robot_Control.
[DMKM22] G. Duran-Martin, A. Kara, and K. Murphy. “Efficient Online Bayesian Inference for Neural
Bandits”. In: AISTATS. 2022. url: http://arxiv.org/abs/2112.00195.
[D’O+22] P. D’Oro, M. Schwarzer, E. Nikishin, P.-L. Bacon, M. G. Bellemare, and A. Courville. “Sample-
Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier”. In: Deep Reinforcement
Learning Workshop NeurIPS 2022. Dec. 2022. url: https : / / openreview . net / pdf ? id =
4GBGwVIEYJ.
[DOB21] W. Dabney, G. Ostrovski, and A. Barreto. “Temporally-Extended epsilon-Greedy Exploration”.
In: ICLR. 2021. url: https://openreview.net/pdf?id=ONBPHFZ7zG4.
[DR11] M. P. Deisenroth and C. E. Rasmussen. “PILCO: A Model-Based and Data-Efficient Approach
to Policy Search”. In: ICML. 2011. url: http://www.icml-2011.org/papers/323_icmlpaper.
pdf.
[Du+21] C. Du, Z. Gao, S. Yuan, L. Gao, Z. Li, Y. Zeng, X. Zhu, J. Xu, K. Gai, and K.-C. Lee.
“Exploration in Online Advertising Systems with Deep Uncertainty-Aware Learning”. In: KDD.
KDD ’21. Association for Computing Machinery, 2021, pp. 2792–2801. url: https://doi.org/
10.1145/3447548.3467089.
[Duf02] M. Duff. “Optimal Learning: Computational procedures for Bayes-adaptive Markov decision
processes”. PhD thesis. U. Mass. Dept. Comp. Sci., 2002. url: http://envy.cs.umass.edu/
People/duff/diss.html.
[Dum+24] V. Dumoulin, D. D. Johnson, P. S. Castro, H. Larochelle, and Y. Dauphin. “A density estimation
perspective on learning from pairwise human preferences”. In: Trans. on Machine Learning
Research 2024 (2024). url: https://openreview.net/pdf?id=YH3oERVYjF.
121
[DVRZ22] S. Dong, B. Van Roy, and Z. Zhou. “Simple Agent, Complex Environment: Efficient Reinforce-
ment Learning with Agent States”. In: J. Mach. Learn. Res. (2022). url: https://www.jmlr.
org/papers/v23/21-0773.html.
[DWS12] T. Degris, M. White, and R. S. Sutton. “Off-Policy Actor-Critic”. In: ICML. 2012. url: http:
//arxiv.org/abs/1205.4839.
[Eco+19] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. “Go-Explore: a New Approach
for Hard-Exploration Problems”. In: (2019). arXiv: 1901.10995 [cs.LG]. url: http://arxiv.
org/abs/1901.10995.
[Eco+21] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune. “First return, then explore”.
en. In: Nature 590.7847 (Feb. 2021), pp. 580–586. url: https://www.nature.com/articles/
s41586-020-03157-9.
[Emm+21] S. Emmons, B. Eysenbach, I. Kostrikov, and S. Levine. “RvS: What is essential for offline RL
via Supervised Learning?” In: arXiv [cs.LG] (Dec. 2021). url: http://arxiv.org/abs/2112.
10751.
[ESL21] B. Eysenbach, R. Salakhutdinov, and S. Levine. “C-Learning: Learning to Achieve Goals via
Recursive Classification”. In: ICLR. 2021. url: https://openreview.net/pdf?id=tc5qisoB-
C.
[Esp+18] L. Espeholt et al. “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-
Learner Architectures”. en. In: ICML. PMLR, July 2018, pp. 1407–1416. url: https : / /
proceedings.mlr.press/v80/espeholt18a.html.
[Eys+20] B. Eysenbach, X. Geng, S. Levine, and R. Salakhutdinov. “Rewriting History with Inverse RL:
Hindsight Inference for Policy Improvement”. In: NIPS. Feb. 2020.
[Eys+21] B. Eysenbach, A. Khazatsky, S. Levine, and R. Salakhutdinov. “Mismatched No More: Joint
Model-Policy Optimization for Model-Based RL”. In: (2021). arXiv: 2110.02758 [cs.LG]. url:
http://arxiv.org/abs/2110.02758.
[Eys+22] B. Eysenbach, A. Khazatsky, S. Levine, and R. Salakhutdinov. “Mismatched No More: Joint
Model-Policy Optimization for Model-Based RL”. In: NIPS. 2022.
[Far+18] G. Farquhar, T. Rocktäschel, M. Igl, and S. Whiteson. “TreeQN and ATreeC: Differentiable
Tree-Structured Models for Deep Reinforcement Learning”. In: ICLR. Feb. 2018. url: https:
//openreview.net/pdf?id=H1dh6Ax0Z.
[Far+23] J. Farebrother, J. Greaves, R. Agarwal, C. Le Lan, R. Goroshin, P. S. Castro, and M. G.
Bellemare. “Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks”. In:
ICLR. 2023. url: https://openreview.net/pdf?id=oGDKSt9JrZi.
[Far+24] J. Farebrother et al. “Stop regressing: Training value functions via classification for scalable
deep RL”. In: arXiv [cs.LG] (Mar. 2024). url: http://arxiv.org/abs/2403.03950.
[FC24] J. Farebrother and P. S. Castro. “CALE: Continuous Arcade Learning Environment”. In: NIPS.
Oct. 2024. url: https://arxiv.org/abs/2410.23810.
[FG21] S. Fujimoto and S. s. Gu. “A Minimalist Approach to Offline Reinforcement Learning”. In: NIPS.
Vol. 34. Dec. 2021, pp. 20132–20145. url: https://proceedings.neurips.cc/paper_files/
paper/2021/file/a8166da05c5a094f7dc03724b41886e5-Paper.pdf.
[FHM18] S. Fujimoto, H. van Hoof, and D. Meger. “Addressing Function Approximation Error in Actor-
Critic Methods”. In: ICLR. 2018. url: http://arxiv.org/abs/1802.09477.
[FL+18] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau. “An Introduction
to Deep Reinforcement Learning”. In: Foundations and Trends in Machine Learning 11.3 (2018).
url: http://arxiv.org/abs/1811.12560.
[FLA16] C. Finn, S. Levine, and P. Abbeel. “Guided Cost Learning: Deep Inverse Optimal Control via
Policy Optimization”. In: ICML. 2016, pp. 49–58.
122
[FLL18] J. Fu, K. Luo, and S. Levine. “Learning Robust Rewards with Adverserial Inverse Reinforcement
Learning”. In: ICLR. 2018.
[For+18] M. Fortunato et al. “Noisy Networks for Exploration”. In: ICLR. 2018. url: http://arxiv.
org/abs/1706.10295.
[FPP04] N. Ferns, P. Panangaden, and D. Precup. “Metrics for finite Markov decision processes”. en. In:
UAI. 2004. url: https://dl.acm.org/doi/10.5555/1036843.1036863.
[Fra+24] B. Frauenknecht, A. Eisele, D. Subhasish, F. Solowjow, and S. Trimpe. “Trust the Model Where
It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption”. In:
ICML. June 2024. url: https://openreview.net/pdf?id=N0ntTjTfHb.
[Fre+24] B. Freed, T. Wei, R. Calandra, J. Schneider, and H. Choset. “Unifying Model-Based and
Model-Free Reinforcement Learning with Equivalent Policy Sets”. In: RL Conference. 2024. url:
https://rlj.cs.umass.edu/2024/papers/RLJ_RLC_2024_37.pdf.
[Fri03] K. Friston. “Learning and inference in the brain”. en. In: Neural Netw. 16.9 (2003), pp. 1325–1352.
url: http://dx.doi.org/10.1016/j.neunet.2003.06.005.
[Fri09] K. Friston. “The free-energy principle: a rough guide to the brain?” en. In: Trends Cogn. Sci.
13.7 (2009), pp. 293–301. url: http://dx.doi.org/10.1016/j.tics.2009.04.005.
[FS+19] H Francis Song et al. “V-MPO: On-Policy Maximum a Posteriori Policy Optimization for
Discrete and Continuous Control”. In: arXiv [cs.AI] (Sept. 2019). url: http://arxiv.org/
abs/1909.12238.
[FSW23] M. Fellows, M. J. A. Smith, and S. Whiteson. “Why Target Networks Stabilise Temporal Differ-
ence Methods”. en. In: ICML. PMLR, July 2023, pp. 9886–9909. url: https://proceedings.
mlr.press/v202/fellows23a.html.
[Fu15] M. Fu, ed. Handbook of Simulation Optimization. 1st ed. Springer-Verlag New York, 2015. url:
http://www.springer.com/us/book/9781493913831.
[Fu+20] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4RL: Datasets for Deep Data-Driven
Reinforcement Learning. arXiv:2004.07219. 2020.
[Fuj+19] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau. “Benchmarking batch deep reinforcement
learning algorithms”. In: Deep RL Workshop NeurIPS. Oct. 2019. url: https://arxiv.org/
abs/1910.01708.
[Gal+24] M. Gallici, M. Fellows, B. Ellis, B. Pou, I. Masmitja, J. N. Foerster, and M. Martin. “Simplifying
deep temporal difference learning”. In: ICML. July 2024.
[Gar23] R. Garnett. Bayesian Optimization. Cambridge University Press, 2023. url: https://bayesoptbook.
com/.
[GBS22] C. Grimm, A. Barreto, and S. Singh. “Approximate Value Equivalence”. In: NIPS. Oct. 2022.
url: https://openreview.net/pdf?id=S2Awu3Zn04v.
[GDG03] R. Givan, T. Dean, and M. Greig. “Equivalence notions and model minimization in Markov
decision processes”. en. In: Artif. Intell. 147.1-2 (July 2003), pp. 163–223. url: https://www.
sciencedirect.com/science/article/pii/S0004370202003764.
[GDWF22] J. Grudzien, C. A. S. De Witt, and J. Foerster. “Mirror Learning: A Unifying Framework of Policy
Optimisation”. In: ICML. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022,
pp. 7825–7844. url: https://proceedings.mlr.press/v162/grudzien22a/grudzien22a.
pdf.
[Ger18] S. J. Gershman. “Deconstructing the human algorithms for exploration”. en. In: Cognition 173
(Apr. 2018), pp. 34–42. url: https://www.sciencedirect.com/science/article/abs/pii/
S0010027717303359.
[Ger19] S. J. Gershman. “What does the free energy principle tell us about the brain?” In: Neurons,
Behavior, Data Analysis, and Theory (2019). url: http://arxiv.org/abs/1901.07945.
123
[GGN22] S. K. S. Ghasemipour, S. S. Gu, and O. Nachum. “Why So Pessimistic? Estimating Uncertainties
for Offline RL through Ensembles, and Why Their Independence Matters”. In: NIPS. Oct. 2022.
url: https://openreview.net/pdf?id=z64kN1h1-rR.
[Ghi+20] S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M. White. “Gradient temporal-
difference learning with Regularized Corrections”. In: ICML. July 2020.
[Gho+21] D. Ghosh, J. Rahme, A. Kumar, A. Zhang, R. P. Adams, and S. Levine. “Why Generalization
in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability”. In: NIPS. Vol. 34.
Dec. 2021, pp. 25502–25515. url: https://proceedings.neurips.cc/paper_files/paper/
2021/file/d5ff135377d39f1de7372c95c74dd962-Paper.pdf.
[Ghu+22] R. Ghugare, H. Bharadhwaj, B. Eysenbach, S. Levine, and R. Salakhutdinov. “Simplifying Model-
based RL: Learning Representations, Latent-space Models, and Policies with One Objective”.
In: ICLR. Sept. 2022. url: https://openreview.net/forum?id=MQcmfgRxf7a.
[Git89] J. Gittins. Multi-armed Bandit Allocation Indices. Wiley, 1989.
[GK19] L. Graesser and W. L. Keng. Foundations of Deep Reinforcement Learning: Theory and Practice
in Python. en. 1 edition. Addison-Wesley Professional, 2019. url: https://www.amazon.com/
Deep-Reinforcement-Learning-Python-Hands/dp/0135172381.
[GM+24] J. Grau-Moya et al. “Learning Universal Predictors”. In: arXiv [cs.LG] (Jan. 2024). url:
https://arxiv.org/abs/2401.14953.
[Gor95] G. J. Gordon. “Stable Function Approximation in Dynamic Programming”. In: ICML. 1995,
pp. 261–268.
[Gra+10] T. Graepel, J. Quinonero-Candela, T. Borchert, and R. Herbrich. “Web-Scale Bayesian Click-
Through Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search Engine”.
In: ICML. 2010.
[Gri+20] C. Grimm, A. Barreto, S. Singh, and D. Silver. “The Value Equivalence Principle for Model-Based
Reinforcement Learning”. In: NIPS 33 (2020), pp. 5541–5552. url: https://proceedings.
neurips.cc/paper_files/paper/2020/file/3bb585ea00014b0e3ebe4c6dd165a358-Paper.
pdf.
[Gul+20] C. Gulcehre et al. RL Unplugged: Benchmarks for Offline Reinforcement Learning. arXiv:2006.13888.
2020.
[Guo+22a] Z. D. Guo et al. “BYOL-Explore: Exploration by Bootstrapped Prediction”. In: Advances in
Neural Information Processing Systems. Oct. 2022. url: https://openreview.net/pdf?id=
qHGCH75usg.
[Guo+22b] Z. D. Guo et al. “BYOL-Explore: Exploration by bootstrapped prediction”. In: NIPS. June 2022.
url: https://proceedings.neurips.cc/paper_files/paper/2022/hash/ced0d3b92bb83b15c43ee32c7f57d
Abstract-Conference.html.
[GZG19] S. K. S. Ghasemipour, R. S. Zemel, and S. Gu. “A Divergence Minimization Perspective on
Imitation Learning Methods”. In: CORL. 2019, pp. 1259–1277.
[Haa+18a] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. “Soft Actor-Critic: Off-Policy Maximum
Entropy Deep Reinforcement Learning with a Stochastic Actor”. In: ICML. 2018. url: http:
//arxiv.org/abs/1801.01290.
[Haa+18b] T. Haarnoja et al. “Soft Actor-Critic Algorithms and Applications”. In: (2018). arXiv: 1812.
05905 [cs.LG]. url: http://arxiv.org/abs/1812.05905.
[Haf+19] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. “Learning Latent
Dynamics for Planning from Pixels”. In: ICML. 2019. url: http://arxiv.org/abs/1811.
04551.
[Haf+20] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. “Dream to Control: Learning Behaviors by
Latent Imagination”. In: ICLR. 2020. url: https://openreview.net/forum?id=S1lOTC4tDS.
124
[Haf+21] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. “Mastering Atari with discrete world models”.
In: ICLR. 2021.
[Haf+23] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. “Mastering Diverse Domains through World
Models”. In: arXiv [cs.AI] (Jan. 2023). url: http://arxiv.org/abs/2301.04104.
[Han+19] S. Hansen, W. Dabney, A. Barreto, D. Warde-Farley, T. Van de Wiele, and V. Mnih. “Fast
Task Inference with Variational Intrinsic Successor Features”. In: ICLR. Sept. 2019. url:
https://openreview.net/pdf?id=BJeAHkrYDS.
[Har+18] J. Harb, P.-L. Bacon, M. Klissarov, and D. Precup. “When waiting is not an option: Learning
options with a deliberation cost”. en. In: AAAI 32.1 (Apr. 2018). url: https://ojs.aaai.
org/index.php/AAAI/article/view/11831.
[Has10] H. van Hasselt. “Double Q-learning”. In: NIPS. Ed. by J. D. Lafferty, C. K. I. Williams, J
Shawe-Taylor, R. S. Zemel, and A Culotta. Curran Associates, Inc., 2010, pp. 2613–2621. url:
http://papers.nips.cc/paper/3964-double-q-learning.pdf.
[Has+16] H. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver. “Learning values across many
orders of magnitude”. In: NIPS. Feb. 2016.
[HDCM15] A. Hallak, D. Di Castro, and S. Mannor. “Contextual Markov decision processes”. In: arXiv
[stat.ML] (Feb. 2015). url: http://arxiv.org/abs/1502.02259.
[HE16] J. Ho and S. Ermon. “Generative Adversarial Imitation Learning”. In: NIPS. 2016, pp. 4565–
4573.
[Hes+18] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot,
M. Azar, and D. Silver. “Rainbow: Combining Improvements in Deep Reinforcement Learning”.
In: AAAI. 2018. url: http://arxiv.org/abs/1710.02298.
[Hes+19] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. van Hasselt. “Multi-task
deep reinforcement learning with PopArt”. In: AAAI. 2019.
[HGS16] H. van Hasselt, A. Guez, and D. Silver. “Deep Reinforcement Learning with Double Q-Learning”.
In: AAAI. AAAI’16. AAAI Press, 2016, pp. 2094–2100. url: http://dl.acm.org/citation.
cfm?id=3016100.3016191.
[HHA19] H. van Hasselt, M. Hessel, and J. Aslanides. “When to use parametric models in reinforcement
learning?” In: NIPS. 2019. url: http://arxiv.org/abs/1906.05243.
[HL04] D. R. Hunter and K. Lange. “A Tutorial on MM Algorithms”. In: The American Statistician 58
(2004), pp. 30–37.
[HL20] O. van der Himst and P. Lanillos. “Deep active inference for partially observable MDPs”. In:
ECML workshop on active inference. Sept. 2020. url: https://arxiv.org/abs/2009.03622.
[HM20] M. Hosseini and A. Maida. “Hierarchical Predictive Coding Models in a Deep-Learning Frame-
work”. In: (2020). arXiv: 2005.03230 [cs.CV]. url: http://arxiv.org/abs/2005.03230.
[Hon+10] A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. “Approximate Riemannian
Conjugate Gradient Learning for Fixed-Form Variational Bayes”. In: JMLR 11.Nov (2010),
pp. 3235–3268. url: http://www.jmlr.org/papers/volume11/honkela10a/honkela10a.pdf.
[Hon+23] M. Hong, H.-T. Wai, Z. Wang, and Z. Yang. “A two-timescale stochastic algorithm framework for
bilevel optimization: Complexity analysis and application to actor-critic”. en. In: SIAM J. Optim.
33.1 (Mar. 2023), pp. 147–180. url: https://epubs.siam.org/doi/10.1137/20M1387341.
[Hou+11] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel. “Bayesian active learning for classifica-
tion and preference learning”. In: arXiv [stat.ML] (Dec. 2011). url: http://arxiv.org/abs/
1112.5745.
[HQC24] M. Hutter, D. Quarel, and E. Catt. An introduction to universal artificial intelligence. Chapman
and Hall, 2024. url: http://www.hutter1.net/ai/uaibook2.htm.
125
[HR11] R. Hafner and M. Riedmiller. “Reinforcement learning in feedback control: Challenges and
benchmarks from technical process control”. en. In: Mach. Learn. 84.1-2 (July 2011), pp. 137–169.
url: https://link.springer.com/article/10.1007/s10994-011-5235-x.
[HR17] C. Hoffmann and P. Rostalski. “Linear Optimal Control on Factor Graphs — A Message
Passing Perspective”. In: Intl. Federation of Automatic Control 50.1 (2017), pp. 6314–6319. url:
https://www.sciencedirect.com/science/article/pii/S2405896317313800.
[HS18] D. Ha and J. Schmidhuber. “World Models”. In: NIPS. 2018. url: http://arxiv.org/abs/
1803.10122.
[HSW22a] N. A. Hansen, H. Su, and X. Wang. “Temporal Difference Learning for Model Predictive Control”.
en. In: ICML. PMLR, June 2022, pp. 8387–8406. url: https://proceedings.mlr.press/
v162/hansen22a.html.
[HSW22b] N. A. Hansen, H. Su, and X. Wang. “Temporal Difference Learning for Model Predictive Control”.
en. In: ICML. PMLR, June 2022, pp. 8387–8406. url: https://proceedings.mlr.press/
v162/hansen22a.html.
[HTB18] G. Z. Holland, E. J. Talvitie, and M. Bowling. “The effect of planning shape on Dyna-style
planning in high-dimensional state spaces”. In: arXiv [cs.AI] (June 2018). url: http://arxiv.
org/abs/1806.01825.
[Hu+20] Y. Hu, W. Wang, H. Jia, Y. Wang, Y. Chen, J. Hao, F. Wu, and C. Fan. “Learning to Utilize Shap-
ing Rewards: A New Approach of Reward Shaping”. In: NIPS 33 (2020), pp. 15931–15941. url:
https://proceedings.neurips.cc/paper_files/paper/2020/file/b710915795b9e9c02cf10d6d2bdb688c-
Paper.pdf.
[Hub+21] T. Hubert, J. Schrittwieser, I. Antonoglou, M. Barekatain, S. Schmitt, and D. Silver. “Learning
and planning in complex action spaces”. In: arXiv [cs.LG] (Apr. 2021). url: http://arxiv.
org/abs/2104.06303.
[Hut05] M. Hutter. Universal Artificial Intelligence: Sequential Decisions Based On Algorithmic Proba-
bility. en. 2005th ed. Springer, 2005. url: http://www.hutter1.net/ai/uaibook.htm.
[Ich+23] B. Ichter et al. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”. en. In:
Conference on Robot Learning. PMLR, Mar. 2023, pp. 287–318. url: https://proceedings.
mlr.press/v205/ichter23a.html.
[ID19] S. Ivanov and A. D’yakonov. “Modern Deep Reinforcement Learning algorithms”. In: arXiv
[cs.LG] (June 2019). url: http://arxiv.org/abs/1906.10025.
[IW18] E. Imani and M. White. “Improving Regression Performance with Distributional Losses”. en.
In: ICML. PMLR, July 2018, pp. 2157–2166. url: https://proceedings.mlr.press/v80/
imani18a.html.
[Jae00] H Jaeger. “Observable operator models for discrete stochastic time series”. en. In: Neural
Comput. 12.6 (June 2000), pp. 1371–1398. url: https://direct.mit.edu/neco/article-
pdf/12/6/1371/814514/089976600300015411.pdf.
[Jan+19a] M. Janner, J. Fu, M. Zhang, and S. Levine. “When to Trust Your Model: Model-Based Policy
Optimization”. In: NIPS. 2019. url: http://arxiv.org/abs/1906.08253.
[Jan+19b] D. Janz, J. Hron, P. Mazur, K. Hofmann, J. M. Hernández-Lobato, and S. Tschiatschek.
“Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning”. In:
NIPS. Vol. 32. 2019. url: https://proceedings.neurips.cc/paper_files/paper/2019/
file/1b113258af3968aaf3969ca67e744ff8-Paper.pdf.
[Jan+22] M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine. “Planning with Diffusion for Flexible
Behavior Synthesis”. In: ICML. May 2022. url: http://arxiv.org/abs/2205.09991.
[Jar+23] D. Jarrett, C. Tallec, F. Altché, T. Mesnard, R. Munos, and M. Valko. “Curiosity in Hindsight:
Intrinsic Exploration in Stochastic Environments”. In: ICML. June 2023. url: https : / /
openreview.net/pdf?id=fIH2G4fnSy.
126
[JCM24] M. Jones, P. Chang, and K. Murphy. “Bayesian online natural gradient (BONG)”. In: May 2024.
url: http://arxiv.org/abs/2405.19681.
[JGP16] E. Jang, S. Gu, and B. Poole. “Categorical Reparameterization with Gumbel-Softmax”. In:
(2016). arXiv: 1611.01144 [stat.ML]. url: http://arxiv.org/abs/1611.01144.
[Jia+15] N. Jiang, A. Kulesza, S. Singh, and R. Lewis. “The Dependence of Effective Planning Horizon
on Model Accuracy”. en. In: Proceedings of the 2015 International Conference on Autonomous
Agents and Multiagent Systems. AAMAS ’15. Richland, SC: International Foundation for
Autonomous Agents and Multiagent Systems, May 2015, pp. 1181–1189. url: https://dl.acm.
org/doi/10.5555/2772879.2773300.
[Jin+22] L. Jing, P. Vincent, Y. LeCun, and Y. Tian. “Understanding Dimensional Collapse in Contrastive
Self-supervised Learning”. In: ICLR. 2022. url: https : / / openreview . net / forum ? id =
YevsQ05DEN7.
[JLL21] M. Janner, Q. Li, and S. Levine. “Offline Reinforcement Learning as One Big Sequence Modeling
Problem”. In: NIPS. June 2021.
[JM70] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier Press, 1970.
[JML20] M. Janner, I. Mordatch, and S. Levine. “Gamma-Models: Generative Temporal Difference
Learning for Infinite-Horizon Prediction”. In: NIPS. Vol. 33. 2020, pp. 1724–1735. url: https://
proceedings.neurips.cc/paper_files/paper/2020/file/12ffb0968f2f56e51a59a6beb37b2859-
Paper.pdf.
[Jor+24] S. M. Jordan, A. White, B. C. da Silva, M. White, and P. S. Thomas. “Position: Benchmarking
is Limited in Reinforcement Learning Research”. In: ICML. June 2024. url: https://arxiv.
org/abs/2406.16241.
[JSJ94] T. Jaakkola, S. Singh, and M. Jordan. “Reinforcement Learning Algorithm for Partially Observ-
able Markov Decision Problems”. In: NIPS. 1994.
[KAG19] A. Kirsch, J. van Amersfoort, and Y. Gal. “BatchBALD: Efficient and Diverse Batch Acquisition
for Deep Bayesian Active Learning”. In: NIPS. 2019. url: http://arxiv.org/abs/1906.08158.
[Kai+19] L. Kaiser et al. “Model-based reinforcement learning for Atari”. In: arXiv [cs.LG] (Mar. 2019).
url: http://arxiv.org/abs/1903.00374.
[Kak01] S. M. Kakade. “A Natural Policy Gradient”. In: NIPS. Vol. 14. 2001. url: https://proceedings.
neurips.cc/paper_files/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.
pdf.
[Kal+18] D. Kalashnikov et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic
Manipulation”. In: CORL. 2018. url: http://arxiv.org/abs/1806.10293.
[Kap+18] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. “Recurrent Experience Replay
in Distributed Reinforcement Learning”. In: ICLR. Sept. 2018. url: https://openreview.
net/pdf?id=r1lyTjAqYX.
[Kap+22] S. Kapturowski, V. Campos, R. Jiang, N. Rakicevic, H. van Hasselt, C. Blundell, and A. P.
Badia. “Human-level Atari 200x faster”. In: ICLR. Sept. 2022. url: https://openreview.net/
pdf?id=JtC6yOHRoJJ.
[Kau+23] T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier. “A survey of reinforcement learning from
human feedback”. In: arXiv [cs.LG] (Dec. 2023). url: http://arxiv.org/abs/2312.14925.
[KB09] G. Konidaris and A. Barto. “Skill Discovery in Continuous Reinforcement Learning Domains
using Skill Chaining”. In: Advances in Neural Information Processing Systems 22 (2009). url:
https://proceedings.neurips.cc/paper_files/paper/2009/file/e0cf1f47118daebc5b16269099ad7347-
Paper.pdf.
127
[KD18] S. Kamthe and M. P. Deisenroth. “Data-Efficient Reinforcement Learning with Probabilistic
Model Predictive Control”. In: AISTATS. 2018. url: http://proceedings.mlr.press/v84/
kamthe18a/kamthe18a.pdf.
[Ke+19] L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa. Imitation Learning as
f -Divergence Minimization. arXiv:1905.12888. 2019.
[KGO12] H. J. Kappen, V. Gómez, and M. Opper. “Optimal control as a graphical model inference
problem”. In: Mach. Learn. 87.2 (2012), pp. 159–182. url: https://doi.org/10.1007/s10994-
012-5278-7.
[Khe+20] K. Khetarpal, Z. Ahmed, G. Comanici, D. Abel, and D. Precup. “What can I do here? A
Theory of Affordances in Reinforcement Learning”. In: Proceedings of the 37th International
Conference on Machine Learning. Ed. by H. D. Iii and A. Singh. Vol. 119. Proceedings of
Machine Learning Research. PMLR, 2020, pp. 5243–5253. url: https://proceedings.mlr.
press/v119/khetarpal20a.html.
[Kid+20] R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. “MOReL: Model-Based Offline
Reinforcement Learning”. In: NIPS. Vol. 33. 2020, pp. 21810–21823. url: https://proceedings.
neurips.cc/paper_files/paper/2020/file/f7efa4f864ae9b88d43527f4b14f750f-Paper.
pdf.
[Kir+21] R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel. “A survey of zero-shot generalisation
in deep Reinforcement Learning”. In: JAIR (Nov. 2021). url: http://jair.org/index.php/
jair/article/view/14174.
[KLC98] L. P. Kaelbling, M. Littman, and A. Cassandra. “Planning and acting in Partially Observable
Stochastic Domains”. In: AIJ 101 (1998).
[Kli+24] M. Klissarov, P. D’Oro, S. Sodhani, R. Raileanu, P.-L. Bacon, P. Vincent, A. Zhang, and M.
Henaff. “Motif: Intrinsic motivation from artificial intelligence feedback”. In: ICLR. 2024.
[KLP11] L. P. Kaelbling and T Lozano-Pérez. “Hierarchical task and motion planning in the now”. In:
ICRA. 2011, pp. 1470–1477. url: http://dx.doi.org/10.1109/ICRA.2011.5980391.
[Koz+21] T. Kozuno, Y. Tang, M. Rowland, R Munos, S. Kapturowski, W. Dabney, M. Valko, and
D. Abel. “Revisiting Peng’s Q-lambda for modern reinforcement learning”. In: ICML 139 (Feb.
2021). Ed. by M. Meila and T. Zhang, pp. 5794–5804. url: https://proceedings.mlr.press/
v139/kozuno21a/kozuno21a.pdf.
[KP19] K. Khetarpal and D. Precup. “Learning options with interest functions”. en. In: AAAI 33.01 (July
2019), pp. 9955–9956. url: https://ojs.aaai.org/index.php/AAAI/article/view/5114.
[KPL19] A. Kumar, X. B. Peng, and S. Levine. “Reward-Conditioned Policies”. In: arXiv [cs.LG] (Dec.
2019). url: http://arxiv.org/abs/1912.13465.
[KS02] M. Kearns and S. Singh. “Near-Optimal Reinforcement Learning in Polynomial Time”. en. In:
MLJ 49.2/3 (Nov. 2002), pp. 209–232. url: https://link.springer.com/article/10.1023/
A:1017984413808.
[KS06] L. Kocsis and C. Szepesvári. “Bandit Based Monte-Carlo Planning”. In: ECML. 2006, pp. 282–
293.
[KSS23] T. Kneib, A. Silbersdorff, and B. Säfken. “Rage Against the Mean – A Review of Distributional
Regression Approaches”. In: Econometrics and Statistics 26 (Apr. 2023), pp. 99–123. url:
https://www.sciencedirect.com/science/article/pii/S2452306221000824.
[Kum+19] A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine. “Stabilizing Off-Policy Q-Learning via
Bootstrapping Error Reduction”. In: NIPS. Vol. 32. 2019. url: https://proceedings.neurips.
cc/paper_files/paper/2019/file/c2073ffa77b5357a498057413bb09d3a-Paper.pdf.
[Kum+20] A. Kumar, A. Zhou, G. Tucker, and S. Levine. “Conservative Q-Learning for Offline Reinforce-
ment Learning”. In: NIPS. June 2020.
128
[Kum+23] A. Kumar, R. Agarwal, X. Geng, G. Tucker, and S. Levine. “Offline Q-Learning on Diverse
Multi-Task Data Both Scales And Generalizes”. In: ICLR. 2023. url: http://arxiv.org/abs/
2211.15144.
[Kum+24] S. Kumar, H. J. Jeon, A. Lewandowski, and B. Van Roy. “The Need for a Big World Simulator:
A Scientific Challenge for Continual Learning”. In: Finding the Frame: An RLC Workshop
for Examining Conceptual Frameworks. July 2024. url: https://openreview.net/pdf?id=
10XMwt1nMJ.
[Kur+19] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel. “Model-Ensemble Trust-Region
Policy Optimization”. In: ICLR. 2019. url: http://arxiv.org/abs/1802.10592.
[LA21] H. Liu and P. Abbeel. “APS: Active Pretraining with Successor Features”. en. In: ICML. PMLR,
July 2021, pp. 6736–6747. url: https://proceedings.mlr.press/v139/liu21b.html.
[Lai+21] H. Lai, J. Shen, W. Zhang, Y. Huang, X. Zhang, R. Tang, Y. Yu, and Z. Li. “On effective
scheduling of model-based reinforcement learning”. In: NIPS 34 (Nov. 2021). Ed. by M Ran-
zato, A Beygelzimer, Y Dauphin, P. S. Liang, and J. W. Vaughan, pp. 3694–3705. url: https://
proceedings.neurips.cc/paper_files/paper/2021/hash/1e4d36177d71bbb3558e43af9577d70e-
Abstract.html.
[Lam+20] N. Lambert, B. Amos, O. Yadan, and R. Calandra. “Objective Mismatch in Model-based
Reinforcement Learning”. In: Conf. on Learning for Dynamics and Control (L4DC). Feb. 2020.
[Law+22] D. Lawson, A. Raventós, A. Warrington, and S. Linderman. “SIXO: Smoothing Inference with
Twisted Objectives”. In: NIPS. June 2022.
[Leh24] M. Lehmann. “The definitive guide to policy gradients in deep reinforcement learning: Theory,
algorithms and implementations”. In: arXiv [cs.LG] (Jan. 2024). url: http://arxiv.org/abs/
2401.13662.
[Lev18] S. Levine. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review”.
In: (2018). arXiv: 1805.00909 [cs.LG]. url: http://arxiv.org/abs/1805.00909.
[Lev+18] A. Levy, G. Konidaris, R. Platt, and K. Saenko. “Learning Multi-Level Hierarchies with
Hindsight”. In: ICLR. Sept. 2018. url: https://openreview.net/pdf?id=ryzECoAcY7.
[Lev+20a] S. Levine, A. Kumar, G. Tucker, and J. Fu. “Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems”. In: (2020). arXiv: 2005.01643 [cs.LG]. url: http:
//arxiv.org/abs/2005.01643.
[Lev+20b] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline Reinforcement Learning: Tutorial, Review,
and Perspectives on Open Problems. arXiv:2005.01643. 2020.
[LG+24] M. Lázaro-Gredilla, L. Y. Ku, K. P. Murphy, and D. George. “What type of inference is
planning?” In: NIPS. June 2024.
[LGR12] S. Lange, T. Gabel, and M. Riedmiller. “Batch reinforcement learning”. en. In: Adaptation,
Learning, and Optimization. Adaptation, learning, and optimization. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2012, pp. 45–73. url: https://link.springer.com/chapter/10.1007/978-
3-642-27645-3_2.
[LHC24] C. Lu, S. Hu, and J. Clune. “Intelligent Go-Explore: Standing on the shoulders of giant foundation
models”. In: arXiv [cs.LG] (May 2024). url: http://arxiv.org/abs/2405.15143.
[LHS13] T. Lattimore, M. Hutter, and P. Sunehag. “The Sample-Complexity of General Reinforcement
Learning”. en. In: ICML. PMLR, May 2013, pp. 28–36. url: https://proceedings.mlr.
press/v28/lattimore13.html.
[Li+10] L. Li, W. Chu, J. Langford, and R. E. Schapire. “A contextual-bandit approach to personalized
news article recommendation”. In: WWW. 2010.
[Li18] Y. Li. “Deep Reinforcement Learning”. In: (2018). arXiv: 1810.06339 [cs.LG]. url: http:
//arxiv.org/abs/1810.06339.
129
[Li23] S. E. Li. Reinforcement learning for sequential decision and optimal control. en. Singapore:
Springer Nature Singapore, 2023. url: https://link.springer.com/book/10.1007/978-
981-19-7784-8.
[Li+24] H. Li, X. Yang, Z. Wang, X. Zhu, J. Zhou, Y. Qiao, X. Wang, H. Li, L. Lu, and J. Dai. “Auto
MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft”. In:
CVPR. 2024, pp. 16426–16435. url: https://openaccess.thecvf.com/content/CVPR2024/
papers/Li_Auto_MC- Reward_Automated_Dense_Reward_Design_with_Large_Language_
Models_CVPR_2024_paper.pdf.
[Lil+16] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.
“Continuous control with deep reinforcement learning”. In: ICLR. 2016. url: http://arxiv.
org/abs/1509.02971.
[Lin+19] C. Linke, N. M. Ady, M. White, T. Degris, and A. White. “Adapting behaviour via intrinsic
reward: A survey and empirical study”. In: J. Artif. Intell. Res. (June 2019). url: http :
//arxiv.org/abs/1906.07865.
[Lin+24a] J. Lin, Y. Du, O. Watkins, D. Hafner, P. Abbeel, D. Klein, and A. Dragan. “Learning to model
the world with language”. In: ICML. 2024.
[Lin+24b] Y.-A. Lin, C.-T. Lee, C.-H. Yang, G.-T. Liu, and S.-H. Sun. “Hierarchical Programmatic Option
Framework”. In: NIPS. Nov. 2024. url: https://openreview.net/pdf?id=FeCWZviCeP.
[Lin92] L.-J. Lin. “Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and
Teaching”. In: Mach. Learn. 8.3-4 (1992), pp. 293–321. url: https://doi.org/10.1007/
BF00992699.
[Lio+22] V. Lioutas, J. W. Lavington, J. Sefas, M. Niedoba, Y. Liu, B. Zwartsenberg, S. Dabiri, F.
Wood, and A. Scibior. “Critic Sequential Monte Carlo”. In: ICLR. Sept. 2022. url: https:
//openreview.net/pdf?id=ObtGcyKmwna.
[LMW24] B. Li, N. Ma, and Z. Wang. “Rewarded Region Replay (R3) for policy learning with discrete
action space”. In: arXiv [cs.LG] (May 2024). url: http://arxiv.org/abs/2405.16383.
[Lor24] J. Lorraine. “Scalable nested optimization for deep learning”. In: arXiv [cs.LG] (July 2024).
url: http://arxiv.org/abs/2407.01526.
[LÖW21] T. van de Laar, A. Özçelikkale, and H. Wymeersch. “Application of the Free Energy Principle
to Estimation and Control”. In: IEEE Trans. Signal Process. 69 (2021), pp. 4234–4244. url:
http://dx.doi.org/10.1109/TSP.2021.3095711.
[LPC22] N. Lambert, K. Pister, and R. Calandra. “Investigating Compounding Prediction Errors in
Learned Dynamics Models”. In: arXiv [cs.LG] (Mar. 2022). url: http://arxiv.org/abs/
2203.09637.
[LR10] S. Lange and M. Riedmiller. “Deep auto-encoder neural networks in reinforcement learning”.
en. In: IJCNN. IEEE, July 2010, pp. 1–8. url: https://ieeexplore.ieee.org/abstract/
document/5596468.
[LS01] M. Littman and R. S. Sutton. “Predictive Representations of State”. In: Advances in Neural
Information Processing Systems 14 (2001). url: https://proceedings.neurips.cc/paper_
files/paper/2001/file/1e4d36177d71bbb3558e43af9577d70e-Paper.pdf.
[LS19] T. Lattimore and C. Szepesvari. Bandit Algorithms. Cambridge, 2019.
[Lu+23] X. Lu, B. Van Roy, V. Dwaracherla, M. Ibrahimi, I. Osband, and Z. Wen. “Reinforcement Learn-
ing, Bit by Bit”. In: Found. Trends® Mach. Learn. (2023). url: https://www.nowpublishers.
com/article/Details/MAL-097.
[LV06] F. Liese and I. Vajda. “On divergences and informations in statistics and information theory”.
In: IEEE Transactions on Information Theory 52.10 (2006), pp. 4394–4412.
130
[LWL06] L. Li, T. J. Walsh, and M. L. Littman. “Towards a Unified Theory of State Abstraction for
MDPs”. In: (2006). url: https://thomasjwalsh.net/pub/aima06Towards.pdf.
[LZZ22] M. Liu, M. Zhu, and W. Zhang. “Goal-conditioned reinforcement learning: Problems and
solutions”. In: IJCAI. Jan. 2022.
[Ma+24] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and
A. Anandkumar. “Eureka: Human-Level Reward Design via Coding Large Language Models”.
In: ICLR. 2024.
[Mac+18a] M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling.
“Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for
General Agents”. In: J. Artif. Intell. Res. (2018). url: http://arxiv.org/abs/1709.06009.
[Mac+18b] M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell. “Eigenoption
Discovery through the Deep Successor Representation”. In: ICLR. Feb. 2018. url: https :
//openreview.net/pdf?id=Bk8ZcAxR-.
[Mac+23] M. C. Machado, A. Barreto, D. Precup, and M. Bowling. “Temporal Abstraction in Reinforcement
Learning with the Successor Representation”. In: JMLR 24.80 (2023), pp. 1–69. url: http:
//jmlr.org/papers/v24/21-1213.html.
[Mad+17] C. J. Maddison, D. Lawson, G. Tucker, N. Heess, A. Doucet, A. Mnih, and Y. W. Teh. “Particle
Value Functions”. In: ICLR Workshop on RL. Mar. 2017.
[Mae+09] H. Maei, C. Szepesvári, S. Bhatnagar, D. Precup, D. Silver, and R. S. Sutton. “Convergent
Temporal-Difference Learning with Arbitrary Smooth Function Approximation”. In: NIPS.
Vol. 22. 2009. url: https://proceedings.neurips.cc/paper_files/paper/2009/file/
3a15c7d0bbe60300a39f76f8a5ba6896-Paper.pdf.
[MAF22] V. Micheli, E. Alonso, and F. Fleuret. “Transformers are Sample-Efficient World Models”. In:
ICLR. Sept. 2022.
[MAF24] V. Micheli, E. Alonso, and F. Fleuret. “Efficient world models with context-aware tokenization”.
In: ICML. June 2024.
[Maj21] S. J. Majeed. “Abstractions of general reinforcement learning: An inquiry into the scalability
of generally intelligent agents”. PhD thesis. ANU, Dec. 2021. url: https://arxiv.org/abs/
2112.13404.
[Man+19] D. J. Mankowitz, N. Levine, R. Jeong, Y. Shi, J. Kay, A. Abdolmaleki, J. T. Springenberg,
T. Mann, T. Hester, and M. Riedmiller. “Robust Reinforcement Learning for Continuous
Control with Model Misspecification”. In: (2019). arXiv: 1906.07516 [cs.LG]. url: http:
//arxiv.org/abs/1906.07516.
[Mar10] J Martens. “Deep learning via Hessian-free optimization”. In: ICML. 2010. url: http://www.
cs.toronto.edu/~asamir/cifar/HFO_James.pdf.
[Mar16] J. Martens. “Second-order optimization for neural networks”. PhD thesis. Toronto, 2016. url:
http://www.cs.toronto.edu/~jmartens/docs/thesis_phd_martens.pdf.
[Mar20] J. Martens. “New insights and perspectives on the natural gradient method”. In: JMLR (2020).
url: http://arxiv.org/abs/1412.1193.
[Mar21] J. Marino. “Predictive Coding, Variational Autoencoders, and Biological Connections”. en. In:
Neural Comput. 34.1 (2021), pp. 1–44. url: http://dx.doi.org/10.1162/neco_a_01458.
[Maz+22] P. Mazzaglia, T. Verbelen, O. Çatal, and B. Dhoedt. “The Free Energy Principle for Perception
and Action: A Deep Learning Perspective”. en. In: Entropy 24.2 (2022). url: http://dx.doi.
org/10.3390/e24020301.
[MBB20] M. C. Machado, M. G. Bellemare, and M. Bowling. “Count-based exploration with the successor
representation”. en. In: AAAI 34.04 (Apr. 2020), pp. 5125–5133. url: https://ojs.aaai.org/
index.php/AAAI/article/view/5955.
131
[McM+13] H. B. McMahan, G. Holt, D Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E.
Davydov, D. Golovin, et al. “Ad click prediction: a view from the trenches”. In: KDD. 2013,
pp. 1222–1230.
[Men+23] W. Meng, Q. Zheng, G. Pan, and Y. Yin. “Off-Policy Proximal Policy Optimization”. en. In:
AAAI 37.8 (June 2023), pp. 9162–9170. url: https://ojs.aaai.org/index.php/AAAI/
article/view/26099.
[Met+17] L. Metz, J. Ibarz, N. Jaitly, and J. Davidson. “Discrete Sequential Prediction of Continuous
Actions for Deep RL”. In: (2017). arXiv: 1705.05035 [cs.LG]. url: http://arxiv.org/abs/
1705.05035.
[Mey22] S. Meyn. Control Systems and Reinforcement Learning. Cambridge, 2022. url: https://meyn.
ece.ufl.edu/2021/08/01/control-systems-and-reinforcement-learning/.
[MG15] J. Martens and R. Grosse. “Optimizing Neural Networks with Kronecker-factored Approximate
Curvature”. In: ICML. 2015. url: http://arxiv.org/abs/1503.05671.
[Mik+20] V. Mikulik, G. Delétang, T. McGrath, T. Genewein, M. Martic, S. Legg, and P. Ortega. “Meta-
trained agents implement Bayes-optimal agents”. In: NIPS 33 (2020), pp. 18691–18703. url:
https://proceedings.neurips.cc/paper_files/paper/2020/file/d902c3ce47124c66ce615d5ad9ba304f-
Paper.pdf.
[Mil20] B. Millidge. “Deep Active Inference as Variational Policy Gradients”. In: J. Mathematical
Psychology (2020). url: http://arxiv.org/abs/1907.03876.
[Mil+20] B. Millidge, A. Tschantz, A. K. Seth, and C. L. Buckley. “On the Relationship Between Active
Inference and Control as Inference”. In: International Workshop on Active Inference. 2020. url:
http://arxiv.org/abs/2006.12964.
[MM90] D. Q. Mayne and H Michalska. “Receding horizon control of nonlinear systems”. In: IEEE Trans.
Automat. Contr. 35.7 (1990), pp. 814–824.
[MMT17] C. J. Maddison, A. Mnih, and Y. W. Teh. “The Concrete Distribution: A Continuous Relaxation
of Discrete Random Variables”. In: ICLR. 2017. url: http://arxiv.org/abs/1611.00712.
[MMT24] S. Mannor, Y. Mansour, and A. Tamar. Reinforcement Learning: Foundations. 2024. url:
https://sites.google.com/corp/view/rlfoundations/home.
[Mni+15] V. Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540
(2015), pp. 529–533.
[Mni+16] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K.
Kavukcuoglu. “Asynchronous Methods for Deep Reinforcement Learning”. In: ICML. 2016. url:
http://arxiv.org/abs/1602.01783.
[Moe+23] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker. “Model-based Reinforcement
Learning: A Survey”. In: Foundations and Trends in Machine Learning 16.1 (2023), pp. 1–118.
url: https://arxiv.org/abs/2006.16712.
[Moh+20] S. Mohamed, M. Rosca, M. Figurnov, and A. Mnih. “Monte Carlo Gradient Estimation in
Machine Learning”. In: JMLR 21.132 (2020), pp. 1–62. url: http://jmlr.org/papers/v21/19-
346.html.
[Mor63] T. Morimoto. “Markov Processes and the H-Theorem”. In: J. Phys. Soc. Jpn. 18.3 (1963),
pp. 328–331. url: https://doi.org/10.1143/JPSJ.18.328.
[MP+22] A. Mavor-Parker, K. Young, C. Barry, and L. Griffin. “How to Stay Curious while avoiding Noisy
TVs using Aleatoric Uncertainty Estimation”. en. In: ICML. PMLR, June 2022, pp. 15220–15240.
url: https://proceedings.mlr.press/v162/mavor-parker22a.html.
[MSB21] B. Millidge, A. Seth, and C. L. Buckley. “Predictive Coding: a Theoretical and Experimental
Review”. In: (2021). arXiv: 2107.12979 [cs.AI]. url: http://arxiv.org/abs/2107.12979.
132
[Mun14] R. Munos. “From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to
Optimization and Planning”. In: Foundations and Trends in Machine Learning 7.1 (2014),
pp. 1–129. url: http://dx.doi.org/10.1561/2200000038.
[Mun+16] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare. “Safe and Efficient Off-Policy
Reinforcement Learning”. In: NIPS. 2016, pp. 1046–1054.
[Mur00] K. Murphy. A Survey of POMDP Solution Techniques. Tech. rep. Comp. Sci. Div., UC Berkeley,
2000. url: https://www.cs.ubc.ca/~murphyk/Papers/pomdp.pdf.
[Mur23] K. P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
[MWS14] J. Modayil, A. White, and R. S. Sutton. “Multi-timescale nexting in a reinforcement learning
robot”. en. In: Adapt. Behav. 22.2 (Apr. 2014), pp. 146–160. url: https://sites.ualberta.
ca/~amw8/nexting.pdf.
[Nac+18] O. Nachum, S. Gu, H. Lee, and S. Levine. “Data-Efficient Hierarchical Reinforcement Learn-
ing”. In: NIPS. May 2018. url: https : / / proceedings . neurips . cc / paper / 2018 / hash /
e6384711491713d29bc63fc5eeb5ba4f-Abstract.html.
[Nac+19] O. Nachum, S. Gu, H. Lee, and S. Levine. “Near-Optimal Representation Learning for Hier-
archical Reinforcement Learning”. In: ICLR. 2019. url: https://openreview.net/pdf?id=
H1emus0qF7.
[Nai+20] A. Nair, A. Gupta, M. Dalal, and S. Levine. “AWAC: Accelerating Online Reinforcement
Learning with Offline Datasets”. In: arXiv [cs.LG] (June 2020). url: http://arxiv.org/abs/
2006.09359.
[Nak+23] M. Nakamoto, Y. Zhai, A. Singh, M. S. Mark, Y. Ma, C. Finn, A. Kumar, and S. Levine.
“Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning”. In: arXiv [cs.LG]
(Mar. 2023). url: http://arxiv.org/abs/2303.05479.
[NHR99] A. Ng, D. Harada, and S. Russell. “Policy invariance under reward transformations: Theory and
application to reward shaping”. In: ICML. 1999.
[Ni+24] T. Ni, B. Eysenbach, E. Seyedsalehi, M. Ma, C. Gehring, A. Mahajan, and P.-L. Bacon. “Bridging
State and History Representations: Understanding Self-Predictive RL”. In: ICLR. Jan. 2024.
url: http://arxiv.org/abs/2401.08898.
[Nik+22] E. Nikishin, R. Abachi, R. Agarwal, and P.-L. Bacon. “Control-oriented model-based reinforce-
ment learning with implicit differentiation”. en. In: AAAI 36.7 (June 2022), pp. 7886–7894. url:
https://ojs.aaai.org/index.php/AAAI/article/view/20758.
[NR00] A. Ng and S. Russell. “Algorithms for inverse reinforcement learning”. In: ICML. 2000.
[NT20] C. Nota and P. S. Thomas. “Is the policy gradient a gradient?” In: Proc. of the 19th International
Conference on Autonomous Agents and MultiAgent Systems. 2020.
[NWJ10] X Nguyen, M. J. Wainwright, and M. I. Jordan. “Estimating Divergence Functionals and the
Likelihood Ratio by Convex Risk Minimization”. In: IEEE Trans. Inf. Theory 56.11 (2010),
pp. 5847–5861. url: http://dx.doi.org/10.1109/TIT.2010.2068870.
[OCD21] G. Ostrovski, P. S. Castro, and W. Dabney. “The Difficulty of Passive Learning in Deep Reinforce-
ment Learning”. In: NIPS. Vol. 34. Dec. 2021, pp. 23283–23295. url: https://proceedings.
neurips.cc/paper_files/paper/2021/file/c3e0c62ee91db8dc7382bde7419bb573-Paper.
pdf.
[OK22] A. Ororbia and D. Kifer. “The neural coding framework for learning generative models”. en. In:
Nat. Commun. 13.1 (Apr. 2022), p. 2064. url: https://www.nature.com/articles/s41467-
022-29632-7.
[ORVR13] I. Osband, D. Russo, and B. Van Roy. “(More) Efficient Reinforcement Learning via Posterior
Sampling”. In: NIPS. 2013. url: http://arxiv.org/abs/1306.0940.
133
[Osb+19] I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. “Deep exploration via randomized value
functions”. In: JMLR 20.124 (2019), pp. 1–62. url: http : / / jmlr . org / papers / v20 / 18 -
339.html.
[Osb+23a] I. Osband, Z. Wen, S. M. Asghari, V. Dwaracherla, M. Ibrahimi, X. Lu, and B. Van Roy.
“Approximate Thompson Sampling via Epistemic Neural Networks”. en. In: UAI. PMLR, July
2023, pp. 1586–1595. url: https://proceedings.mlr.press/v216/osband23a.html.
[Osb+23b] I. Osband, Z. Wen, S. M. Asghari, V. Dwaracherla, M. Ibrahimi, X. Lu, and B. Van Roy.
“Epistemic Neural Networks”. In: NIPS. 2023. url: https://proceedings.neurips.cc/paper_
files/paper/2023/file/07fbde96bee50f4e09303fd4f877c2f3-Paper-Conference.pdf.
[OSL17] J. Oh, S. Singh, and H. Lee. “Value Prediction Network”. In: NIPS. July 2017.
[OT22] M. Okada and T. Taniguchi. “DreamingV2: Reinforcement learning with discrete world models
without reconstruction”. en. In: 2022 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE, Oct. 2022, pp. 985–991. url: https://ieeexplore.ieee.org/
abstract/document/9981405.
[Ouy+22] L. Ouyang et al. “Training language models to follow instructions with human feedback”. In:
(Mar. 2022). arXiv: 2203.02155 [cs.CL]. url: http://arxiv.org/abs/2203.02155.
[OVR17] I. Osband and B. Van Roy. “Why is posterior sampling better than optimism for reinforcement
learning?” In: ICML. 2017, pp. 2701–2710.
[PAG24] M. Panwar, K. Ahuja, and N. Goyal. “In-context learning through the Bayesian prism”. In:
ICLR. 2024. url: https://arxiv.org/abs/2306.04891.
[Par+23] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. “Generative
agents: Interactive simulacra of human behavior”. en. In: Proceedings of the 36th Annual ACM
Symposium on User Interface Software and Technology. New York, NY, USA: ACM, Oct. 2023.
url: https://dl.acm.org/doi/10.1145/3586183.3606763.
[Par+24a] S. Park, K. Frans, B. Eysenbach, and S. Levine. “OGBench: Benchmarking Offline Goal-
Conditioned RL”. In: arXiv [cs.LG] (Oct. 2024). url: http://arxiv.org/abs/2410.20092.
[Par+24b] S. Park, K. Frans, S. Levine, and A. Kumar. “Is value learning really the main bottleneck in
offline RL?” In: NIPS. June 2024. url: https://arxiv.org/abs/2406.09329.
[Pat+17] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. “Curiosity-driven Exploration by Self-
supervised Prediction”. In: ICML. 2017. url: http://arxiv.org/abs/1705.05363.
[Pat+22] S. Pateria, B. Subagdja, A.-H. Tan, and C. Quek. “Hierarchical Reinforcement Learning: A
comprehensive survey”. en. In: ACM Comput. Surv. 54.5 (June 2022), pp. 1–35. url: https:
//dl.acm.org/doi/10.1145/3453160.
[Pat+24] A. Patterson, S. Neumann, M. White, and A. White. “Empirical design in reinforcement learning”.
In: JMLR (2024). url: http://arxiv.org/abs/2304.01315.
[PB+14] N. Parikh, S. Boyd, et al. “Proximal algorithms”. In: Foundations and Trends in Optimization
1.3 (2014), pp. 127–239.
[Pea84] J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley
Longman Publishing Co., Inc., 1984. url: https://dl.acm.org/citation.cfm?id=525.
[Pea94] B. A. Pearlmutter. “Fast Exact Multiplication by the Hessian”. In: Neural Comput. 6.1 (1994),
pp. 147–160. url: https://doi.org/10.1162/neco.1994.6.1.147.
[Pen+19] X. B. Peng, A. Kumar, G. Zhang, and S. Levine. “Advantage-weighted regression: Simple
and scalable off-policy reinforcement learning”. In: arXiv [cs.LG] (Sept. 2019). url: http:
//arxiv.org/abs/1910.00177.
[Pic+19] A. Piche, V. Thomas, C. Ibrahim, Y. Bengio, and C. Pal. “Probabilistic Planning with Sequential
Monte Carlo methods”. In: ICLR. 2019. url: https://openreview.net/pdf?id=ByetGn0cYX.
134
[Pis+22] M. Pislar, D. Szepesvari, G. Ostrovski, D. L. Borsa, and T. Schaul. “When should agents
explore?” In: ICLR. 2022. url: https://openreview.net/pdf?id=dEwfxt14bca.
[PKP21] A. Plaat, W. Kosters, and M. Preuss. “High-Accuracy Model-Based Reinforcement Learning, a
Survey”. In: (2021). arXiv: 2107.08241 [cs.LG]. url: http://arxiv.org/abs/2107.08241.
[Pla22] A. Plaat. Deep reinforcement learning, a textbook. Berlin, Germany: Springer, Jan. 2022. url:
https://link.springer.com/10.1007/978-981-19-0638-1.
[PMB22] K. Paster, S. McIlraith, and J. Ba. “You can’t count on luck: Why decision transformers and
RvS fail in stochastic environments”. In: arXiv [cs.LG] (May 2022). url: http://arxiv.org/
abs/2205.15967.
[Pom89] D. Pomerleau. “ALVINN: An Autonomous Land Vehicle in a Neural Network”. In: NIPS. 1989,
pp. 305–313.
[Pow22] W. B. Powell. Reinforcement Learning and Stochastic Optimization: A Unified Framework
for Sequential Decisions. en. 1st ed. Wiley, Mar. 2022. url: https : / / www . amazon . com /
Reinforcement-Learning-Stochastic-Optimization-Sequential/dp/1119815037.
[PR12] W. B. Powell and I. O. Ryzhov. Optimal Learning. Wiley Series in Probability and Statistics.
http://optimallearning.princeton.edu/. Hoboken, NJ: Wiley-Blackwell, Mar. 2012. url: https:
//castle.princeton.edu/wp-content/uploads/2019/02/Powell-OptimalLearningWileyMarch112018.
pdf.
[PS07] J. Peters and S. Schaal. “Reinforcement Learning by Reward-Weighted Regression for Operational
Space Control”. In: ICML. 2007, pp. 745–750.
[PSS00] D. Precup, R. S. Sutton, and S. P. Singh. “Eligibility Traces for Off-Policy Policy Evaluation”.
In: ICML. ICML ’00. Morgan Kaufmann Publishers Inc., 2000, pp. 759–766. url: http :
//dl.acm.org/citation.cfm?id=645529.658134.
[PT87] C. Papadimitriou and J. Tsitsiklis. “The complexity of Markov decision processes”. In: Mathe-
matics of Operations Research 12.3 (1987), pp. 441–450.
[Put94] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley,
1994.
[PW94] J. Peng and R. J. Williams. “Incremental Multi-Step Q-Learning”. In: Machine Learning
Proceedings. Elsevier, Jan. 1994, pp. 226–232. url: http://dx.doi.org/10.1016/B978-1-
55860-335-6.50035-0.
[QPC21] J. Queeney, I. C. Paschalidis, and C. G. Cassandras. “Generalized Proximal Policy Optimization
with Sample Reuse”. In: NIPS. Oct. 2021.
[QPC24] J. Queeney, I. C. Paschalidis, and C. G. Cassandras. “Generalized Policy Improvement algorithms
with theoretically supported sample reuse”. In: IEEE Trans. Automat. Contr. (2024). url:
http://arxiv.org/abs/2206.13714.
[Rab89] L. R. Rabiner. “A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition”. In: Proc. of the IEEE 77.2 (1989), pp. 257–286.
[Raf+23] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. “Direct Preference
Optimization: Your language model is secretly a reward model”. In: arXiv [cs.LG] (May 2023).
url: http://arxiv.org/abs/2305.18290.
[Raf+24] R. Rafailov et al. “D5RL: Diverse datasets for data-driven deep reinforcement learning”. In:
RLC. Aug. 2024. url: https://arxiv.org/abs/2408.08441.
[Raj+17] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade. “Towards generalization and simplicity
in continuous control”. In: NIPS. Mar. 2017.
[Rao10] A. V. Rao. “A Survey of Numerical Methods for Optimal Control”. In: Adv. Astronaut. Sci.
135.1 (2010). url: http://dx.doi.org/.
135
[RB12] S. Ross and J. A. Bagnell. “Agnostic system identification for model-based reinforcement
learning”. In: ICML. Mar. 2012.
[RB99] R. P. Rao and D. H. Ballard. “Predictive coding in the visual cortex: a functional interpretation
of some extra-classical receptive-field effects”. en. In: Nat. Neurosci. 2.1 (1999), pp. 79–87. url:
http://dx.doi.org/10.1038/4580.
[Rec19] B. Recht. “A Tour of Reinforcement Learning: The View from Continuous Control”. In: Annual
Review of Control, Robotics, and Autonomous Systems 2 (2019), pp. 253–279. url: http :
//arxiv.org/abs/1806.09460.
[Ree+22] S. Reed et al. “A Generalist Agent”. In: TMLR (May 2022). url: https://arxiv.org/abs/
2205.06175.
[Ren+24] A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai,
and M. Simchowitz. “Diffusion Policy Policy Optimization”. In: arXiv [cs.RO] (Aug. 2024). url:
http://arxiv.org/abs/2409.00588.
[RFP15] I. O. Ryzhov, P. I. Frazier, and W. B. Powell. “A new optimal stepsize for approximate dynamic
programming”. en. In: IEEE Trans. Automat. Contr. 60.3 (Mar. 2015), pp. 743–758. url:
https://castle.princeton.edu/Papers/Ryzhov-OptimalStepsizeforADPFeb242015.pdf.
[RGB11] S. Ross, G. J. Gordon, and J. A. Bagnell. “A reduction of imitation learning and structured
prediction to no-regret online learning”. In: AISTATS. 2011.
[Rie05] M. Riedmiller. “Neural fitted Q iteration – first experiences with a data efficient neural reinforce-
ment learning method”. en. In: ECML. Lecture notes in computer science. 2005, pp. 317–328.
url: https://link.springer.com/chapter/10.1007/11564096_32.
[Rie+18] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V. Mnih, N. Heess,
and J. T. Springenberg. “Learning by Playing Solving Sparse Reward Tasks from Scratch”. en.
In: ICML. PMLR, July 2018, pp. 4344–4353. url: https://proceedings.mlr.press/v80/
riedmiller18a.html.
[RJ22] A. Rao and T. Jelvis. Foundations of Reinforcement Learning with Applications in Finance.
Chapman and Hall/ CRC, 2022. url: https://github.com/TikhonJelvis/RL-book.
[RK04] R. Rubinstein and D. Kroese. The Cross-Entropy Method: A Unified Approach to Combinatorial
Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag, 2004.
[RLT18] M. Riemer, M. Liu, and G. Tesauro. “Learning Abstract Options”. In: NIPS 31 (2018). url:
https://proceedings.neurips.cc/paper_files/paper/2018/file/cdf28f8b7d14ab02d12a2329d71e4079-
Paper.pdf.
[RMD22] J. B. Rawlings, D. Q. Mayne, and M. M. Diehl. Model Predictive Control: Theory, Computa-
tion, and Design (2nd ed). en. Nob Hill Publishing, LLC, Sept. 2022. url: https://sites.
engineering.ucsb.edu/~jbraw/mpc/MPC-book-2nd-edition-1st-printing.pdf.
[RMK20] A. Rajeswaran, I. Mordatch, and V. Kumar. “A game theoretic framework for model based
reinforcement learning”. In: ICML. 2020.
[RN19] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. 4th edition. Prentice Hall,
2019.
[RN94] G. A. Rummery and M Niranjan. On-Line Q-Learning Using Connectionist Systems. Tech. rep.
Cambridge Univ. Engineering Dept., 1994. url: http://dx.doi.org/.
[RR14] D. Russo and B. V. Roy. “Learning to Optimize via Posterior Sampling”. In: Math. Oper. Res.
39.4 (2014), pp. 1221–1243.
[RTV12] K. Rawlik, M. Toussaint, and S. Vijayakumar. “On stochastic optimal control and reinforcement
learning by approximate inference”. In: Robotics: Science and Systems VIII. Robotics: Science and
Systems Foundation, 2012. url: https://blogs.cuit.columbia.edu/zp2130/files/2019/
03/On_Stochasitc_Optimal_Control_and_Reinforcement_Learning_by_Approximate_
Inference.pdf.
136
[Rub97] R. Y. Rubinstein. “Optimization of computer simulation models with rare events”. In: Eur. J.
Oper. Res. 99.1 (1997), pp. 89–112. url: http://www.sciencedirect.com/science/article/
pii/S0377221796003852.
[Rus+18] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen. “A Tutorial on Thompson
Sampling”. In: Foundations and Trends in Machine Learning 11.1 (2018), pp. 1–96. url:
http://dx.doi.org/10.1561/2200000070.
[Rus19] S. Russell. Human Compatible: Artificial Intelligence and the Problem of Control. en. Kin-
dle. Viking, 2019. url: https : / / www . amazon . com / Human - Compatible - Artificial -
Intelligence - Problem - ebook / dp / B07N5J5FTS / ref = zg _ bs _ 3887 _ 4 ? _encoding = UTF8 &
psc=1&refRID=0JE0ST011W4K15PTFZAT.
[RW91] S. Russell and E. Wefald. “Principles of metareasoning”. en. In: Artif. Intell. 49.1-3 (May 1991),
pp. 361–395. url: http://dx.doi.org/10.1016/0004-3702(91)90015-C.
[Ryu+20] M. Ryu, Y. Chow, R. Anderson, C. Tjandraatmadja, and C. Boutilier. “CAQL: Continuous
Action Q-Learning”. In: ICLR. 2020. url: https://openreview.net/forum?id=BkxXe0Etwr.
[Saj+21] N. Sajid, P. J. Ball, T. Parr, and K. J. Friston. “Active Inference: Demystified and Compared”.
en. In: Neural Comput. 33.3 (Mar. 2021), pp. 674–712. url: https://web.archive.org/web/
20210628163715id_/https://discovery.ucl.ac.uk/id/eprint/10119277/1/Friston_
neco_a_01357.pdf.
[Sal+23] T. Salvatori, A. Mali, C. L. Buckley, T. Lukasiewicz, R. P. N. Rao, K. Friston, and A. Ororbia.
“Brain-inspired computational intelligence via predictive coding”. In: arXiv [cs.AI] (Aug. 2023).
url: http://arxiv.org/abs/2308.07870.
[Sal+24] T. Salvatori, Y. Song, Y. Yordanov, B. Millidge, L. Sha, C. Emde, Z. Xu, R. Bogacz, and
T. Lukasiewicz. “A Stable, Fast, and Fully Automatic Learning Algorithm for Predictive Coding
Networks”. In: ICLR. Oct. 2024. url: https://openreview.net/pdf?id=RyUvzda8GH.
[SB18] R. Sutton and A. Barto. Reinforcement learning: an introduction (2nd edn). MIT Press, 2018.
[Sch10] J. Schmidhuber. “Formal Theory of Creativity, Fun, and Intrinsic Motivation”. In: IEEE Trans.
Autonomous Mental Development 2 (2010). url: http : / / people . idsia . ch / ~juergen /
ieeecreative.pdf.
[Sch+15a] T. Schaul, D. Horgan, K. Gregor, and D. Silver. “Universal Value Function Approximators”. en.
In: ICML. PMLR, June 2015, pp. 1312–1320. url: https://proceedings.mlr.press/v37/
schaul15.html.
[Sch+15b] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimiza-
tion”. In: ICML. 2015. url: http://arxiv.org/abs/1502.05477.
[Sch+16a] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. “Prioritized Experience Replay”. In: ICLR.
2016. url: http://arxiv.org/abs/1511.05952.
[Sch+16b] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. “High-Dimensional Continuous
Control Using Generalized Advantage Estimation”. In: ICLR. 2016. url: http://arxiv.org/
abs/1506.02438.
[Sch+17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization
Algorithms”. In: (2017). arXiv: 1707.06347 [cs.LG]. url: http://arxiv.org/abs/1707.
06347.
[Sch19] J. Schmidhuber. “Reinforcement learning Upside Down: Don’t predict rewards – just map them
to actions”. In: arXiv [cs.AI] (Dec. 2019). url: http://arxiv.org/abs/1912.02875.
[Sch+20] J. Schrittwieser et al. “Mastering Atari, Go, Chess and Shogi by Planning with a Learned
Model”. In: Nature (2020). url: http://arxiv.org/abs/1911.08265.
137
[Sch+21] M. Schwarzer, A. Anand, R. Goel, R Devon Hjelm, A. Courville, and P. Bachman. “Data-
Efficient Reinforcement Learning with Self-Predictive Representations”. In: ICLR. 2021. url:
https://openreview.net/pdf?id=uCQfPZwRaUu.
[Sch+23a] I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg,
A. Byravan, L. Hasenclever, and N. Heess. “A Generalist Dynamics Model for Control”. In:
arXiv [cs.AI] (May 2023). url: http://arxiv.org/abs/2305.10912.
[Sch+23b] M. Schwarzer, J. Obando-Ceron, A. Courville, M. Bellemare, R. Agarwal, and P. S. Castro.
“Bigger, Better, Faster: Human-level Atari with human-level efficiency”. In: ICML. May 2023.
url: http://arxiv.org/abs/2305.19452.
[Sco10] S. Scott. “A modern Bayesian look at the multi-armed bandit”. In: Applied Stochastic Models in
Business and Industry 26 (2010), pp. 639–658.
[Sei+16] H. van Seijen, A Rupam Mahmood, P. M. Pilarski, M. C. Machado, and R. S. Sutton. “True
Online Temporal-Difference Learning”. In: JMLR (2016). url: http://jmlr.org/papers/
volume17/15-599/15-599.pdf.
[Sey+22] T. Seyde, P. Werner, W. Schwarting, I. Gilitschenski, M. Riedmiller, D. Rus, and M. Wulfmeier.
“Solving Continuous Control via Q-learning”. In: ICLR. Sept. 2022. url: https://openreview.
net/pdf?id=U5XOGxAgccS.
[Sha+20] R. Shah, P. Freire, N. Alex, R. Freedman, D. Krasheninnikov, L. Chan, M. D. Dennis, P.
Abbeel, A. Dragan, and S. Russell. “Benefits of Assistance over Reward Learning”. In: NIPS
Workshop. 2020. url: https://aima.cs.berkeley.edu/~russell/papers/neurips20ws-
assistance.pdf.
[Sie+20] N. Siegel, J. T. Springenberg, F. Berkenkamp, A. Abdolmaleki, M. Neunert, T. Lampe, R.
Hafner, N. Heess, and M. Riedmiller. “Keep Doing What Worked: Behavior Modelling Priors
for Offline Reinforcement Learning”. In: ICLR. 2020. url: https://openreview.net/pdf?id=
rke7geHtwH.
[Sil+14] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. “Deterministic
Policy Gradient Algorithms”. In: ICML. ICML’14. JMLR.org, 2014, pp. I–387–I–395. url:
http://dl.acm.org/citation.cfm?id=3044805.3044850.
[Sil+16] D. Silver et al. “Mastering the game of Go with deep neural networks and tree search”. en. In:
Nature 529.7587 (2016), pp. 484–489. url: http://dx.doi.org/10.1038/nature16961.
[Sil+17a] D. Silver et al. “Mastering the game of Go without human knowledge”. en. In: Nature 550.7676
(2017), pp. 354–359. url: http://dx.doi.org/10.1038/nature24270.
[Sil+17b] D. Silver et al. “The predictron: end-to-end learning and planning”. In: ICML. 2017. url:
https://openreview.net/pdf?id=BkJsCIcgl.
[Sil18] D. Silver. Lecture 9L Exploration and Exploitation. 2018. url: http://www0.cs.ucl.ac.uk/
staff/d.silver/web/Teaching_files/XX.pdf.
[Sil+18] D. Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go
through self-play”. en. In: Science 362.6419 (2018), pp. 1140–1144. url: http://dx.doi.org/
10.1126/science.aar6404.
[Sin+00] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári. “Convergence Results for Single-
Step On-PolicyReinforcement-Learning Algorithms”. In: MLJ 38.3 (2000), pp. 287–308. url:
https://doi.org/10.1023/A:1007678930559.
[Ska+22] J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. “Defining and characterizing
reward hacking”. In: NIPS. Sept. 2022.
[SKM18] S. Schwöbel, S. Kiebel, and D. Marković. “Active Inference, Belief Propagation, and the Bethe
Approximation”. en. In: Neural Comput. 30.9 (2018), pp. 2530–2567. url: http://dx.doi.org/
10.1162/neco_a_01108.
138
[Sli19] A. Slivkins. “Introduction to Multi-Armed Bandits”. In: Foundations and Trends in Machine
Learning (2019). url: http://arxiv.org/abs/1904.07272.
[Smi+23] F. B. Smith, A. Kirsch, S. Farquhar, Y. Gal, A. Foster, and T. Rainforth. “Prediction-Oriented
Bayesian Active Learning”. In: AISTATS. Apr. 2023. url: http://arxiv.org/abs/2304.08151.
[Sol64] R. J. Solomonoff. “A formal theory of inductive inference. Part I”. In: Information and Control
7.1 (Mar. 1964), pp. 1–22. url: https://www.sciencedirect.com/science/article/pii/
S0019995864902232.
[Son98] E. D. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems. 2nd.
Vol. 6. Texts in Applied Mathematics. Springer, 1998.
[Spi+24] B. A. Spiegel, Z. Yang, W. Jurayj, B. Bachmann, S. Tellex, and G. Konidaris. “Informing
Reinforcement Learning Agents by Grounding Language to Markov Decision Processes”. In:
Workshop on Training Agents with Foundation Models at RLC 2024. Aug. 2024. url: https:
//openreview.net/pdf?id=uFm9e4Ly26.
[Spr17] M. W. Spratling. “A review of predictive coding algorithms”. en. In: Brain Cogn. 112 (2017),
pp. 92–97. url: http://dx.doi.org/10.1016/j.bandc.2015.11.003.
[SPS99] R. S. Sutton, D. Precup, and S. Singh. “Between MDPs and semi-MDPs: A framework for
temporal abstraction in reinforcement learning”. In: Artif. Intell. 112.1 (Aug. 1999), pp. 181–211.
url: http://www.sciencedirect.com/science/article/pii/S0004370299000521.
[SS21] D. Schmidt and T. Schmied. “Fast and Data-Efficient Training of Rainbow: an Experimental
Study on Atari”. In: Deep RL Workshop NeurIPS 2021. Dec. 2021. url: https://openreview.
net/pdf?id=GvM7A3cv63M.
[SSM08] R. S. Sutton, C. Szepesvári, and H. R. Maei. “A convergent O(n) algorithm for off-policy temporal-
difference learning with linear function approximation”. en. In: NIPS. NIPS’08. Red Hook, NY,
USA: Curran Associates Inc., Dec. 2008, pp. 1609–1616. url: https://proceedings.neurips.
cc/paper_files/paper/2008/file/e0c641195b27425bb056ac56f8953d24-Paper.pdf.
[Str00] M. Strens. “A Bayesian Framework for Reinforcement Learning”. In: ICML. 2000.
[Sub+22] J. Subramanian, A. Sinha, R. Seraj, and A. Mahajan. “Approximate information state for
approximate planning and reinforcement learning in partially observed systems”. In: JMLR
23.12 (2022), pp. 1–83. url: http://jmlr.org/papers/v23/20-1165.html.
[Sut+08] R. S. Sutton, C. Szepesvari, A. Geramifard, and M. P. Bowling. “Dyna-style planning with
linear function approximation and prioritized sweeping”. In: UAI. 2008.
[Sut15] R. Sutton. Introduction to RL with function approximation. NIPS Tutorial. 2015. url: http:
/ / media . nips . cc / Conferences / 2015 / tutorialslides / SuttonIntroRL - nips - 2015 -
tutorial.pdf.
[Sut88] R. Sutton. “Learning to predict by the methods of temporal differences”. In: Machine Learning
3.1 (1988), pp. 9–44.
[Sut90] R. S. Sutton. “Integrated Architectures for Learning, Planning, and Reacting Based on Ap-
proximating Dynamic Programming”. In: ICML. Ed. by B. Porter and R. Mooney. Morgan
Kaufmann, 1990, pp. 216–224. url: http://www.sciencedirect.com/science/article/
pii/B9781558601413500304.
[Sut95] R. S. Sutton. “TD models: Modeling the world at a mixture of time scales”. en. In: ICML.
Jan. 1995, pp. 531–539. url: https://www.sciencedirect.com/science/article/abs/pii/
B9781558603776500724.
[Sut96] R. S. Sutton. “Generalization in Reinforcement Learning: Successful Examples Using Sparse
Coarse Coding”. In: NIPS. Ed. by D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo. MIT
Press, 1996, pp. 1038–1044. url: http://papers.nips.cc/paper/1109-generalization-in-
reinforcement-learning-successful-examples-using-sparse-coarse-coding.pdf.
139
[Sut+99] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. “Policy Gradient Methods for Reinforcement
Learning with Function Approximation”. In: NIPS. 1999.
[SW06] J. E. Smith and R. L. Winkler. “The Optimizer’s Curse: Skepticism and Postdecision Surprise
in Decision Analysis”. In: Manage. Sci. 52.3 (2006), pp. 311–322.
[Sze10] C. Szepesvari. Algorithms for Reinforcement Learning. Morgan Claypool, 2010.
[Tam+16] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. “Value Iteration Networks”. In: NIPS.
2016. url: http://arxiv.org/abs/1602.02867.
[Tan+23] Y. Tang et al. “Understanding Self-Predictive Learning for Reinforcement Learning”. In: ICML.
2023. url: https://proceedings.mlr.press/v202/tang23d/tang23d.pdf.
[Ten02] R. B. A. Tennenholtz. “R-max – A General Polynomial Time Algorithm for Near-Optimal
Reinforcement Learning”. In: JMLR 3 (2002), pp. 213–231. url: http://www.ai.mit.edu/
projects/jmlr/papers/volume3/brafman02a/source/brafman02a.pdf.
[Tha+22] S. Thakoor, M. Rowland, D. Borsa, W. Dabney, R. Munos, and A. Barreto. “Generalised
Policy Improvement with Geometric Policy Composition”. en. In: ICML. PMLR, June 2022,
pp. 21272–21307. url: https://proceedings.mlr.press/v162/thakoor22a.html.
[Tho33] W. R. Thompson. “On the Likelihood that One Unknown Probability Exceeds Another in View
of the Evidence of Two Samples”. In: Biometrika 25.3/4 (1933), pp. 285–294.
[TKE24] H. Tang, D. Key, and K. Ellis. “WorldCoder, a model-based LLM agent: Building world models
by writing code and interacting with the environment”. In: arXiv [cs.AI] (Feb. 2024). url:
http://arxiv.org/abs/2402.12275.
[TL05] E. Todorov and W. Li. “A Generalized Iterative LQG Method for Locally-optimal Feedback
Control of Constrained Nonlinear Stochastic Systems”. In: ACC. 2005, pp. 300–306.
[TMM19] C. Tessler, D. J. Mankowitz, and S. Mannor. “Reward Constrained Policy Optimization”. In:
ICLR. 2019. url: https://openreview.net/pdf?id=SkfrvsA9FX.
[Tom+22] T. Tomilin, T. Dai, M. Fang, and M. Pechenizkiy. “LevDoom: A benchmark for generalization
on level difficulty in reinforcement learning”. In: 2022 IEEE Conference on Games (CoG). IEEE,
Aug. 2022. url: https://ieee-cog.org/2022/assets/papers/paper_30.pdf.
[Tom+24] M. Tomar, P. Hansen-Estruch, P. Bachman, A. Lamb, J. Langford, M. E. Taylor, and S. Levine.
“Video Occupancy Models”. In: arXiv [cs.CV] (June 2024). url: http://arxiv.org/abs/2407.
09533.
[Tou09] M. Toussaint. “Robot Rrajectory Optimization using Approximate Inference”. In: ICML. 2009,
pp. 1049–1056.
[Tou14] M. Toussaint. Bandits, Global Optimization, Active Learning, and Bayesian RL – understanding
the common ground. Autonomous Learning Summer School. 2014. url: https://www.user.
tu-berlin.de/mtoussai/teaching/14-BanditsOptimizationActiveLearningBayesianRL.
pdf.
[TR97] J. Tsitsiklis and B. V. Roy. “An analysis of temporal-difference learning with function approxi-
mation”. In: IEEE Trans. on Automatic Control 42.5 (1997), pp. 674–690.
[TS06] M. Toussaint and A. Storkey. “Probabilistic inference for solving discrete and continuous state
Markov Decision Processes”. In: ICML. 2006, pp. 945–952.
[Tsc+20] A. Tschantz, B. Millidge, A. K. Seth, and C. L. Buckley. “Reinforcement learning through active
inference”. In: ICLR workshop on “Bridging AI and Cognitive Science“. Feb. 2020.
[Tsc+23] A. Tscshantz, B. Millidge, A. K. Seth, and C. L. Buckley. “Hybrid predictive coding: Inferring,
fast and slow”. en. In: PLoS Comput. Biol. 19.8 (Aug. 2023), e1011280. url: https : / /
journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011280&
type=printable.
140
[Tsi+17] P. A. Tsividis, T. Pouncy, J. L. Xu, J. B. Tenenbaum, and S. J. Gershman. “Human Learning
in Atari”. en. In: AAAI Spring Symposium Series. 2017. url: https://www.aaai.org/ocs/
index.php/SSS/SSS17/paper/viewPaper/15280.
[TVR97] J. N. Tsitsiklis and B Van Roy. “An analysis of temporal-difference learning with function
approximation”. en. In: IEEE Trans. Automat. Contr. 42.5 (May 1997), pp. 674–690. url:
https://ieeexplore.ieee.org/abstract/document/580874.
[Unk24] Unknown. “Beyond The Rainbow: High Performance Deep Reinforcement Learning On A
Desktop PC”. In: (Oct. 2024). url: https://openreview.net/pdf?id=0ydseYDKRi.
[Val00] H. Valpola. “Bayesian Ensemble Learning for Nonlinear Factor Analysis”. PhD thesis. Helsinki
University of Technology, 2000. url: https://users.ics.aalto.fi/harri/thesis/valpola_
thesis.ps.gz.
[van+18] H. van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil. Deep Reinforcement
Learning and the Deadly Triad. arXiv:1812.02648. 2018.
[VBW15] S. S. Villar, J. Bowden, and J. Wason. “Multi-armed Bandit Models for the Optimal Design
of Clinical Trials: Benefits and Challenges”. en. In: Stat. Sci. 30.2 (2015), pp. 199–215. url:
http://dx.doi.org/10.1214/14-STS504.
[Vee+19] V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S.
Singh. “Discovery of Useful Questions as Auxiliary Tasks”. In: NIPS. Vol. 32. 2019. url: https://
proceedings.neurips.cc/paper_files/paper/2019/file/10ff0b5e85e5b85cc3095d431d8c08b4-
Paper.pdf.
[Ven+24] D. Venuto, S. N. Islam, M. Klissarov, D. Precup, S. Yang, and A. Anand. “Code as re-
ward: Empowering reinforcement learning with VLMs”. In: ICML. Feb. 2024. url: https:
//openreview.net/forum?id=6P88DMUDvH.
[Vez+17] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu.
“FeUdal Networks for Hierarchical Reinforcement Learning”. en. In: ICML. PMLR, July 2017,
pp. 3540–3549. url: https://proceedings.mlr.press/v70/vezhnevets17a.html.
[Vil+22] A. R. Villaflor, Z. Huang, S. Pande, J. M. Dolan, and J. Schneider. “Addressing Optimism
Bias in Sequence Modeling for Reinforcement Learning”. en. In: ICML. PMLR, June 2022,
pp. 22270–22283. url: https://proceedings.mlr.press/v162/villaflor22a.html.
[VPG20] N. Vieillard, O. Pietquin, and M. Geist. “Munchausen Reinforcement Learning”. In: NIPS.
Vol. 33. 2020, pp. 4235–4246. url: https://proceedings.neurips.cc/paper_files/paper/
2020/file/2c6a0bae0f071cbbf0bb3d5b11d90a82-Paper.pdf.
[Wag+19] N. Wagener, C.-A. Cheng, J. Sacks, and B. Boots. “An online learning approach to model
predictive control”. In: Robotics: Science and Systems. Feb. 2019. url: https://arxiv.org/
abs/1902.08967.
[Wan+16] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. “Dueling Network
Architectures for Deep Reinforcement Learning”. In: ICML. 2016. url: http://proceedings.
mlr.press/v48/wangf16.pdf.
[Wan+19] T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel,
and J. Ba. “Benchmarking Model-Based Reinforcement Learning”. In: arXiv [cs.LG] (July 2019).
url: http://arxiv.org/abs/1907.02057.
[Wan+22] T. Wang, S. S. Du, A. Torralba, P. Isola, A. Zhang, and Y. Tian. “Denoised MDPs: Learning
World Models Better Than the World Itself”. In: ICML. June 2022. url: http://arxiv.org/
abs/2206.15477.
[Wan+24a] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar.
“Voyager: An Open-Ended Embodied Agent with Large Language Models”. In: TMLR (2024).
url: https://openreview.net/forum?id=ehfRiF0R3a.
141
[Wan+24b] S. Wang, S. Liu, W. Ye, J. You, and Y. Gao. “EfficientZero V2: Mastering discrete and continuous
control with limited data”. In: arXiv [cs.LG] (Mar. 2024). url: http://arxiv.org/abs/2403.
00564.
[WAT17] G. Williams, A. Aldrich, and E. A. Theodorou. “Model Predictive Path Integral Control: From
Theory to Parallel Computation”. In: J. Guid. Control Dyn. 40.2 (Feb. 2017), pp. 344–357. url:
https://doi.org/10.2514/1.G001921.
[Wat+21] J. Watson, H. Abdulsamad, R. Findeisen, and J. Peters. “Stochastic Control through Approxi-
mate Bayesian Input Inference”. In: arxiv (2021). url: http://arxiv.org/abs/2105.07693.
[WCM24] C. Wang, Y. Chen, and K. Murphy. “Model-based Policy Optimization under Approximate
Bayesian Inference”. en. In: AISTATS. PMLR, Apr. 2024, pp. 3250–3258. url: https : / /
proceedings.mlr.press/v238/wang24g.html.
[WD92] C. Watkins and P. Dayan. “Q-learning”. In: Machine Learning 8.3 (1992), pp. 279–292.
[Wei+24] R. Wei, N. Lambert, A. McDonald, A. Garcia, and R. Calandra. “A unified view on solving
objective mismatch in model-based Reinforcement Learning”. In: Trans. on Machine Learning
Research (2024). url: https://openreview.net/forum?id=tQVZgvXhZb.
[Wen18a] L. Weng. “A (Long) Peek into Reinforcement Learning”. In: lilianweng.github.io (2018). url:
https://lilianweng.github.io/posts/2018-02-19-rl-overview/.
[Wen18b] L. Weng. “Policy Gradient Algorithms”. In: lilianweng.github.io (2018). url: https://lilianweng.
github.io/posts/2018-04-08-policy-gradient/.
[WHT19] Y. Wang, H. He, and X. Tan. “Truly Proximal Policy Optimization”. In: UAI. 2019. url:
http://auai.org/uai2019/proceedings/papers/21.pdf.
[WHZ23] Z. Wang, J. J. Hunt, and M. Zhou. “Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning”. In: ICLR. 2023. url: https://openreview.net/pdf?id=AHvFDPi-
FA.
[Wie03] E Wiewiora. “Potential-Based Shaping and Q-Value Initialization are Equivalent”. In: JAIR.
2003. url: https://jair.org/index.php/jair/article/view/10338.
[Wil+17] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, and E. A. Theodorou.
“Information theoretic MPC for model-based reinforcement learning”. In: ICRA. IEEE, May
2017, pp. 1714–1721. url: https://ieeexplore.ieee.org/document/7989202.
[Wil92] R. J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement
learning”. In: MLJ 8.3-4 (1992), pp. 229–256.
[WIP20] J. Watson, A. Imohiosen, and J. Peters. “Active Inference or Control as Inference? A Unifying
View”. In: International Workshop on Active Inference. 2020. url: http://arxiv.org/abs/
2010.00262.
[WL14] N. Whiteley and A. Lee. “Twisted particle filters”. en. In: Annals of Statistics 42.1 (Feb. 2014),
pp. 115–141. url: https://projecteuclid.org/journals/annals-of-statistics/volume-
42/issue-1/Twisted-particle-filters/10.1214/13-AOS1167.full.
[Wu+17] Y. Wu, E. Mansimov, S. Liao, R. Grosse, and J. Ba. “Scalable trust-region method for deep
reinforcement learning using Kronecker-factored approximation”. In: NIPS. 2017. url: https:
//arxiv.org/abs/1708.05144.
[Wu+21] Y. Wu, S. Zhai, N. Srivastava, J. Susskind, J. Zhang, R. Salakhutdinov, and H. Goh. “Uncertainty
Weighted Actor-critic for offline Reinforcement Learning”. In: ICML. May 2021. url: https:
//arxiv.org/abs/2105.08140.
[Wu+22] P. Wu, A. Escontrela, D. Hafner, K. Goldberg, and P. Abbeel. “DayDreamer: World Models
for Physical Robot Learning”. In: (June 2022). arXiv: 2206 . 14176 [cs.RO]. url: http :
//arxiv.org/abs/2206.14176.
142
[Wu+23] G. Wu, W. Fang, J. Wang, P. Ge, J. Cao, Y. Ping, and P. Gou. “Dyna-PPO reinforcement
learning with Gaussian process for the continuous action decision-making in autonomous driving”.
en. In: Appl. Intell. 53.13 (July 2023), pp. 16893–16907. url: https://link.springer.com/
article/10.1007/s10489-022-04354-x.
[Wur+22] P. R. Wurman et al. “Outracing champion Gran Turismo drivers with deep reinforcement
learning”. en. In: Nature 602.7896 (Feb. 2022), pp. 223–228. url: https://www.researchgate.
net/publication/358484368_Outracing_champion_Gran_Turismo_drivers_with_deep_
reinforcement_learning.
[Xu+17] C. Xu, T. Qin, G. Wang, and T.-Y. Liu. “Reinforcement learning for learning rate control”. In:
arXiv [cs.LG] (May 2017). url: http://arxiv.org/abs/1705.11159.
[Yan+23] M. Yang, D Schuurmans, P Abbeel, and O. Nachum. “Dichotomy of control: Separating what
you can control from what you cannot”. In: ICLR. Vol. abs/2210.13435. 2023. url: https:
//github.com/google-research/google-research/tree/.
[Yan+24] S. Yang, Y. Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and
P. Abbeel. “Learning Interactive Real-World Simulators”. In: ICLR. 2024. url: https : / /
openreview.net/pdf?id=sFyTZEqmUY.
[Yao+22] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. “ReAct: Synergizing
Reasoning and Acting in Language Models”. In: ICLR. Sept. 2022. url: https://openreview.
net/pdf?id=WE_vluYUL-X.
[Ye+21] W. Ye, S. Liu, T. Kurutach, P. Abbeel, and Y. Gao. “Mastering Atari games with limited data”.
In: NIPS. Oct. 2021.
[Yu17] H. Yu. “On convergence of some gradient-based temporal-differences algorithms for off-policy
learning”. In: arXiv [cs.LG] (Dec. 2017). url: http://arxiv.org/abs/1712.09652.
[Yu+20] T. Yu, G. Thomas, L. Yu, S. Ermon, J. Y. Zou, S. Levine, C. Finn, and T. Ma. “MOPO: Model-
based Offline Policy Optimization”. In: NIPS. Vol. 33. 2020, pp. 14129–14142. url: https://
proceedings.neurips.cc/paper_files/paper/2020/hash/a322852ce0df73e204b7e67cbbef0d0a-
Abstract.html.
[Yu+23] C. Yu, N Burgess, M Sahani, and S Gershman. “Successor-Predecessor Intrinsic Exploration”. In:
NIPS. Vol. abs/2305.15277. Curran Associates, Inc., May 2023, pp. 73021–73038. url: https://
proceedings.neurips.cc/paper_files/paper/2023/hash/e6f2b968c4ee8ba260cd7077e39590dd-
Abstract-Conference.html.
[Yua22] M. Yuan. “Intrinsically-motivated reinforcement learning: A brief introduction”. In: arXiv [cs.LG]
(Mar. 2022). url: http://arxiv.org/abs/2203.02298.
[Yua+24] M. Yuan, R. C. Castanyer, B. Li, X. Jin, G. Berseth, and W. Zeng. “RLeXplore: Accelerating
research in intrinsically-motivated reinforcement learning”. In: arXiv [cs.LG] (May 2024). url:
http://arxiv.org/abs/2405.19548.
[YZ22] Y. Yang and P. Zhai. “Click-through rate prediction in online advertising: A literature review”.
In: Inf. Process. Manag. 59.2 (2022), p. 102853. url: https://www.sciencedirect.com/
science/article/pii/S0306457321003241.
[ZABD10] B. D. Ziebart, J Andrew Bagnell, and A. K. Dey. “Modeling Interaction via the Principle of
Maximum Causal Entropy”. In: ICML. 2010. url: https://www.cs.uic.edu/pub/Ziebart/
Publications/maximum-causal-entropy.pdf.
[Zel+24] E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D. Goodman. “Quiet-STaR:
Language Models Can Teach Themselves to Think Before Speaking”. In: arXiv [cs.CL] (Mar.
2024). url: http://arxiv.org/abs/2403.09629.
[Zha+19] S. Zhang, B. Liu, H. Yao, and S. Whiteson. “Provably convergent two-timescale off-policy actor-
critic with function approximation”. In: ICML 119 (Nov. 2019). Ed. by H. D. Iii and A. Singh,
pp. 11204–11213. url: https://proceedings.mlr.press/v119/zhang20s/zhang20s.pdf.
143
[Zha+21] A. Zhang, R. T. McAllister, R. Calandra, Y. Gal, and S. Levine. “Learning Invariant Repre-
sentations for Reinforcement Learning without Reconstruction”. In: ICLR. 2021. url: https:
//openreview.net/pdf?id=-2FCwDKRREu.
[Zha+23a] J. Zhang, J. T. Springenberg, A. Byravan, L. Hasenclever, A. Abdolmaleki, D. Rao, N. Heess,
and M. Riedmiller. “Leveraging Jumpy Models for Planning and Fast Learning in Robotic
Domains”. In: arXiv [cs.RO] (Feb. 2023). url: http://arxiv.org/abs/2302.12617.
[Zha+23b] W. Zhang, G. Wang, J. Sun, Y. Yuan, and G. Huang. “STORM: Efficient Stochastic Transformer
based world models for reinforcement learning”. In: arXiv [cs.LG] (Oct. 2023). url: http:
//arxiv.org/abs/2310.09615.
[Zha+24] S. Zhao, R. Brekelmans, A. Makhzani, and R. B. Grosse. “Probabilistic Inference in Language
Models via Twisted Sequential Monte Carlo”. In: ICML. June 2024. url: https://openreview.
net/pdf?id=frA0NNBS1n.
[Zhe+22] L. Zheng, T. Fiez, Z. Alumbaugh, B. Chasnov, and L. J. Ratliff. “Stackelberg actor-critic: Game-
theoretic reinforcement learning algorithms”. en. In: AAAI 36.8 (June 2022), pp. 9217–9224.
url: https://ojs.aaai.org/index.php/AAAI/article/view/20908.
[Zho+22] H. Zhou, Z. Lin, J. Li, Q. Fu, W. Yang, and D. Ye. “Revisiting discrete soft actor-critic”. In:
arXiv [cs.LG] (Sept. 2022). url: http://arxiv.org/abs/2209.10081.
[Zho+24] G. Zhou, S. Swaminathan, R. V. Raju, J. S. Guntupalli, W. Lehrach, J. Ortiz, A. Dedieu,
M. Lázaro-Gredilla, and K. Murphy. “Diffusion Model Predictive Control”. In: arXiv [cs.LG]
(Oct. 2024). url: http://arxiv.org/abs/2410.05364.
[ZHR24] H. Zhu, B. Huang, and S. Russell. “On representation complexity of model-based and model-free
reinforcement learning”. In: ICLR. 2024.
[Zie+08] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. “Maximum Entropy Inverse Reinforce-
ment Learning”. In: AAAI. 2008, pp. 1433–1438.
[Zin+21] L Zintgraf, S. Schulze, C. Lu, L. Feng, M. Igl, K Shiarlis, Y Gal, K. Hofmann, and S. Whiteson.
“VariBAD: Variational Bayes-Adaptive Deep RL via meta-learning”. In: J. Mach. Learn. Res.
22.289 (2021), 289:1–289:39. url: https://www.jmlr.org/papers/volume22/21-0657/21-
0657.pdf.
[Zit+23] B. Zitkovich et al. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control”. en. In: Conference on Robot Learning. PMLR, Dec. 2023, pp. 2165–2183. url: https:
//proceedings.mlr.press/v229/zitkovich23a.html.
[ZS22] N. Zucchet and J. Sacramento. “Beyond backpropagation: Bilevel optimization through implicit
differentiation and equilibrium propagation”. en. In: Neural Comput. 34.12 (Nov. 2022), pp. 2309–
2346. url: https://direct.mit.edu/neco/article-pdf/34/12/2309/2057431/neco_a_
01547.pdf.
[ZSE24] C. Zheng, R. Salakhutdinov, and B. Eysenbach. “Contrastive Difference Predictive Coding”.
In: The Twelfth International Conference on Learning Representations. 2024. url: https:
//openreview.net/pdf?id=0akLDTFR9x.
[ZW19] S. Zhang and S. Whiteson. “DAC: The Double Actor-Critic Architecture for Learning Options”.
In: NIPS 32 (2019). url: https://proceedings.neurips.cc/paper_files/paper/2019/
file/4f284803bd0966cc24fa8683a34afc6e-Paper.pdf.
144