0% found this document useful (0 votes)
7 views

An Introduction to Deep Reinforcement Learning

The document provides an introduction to deep reinforcement learning (RL), which combines reinforcement learning and deep learning to address complex decision-making tasks across various domains such as healthcare, robotics, and finance. It covers models, algorithms, and techniques related to deep RL, with a focus on generalization and practical applications. The authors assume that readers have a basic understanding of machine learning concepts.

Uploaded by

web3.liber
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

An Introduction to Deep Reinforcement Learning

The document provides an introduction to deep reinforcement learning (RL), which combines reinforcement learning and deep learning to address complex decision-making tasks across various domains such as healthcare, robotics, and finance. It covers models, algorithms, and techniques related to deep RL, with a focus on generalization and practical applications. The authors assume that readers have a basic understanding of machine learning concepts.

Uploaded by

web3.liber
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 280

arXiv:1811.12560v2 [cs.

LG] 3 Dec 2018

An Introduction to Deep
Reinforcement Learning
Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare and Joelle
Pineau (2018), “An Introduction to Deep Reinforcement Learning”, Foundations and
Trends in Machine Learning: Vol. 11, No. 3-4. DOI: 10.1561/2200000071.

Vincent François-Lavet Peter Henderson


McGill University McGill University
[email protected] [email protected]
Riashat Islam Marc G. Bellemare
McGill University Google Brain
[email protected] [email protected]
Joelle Pineau
Facebook, McGill University
[email protected]

Boston — Delft
日 3 月 21 年 8102 ]GL.sc[ 2v06521.1181:viXra

深度强化学习简介
Vincent François-Lavet、Peter Henderson、Riashat Islam、Marc G. Bellemare 和 Joelle
Pineau (2018),《深度强化学习导论》,《机器学习基础与趋势》:第 11 卷,第 3-4 期。doi:
10.1561/2200000071.
文森特-弗朗索瓦-拉韦特 彼得-亨德森
麦吉尔大学 麦吉尔大学
vincent.francois- peter.henderson
[email protected]伊斯兰 @mail.mcgill.ca
马克-贝勒马尔
麦吉尔大学 谷歌大脑
riashat.islam@m bellemare@
ail.mcgill.ca 乔埃勒-皮诺 google.com
Facebook, 麦吉尔大学
[email protected]

波士顿 - 代尔夫特
Contents

1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Machine learning and deep learning 6


2.1 Supervised learning and the concepts of bias and overfitting 7
2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . 9
2.3 The deep learning approach . . . . . . . . . . . . . . . . . 10

3 Introduction to reinforcement learning 15


3.1 Formal framework . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Different components to learn a policy . . . . . . . . . . . 20
3.3 Different settings to learn a policy from data . . . . . . . . 21

4 Value-based methods for deep RL 24


4.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Deep Q-networks . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Dueling network architecture . . . . . . . . . . . . . . . . 29
4.6 Distributional DQN . . . . . . . . . . . . . . . . . . . . . 31
4.7 Multi-step learning . . . . . . . . . . . . . . . . . . . . . . 32
目录

1 导言 2
1.1 动机 2
1.2 概要 3

2 机器学习和深度学习 6
2.1 监督学习以及偏差和过拟合的概念 7
2.2 无监督学习 9
2.3 深度学习方法 10

3 强化学习简介 15
3.1 正式框架 16
3.2 学习政策的不同组成部分 20
3.3 从数据中学习政策的不同设置 21

4 基于价值的深度 RL 方法 24
4.1 Q-learning 24
4.2 适合的 Q-learning 25
4.3 深度 Q 网络 27
4.4 双 DQN 28
4.5 决斗网络架构 29
4.6 分布式 DQN 31
4.7 多步骤学习 32
4.8 Combination of all DQN improvements and variants of DQN 34

5 Policy gradient methods for deep RL 36


5.1 Stochastic Policy Gradient . . . . . . . . . . . . ..... 37
5.2 Deterministic Policy Gradient . . . . . . . . . . ...... 39
5.3 Actor-Critic Methods . . . . . . . . . . . . . . . ..... 40
5.4 Natural Policy Gradients . . . . . . . . . . . . . ..... 42
5.5 Trust Region Optimization . . . . . . . . . . . . ..... 43
5.6 Combining policy gradient and Q-learning . . . . ..... 44

6 Model-based methods for deep RL 46


6.1 Pure model-based methods . . . . . . . . . . . . . . . . . 46
6.2 Integrating model-free and model-based methods . . . . . 49

7 The concept of generalization 53


7.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Choice of the learning algorithm and function approximator
selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3 Modifying the objective function . . . . . . . . . . . . . . 61
7.4 Hierarchical learning . . . . . . . . . . . . . . . . . . . . . 62
7.5 How to obtain the best bias-overfitting tradeoff . . . . . . 63

8 Particular challenges in the online setting 66


8.1 Exploration/Exploitation dilemma . . . . . . . . . . . . . . 66
8.2 Managing experience replay . . . . . . . . . . . . . . . . . 71

9 Benchmarking Deep RL 73
9.1 Benchmark Environments . . . . . . . . . . . . . . . . . . 73
9.2 Best practices to benchmark deep RL . . . . . . . . . . . 78
9.3 Open-source software for Deep RL . . . . . . . . . . . . . 80

10 Deep reinforcement learning beyond MDPs 81


10.1 Partial observability and the distribution of (related) MDPs 81
10.2 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 86
10.3 Learning without explicit reward function . . . . . . . . . . 89
10.4 Multi-agent systems . . . . . . . . . . . . . . . . . . . . . 91
4.8 结合所有 DQN 改进和 DQN 变体 34
5 深度 RL 的政策梯度方法 36
5.1 随机政策梯度 37
5.2 确定性政策梯度 39
5.3 演员批评方法 40
5.4 自然政策梯度 42
5.5 信任区域优化 43
5.6 政策梯度与 Q 学习相结合 44

6 基于模型的深度 RL 方法 46
6.1 基于模型的纯粹方法 46
6.2 整合无模型方法和基于模型的方法 49

7 概括的概念 53
7.1 特征选择 58
7.2 学习算法的选择和函数近似值的选择
59
7.3 修改目标函数 61
7.4 分层学习 62
7.5 如何获得偏差与拟合之间的最佳平衡 63

8 在线环境中的特殊挑战 66
8.1 勘探/开发的两难选择 66
8.2 管理经验回放 71

9 深度 RL 基准 73
9.1 基准环境 73
9.2 以深度 RL 为基准的最佳做法 78
9.3 用于深度 RL 的开源软件 80

10 超越 MDP 的深度强化学习 81
10.1 部分可观测性和(相关)MDP 的分布 81
10.2 迁移学习 86
10.3 无明确奖励函数的学习 89
10.4 多代理系统 91
11 Perspectives on deep reinforcement learning 94
11.1 Successes of deep reinforcement learning . . . . . . . . . . 94
11.2 Challenges of applying reinforcement learning to real-world
problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.3 Relations between deep RL and neuroscience . . . . . . . . 96

12 Conclusion 99
12.1 Future development of deep RL . . . . . . . . . . . . . . . 99
12.2 Applications and societal impact of deep RL . . . . . . . . 100

Appendices 103

References 106
11 深度强化学习的前景 94
11.1 深度强化学习的成功经验 94
11.2 将强化学习应用于实际问题的挑战
95
11.3 深度学习与神经科学的关系 96

12 结论 99
12.1 深度制冷系统的未来发展 99
12.2 深度制冷的应用和社会影响 100

附录 103

参考资料 106
An Introduction to Deep
Reinforcement Learning
Vincent François-Lavet1, Peter Henderson2, Riashat Islam3 , Marc
G. Bellemare4 and Joelle Pineau5
1
McGill University; [email protected]
2
McGill University; [email protected]
3
McGill University; [email protected]
4
Google Brain; [email protected]
5
Facebook, McGill University; [email protected]

ABSTRACT
Deep reinforcement learning is the combination of reinforce-
ment learning (RL) and deep learning. This field of research
has been able to solve a wide range of complex decision-
making tasks that were previously out of reach for a machine.
Thus, deep RL opens up many new applications in domains
such as healthcare, robotics, smart grids, finance, and many
more. This manuscript provides an introduction to deep
reinforcement learning models, algorithms and techniques.
Particular focus is on the aspects related to generalization
and how deep RL can be used for practical applications. We
assume the reader is familiar with basic machine learning
concepts.
深度强化学习简介
Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G.
Bellemareand Joelle Pineau
1

2
麦吉尔大学; [email protected]
McGill University; [email protected]
3
麦吉尔大学;
重试 错误原因 [email protected]
4
谷歌大脑;[email protected]
5
Facebook, 麦吉尔大学; [email protected]

摘要
深度强化学习是强化学习(RL)和深度学习的结合。这一研究领
域已经能够解决以前机器无法完成的各种复杂决策任务。因此,
深度强化学习为医疗保健、机器人、智能电网、金融等领域带来
了许多新应用。本手稿介绍了深度强化学习模型、算法和技术。
特别关注与泛化相关的方面,以及如何将深度强化学习用于实际
应用。我们假定读者熟悉基本的机器学习概念。
1
Introduction

1.1 Motivation

A core topic in machine learning is that of sequential decision-making.


This is the task of deciding, from experience, the sequence of actions
to perform in an uncertain environment in order to achieve some
goals. Sequential decision-making tasks cover a wide range of possible
applications with the potential to impact many domains, such as
robotics, healthcare, smart grids, finance, self-driving cars, and many
more.
Inspired by behavioral psychology (see e.g., Sutton, 1984), rein-
forcement learning (RL) proposes a formal framework to this problem.
The main idea is that an artificial agent may learn by interacting with
its environment, similarly to a biological agent. Using the experience
gathered, the artificial agent should be able to optimize some objectives
given in the form of cumulative rewards. This approach applies in prin-
ciple to any type of sequential decision-making problem relying on past
experience. The environment may be stochastic, the agent may only
observe partial information about the current state, the observations
may be high-dimensional (e.g., frames and time series), the agent may
freely gather experience in the environment or, on the contrary, the data

2
1
导言

1.1动机
机器学习的一个核心课题是顺序决策。这是一项在不确定的环境中根据经验决
定行动顺序以实现某些目标的任务。顺序决策任务涵盖了广泛的可能应用,有
可能对机器人、医疗保健、智能电网、金融、自动驾驶汽车等许多领域产生影
响。
受行为心理学的启发(如萨顿,1984 年),强化学习(RL)为这一问题提出
了一个正式的框架。其主要思想是,人工代理可以通过与环境互动来学习,这
与生物代理类似。利用收集到的经验,人工代理应该能够优化以累积奖励形式
给出的某些目标。这种方法原则上适用于任何依赖过去经验的连续决策问题。
环境可能是随机的,代理可能只能观察到当前状态的部分信息,观察结果可能
是高维的(如帧和时间序列),代理可能在环境中自由地收集经验,或者相
反,数据可能是低维的。

2
1.2. Outline 3

may be may be constrained (e.g., not access to an accurate simulator


or limited data).
Over the past few years, RL has become increasingly popular due
to its success in addressing challenging sequential decision-making
problems. Several of these achievements are due to the combination of
RL with deep learning techniques (LeCun et al., 2015; Schmidhuber,
2015; Goodfellow et al., 2016). This combination, called deep RL, is
most useful in problems with high dimensional state-space. Previous RL
approaches had a difficult design issue in the choice of features (Munos
and Moore, 2002; Bellemare et al., 2013). However, deep RL has been
successful in complicated tasks with lower prior knowledge thanks to its
ability to learn different levels of abstractions from data. For instance,
a deep RL agent can successfully learn from visual perceptual inputs
made up of thousands of pixels (Mnih et al., 2015). This opens up the
possibility to mimic some human problem solving capabilities, even in
high-dimensional space — which, only a few years ago, was difficult to
conceive.
Several notable works using deep RL in games have stood out for
attaining super-human level in playing Atari games from the pixels
(Mnih et al., 2015), mastering Go (Silver et al., 2016a) or beating the
world’s top professionals at the game of Poker (Brown and Sandholm,
2017; Moravčik et al., 2017). Deep RL also has potential for real-world
applications such as robotics (Levine et al., 2016; Gandhi et al., 2017;
Pinto et al., 2017), self-driving cars (You et al., 2017), finance (Deng
et al., 2017) and smart grids (François-Lavet, 2017), to name a few.
Nonetheless, several challenges arise in applying deep RL algorithms.
Among others, exploring the environment efficiently or being able
to generalize a good behavior in a slightly different context are not
straightforward. Thus, a large array of algorithms have been proposed
for the deep RL framework, depending on a variety of settings of the
sequential decision-making tasks.

1.2 Outline

The goal of this introduction to deep RL is to guide the reader towards


effective use and understanding of core methods, as well as provide
1.2.概要 3

可能会受到限制(如无法获得精确的模拟器或数据有限)。
在过去几年中,RL 因其在解决具有挑战性的连续决策问题方面的成功而越来
越受欢迎。其中一些成就归功于 RL 与深度学习技术的结合(LeCun 等人,
2015 年;Schmidhuber,2015 年;Goodfellow 等人,2016 年)。这种结合
被称为深度 RL,在处理高维状态空间的问题时最为有用。以往的 RL 方法在特
征选择方面存在设计难题(Munos 和 Moore,2002 年;Bellemare 等人,
2013 年)。然而,由于深度 RL 能够从数据中学习不同层次的抽象概念,因此
它能成功完成先验知识较少的复杂任务。例如,深度 RL 代理可以成功地从由
数千像素组成的视觉感知输入中学习(Mnih 等人,2015 年)。这就为模仿人
类的某些问题解决能力提供了可能,即使是在高维空间中--这在几年前还是难
以想象的。

在游戏中使用深度 RL 的几项著名工作已经脱颖而出,从像素数据玩雅达利游
戏(Mnih 等人,2015 年)、掌握围棋(Silver 等人,2016 年 a)或在扑克游
戏中击败世界顶级职业选手(Brown 和 Sandholm,2017 年;Moravčik 等
人,2017 年)都达到了超人水平。深度 RL 在现实世界的应用也很有潜力,例
如机器人技术(Levine 等人,2016 年;Gandhi 等人,2017 年;Pinto 等人,
2017 年)、自动驾驶汽车(You 等人,2017 年)、金融(Deng 等人,2017 年)
和智能电网(François-Lavet,2017 年)等等。然而,在应用深度 RL 算法的
过程中也遇到了一些挑战。其中,高效地探索环境或在稍有不同的环境中概括
良好的行为并不简单。因此,根据顺序决策任务的各种设置,人们为深度 RL
框架提出了大量算法。

1.2概要
这本深度 RL 入门旨在引导读者有效使用和理解核心方法,并提供以下内容
4 Introduction

references for further reading. After reading this introduction, the reader
should be able to understand the key different deep RL approaches and
algorithms and should be able to apply them. The reader should also
have enough background to investigate the scientific literature further
and pursue research on deep RL.
In Chapter 2, we introduce the field of machine learning and the deep
learning approach. The goal is to provide the general technical context
and explain briefly where deep learning is situated in the broader field
of machine learning. We assume the reader is familiar with basic notions
of supervised and unsupervised learning; however, we briefly review the
essentials.
In Chapter 3, we provide the general RL framework along with
the case of a Markov Decision Process (MDP). In that context, we
examine the different methodologies that can be used to train a deep
RL agent. On the one hand, learning a value function (Chapter 4)
and/or a direct representation of the policy (Chapter 5) belong to the
so-called model-free approaches. On the other hand, planning algorithms
that can make use of a learned model of the environment belong to the
so-called model-based approaches (Chapter 6).
We dedicate Chapter 7 to the notion of generalization in RL.
Within either a model-based or a model-free approach, we discuss the
importance of different elements: (i) feature selection, (ii) function
approximator selection, (iii) modifying the objective function and
(iv) hierarchical learning. In Chapter 8, we present the main challenges of
using RL in the online setting. In particular, we discuss the exploration-
exploitation dilemma and the use of a replay memory.
In Chapter 9, we provide an overview of different existing bench-
marks for evaluation of RL algorithms. Furthermore, we present a set
of best practices to ensure consistency and reproducibility of the results
obtained on the different benchmarks.
In Chapter 10, we discuss more general settings than MDPs: (i) the
Partially Observable Markov Decision Process (POMDP), (ii) the
distribution of MDPs (instead of a given MDP) along with the notion
of transfer learning, (iii) learning without explicit reward function and
(iv) multi-agent systems. We provide descriptions of how deep RL can
be used in these settings.
4 导言
进一步阅读参考文献。阅读完本简介后,读者应该能够理解主要的不同深度
RL 方法和算法,并能够应用它们。读者还应该具备足够的背景知识,以便进
一步查阅科学文献,开展深度 RL 研究。
在第 2 章中,我们将介绍机器学习领域和深度学习方法。目的是提供一般技
术背景,并简要解释深度学习在更广泛的机器学习领域中的位置。我们假定
读者熟悉有监督和无监督学习的基本概念;不过,我们将简要回顾其要点。
在第 3 章中,我们以马尔可夫决策过程(Markov Decision Process,MDP)
为例,介绍了一般 RL 框架。在此背景下,我们研究了可用于训练深度 RL 代
理的不同方法。一方面,学习值函数(第 4 章)和/或直接表示策略(第 5
章)属于所谓的无模型方法。另一方面,能够利用所学环境模型的规划算法
属于所谓的基于模型的方法(第 6 章)。

我们将在第 7 章专门讨论 RL 中的泛化概念。在基于模型或无模型的方法中,


我们讨论了不同要素的重要性:(i) 特征选择,(ii) 函数近似值选择,(iii) 修改目
标函数,以及提供 (iv) 分层学习。在第 8 章中,我们介绍了在在线环境中使用
RL 所面临的主要挑战。我们特别讨论了探索-开发困境和重放存储器的使用。

在第 9 章中,我们概述了用于评估 RL 算法的各种现有基准。此外,我们还提
出了一套最佳实践,以确保在不同基准上获得的结果具有一致性和可重复性。

在第 10 章中,我们将讨论比 MDP 更一般的环境:(i) 部分可观测马尔可夫决策


过程(POMDP),(ii) MDP 分布(而不是给定的 MDP)以及迁移学习的概念,
(iii) 无显式奖励函数的学习,以及 (iv) 多代理系统。我们将介绍如何在这些环
境中使用深度 RL。
1.2. Outline 5

In Chapter 11, we present broader perspectives on deep RL. This


includes a discussion on applications of deep RL in various domains,
along with the successes achieved and remaining challenges (e.g. robotics,
self driving cars, smart grids, healthcare, etc.). This also includes a brief
discussion on the relationship between deep RL and neuroscience.
Finally, we provide a conclusion in Chapter 12 with an outlook on
the future development of deep RL techniques, their future applications,
as well as the societal impact of deep RL and artificial intelligence.
1.2.概要 5

在第 11 章中,我们将从更广阔的视角介绍深度 RL。这包括讨论深度 RL 在各
个领域的应用,以及取得的成功和面临的挑战(如机器人、自动驾驶汽车、
智能电网、医疗保健等)。此外,还简要讨论了深度 RL 与神经科学之间的关
系。
最后,我们在第 12 章中对深度 RL 技术的未来发展、未来应用以及深度 RL 和
人工智能的社会影响进行了展望。
2
Machine learning and deep learning

Machine learning provides automated methods that can detect patterns


in data and use them to achieve some tasks (Christopher, 2006; Murphy,
2012). Three types of machine learning tasks can be considered:

• Supervised learning is the task of inferring a classification or


regression from labeled training data.

• Unsupervised learning is the task of drawing inferences from


datasets consisting of input data without labeled responses.

• Reinforcement learning (RL) is the task of learning how agents


ought to take sequences of actions in an environment in order to
maximize cumulative rewards.

To solve these machine learning tasks, the idea of function


approximators is at the heart of machine learning. There exist many
different types of function approximators: linear models (Anderson et al.,
1958), SVMs (Cortes and Vapnik, 1995), decisions tree (Liaw, Wiener,
et al., 2002; Geurts et al., 2006), Gaussian processes (Rasmussen, 2004),
deep learning (LeCun et al., 2015; Schmidhuber, 2015; Goodfellow et al.,
2016), etc.

6
2
机器学习和深度学习

机器学习提供的自动化方法可以检测数据中的模式,并利用这些模式完成某些
任务(Christopher,2006;Murphy,2012)。机器学习任务可分为三类:
• 监督学习是从标注的训练数据中推断分类或回归的任务。
• 无监督学习是指从由无标签响应的输入数据组成的数据集中得出推论的
任务。
• 强化学习(RL)的任务是学习代理应如何在环境中采取一系列行动,以
最大限度地提高累积奖励。
要解决这些机器学习任务,函数近似器的思想是机器学习的核心。目前存在许
多不同类型的函数近似器:线性模型(Anderson 等人,1958 年)、SVM
(Cortes 和 Vapnik,1995 年)、决策树(Liaw、Wiener 等人,2002 年;
Geurts 等人,2006 年)、高斯过程(Rasmussen,2004 年)、深度学习
(LeCun 等人,2015 年;Schmidhuber,2015 年;Goodfellow 等人,2016
年)等。

6
2.1. Supervised learning and the concepts of bias and overfitting 7

In recent years, mainly due to recent developments in deep learning,


machine learning has undergone dramatic improvements when learning
from high-dimensional data such as time series, images and videos. These
improvements can be linked to the following aspects: (i) an exponential
increase of computational power with the use of GPUs and distributed
computing (Krizhevsky et al., 2012), (ii) methodological breakthroughs
in deep learning (Srivastava et al., 2014; Ioffe and Szegedy, 2015; He
et al., 2016; Szegedy et al., 2016; Klambauer et al., 2017), (iii) a growing
eco-system of softwares such as Tensorflow (Abadi et al., 2016) and
datasets such as ImageNet (Russakovsky et al., 2015). All these aspects
are complementary and, in the last few years, they have lead to a
virtuous circle for the development of deep learning.
In this chapter, we discuss the supervised learning setting along
with the key concepts of bias and overfitting. We briefly discuss the
unsupervised setting with tasks such as data compression and generative
models. We also introduce the deep learning approach that has become
key to the whole field of machine learning. Using the concepts introduced
in this chapter, we cover the reinforcement learning setting in later
chapters.

2.1 Supervised learning and the concepts of bias and overfitting

In its most abstract form, supervised learning consists in finding a


function f : X → Y that takes as input x ∈ X and gives as output
y ∈ Y (X and Y depend on the application):

y = f (x). (2.1)

A supervised learning algorithm can be viewed as a function that


i.i.d.
maps a dataset D LS of learning samples (x, y) ∼ (X, Y ) into a model.
The prediction of such a model at a point x ∈ X of the input space
is denoted by f (x | D LS ). Assuming a random sampling scheme, s.t.
D LS ∼ DLS , f (x | D LS ) is a random variable, and so is its average error
over the input space. The expected value of this quantity is given by:

I [f ] = E E E L (Y, f (X | D LS )), (2.2)


X D LS Y |X
2.1.监督学习以及偏差和过拟合的概念 7

近年来,主要由于深度学习的最新发展,机器学习在学习时间序列、图像和视
频等高维数据方面取得了巨大进步。这些改进与以下几个方面有关:(i) GPU
和分布式计算的使用使计算能力呈指数级增长(Krizhevsky 等人,2012 年),
(ii) 深度学习方法上的突破(Srivastava 等人,2014 年;Ioffe 和 Szegedy,
2015 年;He 等人,2015 年)、2014;Ioffe 和 Szegedy,2015;He 等人,
2016;Szegedy 等人,2016;Klambauer 等人,2017),(iii) Tensorflow 等
软件(Abadi 等人,2016)和 ImageNet 等数据集(Russakovsky 等人,
2015)的生态不断壮大。所有这些方面都是相辅相成的,在过去几年中,它们
为深度学习的发展带来了良性循环。

在本章中,我们将讨论有监督学习环境以及偏差和过拟合的关键概念。我们简
要讨论了无监督环境下的任务,如数据压缩和生成模型。我们还介绍了已成为
整个机器学习领域关键的深度学习方法。利用本章介绍的概念,我们将在后面
的章节中介绍强化学习设置。

2.1监督学习以及偏差和过度拟合的概念
监督学习最抽象的形式是找到一个函数 f : X → Y,将 x∈X 作为输入,将 y∈Y
作为输出(X 和 Y 取决于应用):
y = f (x)。 (2.1)

监督学习算法可以看作是将学习样本 (x, y) ∼ (X, Y) 的数据集 Do 映射到模型的


函数。该模型对输入空间中 x∈X 点的预测用 f (x | D) 表示。假设采用随机抽样
方案,s.t. D∼ D,f (x | D) 是一个随机变量,它在输入空间的平均误差也是随机
变量。该量的期望值为

I[f ] = EE
D
E
Y |X
L(Y, f (X | D))、 (2.2)
8 Machine learning and deep learning

where L (·, ·) is the loss function. If L (y, ŷ) = (y − ŷ) 2 , the error
decomposes naturally into a sum of a bias term and a variance term1.
This bias-variance decomposition can be useful because it highlights a
tradeoff between an error due to erroneous assumptions in the model
selection/learning algorithm (the bias) and an error due to the fact that
only a finite set of data is available to learn that model (the parametric
variance). Note that the parametric variance is also called the overfitting
error2. Even though there is no such direct decomposition for other loss
functions (James, 2003), there is always a tradeoff between a sufficiently
rich model (to reduce the model bias, which is present even when the
amount of data would be unlimited) and a model not too complex (so as
to avoid overfitting to the limited amount of data). Figure 2.1 provides
an illustration.
Without knowing the joint probability distribution, it is impossible
to compute I [f ]. Instead, we can compute the empirical error on a
sample of data. Given n data points (x i , yi ), the empirical error is

1 ∑n
I S [f ] = L (yi , f (x i )).
n i =1

The generalization error is the difference between the error on a


sample set (used for training) and the error on the underlying joint
probability distribution. It is defined as

G = I [f ] − I S [f ].

1
The bias-variance decomposition (Geman et al., 1992) is given by:
2 2 2
E E (Y − f (X | D LS )) = σ (x) + bias (x), (2.3)
D LS Y | X

where
( )2
bias2 (x) , EY | x (Y ) − ED LS f (x | D LS ) ,
( )2 ( )2
2
σ (x) , EY | x Y − EY | x (Y ) + ED LS f (x | D LS ) − ED LS f (x | D LS ) , (2.4)
︸ ︷︷ ︸ ︸ ︷︷ ︸
Internal variance
Parametric variance

2
For any given model, the parametric variance goes to zero with an arbitrary
large dataset by considering the strong law of convergence.
8 机器学习和深度学习
其中 L(-, -) 是损失函数。如果 L(y, ˆy) = (y - ˆy),误差自然分解为偏差项和方
差项之和。这种偏差-方差分解非常有用,因为它突出了由于模型选择/学习算
法中的错误假设而造成的误差(偏差)和由于只有有限的数据集可用于学习模
型而造成的误差(参数方差)之间的权衡。请注意,参数方差也称为过拟合误
差。尽管其他损失函数没有这样的直接分解(James,2003 年),但在一个足
够丰富的模型(以减少模型偏差,即使数据量是无限的)和一个不太复杂的模
型(以避免对有限数据量的过度拟合)之间,总是存在一个权衡。图 2.1 举例
说明。

如果不知道联合概率分布,就无法计算 I[f ]。相反,我们可以计算数据样本的


经验误差。给定 n 个数据点 (x, y),经验误差为
n
1∑
I[f ] =
n
L(y,f(x))。
i=1

泛化误差是样本集(用于训练)误差与基础联合概率分布误差之间的差值。其
定义为

G = I[f ] - I[f ]。
1
偏差-方差分解法(Geman 等人,1992 年)的计算公式为
E E (Y - f (X | D))= σ(x) + bias(x)、 (2.3)
DLS Y |X

其中
( )
偏差(x) 、 E(Y ) - Ef (x | D) ,
( ) ( )
Y - E(Y ) f (x | D) - Ef (x | D)
︸ ︷︷ ︸
σ(x) , E +E , (2.4)

内部差异

参数方差
︷︷ ︸
对于任何给定模型,通过考虑强收敛定律,参数方差在任意大数据集上都会归零。
2
2.2. Unsupervised learning 9

Polynomial of degree 1 Polynomial of degree 4 Polynomial of degree 10


for the model for the model for the model
Model Model Model
True function True function True function
Training samples Training samples Training samples
y

y
x x x

Figure 2.1: Illustration of overfitting and underfitting for a simple 1D regression task
in supervised learning (based on one example from the library scikit-learn (Pedregosa
et al., 2011)). In this illustration, the data points (x, y) are noisy samples from a
true function represented in green. In the left figure, the degree 1 approximation is
underfitting, which means that it is not a good model, even for the training samples;
on the right, the degree 10 approximation is a very good model for the training
samples but is overly complex and fails to provide a good generalization.

In machine learning, the complexity of the function approximator


provides upper bounds on the generalization error. The generalization
error can be bounded by making use of complexity measures, such as
the Rademacher complexity (Bartlett and Mendelson, 2002), or the
VC-dimension (Vapnik, 1998). However, even though it lacks strong
theoretical foundations, it has become clear in practice that the strength
of deep neural networks is their generalization capability, even with
a high number of parameters (hence a potentially high complexity)
(Zhang et al., 2016).

2.2 Unsupervised learning

Unsupervised learning is a branch of machine learning that learns from


data that do not have any label. It relates to using and identifying
patterns in the data for tasks such as data compression or generative
models.
Data compression or dimensionality reduction involve encoding
information using a smaller representation (e.g., fewer bits) than the
original representation. For instance, an auto-encoder consists of an
encoder and a decoder. The encoder maps the original image x i ∈ RM
2.2.无监督学习 9

1 级多项式 4 级多项式 10 级多项式


为模型 为模型 为模型
Model Model Model

True function True function True function

Training samples Training samples Training samples


y

y
x x x

图 2.1:监督学习中简单一维回归任务的过拟合和欠拟合示例(基于 scikit-learn 库中的一个示例


(Pedregosa 等人,2011 年))。在该图中,数据点 (x, y) 是来自绿色真实函数的噪声样本。在左
图中,阶 1 近似值拟合不足,这意味着即使对于训练样本来说,它也不是一个好模型;在右图
中,阶 10 近似值对于训练样本来说是一个非常好的模型,但过于复杂,无法提供良好的泛化效
果。

在机器学习中,函数近似值的复杂度为泛化误差提供了上限。泛化误差可以通
过使用复杂度度量来限定,如 Rademacher 复杂度(Bartlett 和
Mendelson,2002 年)或 VC 维度(Vapnik,1998 年)。然而,尽管缺乏坚实
的理论基础,但在实践中,深度神经网络的优势显然在于其泛化能力,即使参
数数量较多(因此复杂度可能较高)也不例外(Zhang 等人,2016 年)。

2.2无监督学习
无监督学习是机器学习的一个分支,它从没有任何标签的数据中学习。它涉
及使用和识别数据中的模式,以完成数据压缩或生成模型等任务。
数据压缩或降维涉及使用比原始表示更小的表示(如更少的比特)对信息进
行编码。例如,自动编码器由编码器和解码器组成。编码器将原始图像 x∈R
10 Machine learning and deep learning

onto a low-dimensional representation zi = e(x i ; θe) ∈ Rm , where


m << M ; the decoder maps these features back to a high-dimensional
representation d(zi ; θd ) ≈ e− 1(zi ; θe). Auto-encoders can be trained
by optimizing for the reconstruction of the input through supervised
learning objectives.
Generative models aim at approximating the true data distribution
of a training set so as to generate new data points from the distribution.
Generative adversarial networks (Goodfellow et al., 2014) use an
adversarial process, in which two models are trained simulatenously: a
generative model G captures the data distribution, while a discriminative
model D estimates whether a sample comes from the training data rather
than G. The training procedure corresponds to a minimax two-player
game.

2.3 The deep learning approach

Deep learning relies on a function f : X → Y parameterized with


θ ∈ Rn θ (n θ ∈ N):
y = f (x; θ). (2.5)
A deep neural network is characterized by a succession of multiple
processing layers. Each layer consists in a non-linear transformation
and the sequence of these transformations leads to learning different
levels of abstraction (Erhan et al., 2009; Olah et al., 2017).
First, let us describe a very simple neural network with one fully-
connected hidden layer (see Fig 2.2). The first layer is given the input
values (i.e., the input features) x in the form of a column vector of size
n x (n x ∈ N). The values of the next hidden layer are a transformation
of these values by a non-linear parametric function, which is a matrix
multiplication by W1 of size n h × n x (n h ∈ N), plus a bias term b1 of
size n h , followed by a non-linear transformation:

h = A(W1 · x + b1), (2.6)

where A is the activation function. This non-linear activation function is


what makes the transformation at each layer non-linear, which ultimately
provides the expressivity of the neural network. The hidden layer h of
10 机器学习和深度学习
编码器将原始图像 x∈R 映射到低维表示 z= e(x; θ)∈R 上,其中 m << M;解
码器将这些特征映射回高维表示 d(z; θ)≈e(z; θ)。自动编码器可以通过监督学
习目标对输入的重构进行优化训练。

生成模型旨在逼近训练集的真实数据分布,从而从分布中生成新的数据点。生
成式对抗网络(Goodfellow 等人,2014 年)采用对抗过程,即模拟训练两个
模型:生成式模型 G 捕捉数据分布,而判别式模型 D 估计样本是否来自训练数
据而非 G。

2.3 深度学习方法
深度学习依赖于一个函数 f : X → Y,其参数为
θ∈R(n∈N):
y = f (x; θ). (2.5)

深度神经网络的特点是连续多个处理层。每一层都由非线性变换组成,这些变
换的顺序导致学习不同层次的抽象(Erhan 等人,2009 年;Olah 等人,2017
年)。
首先,让我们描述一个非常简单的神经网络,它有一个全连接的隐藏层(见图
2.2)。第一层以大小为 n(n∈ N)的列向量形式给出输入值(即输入特征)
x。下一隐藏层的值是这些值通过非线性参数函数进行的变换,即大小为 n× n
(n∈ N)的 W 的矩阵乘法,加上大小为 n 的偏置项 bo,然后再进行非线性变
换:

h = A(W- x + b)、 (2.6)

其中 A 是激活函数。这种非线性激活函数使得每一层的变换都是非线性的,最
终提供了神经网络的表现力。隐藏层 h
2.3. The deep learning approach 11

size n h can in turn be transformed to other sets of values up to the last


transformation that provides the output values y. In this case:

y = (W2 · h + b2), (2.7)

where W2 is of size n y × n h and b2 is of size n y (n y ∈ N).

Hidden Output
Inputs
layer layer
x
h y

Figure 2.2: Example of a neural network with one hidden layer.

All these layers are trained to minimize the empirical error I S [f ].


The most common method for optimizing the parameters of a neural
network is based on gradient descent via the backpropagation algorithm
(Rumelhart et al., 1988). In the simplest case, at every iteration, the
algorithm changes its internal parameters θ so as to fit the desired
function:

θ ← θ − α∇ θI S [f ], (2.8)

where α is the learning rate.


In current applications, many different types of neural network
layers have appeared beyond the simple feedforward networks just
introduced. Each variation provides specific advantages, depending on
the application (e.g., good tradeoff between bias and overfitting in
a supervised learning setting). In addition, within one given neural
network, an arbitrarily large number of layers is possible, and the trend
2.3.深度学习方法 11

在这种情况下,大小 n 可以依次转换为其他值集,直至最后一次转换,以提供
输出值 y:
y = (W- h + b)、 (2.7)

其中 Wis 的大小为 n× n,bis 的大小为 n(n∈ N)。


输入 隐藏 输出
x
层次 层次
h y

图 2.2:单隐层神经网络示例
所有这些层的训练都是为了最小化经验误差 I[f]。优化神经网络参数的最常用
方法是通过反向传播算法进行梯度下降。在最简单的情况下,算法在每次迭代
时都会改变其内部参数 θ,以拟合所需的函数:

θ ← θ - α∇I[f ]、 (2.8)

其中,α 是学习率。
在当前的应用中,除了刚才介绍的简单前馈网络外,还出现了许多不同类型的
神经网络层。根据不同的应用,每种变化都具有特定的优势(例如,在监督学
习设置中,偏差和过拟合之间的良好权衡)。此外,在一个给定的神经网络
中,可以有任意多的层数,其发展趋势是
12 Machine learning and deep learning

in the last few years is to have an ever-growing number of layers, with


more than 100 in some supervised learning tasks (Szegedy et al., 2017).
We merely describe here two types of layers that are of particular interest
in deep RL (and in many other tasks).
Convolutional layers (LeCun, Bengio, et al., 1995) are particularly
well suited for images and sequential data (see Fig 2.3), mainly due
to their translation invariance property. The layer’s parameters consist
of a set of learnable filters (or kernels), which have a small receptive
field and which apply a convolution operation to the input, passing
the result to the next layer. As a result, the network learns filters that
activate when it detects some specific features. In image classification,
the first layers learn how to detect edges, textures and patterns; then
the following layers are able to detect parts of objects and whole objects
(Erhan et al., 2009; Olah et al., 2017). In fact, a convolutional layer is
a particular kind of feedforward layer, with the specificity that many
weights are set to 0 (not learnable) and that other weights are shared.

output feature maps


input image
or input feature map
filter
10
01
1 0 0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 1 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 2
0 1 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0 1
Figure 2.3: Illustration of a convolutional layer with one input feature map that is
convolved by different filters to yield the output feature maps. The parameters that
are learned for this type of layer are those of the filters. For illustration purposes,
some results are displayed for one of the output feature maps with a given filter (in
practice, that operation is followed by a non-linear activation function).
12 机器学习和深度学习
在过去几年中,层的数量不断增加,在某些监督学习任务中,层的数量超过了
100 个(Szegedy 等人,2017 年)。在此,我们仅介绍两类在深度 RL(以及
许多其他任务)中具有特殊意义的层。
卷积层(LeCun、Bengio 等人,1995 年)特别适用于图像和序列数据(见图
2.3),这主要是由于它们具有平移不变的特性。该层的参数由一组可学习的滤
波器(或内核)组成,这些滤波器具有较小的感受野,可对输入进行卷积运
算,并将结果传递给下一层。因此,网络可以学习滤波器,在检测到某些特定
特征时激活这些滤波器。在图像分类中,第一层学习如何检测边缘、纹理和图
案;然后,接下来的层能够检测物体的部分和整个物体以及趋势(Erhan 等
人,2009 年;Olah 等人,2017 年)。事实上,卷积层是一种特殊的前馈层,
其特殊之处在于许多权重被设置为 0(不可学习),而其他权重则是共享的。

输出特征图
输入图像或输入特征
图 过滤
10
01
1 0 0 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 1 0 0 0
0 0 0 0 1 0 0 1 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 1 0 0 1 0 0 0 0
0 0 0 1 0 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 2
0 1 0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0 1

图 2.3:卷积层的示意图,一个输入特征图经过不同的滤波器卷积后产生输出特征图。这种卷积层
所学习的参数是滤波器的参数。为便于说明,图中显示的是一个输出特征图与一个给定滤波器的
一些结果(在实际操作中,该操作之后会有一个非线性激活函数)。
2.3. The deep learning approach 13

yt− 2 yt − 1 yt y1 y2 y3 yt− 1 yt

h = h h h ... h h

xt− 2 xt − 1 xt x1 x2 x3 xt− 1 xt

Figure 2.4: Illustration of a simple recurrent neural network. The layer denoted
by "h" may represent any non linear function that takes two inputs and provides
two outputs. On the left is the simplified view of a recurrent neural network that
is applied recursively to (x t , yt ) for increasing values of t and where the blue line
presents a delay of one time step. On the right, the neural network is unfolded with
the implicit requirement of presenting all inputs and outputs simultaneously.

Recurrent layers are particularly well suited for sequential data (see
Fig 2.4). Several different variants provide particular benefits in different
settings. One such example is the long short-term memory network
(LSTM) (Hochreiter and Schmidhuber, 1997), which is able to encode
information from long sequences, unlike a basic recurrent neural network.
Neural Turing Machines (NTMs) (Graves et al., 2014) are another such
example. In such systems, a differentiable "external memory" is used
for inferring even longer-term dependencies than LSTMs with low
degradation.
Several other specific neural network architectures have also been
studied to improve generalization in deep learning. For instance, it is
possible to design an architecture in such a way that it automatically
focuses on only some parts of the inputs with a mechanism called
attention (Xu et al., 2015; Vaswani et al., 2017). Other approaches aim
to work with symbolic rules by learning to create programs (Reed and
De Freitas, 2015; Neelakantan et al., 2015; Johnson et al., 2017; Chen
et al., 2017).
To be able to actually apply the deep RL methods described in the
later chapters, the reader should have practical knowledge of applying
deep learning methods in simple supervised learning settings (e.g.,
MNIST classification). For information on topics such as the importance
2.3.深度学习方法 13

y y y y y y y y

h = h h h ... h h

x x x x x x x x

图 2.4:简单递归神经网络示意图。以 "h "表示的层可以代表任何接受两个输入并提供两个输出


的非线性函数。左图是递归神经网络的简化视图,该网络在 t 值增大时递归应用于(x, y),其中
蓝线表示一个时间步的延迟。右图是神经网络的展开图,隐含的要求是同时显示所有输入和输
出。

递归层尤其适用于连续数据(见图 2.4)。有几种不同的变体在不同的环境下具
有特殊的优势。其中一个例子是长短时记忆网络(LSTM)(Hochreiter 和
Schmidhuber,1997 年),它与基本的递归神经网络不同,能够编码长序列的
信息。神经图灵机(NTMs)(Graves 等人,2014 年)是另一个这样的例子。
在这类系统中,可变的 "外部记忆 "用于推断比 LSTM 更长期的依赖关系,而且
退化程度低。

为了提高深度学习的泛化能力,人们还研究了其他一些特定的神经网络架
构。例如,可以设计一种架构,使其通过一种称为注意力的机制自动只关注
输入的某些部分(Xu 等人,2015 年;Vaswani 等人,2017 年)。其他方法旨
在通过学习创建程序来处理符号规则(Reed 和 De Freitas,2015 年;
Neelakantan 等人,2015 年;Johnson 等人,2017 年;Chen 等人,2017
年)。
为了能够实际应用后面章节中介绍的深度 RL 方法,读者应该具备在简单的监
督学习环境(如 MNIST 分类)中应用深度学习方法的实践知识。有关以下主
题的信息
14 Machine learning and deep learning

of input normalizations, weight initialization techniques, regularization


techniques and the different variants of gradient descent techniques,
the reader can refer to several reviews on the subject (LeCun et al.,
2015; Schmidhuber, 2015; Goodfellow et al., 2016) as well as references
therein.
In the following chapters, the focus is on reinforcement learning,
in particular on methods that scale to deep neural network function
approximators. These methods allows for learning a wide variety of
challenging sequential decision-making tasks directly from rich high-
dimensional inputs.
14 机器学习和深度学习
有关输入归一化的重要性、权重初始化技术、正则化技术和梯度下降技术的
不同变体等主题的信息,读者可以参考相关的几篇综述(LeCun 等人,2015
年;Schmidhuber,2015 年;Goodfellow 等人,2016 年)以及其中的参考
文献。
在接下来的章节中,重点将放在强化学习,尤其是扩展到深度神经网络函数
近似值的方法上。这些方法可以直接从丰富的高维输入中学习各种具有挑战
性的连续决策任务。
3
Introduction to reinforcement learning

Reinforcement learning (RL) is the area of machine learning that deals


with sequential decision-making. In this chapter, we describe how the
RL problem can be formalized as an agent that has to make decisions
in an environment to optimize a given notion of cumulative rewards.
It will become clear that this formalization applies to a wide variety
of tasks and captures many essential features of artificial intelligence
such as a sense of cause and effect as well as a sense of uncertainty and
nondeterminism. This chapter also introduces the different approaches
to learning sequential decision-making tasks and how deep RL can be
useful.
A key aspect of RL is that an agent learns a good behavior. This
means that it modifies or acquires new behaviors and skills incrementally.
Another important aspect of RL is that it uses trial-and-error experience
(as opposed to e.g., dynamic programming that assumes full knowledge
of the environment a priori). Thus, the RL agent does not require
complete knowledge or control of the environment; it only needs to be
able to interact with the environment and collect information. In an
offline setting, the experience is acquired a priori, then it is used as a
batch for learning (hence the offline setting is also called batch RL).

15
3
强化学习简介

强化学习(RL)是机器学习中处理连续决策的一个领域。在本章中,我们将
介绍如何把强化学习问题形式化为一个代理,该代理必须在环境中做出决策,
以优化给定的累积奖励概念。我们将清楚地看到,这种形式化方法适用于各种
任务,并抓住了人工智能的许多基本特征,例如因果关系以及不确定性和非确
定性。本章还介绍了学习顺序决策任务的不同方法,以及深度 RL 如何发挥作
用。
RL 的一个关键方面是代理学习良好的行为。这意味着它可以逐步修改或获得
新的行为和技能。RL 的另一个重要方面是,它使用试错经验(与动态编程等
先验地假定完全了解环境的方法相反)。因此,RL 代理不需要完全了解或控制
环境;它只需要能够与环境互动并收集信息。在离线环境中,经验是先验获得
的,然后作为批量学习使用(因此离线环境也被称为批量 RL)。

15
16 Introduction to reinforcement learning

This is in contrast to the online setting where data becomes available


in a sequential order and is used to progressively update the behavior
of the agent. In both cases, the core learning algorithms are essentially
the same but the main difference is that in an online setting, the agent
can influence how it gathers experience so that it is the most useful for
learning. This is an additional challenge mainly because the agent has to
deal with the exploration/exploitation dilemma while learning (see §8.1
for a detailed discussion). But learning in the online setting can also be
an advantage since the agent is able to gather information specifically
on the most interesting part of the environment. For that reason, even
when the environment is fully known, RL approaches may provide the
most computationally efficient approach in practice as compared to
some dynamic programming methods that would be inefficient due to
this lack of specificity.

3.1 Formal framework

The reinforcement learning setting


The general RL problem is formalized as a discrete time stochastic
control process where an agent interacts with its environment in the
following way: the agent starts, in a given state within its environment
s0 ∈ S, by gathering an initial observation ω0 ∈ Ω. At each time step t,
the agent has to take an action at ∈ A. As illustrated in Figure 3.1, it
follows three consequences: (i) the agent obtains a reward r t ∈ R, (ii) the
state transitions to st+1 ∈ S, and (iii) the agent obtains an observation
ωt+1 ∈ Ω. This control setting was first proposed by Bellman, 1957b
and later extended to learning by Barto et al., 1983. Comprehensive
treatment of RL fundamentals are provided by Sutton and Barto, 2017.
Here, we review the main elements of RL before delving into deep RL
in the following chapters.

The Markov property


For the sake of simplicity, let us consider first the case of Markovian
stochastic control processes (Norris, 1998).
16 强化学习简介
这与在线环境形成鲜明对比,在在线环境中,数据是按顺序提供的,并用于逐
步更新代理的行为。在这两种情况下,核心学习算法本质上是相同的,但主要
区别在于,在在线环境中,代理可以影响其收集经验的方式,从而使其对学习
最有用。这是一个额外的挑战,主要是因为代理在学习时必须处理探索/开发两
难问题(详细讨论见第 8.1 节)。不过,在线学习也是一种优势,因为代理可以
专门收集环境中最有趣部分的信息。因此,即使在环境完全已知的情况下,与
某些动态编程方法相比,RL 方法在实践中可能是计算效率最高的方法。

3.1 正式框架
强化学习环境
一般的 RL 问题被形式化为一个离散时间随机控制过程,其中代理与环境的交
互方式如下:代理在其环境 s∈ S 的给定状态下,通过收集初始观测值 ω∈ Ω
开始。如图 3.1 所示,其结果有三种:(i) 代理获得奖励 r∈R,(ii) 状态转换为
s∈S,(iii) 代理获得观测值 ω∈Ω。这种控制设置最早由 Bellman(1957b)
提出,后来由 Barto 等人(1983)扩展到学习。Sutton 和 Barto,2017 年对
RL 基本原理进行了全面论述。在此,我们先回顾一下 RL 的主要内容,然后在
接下来的章节中深入探讨深度 RL。

马尔可夫特性
为简单起见,让我们首先考虑马尔可夫随机控制过程的情况(Norris,1998
年)。
3.1. Formal framework 17

Agent

at ωt +1 rt

Environment
st → st +1

Figure 3.1: Agent-environment interaction in RL.

Definition 3.1. A discrete time stochastic control process is Markovian


(i.e., it has the Markov property) if

• P(ωt +1 | ωt , at ) = P(ωt +1 | ωt , at , . . . , , ω0 , a0), and

• P(r t | ωt , at ) = P(r t | ωt , at , . . . , , ω0, a0).

The Markov property means that the future of the process only
depends on the current observation, and the agent has no interest in
looking at the full history.
A Markov Decision Process (MDP) (Bellman, 1957a) is a discrete
time stochastic control process defined as follows:

Definition 3.2. An MDP is a 5-tuple (S, A , T, R, γ) where:

• S is the state space,

• A is the action space,

• T : S × A × S → [0, 1] is the transition function (set of conditional


transition probabilities between states),

• R : S ×A×S → R is the reward function, where R is a continuous


set of possible rewards in a range R max ∈ R+ (e.g., [0, Rmax ]),

• γ ∈ [0, 1) is the discount factor.

The system is fully observable in an MDP, which means that the


observation is the same as the state of the environment: ωt = st . At
each time step t , the probability of moving to st +1 is given by the state
3.1.正式框架 17

代理
a ω r

环境
s→ s

图 3.1:RL 中代理与环境的交互。

定义 3.1.离散时间随机控制过程是马尔可夫过程(即具有马尔可夫特性),如

• P(ω| ω, a) = P(ω| ω, a, ... , , ω, a),以及

• P(r| ω, a) = P(r| ω, a, ... , , ω, a).

马尔可夫特性意味着,进程的未来只取决于当前的观测结果,代理人没有兴
趣查看全部历史。
马尔可夫决策过程(Markov Decision Process,MDP)(Bellman,1957a)
是一种离散时间随机控制过程,其定义如下:
定义 3.2.MDP 是一个 5 元组(S, A, T, R, γ),其中
• S 是状态空间、

• A 是行动空间、

• T : S × A × S → [0, 1] 是过渡函数(状态之间的条件过渡概率集)

• R : S ×A×S → R 是奖励函数,其中 R 是 R∈ R(例如 [0, R])范围内可
能奖励的连续集合、
• γ∈ [0, 1]是贴现因子。

在 MDP 中,系统是完全可观测的,这意味着观测值与环境状态相同:ω= s。
18 Introduction to reinforcement learning

transition function T (st , at , st +1 ) and the reward is given by a bounded


reward function R(st , at , st+1 ) ∈ R. This is illustrated in Figure 3.2.
Note that more general cases than MDPs are introduced in Chapter 10.
Transition Transition
function function
T ( s 0 , a0 , s1 ) T ( s 1 , a1 , s 2 )

s0 s1 s2

Reward Reward
...
Policy function Policy function
R ( s 0 , a0 , s1 ) R ( s 1 , a 1 , s2 )

a0 r0 a1 r1

Figure 3.2: Illustration of a MDP. At each step, the agent takes an action that
changes its state in the environment and provides a reward.

Different categories of policies


A policy defines how an agent selects actions. Policies can be categorized
under the criterion of being either stationary or non-stationary. A non-
stationary policy depends on the time-step and is useful for the finite-
horizon context where the cumulative rewards that the agent seeks to
optimize are limited to a finite number of future time steps (Bertsekas
et al., 1995). In this introduction to deep RL, infinite horizons are
considered and the policies are stationary1.
Policies can also be categorized under a second criterion of being
either deterministic or stochastic:

• In the deterministic case, the policy is described by π(s) : S → A.

• In the stochastic case, the policy is described by π(s, a) : S × A →


[0, 1] where π(s, a) denotes the probability that action a may be
chosen in state s.

1
The formalism can be directly extended to the finite horizon context. In that
case, the policy and the cumulative expected returns should be time-dependent.
18 强化学习简介
移动到 s 的概率由状态转换函数 T(s, a, s)给出,奖励由有界奖励函数 R(s,
a, s)∈ R 给出。请注意,第 10 章将介绍比 MDP 更一般的情况。
过渡 过渡
功能 功能
T (s, a, s) T (s, a, s)

s s s

奖励 奖励 ...
政策 功能 政策 功能
R(s, a, s) R(s, a, s)

a r a r

图 3.2:MDP 的图解。在每一步中,代理采取的行动都会改变其在环境中的状态,并提供奖励。

不同类别的政策
策略定义了代理如何选择行动。政策可以按照静态或非静态标准进行分类。非
稳态策略取决于时间步长,适用于有限视界环境,在这种环境下,代理寻求优
化的累积奖励仅限于有限数量的未来时间步长(Bertsekas 等人,1995 年)。
在这篇深度 RL 简介中,我们考虑的是无限视界,策略是静态的。

政策还可以按照第二个标准进行分类,即确定性政策或随机政策:
• 在确定性情况下,政策由 π(s) 描述:S → A。
• 在随机情况下,政策由 π(s, a) 描述:S × A → [0, 1] 其中,π(s, a) 表示
在状态 s 下选择行动 a 的概率。

这一形式可以直接扩展到有限时间范围。在这种情况下,政策和累计预期收益都应与时间相
1

关。
3.1. Formal framework 19

The expected return


Throughout this survey, we consider the case of an RL agent whose
goal is to find a policy π(s, a) ∈ Π, so as to optimize an expected return
V π(s) : S → R (also called V-value function) such that
[∑∞ ]
π k
V (s) = E γ r t + k | st = s, π , (3.1)
k=0

where:
( )
• rt = E R st , a, st+1 ,
a∼ π(st ,·)
( )
• P st +1 |st , at = T (st , at , st +1) with at ∼ π(st , ·),
From the definition of the expected return, the optimal expected return
can be defined as:
V ∗ (s) = max V π(s). (3.2)
π∈Π
In addition to the V-value function, a few other functions of interest
can be introduced. The Q-value function Qπ(s, a) : S × A → R is defined
as follows:
[∑∞ ]
π k
Q (s, a) = E γ r t + k | st = s, at = a, π . (3.3)
k=0

This equation can be rewritten recursively in the case of an MDP using


Bellman’s equation:
∑ ( )
Qπ(s, a) = T (s, a, s′ ) R(s, a, s′ ) + γQπ(s′ , a = π(s′ )) . (3.4)
s′ ∈S

Similarly to the V-value function, the optimal Q-value function Q∗ (s, a)


can also be defined as

Q∗ (s, a) = max Qπ(s, a). (3.5)


π∈Π

The particularity of the Q-value function as compared to the V-value


function is that the optimal policy can be obtained directly from Q∗ (s, a):

π∗ (s) = argmax Q∗ (s, a). (3.6)


a∈A
The optimal V-value function V ∗ (s) is the expected discounted reward
when in a given state s while following the policy π∗ thereafter. The
3.1.正式框架 19

预期收益
在整个研究中,我们考虑的是一个 RL 代理的情况,其目标是找到一个策略
π(s, a) ∈ Π,从而优化预期收益 V (s) :S → R(也称为 V 值函数),使得
[∑ ]
V (s) = E
k=0
γr| s= s,π , (3.1)

在哪里?
( )
• r= E R s, a, s ,
a∼π(s,-)
s,a = T (s, a, s) with a∼ π(s, -)、
( )
• P

根据预期收益率的定义,最佳预期收益率可定义为
V (s) = 最大 V(s)。 (3.2)
π∈Π

除 V 值函数外,还可以引入其他一些函数。Q 值函数 Q(s, a) :S ×A → R 的定


义如下:
[∑ ]
γr| s= s,a= a,π
Q(s, a) = E
k=0
. (3.3)

在 MDP 的情况下,这个方程可以用贝尔曼方程递归重写:
∑ ( )
Q(s, a) = T (s, a, s) R(s, a, s) + γQ(s, a = π(s)) . (3.4)
s∈S

与 V 值函数类似,最优 Q 值函数 Q(s,a)也可定义为


Q(s,a)。
Q(s, a) = max
(3.5)
π∈Π

与 V 值函数相比,Q 值函数的特殊之处在于可以直接从 Q(s,a)中获得最优


策略:
π(s) = argmax Q(s,a)。 (3.6)
a∈A

最佳 V 值函数 V (s) 是指当处于给定状态 s 时的预期贴现回报,而此后则遵循


政策 π。其
20 Introduction to reinforcement learning

optimal Q-value Q∗ (s, a) is the expected discounted return when in


a given state s and for a given action a while following the policy π∗
thereafter.
It is also possible to define the advantage function

A π(s, a) = Qπ(s, a) − V π(s). (3.7)


This quantity describes how good the action a is, as compared to the
expected return when following directly policy π.
Note that one straightforward way to obtain estimates of either
V (s), Qπ(s, a) or A π(s, a) is to use Monte Carlo methods, i.e. defining
π

an estimate by performing several simulations from s while following


policy π. In practice, we will see that this may not be possible in the case
of limited data. In addition, even when it is possible, we will see that
other methods should usually be preferred for computational efficiency.

3.2 Different components to learn a policy

An RL agent includes one or more of the following components:

• a representation of a value function that provides a prediction of


how good each state or each state/action pair is,

• a direct representation of the policy π(s) or π(s, a), or

• a model of the environment (the estimated transition function and


the estimated reward function) in conjunction with a planning
algorithm.
The first two components are related to what is called model-free RL
and are discussed in Chapters 4, 5. When the latter component is used,
the algorithm is referred to as model-based RL, which is discussed in
Chapter 6. A combination of both and why using the combination can
be useful is discussed in §6.2. A schema with all possible approaches is
provided in Figure 3.3.
For most problems approaching real-world complexity, the state
space is high-dimensional (and possibly continuous). In order to learn
an estimate of the model, the value function or the policy, there are
two main advantages for RL algorithms to rely on deep learning:
20 强化学习简介
最优 Q 值 Q(s,a)是指在给定状态 s 下,采取给定行动 a 并在此后遵循政策
π 时的预期贴现收益。
也可以定义优势函数
A(s,a)= Q(s,a)- V(s)。 (3.7)

这个量描述了与直接遵循政策 π 时的预期收益相比,行动 a 有多好。


请注意,获得 V (s)、Q(s, a) 或 A(s, a) 估计值的一种直接方法是使用蒙特卡罗
方法,即在遵循策略 π 的情况下,通过对 s 进行多次模拟来定义估计值。此
外,即使在可行的情况下,我们也会发现,为了提高计算效率,通常应优先选
择其他方法。

3.2 学习政策的不同组成部分
RL 代理包括以下一个或多个组件:
• 价值函数的表示,它能预测每个状态或每个状态/行动对的好坏、

• 政策 π(s) 或 π(s, a) 的直接表示,或


• 环境模型(估计过渡函数和估计奖励函数)与规划算法相结合。

前两个部分与所谓的无模型 RL 有关,将在第 4 章和第 5 章中讨论。当使用后


一部分时,算法被称为基于模型的 RL,将在第 6 章讨论。第 6.2 节讨论了两者
的结合,以及为什么结合使用会有用。图 3.3 提供了所有可能方法的示意图。

对于大多数接近真实世界复杂度的问题来说,状态空间都是高维的(而且可能
是连续的)。为了学习模型、价值函数或策略的估计值,RL 算法依赖深度学习
有两大优势:
3.3. Different settings to learn a policy from data 21

Experience

Model
Acting
learning
Model-free
Model RL Value/policy

Planning

Figure 3.3: General schema of the different methods for RL. The direct approach
uses a representation of either a value function or a policy to act in the environment.
The indirect approach makes use of a model of the environment.

• Neural networks are well suited for dealing with high-dimensional


sensory inputs (such as times series, frames, etc.) and, in practice,
they do not require an exponential increase of data when adding
extra dimensions to the state or action space (see Chapter 2).

• In addition, they can be trained incrementally and make use of


additional samples obtained as learning happens.

3.3 Different settings to learn a policy from data

We now describe key settings that can be tackled with RL.

3.3.1 Offline and online learning


Learning a sequential decision-making task appears in two cases: (i) in
the offline learning case where only limited data on a given environment
is available and (ii) in an online learning case where, in parallel to
learning, the agent gradually gathers experience in the environment. In
both cases, the core learning algorithms introduced in Chapters 4 to 6
are essentially the same. The specificity of the batch setting is that the
agent has to learn from limited data without the possibility of interacting
further with the environment. In that case, the idea of generalization
introduced in Chapter 7 is the main focus. In the online setting, the
learning problem is more intricate and learning without requiring a
large amount of data (sample efficiency) is not only influenced by the
3.3.从数据中学习政策的不同设置 21

经验
模型 代理
学习
无模型
模型 RL 价值/政策
规划
图 3.3:RL 不同方法的总体图示。直接方法使用价值函数或策略来表示环境中的行为。
间接方法利用环境模型。

• 神经网络非常适合处理高维感官输入(如时间序列、帧等),而且在实践
中,当在状态或动作空间中添加额外维度时,神经网络不需要呈指数级
增长的数据(见第 2 章)。
• 此外,它们还可以进行增量训练,并利用学习过程中获得的额外样本。

3.3 从数据中学习政策的不同设置
现在,我们介绍一下可以用 RL 解决的关键问题。
3.3.1 离线和在线学习

顺序决策任务的学习有两种情况:(i) 离线学习,即只有有限的给定环境数据;
(ii) 在线学习,即在学习的同时,代理逐渐积累环境经验。在这两种情况下,
第 4 章至第 6 章介绍的核心学习算法基本相同。批处理设置的特殊性在于,代
理必须从有限的数据中学习,而不可能进一步与环境互动。在这种情况下,第
7 章中介绍的泛化思想是重点。在在线环境中,学习问题更为复杂,不需要大
量数据的学习(样本效率)不仅受以下因素的影响
22 Introduction to reinforcement learning

capability of the learning algorithm to generalize well from the limited


experience. Indeed, the agent has the possibility to gather experience
via an exploration/exploitation strategy. In addition, it can use a replay
memory to store its experience so that it can be reprocessed at a later
time. Both the exploration and the replay memory will be discussed in
Chapter 8). In both the batch and the online settings, a supplementary
consideration is also the computational efficiency, which, among other
things, depends on the efficiency of a given gradient descent step. All
these elements will be introduced with more details in the following
chapters. A general schema of the different elements that can be found
in most deep RL algorithms is provided in Figure 3.4.

Function Learning Controllers


Approximators algorithms • train/validation
and test phases
• Convolutions Value-based RL
• hyperparameters
• Recurrent cells Policy-based RL
management
• ... Model-based RL

Replay memory Policies


Exploration/Exploitation
dilemma

AGENT

ENVIRONMENT

Figure 3.4: General schema of deep RL methods.

3.3.2 Off-policy and on-policy learning


According to Sutton and Barto, 2017, « on-policy methods attempt to
evaluate or improve the policy that is used to make decisions, whereas
22 强化学习简介
学习问题更加复杂,无需大量数据的学习(样本效率)不仅受学习算法从有
限经验中进行良好归纳的能力影响。事实上,代理有可能收集以下经验
探索/开发策略。此外,它还可以使用重放记忆来储存经验,以便日后重新处
理。探索和重放记忆都将在第 8 章中讨论)。在批处理和在线设置中,计算效
率也是一个补充考虑因素,除其他外,它取决于给定梯度下降步骤的效率。
所有这些要素将在后续章节中详细介绍。图 3.4 提供了大多数深度 RL 算法中
不同要素的一般示意图。

功能 学习 控制器
近似器
• 卷积
算法
基于价值的制冷剂管理
- 培训/验证
和测试阶段
• 复发性细胞 基于政策的 RL - 超参数
管理
• ... 基于模型的 RL

重放记忆 政策
勘探/开发的两难选择

代理

环境

图 3.4:深度 RL 方法的一般模式。

3.3.2 政策外和政策内学习
根据 Sutton 和 Barto,2017 年,"政策研究方法试图评估或改进用于决策的
政策,而
3.3. Different settings to learn a policy from data 23

off-policy methods evaluate or improve a policy different from that


used to generate the data ». In off-policy based methods, learning is
straightforward when using trajectories that are not necessarily obtained
under the current policy, but from a different behavior policy β(s, a).
In those cases, experience replay allows reusing samples from a different
behavior policy. On the contrary, on-policy based methods usually
introduce a bias when used with a replay buffer as the trajectories
are usually not obtained solely under the current policy π. As will be
discussed in the following chapters, this makes off-policy methods sample
efficient as they are able to make use of any experience; in contrast,
on-policy methods would, if specific care is not taken, introduce a bias
when using off-policy trajectories.
3.3.从数据中学习政策的不同设置 23

非政策方法评估或改进的政策与生成数据的政策不同"。在基于非政策的方法
中,如果使用的轨迹不一定是在当前政策下获得的,而是来自不同的行为政策
β(s,a),那么学习就很简单了。在这种情况下,经验重放可以重复使用不同行
为策略的样本。相反,基于政策的方法在使用回放缓冲区时通常会产生偏差,
因为轨迹通常并不完全是在当前政策 π 下获得的。正如以下章节将讨论的,这
使得非政策方法的采样效率很高,因为它们能够利用任何经验;相反,如果不
特别注意,基于政策的方法在使用非政策轨迹时会产生偏差。
4
Value-based methods for deep RL

The value-based class of algorithms aims to build a value function,


which subsequently lets us define a policy. We discuss hereafter one
of the simplest and most popular value-based algorithms, the Q-
learning algorithm (Watkins, 1989) and its variant, the fitted Q-learning,
that uses parameterized function approximators (Gordon, 1996). We
also specifically discuss the main elements of the deep Q-network
(DQN) algorithm (Mnih et al., 2015) which has achieved superhuman-
level control when playing ATARI games from the pixels by using
neural networks as function approximators. We then review various
improvements of the DQN algorithm and provide resources for further
details. At the end of this chapter and in the next chapter, we discuss the
intimate link between value-based methods and policy-based methods.

4.1 Q-learning

The basic version of Q-learning keeps a lookup table of values Q(s, a)


(Equation 3.3) with one entry for every state-action pair. In order to
learn the optimal Q-value function, the Q-learning algorithm makes use

24
4
基于价值的深度 RL 方法

基于值的算法旨在建立一个值函数,然后让我们定义一个策略。我们将在下文
讨论最简单、最流行的基于值的算法之一--Q-learning 算法(Watkins,1989
年)及其变体,即使用参数化函数近似值的拟合 Q-learning(Gordon,1996
年)。我们还特别讨论了深度 Q 网络(DQN)算法(Mnih 等人,2015 年)的
主要内容,该算法通过使用神经网络作为函数近似器,在玩像素 ATARI 游戏时
实现了超人级别的控制。然后,我们回顾了 DQN 算法的各种改进,并提供了
进一步详情的资源。在本章结尾和下一章,我们将讨论基于价值的方法和基于
策略的方法之间的密切联系。

4.1 Q-learning

Q-learning 的基本版本保存了一个 Q(s,a)值查询表(等式 3.3),每个状


态-动作配对都有一个条目。为了学习最优 Q 值函数,Q-learning 算法利用

24
4.2. Fitted Q-learning 25

of the Bellman equation for the Q-value function (Bellman and Dreyfus,
1962) whose unique solution is Q∗ (s, a):

Q∗ (s, a) = (BQ∗ )(s, a), (4.1)

where B is the Bellman operator mapping any function K : S × A → R


into another function S × A → R and is defined as follows:
∑ ( )
′ ′ ′ ′
(BK )(s, a) = T (s, a, s ) R(s, a, s ) + γ max

K (s , a ) . (4.2)
a ∈A
s′ ∈S

By Banach’s theorem, the fixed point of the Bellman operator B exists


since it is a contraction mapping1 . In practice, one general proof of
convergence to the optimal value function is available (Watkins and
Dayan, 1992) under the conditions that:

• the state-action pairs are represented discretely, and

• all actions are repeatedly sampled in all states (which ensures


sufficient exploration, hence not requiring access to the transition
model).

This simple setting is often inapplicable due to the high-dimensional


(possibly continuous) state-action space. In that context, a param-
eterized value function Q(s, a; θ) is needed, where θ refers to some
parameters that define the Q-values.

4.2 Fitted Q-learning

Experiences are gathered in a given dataset D in the form of tuples


< s, a, r, s′ > where the state at the next time-step s′ is drawn from
T (s, a, ·) and the reward r is given by R(s, a, s′ ). In fitted Q-learning
(Gordon, 1996), the algorithm starts with some random initialization of
the Q-values Q(s, a; θ0) where θ0 refers to the initial parameters (usually
1
The Bellman operator is a contraction mapping because it can be shown that
for any pair of bounded functions K, K ′ : S × A → R, the following condition is
respected:
′ ′
‖ T K − T K ‖ ∞ ≤ γ‖K − K ‖ ∞ .
4.2.拟合 Q 学习 25

的 Q 值函数的贝尔曼方程(贝尔曼和德雷福斯,1962 年),其唯一解是 Q
(s,a):Q(s, a) = (BQ)(s, a), (4.1)

其中,B 是将任何函数 K : S × A → R 映射到另一个函数 S × A → R 的贝尔曼


算子,定义如下:
( )

(BK)(s, a) = T (s, a, s) R(s, a, s) + γ max K(s, a) . (4.2)
a∈A
s∈S

根据巴纳赫定理,贝尔曼算子 B 的定点是存在的,因为它是一个收缩映射。实
际上,在以下条件下,有一个收敛到最优值函数的一般证明(Watkins 和
Dayan,1992 年):
• 状态-行动对是离散表示的,以及
• 在所有状态下对所有行动进行重复采样(这确保了充分的探索,因此无
需访问过渡模型)。
由于 Q-learning 算法使用(可能是连续的)状态-动作空间的维度较高,这种
简单的设置往往不适用。在这种情况下,需要一个参数化的值函数 Q(s,a;θ),
其中 θ 指的是定义 Q 值的一些参数。

4.2 适合的 Q-learning


经验以 < s, a, r, s> 元组的形式收集于给定的数据集 D 中,其中下一时间步 s 的
状态取自 T(s, a, -),奖励 r 由 R(s, a, s)给出。在拟合 Q-learning 算法
(Gordon,1996 年)中,算法从 Q 值 Q(s,a;θ)的随机初始化开始,其
中 θ 指的是初始参数(通常为:Q(s,a;θ))。

贝尔曼算子是一种收缩映射,因为可以证明,对于任何一对有界函数 K、K:S × A → R,都遵守


1

以下条件:‖T K - T K‖≤ γ‖K - K‖。


26 Value-based methods for deep RL

such that the initial Q-values should be relatively close to 0 so as to


avoid slow learning). Then, an approximation of the Q-values at the
k th iteration Q(s, a; θk ) is updated towards the target value
Q
Yk = r + γ max

Q(s′ , a′ ; θk ), (4.3)
a ∈A

where θk refers to some parameters that define the Q-values at the k th


iteration.
In neural fitted Q-learning (NFQ) (Riedmiller, 2005), the state can
be provided as an input to the Q-network and a different output is given
for each of the possible actions. This provides an efficient structure that
has the advantage of obtaining the computation of maxa′ ∈A Q(s′ , a′ ; θk )
in a single forward pass in the neural network for a given s′ . The Q-
values are parameterized with a neural network Q(s, a; θk ) where the
parameters θk are updated by stochastic gradient descent (or a variant)
by minimizing the square loss:
( )
Q 2
L DQN = Q(s, a; θk ) − Yk . (4.4)

Thus, the Q-learning update amounts in updating the parameters:


( )
Q
θk+1 = θk + α Yk − Q(s, a; θk ) ∇ θk Q(s, a; θk ), (4.5)

where α is a scalar step size called the learning rate. Note that using the
square loss is not arbitrary. Indeed, it ensures that Q(s, a; θk ) should
Q
tend without bias to the expected value of the random variable Yk 2.
Hence, it ensures that Q(s, a; θk ) should tend to Q∗ (s, a) after many
iterations in the hypothesis that the neural network is well-suited for
the task and that the experience gathered in the dataset D is sufficient
(more details will be given in Chapter 7).
When updating the weights, one also changes the target. Due to
the generalization and extrapolation abilities of neural networks, this
approach can build large errors at different places in the state-action
space3. Therefore, the contraction mapping property of the Bellman
2
The minimum of E[(Z − c) 2 ] occurs when the constant c equals the expected
value of the random variable Z .
3
Note that even fitted value iteration with linear regression can diverge (Boyan
and Moore, 1995). However, this drawback does not happen when using linear
26 基于价值的深度 RL 方法
初始 Q 值应相对接近 0,以避免学习速度过慢)。然后,在迭代 Q(s,a;θ)
时对 Q 值的近似值进行更新,使其趋向于目标值
Y = r + γ max Q(s,a;θ)、
a∈A
(4.3)

其中,θ 指的是定义 k 次迭代时 Q 值的一些参数。


在神经拟合 Q 学习(NFQ)(Riedmiller,2005 年)中,可以将状态作为 Q 网
络的输入,并为每个可能的行动提供不同的输出。这提供了一种高效的结构,
其优点是在给定 s 的情况下,只需在神经网络中进行一次前向传递,就能计算
出 maxQ(s,a;θ)。Q 值通过神经网络 Q(s,a;θ)进行参数化,其中参数 θ 是通过
随机梯度下降(或变体)最小化平方损失来更新的:

( )
L= Q(s, a; θ) - Y . (4.4)

因此,Q-learning 更新相当于更新参数:
( )
θ= θ+ α Y - Q(s,a;θ) ∇Q(s,a;θ)、 (4.5)

其中,α 是一个标量步长,称为学习率。请注意,使用平方损失并不是任意
的。事实上,它能确保 Q(s,a;θ)无偏差地趋向于随机变量 Y 的期望值。因此,
它可以确保 Q(s,a;θ)在经过多次迭代后趋向于 Q(s,a),前提是神经网络非常适
合任务,并且在数据集 D 中积累的经验足够多(更多细节将在第 7 章中给
出)。

在更新权重的同时,也会改变目标。由于神经网络的泛化和外推能力,这种方
法会在状态-动作空间的不同位置产生较大误差。因此,Bellman 的收缩映射
属性
当常数 c 等于随机变量 Z 的期望值时,E[(Z - c)] 出现最小值。
2

请注意,即使是线性回归的拟合值迭代也会出现偏差(Boyan 和 Moore,1995 年)。然


3

而,在使用线性回归时,这一缺点不会发生。
4.3. Deep Q-networks 27

operator in Equation 4.2 is not enough to guarantee convergence. It


is verified experimentally that these errors may propagate with this
update rule and, as a consequence, convergence may be slow or even
unstable (Baird, 1995; Tsitsiklis and Van Roy, 1997; Gordon, 1999;
Riedmiller, 2005). Another related damaging side-effect of using function
approximators is the fact that Q-values tend to be overestimated due to
the max operator (Van Hasselt et al., 2016). Because of the instabilities
and the risk of overestimation, specific care has be taken to ensure
proper learning.

4.3 Deep Q-networks

Leveraging ideas from NFQ, the deep Q-network (DQN) algorithm


introduced by Mnih et al. (2015) is able to obtain strong performance
in an online setting for a variety of ATARI games, directly by learning
from the pixels. It uses two heuristics to limit the instabilities:

• The target Q-network in Equation 4.3 is replaced by Q(s′ , a′ ; θk− )


where its parameters θk− are updated only every C ∈ N iterations
with the following assignment: θk− = θk . This prevents the
instabilities to propagate quickly and it reduces the risk of
Q
divergence as the target values Yk are kept fixed for C iterations.
The idea of target networks can be seen as an instantiation of fitted
Q-learning, where each period between target network updates
corresponds to a single fitted Q-iteration.

• In an online setting, the replay memory (Lin, 1992) keeps all


information for the last N replay ∈ N time steps, where the
experience is collected by following an -greedy policy4. The
updates are then made on a set of tuples < s, a, r, s′ > (called
mini-batch) selected randomly within the replay memory. This

function approximators that only have interpolation abilities such as kernel-based


regressors (k-nearest neighbors, linear and multilinear interpolation, etc.) (Gordon,
1999) or tree-based ensemble methods (Ernst et al., 2005). However, these methods
have not proved able to handle successfully high-dimensional inputs.
4
It takes a random action with probability and follows the policy given by
argmaxa ∈A Q(s, a; θk ) with probability 1 − .
4.3.深度 Q 网络 27

方程 4.2 中的算子不足以保证收敛。实验证明,这些误差可能会通过这种更新
规则传播,因此收敛速度可能会很慢,甚至不稳定(Baird,1995;Tsitsiklis
和 Van Roy,1997;Gordon,1999;Riedmiller,2005)。使用函数近似值
的另一个相关的有害副作用是,由于最大算子的作用,Q 值往往会被高估
(Van Hasselt 等人,2016 年)。由于存在不稳定性和高估的风险,因此必须
特别注意确保学习的正确性。

4.3 深度 Q 网络
利用 NFQ 的思想,Mnih 等人(2015 年)推出的深度 Q 网络(DQN)算法能
够直接从像素中学习,在各种 ATARI 游戏的在线设置中获得强劲的性能。它使
用两种启发式方法来限制不稳定性:
• 方程 4.3 中的目标 Q 网络由 Q(s,a;θ)代替,其参数 θ 只在每 C∈N 次迭
代中更新一次,其赋值如下:θ= θ。目标网络的概念可以看作是拟合 Q-
learning 的实例化,其中目标网络更新之间的每段时间都对应于一次拟
合 Q-迭代。

• 在在线设置中,重放存储器(Lin,1992 年)保留了最后 Nreplay∈N


个时间步的所有信息,其中的经验是通过遵循-贪婪策略收集的。然后在
重放存储器中随机选择一组图元 < s、a、r、s>(称为迷你批次)进行更
新。这

这些函数近似器仅具有插值能力,如基于核的回归器(k-近邻、线性和多线性插值等)
(Gordon,1999 年)或基于树的集合方法(Ernst 等人,2005 年)。然而,事实证明这些方法无
法成功处理高维输入。
它很有可能采取随机行动,并遵循 argmaxQ(s,a;θ)给出的政策,概率为 1 - 。
4
28 Value-based methods for deep RL

technique allows for updates that cover a wide range of the state-
action space. In addition, one mini-batch update has less variance
compared to a single tuple update. Consequently, it provides the
possibility to make a larger update of the parameters, while having
an efficient parallelization of the algorithm.

A sketch of the algorithm is given in Figure 4.1.


In addition to the target Q-network and the replay memory, DQN
uses other important heuristics. To keep the target values in a reasonable
scale and to ensure proper learning in practice, rewards are clipped
between -1 and +1. Clipping the rewards limits the scale of the error
derivatives and makes it easier to use the same learning rate across
multiple games (however, it introduces a bias). In games where the
player has multiple lives, one trick is also to associate a terminal state
to the loss of a life such that the agent avoids these terminal states (in
a terminal state the discount factor is set to 0).
In DQN, many deep learning specific techniques are also used. In
particular, a preprocessing step of the inputs is used to reduce the
input dimensionality, to normalize inputs (it scales pixels value into
[-1,1]) and to deal with some specificities of the task. In addition,
convolutional layers are used for the first layers of the neural network
function approximator and the optimization is performed using a variant
of stochastic gradient descent called RMSprop (Tieleman, 2012).

4.4 Double DQN

The max operation in Q-learning (Equations 4.2, 4.3) uses the same
values both to select and to evaluate an action. This makes it more likely
to select overestimated values in case of inaccuracies or noise, resulting
in overoptimistic value estimates. Therefore, the DQN algorithm induces
an upward bias. The double estimator method uses two estimates for
each variable, which allows for the selection of an estimator and its value
to be uncoupled (Hasselt, 2010). Thus, regardless of whether errors
in the estimated Q-values are due to stochasticity in the environment,
function approximation, non-stationarity, or any other source, this
allows for the removal of the positive bias in estimating the action
28 基于价值的深度 RL 方法
这种技术可以实现覆盖广泛的状态动作空间的更新。此外,与单个元组
更新相比,迷你批量更新的方差更小。因此,它可以对参数进行更大范
围的更新,同时实现算法的高效并行化。

算法简图见图 4.1。
除了目标 Q 网络和重放记忆,DQN 还使用了其他重要的启发式方法。为了使
目标值保持在一个合理的范围内,并确保在实践中的正确学习,奖励在-1 和
+1 之间被削去。剪切奖励限制了误差导数的规模,并使在多个游戏中使用相
同的学习率变得更容易(不过,这会带来偏差)。在玩家拥有多条生命的游戏
中,一种技巧是将生命损失与终结状态联系起来,使代理避免这些终结状态
(在终结状态下,贴现因子设为 0)。

在 DQN 中,还使用了许多特定的深度学习技术。特别是,输入的预处理步骤
用于降低输入维度、对输入进行归一化(将像素值缩放为 [-1,1])以及处理任
务的一些特殊性。此外,神经网络函数近似器的第一层使用卷积层,并使用随
机梯度下降的变体 RMSprop(Tieleman,2012 年)进行优化。

4.4 双 DQN
Q-learning 中的最大值运算(公式 4.2、4.3)使用相同的值来选择和评估一个
操作。这使得在不准确或有噪音的情况下,更有可能选择高估的值,从而导致
过于乐观的值估计。因此,DQN 算法会产生向上偏差。双估算器法对每个变量
使用两个估算器,这使得估算器的选择和估算值是不耦合的(Hasselt,2010
年)。因此,无论估计 Q 值中的误差是由环境随机性、函数近似、非平稳性或
任何其他原因造成的,都能消除估计行动时的正偏差。
4.5. Dueling network architecture 29

s1 , . . . , sN replay , a1 , . . . , aN replay Update Every C:


Q(s, a; θk ) θk− := θk
Policy r 1, . . . , rN replay

s1+1, . . . , sN replay+1
Environment
r t + γmax

(Q(st+1 , a′ ; θk− ))
a ∈A

Figure 4.1: Sketch of the DQN algorithm. Q(s, a; θk ) is initialized to random values
(close to 0) everywhere in its domain and the replay memory is initially empty;
the target Q-network parameters θk− are only updated every C iterations with the
Q-network parameters θk and are held fixed between updates; the update uses a
mini-batch (e.g., 32 elements) of tuples < s, a > taken randomly in the replay
memory along with the corresponding mini-batch of target values for the tuples.

values. In Double DQN, or DDQN (Van Hasselt et al., 2016), the target
Q
value Yk is replaced by
DDQN
Yk = r + γQ(s′ , argmax Q(s′ , a; θk ); θk− ), (4.6)
a∈A

which leads to less overestimation of the Q-learning values, as well as


improved stability, hence improved performance. As compared to DQN,
the target network with weights θt− are used for the evaluation of the
current greedy action. Note that the policy is still chosen according to
the values obtained by the current weights θ.

4.5 Dueling network architecture

In (Wang et al., 2015), the neural network architecture decouples the


value and advantage function A π(s, a) (Equation 3.7), which leads to
improved performance. The Q-value function is given by
( )
(1) (2) (3) (1) (3)
Q(s, a; θ , θ ,θ ) = V s; θ , θ
( ( ) ( )) (4.7)
(1) (2) ′ (1) (2)
+ A s, a; θ , θ − max

A s, a ; θ , θ .
a ∈A

Now, for a∗ = argmax a′ ∈A Q(s, a′ ; θ(1) , θ(2) , θ(3) ), we obtain


Q(s, a∗ ; θ(1) , θ(2) , θ (3)) = V (s; θ(1), θ (3)). As illustrated in Figure 4.2,
4.5.对决网络架构 29

s, . ., sreplay , 以及 , .. , a 更新 每个 C.
Q(s, a; θ) θ:= θ
政策 r, . .. , r
s, . .. , s
环境
r+ γmax (Q(s, a; θ))
a∈A

图 4.1:DQN 算法草图。Q(s, a; θ) 初始化为随机值


(接近 0),而重放存储器最初是空的;目标 Q 网络参数 θ 只在每 C 次迭代中更新一次,每次更新之
间保持不变;更新使用重放存储器中随机抽取的一小批(例如 32 个元素)元组 < s, a >,以及相
应的一小批元组目标值。

值。在双 DQN 或 DDQN(Van Hasselt 等人,2016 年)中,目标值 Y 被替换



Y = r + γQ(s,argmax Q(s,a;θ);θ)、 (4.6)
k
a∈A

这样可以减少对 Q 学习值的高估,提高稳定性,从而提高性能。与 DQN 相


比,目标网络的权重θ 被用于评估当前的贪婪行动。需要注意的是,策略仍然
是根据当前权重θ 得到的值来选择的。

4.5 决斗网络架构
在(Wang 等人,2015 年)中,神经网络架构解耦了价值和优势函数 A(s,
a)(等式 3.7),从而提高了性能。Q 值函数为
( )
Q(s, a; θ, θ, θ) = V s; θ, θ
( ( ) ( )) (4.7)
s,a;θ,θ - 最大 A s,a;θ,θ .
+ A
a∈A

现在,对于 a= argmaxQ(s,a;θ,θ,θ),我们得到 Q(s,a;θ,θ,


θ)= V(s;θ,θ)。如图 4.2 所示、
30 Value-based methods for deep RL

the stream V (s; θ(1), θ (3) ) provides an estimate of the value function,
while the other stream produces an estimate of the advantage function.
The learning update is done as in DQN and it is only the structure of
the neural network that is modified.

θ(3)

V (s)

...

Q(s, a)

A(s, a)
θ(1)
θ(2)

Figure 4.2: Illustration of the dueling network architecture with the two streams
that separately estimate the value V (s) and the advantages A (s, a). The boxes
represent layers of a neural network and the grey output implements equation 4.7 to
combine V (s) and A (s, a).

In fact, even though it loses the original semantics of V and A, a


slightly different approach is preferred in practice because it increases
the stability of the optimization:
( )
(1) (2) (3) (1) (3)
Q(s, a; θ , θ ,θ ) = V s; θ , θ
 
( ) 1 ∑ ( ) (4.8)
(1) (2)
+  A s, a; θ , θ − A s, a′ ; θ(1) , θ(2)  .
|A| a′ ∈A

In that case, the advantages only need to change as fast as the mean,
which appears to work better in practice (Wang et al., 2015).
30 基于价值的深度 RL 方法
流 V (s; θ, θ) 提供了对价值函数的估计,而另一个流则产生了对优势函数的估
计。学习更新与 DQN 一样进行,修改的只是神经网络的结构。

V (s)

...

Q(s, a)

A(s, a)

θ
θ

图 4.2:双流网络结构示意图,两个流分别估算值 V (s) 和优势 A(s,a)。方框代表神经网络的层


数,灰色输出执行公式 4.7,将 V (s) 和 A(s,a)结合起来。

事实上,尽管这种方法失去了 V 和 A 的原始语义,但在实践中,我们更倾向于
采用稍有不同的方法,因为它能提高优化的稳定性:
( )
Q(s, a; θ, θ, θ) = V s; θ, θ
 
( ) ( ) (4.8)
s,a;θ,θ s,a;θ,θ
1 ∑
+ A − A  .
|A|
a∈A

在这种情况下,优势的变化速度只需与平均值一样快,这在实践中似乎效果更
好(Wang 等人,2015 年)。
4.6. Distributional DQN 31

4.6 Distributional DQN

The approaches described so far in this chapter all directly approximate


the expected return in a value function. Another approach is to aim for
a richer representation through a value distribution, i.e. the distribution
of possible cumulative returns (Jaquette et al., 1973; Morimura et al.,
2010). This value distribution provides more complete information of
the intrinsic randomness of the rewards and transitions of the agent
within its environment (note that it is not a measure of the agent’s
uncertainty about the environment).
The value distribution Z π is a mapping from state-action pairs to
distributions of returns when following policy π. It has an expectation
equal to Qπ:
Qπ(s, a) = EZ π(s, a).
This random return is also described by a recursive equation, but one
of a distributional nature:

Z π(s, a) = R(s, a, S ′ ) + γZ π(S′ , A ′ ), (4.9)

where we use capital letters to emphasize the random nature of the


next state-action pair (S′ , A ′ ) and A ′ ∼ π(·|S′ ). The distributional
Bellman equation states that the distribution of Z is characterized by
the interaction of three random variables: the reward R(s, a, S ′ ), the
next state-action (S′ , A ′ ), and its random return Z π(S′ , A ′ ).
It has been shown that such a distributional Bellman equation can
be used in practice, with deep learning as the function approximator
(Bellemare et al., 2017; Dabney et al., 2017; Rowland et al., 2018). This
approach has the following advantages:

• It is possible to implement risk-aware behavior (see e.g., Morimura


et al., 2010).

• It leads to more performant learning in practice. This may appear


surprising since both DQN and the distributional DQN aim to
maximize the expected return (as illustrated in Figure 4.3). One of
the main elements is that the distributional perspective naturally
provides a richer set of training signals than a scalar value function
4.6.分布式 DQN 31

4.6分布式 DQN
本章迄今为止介绍的方法都是在价值函数中直接近似预期收益。另一种方法是
通过价值分布(即可能累积回报的分布)来获得更丰富的表征(Jaquette 等
人,1973 年;Morimura 等人,2010 年)。这种价值分布提供了更完整的信
息,说明了代理在其环境中的回报和转换的内在随机性(注意,它并不是代理
对环境不确定性的衡量标准)。

价值分布 Z 是遵循政策 π 时从状态-行动对到收益分布的映射,其期望值等于


Q:Q(s,a)= EZ(s,a)。
这种随机回报也是由一个递归方程来描述的,但却是一个分布式方程:

Z(s, a) = R(s, a, S) + γZ(S,A)、 (4.9)

其中,我们使用大写字母来强调下一个状态-行动对(S,A)和 A∼π(-|S) 的随
机性。分布式贝尔曼方程指出,Z 的分布由三个随机变量的相互作用来表征:
奖励 R(s,a,S)、下一个状态行动(S,A)及其随机回报 Z(S,A)。
研究表明,这种分布式贝尔曼方程可以在实践中使用,并以深度学习作为函
数近似器(Bellemare 等人,2017 年;Dabney 等人,2017 年;Rowland
等人,2018 年)。这种方法具有以下优点:

• 风险意识行为是可以实现的(参见 Morimura 等人,2010 年)。


• 在实践中,它能带来更高效的学习。这似乎有些出人意料,因为 DQN
和分布式 DQN 的目标都是最大化预期收益(如图 4.3 所示)。其中一个
主要因素是,与标量值函数相比,分布式视角自然能提供更丰富的训练
信号集。
32 Value-based methods for deep RL

Q(s, a). These training signals that are not a priori necessary
for optimizing the expected return are known as auxiliary tasks
(Jaderberg et al., 2016) and lead to an improved learning (this is
discussed in §7.2.1).

Q̂π1≈ Q̂π2
π1

P= 1 R max
Ẑ π2
s(1) 5
π1 P = 0.2
(s, a) s(3) R max
π2
s(2) 0
P = 0.8
0 R max
1− γ
(a) Example MDP. (b) Sketch (in an idealized version) of
the estimate of resulting value distribu-
tion Ẑ π1 and Ẑ π2 as well as the esti-
mate of the Q-values Q̂ π1 , Q̂ π2 .

Figure 4.3: For two policies illustrated on Fig (a), the illustration on Fig (b) gives
the value distribution Z ( π) (s, a) as compared to the expected value Q π (s, a). On
the left figure, one can see that π1 moves with certainty to an absorbing state with
reward at every step R max
5
, while π2 moves with probability 0.2 and 0.8 to absorbing
states with respectively rewards at every step R max and 0. From the pair (s, a), the
policies π1 and π2 have the same expected return but different value distributions.

4.7 Multi-step learning

In DQN, the target value used to update the Q-network parameters


(given in Equation 4.3) is estimated as the sum of the immediate reward
and a contribution of the following steps in the return. That contribution
is estimated based on its own value estimate at the next time-step. For
that reason, the learning algorithm is said to bootstrap as it recursively
uses its own value estimates (Sutton, 1988).
This method of estimating a target value is not the only possibility.
Non-bootstrapping methods learn directly from returns (Monte Carlo)
32 基于价值的深度 RL 方法
Q(s,a)。这些并非优化预期收益所必需的先验训练信号被称为辅助任
务(Jaderberg 等人,2016 年),它们能提高学习效果(这将在第 7.2.1
节中讨论)。

Q̂≈ Q̂
ˆZ
P=1 Ẑ
R
s 5
π P = 0.2
(s, a) s R
π

s 0
P = 0.8
0 R
1−γ

(a) MDP 示例。 (b) 对结果值分布 ˆ Z 和 ˆ Z 的估计以及对 Q


值 ˆ Q、ˆ Q 的估计绘制草图(理想化版本)。

图 4.3:对于图(a)中的两种策略,图(b)中的插图给出了值分布 Z(s,a)与期望值 Q(s,


a)的比较。在左图中,我们可以看到 π 在每一步都会确定无疑地进入有回报的吸收状态。
而 π 则以 0.2 和 0.8 的概率移动到吸收状态,每一步 Rand 0 的回报分别为 0.2 和 0.8。
5

4.7 多步骤学习
在 DQN 中,用于更新 Q 网络参数的目标值 其中一个主要因素是,与标量值函
数(在公式 4.3 中给出)相比,分布式视角自然能提供更丰富的训练信号。该
贡献是根据下一时间步的自身价值估算得出的。因此,学习算法被称为引导算
法,因为它递归地使用自己的价值估计值(Sutton,1988 年)。

这种估计目标值的方法并不是唯一的可能性。非引导法直接从收益中学习
(蒙特卡罗法)
4.7. Multi-step learning 33

and an intermediate solution is to use a multi-step target (Sutton, 1988;


Watkins, 1989; Peng and Williams, 1994; Singh and Sutton, 1996). Such
a variant in the case of DQN can be obtained by using the n-step target
value given by:
n−
∑1
Q,n
Yk = γ t r t + γ n max

Q(sn , a ′
; θk ) (4.10)
a ∈A
t =0

where (s0 , a0, r0, · · · , sn− 1 , an− 1 , rn− 1, sn ) is any trajectory of n + 1 time
steps with s = s0 and a = a0. A combination of different multi-steps
targets can also be used:
( )
n−
∑1 ∑i
Q,n ( )
Yk = λi γ t r t + γ i +1 max

Q si +1, a′ ; θk (4.11)
a ∈A
i =0 t=0

with n− 1
i =0 λ i = 1. In the method called T D(λ) (Sutton, 1988), n → ∞
and λ i follow a geometric law: λ i ∝ λ i where 0 ≤ λ ≤ 1.

To bootstrap or not to bootstrap? Bootstrapping has both advan-


tages and disadvantages. On the negative side, using pure bootstrapping
methods (such as in DQN) are prone to instabilities when combined
with function approximation because they make recursive use of their
own value estimate at the next time-step. On the contrary, methods
such as n-step Q-learning rely less on their own value estimate because
the estimate used is decayed by γ n for the n th step backup. In addition,
methods that rely less on bootstrapping can propagate information
more quickly from delayed rewards as they learn directly from returns
(Sutton, 1996). Hence they might be more computationally efficient.
Bootstrapping also has advantages. The main advantage is that
using value bootstrap allows learning from off-policy samples. Indeed,
methods that do not use pure bootstrapping, such as n-step Q-learning
with n > 1 or T D(λ), are in principle on-policy based methods that
would introduce a bias when used with trajectories that are not obtained
solely under the behavior policy μ (e.g., stored in a replay buffer).
The conditions required to learn efficiently and safely with eligibility
traces from off-policy experience are provided by Munos et al., 2016;
Harutyunyan et al., 2016. In the control setting, the retrace operator
(Munos et al., 2016) considers a sequence of target policies π that depend
4.7.多步骤学习 33

而中间解决方案是使用多步目标值(Sutton,1988 年;Watkins,1989 年;
Peng 和 Williams,1994 年;Singh 和 Sutton,1996 年)。在 DQN 的情况
下,通过使用 n 步目标值,可以得到这样一个变量:
n−1

Yk = γr+ γmax Q(s, a; θ) (4.10)
a∈A
t=0

其中 (s, a, r, - - - , s, a, r, s) 是 s = sand a = a 的 n + 1 个时间步长的任意轨迹:

( )
n−1
∑ ∑ ( )
Yk = λ γr+ γmax Q s, a; θ (4.11)
a∈A
i=0 t=0

在称为∑ T D(λ) 的方法中(萨顿,1988 年),n → ∞ 和 λ 遵循一个


with
几何定律:λ∝ λ,其中 0≤λ≤ 1。
i=0

引导还是不引导?自举既有优点,也有缺点。
在负面影响方面,使用纯引导方法(如 DQN)与函数近似相结合时,容易产
生不稳定性。不利的一面是,使用纯引导方法(如 DQN)与函数逼近相结合
时容易出现不稳定,因为它们在下一个时间步递归使用自己的估计值。相
反,n 步 Q-learning 等方法对自身值估计的依赖程度较低,因为在 n 步备份
中,所使用的估计值按 γ 递减。此外,较少依赖引导的方法可以更快地从延
迟回报中传播信息,因为它们直接从回报中学习。因此,它们的计算效率可
能更高。
自举法也有优点。其主要优势在于,使用值引导可以从非政策样本中学习。
事实上,不使用纯引导的方法,如 n > 1 的 n 步 Q-learning 或 T D(λ),原则
上都是基于策略的方法,在使用非完全根据行为策略 μ 获得的轨迹(如存储
在重放缓冲区中的轨迹)时,会引入偏差。
Munos 等人,2016 年;Harutyunyan 等人,2016 年。在控制设置中,回溯
算子(Munos 等人,2016 年)会考虑一系列目标政策 π,这些政策取决于
34 Value-based methods for deep RL

on the sequence of Q-functions (such as -greedy policies), and seek to


approximate Q∗ (if π is greedy or becomes increasingly greedy w.r.t.
the Q estimates). It leads to the following target:
 ( t ) 
∑ ∏
Y = Q(s, a) +  γt cs (r t + γEπ Q(st +1 , a′ ) − Q(st , at )) (4.12)
t≥ 0 cs =1

( )
π(s,a)
where cs = λ min 1, with 0 ≤ λ ≤ 1 and μ is the behavior
μ(s,a)
policy (estimated from observed samples). This way of updating the
Q-network has guaranteed convergence, does not suffer from a high
variance and it does not cut the traces unnecessarily when π and μ
are close. Nonetheless, one can note that estimating the target is more
expansive to compute as compared to the one-step target (such as in
DQN) because the Q-value function has to be estimated on more states.

4.8 Combination of all DQN improvements and variants of DQN

The original DQN algorithm can combine the different variants discussed
in §4.4 to §4.7 (as well as some discussed in Chapter 8.1) and that
has been studied by Hessel et al., 2017. Their experiments show that
the combination of all the previously mentioned extensions to DQN
provides state-of-the-art performance on the Atari 2600 benchmarks,
both in terms of sample efficiency and final performance. Overall, a
large majority of Atari games can be solved such that the deep RL
agents surpass the human level performance.
Some limitations remain with DQN-based approaches. Among others,
these types of algorithms are not well-suited to deal with large and/or
continuous action spaces. In addition, they cannot explicitly learn
stochastic policies. Modifications that address these limitations will be
discussed in the following Chapter 5, where we discuss policy-based
approaches. Actually, the next section will also show that value-based
and policy-based approaches can be seen as two facets of the same
model-free approach. Therefore, the limitations of discrete action spaces
and deterministic policies are only related to DQN.
One can also note that value-based or policy-based approaches do
not make use of any model of the environment, which limits their sample
34 基于价值的深度 RL 方法
2016 年)考虑了一连串依赖于 Q 函数序列的目标策略 π(如贪婪策略),并寻
求近似 Q(如果 π 是贪婪的或在 Q 估计值方面变得越来越贪婪)。这就导致了
以下目标:  
( )
∑ ∏
Y = Q(s,a)+  γ c (r+ γEQ(s, a) - Q(s, a))  (4.12)
t≥0 c=1

( )
0 ≤ λc=≤λ1,μ
其中 min 为行为策略(根据观测样本估算)
1, 。这种更新 Q 网络的方法保证
了收敛性,不存在方差过大的问题,而且当 π 和 μ 接近时,也不会不必要地削
μ(s,a)

减轨迹。不过,我们可以注意到,与一步目标(如 DQN)相比,估计目标的计
算量更大,因为需要在更多的状态上估计 Q 值函数。

4.8 所有 DQN 改进和 DQN 变体的组合


原始 DQN 算法可以结合第 4.4 至 4.7 节中讨论的不同变体(以及第 8.1 章中
讨论的一些变体),2017 年,Hessel 等人对此进行了研究。他们的实验表
明,将前面提到的所有扩展结合到 DQN 上,在 Atari 2600 基准上,无论是采
样效率还是最终性能,都能提供最先进的性能。总体而言,绝大多数 Atari 游
戏都能被深度 RL 代理解决,从而超越人类水平的性能。
基于 DQN 的方法仍存在一些局限性。其中,这类算法并不适合处理大型和/
或连续的行动空间。此外,它们无法明确学习随机策略。针对这些局限性的
修改将在下文第 5 章讨论,我们将在该章讨论基于策略的方法。实际上,下
一节还将说明,基于价值的方法和基于策略的方法可以看作是同一种无模型
方法的两个方面。因此,离散行动空间和确定性策略的局限性只与 DQN 有
关。

我们还可以注意到,以价值或政策为基础的方法不使用任何环境模型,这
限制了它们的样本
4.8. Combination of all DQN improvements and variants of DQN 35

efficiency. Ways to combine model-free and model-based approaches


will be discussed in Chapter 6.
4.8.所有 DQN 改进和 DQN 变体的组合 35

效率。第 6 章将讨论如何将无模型方法和基于模型方法结合起来。
5
Policy gradient methods for deep RL

This section focuses on a particular family of reinforcement learning


algorithms that use policy gradient methods. These methods optimize
a performance objective (typically the expected cumulative reward)
by finding a good policy (e.g a neural network parameterized policy)
thanks to variants of stochastic gradient ascent with respect to the policy
parameters. Note that policy gradient methods belong to a broader
class of policy-based methods that includes, among others, evolution
strategies. These methods use a learning signal derived from sampling
instantiations of policy parameters and the set of policies is developed
towards policies that achieve better returns (e.g., Salimans et al., 2017).
In this chapter, we introduce the stochastic and deterministic
gradient theorems that provide gradients on the policy parameters in
order to optimize the performance objective. Then, we present different
RL algorithms that make use of these theorems.

36
5
深度 RL 的政策梯度方法

本节重点介绍使用策略梯度法的强化学习算法的一个特殊系列。这些方法通过
找到一个好的策略(例如神经网络参数化策略)来优化性能目标(通常是预期
累积奖励),这要归功于与策略参数相关的随机梯度上升变体。需要注意的
是,策略梯度法属于更广泛的基于策略的方法,其中包括进化策略。这些方法
使用从政策参数实例采样中获得的学习信号,政策集朝着获得更好回报的政策
方向发展(例如,Salimans 等人,2017 年)。

在本章中,我们将介绍随机梯度定理和确定梯度定理,它们提供了策略参数的
梯度,以优化性能目标。然后,我们将介绍利用这些定理的不同 RL 算法。

36
5.1. Stochastic Policy Gradient 37

5.1 Stochastic Policy Gradient

The expected return of a stochastic policy π starting from a given state


s0 from Equation 3.1 can be written as (Sutton et al., 2000):
∫ ∫
π π
V (s0 ) = ρ (s) π(s, a)R ′ (s, a)dads, (5.1)
S A

where R ′ (s, a) = s′ ∈S T (s, a, s′ )R(s, a, s′ ) and ρπ(s) is the discounted
state distribution defined as
∑∞
π
ρ (s) = γ t P r{ st = s|s0, π} .
t =0

For a differentiable policy πw , the fundamental result underlying


these algorithms is the policy gradient theorem (Sutton et al., 2000):

∫ ∫
πw πw
∇ wV (s0) = ρ (s) ∇ w πw (s, a)Qπw (s, a)dads. (5.2)
S A

This result allows us to adapt the policy parameters w: ∆ w ∝


∇ w V πw (s0) from experience. This result is particularly interesting
since the policy gradient does not depend on the gradient of the state
distribution (even though one might have expected it to). The simplest
way to derive the policy gradient estimator (i.e., estimating ∇ w V πw (s0)
from experience) is to use a score function gradient estimator, commonly
known as the REINFORCE algorithm (Williams, 1992). The likelihood
ratio trick can be exploited as follows to derive a general method of
estimating gradients from expectations:

∇ w πw (s, a)
∇ w πw (s, a) = πw (s, a)
πw (s, a) (5.3)
= πw (s, a)∇ w log(πw (s, a)).

Considering Equation 5.3, it follows that

∇ w V πw (s0 ) = Es∼ ρπw ,a∼ πw [∇ w (log πw (s, a)) Qπw (s, a)] . (5.4)
5.1.随机政策梯度 37

5.1 随机政策梯度
根据公式 3.1,从给定状态 s 开始的随机政策 π 的预期收益可以写成(Sutton
等人,2000 年):
∫ ∫
V (s) = ρ(s) π(s,a)R(s,a)dads、 (5.1)
S A

其中
T(s,a,s)R(s,a,s)
R(s,a)= ∫
,ρ(s)是贴现状态分
布,定义为
s∈S

∑ ∞
ρ(s) = γP r{s= s|s, π}。
t=0

对于可变策略 π,这些算法的基本结果是策略梯度定理(Sutton 等人,


2000 年):
∫ ∫
∇V (s) = ∇π(s,a)Q(s,a)dads。
S
ρ(s) (5.2)
A

这一结果允许我们根据经验调整政策参数 w: ∆w ∝∇V (s)。这一结果特别有


趣,因为政策梯度并不取决于状态分布的梯度(尽管我们可能会预料到)。推
导政策梯度估计器(即根据经验估计 ∇V (s))的最简单方法是使用分数函数梯
度估计器,即通常所说的 REINFORCE 算法(Williams,1992 年)。利用似然
比诀窍,可以得出从期望值估计梯度的一般方法:

∇π(s, a) = π(s, a)
∇π(s,a)
π(s, a) (5.3)
= π(s,a)∇log(π(s,a))。
根据公式 5.3,可以得出
∇V (s) = E[∇(log π(s, a)) Q(s, a)] . (5.4)
38 Policy gradient methods for deep RL

Note that, in practice, most policy gradient methods effectively use


undiscounted state distributions, without hurting their performance
(Thomas, 2014).
So far, we have shown that policy gradient methods should include
a policy evaluation followed by a policy improvement. On the one
hand, the policy evaluation estimates Qπw . On the other hand, the
policy improvement takes a gradient step to optimize the policy πw (s, a)
with respect to the value function estimation. Intuitively, the policy
improvement step increases the probability of the actions proportionally
to their expected return.
The question that remains is how the agent can perform the policy
evaluation step, i.e., how to obtain an estimate of Qπw (s, a). The
simplest approach to estimating gradients is to replace the Q function
estimator with a cumulative return from entire trajectories. In the Monte-
Carlo policy gradient, we estimate the Qπw (s, a) from rollouts on the
environment while following policy πw . The Monte-Carlo estimator is an
unbiased well-behaved estimate when used in conjunction with the back-
propagation of a neural network policy, as it estimates returns until the
end of the trajectories (without instabilities induced by bootstrapping).
However, the main drawback is that the estimate requires on-policy
rollouts and can exhibit high variance. Several rollouts are typically
needed to obtain a good estimate of the return. A more efficient approach
is to instead use an estimate of the return given by a value-based
approach, as in actor-critic methods discussed in §5.3.
We make two additional remarks. First, to prevent the policy from
becoming deterministic, it is common to add an entropy regularizer
to the gradient. With this regularizer, the learnt policy can remain
stochastic. This ensures that the policy keeps exploring.
Second, instead of using the value function Qπw in Eq. 5.4, an
advantage value function A πw can also be used. While Qπw (s, a)
summarizes the performance of each action for a given state under policy
πw , the advantage function A πw (s, a) provides a measure of comparison
for each action to the expected return at the state s, given by V πw (s).
Using A πw (s, a) = Qπw (s, a) − V πw (s) has usually lower magnitudes
than Qπw (s, a). This helps reduce the variance of the gradient estimator
∇ w V πw (s0) in the policy improvement step, while not modifying the
38 深度 RL 的政策梯度方法
请注意,在实践中,大多数政策梯度方法都有效地使用了未贴现的状态分布,
而不会影响其性能 4)(Thomas,2014)。
到目前为止,我们已经证明,政策梯度法应包括政策评估和政策改进。一方
面,政策评估估计 Q;另一方面,政策改进采取梯度步骤,根据价值函数估
计优化政策 π(s,a)。直观地说,政策改进步骤会按预期收益的比例增加行动的
概率。
剩下的问题是代理如何执行策略评估步骤,即如何获得 Q(s,a)的估计
值。估计梯度的最简单方法是用整个轨迹的累积收益代替 Q 函数估计值。在
MonteCarlo 策略梯度中,我们在遵循策略 π 的同时,通过对环境的滚动来估
计 Q(s,a)。当与神经网络策略的反向传播结合使用时,Monte-Carlo 估计器是
一种无偏的、良好的估计方法,因为它能估计出直到轨迹结束的回报(没有
自举引起的不稳定性)。不过,其主要缺点是估算需要在策略上滚动,可能会
表现出较高的方差。通常需要数次滚动才能获得较好的收益估计值。更有效
的方法是使用基于价值的方法来估算收益,如第 5.3 节讨论的行为批判方法。

我们还要补充两点。首先,为了防止政策变成确定性的,通常会在梯度上添加
一个熵正则。有了这个正则,学习到的策略就可以保持随机性。这就确保了策
略的不断探索性。
其次,也可以使用优势值函数 Ac 来代替公式 5.4 中的值函数 Qin。Q(s,a)概括
了在政策π下特定状态下每种行动的表现,而优势函数 A(s,a)则提供了每种行
动与状态 s 下预期收益的比较度量,即 V (s)。使用 A(s, a) = Q(s, a) - V (s) 通常
比 Q(s, a) 的值要小。这有助于在策略改进步骤中减少梯度估计器 ∇V (s) 的方
差,同时又不会改变策略改进步骤的结果。
5.2. Deterministic Policy Gradient 39

expectation1 . In other words, the value function V πw (s) can be seen as


a baseline or control variate for the gradient estimator. When updating
the neural network that fits the policy, using such a baseline allows for
improved numerical efficiency – i.e. reaching a given performance with
fewer updates – because the learning rate can be bigger.

5.2 Deterministic Policy Gradient

The policy gradient methods may be extended to deterministic policies.


The Neural Fitted Q Iteration with Continuous Actions (NFQCA)
(Hafner and Riedmiller, 2011) and the Deep Deterministic Policy
Gradient (DDPG) (Silver et al., 2014; Lillicrap et al., 2015) algorithms
introduce the direct representation of a policy in such a way that it can
extend the NFQ and DQN algorithms to overcome the restriction of
discrete actions.
Let us denote by π(s) the deterministic policy: π(s) : S → A . In
discrete action spaces, a direct approach is to build the policy iteratively
with:
πk+1 (s) = argmax Qπk (s, a), (5.5)
a∈A

where πk is the policy at the k th iteration. In continuous action


spaces, a greedy policy improvement becomes problematic, requiring a
global maximisation at every step. Instead, let us denote by πw (s)
a differentiable deterministic policy. In that case, a simple and
computationally attractive alternative is to move the policy in the
direction of the gradient of Q, which leads to the Deep Deterministic
Policy Gradient (DDPG) algorithm (Lillicrap et al., 2015):
[ ]
πw πw
∇ wV (s0) = Es∼ ρπw ∇ w (πw ) ∇ a (Q (s, a)) |a= πw (s) . (5.6)

This equation implies relying on ∇ a (Qπw (s, a)) (in addition to ∇ w πw ),


which usually requires using actor-critic methods (see §5.3).

1 πw
Indeed, subtracting a baseline that only depends
∫ on s to Q (s, a) in Eq. 5.2
does not change the gradient estimator because ∀s, A ∇ w πw (s, a)da = 0.
5.2.确定性政策梯度 39

期望值。换句话说,值函数 V (s) 可以看作是梯度估计器的基线或控制变量。


在更新符合策略的神经网络时,使用这样的基线可以提高数值效率,即用更少
的更新达到给定的性能,因为学习率可以更大。

5.2 确定性政策梯度
政策梯度方法可以扩展到确定性政策。连续行动的神经拟合 Q 迭代(NFQCA)
在不修改(Hafner 和 Riedmiller,2011 年)和深度确定性策略梯度(DDPG)
(Silver 等人,2014 年;Lillicrap 等人,2015 年)算法的同时,引入了策略
的直接表示,从而可以扩展 NFQ 和 DQN 算法,克服离散行动的限制。

让我们用 π(s) 表示确定性策略: π(s) :在离散行动空间中,一种直接的方法


是用以下方法迭代建立策略:
π(s) = argmax
a∈A
Q(s,a)、 (5.5)

其中,π 是迭代时的策略。在连续的行动空间中,贪婪的策略改进变得很成问
题,因为它要求在每一步都实现全局最大化。相反,让我们用π(s) 表示可变的
确定性策略。在这种情况下,一个简单且具有计算吸引力的替代方法是沿着 Q
的梯度方向移动策略,这就产生了深度确定性策略梯度算法(Deep
Deterministic Policy Gradient,DDPG)(Lillicrap 等人,2015 年):

[ ]
∇(π)∇(Q(s, a))|
∇V (s) = E . (5.6)

这个等式意味着要依赖∇(Q(s, a))(除了∇π),这通常需要使用行为批判方法
(见第 5.3 节)。

事实上,在公式 5.2 中的 Q(s,a)中减去一个仅取决于 s 的基线并不会改变梯度估计值,因为


1

∀s、 ∇π(s, a)da = 0。
A
40 Policy gradient methods for deep RL

5.3 Actor-Critic Methods

As we have seen in §5.1 and §5.2, a policy represented by a neural


network can be updated by gradient ascent for both the deterministic
and the stochastic case. In both cases, the policy gradient typically
requires an estimate of a value function for the current policy. One
common approach is to use an actor-critic architecture that consists of
two parts: an actor and a critic (Konda and Tsitsiklis, 2000). The actor
refers to the policy and the critic to the estimate of a value function (e.g.,
the Q-value function). In deep RL, both the actor and the critic can be
represented by non-linear neural network function approximators (Mnih
et al., 2016). The actor uses gradients derived from the policy gradient
theorem and adjusts the policy parameters w. The critic, parameterized
by θ, estimates the approximate value function for the current policy π:
Q(s, a; θ) ≈ Qπ(s, a).

The critic
From a (set of) tuples < s, a, r, s′ > , possibly taken from a replay
memory, the simplest off-policy approach to estimating the critic is to
use a pure bootstrapping algorithm T D(0) where, at every iteration,
the current value Q(s, a; θ) is updated towards a target value:
Q
Yk = r + γQ(s′ , a = π(s′ ); θ) (5.7)
This approach has the advantage of being simple, yet it is not
computationally efficient as it uses a pure bootstrapping technique that
is prone to instabilities and has a slow reward propagation backwards
in time (Sutton, 1996). This is similar to the elements discussed in the
value-based methods in §4.7.
The ideal is to have an architecture that is
• sample-efficient such that it should be able to make use of both
off-policy and and on-policy trajectories (i.e., it should be able to
use a replay memory), and

• computationally efficient: it should be able to profit from the


stability and the fast reward propagation of on-policy methods
for samples collected from near on-policy behavior policies.
40 深度 RL 的政策梯度方法
5.3 演员批评方法
正如我们在第 5.1 节和第 5.2 节中所看到的,神经网络所代表的策略可以通过
梯度上升来更新,无论是在确定性情况下还是在随机情况下。在这两种情况
下,策略梯度通常都需要对当前策略的值函数进行估计。一种常见的方法是使
用行动者-批判者架构,该架构由两部分组成:行动者和批判者(Konda 和
Tsitsiklis,2000 年)。行动者指的是策略,而批判者指的是价值函数(如 Q 值
函数)的估计值。在深度 RL 中,行动者和批评者都可以用非线性神经网络函
数近似器来表示(Mnih 等人,2016 年)。行动者使用从策略梯度定理中得出
的梯度,调整策略参数 w。批评者以 θ 为参数,估计当前策略 π 的近似值函
数:Q(s, a; θ) ≈ Q(s, a)。

评论家
从可能来自重放存储器的(一组)元组 < s、a、r、s> 出发,估计批判者的最
简单非策略方法是使用纯引导算法 T D(0),在该算法中,每迭代一次,当前值
Q(s、a; θ) 都会更新为目标值:
Y = r + γQ(s,a = π(s);θ) (5.7)

这种方法的优点是简单,但计算效率不高,因为它使用的是纯引导技术,容易
产生不稳定性,而且奖励的时间传播速度较慢(萨顿,1996 年)。这与第 4.7
节中讨论的基于值的方法中的要素类似。

理想的架构是
• 采样效率高,既能利用非政策轨迹,也能利用政策轨迹(即能够使用重
放存储器),以及
• 计算效率高:对于从接近政策行为政策中收集的样本,它应能从政策方
法的稳定性和快速奖励传播中获益。
5.3. Actor-Critic Methods 41

There are many methods that combine on- and off-policy data for
policy evaluation (Precup, 2000). The algorithm Retrace(λ) (Munos
et al., 2016) has the advantages that (i) it can make use of samples
collected from any behavior policy without introducing a bias and
(ii) it is efficient as it makes the best use of samples collected from
near on-policy behavior policies. That approach was used in actor-critic
architectures described by Wang et al. (2016b) and Gruslys et al. (2017).
These architectures are sample-efficient thanks to the use of a replay
memory, and computationally efficient since they use multi-step returns
which improves the stability of learning and increases the speed of
reward propagation backwards in time.

The actor
From Equation 5.4, the off-policy gradient in the policy improvement
phase for the stochastic case is given as:

∇ w V πw (s0 ) = Es∼ ρπβ ,a∼ πβ [∇ θ (log πw (s, a)) Qπw (s, a)] . (5.8)

where β is a behavior policy generally different than π, which makes the


gradient generally biased. This approach usually behaves properly in
practice but the use of a biased policy gradient estimator makes difficult
the analysis of its convergence without the GLIE assumption (Munos
et al., 2016; Gruslys et al., 2017)2.
In the case of actor-critic methods, an approach to perform the policy
gradient on-policy without experience replay has been investigated with
the use of asynchronous methods, where multiple agents are executed
in parallel and the actor-learners are trained asynchronously (Mnih
et al., 2016). The parallelization of agents also ensures that each agent
experiences different parts of the environment at a given time step.
In that case, n-step returns can be used without introducing a bias.
This simple idea can be applied to any learning algorithm that requires
2
Greedy in the Limit with Infinite Exploration (GLIE) means that the behavior
policies are required to become greedy (no exploration) in the limit of an online
learning setting where the agent has gathered an infinite amount of experience. It
is required that « (i) each action is executed infinitely often in every state that is
visited infinitely often, and (ii) in the limit, the learning policy is greedy with respect
to the Q-value function with probability 1 »(Singh et al., 2000).
5.3.行为批判方法 41

有许多方法将政策内和政策外数据结合起来进行政策评估(Precup,2000
年)。算法 Retrace(λ)(Munos 等人,2016 年)的优势在于:(i) 它可以利用
从任何行为政策中收集的样本,而不会引入偏差;(ii) 它是高效的,因为它能
充分利用从接近政策内行为政策中收集的样本。这种方法被用于行为批评

Wang 等人(2016b)和 Gruslys 等人(2017)所描述的架构。由于使用了重


放存储器,这些架构具有采样效率高的特点,而且由于使用了多步回报,提高
了学习的稳定性并增加了奖励向后传播的速度,因此计算效率也很高。

演员
根据公式 5.4,随机情况下政策改进阶段的非政策梯度为
∇V (s) = E[∇(log π(s, a)) Q(s, a)] . (5.8)

其中,β 是一般不同于 π 的行为政策,这使得梯度一般是有偏差的。这种方法


在实践中通常表现正常,但由于使用了有偏差的政策梯度估计器,因此很难在
不考虑 GLIE 假设的情况下对其收敛性进行分析(Munos 等人,2016 年;
Gruslys 等人,2017 年)。
在行动者批判方法中,使用异步方法研究了一种无需经验重放即可在政策上执
行政策梯度的方法,在异步方法中,多个代理被并行执行,行动者学习者被异
步训练(Mnih 等人,2016 年)。代理的并行化还能确保每个代理在给定的时
间步长内经历环境的不同部分。在这种情况下,可以在不引入偏差的情况下使
用 n 步回报。这个简单的想法可以应用于任何需要

无限探索极限贪婪(GLIE)指的是,在代理已经积累了无限多经验的在线学习环境中,要求行为
2

策略在极限状态下变得贪婪(无探索)。它要求"(i) 在无限次访问的每个状态下,每个行为都被无
限次执行;(ii) 在极限状态下,学习策略在 Q 值函数方面是贪婪的,概率为 1"(Singh 等人,
2000 年)。
42 Policy gradient methods for deep RL

on-policy data and it removes the need to maintain a replay buffer.


However, this asynchronous trick is not sample efficient.
An alternative is to combine off-policy and on-policy samples to
trade-off both the sample efficiency of off-policy methods and the
stability of on-policy gradient estimates. For instance, Q-Prop (Gu
et al., 2017b) uses a Monte Carlo on-policy gradient estimator, while
reducing the variance of the gradient estimator by using an off-policy
critic as a control variate. One limitation of Q-Prop is that it requires
using on-policy samples for estimating the policy gradient.

5.4 Natural Policy Gradients

Natural policy gradients are inspired by the idea of natural gradients


for the updates of the policy. Natural gradients can be traced back to
the work of Amari, 1998 and has been later adapted to reinforcement
learning (Kakade, 2001).
Natural policy gradient methods use the steepest direction given
by the Fisher information metric, which uses the manifold of the
objective function. In the simplest form of steepest ascent for an
objective function J (w), the update is of the form 4 w ∝ ∇w J (w).
In other words, the update follows the direction that maximizes
(J (w) − J (w + 4 w)) under a constraint on || 4 w||2. In the hypothesis
that the constraint on 4 w is defined with another metric than L 2, the
first-order solution to the constrained optimization problem typically
has the form 4 w ∝ B − 1∇ w J (w) where B is an n w × n w matrix. In
natural gradients, the norm uses the Fisher information metric, given by
a local quadratic approximation to the KL divergence D KL (πw ||πw+∆w ).
The natural gradient ascent for improving the policy πw is given by

4 w ∝ F w− 1∇ w V πw (·), (5.9)

where F w is the Fisher information matrix given by

F w = Eπw [∇ w log πw (s, ·)(∇ w log πw (s, ·))T ]. (5.10)

Policy gradients following ∇ w V πw (·) are often slow because they are
prone to getting stuck in local plateaus. Natural gradients, however, do
not follow the usual steepest direction in the parameter space, but the
42 深度 RL 的政策梯度方法
这样就无需维护重放缓冲区。
不过,这种异步技巧的采样效率并不高。
另一种方法是将非政策样本和政策样本结合起来,以权衡非政策方法的样本效
率和政策梯度估计的稳定性。例如,Q-Prop(Gu 等人,2017b)使用蒙特卡
洛政策上梯度估计器,同时通过使用非政策批评者作为控制变量来减少梯度估
计器的方差。Q-Prop 的一个局限是它需要使用政策内样本来估计政策梯度。

5.4 自然政策梯度
自然策略梯度的灵感来源于策略更新的自然梯度思想。自然梯度可以追溯到
1998 年 Amari 的研究成果,后来被应用于强化学习(Kakade,2001 年)。

自然策略梯度法使用费雪信息度量给出的最陡方向,它使用目标函数的流形。
在目标函数 J(w)的最陡上升的最简单形式中,更新的形式为 4w ∝∇J(w) 。换
句话说,在 || 4 w|| 的约束条件下,更新的方向是最大化(J(w) - J(w + 4w))。
假设 4w 上的约束条件是用 L 以外的另一种度量定义的,那么约束优化问题的
一阶解决方案通常是 4w ∝ B∇J(w) 的形式,其中 B 是一个 n× nmatrix。在
自然梯度中,准则使用费雪信息度量,由 KL 分歧 D(π||π) 的局部二次逼近给
出。改进策略 π 的自然梯度上升公式为

4w ∝ F w∇V (-)、 (5.9)

其中,Fis 是费雪信息矩阵,其计算公式为
F= E[∇log π(s,-)(∇log π(s,-))]。 (5.10)

遵循 ∇V (-) 的政策梯度通常比较缓慢,因为它们容易陷入局部高原。然而,
自然梯度并不遵循参数空间中通常最陡峭的方向,而是遵循∇V(-)的方向。
5.5. Trust Region Optimization 43

steepest direction with respect to the Fisher metric. Note that, as the
angle between natural and ordinary gradient is never larger than ninety
degrees, convergence is also guaranteed when using natural gradients.
The caveat with natural gradients is that, in the case of neural
networks and their large number of parameters, it is usually impractical
to compute, invert, and store the Fisher information matrix (Schulman
et al., 2015). This is the reason why natural policy gradients are usually
not used in practice for deep RL; however alternatives inspired by this
idea have been found and they are discussed in the following section.

5.5 Trust Region Optimization

As a modification to the natural gradient method, policy optimization


methods based on a trust region aim at improving the policy while
changing it in a controlled way. These constraint-based policy optimiza-
tion methods focus on restricting the changes in a policy using the KL
divergence between the action distributions. By bounding the size of
the policy update, trust region methods also bound the changes in state
distributions guaranteeing improvements in policy.
TRPO (Schulman et al., 2015) uses constrained updates and
advantage function estimation to perform the update, resulting in
the reformulated optimization given by
[ ]
πw+ 4 w (s, a) πw
max Es∼ ρπw ,a∼ πw A (s, a) (5.11)
4w πw (s, a)
subject to E D KL (πw (s, ·)||πw+ 4 w (s, ·)) ≤ δ, where δ ∈ R is a
hyperparameter. From empirical data, TRPO uses a conjugate gradient
with KL constraint to optimize the objective function.
Proximal Policy Optimization (PPO) (Schulman et al., 2017b) is a
variant of the TRPO algorithm, which formulates the constraint as a
penalty or a clipping objective, instead of using the KL constraint. Unlike
TRPO, PPO considers modifying the objective function to penalize
π (s,a)
changes to the policy that move r t (w) = wπ+w4 (ws,a) away from 1. The
clipping objective that PPO maximizes is given by
[ ( ) ]
πw
( ) πw
E min r t (w)A (s, a), clip r t (w), 1 − , 1 + A (s, a)
s∼ ρπw ,a∼ πw
(5.12)
5.5.信任区域优化 43

相对于费雪度量的最陡方向。需要注意的是,由于自然梯度和普通梯度之间
的夹角永远不会大于九十度,因此使用自然梯度时也能保证收敛性。
自然梯度需要注意的是,在神经网络及其大量参数的情况下,计算、反转和
存储费雪信息矩阵通常是不切实际的(Schulman 等人,2015 年)。这也是自
然策略梯度通常不用于深度 RL 实践的原因;不过,受这一思想启发,人们已
经找到了替代方法,下文将对其进行讨论。

5.5信任区域优化
作为对自然梯度法的一种改进,基于信任区域的政策优化方法旨在改进政策,
同时以可控的方式改变政策。这些基于约束的策略优化方法侧重于利用行动分
布之间的 KL 发散来限制策略的变化。通过限制政策更新的大小,信任区域方
法也限制了状态分布的变化,从而保证了政策的改进。

TRPO(Schulman 等人,2015 年)使用约束更新和优势函数估计来执行更


新,从而重新制定了优化方法,具体如下
[ ]
π(s, a)
max E A(s, a) (5.11)
4w π(s, a)
E D(π(s, -)||π(s, -)) ≤ δ,其中 δ∈ R 是一个超参数。根据经验数据,TRPO 使
用带有 KL 约束的共轭梯度来优化目标函数。
近端策略优化(PPO)(Schulman 等人,2017b)是 TRPO 算法的一种变
体,它将约束条件表述为惩罚或剪切目标,而不是使用 KL 约束条件。与
TRPO 不同的是,PPO 会考虑修改目标函数,以惩罚那些使 r(w) = = 的政策
变化。
1.
PPO 最大化的剪切目标为
π(s,a)

[ ( )]
min r(w)A(s,a),剪辑 r(w), 1 - , 1 +
( )
E A(s, a)
s∼ρ,a∼π
(5.12)
44 Policy gradient methods for deep RL

where ∈ R is a hyperparameter. This objective function clips


the probability ratio to constrain the changes of r t in the interval
[1 − , 1 + ].

5.6 Combining policy gradient and Q-learning

Policy gradient is an efficient technique for improving a policy in a


reinforcement learning setting. As we have seen, this typically requires
an estimate of a value function for the current policy and a sample
efficient approach is to use an actor-critic architecture that can work
with off-policy data.
These algorithms have the following properties unlike the methods
based on DQN discussed in Chapter 4:

• They are able to work with continuous action spaces. This is


particularly interesting in applications such as robotics, where
forces and torques can take a continuum of values.

• They can represent stochastic policies, which is useful for building


policies that can explicitly explore. This is also useful in settings
where the optimal policy is a stochastic policy (e.g., in a multi-
agent setting where the Nash equilibrium is a stochastic policy).

However, another approach is to combine policy gradient methods


directly with off-policy Q-learning (O’Donoghue et al., 2016). In some
specific settings, depending on the loss function and the entropy
regularization used, value-based methods and policy-based methods are
equivalent (Fox et al., 2015; O’Donoghue et al., 2016; Haarnoja et al.,
2017; Schulman et al., 2017a). For instance, when adding an entropy
regularization, Eq. 5.4 can be written as

∇ w V πw (s0) = Es,a [∇ w (log πw (s, a)) Qπw (s, a)] + αEs ∇ w H πw (s).
(5.13)

where H π(s) = − a π(s, a) log π(s, a). From this, one can note
that an optimum is satisfied by the following policy: πw (s, a) =
exp(A πw (s, a)/α− H πw (s)). Therefore, we can use the policy to derive an
estimate of the advantage function: Ã πw (s, a) = α( log πw (s, a) + H π(s)).
44 深度 RL 的政策梯度方法
12) 其中,∈ R 是一个超参数。该目标函数利用概率比来限制区间[1 - , 1 + ]内
林的变化。

5.6 政策梯度与 Q 学习相结合


策略梯度是一种在强化学习环境中改进策略的有效技术。正如我们所看到
的,这通常需要对当前策略的价值函数进行估计,而一种有效的抽样方法就
是使用可处理非策略数据的行为批判架构。
与第 4 章讨论的基于 DQN 的方法不同,这些算法具有以下特性:

• 它们能够处理连续的动作空间。这在机器人等应用中尤为有趣,因为在
这些应用中,力和扭矩的值可以是连续的。
• 它们可以表示随机政策,这对于构建可以明确探索的政策非常有用。在
最优政策是随机政策的情况下,这一点也很有用(例如,在多代理设置
中,纳什均衡是一种随机政策)。
然而,另一种方法是直接将政策梯度法与非政策 Q-learning 结合起来
(O'Donoghue 等人,2016 年)。在某些特定情况下,根据所使用的损失函数
和熵正则化,基于值的方法和基于策略的方法是等价的(Fox 等人,2015 年;
O'Donoghue 等人,2016 年;Haarnoja 等人,2017 年;Schulman 等人,
2017a)。例如,当添加熵正则化时,公式 5.4 可写成

∇V (s) = E[∇(log π(s, a)) Q(s, a)] + αE∇H(s).


(5.13)

其中
π(s, a)H(s)log=π(s,
- a).由此可以看出,以下策略满足最优条件:π(s,

a) = exp(A(s,
a)/α-H(s)) 。因此,我们可以利用该策略得出优势函数的估计值:˜π(a) =
a

exp(A(s, a) /α-H(s) 。
A(s, a) = α( log π(s, a) + H(s)).
5.6. Combining policy gradient and Q-learning 45

We can thus think of all model-free methods as different facets of the


same approach.
One remaining limitation is that both value-based and policy-based
methods are model-free and they do not make use of any model of the
environment. The next chapter describes algorithms with a model-based
approach.
5.6.政策梯度与 Q 学习相结合 45

因此,我们可以将所有无模型方法视为同一方法的不同侧面。
剩下的一个局限是,基于价值和基于策略的方法都是无模型的,它们不使用
任何环境模型。下一章将介绍基于模型的算法。
6
Model-based methods for deep RL

In Chapters 4 and 5, we have discussed the model-free approach that


rely either on a value-based or a policy-based method. In this chapter,
we introduce the model-based approach that relies on a model of the
environment (dynamics and reward function) in conjunction with a
planning algorithm. In §6.2, the respective strengths of the model-based
versus the model-free approaches are discussed, along with how the two
approaches can be integrated.

6.1 Pure model-based methods

A model of the environment is either explicitly given (e.g., in the game


of Go for which all the rules are known a priori) or learned from
experience. To learn the model, yet again function approximators bring
significant advantages in high-dimensional (possibly partially observable)
environments (Oh et al., 2015; Mathieu et al., 2015; Finn et al., 2016a;
Kalchbrenner et al., 2016; Duchesne et al., 2017; Nagabandi et al., 2018).
The model can then act as a proxy for the actual environment.
When a model of the environment is available, planning consists
in interacting with the model to recommend an action. In the case
of discrete actions, lookahead search is usually done by generating

46
6
基于模型的深度 RL 方法

在第 4 章和第 5 章中,我们讨论了无模型方法,这种方法依赖于基于价值或基
于策略的方法。在本章中,我们将介绍基于模型的方法,这种方法依赖于环境
模型(动态和奖励函数)与规划算法的结合。在第 6.2 节中,我们将讨论基于
模型方法和无模型方法各自的优势,以及如何将这两种方法结合起来。

6.1 基于模型的纯粹方法
环境模型要么是明确给出的(例如,在围棋游戏中,所有规则都是先验已知
的),要么是从经验中学习的。为了学习模型,函数近似器在高维(可能部分
可观测)环境中再次带来显著优势(Oh 等人,2015;Mathieu 等人,2015;
Finn 等人,2016a;Kalchbrenner 等人,2016;Duchesne 等人,2017;
Nagabandi 等人,2018)。
这样,模型就可以作为实际环境的代理。
当环境模型可用时,规划工作就包括与模型互动,推荐行动。在离散行动的情
况下,前瞻性搜索通常是通过生成

46
6.1. Pure model-based methods 47

potential trajectories. In the case of a continuous action space, trajectory


optimization with a variety of controllers can be used.

6.1.1 Lookahead search


A lookahead search in an MDP iteratively builds a decision tree where
the current state is the root node. It stores the obtained returns in the
nodes and focuses attention on promising potential trajectories. The
main difficulty in sampling trajectories is to balance exploration and
exploitation. On the one hand, the purpose of exploration is to gather
more information on the part of the search tree where few simulations
have been performed (i.e., where the expected value has a high variance).
On the other hand, the purpose of exploitation is to refine the expected
value of the most promising moves.
Monte-Carlo tree search (MCTS) techniques (Browne et al., 2012)
are popular approaches to lookahead search. Among others, they have
gained popularity thanks to prolific achievements in the challenging
task of computer Go (Brügmann, 1993; Gelly et al., 2006; Silver et al.,
2016a). The idea is to sample multiple trajectories from the current state
until a terminal condition is reached (e.g., a given maximum depth)
(see Figure 6.1 for an illustration). From those simulation steps, the
MCTS algorithm then recommends an action to take.
Recent works have developed strategies to directly learn end-to-end
the model, along with how to make the best use of it, without relying on
explicit tree search techniques (Pascanu et al., 2017). These approaches
show improved sample efficiency, performance, and robustness to model
misspecification compared to the separated approach (simply learning
the model and then relying on it during planning).

6.1.2 Trajectory optimization


Lookahead search techniques are limited to discrete actions, and
alternative techniques have to be used for the case of continuous actions.
If the model is differentiable, one can directly compute an analytic policy
gradient by backpropagation of rewards along trajectories (Nguyen
and Widrow, 1990). For instance, PILCO (Deisenroth and Rasmussen,
2011) uses Gaussian processes to learn a probabilistic model of the
6.1.基于模型的纯粹方法 47

潜在轨迹。在连续动作空间的情况下,可以使用各种控制器进行轨迹优化。

6.1.1 前瞻性搜索
MDP 中的前瞻搜索会反复建立一棵决策树,其中当前状态是根节点。它将获
得的回报存储在节点中,并将注意力集中在有希望的潜在轨迹上。轨迹采样的
主要困难在于如何平衡探索和利用。一方面,探索的目的是收集搜索树中模拟
次数较少的部分(即预期值方差较大的部分)的更多信息。另一方面,开发的
目的是完善最有希望的移动的预期值。
蒙特卡洛树搜索(Monte-Carlo tree search,MCTS)技术(Browne et al.其
中,由于在计算机围棋这项具有挑战性的任务中取得了丰硕成果,MCTS 技术
越来越受欢迎(Brügmann,1993;Gelly 等人,2006;Silver 等人,
2016a)。其思路是从当前状态开始对多个轨迹进行采样,直到达到终点条件
(例如给定的最大深度),通常通过生成前瞻搜索来完成(见图 6.1 的说明)。
根据这些模拟步骤,MCTS 算法会建议采取某种行动。

最近的工作开发出了直接端到端学习模型的策略,以及如何充分利用模型,而
不依赖显式树搜索技术(Pascanu 等人,2017 年)。与分离式方法(简单地学
习模型,然后在规划过程中依赖模型)相比,这些方法提高了采样效率、性能
和对模型错误规范的鲁棒性。

6.1.2 轨迹优化
前瞻搜索技术仅限于离散行动,对于连续行动必须使用其他技术。如果模型是
可微分的,就可以通过奖励沿轨迹的反向传播直接计算出分析策略梯度
(Nguyen 和 Widrow,1990 年)。例如,PILCO(Deisenroth 和
Rasmussen,2011 年)使用高斯过程来学习一个概率模型。
48 Model-based methods for deep RL

st
t

t+1

t+2

a( i ) ∈ A
a( i ) ∈ A
Monte-carlo
simulation
End
state

Figure 6.1: Illustration of how a MCTS algorithm performs a Monte-Carlo


simulation and builds a tree by updating the statistics of the different nodes. Based
on the statistics gathered for the current node st , the MCTS algorithm chooses an
action to perform on the actual environment.

dynamics. It can then explicitly use the uncertainty for planning and
policy evaluation in order to achieve a good sample efficiency. However,
the gaussian processes have not been able to scale reliably to high-
dimensional problems.
One approach to scale planning to higher dimensions is to aim at
leveraging the generalization capabilities of deep learning. For instance,
Wahlström et al. (2015) uses a deep learning model of the dynamics
(with an auto-encoder) along with a model in a latent state space.
Model-predictive control (Morari and Lee, 1999) can then be used to
find the policy by repeatedly solving a finite-horizon optimal control
problem in the latent space. It is also possible to build a probabilistic
generative model in a latent space with the objective that it possesses
a locally linear dynamics, which allows control to be performed more
efficiently (Watter et al., 2015). Another approach is to use the trajectory
optimizer as a teacher rather than a demonstrator: guided policy
search (Levine and Koltun, 2013) takes a few sequences of actions
suggested by another controller. iIt then learns to adjust the policy from
these sequences. Methods that leverage trajectory optimization have
48 基于模型的深度 RL 方法
s
t

t+1

t+2

a∈ A
a∈ A
蒙特卡洛
仿真
End

图 6.1:MCTS 算法如何执行蒙特卡洛模拟并通过更新不同节点的统计数据来构建树的示意图。根
据收集到的当前节点 s 的统计数据,MCTS 算法选择对实际环境执行的操作。

动态。然后,它可以明确地利用不确定性进行规划和政策评估,以实现良好的
样本效率。然而,高斯过程无法可靠地扩展到高维问题。

将规划扩展到更高维度的一种方法是利用深度学习的泛化能力。例如,
Wahlström 等人(2015 年)使用动态的深度学习模型(2011 年)使用高斯过
程来学习潜在状态空间的概率模型(使用自动编码器)和模型。然后,可以使
用模型预测控制(Morari 和 Lee,1999 年),通过重复求解潜在空间中的有限
视距最优控制问题来找到策略。还可以在潜空间中建立一个概率生成模型,目
标是使其具有局部线性动力学,从而更有效地进行控制(Watter 等人,2015
年)。另一种方法是将轨迹优化器用作教师而非示范者:引导式策略搜索
(Levine 和 Koltun,2013 年)采用另一个控制器建议的几个动作序列,然后
从这些序列中学习调整策略。利用轨迹优化的方法有
6.2. Integrating model-free and model-based methods 49

demonstrated many capabilities, for instance in the case of simulated


3D bipeds and quadrupeds (e.g., Mordatch et al., 2015).

6.2 Integrating model-free and model-based methods

The respective strengths of the model-free versus model-based ap-


proaches depend on different factors. First, the best suited approach
depends on whether the agent has access to a model of the environment.
If that’s not the case, the learned model usually has some inaccuracies
that should be taken into account. Note that learning the model can
share the hidden-state representations with a value-based approach by
sharing neural network parameters (Li et al., 2015).
Second, a model-based approach requires working in conjunction
with a planning algorithm (or controller), which is often computationally
demanding. The time constraints for computing the policy π(s) via
planning must therefore be taken into account (e.g., for applications
with real-time decision-making or simply due to resource limitations).
Third, for some tasks, the structure of the policy (or value function) is
the easiest one to learn, but for other tasks, the model of the environment
may be learned more efficiently due to the particular structure of the
task (less complex or with more regularity). Thus, the most performant
approach depends on the structure of the model, policy, and value
function (see the coming Chapter 7 for more details on generalization).
Let us consider two examples to better understand this key consideration.
In a labyrinth where the agent has full observability, it is clear how
actions affect the next state and the dynamics of the model may easily be
generalized by the agent from only a few tuples (for instance, the agent
is blocked when trying to cross a wall of the labyrinth). Once the model
is known, a planning algorithm can then be used with high performance.
Let us now discuss another example where, on the contrary, planning
is more difficult: an agent has to cross a road with random events
happening everywhere on the road. Let us suppose that the best policy
is simply to move forward except when an object has just appeared
in front of the agent. In that case, the optimal policy may easily be
captured by a model-free approach, while a model-based approach would
be more difficult (mainly due to the stochasticity of the model which
6.2.无模型方法与基于模型方法的结合 49

例如,在模拟三维两足动物和四足动物的情况下(例如,Mordatch 等人,
2015 年),就展示了许多能力。
6.2整合无模型和基于模型的方法
无模型方法和基于模型方法各自的优势取决于不同的因素。首先,最合适的方
法取决于代理是否能获得环境模型。如果没有,学习到的模型通常会有一些不
准确的地方,应该加以考虑。需要注意的是,学习模型可以通过共享神经网络
参数,与基于值的方法共享隐态表征(Li 等人,2015 年)。

其次,基于模型的方法需要与规划算法(或控制器)配合使用,而规划算法
(或控制器)的计算要求通常很高。因此,必须考虑到通过规划计算策略 π(s)
的时间限制(例如,对于需要实时决策的应用,或仅仅由于资源限制)。

第三,对于某些任务来说,策略(或价值函数)的结构是最容易学习的,但对
于其他任务来说,由于任务的特殊结构(不太复杂或规律性较强),环境模型
的学习效率可能更高。因此,最有效的方法取决于模型、策略和价值函数的结
构(有关泛化的更多详情,请参阅第 7 章)。让我们举两个例子来更好地理解
这一关键因素。在迷宫中,代理具有完全的可观察性,行动对下一个状态的影
响是显而易见的,代理可以很容易地从几个图元中概括出模型的动态(例如,
代理在试图穿过迷宫的一堵墙时被挡住了)。一旦知道了模型,就可以使用高
性能的规划算法。现在我们来讨论另一个例子,在这个例子中,规划反而更加
困难:一个代理必须穿过一条道路,而道路上到处都会发生随机事件。我们假
设,最佳策略只是向前移动,除非有物体刚刚出现在代理的前方。在这种情况
下,无模型方法可以很容易地捕捉到最优策略,而基于模型的方法则比较困难
(主要是由于模型的随机性,而模型的随机性又会影响策略的制定)。
50 Model-based methods for deep RL

leads to many different possible situations, even for one given sequence
of actions).

Model-based
RL

Value-based Policy-based
RL RL

Figure 6.2: Venn diagram in the space of possible RL algorithms.

We now describe how it is possible to obtain advantages from both


worlds by integrating learning and planning into one end-to-end training
procedure so as to obtain an efficient algorithm both in performance
(sample efficient) and in computation time. A Venn diagram of the
different combinations is given in Figure 6.2.
When the model is available, one direct approach is to use tree
search techniques that make use of both value and policy networks
(e.g., Silver et al., 2016a). When the model is not available and under
the assumption that the agent has only access to a limited number of
trajectories, the key property is to have an algorithm that generalizes
well (see Chapter 7 for a discussion on generalization). One possibility
is to build a model that is used to generate additional samples for a
model-free reinforcement learning algorithm (Gu et al., 2016b). Another
possibility is to use a model-based approach along with a controller
such as MPC to perform basic tasks and use model-free fine-tuning in
order to achieve task success (Nagabandi et al., 2017).
Other approaches build neural network architectures that combine
both model-free and model-based elements. For instance, it is possible
to combine a value function with steps of back-propagation through a
model (Heess et al., 2015). The VIN architecture (Tamar et al., 2016)
50 基于模型的深度 RL 方法
即使是一个给定的行动序列,也会导致许多不同的可能情况)。

基于模型
RL

以价值为基础 以政策为基础
RL RL

图 6.2:可能的 RL 算法空间维恩图。
现在,我们将介绍如何通过将学习和规划整合到一个端到端训练程序中来获
得两个世界的优势,从而获得一种在性能和计算时间上都高效的算法,而基
于模型的方法则会更加困难(主要是由于模型的随机性)。图 6.2 给出了不同
组合的维恩图。
当模型可用时,一种直接的方法是使用树搜索技术,同时利用价值网络和策
略网络(例如,Silver 等人,2016a)。当模型不可用时,假设代理只能访问
有限数量的轨迹,那么关键特性就是要有一种能很好泛化的算法(关于泛化
的讨论,请参见第 7 章)。一种可能性是建立一个模型,用于为无模型强化学
习算法生成额外样本(Gu 等人,2016b)。另一种可能性是使用基于模型的方
法和控制器(如 MPC)来执行基本任务,并使用无模型微调来实现任务成功
(Nagabandi 等人,2017 年)。

其他方法建立的神经网络架构结合了无模型和基于模型的元素。例如,可以通
过模型将值函数与反向传播步骤结合起来(Heess 等人,2015 年)。VIN 架构
(Tamar 等人,2016 年)
6.2. Integrating model-free and model-based methods 51

is a fully differentiable neural network with a planning module that


learns to plan from model-free objectives (given by a value function). It
works well for tasks that involve planning-based reasoning (navigation
tasks) from one initial position to one goal position and it demonstrates
strong generalization in a few different domains.
In the same spirit, the predictron (Silver et al., 2016b) is aimed at
developing a more generally applicable algorithm that is effective in the
context of planning. It works by implicitly learning an internal model
in an abstract state space, which is used for policy evaluation. The
predictron is trained end-to-end to learn, from the abstract state space,
(i) the immediate reward and (ii) value functions over multiple planning
depths. The predictron architecture is limited to policy evaluation, but
the idea was extended to an algorithm that can learn an optimal policy
in an architecture called VPN (Oh et al., 2017). Since VPN relies on
n-step Q-learning, it requires however on-policy data.
Other works have proposed architectures that combine model-based
and model-free approaches. Schema Networks (Kansky et al., 2017) learn
the dynamics of an environment directly from data by enforcing some
relational structure. The idea is to use a richly structured architecture
such that it provides robust generalization thanks to an object-oriented
approach for the model.
I2As (Weber et al., 2017) does not use the model to directly perform
planning but it uses the predictions as additional context in deep policy
networks. The proposed idea is that I2As could learn to interpret
predictions from the learned model to construct implicit plans.
TreeQN (Farquhar et al., 2017) constructs a tree by recursively
applying an implicit transition model in an implicitly learned abstract
state space, built by estimating Q-values. Farquhar et al. (2017) also
propose ATreeC, which is an actor-critic variant that augments TreeQN
with a softmax layer to form a stochastic policy network.
The CRAR agent explicitly learns both a value function and a model
via a shared low-dimensional learned encoding of the environment,
which is meant to capture summarized abstractions and allow for
efficient planning (François-Lavet et al., 2018). By forcing an expressive
representation, the CRAR approach creates an interpretable low-
6.2.无模型方法与基于模型方法的结合 51

是一个带有规划模块的完全可微分神经网络,它能从无模型目标(由值函数给
出)中学习规划。它能很好地完成从一个初始位置到一个目标位置的基于规划
的推理任务(导航任务),并在一些不同的领域中表现出很强的通用性。
本着同样的精神,predictron(Silver 等人,2016b)旨在开发一种更普遍适
用的算法,在规划方面非常有效。它的工作原理是在抽象状态空间中隐含学习
一个内部模型,该模型用于政策评估。predictron 经过端到端训练,可从抽
象状态空间学习 2016)(i)即时奖励和(ii)多个规划深度的价值函数。
predictron 架构仅限于政策评估,但这一想法被扩展为一种可以在名为 VPN
的架构中学习最优政策的算法(Oh 等人,2017 年)。由于 VPN 依赖于 n 步
Q-learning,因此它需要政策数据。

其他研究还提出了结合基于模型和无模型方法的架构。Schema Networks
(Kansky 等人,2017 年)通过强化某种关系结构,直接从数据中学习环境
的动态。我们的想法是使用一种结构丰富的架构,通过面向对象的模型方法
提供强大的泛化能力。
I2As(韦伯等人,2017 年)并不使用模型直接执行规划,而是将预测作为深
度策略网络的附加上下文。我们提出的想法是,I2As 可以学会从学习的模型
中解释预测,从而构建隐式计划。
TreeQN(Farquhar 等人,2017 年)通过在隐式学习的抽象状态空间中递归
应用隐式转换模型来构建树,该模型是通过估计 Q 值建立的。Farquhar 等人
(2017 年)还提出了 ATreeC,这是一种行为批判变体,它通过软最大层增
强了 TreeQN,从而形成了一个随机策略网络。

CRAR 代理通过一个共享的低维环境学习编码明确地学习价值函数和模型,该
编码旨在捕捉经过总结的抽象概念并实现高效规划(François-Lavet 等人,
2018 年)。通过强制使用一种具有表现力的表征,CRAR 方法创建了一种可解
释的低维环境编码。
52 Model-based methods for deep RL

dimensional representation of the environment, even far temporally


from any rewards or in the absence of model-free objectives.
Improving the combination of model-free and model-based ideas
is one key area of research for the future development of deep RL
algorithms. We therefore expect to see smarter and richer structures in
that domain.
52 基于模型的深度 RL 方法
即使在远离任何奖励或没有无模型目标的情况下,CRAR 方法也能创建可解释
的低维环境表示。
改进无模型和基于模型的想法的结合是未来深度 RL 算法发展的一个关键研究
领域。因此,我们期待在该领域看到更智能、更丰富的结构。
7
The concept of generalization

Generalization is a central concept in the field of machine learning, and


reinforcement learning is no exception. In an RL algorithm (model-free
or model-based), generalization refers to either

• the capacity to achieve good performance in an environment where


limited data has been gathered, or

• the capacity to obtain good performance in a related environment.

In the former case, the agent must learn how to behave in a test
environment that is identical to the one it has been trained on. In
that case, the idea of generalization is directly related to the notion
of sample efficiency (e.g., when the state-action space is too large to
be fully visited). In the latter case, the test environment has common
patterns with the training environment but can differ in the dynamics
and the rewards. For instance, the underlying dynamics may be the
same but a transformation on the observations may have happened
(e.g., noise, shift in the features, etc.). That case is related to the idea
of transfer learning (discussed in §10.2) and meta-learning (discussed
in §10.1.2).

53
7
概括的概念

泛化是机器学习领域的一个核心概念,强化学习也不例外。在 RL 算法(无模
型或基于模型)中,泛化是指
• 有能力在数据收集有限的环境中取得良好绩效,或
• 在相关环境中取得良好业绩的能力。
在前一种情况下,代理必须学会如何在与训练环境完全相同的测试环境中行
动。在这种情况下,泛化的概念与抽样效率的概念直接相关(例如,当状态-动
作空间太大而无法完全访问时)。在后一种情况下,测试环境与训练环境具有
共同的模式,但在动态和奖励方面可能有所不同。例如,基本动态可能是相同
的,但观测结果可能发生了变化(如噪声、特征变化等)。这种情况与迁移学
习(第 10.2 节讨论)和元学习(第 10.1.2 节讨论)的概念有关。

53
54 The concept of generalization

Note that, in the online setting, one mini-batch gradient update


is usually done at every step. In that case, the community has also
used the term sample efficiency to refer to how fast the algorithm
learns, which is measured in terms of performance for a given number
of steps (number of learning steps=number of transitions observed).
However, in that context, the result depends on many different elements.
It depends on the learning algorithm and it is, for instance, influenced
by the possible variance of the target in a model-free setting. It also
depends on the exploration/exploitation, which will be discussed in
§8.1 (e.g, instabilities may be good). Finally, it depends on the actual
generalization capabilities.
In this chapter, the goal is to study specifically the aspect of
generalization. We are not interested in the number of mini-batch
gradient descent steps that are required but rather in the performance
that a deep RL algorithm can have in the offline case where the agent
has to learn from limited data. Let us consider the case of a finite
dataset D obtained on the exact same task as the test environment.
Formally, a dataset available to the agent D ∼ D can be defined as a
set of four-tuples < s, a, r, s′ > ∈ S × A × R × S gathered by sampling
independently and identically (i.i.d.)1

• a given number of state-action pairs (s, a) from some fixed


distribution with P(s, a) > 0, ∀(s, a) ∈ S × A,

• a next state s′ ∼ T (s, a, ·), and

• a reward r = R(s, a, s′ ).

We denote by D ∞ the particular case of a dataset D where the number


of tuples tends to infinity.
A learning algorithm can be seen as a mapping of a dataset D into
a policy πD (independently of whether the learning algorithm has a
1
That i.i.d. assumption can, for instance, be obtained from a given distribution
of initial states by following a stochastic sampling policy that ensures a non-zero
probability of taking any action in any given state. That sampling policy should be
followed during at least H time steps with the assumption that all states of the MDP
can be reached in a number of steps smaller than H from the given distribution of
initial states.
54 概括的概念
请注意,在在线设置中,通常每一步都会进行一次迷你批量梯度更新。在这
种情况下,业界也使用样本效率一词来指算法的学习速度,它是以给定步数
(学习步数=观察到的转换次数)的性能来衡量的。然而,在这种情况下,结
果取决于许多不同的因素。它取决于学习算法,例如,它受到无模型环境下
目标可能存在的方差的影响。它还取决于探索/开发,这将在第 8.1 节中讨论
(例如,不稳定性可能是好事)。最后,它还取决于实际的泛化能力。
本章的目标是专门研究泛化问题。我们感兴趣的不是所需的小批量梯度下降
步数,而是深度 RL 算法在代理必须从有限数据中学习的离线情况下的性能。
让我们考虑一个有限数据集 D 的情况,该数据集是在与测试环境完全相同的
任务中获得的。从形式上看,代理可用的数据集 D ∼ D 可以定义为通过独立
且相同(i.i.d.)的采样收集到的四个元组 < s, a, r, s>∈ S × A × R × S 的集
合。

P(s, a) > 0 的某个固定分布中的给定数量的状态-行动对(s, a),∀(s,



a)∈ S × A、
• 下一个状态 s∼ T(s, a, -),以及
• 奖励 r = R(s,a,s) 。
我们用 D 表示数据集 D 中图元数趋于无穷大的特殊情况。
学习算法可以看作是数据集 D 到策略 π 的映射(与学习算法是否有

例如,可以根据给定的初始状态分布,采用随机抽样策略,确保在任何给定状态下采取任何行动
1

的概率不为零,从而获得 i.i.d. 假设。至少应在 H 个时间步长内遵循该采样策略,并假设从给定


的初始状态分布出发,MDP 的所有状态都能在小于 H 的步长内到达。
55

model-based or a model-free approach). In that case, we can decompose


the suboptimality of the expected return as follows:

[ ] [ ]
π∗ πD π∗ πD ∞ πD ∞ πD
E V (s) − V (s) = E V (s) − V (s) + V (s) − V (s)
D ∼D D ∼D

= (V π (s) − V πD ∞ (s))
︸ ︷︷ ︸
asymptotic bias
[ ]
πD ∞ πD
+ E V (s) − V (s) .
D ∼D
︸ ︷︷ ︸
error due to finite size of the dataset D
(7.1)

This decomposition highlights two different terms: (i) an asymptotic


bias which is independent of the quantity of data and (ii) an overfitting
term directly related to the fact that the amount of data is limited.
The goal of building a policy πD from a dataset D is to obtain the
lowest overall suboptimality. To do so, the RL algorithm should be well
adapted to the task (or the set of tasks).
In the previous section, two different types of approaches (model-
based and model-free) have been discussed, as well as how to combine
them. We have discussed the algorithms that can be used for different
approaches but we have in fact left out many important elements that
have an influence on the bias-overfitting tradeoff (e.g., Zhang et al.,
2018c; Zhang et al., 2018a for illustrations of overfitting in deep RL).
As illustrated in Figure 7.1, improving generalization can be seen as
a tradeoff between (i) an error due to the fact that the algorithm trusts
completely the frequentist assumption (i.e., discards any uncertainty
on the limited data distribution) and (ii) an error due to the bias
introduced to reduce the risk of overfitting. For instance, the function
approximator can be seen as a form of structure introduced to force some
generalization, at the risk of introducing a bias. When the quality of the
dataset is low, the learning algorithm should favor more robust policies
(i.e., consider a smaller class of policies with stronger generalization
capabilities). When the quality of the dataset increases, the risk of
overfitting is lower and the learning algorithm can trust the data more,
hence reducing the asymptotic bias.
55

或无模型方法)。在这种情况下,我们可以将预期收益的次优性分解如下:
[ ] [ ]
E
D∼D
V (s) - V (s) =E
D∼D
V (s) - V (s) + V (s) - V (s)
= (V︸(s)︷︷
- V ︸(s))
渐近偏差
[ ]
+ V (s) - V (s)
E .
︸ ︷︷ ︸
D∼D

由于数据集规模有限而造成的误差 D
(7.1)

这种分解方法突出了两个不同的项:(i) 与数据量无关的渐近偏差,(ii) 与数据


量有限这一事实直接相关的过拟合项。根据数据集 D 建立策略 π 的目标是获
得最低的总体次优性。为此,RL 算法应能很好地适应任务(或任务集)。
上一节讨论了两种不同类型的方法(基于模型和无模型),以及如何将它们结
合起来。我们讨论了可用于不同方法的算法,但事实上我们遗漏了许多对偏
差-过拟合权衡有影响的重要因素(例如,Zhang 等人,2018c;Zhang 等
人,2018a 对深度 RL 中过拟合的说明)。

如图 7.1 所示,提高泛化能力可以看作是(i)算法完全相信频数主义假设(即
忽略有限数据分布的任何不确定性)而产生的误差,以及(ii)为降低过度拟
合风险而引入的偏差所产生的误差之间的权衡。例如,函数近似器可以看作是
一种结构形式,它的引入是为了强制实现某些泛化,但也有引入偏差的风险。
当数据集质量较低时,学习算法应倾向于采用更稳健的策略(即考虑采用更小
的策略类别,以获得更强的泛化能力)。当数据集的质量提高时,过拟合的风
险就会降低,学习算法就可以更加信任数据,从而减少渐进偏差。
56 The concept of generalization

% of the % of the
error error due to
due to Policy asymptotic
Data class
overfitting bias

Figure 7.1: Schematic representation of the bias-overfitting tradeoff.

As we will see, for many algorithmic choices, there is in practice a


tradeoff to be made between asymptotic bias and overfitting that we
simply call "bias-overfitting tradeoff". In this section, we discuss the
following key elements that are at stake when one wants to improve
generalization in deep RL:

• the state representation,

• the learning algorithm (type of function approximator and model-


free vs model-based),

• the objective function (e.g., reward shaping, tuning the training


discount factor), and

• using hierarchical learning.


Throughout those discussions, a simple example is considered. This
example is, by no means, representative of the complexity of real-world
problems but it is enlightening to simply illustrate the concepts that
will be discussed. Let us consider an MDP with N S states, N S = 11
and N A actions, N A = 4. Let us suppose that the main part of the
environment is a square 3 × 3 grid world (each represented by a tuple
(x, y) with x = { 0, 1, 2} , y = { 0, 1, 2} ), such as illustrated in Figure 7.2.
The agent starts in the central state (1, 1). In every state, it selects one
of the 4 actions corresponding to 4 cardinal directions (up, down, left
and right), which leads the agent to transition deterministically in a
state immediately next to it, except when it tries to move out of the
domain. On the upper part and lower part of the domain, the agent is
56 概括的概念

百分比 百分比
错误 由于
由于 政策 渐近
过拟合 Data 类 bias

图 7.1:偏差与拟合之间的权衡示意图。

正如我们将看到的那样,对于许多算法选择,我们实际上需要在渐近偏差和过
度拟合之间做出权衡,我们称之为 "偏差-过度拟合权衡"。在本节中,我们将
讨论在深度 RL 中提高泛化能力的关键因素:

• 国家代表、
• 学习算法(函数逼近器类型、无模态与基于模态)、
• 目标函数(如奖励整形、调整训练折扣系数),以及
• 使用分层学习
在整个讨论过程中,我们会考虑一个简单的例子。这个例子绝对不能代表现实
世界问题的复杂性,但它对简单说明将要讨论的概念很有启发。让我们考虑一
个有 N 个状态(N= 11)和 N 个反应(N= 4)的 MDP。假设环境的主要部分是
一个 3 × 3 的正方形网格世界(每个网格由一个元组(x, y)表示,x = {0, 1,
2}, y = {0, 1, 2}),如图 7.2 所示。代理从中心状态(1,1)开始。在每个状态
下,它都会选择与 4 个基本方向(上、下、左、右)相对应的 4 个动作中的一
个,这导致代理确定性地过渡到紧邻它的状态,除非它试图移出该域。在域的
上部和下部,代理是
57

stuck in the same state if it tries to move out of the domain. On the
left, the agent transitions deterministically to a given state, which will
provide a reward of 0.6 for any action at the next time step. On the
right side of the square, the agent transitions with a probability 25% to
another state that will provide, at the next time step, a reward of 1 for
any action (the rewards are 0 for all other states). When a reward is
obtained, the agent transitions back to the central state.

P= 1 P = 0.25

r = 0.6 r =1

y
x

Figure 7.2: Representation of a simple MDP that illustrates the need of


generalization.

In this example, if the agent has perfect knowledge of its environment,


the best expected cumulative reward (for a discount factor close to 1)
would be to always go to the left direction and repeatedly gather a
reward of 0.6 every 3 steps (as compared to gathering a reward of 1 every
6 steps on average). Let us now suppose that only limited information
has been obtained on the MDP with only one tuple of experience
< s, a, r, s′ > for each couple < s, a >. According to the limited data in
the frequentist assumption, there is a rather high probability (∼ 58%)
that at least one transition from the right side seems to provide a
deterministic access to r = 1. In those cases and for either a model-
based or a model-free approach, if the learning algorithm comes up
with the optimal policy in an empirical MDP built from the frequentist
statistics, it would actually suffer from poor generalization as it would
choose to try to obtain the reward r = 1.
We discuss hereafter the different aspects that can be used to avoid
overfitting to limited data; we show that it is done by favoring robust
57

如果代理试图离开该域,它就会停留在相同的状态。在左边,代理以确定的方
式过渡到一个给定的状态,在下一个时间步骤中的任何行动都将获得 0.6 的奖
励。在方格的右侧,代理以 25% 的概率过渡到另一个状态,在下一个时间步
骤,该状态下的任何行动都将获得 1 的奖励(所有其他状态的奖励均为 0)。
获得奖励后,代理将转回中心状态。

P=1 P = 0.25

r = 0.6 r=1

y
x

图 7.2:一个简单 MDP 的示意图,说明了泛化的必要性。

在这个例子中,如果代理对环境完全了解,那么最佳的预期累积奖励(贴现率
接近 1 时)将是始终向左走,并且每 3 步重复收集 0.6 的奖励(相比之下,平
均每 6 步收集 1 的奖励)。现在,我们假设只获得了 MDP 的有限信息,每对情
侣 < s, a > 只有一个经验元组 < s, a, r, s>。根据频繁主义假设中的有限数据,有
相当高的概率(∼ 58%),右边至少有一个过渡似乎提供了通向 r = 1 的确定性
通道。在这种情况下,无论是基于模型的方法还是无模型的方法,如果学习算
法在根据频数统计建立的经验 MDP 中得出了最优策略,那么它的泛化能力实
际上会很差,因为它会选择尝试获得 r = 1 的奖励。

我们将在下文讨论可用于避免对有限数据过度拟合的不同方面。
58 The concept of generalization

policies within the policy class, usually at the expense of introducing


some bias. At the end, we also discuss how the bias-overfitting tradeoff
can be used in practice to obtain the best performance from limited
data.

7.1 Feature selection

The idea of selecting the right features for the task at hand is key in the
whole field of machine learning and also highly prevalent in reinforcement
learning (see e.g., Munos and Moore, 2002; Ravindran and Barto, 2004;
Leffler et al., 2007; Kroon and Whiteson, 2009; Dinculescu and Precup,
2010; Li et al., 2011; Ortner et al., 2014; Mandel et al., 2014; Jiang
et al., 2015a; Guo and Brunskill, 2017; François-Lavet et al., 2017). The
appropriate level of abstraction plays a key role in the bias-overfitting
tradeoff and one of the key advantages of using a small but rich abstract
representation is to allow for improved generalization.

Overfitting When considering many features on which to base the


policy (in the example the y-coordinate of the state as illustrated in
Figure 7.3), an RL algorithm may take into consideration spurious
correlations, which leads to overfitting (in the example, the agent may
infer that the y-coordinate changes something to the expected return
because of the limited data).

s(0) s(1) s(2) (0, 0) (1, 0) (2, 0) (0) (1) (2)

s(3) s(4) s(5) (0, 1) (1, 1) (2, 1) (0) (1) (2)


y
s(6) s(7) s(8) (0, 2) (1, 2) (2, 2) (0) (1) (2)
x
States Feature
representation selection where
Environment
with a set of only the x-coordinate
features (x, y) has been kept

Figure 7.3: Illustration of the state representation and feature selection process. In
this case, after the feature selection process, all states with the same x-coordinate
are considered as indistinguishable.
58 概括的概念
通常以引入一些偏差为代价。最后,我们还讨论了如何在实践中利用偏差与拟
合之间的权衡,从有限的数据中获得最佳性能。

7.1 特征选择
为手头任务选择正确特征的理念是整个机器学习领域的关键,在强化学习中
也非常普遍(参见 Munos 和 Moore,2002 年;Ravindran 和 Barto,2004
年;Leffler 等人,2007 年;Kroon 和 Whiteson,2009 年;Dinculescu 和
Precup,2010 年;Li 等人,2011 年;Ortner 等人,2014 年;Mandel 等
人,2014 年;Jiang 等人,2014 年)、2007;Kroon 和 Whiteson,2009;
Dinculescu 和 Precup,2010;Li 等人,2011;Ortner 等人,2014;
Mandel 等人,2014;Jiang 等人,2015a;Guo 和 Brunskill,2017;
François-Lavet 等人,2017)。适当的抽象程度在偏差与拟合的权衡中起着关
键作用,而使用小而丰富的抽象表征的主要优势之一就是可以提高泛化程
度。
过度拟合 在考虑许多特征作为策略基础时(如图 7.3 所示的状态 y 坐标),RL
算法可能会考虑到虚假的相关性,从而导致过度拟合(在本例中,由于数据
有限,代理可能会推断 y 坐标会对预期收益产生影响)。

s s s (0, 0) (1, 0) (2, 0) (0) (1) (2)

s s s (0, 1) (1, 1) (2, 1) (0) (1) (2)


y
s s s (0, 2) (1, 2) (2, 2) (0) (1) (2)
x

国家 特点
环境 代表权 只保留 x 坐标的选择
与一套
特征 (x, y)
图 7.3:状态表示和特征选择过程示意图。在这种情况下,经过特征选择过程后,所有具有相同 x
坐标的状态都被视为无差别状态。
7.2. Choice of the learning algorithm and function approximator selection
59

Asymptotic bias Removing features that discriminate states with a


very different role in the dynamics introduces an asymptotic bias. Indeed,
the same policy would be enforced on undistinguishable states, hence
leading to a sub-optimal policy.

In deep RL, one approach is to first infer a factorized set of


generative factors from the observations. This can be done for instance
with an encoder-decoder architecture variant (Higgins et al., 2017;
Zhang et al., 2018b). These features can then be used as inputs to a
reinforcement learning algorithm. The learned representation can, in
some contexts, greatly help for generalization as it provides a more
succinct representation that is less prone to overfitting. However, an
auto-encoder is often too strong of a constraint. On the one hand, some
features may be kept in the abstract representation because they are
important for the reconstruction of the observations, though they are
otherwise irrelevant for the task at hand (e.g., the color of the cars in a
self-driving car context). On the other hand, crucial information about
the scene may also be discarded in the latent representation, particularly
if that information takes up a small proportion of the observations x in
pixel space (Higgins et al., 2017). Note that in the deep RL setting, the
abstraction representation is intertwined with the use of deep learning.
This is discussed in detail in the following section.

7.2 Choice of the learning algorithm and function approximator


selection

The function approximator in deep learning characterizes how the


features will be treated into higher levels of abstraction (a fortiori it can
thus give more or less weight to some features). If there is, for instance,
an attention mechanism in the first layers of a deep neural network, the
mapping made up of those first layers can be seen as a feature selection
mechanism.
On the one hand, if the function approximator used for the value
function and/or the policy and/or the model is too simple, an asymptotic
bias may appear. On the other hand, when the function approximator
has poor generalization, there will be a large error due to the finite size
7.2.学习算法的选择和函数近似值的选择 59
渐进偏差 剔除在动态变化中作用截然不同的状态特征,会带来渐进偏差。事
实上,相同的政策会在无法区分的状态下执行,从而导致次优政策。
在深度 RL 中,一种方法是首先从观测结果中推断出生成因子的因子集。例
如,这可以通过编码器-解码器架构变体来实现(Higgins 等人,2017;
Zhang 等人,2018b)。然后,这些特征可用作强化学习算法的输入。在某些
情况下,学习到的表征可以极大地帮助泛化,因为它提供了一种更简洁的表
征,不易出现过度拟合。不过,自动编码器的约束往往太强。一方面,一些特
征可能会被保留在抽象表示中,因为它们对观察结果的重建很重要,尽管它们
与手头的任务无关(例如,自动驾驶汽车中汽车的颜色)。另一方面,关于场
景的关键信息也可能在潜在表征中被丢弃,尤其是当这些信息在像素空间中的
观测数据 x 中只占很小一部分时(Higgins 等人,2017 年)。请注意,在深度
RL 设置中,抽象表征与深度学习的使用是交织在一起的。

下一节将对此进行详细讨论。
7.2 学习算法的选择和函数近似值的选择
深度学习中的函数近似器描述了如何将特征处理到更高的抽象层次(因此,它
可以给某些特征赋予更多或更少的权重)。举例来说,如果深度神经网络的第
一层存在注意力机制,那么这些第一层组成的映射就可以看作是一种特征选择
机制。

一方面,如果用于价值函数和/或政策和/或模型的函数近似值过于简单,可能
会出现渐近偏差。另一方面,当函数近似值的概括性较差时,会由于有限大小
而产生较大误差。
60 The concept of generalization

of the dataset (overfitting). In the example above, a particularly good


choice of a model-based or model-free approach associated with a good
choice of a function approximator could infer that the y-coordinate of
the state is less important than the x-coordinate, and generalize that
to the policy.
Depending on the task, finding a performant function approximator
is easier in either a model-free or a model-based approach. The choice of
relying more on one or the other approach is thus also a crucial element
to improve generalization, as discussed in §6.2.
One approach to mitigate non-informative features is to force the
agent to acquire a set of symbolic rules adapted to the task and to
reason on a more abstract level. This abstract level reasoning and the
improved generalization have the potential to induce high-level cognitive
functions such as transfer learning and analogical reasoning (Garnelo
et al., 2016). For instance, the function approximator may embed a
relational learning structure (Santoro et al., 2017) and thus build on
the idea of relational reinforcement learning (Džeroski et al., 2001).

7.2.1 Auxiliary tasks


In the context of deep reinforcement learning, it was shown by Jaderberg
et al. (2016) that augmenting a deep reinforcement learning agent with
auxiliary tasks within a jointly learned representation can drastically
improve sample efficiency in learning. This is done by maximizing
simultaneously many pseudo-reward functions such as immediate reward
prediction (γ = 0), predicting pixel changes in the next observation, or
predicting activation of some hidden unit of the agent’s neural network.
The argument is that learning related tasks introduces an inductive
bias that causes a model to build features in the neural network that
are useful for the range of tasks (Ruder, 2017). Hence, this formation of
more significant features leads to less overfitting.
In deep RL, it is possible to build an abstract state such that it
provides sufficient information for simultaneously fitting an internal
meaningful dynamics as well as the estimation of the expected value
of an optimal policy. By explicitly learning both the model-free and
model-based components through the state representation, along with an
60 概括的概念
由于数据集的有限性,会产生较大的误差(过拟合)。在上面的例子中,如果
选择了一种基于模型或无模型的方法,并选择了一种好的函数近似值,就可
以推断出状态的 y 坐标比 x 坐标更不重要,并将其推广到政策中。
根据任务的不同,无论是无模型方法还是基于模型方法,都更容易找到性能
优越的函数近似值。因此,选择更多依赖一种或另一种方法也是提高泛化能
力的关键因素,这一点将在第 6.2 节中讨论。
减轻非信息特征的一种方法是迫使代理获取一套适应任务的符号规则,并在
更抽象的层面上进行推理。这种抽象水平的推理和改进的泛化有可能诱发高
级认知功能,如迁移学习和类比推理(Garnelo 等人,2016 年)。例如,函
数近似器可以嵌入关系学习结构(Santoro 等人,2017 年),从而建立在关
系强化学习(Džeroski 等人,2001 年)的基础上。

辅助任务
7.2.1

在深度强化学习方面,Jaderberg 等人(2016 年)的研究表明,在联合学习的


表征中为深度强化学习代理添加辅助任务,可以大幅提高样本的学习效率。具
体做法是同时最大化多个伪奖励函数,如即时奖励预测(γ = 0)、预测下一次
观测中的像素变化或预测代理神经网络中某些隐藏单元的激活。这种观点认
为,学习相关任务会引入一种归纳偏差,导致模型在神经网络中建立对一系列
任务有用的特征(Ruder,2017)。因此,这种更重要特征的形成会减少过度
拟合。

在深度 RL 中,有可能建立一种抽象状态,从而为同时拟合内部有意义的动态
以及估计最优策略的期望值提供足够的信息。通过状态表示明确地学习无模型
和基于模型的部分,再加上一个
7.3. Modifying the objective function 61

approximate entropy maximization penalty, the CRAR agent (François-


Lavet et al., 2018) shows how it is possible to learn a low-dimensional
representation of the task. In addition, this approach can directly make
use of a combination of model-free and model-based, with planning
happening in a smaller latent state space.

7.3 Modifying the objective function

In order to improve the policy learned by a deep RL algorithm, one can


optimize an objective function that diverts from the actual objective.
By doing so, a bias is usually introduced but this can in some cases
help with generalization. The main approaches to modify the objective
function are either (i) to modify the reward of the task to ease learning
(reward shaping), or (ii) tune the discount factor at training time.

7.3.1 Reward shaping


Reward shaping is a heuristic for faster learning. In practice, reward
shaping uses prior knowledge by giving intermediate rewards for actions
that lead to desired outcome. It is usually formalized as a function
F (s, a, s′ ) added to the original reward function R(s, a, s′ ) of the original
MDP (Ng et al., 1999). This technique is often used in deep reinforcement
learning to improve the learning process in settings with sparse and
delayed rewards (e.g., Lample and Chaplot, 2017).

7.3.2 Discount factor


When the model available to the agent is estimated from data, the
policy found using a shorter planning horizon can actually be better
than a policy learned with the true horizon (Petrik and Scherrer, 2009;
Jiang et al., 2015b). On the one hand, artificially reducing the planning
horizon leads to a bias since the objective function is modified. On the
other hand, if a long planning horizon is targeted (the discount factor
γ is close to 1), there is a higher risk of overfitting. This overfitting can
intuitively be understood as linked to the accumulation of the errors
in the transitions and rewards estimated from data as compared to
the actual transition and reward probabilities. In the example above
7.3.修改目标函数 61

近似熵最大化惩罚,CRAR 代理(François-Lavet 等人,2018 年)展示了如何


能够学习任务的低维表示。此外,这种方法可以直接利用无模型和基于模型的
结合,在较小的潜在状态空间中进行规划。

7.3修改目标函数
为了改进深度 RL 算法学习到的策略,我们可以优化与实际目标相背离的目标
函数。这样做通常会带来偏差,但在某些情况下,这有助于实现泛化。修改目
标函数的主要方法有:(i) 修改任务的奖励,以方便学习(奖励整形),或 (ii) 在
训练时调整折扣系数。

7.3.1 奖励塑造
奖励塑造是一种加快学习速度的启发式方法。在实践中,奖励塑造利用了先验
知识,通过对导致预期结果的行动给予中间奖励。它通常被形式化为一个函数
F (s, a, s) 添加到原始 MDP 的原始奖励函数 R(s, a, s) 中(Ng 等人,1999
年)。这种技术通常用于深度强化学习,以改善奖励稀疏和延迟设置下的学习
过程(例如,Lample 和 Chaplot,2017 年)。

折扣系数
7.3.2

当代理可用的模型是从数据中估算出来的,使用较短规划期限找到的政策实际
上可能比使用真实期限学习到的政策更好(Petrik 和 Scherrer,2009;Jiang
等人,2015b)。一方面,人为缩短规划期限会导致偏差,因为目标函数被修改
了。另一方面,如果目标规划期限较长(贴现因子 γ 接近 1),则过度拟合的风
险较高。这种过拟合可以直观地理解为,与实际过渡概率和奖励概率相比,从
数据中估算出的过渡概率和奖励概率的误差不断累积。在上面的例子中
62 The concept of generalization

(Figure 7.2), in the case where the upper right or lower right states
would seem to lead deterministically to r = 1 from the limited data,
one may take into account that it requires more steps and thus more
uncertainty on the transitions (and rewards). In that context, a low
training discount factor would reduce the impact of rewards that are
temporally distant. In the example, a discount factor close to 0 would
discount the estimated rewards at three time steps much more strongly
than the rewards two time steps away, hence practically discarding the
potential rewards that can be obtained by going through the corners as
compared to the ones that only require moving along the x-axis.
In addition to the bias-overfitting tradeoff, a high discount factor
also requires specific care in value iteration algorithms as it can lead to
instabilities in convergence. This effect is due to the mappings used in
the value iteration algorithms with bootstrapping (e.g., Equation 4.2
for the Q-learning algorithm) that propagate errors more strongly with
a high discount factor. This issue is discussed by Gordon (1999) with
the notion of non-expansion/expansion mappings. When bootstrapping
is used in a deep RL value iteration algorithm, the risk of instabilities
and overestimation of the value function is empirically stronger for a
discount factor close to one (François-Lavet et al., 2015).

7.4 Hierarchical learning

The possibility of learning temporally extended actions (as opposed


to atomic actions that last for one time-step) has been formalized
under the name of options (Sutton et al., 1999). Similar ideas have
also been denoted in the literature as macro-actions (McGovern et
al., 1997) or abstract actions (Hauskrecht et al., 1998). The usage of
options is an important challenge in RL because it is essential when
the task at hand requires working on long time scales while developing
generalization capabilities and easier transfer learning between the
strategies. A few recent works have brought interesting results in
the context of fully differentiable (hence learnable in the context of
deep RL) options discovery. In the work of Bacon et al., 2016, an
option-critic architecture is presented with the capability of learning
simultaneously the internal policies and the termination conditions of
62 概括的概念
在上面的例子(图 7.2)中,从有限的数据来看,右上或右下的状态似乎可以
确定地导致 r = 1,但我们可以考虑到这需要更多的步骤,因此过渡(和奖励)
的不确定性也更大。在这种情况下,较低的训练折扣系数会降低时间上较远的
奖励的影响。在本例中,如果折扣系数接近于 0,那么三个时间步长的估计奖
励将比两个时间步长的奖励折扣要大得多,因此,与只需沿 x 轴移动的奖励相
比,通过拐角移动可获得的潜在奖励实际上会被放弃。

除了偏差与拟合之间的权衡之外,高贴现率还需要在价值迭代算法中特别注
意,因为它可能导致收敛的不稳定性。造成这种影响的原因是价值迭代算法中
使用的自举映射(例如 Q-learning 算法中的公式 4.2),高贴现率会使误差传
播得更快。戈登(Gordon,1999 年)用非扩展/扩展映射的概念讨论了这一
问题。在深度 RL 值迭代算法中使用引导时,根据经验,贴现因子接近 1 时,
不稳定和高估值函数的风险更大(François-Lavet 等人,2015 年)。

7.4分层学习
学习时间上扩展的行动(而不是持续一个时间步的原子行动)的可能性已被正
式命名为选项(Sutton 等人,1999 年)。类似的观点在文献中也被称为宏行
动(McGovern 等人,1997 年)或抽象行动(Hauskrecht 等人,1998 年)。
选项的使用是 RL 中的一个重要挑战,因为当手头的任务需要在较长的时间尺
度上工作时,选项的使用是必不可少的,同时还需要开发泛化能力,使策略间
的迁移学习更容易。最近的一些研究在完全可微分(因此在深度 RL 中是可学
习的)选项发现方面取得了一些有趣的成果。Bacon 等人在 2016 年的研究中
提出了一种期权批判架构,这种架构能够同时学习内部策略以及
7.5. How to obtain the best bias-overfitting tradeoff 63

options, as well as the policy over options. In the work of Vezhnevets


et al., 2016, the deep recurrent neural network is made up of two main
elements. The first module generates an action-plan (stochastic plan
of future actions) while the second module maintains a commitment-
plan which determines when the action-plan has to be updated or
terminated. Many variations of these approaches are also of interest
(e.g., Kulkarni et al., 2016; Mankowitz et al., 2016). Overall, building
a learning algorithm that is able to do hierarchical learning can be a
good way of constraining/favoring some policies that have interesting
properties and thus improving generalization.

7.5 How to obtain the best bias-overfitting tradeoff

From the previous sections, it is clear that there is a large variety


of algorithmic choices and parameters that have an influence on the
bias-overfitting tradeoff (including the choice of approach between model-
based and model-free). An overall combination of all these elements
provides a low overall sub-optimality.
For a given algorithmic parameter setting and keeping all other
things equal, the right level of complexity is the one at which the increase
in bias is equivalent to the reduction of overfitting (or the increase in
overfitting is equivalent to the reduction of bias). However, in practice,
there is usually not an analytical way to find the right tradeoffs to be
made between all the algorithmic choices and parameters. Still, there
are a variety of practical strategies that can be used. We now discuss
them for the batch setting case and the online setting case.

7.5.1 Batch setting


In the batch setting case, the selection of the policy parameters to
effectively balance the bias-overfitting tradeoff can be done similarly
to that in supervised learning (e.g., cross-validation) as long as the
performance criterion can be estimated from a subset of the trajectories
from the dataset D not used during training (i.e., a validation set).
One approach is to fit an MDP model to the data via regression (or
simply use the frequentist statistics for finite state and action space).
7.5.如何获得偏差与拟合之间的最佳平衡 63

以及对选项的政策。在 Vezhnevets 等人 2016 年的研究中,深度递归神经网络


由两个主要部分组成。第一个模块生成一个行动计划(未来行动的随机计划),
第二个模块维护一个承诺计划,该计划决定何时更新或终止行动计划。这些方
法的许多变体也很值得关注,例如,Kulkarni 等人,2016 年;Mankowitz 等
人,2016 年)提出的选项批判架构具有同时学习内部策略和终止条件的能力。
总之,建立一种能够进行分层学习的学习算法,可以很好地约束/偏好某些具有
有趣特性的策略,从而提高泛化能力。

7.5如何获得偏差与拟合之间的最佳平衡
从前面的章节可以看出,有多种算法选择和参数会对偏差-拟合权衡产生影响
(包括在基于模型和无模型之间的方法选择)。将所有这些因素综合起来,就
能获得较低的总体次优性。
对于给定的算法参数设置,在其他条件相同的情况下,正确的复杂度水平是
偏差的增加等同于过拟合的减少(或过拟合的增加等同于偏差的减少)。然
而,在实践中,通常没有一种分析方法可以在所有算法选择和参数之间找到
正确的权衡。不过,还是有多种实用策略可以使用。现在我们就批量设置和
在线设置两种情况进行讨论。

批量设置
7.5.1

在批量设置的情况下,选择策略参数以有效平衡偏差-拟合权衡的方法与监督
学习(如交叉验证)中的方法类似,只要性能标准可以从数据集 D 中未在训练
中使用的轨迹子集(即验证集)中估算出来即可。

一种方法是通过回归将 MDP 模型与数据拟合(或简单地使用有限状态和行


动空间的频数统计)。
64 The concept of generalization

The empirical MDP can then be used to evaluate the policy. This purely
model-based estimator has alternatives that do not require fitting a
model. One possibility is to use a policy evaluation step obtained
by generating artificial trajectories from the data, without explicitly
referring to a model, thus designing a Model-free Monte Carlo-like
(MFMC) estimator (Fonteneau et al., 2013). Another approach is to
use the idea of importance sampling that lets us obtain an estimate of
V π(s) from trajectories that come from a behavior policy β 6
= π, where
β is assumed to be known (Precup, 2000). That approach is unbiased
but the variance usually grows exponentially in horizon, which renders
the method unsuitable when the amount of data is low. A mix of the
regression-based approach and the importance sampling approach is
also possible (Jiang and Li, 2016; Thomas and Brunskill, 2016), and
the idea is to use a doubly-robust estimator that is both unbiased and
with a lower variance than the importance sampling estimators.
Note that there exists a particular case where the environment’s
dynamics are known to the agent, but contain a dependence on
an exogenous time series (e.g., trading in energy markets, weather-
dependent dynamics) for which the agent only has finite data. In that
case, the exogenous signal can be broken down in training time series
and validation time series (François-Lavet et al., 2016b). This allows
training on the environment with the training time series and this
allows estimating any policy on the environment with the validation
time series.

7.5.2 Online setting


In the online setting, the agent continuously gathers new experience.
The bias-overfitting tradeoff still plays a key role at each stage of the
learning process in order to achieve good sampling efficiency. Indeed,
a performant policy from given data is part of the solution to an
efficient exploration/exploitation tradeoff. For that reason, progressively
fitting a function approximator as more data becomes available can in
fact be understood as a way to obtain a good bias-overfitting tradeoff
throughout learning. With the same logic, progressively increasing the
discount factor allows optimizing the bias-overfitting tradeoff through
64 概括的概念
然后就可以利用经验 MDP 对政策进行评估。这种纯粹基于模型的估算方法也
有不需要拟合模型的替代方法。其中一种方法是使用从数据中生成人工轨迹的
策略评估步骤,而不明确参考模型,从而设计出一种类似于无模型蒙特卡罗
(MFMC)的估计器(Fonteneau 等人,2013 年)。另一种方法是使用重要性
采样的理念,让我们从来自行为策略 β 6 = π 的轨迹中获得 V (s) 的估计值,其
中 β 被假定为已知(Precup,2000 年)。这种方法是无偏的,但方差通常在水
平范围内呈指数增长,因此在数据量较少时不适合采用这种方法。基于回归的
方法和重要性抽样方法的混合也是可行的(Jiang 和 Li,2016 年;Thomas 和
Brunskill,2016 年),其想法是使用一种双重稳健估计器,既无偏又比重要性
抽样估计器方差小。

需要注意的是,存在一种特殊情况,即代理已知环境动态,但其中包含对代理
只有有限数据的外生时间序列的依赖(如能源市场交易、与天气相关的动
态)。在这种情况下,外生信号可以分解为训练时间序列和验证时间序列
(François-Lavet 等人,2016b)。这样就可以利用训练时间序列对环境进行训
练,并利用验证时间序列对环境中的任何政策进行估计。

在线设置
7.5.2

在在线环境中,代理会不断积累新的经验。为了实现良好的采样效率,偏差与
拟合之间的权衡在学习过程的每个阶段仍然起着关键作用。事实上,从给定数
据中得出一个有效的策略是高效探索/开发权衡的部分解决方案。因此,随着可
用数据的增加,逐步拟合一个函数近似值实际上可以理解为在整个学习过程中
获得良好的偏差-拟合权衡的一种方法。基于同样的逻辑,逐步提高贴现因子可
以通过以下方式优化偏差与拟合的权衡
7.5. How to obtain the best bias-overfitting tradeoff 65

learning (François-Lavet et al., 2015). Besides, optimizing the bias-


overfitting tradeoff also suggests the possibility to dynamically adapt
the feature space and/or the function approximator. For example, this
can be done through ad hoc regularization, or by adapting the neural
network architecture, using for instance the NET2NET transformation
(Chen et al., 2015).
7.5.如何获得偏差与拟合之间的最佳平衡 65

学习(François-Lavet 等人,2015 年)。此外,优化偏差与拟合之间的权衡也


为动态调整特征空间和/或函数近似值提供了可能。例如,这可以通过特别正则
化或调整神经网络架构来实现,例如使用 NET2NET 变换逐步增加折扣因子,
从而优化偏差与拟合之间的权衡(Chen 等人,2015 年)。
8
Particular challenges in the online setting

As discussed in the introduction, reinforcement learning can be used in


two main settings: (i) the batch setting (also called offline setting), and
(ii) the online setting. In a batch setting, the whole set of transitions
(s, a, r, s′ ) to learn the task is fixed. This is in contrast to the online
setting where the agent can gather new experience gradually. In the
online setting, two specific elements have not yet been discussed in
depth. First, the agent can influence how to gather experience so that
it is the most useful for learning. This is the exploration/exploitation
dilemma that we discuss in Section 8.1. Second, the agent has the
possibility to use a replay memory (Lin, 1992) that allows for a good
data-efficiency. We discuss in Section 8.2 what experience to store and
how to reprocess that experience.

8.1 Exploration/Exploitation dilemma

The exploration-exploitation dilemma is a well-studied tradeoff in RL


(e.g., Thrun, 1992). Exploration is about obtaining information about the
environment (transition model and reward function) while exploitation
is about maximizing the expected return given the current knowledge.
As an agent starts accumulating knowledge about its environment, it

66
8
在线环境中的特殊挑战

如导言所述,强化学习可用于两种主要环境:(i) 批量环境(也称离线环境)和
(ii) 在线环境。在批处理设置中,学习任务的整个转换集(s、a、r、s)是固定
的。这与在线设置形成鲜明对比,在在线设置中,代理可以逐步积累新经验。
在在线环境中,有两个具体因素尚未得到深入讨论。首先,代理可以影响收集
经验的方式,使其对学习最有用。这就是我们在第 8.1 节中讨论的探索/开发两
难问题。其次,代理可以使用重放记忆(Lin,1992 年),从而提高数据效率。
我们将在第 8.2 节讨论应存储哪些经验以及如何重新处理这些经验。

8.1 勘探/开发的两难选择
探索-开发两难是 RL 中一个经过深入研究的权衡问题(例如 Thrun,1992
年)。探索是为了获取环境信息(过渡模型和奖励函数),而利用
是指在当前知识条件下最大化预期收益。当一个代理开始积累有关其环境的知
识时,它会
66
8.1. Exploration/Exploitation dilemma 67

has to make a tradeoff between learning more about its environment


(exploration) or pursuing what seems to be the most promising strategy
with the experience gathered so far (exploitation).

8.1.1 Different settings in the exploration/exploitation dilemma


There exist mainly two different settings. In the first setting, the agent
is expected to perform well without a separate training phase. Thus,
an explicit tradeoff between exploration versus exploitation appears so
that the agent should explore only when the learning opportunities are
valuable enough for the future to compensate what direct exploitation
can provide. The sub-optimality E V ∗ (s0) − V π(s0) of an algorithm
s0
obtained in this context is known as the cumulative regret 1 . The deep
RL community is usually not focused on this case, except when explicitly
stated such as in the works of Wang et al. (2016a) and Duan et al.
(2016b).
In the more common setting, the agent is allowed to follow a training
policy during a first phase of interactions with the environment so as to
accumulate training data and hence learn a test policy. In the training
phase, exploration is only constrained by the interactions it can make
with the environment (e.g., a given number of interactions). The test
policy should then be able to maximize a cumulative sum of rewards in
a separate phase of interaction. The sub-optimality E V ∗ (s0) − V π(s0)
s0
obtained in this case of setting is known as the simple regret. Note
that an implicit exploration/exploitation is still important. On the
one hand, the agent has to ensure that the lesser-known parts of the
environment are not promising (exploration). On the other hand, the
agent is interested in gathering experience in the most promising parts of
the environment (which relates to exploitation) to refine the knowledge
of the dynamics. For instance, in the bandit task provided in Figure 8.1,
it should be clear with only a few samples that the option on the right
is less promising and the agent should gather experience mainly on the
two most promising arms to be able to discriminate the best one.
1
This term is mainly used in the bandit community where the agent is in only
one state and where a distribution of rewards is associated to each action; see e.g.,
Bubeck et al., 2011.
8.1.勘探/开发的两难选择 67

在了解更多环境信息(探索)和根据目前积累的经验采取最有希望的策略(开
发)之间,必须做出权衡。

8.1.1 勘探/开发两难境地中的不同环境
主要有两种不同的情况。在第一种情况下,不需要单独的训练阶段,代理就
能表现出色。因此,探索与开发之间出现了明确的权衡,即只有当学习机会
对未来有足够的价值,足以弥补直接开发所能提供的价值时,代理才应该进
行探索。算法的次优化 EV (s) - V (s)
在这种情况下得到的结果被称为累积遗憾值(cumulative regret)。深度 RL
界通常并不关注这种情况,除非有明确的说明,如 Wang 等人(2016a)和
Duan 等人的研究。

(2016b).
更常见的情况是,在与环境交互的第一阶段,允许代理遵循训练策略,以积
累训练数据,从而学习测试策略。在训练阶段,探索只受限于它能与环境进
行的交互(如给定的交互次数)。然后,测试策略应能在单独的交互阶段最大
化累积奖励总和。次优化 EV (s) - V (s)
在这种情况下得到的结果称为简单后悔值。需要注意的是,隐含的探索/开发
仍然很重要。一方面,代理必须确保环境中鲜为人知的部分没有前途(探
索)。另一方面,代理有兴趣在环境中最有前途的部分收集经验(这与开发有
关),以完善对动态的了解。例如,在图 8.1 所示的 "强盗 "任务中,只需几个
样本就能清楚地看出右边的选择不太有前途,因此代理应主要在最有前途的
两个臂上收集经验,以便能够分辨出最好的一个。

该术语主要用于代理人只处于一种状态,且每个行动都与奖励分布相关联的强盗社区;参见
1

Bubeck 等人,2011 年。
68 Particular challenges in the online setting

P(·)

P(·) P(·)

0 Rmax r 0 R max r 0 Rmax r

Figure 8.1: Illustration of the reward probabilities of 3 arms in a multi-armed


bandit problem.

8.1.2 Different approaches to exploration


The exploration techniques are split into two main categories: (i) directed
exploration and (ii) undirected exploration (Thrun, 1992).
In the undirected exploration techniques, the agent does not rely on
any exploration specific knowledge of the environment (Thrun, 1992).
For instance, the technique called -greedy takes a random action with
probability and follows the policy that is believed to be optimal with
probability 1 − . Other variants such as softmax exploration (also called
Boltzmann exploration) takes an action with a probability that depends
on the associated expected return.
Contrary to the undirected exploration, directed exploration
techniques make use of a memory of the past interactions with the
environment. For MDPs, directed exploration can scale polynomially
with the size of the state space while undirected exploration scales
in general exponentially with the size of the state space (e.g., E3 by
Kearns and Singh, 2002; R-max by Brafman and Tennenholtz, 2003; ...).
Inspired by the Bayesian setting, directed exploration can be done via
heuristics of exploration bonus (Kolter and Ng, 2009) or by maximizing
Shannon information gains (e.g., Sun et al., 2011).
Directed exploration is, however, not trivially applicable in high-
dimensional state spaces (e.g., Kakade et al., 2003). With the develop-
ment of the generalization capabilities of deep learning, some possibil-
ities have been investigated. The key challenge is to handle, for high-
dimensional spaces, the exploration/exploitation tradeoff in a principled
way – with the idea to encourage the exploration of the environment
68 在线环境中的特殊挑战
P(·)

P(·) P(·)

0 R r 0 R r 0 R r

图 8.1:多臂强盗问题中 3 个臂的奖励概率说明。

8.1.2 不同的勘探方法
探索技术分为两大类:(i) 定向探索和 (ii) 非定向探索(Thrun,1992 年)。
在无定向探索技术中,代理不依赖于对环境的任何特定探索知识(Thrun,
1992 年)。例如,被称为 "贪婪"(greedy)的技术会采取概率随机行动,并
遵循概率为 1 - 的最优策略。其他变种,如软最大探索(也称作波尔兹曼探
索),采取的行动概率取决于相关的预期收益。
与无向探索相反,有向探索技术利用的是过去与环境交互的记忆。对于 MDPs
而言,有向探索可随状态空间的大小呈多项式扩展,而无向探索一般会随状态
空间的大小呈指数扩展(例如,Eby Kearns 和 Singh,2002;Brafman 和
Tennenholtz 的 R-max,2003;......)。受贝叶斯设置的启发,定向探索可以
通过探索奖励启发式(Kolter 和 Ng,2009 年)或香农信息增益最大化(如
Sun 等人,2011 年)来完成。
然而,定向探索在高维状态空间中并不完全适用(例如,Kakade 等人,2003
年)。随着深度学习泛化能力的发展,人们对一些可能性进行了研究。关键的
挑战在于,如何在高维空间中以有原则的方式处理探索与开发之间的权衡问
题--其理念是鼓励探索环境
8.1. Exploration/Exploitation dilemma 69

where the uncertainty due to limited data is the highest. When rewards
are not sparse, a measure of the uncertainty on the value function can
be used to drive the exploration (Dearden et al., 1998; Dearden et al.,
1999). When rewards are sparse, this is even more challenging and
exploration should in addition be driven by some novelty measures on
the observations (or states in a Markov setting).
Before discussing the different techniques that have been proposed
in the deep RL setting, one can note that the success of the first deep
RL algorithms such as DQN also come from the exploration that arises
naturally. Indeed, following a simple -greedy scheme online often proves
to be already relatively efficient thanks to the natural instability of the
Q-network that drives exploration (see Chapter 4 for why there are
instabilities when using bootstrapping in a fitted Q-learning algorithm
with neural networks).
Different improvements are directly built on that observation. For
instance, the method of "Bootstrapped DQN" (Osband et al., 2016)
makes an explicit use of randomized value functions. Along similar lines,
efficient exploration has been obtained by the induced stochasticity
of uncertainty estimates given by a dropout Q-network (Gal and
Ghahramani, 2016) or parametric noise added to its weights (Lipton
et al., 2016; Plappert et al., 2017; Fortunato et al., 2017). One specificity
of the work done by Fortunato et al., 2017 is that, similarly to Bayesian
deep learning, the variance parameters are learned by gradient descent
from the reinforcement learning loss function.
Another common approach is to have a directed scheme thanks
to exploration rewards given to the agent via heuristics that estimate
novelty (Schmidhuber, 2010; Stadie et al., 2015; Houthooft et al., 2016).
In (Bellemare et al., 2016; Ostrovski et al., 2017), an algorithm provides
the notion of novelty through a pseudo-count from an arbitrary density
model that provides an estimate of how many times an action has been
taken in similar states. This has shown good results on one of the most
difficult Atari 2600 games, Montezuma’s Revenge.
In (Florensa et al., 2017), useful skills are learned in pre-training
environments, which can then be utilized in the actual environment
to improve exploration and train a high-level policy over these skills.
Similarly, an agent that learns a set of auxiliary tasks may use them to
8.1.勘探/开发的两难选择 69

其中,数据有限导致的不确定性最大。当奖励并不稀疏时,可以使用价值函数
的不确定性度量来驱动探索(Dearden 等人,1998 年;Dearden 等人,1999
年)。当奖赏稀疏时,这就更具挑战性,因此还应在观察结果(或马尔可夫环
境中的状态)的某些新奇度量的基础上进行探索。

在讨论深度 RL 环境中提出的不同技术之前,我们可以注意到,第一批深度 RL
算法(如 DQN)的成功也来自于自然产生的探索。事实上,由于 Q 网络的天
然不稳定性推动了探索,遵循简单的在线贪婪方案往往被证明已经相对高效
(参见第 4 章,了解在神经网络的拟合 Q 学习算法中使用引导时为何会出现不
稳定性)。

不同的改进方法都直接建立在这一观察结果的基础上。例如,"Bootstrapped
DQN "方法(Osband 等人,2016 年)明确使用了随机值函数。沿着类似的
思路,高效的探索也是通过辍学 Q 网络给出的不确定性估计的诱导随机性
(Gal 和 Ghahramani,2016 年)或添加到其权重中的参数噪声来实现的
(Lipton 等人,2016 年;Plappert 等人,2017 年;Fortunato 等人,2017
年)。Fortunato 等人 2017 年所做工作的一个特点是,与贝叶斯深度学习类
似,方差参数也是通过强化学习损失函数的梯度下降来学习的。
另一种常见的方法是,通过启发式方法估算新颖性,为代理提供探索奖励,
从而建立一个定向计划(Schmidhuber,2010;Stadie 等人,2015;
Houthooft 等人,2016)。在(Bellemare 等人,2016 年;Ostrovski 等人,
2017 年)中,一种算法通过来自任意密度模型的伪计数来提供新颖性的概
念,该模型提供了在类似状态下某一行动被采取过多少次的估计值。该算法
在难度最高的 Atari 2600 游戏之一《蒙特祖玛的复仇》中取得了良好的效
果。
在(Florensa 等人,2017 年)中,有用的技能是在预训练环境中学到的,然
后可以在实际环境中利用这些技能来提高探索能力,并对这些技能进行高级
策略训练。同样,一个学习了一系列辅助任务的代理可以利用它们来
70 Particular challenges in the online setting

efficiently explore its environment (Riedmiller et al., 2018). These ideas


are also related to the creation of options studied in (Machado et al.,
2017a), where it is suggested that exploration may be tackled by learning
options that lead to specific modifications in the state representation
derived from proto-value functions.
Exploration strategies can also make use of a model of the
environment along with planning. In that case, a strategy investigated
in (Salge et al., 2014; Mohamed and Rezende, 2015; Gregor et al., 2016;
Chiappa et al., 2017) is to have the agent choose a sequence of actions by
planning that leads to a representation of state as different as possible
to the current state. In (Pathak et al., 2017; Haber et al., 2018), the
agent optimizes both a model of its environment and a separate model
that predicts the error/uncertainty of its own model. The agent can
thus seek to take actions that adversarially challenge its knowledge of
the environment (Savinov et al., 2018).
By providing rewards on unfamiliar states, it is also possible to
explore efficiently the environments. To determine the bonus, the current
observation can be compared with the observations in memory. One
approach is to define the rewards based on how many environment steps
it takes to reach the current observation from those in memory (Savinov
et al., 2018). Another approach is to use a bonus positively correlated
to the error of predicting features from the observations (e.g., features
given by a fixed randomly initialized neural network) (Burda et al.,
2018).
Other approaches require either demonstrations or guidance from
human demonstrators. One line of work suggests using natural
language to guide the agent by providing exploration bonuses when an
instruction is correctly executed (Kaplan et al., 2017). In the case where
demonstrations from expert agents are available, another strategy for
guiding exploration in these domains is to imitate good trajectories. In
some cases, it is possible to use demonstrations from experts even when
they are given in an environment setup that is not exactly the same
(Aytar et al., 2018).
70 在线环境中的特殊挑战
Riedmiller 等人,2018)。这些想法也与(Machado et al.,2017a)中研究
的选项创建有关,该研究认为,探索可以通过学习选项来解决,而学习选项
会导致从原值函数衍生出的状态表征中的特定修改。
探索策略也可以在规划的同时利用环境模型。在这种情况下,(Salge 等人,
2014 年;Mohamed 和 Rezende,2015 年;Gregor 等人,2016 年;
Chiappa 等人,2017 年)研究的一种策略是,让代理通过规划选择一连串的
行动,从而获得与当前状态尽可能不同的状态表示。在(Pathak 等人,2017
年;Haber 等人,2018 年)中,代理既要优化其环境模型,又要优化预测其
自身模型误差/不确定性的单独模型。因此,代理可以寻求采取对其环境知识
提出挑战的行动(Savinov 等人,2018 年)。
通过对不熟悉的状态提供奖励,还可以有效地探索环境。为了确定奖励,可
以将当前观察结果与记忆中的观察结果进行比较。一种方法是根据从记忆中
的观察结果到达当前观察结果所需的环境步数来确定奖励(Savinov 等人,
2018 年)。另一种方法是使用与根据观测结果预测特征(例如,由固定随机初
始化神经网络给出的特征)的误差正相关的奖金(Burda 等人,2018 年)。

其他方法则需要人类演示者的演示或指导。有一种方法建议使用自然语言来引
导代理,在指令被正确执行时提供探索奖金(Kaplan 等人,2017 年)。在有
专家代理示范的情况下,在这些领域指导探索的另一种策略是模仿好的轨迹。
在某些情况下,即使专家的示范是在环境设置不完全相同的情况下给出的,也
可以使用专家的示范,而学习一组辅助任务的代理可能会使用它们(Aytar 等
人,2018 年)。
8.2. Managing experience replay 71

8.2 Managing experience replay

In online learning, the agent has the possibility to use a replay


memory (Lin, 1992) that allows for data-efficiency by storing the past
experience of the agent in order to have the opportunity to reprocess
it later. In addition, a replay memory also ensures that the mini-
batch updates are done from a reasonably stable data distribution
kept in the replay memory (for N replay sufficiently large) which helps
for convergence/stability. This approach is particularly well-suited
in the case of off-policy learning as using experience from past (i.e.
different) policies does not introduce any bias (usually it is even good
for exploration). In that context, methods based for instance on a DQN
learning algorithm or model-based learning can safely and efficiently
make use of a replay memory. In an online setting, the replay memory
keeps all information for the last N replay ∈ N time steps, where N replay
is constrained by the amount of memory available.
While a replay memory allows processing the transitions in a different
order than they are experienced, there is also the possibility to use
prioritized replay. This allows for consideration of the transitions with
a different frequency than they are experienced depending on their
significance (that could be which experience to store and which ones
to replay). In (Schaul et al., 2015b), the prioritization increases with
the magnitude of the transitions’ TD error, with the aim that the
"unexpected" transitions are replayed more often.
A disadvantage of prioritized replay is that, in general, it also
introduces a bias; indeed, by modifying the apparent probabilities of
transitions and rewards, the expected return gets biased. This can
readily be understood by considering the simple example illustrated
in Figure 8.2 where an agent tries to estimate the expected return for
a given tuple < s, a >. In that example, a cumulative return of 0 is
obtained with probability 1 − (from next state s(1) ) while a cumulative
return of C > 0 is obtained with probability (from next state s(2) ).
In that case, using prioritized experience replay will bias the expected
return towards a value higher than C since any transition leading to
s(2) will be replayed with a probability higher than .
8.2.管理经验回放 71

8.2管理经验回放
在在线学习中,代理可以使用重放存储器(Lin,1992 年),通过存储代理过去
的经验来提高数据效率,以便以后有机会重新处理。此外,重放存储器还能确
保小批量更新是从重放存储器中保存的相当稳定的数据分布(Nreplay 足够大
时)开始的,这有助于收敛/稳定。这种方法尤其适用于非政策学习,因为使用
过去(即不同)政策的经验不会带来任何偏差(通常甚至有利于探索)。在这
种情况下,基于 DQN 学习算法或基于模型的学习等方法可以安全高效地利用
重放存储器。在在线设置中,重放存储器会保存最近 N∈N 个时间步的所有信
息,其中 N 受可用内存量的限制。

虽然重放存储器允许以不同于体验的顺序处理转换,但也有可能使用优先重
放。这样就可以根据转场的重要性,以不同于体验的频率来考虑转场(即哪些
体验需要存储,哪些体验需要重放)。在 Schaul 等人的研究中(Schaul et
al.,2015b),优先级随转场的 TD 误差大小而增加,目的是让 "意外 "转场得
到更频繁的重放。

按优先级重放的一个缺点是,一般来说,它也会带来偏差;事实上,通过修改
转换和奖励的表观概率,预期收益也会产生偏差。考虑一下图 8.2 所示的简单
例子,我们就能很容易地理解这一点。在这个例子中,代理试图估算给定元组
< s, a > 的预期收益。在这个例子中,累计收益为 0 的概率为 1 -(来自下一个
状态 s),而累计收益 C > 0 的概率为(来自下一个状态 s)。在这种情况下,使
用优先经验重放会使预期回报偏向于高于 C 的值,因为任何导致 sw 的过渡都
会以高于 . 的概率被重放。
72 Particular challenges in the online setting

T (s, a, s(1) ) = 1 −
π (1)
s(1) V (s ) = 0

s
T (s, a, s(2) ) =
π (2)
s(2) V (s ) = C > 0

Figure 8.2: Illustration of a state s where for a given action a, the value of Q π (s, a; θ)
would be biased if prioritized experience replay is used ( << 1).

Note that this bias can be partly or completely corrected using


weighted importance sampling, and this correction is important near
convergence at the end of training (Schaul et al., 2015b).
72 在线环境中的特殊挑战
T (s, a, s) = 1 -
s V (s) = 0

s
T (s, a, s) =
s V (s) = C > 0

图 8.2:状态 s 的示意图,对于给定的操作 a,如果使用优先级经验重放,Q(s, a; θ) 的值会有偏差


(<< 1)。

需要注意的是,这种偏差可以通过加权重要度采样得到部分或完全纠正,而这
种纠正在训练结束接近收敛时非常重要(Schaul 等人,2015b)。
9
Benchmarking Deep RL

Comparing deep learning algorithms is a challenging problem due


to the stochastic nature of the learning process and the narrow
scope of the datasets examined during algorithm comparisons. This
problem is exacerbated in deep reinforcement learning. Indeed, deep
RL involves both stochasticity in the environment and stochasticity
inherent to model learning, which makes ensuring fair comparisons
and reproducibility especially difficult. To this end, simulations of
many sequential decision-making tasks have been created to serve
as benchmarks. In this section, we present several such benchmarks.
Next, we give key elements to ensure consistency and reproducibility
of experimental results. Finally, we also discuss some open-source
implementations for deep RL algorithms.

9.1 Benchmark Environments

9.1.1 Classic control problems


Several classic control problems have long been used to evaluate
reinforcement learning algorithms. These include balancing a pole
on a cart (Cartpole) (Barto et al., 1983), trying to get a car

73
9
深度 RL 基准

由于学习过程具有随机性,而且在算法比较过程中检查的数据集范围较窄,因
此比较深度学习算法是一个具有挑战性的问题。在深度强化学习中,这一问题
更加严重。事实上,深度强化学习既涉及环境的随机性,也涉及模型学习固有
的随机性,这使得确保公平比较和可重复性尤为困难。为此,我们创建了许多
顺序决策任务的模拟,作为基准。在本节中,我们将介绍几个这样的基准。接
下来,我们将介绍确保实验结果一致性和可重复性的关键要素。最后,我们还
将讨论一些深度 RL 算法的开源实现。

9.1 基准环境
9.1.1 经典控制问题

长期以来,一些经典的控制问题一直被用于评估强化学习算法。这些问题包括
在小车上平衡一根杆子(Cartpole)(Barto 等人,1983 年),试图让一辆汽
车在行驶过程中保持平衡(Barto et al.
73
74 Benchmarking Deep RL

up a mountain using momentum (Mountain Car) (Moore, 1990),


swinging a pole up using momentum and subsequently balancing
it (Acrobot) (Sutton and Barto, 1998). These problems have been
commonly used as benchmarks for tabular RL and RL algorithms using
linear function approximators (Whiteson et al., 2011). Nonetheless,
these simple environments are still sometimes used to benchmark deep
RL algorithms (Ho and Ermon, 2016; Duan et al., 2016a; Lillicrap et al.,
2015).

9.1.2 Games
Board-games have also been used for evaluating artificial intelligence
methods for decades (Shannon, 1950; Turing, 1953; Samuel, 1959; Sutton,
1988; Littman, 1994; Schraudolph et al., 1994; Tesauro, 1995; Campbell
et al., 2002). In recent years, several notable works have stood out in
using deep RL for mastering Go (Silver et al., 2016a) or Poker (Brown
and Sandholm, 2017; Moravčik et al., 2017).
In parallel to the achievements in board games, video games have
also been used to further investigate reinforcement learning algorithms.
In particular,

• many of these games have large observation space and/or large


action space;

• they are often non-Markovian, which require specific care (see


§10.1);

• they also usually require very long planning horizons (e.g., due to
sparse rewards).

Several platforms based on video games have been popularized.


The Arcade Learning Environment (ALE) (Bellemare et al., 2013)
was developed to test reinforcement algorithms across a wide range of
different tasks. The system encompasses a suite of iconic Atari games,
including Pong, Asteroids, Montezuma’s Revenge, etc. Figure 9.1 shows
sample frames from some of these games. On most of the Atari games,
deep RL algorithms have reached super-human level (Mnih et al., 2015).
Due to the similarity in state and action spaces between different Atari
74 深度 RL 基准
试图利用动量将汽车开上山(Mountain Car)(Moore,1990 年),利用动量
将一根杆子抡起来,然后使其保持平衡(Acrobot)(Sutton 和 Barto,1998
年)。这些问题通常被用作表格 RL 和使用线性函数近似值的 RL 算法的基准
(Whiteson 等人,2011 年)。尽管如此,这些简单环境有时仍被用来作为深
度 RL 算法的基准(Ho 和 Ermon,2016 年;Duan 等人,2016a;Lillicrap
等人,2015 年)。

9.1.2 游戏
几十年来,棋盘游戏也一直被用于评估人工智能方法(香农,1950 年;图
灵,1953 年;塞缪尔,1959 年;萨顿,1988 年;利特曼,1994 年;施劳德
夫等人,1994 年;特索罗,1995 年;坎贝尔等人,2002 年)。近年来,在利
用深度 RL 掌握围棋(Silver 等人,2016a)或扑克(Brown 和 Sandholm,
2017;Moravčik 等人,2017)方面,有几项著名的研究成果脱颖而出。
在棋盘游戏取得成就的同时,视频游戏也被用来进一步研究强化学习算法。
特别是

• 许多游戏都有很大的观察空间和/或很大的操作空间;
• 它们通常是非马尔科夫模型,需要特别注意(见第 10.1 节);
• 它们通常还需要很长的规划期(例如,由于奖励稀少)。
一些基于视频游戏的平台已经得到普及。街机学习环境(ALE)(Bellemare 等
人,2013 年)就是为了测试各种不同任务的强化算法而开发的。该系统包含一
整套具有代表性的雅达利游戏,包括《乒乓》、《小行星》、《蒙特祖玛的复仇》
等。图 9.1 显示了其中一些游戏的帧样本。在大多数雅达利游戏中,深度 RL 算
法都达到了超人水平(Mnih 等人,2015 年)。由于不同雅达利游戏的状态和
动作空间具有相似性
9.1. Benchmark Environments 75

games or different variants of the same game, they are also a good test-
bed for evaluating generalization of reinforcement learning algorithms
(Machado et al., 2017b), multi-task learning (Parisotto et al., 2015) and
for transfer learning (Rusu et al., 2015).

(a) Space invaders (b) Seaquest (c) Breakout

Figure 9.1: Illustration of three Atari games.

The General Video Game AI (GVGAI) competition framework


(Perez-Liebana et al., 2016) was created and released with the purpose
of providing researchers a platform for testing and comparing their
algorithms on a large variety of games and under different constraints.
The agents are required to either play multiple unknown games with
or without access to game simulations, or to design new game levels or
rules.
VizDoom (Kempka et al., 2016) implements the Doom video game as
a simulated environment for reinforcement learning. VizDoom has been
used as a platform for investigations of reward shaping (Lample and
Chaplot, 2017), curriculum learning (Wu and Tian, 2016), predictive
planning (Dosovitskiy and Koltun, 2016), and meta-reinforcement
learning (Duan et al., 2016b).
The open-world nature of Minecraft also provides a convenient
platform for exploring reinforcement learning and artificial intelligence.
Project Malmo (Johnson et al., 2016) is a framework that provides easy
access to the Minecraft video game. The environment and framework
provide layers of abstraction that facilitate tasks ranging from simple
navigation to collaborative problem solving. Due to the nature of
the simulation, several works have also investigated lifelong-learning,
curriculum learning, and hierarchical planning using Minecraft as a
9.1.基准环境 75

由于不同的 Atari(Machado 等人,2017b)、多任务学习(Parisotto 等人,


2015 年)和迁移学习(Rusu 等人,2015 年)在状态和行动空间上的相似性,
它们也是评估强化学习算法泛化的良好试验平台。

(a) 太空入侵者 (b) 海洋探险 (c) 突围


图 9.1:三款雅达利游戏的示意图。
通用视频游戏人工智能(GVGAI)竞赛框架(Perez-Liebana et al.

算法。代理需要玩多个未知游戏,可以使用或不使用游戏模拟器,也可以设计
新的游戏关卡或规则。

VizDoom(Kempka 等人,2016 年)将 Doom 视频游戏作为强化学习的模拟


环境。VizDoom 已被用作研究奖励塑造(Lample 和 Chaplot,2017 年)、课
程学习(Wu 和 Tian,2016 年)、预测规划(Dosovitskiy 和 Koltun,2016
年)和元强化学习(Duan 等人,2016b)的平台。

Minecraft 的开放世界特性也为探索强化学习和人工智能提供了便利的平台。
马尔默项目(Project Malmo)(Johnson 等人,2016 年)是一个可以轻松访
问 Minecraft 视频游戏的框架。该环境和框架提供了多层抽象,有助于完成从
简单导航到协作解决问题等各种任务。由于模拟的性质,一些作品还研究了终
身学习、课程学习和分层规划,并将 Minecraft 用作
76 Benchmarking Deep RL

platform (Tessler et al., 2017; Matiisen et al., 2017; Branavan et al.,


2012; Oh et al., 2016).
Similarly, Deepmind Lab (Beattie et al., 2016) provides a 3D
platform adapted from the Quake video game. The Labyrinth maze
environments provided with the framework have been used in work on
hierarchical, lifelong and curriculum learning (Jaderberg et al., 2016;
Mirowski et al., 2016; Teh et al., 2017).
Finally, “StarCraft II" (Vinyals et al., 2017) and “Starcraft:
Broodwar" (Wender and Watson, 2012; Synnaeve et al., 2016) provide
similar benefits in exploring lifelong-learning, curriculum learning, and
other related hierarchical approaches. In addition, real-time strategy
(RTS) games – as with the Starcraft series – are also an ideal testbed
for multi-agent systems. Consequently, several works have investigated
these aspects in the Starcraft framework (Foerster et al., 2017b; Peng
et al., 2017a; Brys et al., 2014).

9.1.3 Continuous control systems and robotics domains


While games provide a convenient platform for reinforcement learning,
the majority of those environments investigate discrete action decisions.
In many real-world systems, as in robotics, it is necessary to provide
frameworks for continuous control.
In that setting, the MuJoCo (Todorov et al., 2012) simulation
framework is used to provide several locomotion benchmark tasks.
These tasks typically involve learning a gait to move a simulated robotic
agent as fast as possible. The action space is the amount of torque to
apply to motors on the agents’ joints, while the observations provided
are typically the joint angles and positions in the 3D space. Several
frameworks have built on top of these locomotion tasks to provide
hierarchical task environments (Duan et al., 2016a) and multi-task
learning platforms (Henderson et al., 2017a).
Because the MuJoCo simulator is closed-source and requires a license,
an open-source initiative called Roboschool (Schulman et al., 2017b)
provides the same locomotion tasks along with more complex tasks
involving humanoid robot simulations (such as learning to run and
chase a moving flag while being hit by obstacles impeding progress).
76 深度 RL 基准
平台(Tessler 等人,2017 年;Matiisen 等人,2017 年;Branavan 等人,
2012 年;Oh 等人,2016 年)。
同样,Deepmind Lab(Beattie 等人,2016 年)提供了一个改编自 Quake
视频游戏的三维平台。该框架提供的迷宫环境已被用于分层学习、终身学习
和课程学习(Jaderberg 等人,2016;Mirowski 等人,2016;Teh 等人,
2017)。
最后,《星际争霸 II》(Vinyals et al:Broodwar》(Wender 和 Watson,2012
年;Synnaeve 等人,2016 年)在探索终身学习、课程学习和其他相关分层方
法方面具有类似的优势。此外,使用 Minecraft 作为(RTS)游戏(如《星际
争霸》系列)的实时战略和分层规划也是多代理系统的理想试验平台。因此,
有几部作品在《星际争霸》框架中对这些方面进行了研究(Foerster 等人,
2017b;Peng 等人,2017a;Brys 等人,2014)。

9.1.3 连续控制系统和机器人领域
虽然游戏为强化学习提供了一个便捷的平台,但这些环境大多研究的是离散行
动决策。在现实世界的许多系统中,如机器人学,有必要提供连续控制的框
架。
在这种情况下,MuJoCo(Todorov 等人,2012 年)仿真框架被用来提供几
个运动基准任务。这些任务通常涉及学习步态,以尽可能快地移动模拟机器人
代理。动作空间是应用于代理关节上电机的扭矩大小,而提供的观测数据通常
是三维空间中的关节角度和位置。一些框架建立在这些运动任务之上,以提供
分层任务环境(Duan 等人,2016a)和多任务学习平台(Henderson 等人,
2017a)。
由于 MuJoCo 模拟器是闭源的,需要许可证,因此一项名为 Roboschool 的
开源计划(Schulman 等人,2017b)提供了相同的运动任务,以及涉及仿人
机器人模拟的更复杂的任务(例如学习奔跑和追逐移动的旗帜,同时被阻碍前
进的障碍物击中)。
9.1. Benchmark Environments 77

Figure 9.2: Screenshots from MuJoCo locomotion benchmark environments


provided by OpenAI Gym.

These tasks allow for evaluation of complex planning in reinforcement


learning algorithms.
Physics engines have also been used to investigate transfer learning
to real-world applications. For instance, the Bullet physics engine
(Coumans, Bai, et al., 2016) has been used to learn locomotion skills
in simulation, for character animation in games (Peng et al., 2017b)
or for being transferred to real robots (Tan et al., 2018). This also
includes manipulation tasks (Rusu et al., 2016; Duan et al., 2017) where
a robotic arm stacks cubes in a given order. Several works integrate
Robot Operating System (ROS) with physics engines (such as ODE,
or Bullet) to provide RL-compatible access to near real-world robotic
simulations (Zamora et al., 2016; Ueno et al., 2017). Most of them can
also be run on real robotic systems using the same software.
There exists also a toolkit that leverages the Unity platform for
creating simulation environments (Juliani et al., 2018). This toolkit
enables the development of learning environments that are rich in
sensory and physical complexity and supports the multi-agent setting.

9.1.4 Frameworks
Most of the previously cited benchmarks have open-source code available.
There also exists easy-to-use wrappers for accessing many different
benchmarks. One such example is OpenAI Gym (Brockman et al., 2016).
This wrapper provides ready access to environments such as algorithmic,
Atari, board games, Box2d games, classical control problems, MuJoCo
robotics simulations, toy text problems, and others. Gym Retro1 is
a wrapper similar to OpenAI Gym and it provides over 1,000 games
1
https://github.com/openai/retro
9.1.基准环境 77

图 9.2:OpenAI Gym 提供的 MuJoCo 运动基准环境截图。

通过这些任务,可以对强化学习算法中的复杂规划进行评估。
物理引擎也被用于研究将学习转移到现实世界中的应用。例如,Bullet 物理引
擎(Coumans、Bai 等人,2016 年)已被用于学习模拟运动技能、游戏中的
角色动画(Peng 等人,2017b)或转移到真实机器人上(Tan 等人,2018
年)。这也包括操纵任务(Rusu 等人,2016 年;Duan 等人,2017 年),即
机械臂按照给定顺序堆叠立方体。有几项研究将机器人操作系统(ROS)与物
理引擎(如 ODE 或 Bullet)相集成,以提供与 RL 兼容的接近真实世界的机
器人模拟访问(Zamora 等人,2016 年;Ueno 等人,2017 年)。它们中的大
多数也可以使用相同的软件在真实机器人系统上运行。

此外,还有一个利用 Unity 平台创建模拟环境的工具包(Juliani 等人,2018


年)。该工具包能够开发具有丰富感官和物理复杂性的学习环境,并支持多代
理设置。

9.1.4框架
前面提到的大多数基准都有开放源代码。此外,还有一些易于使用的封装器,
可用于访问许多不同的基准。其中一个例子是 OpenAI Gym(Brockman 等
人,2016 年)。该封装器可随时访问算法、Atari、棋盘游戏、Box2d 游戏、经
典控制问题、MuJoCo 机器人模拟、玩具文本问题等环境。Gym Retro 是一个
类似于 OpenAI Gym 的包装器,它提供了 1000 多个游戏

1
https://github.com/openai/retro
78 Benchmarking Deep RL

across a variety of backing emulators. The goal is to study the ability of


deep RL agents to generalize between games that have similar concepts
but different appearances. Other frameworks such as μniverse2 and
SerpentAI3 also provide wrappers for specific games or simulations.

9.2 Best practices to benchmark deep RL

Ensuring best practices in scientific experiments is crucial to continued


scientific progress. Across various fields, investigations in reproducibility
have found problems in numerous publications, resulting in several works
providing experimental guidelines in proper scientific practices (Sandve
et al., 2013; Baker, 2016; Halsey et al., 2015; Casadevall and Fang,
2010). To this end, several works have investigated proper metrics and
experimental practices when comparing deep RL algorithms (Henderson
et al., 2017b; Islam et al., 2017; Machado et al., 2017b; Whiteson et al.,
2011).

Number of Trials, Random Seeds and Significance Testing


Stochasticity plays a large role in deep RL, both from randomness within
initializations of neural networks and stochasticity in environments.
Results may vary significantly simply by changing the random seed.
When comparing the performance of algorithms, it is therefore important
to run many trials across different random seeds.
In deep RL, it has become common to simply test an algorithm’s
effectiveness with an average across a few learning trials. While this is
a reasonable benchmark strategy, techniques derived from significance
testing (Demšar, 2006; Bouckaert and Frank, 2004; Bouckaert, 2003;
Dietterich, 1998) have the advantage of providing statistically grounded
arguments in favor of a given hypothesis. In practice for deep RL, signif-
icance testing can be used to take into account the standard deviation
across several trials with different random seeds and environment condi-
tions. For instance, a simple 2-sample t-test can give an idea of whether
performance gains are significantly due to the algorithm performance

2
https://github.com/unixpickle/muniverse
3
https://github.com/SerpentAI/SerpentAI
78 深度 RL 基准
跨各种备份模拟器。我们的目标是研究深度 RL 代理在概念相似但外观不同的
游戏之间进行泛化的能力。其他框架,如 μniverse 和 SerpentAI,也为特定
游戏或模拟提供了封装器。

9.2 以深度 RL 为基准的最佳做法


确保科学实验的最佳实践对于科学的持续进步至关重要。在各个领域,对可重
复性的调查发现了许多出版物中存在的问题,因此有几部著作提供了正确科学
实践的实验指南(Sandve 等人,2013 年;Baker,2016 年;Halsey 等人,
2015 年;Casadevall 和 Fang,2010 年)。为此,有几部著作研究了比较深度
RL 算法时的正确指标和实验实践(Henderson 等人,2017b;Islam 等人,
2017;Machado 等人,2017b;Whiteson 等人,2011)。

试验次数、随机种子和显著性检验
随机性在深度 RL 中扮演着重要角色,它既来自神经网络初始化的随机性,也
来自环境的随机性。只需改变随机种子,结果就可能大相径庭。因此,在比较
算法性能时,对不同的随机种子进行多次试验非常重要。
在深度 RL 中,简单地用几次学习试验的平均值来测试算法的有效性已成为一
种普遍做法。虽然这是一种合理的基准策略,但源自显著性检验的技术
(Demšar,2006 年;Bouckaert 和 Frank,2004 年;Bouckaert,2003
年;Dietterich,1998 年)具有提供支持给定假设的统计基础论据的优势。在
深度 RL 的实践中,显著性检验可用于考虑不同随机种子和环境条件下多次试
验的标准偏差。例如,通过简单的两样本 t 检验,可以了解性能提升是否主要
归因于算法性能的提高。

2
https://github.com/unixpickle/muniverse
3
https://github.com/SerpentAI/SerpentAI
9.2. Best practices to benchmark deep RL 79

or to noisy results in highly stochastic settings. In particular, while


several works have used the top-K trials and simply presented those
as performance gains, this has been argued to be inadequate for fair
comparisons (Machado et al., 2017b; Henderson et al., 2017b).
In addition, one should be careful not to over-interpret the results. It
is possible that a hypothesis can be shown to hold for one or several given
environments and under one or several given set of hyperparameters,
but fail in other settings.

Hyperparameter Tuning and Ablation Comparisons


Another important consideration is ensuring a fair comparison between
learning algorithms. In this case, an ablation analysis compares
alternate configurations across several trials with different random
seeds. It is especially important to tune hyperparameters to the greatest
extent possible for baseline algorithms. Poorly chosen hyperparameters
can lead to an unfair comparison between a novel and a baseline
algorithm. In particular, network architecture, learning rate, reward
scale, training discount factor, and many other parameters can
affect results significantly. Ensuring that a novel algorithm is indeed
performing much better requires proper scientific procedure when
choosing such hyperparameters (Henderson et al., 2017b).

Reporting Results, Benchmark Environments, and Metrics


Average returns (or cumulative reward) across evaluation trajectories are
often reported as a comparison metric. While some literature (Gu et al.,
2016a; Gu et al., 2017c) has also used metrics such as average maximum
return or maximum return within Z samples, these may be biased to
make results for highly unstable algorithms appear more significant.
For example, if an algorithm reaches a high maximum return quickly,
but then diverges, such metrics would ensure this algorithm appears
successful. When choosing metrics to report, it is important to select
those that provide a fair comparison. If the algorithm performs better in
average maximum return, but worse by using an average return metric,
it is important to highlight both results and describe the benefits and
shortcomings of such an algorithm (Henderson et al., 2017b).
9.2.以深度 RL 为基准的最佳做法 79

或在高度随机的环境中出现噪声结果。特别是,虽然有几项研究使用了前 K 次
试验,并简单地将其作为性能增益,但有人认为这不足以进行公平的比较
(Machado 等人,2017b;Henderson 等人,2017b)。
此外,还应注意不要过度解读结果。有可能在一种或几种给定的环境下,在一
组或几组给定的超参数下,一个假设被证明是成立的,但在其他环境下却失效
了。

超参数调整和消融比较
另一个重要的考虑因素是确保学习算法之间的公平比较。在这种情况下,消融
分析会在多次试验中使用不同的随机种子对交替配置进行比较。对于基线算法
来说,尽可能调整超参数尤为重要。超参数选择不当会导致新算法与基线算法
之间的不公平比较。特别是,网络架构、学习率、奖励比例、训练折扣系数和
许多其他参数都会对结果产生重大影响。要确保新型算法的性能确实好得多,
就需要在选择此类超参数时采用适当的科学程序(Henderson 等人,
2017b)。

报告结果、基准环境和指标
跨评估轨迹的平均回报(或累计奖励)通常作为比较指标进行报告。虽然有些
文献(Gu 等人,2016a;Gu 等人,2017c)也使用了 Z 样本内的平均最大回
报或最大回报等指标,但这些指标可能存在偏差,使高度不稳定算法的结果显
得更为显著。例如,如果一种算法很快就达到了很高的最大回报率,但随后又
出现了分化,那么这些指标将确保这种算法看起来是成功的。在选择报告指标
时,重要的是要选择那些能提供公平比较的指标。如果算法在平均最大回报率
方面表现较好,但使用平均回报率指标则表现较差,则必须突出这两种结果,
并说明这种算法的优点和缺点(Henderson 等,2017b)。
80 Benchmarking Deep RL

This is also applicable to the selection of which benchmark


environments to report during evaluation. Ideally, empirical results
should cover a large mixture of environments to determine in which
settings an algorithm performs well and in which settings it does not.
This is vital for determining real-world performance applications and
capabilities.

9.3 Open-source software for Deep RL

A deep RL agent is composed of a learning algorithm (model-based or


model-free) along with specific structure(s) of function approximator(s).
In the online setting (more details are given in Chapter 8), the agent
follows a specific exploration/exploitation strategy and typically uses a
memory of its previous experience for sample efficiency.
While many papers release implementations of various deep RL
algorithms, there also exist some frameworks built to facilitate the
development of new deep RL algorithms or to apply existing algorithms
to a variety of environments. We provide a list of some of the existing
frameworks in Appendix A.
80 深度 RL 基准
这也适用于在评估过程中选择报告哪些基准环境。理想情况下,经验结果应涵
盖大量混合环境,以确定算法在哪些环境下表现良好,在哪些环境下表现不
佳。这对于确定真实世界的性能应用和能力至关重要。

9.3 用于深度 RL 的开源软件


深度 RL 代理由学习算法(基于模型或无模型)和特定结构的函数近似器组
成。在在线环境下(更多详情见第 8 章),代理遵循特定的探索/开发策略,通
常使用先前经验记忆来提高采样效率。

虽然许多论文发布了各种深度 RL 算法的实现,但也有一些框架旨在促进新深
度 RL 算法的开发或将现有算法应用于各种环境。我们在附录 A 中提供了一些
现有框架的列表。
10
Deep reinforcement learning beyond MDPs

We have so far mainly discussed how an agent is able to learn how


to behave in a given Markovian environment where all the interesting
information (the state st ∈ S) is obtained at every time step t. In
this chapter, we discuss more general settings with (i) non-Markovian
environments, (ii) transfer learning and (iii) multi-agent systems.

10.1 Partial observability and the distribution of (related) MDPs

In domains where the Markov hypothesis holds, it is straightforward to


show that the policy need not depend on what happened at previous time
steps to recommend an action (by definition of the Markov hypothesis).
This section describes two different cases that complicate the Markov
setting: the partially observable environments and the distribution of
(related) environments.
Those two settings are at first sight quite different conceptually.
However, in both settings, at each step in the sequential decision process,
the agent may benefit from taking into account its whole observable
history up to the current time step t when deciding what action to
perform. In other words, a history of observations can be used as a
pseudo-state (pseudo-state because that refers to a different and abstract

81
10
超越 MDP 的深度强化学习

到目前为止,我们主要讨论了代理如何在给定的马尔可夫环境中学习如何行
为,在马尔可夫环境中,所有有趣的信息(状态 s∈S)都是在每个时间步长 t
获取的。在本章中,我们将讨论更一般的情况:(i) 非马尔可夫环境,(ii) 迁移
学习和 (iii) 多代理系统。

10.1部分可观测性和(相关)MDP 的分布
在马尔可夫假设成立的领域中,我们可以直截了当地证明,政策无需依赖于前
几个时间步骤所发生的事情来推荐行动(根据马尔可夫假设的定义)。本节将
介绍使马尔可夫假设复杂化的两种不同情况:部分可观测环境和(相关)环境
分布。

乍一看,这两种情况在概念上大相径庭。不过,在这两种情况下,在顺序决策
过程的每一步,代理在决定执行什么行动时,都可能会考虑到直到当前时间步
长 t 的整个可观测历史。换句话说,观测历史可以用作伪状态(伪状态指的是
一种不同的抽象状态)。

81
82 Deep reinforcement learning beyond MDPs

stochastic control process). Any missing information in the history of


observations (potentially long before time t) can introduce a bias in
the RL algorithm (as described in Chapter 7 when some features are
discarded).

10.1.1 The partially observable scenario


In this setting, the agent only receives, at each time step, an observation
of its environment that does not allow it to identify the state with
certainty. A Partially Observable Markov Decision Process (POMDP)
(Sondik, 1978; Kaelbling et al., 1998) is a discrete time stochastic control
process defined as follows:

Definition 10.1. A POMDP is a 7-tuple (S, A , T, R, Ω, O, γ) where:

• S is a finite set of states { 1, . . . , NS } ,

• A is a finite set of actions { 1, . . . , NA } ,

• T : S × A × S → [0, 1] is the transition function (set of conditional


transition probabilities between states),

• R : S ×A×S → R is the reward function, where R is a continuous


set of possible rewards in a range R max ∈ R+ (e.g., [0, Rmax ]
without loss of generality),

• Ω is a finite set of observations { 1, . . . , NΩ } ,

• O : S × Ω → [0, 1] is a set of conditional observation probabilities,


and

• γ ∈ [0, 1) is the discount factor.

The environment starts in a distribution of initial states b(s0). At


each time step t ∈ N0 , the environment is in a state st ∈ S. At the
same time, the agent receives an observation ωt ∈ Ω that depends on
the state of the environment with probability O(st , ωt ), after which the
agent chooses an action at ∈ A. Then, the environment transitions to
state st +1 ∈ S with probability T (st , at , st+1 ) and the agent receives a
reward r t ∈ R equal to R(st , at , st +1).
82 超越 MDP 的深度强化学习
随机控制过程)。观测历史中的任何缺失信息(可能早于时间 t)都会给 RL 算
法带来偏差(如第 7 章所述,当某些特征被丢弃时)。

10.1.1 部分可观测情景
在这种情况下,代理在每个时间步长内只能接收到一次对环境的观测,无法确
定状态。部分可观测马尔可夫决策过程(POMDP)是一种离散时间随机控制
过程,其观测历史可用作伪状态(伪状态指的是一种不同的抽象状态)
(Sondik,1978;Kaelbling 等人,1998),定义如下:
定义 10.1.POMDP 是一个 7 元组(S, A, T, R, Ω, O, γ),其中:
• S 是一个有限的状态集合 {1, ., N},

• A 是一个有限的行动集合 {1, ., N},

• T : S × A × S → [0, 1] 是过渡函数(状态之间的条件过渡概率集) 、
• R : S ×A×S → R 是奖励函数,其中 R 是 R∈ R(例如 [0,R],不失一
般性)范围内可能奖励的连续集合、
• Ω 是观测值 {1,......,N} 的有限集合。, N},
• O : S × Ω → [0, 1] 是一组条件观测概率,并且
γ∈ [0, 1]是贴现因子。

环境开始时的初始状态分布为 b(s)。在每个时间步长 t∈ N 时,环境处于一种


状态 s∈ S。与此同时,代理接收到一个取决于环境状态的观测值 ω∈ Ω,其
概率为 O(s,ω),之后代理选择一个行动 a∈ A。
10.1. Partial observability and the distribution of (related) MDPs 83

When the full model (T , R and O) are known, methods such as


Point-Based Value Iteration (PBVI) algorithm (Pineau et al., 2003) for
POMDP planning can be used to solve the problem. If the full POMDP
model is not available, other reinforcement learning techniques have to
be used.
A naive approach to building a space of candidate policies is to
consider the set of mappings taking only the very last observation(s) as
input. However, in a POMDP setting, this leads to candidate policies
that are typically not rich enough to capture the system dynamics,
thus suboptimal. In that case, the best achievable policy is stochastic
(Singh et al., 1994), and it can be obtained using policy gradient. The
alternative is to use a history of previously observed features to better
estimate the hidden state dynamics. We denote by H t = Ω× (A×R×Ω)t
the set of histories observed up to time t for t ∈ N0 (see Fig. 10.1), and


by H = H t the space of all possible observable histories.
t =0

s0 s1 s2 Hidden
dynamics

ω0 ω1 ω2
...
a0 r0 a1 r1 a2

Policy Policy Policy

H0 H1 H2

Figure 10.1: Illustration of a POMDP. The actual dynamics of the POMDP is


depicted in dark while the information that the agent can use to select the action at
each step is the whole history H t depicted in blue.

A straightforward approach is to take the whole history H t ∈ H


as input (Braziunas, 2003). However, increasing the size of the set of
candidate optimal policies generally implies: (i) more computation to
search within this set (Singh et al., 1994; McCallum, 1996) and, (ii) an
increased risk of including candidate policies suffering from overfitting
10.1.部分可观测性和(相关)MDP 的分布 83

当已知完整模型(T、R 和 O)时,可以使用 POMDP 规划的基于点的值迭代


(PBVI)算法(Pineau 等人,2003 年)等方法来解决问题。如果没有完整的
POMDP 模型,则必须使用其他强化学习技术。

建立候选政策空间的一种简单方法是,只考虑以最后一个或多个观测值作为输
入的映射集。然而,在 POMDP 环境中,这种方法所得出的候选策略通常不够
丰富,无法捕捉系统动态,因此是次优的。在这种情况下,可实现的最佳策略
就是随机策略(Singh 等人,1994 年),它可以通过策略梯度来获得。另一种
方法是利用先前观测到的历史特征来更好地估计隐藏状态动态。我们用 H=
Ω×(A×R×Ω)表示在 t∈N 时直到时间 t 所观察到的历史记录集(见图
10.1),用 H= Ω×(A×R×Ω)表示在 t∈N 时直到时间 t 所观察到的历史记录
集。
由H= H 即所有可能的可观测历史的空间。


t=0

隐藏
s s s
动力

ω ω ω

...
a r a r a

政策 政策 政策
H H H

图 10.1:POMDP 图解。POMDP 的实际动态用深色表示,而代理可以用来选择每一步行动的信


息是用蓝色表示的整个历史 H。

一种直接的方法是将整个历史 H∈ H 作为输入(Braziunas,2003 年)。然


而,增加候选最优政策集的规模通常意味着:(i) 在此政策集内搜索的计算量增
加(Singh 等人,1994 年;McCallum,1996 年),(ii) 候选政策出现过度拟
合的风险增加。
84 Deep reinforcement learning beyond MDPs

due to lack of sufficient data, which thus leads to a bias-overfitting


tradeoff when learning policies from data (François-Lavet et al., 2017).
In the case of deep RL, the architectures used usually have a smaller
number of parameters and layers than in supervised learning due to
the more complex RL setting, but the trend of using ever smarter
and complex architectures in deep RL happens similarly to supervised
learning tasks. Architectures such as convolutional layers or recurrency
are particularly well-suited to deal with a large input space because they
offer interesting generalization properties. A few empirical successes
on large scale POMDPs make use of convolutional layers (Mnih et al.,
2015) and/or recurrent layers (Hausknecht and Stone, 2015), such as
LSTMs (Hochreiter and Schmidhuber, 1997).

10.1.2 The distribution of (related) environments


In this setting, the environment of the agent is a distribution of different
(yet related) tasks that differ for instance in the reward function or in
the probabilities of transitions from one state to another. Each task
Ti ∼ T can be defined by the observations ωt ∈ Ω (which are equal
to st if the environments are Markov), the rewards r t ∈ R, as well as
the effect of the actions at ∈ A taken at each step. Similarly to the
partially observable context, we denote the history of observations by
H t , where H t ∈ Ht = Ω × (A × R × Ω)t . The agent aims at finding a
policy π(at |H t ; θ) with the objective of maximizing its expected return,
defined (in the discounted setting) as
[∑∞ ]
k
E γ r t+ k | H t , π .
Ti ∼T k=0

An illustration of the general setting of meta learning on non-Markov


environments is given in Figure 10.2.
Different approaches have been investigated in the literature. The
Bayesian approach aims at explicitly modeling the distribution of the
different environments, if a prior is available (Ghavamzadeh et al.,
2015). However, it is often intractable to compute the Bayesian-optimal
strategy and one has to rely on more practical approaches that do
not require an explicit model of the distribution. The concept of meta-
learning or learning to learn aims at discovering, from experience, how
84 超越 MDP 的深度强化学习
由于缺乏足够的数据,因此在从数据中学习政策时,会出现偏差与拟合之间
的权衡(François-Lavet 等人,2017 年)。
在深度 RL 中,由于 RL 设置更为复杂,所使用的架构参数和层数通常少于监
督学习,但在深度 RL 中使用更智能、更复杂架构的趋势与监督学习任务类
似。卷积层或递归等架构尤其适合处理大型输入空间,因为它们具有有趣的
泛化特性。一些在大规模 POMDP 上取得的经验性成功案例使用了卷积层
(Mnih 等人,2015 年)和/或递归层(Hausknecht 和 Stone,2015 年),
如 LSTM(Hochreiter 和 Schmidhuber,1997 年)。

10.1.2 相关)环境的分布
在这种情况下,代理的环境是一个不同的分布(ii)包含候选政策的风险增
加,这些候选政策因过度拟合(但相关)任务而受到影响,例如在奖励函数或
从一个状态过渡到另一个状态的概率方面。每个任务 T∼ T 可以由观测值 ω∈
Ω(如果环境是马尔可夫的,则观测值等于 s)、奖励 r∈ R 以及每一步采取的
行动 a∈ A 的影响来定义。与部分可观测情境类似,我们用 H 表示观测历史,
其中 H∈ H= Ω × (A × R × Ω)。代理人的目标是找到一个政策 π(a|H;θ),其目
标是最大化预期收益,(在贴现环境下)其定义为

[∑ ]
E
T∼T
γr| H,π
k=0
.

图 10.2 展示了非马尔可夫环境下元学习的一般设置。
文献中研究了不同的方法。贝叶斯方法旨在明确模拟不同环境的分布(如果
有先验模型的话)(Ghavamzadeh 等人,2015 年)。然而,计算贝叶斯最优
策略往往很困难,人们不得不依赖不需要明确的分布模型的更实用的方法。
金属学习(metalearning)或学习学习(learning to learn)的概念旨在从经
验中发现如何
10.1. Partial observability and the distribution of (related) MDPs 85

Distribution of tasks

Training
on a set
of tasks

RL algorithm

Testing
on related
tasks

Figure 10.2: Illustration of the general setting of meta learning on POMDPs for a
set of labyrinth tasks. In this illustration, it is supposed that the agent only sees the
nature of the environment just one time step away from him.

to behave in a range of tasks and how to negotiate the exploration-


exploitation tradeoff (Hochreiter et al., 2001). In that case, deep RL
techniques have been investigated by, e.g., Wang et al., 2016a; Duan
et al., 2016b with the idea of using recurrent networks trained on a set
of environments drawn i.i.d. from the distribution.
Some other approaches have also been investigated. One possibility
is to train a neural network to imitate the behavior of known optimal
policies on MDPs drawn from the distribution (Castronovo et al., 2017).
The parameters of the model can also be explicitly trained such that a
small number of gradient steps in a new task from the distribution will
produce fast learning on that task (Finn et al., 2017).
10.1.部分可观测性和(相关)MDP 的分布 85

任务分配

培训
在一组
任务

RL 算法

测试
关于相关
任务

图 10.2:针对一组迷宫任务的 POMDP 元学习一般设置图解。在本示例中,假设代理只看到离他


只有一步之遥的环境的性质。

在一系列任务中的行为,以及如何协商探索与开发之间的权衡(Hochreiter
等人,2001 年)。在这种情况下,Wang 等人,2016a;Duan 等人,2016b
等人研究了深度 RL 技术,他们的想法是使用从分布中 i.i.d. 抽取的一组环境
训练的递归网络。
人们还研究了其他一些方法。一种方法是训练一个神经网络,模仿从分布中提
取的 MDP 上已知最优策略的行为(Castronovo 等人,2017 年)。也可以对
模型的参数进行显式训练,以便在分布中的新任务中进行少量梯度阶跃就能在
该任务上产生快速学习(Finn 等人,2017 年)。
86 Deep reinforcement learning beyond MDPs

There exists different denominations for this setting with a


distribution of environments. For instance, the denomination of "multi-
task setting" is usually used in settings where a short history of
observations is sufficient to clearly distinguish the tasks. As an example
of a multi-task setting, a deep RL agent can exceed median human
performance on the set of 57 Atari games with a single set of weights
(Hessel et al., 2018). Other related denominations are the concepts
of "contextual" policies (Da Silva et al., 2012) and "universal"/"goal
conditioned" value functions (Schaul et al., 2015a) that refer to learning
policies or value function within the same dynamics for multiple goals
(multiple reward functions).

10.2 Transfer learning

Transfer learning is the task of efficiently using previous knowledge


from a source environment to achieve a good performance in a target
environment. In a transfer learning setting, the target environment
should not be in the distribution of the source tasks. However, in
practice, the concept of transfer learning is sometimes closely related to
meta learning, as we discuss hereafter.

10.2.1 Zero-shot learning


The idea of zero-shot learning is that an agent should be able to
act appropriately in a new task directly from experience acquired on
other similar tasks. For instance, one use case is to learn a policy
in a simulation environment and then use it in a real-world context
where gathering experience is not possible or severely constrained (see
§11.2). To achieve this, the agent must either (i) develop generalization
capacities described in Chapter 7 or (ii) use specific transfer strategies
that explicitly retrain or replace some of its components to adjust to
new tasks.
To develop generalization capacities, one approach is to use an idea
similar to data augmentation in supervised learning so as to make sense
of variations that were not encountered in the training data. Exactly
as in the meta-learning setting (§10.1.2), the actual (unseen) task may
86 超越 MDP 的深度强化学习
这种环境有不同的名称和环境分布。例如,"多任务环境 "这一名称通常用于短
时间的观察足以明确区分任务的环境。作为多任务设置的一个例子,一个深度
RL 代理可以在 57 个 Atari 游戏集上用单组权重超过人类性能的中位数
(Hessel 等人,2018 年)。其他相关的名称还有 "情境 "策略(Da Silva 等
人,2012 年)和 "通用"/"目标条件 "价值函数(Schaul 等人,2015 年 a)等
概念,指的是在同一动力学中针对多个目标(多个奖励函数)学习策略或价值
函数。

迁移学习
10.2

迁移学习是指有效利用源环境中的先前知识,在目标环境中取得良好表现的任
务。在迁移学习环境中,目标环境不应该在源任务的分布中。然而,在实践
中,迁移学习的概念有时与元学习密切相关,我们将在下文中讨论。

10.2.1 零点学习
零距离学习的理念是,代理应该能够直接根据在其他类似任务中获得的经
验,在新任务中采取适当的行动。例如,一种使用情况是在模拟环境中学习
一种策略,然后将其用于不可能或严重受限于经验收集的真实环境中(见第
11.2 节)。要做到这一点,代理必须(i)发展第 7 章所述的泛化能力,或
(ii)使用特定的转移策略,明确地重新训练或替换其部分组件,以适应新任
务。
要开发泛化能力,一种方法是使用与监督学习中的数据扩充类似的想法,以
便理解训练数据中未遇到的变化。与元学习中的情况(第 10.1.2 节)完全相
同,实际(未见的)任务可以是
10.2. Transfer learning 87

appear to the agent as just another variation if there is enough data


augmentation on the training data. For instance, the agent can be
trained with deep RL techniques on different tasks simultaneously, and
it is shown by Parisotto et al., 2015 that it can generalize to new related
domains where the exact state representation has never been observed.
Similarly, the agent can be trained in a simulated environment while
being provided with different renderings of the observations. In that case,
the learned policy can transfer well to real images (Sadeghi and Levine,
2016; Tobin et al., 2017). The underlying reason for these successes
is the ability of the deep learning architecture to generalize between
states that have similar high-level representations and should therefore
have the same value function/policy in different domains. Rather than
manually tuning the randomization of simulations, one can also adapt
the simulation parameters by matching the policy behavior in simulation
to the real world (Chebotar et al., 2018). Another approach to zero-shot
transfer is to use algorithms that enforce states that relate to the same
underlying task but have different renderings to be mapped into an
abstract state that is close (Tzeng et al., 2015; François-Lavet et al.,
2018).

10.2.2 Lifelong learning or continual learning


A specific way of achieving transfer learning is to aim at lifelong learning
or continual learning. According to Silver et al., 2013, lifelong machine
learning relates to the capability of a system to learn many tasks over
a lifetime from one or more domains.
In general, deep learning architectures can generalize knowledge
across multiple tasks by sharing network parameters. A direct approach
is thus to train function approximators (e.g. policy, value function,
model, etc.) sequentially in different environments. The difficulty of
this approach is to find methods that enable the agent to retain
knowledge in order to more efficiently learn new tasks. The problem
of retaining knowledge in deep reinforcement learning is complicated
by the phenomenon of catastrophic forgetting, where generalization to
previously seen data is lost at later stages of learning.
10.2.迁移学习 87

如果在训练数据上有足够的数据增强,代理就会发现这只是另一种变化。例
如,可以同时在不同任务上使用深度 RL 技术训练代理,Parisotto 等人在
2015 年的研究表明,代理可以泛化到从未观察过确切状态表示的新相关领
域。同样,可以在模拟环境中训练代理,同时提供不同的观察结果。在这种情
况下,学习到的策略可以很好地迁移到真实图像中(Sadeghi 和 Levine,
2016 年;Tobin 等人,2017 年)。这些成功的根本原因在于深度学习架构能够
在具有相似高层表征的状态之间进行泛化,因此在不同领域中应该具有相同的
价值函数/策略。与其手动调整模拟的随机化,人们还可以通过将模拟中的策略
行为与真实世界相匹配来调整模拟参数(Chebotar 等人,2018 年)。零点转
移的另一种方法是使用算法,强制将与相同底层任务相关但呈现方式不同的状
态映射为接近的抽象状态(Tzeng 等人,2015 年;François-Lavet 等人,
2018 年)。

终身学习或持续学习
10.2.2

实现迁移学习的一种具体方法是以终身学习或持续学习为目标。根据 Silver 等
人的研究,终身机器学习指的是系统在一生中从一个或多个领域学习多项任务
的能力。
一般来说,深度学习架构可以通过共享网络参数,在多个任务中推广知识。因
此,一种直接的方法是在不同环境中依次训练函数近似值(如策略、价值函
数、模型等)。这种方法的难点在于找到能让代理保留知识的方法,以便更有
效地学习新任务。在深度强化学习中,保留知识的问题因灾难性遗忘现象而变
得复杂,即在学习的后期阶段会失去对之前所见数据的泛化。
88 Deep reinforcement learning beyond MDPs

The straightforward approach is to either (i) use experience replay


from all previous experience (as discussed in §8.2), or (ii) retrain
occasionally on previous tasks similar to the meta-learning setting
(as discussed in §10.1.2).
When these two options are not available, or as a complement to the
two previous approaches, one can use deep learning techniques that are
robust to forgetting, such as progressive networks (Rusu et al., 2016).
The idea is to leverage prior knowledge by adding, for each new task,
lateral connections to previously learned features (that are kept fixed).
Other approaches to limiting catastrophic forgetting include slowing
down learning on the weights important for previous tasks (Kirkpatrick
et al., 2016) and decomposing learning into skill hierarchies (Stone and
Veloso, 2000; Tessler et al., 2017).

Task 0 Task 1 Task 2 Task 3 Task 4 ... Sequence


of tasks

Agent
Figure 10.3: Illustration of the continual learning setting where an agent has to
interact sequentially with related (but different) tasks.

10.2.3 Curriculum learning


A particular setting of continual learning is curriculum learning. Here,
the goal is to explicitly design a sequence of source tasks for an agent to
train on such that the final performance or learning speed is improved
on a target task. The idea is to start by learning small and easy aspects
of the target task and then to gradually increase the difficulty level
(Bengio et al., 2009; Narvekar et al., 2016). For instance, Florensa et al.
(2018) use generative adversarial training to automatically generate
goals for a contextual policy such that they are always at the appropriate
level of difficulty. As the difficulty and number of tasks increase, one
88 超越 MDP 的深度强化学习
直截了当的方法是:(i) 使用以前所有经验的经验回放(如第 8.2 节所述),或
(ii) 偶尔对以前的任务进行类似元学习设置的再训练(如第 10.1.2 节所述)。

如果没有这两种方法,或者作为前两种方法的补充,可以使用对遗忘具有鲁棒
性的深度学习技术,如渐进式网络(Rusu 等人,2016 年)。其原理是通过为
每个新任务添加与先前学习到的特征(保持固定)的横向联系,从而利用先前
的知识。其他限制灾难性遗忘的方法还包括放慢对以前任务重要的权重的学习
速度(Kirkpatrick 等人,2016 年),以及将学习分解为技能分层(Stone 和
Veloso,2000 年;Tessler 等人,2017 年)。

任务 0 任务 1 任务 2 任务 3 任务 4 . . . 序列
任务

代理
图 10.3:持续学习环境示意图,在这种环境下,代理必须依次与相关(但不同)的任务进行交
互。

10.2.3 课程学习
课程学习是一种特殊的持续学习方式。这里的目标是明确设计一系列源任务,
供代理进行训练,从而提高代理在目标任务上的最终表现或学习速度。其思路
是从学习目标任务的小而简单的方面开始,然后逐步提高难度(Bengio 等人,
2009 年;Narvekar 等人,2016 年)。例如,Florensa 等人(2018 年)使用
生成对抗训练为情境策略自动生成目标,使其始终处于适当的难度水平。随着
任务难度和数量的增加,一个
10.3. Learning without explicit reward function 89

possibility to satisfy the bias-overfitting tradeoff is to consider network


transformations through learning.

10.3 Learning without explicit reward function

In reinforcement learning, the reward function defines the goals to be


achieved by the agent (for a given environment and a given discount
factor). Due to the complexity of environments in practical applications,
defining a reward function can turn out to be rather complicated. There
are two other possibilities: (i) given demonstrations of the desired task,
we can use imitation learning or extract a reward function using inverse
reinforcement learning; (ii) a human may provide feedback on the agent’s
behavior in order to define the task.

10.3.1 Learning from demonstrations


In some circumstances, the agent is only provided with trajectories of
an expert agent (also called the teacher), without rewards. Given an
observed behavior, the goal is to have the agent perform similarly. Two
approaches are possible:

• Imitation learning uses supervised learning to map states to


actions from the observations of the expert’s behavior (e.g., Giusti
et al., 2016). Among other applications, this approach has been
used for self-driving cars to map raw pixels directly to steering
commands thanks to a deep neural network (Bojarski et al., 2016).

• Inverse reinforcement learning (IRL) determines a possible re-


ward function given observations of optimal behavior. When the
system dynamics is known (except the reward function), this is
an appealing approach particularly when the reward function
provides the most generalizable definition of the task (Ng, Russell,
et al., 2000; Abbeel and Ng, 2004). For example, let us consider a
large MDP for which the expert always ends up transitioning to
the same state. In that context, one may be able to easily infer,
from only a few trajectories, what the probable goal of the task
is (a reward function that explains the behavior of the teacher),
10.3.无明确奖励函数的学习 89

要满足偏差与拟合之间的权衡,可以考虑通过学习进行网络转换。

无明确奖励功能的学习
10.3

在强化学习中,奖励函数定义了代理要实现的目标(针对给定的环境和给定的
折扣系数)。由于实际应用中环境的复杂性,定义奖励函数可能会变得相当复
杂。还有另外两种可能性:(i) 给出所需的任务演示,我们可以使用模仿学习或
使用反强化学习提取奖励函数;(ii) 人类可以提供关于代理行为的反馈,以确
定任务。

从示范中学习
10.3.1

在某些情况下,只向代理提供专家代理(也称为教师)的轨迹,而不提供奖
励。给定一个观察到的行为,目标是让代理做出类似的表现。有两种可能的方
法:
• 模仿学习利用监督学习,根据对专家行为的观察,将状态映射为行动
(例如,Giusti 等人,2016 年)。在其他应用中,这种方法已被用于自
动驾驶汽车,通过深度神经网络将原始像素直接映射为转向指令
(Bojarski 等人,2016 年)。
• 反强化学习(IRL)可根据对最佳行为的观察结果确定可能的奖励函数。
当系统动态已知(奖励函数除外)时,这是一种很有吸引力的方法,尤
其是当奖励函数提供了任务的最通用定义时(Ng, Russell, et al., 2000;
Abbeel and Ng, 2004)。例如,让我们考虑一个大型 MDP,专家最终总
是会过渡到相同的状态。在这种情况下,我们也许可以很容易地从一些
轨迹中推断出任务的可能目标(解释教师行为的奖励函数)、
90 Deep reinforcement learning beyond MDPs

as opposed to directly learning the policy via imitation learning,


which is much less efficient. Note that recent works bypass the
requirement of the knowledge of the system dynamics (Boularias
et al., 2011; Kalakrishnan et al., 2013; Finn et al., 2016b) by using
techniques based on the principle of maximum causal entropy
(Ziebart, 2010).

A combination of the two approaches has also been investigated by


Neu and Szepesvári, 2012; Ho and Ermon, 2016. In particular, Ho and
Ermon, 2016 use adversarial methods to learn a discriminator (i.e.,
a reward function) such that the policy matches the distribution of
demonstrative samples.
It is important to note that in many real-world applications, the
teacher is not exactly in the same context as the agent. Thus, transfer
learning may also be of crucial importance (Schulman et al., 2016; Liu
et al., 2017).
Another setting requires the agent to learn directly from a sequence
of observations without corresponding actions (and possibly in a slightly
different context). This may be done in a meta-learning setting by
providing positive reward to the agent when it performs as it is expected
based on the demonstration of the teacher. The agent can then act
based on new unseen trajectories of the teacher, with the objective that
is can generalize sufficiently well to perform new tasks (Paine et al.,
2018).

10.3.2 Learning from direct feedback


Learning from feedback investigates how an agent can interactively learn
behaviors from a human teacher who provides positive and negative
feedback signals. In order to learn complex behavior, human trainer
feedbacks has the potential to be more performant than a reward
function defined a priori (MacGlashan et al., 2017; Warnell et al., 2017).
This setting can be related to the idea of curriculum learning discussed
in §10.2.3.
In the work of Hadfield-Menell et al., 2016, the cooperative inverse
reinforcement learning framework considers a two-player game between
a human and a robot interacting with an environment with the purpose
90 超越 MDP 的深度强化学习
而不是直接通过模仿学习来学习策略,后者的效率要低得多。需要注意
的是,最近的研究通过使用基于最大因果熵原理的技术,绕过了对系统
动态知识的要求(Boularias 等人,2011;Kalakrishnan 等人,2013;
Finn 等人,2016b)(Ziebart,2010)。

Neu 和 Szepesvári, 2012 年;Ho 和 Ermon, 2016 年也对这两种方法的结合进


行了研究。其中,Ho 和 Ermon,2016 年使用对抗方法来学习判别器(即奖励
函数),从而使政策与示范样本的分布相匹配。

值得注意的是,在现实世界的许多应用中,教师与代理并非处于完全相同的情
境中。因此,迁移学习可能也至关重要(Schulman 等人,2016;Liu 等人,
2017)。
另一种情况则要求代理直接从一连串的观察结果中学习,而不采取相应的行动
(可能情况略有不同)。这可以在元学习环境中实现,即当代理根据教师的示
范做出预期表现时,向其提供积极奖励。然后,代理可以根据教师未见过的新
轨迹采取行动,目标是能够充分泛化以执行新任务(Paine 等人,2018 年)。

10.3.2 从直接反馈中学习
从反馈中学习 "研究了代理如何从提供正反馈信号的人类教师那里交互式地学
习行为。为了学习复杂的行为,人类培训师的反馈有可能比先验定义的奖励函
数更有效(MacGlashan 等人,2017;Warnell 等人,2017)。这种设置与第
10.2.3 节中讨论的课程学习理念有关。

在 Hadfield-Menell 等人 2016 年的研究中,合作逆强化学习框架考虑了人与


机器人之间的双人游戏,机器人与环境互动的目的是
10.4. Multi-agent systems 91

of maximizing the human’s reward function. In the work of Christiano


et al., 2017, it is shown how learning a separate reward model using
supervised learning lets us significantly reduce the amount of feedback
required from a human teacher. They also present the first practical
applications of using human feedback in the context of deep RL to solve
tasks with a high dimensional observation space.

10.4 Multi-agent systems

A multi-agent system is composed of multiple interacting agents within


an environment (Littman, 1994).

Definition 10.2. A multi-agent POMDP with N agents is a tuple


(S, A N , . . . , A N , T, R1 , . . . , RN , Ω, O1, . . . , ON , γ) where:

• S is a finite set of states { 1, . . . , NS } (describing the possible


configurations of all agents),

• A = A 1 × . . . × A n is a finite set of actions { 1, . . . , NA } ,

• T : S × A × S → [0, 1] is the transition function (set of conditional


transition probabilities between states),

• ∀i, Ri : S × A i × S → R is the reward function for agent i , where


R is a continuous set of possible rewards in a range R max ∈ R+
(e.g., [0, R max] without loss of generality),

• Ω is a finite set of observations { 1, . . . , NΩ } ,

• ∀i, Oi : S × Ω → [0, 1] is a set of conditional observation


probabilities, and

• γ ∈ [0, 1) is the discount factor.

For this type of system, many different settings can be considered.

• Collaborative versus non-collaborative setting. In a pure collabo-


rative setting, agents have a shared reward measurement (R i =
R j , ∀ i, j ∈ [1, . . . , N ]). In a mixed or non-collaborative (possibly
adversarial) setting each agent obtains different rewards. In both
10.4.多代理系统 91

的最大化人类奖励函数。在 Christiano 等人 2017 年的研究中,他们展示了如


何利用监督学习来学习单独的奖励模型,从而大幅减少人类教师所需的反馈
量。他们还首次提出了在深度 RL 背景下使用人类反馈来解决高维观测空间任
务的实际应用。

10.4 多代理系统
多代理系统由环境中多个相互作用的代理组成(Littman,1994 年)。
定义 10.2.有 N 个代理的多代理 POMDP 是一个元组(S,A,...,A,T,
R,...,R,Ω,O,...,O,γ):

• S 是由状态 {1,......,N}(描述所有代理的可能配置)组成的有限集
合。, N}(描述所有代理的可能配置)、
• A = A× .× A 是行动 {1,......,N} 的有限集合。, N},
• T : S × A × S → [0, 1] 是过渡函数(状态之间的条件过渡概率集)、
• ∀i, R: S × A× S → R 是代理 i 的奖励函数,其中 R 是 R∈ R 范围内可能
奖励的连续集合(例如 [0,R],不失一般性)、
• Ω 是观测值 {1,......,N} 的有限集合。, N},
• ∀i,O:S × Ω → [0,1] 是一组条件观测概率,以及
• γ∈ [0, 1]是贴现因子。
对于这类系统,可以考虑多种不同的设置。
• 协作与非协作环境 .在纯协作环境中,代理有一个共享的奖励衡量标准
(R= R, ∀ i, j∈ [1, . . , N ])。在混合或非协作(可能是对抗)环境中,每
个代理获得不同的奖励。在这两种情况下
92 Deep reinforcement learning beyond MDPs

cases, each agent i aims to maximize a discounted sum of its


∑ t (i )
rewards H t=0 γ rt .

• Decentralized versus centralized setting. In a decentralized set-


ting, each agent selects its own action conditioned only on its
local information. When collaboration is beneficial, this decentral-
ized setting can lead to the emergence of communication between
agents in order to share information (e.g., Sukhbaatar et al., 2016).
In a centralized setting, the RL algorithm has access to all
observations w(i ) and all rewards r (i ) . The problem can be
reduced to a single-agent RL problem on the condition that
a single objective can be defined (in a purely collaborative
setting, the unique objective is straightforward). Note that even
when a centralized approach can be considered (depending on
the problem), an architecture that does not make use of the
multi-agent structure usually leads to sub-optimal learning (e.g.,
Sunehag et al., 2017).

In general, multi-agent systems are challenging because agents


are independently updating their policies as learning progresses, and
therefore the environment appears non-stationary to any particular
agent. For training one particular agent, one approach is to select
randomly the policies of all other agents from a pool of previously
learned policies. This can stabilize training of the agent that is learning
and prevent overfitting to the current policy of the other agents (Silver
et al., 2016a).
In addition, from the perspective of a given agent, the environment is
usually strongly stochastic even with a known, fixed policy for all other
agents. Indeed, any given agent does not know how the other agents will
act and consequently, it doesn’t know how its own actions contribute
to the rewards it obtains. This can be partly explained due to partial
observability, and partly due to the intrinsic stochasticity of the policies
followed by other agents (e.g., when there is a high level of exploration).
For these reasons, a high variance of the expected global return is
observed, which makes learning challenging (particularly when used in
conjunction with bootstrapping). In the context of the collaborative
92 超越 MDP 的深度强化学习
情况下,每个代理

i 的目标是最大化其回报的贴现总和
t=0 γr.

• 分散与集中环境。在分散式环境中,每个代理仅根据其本地信息选择自
己的行动。如果合作是有益的,这种分散环境会导致代理之间出现交
流,以共享信息(例如,Sukhbaatar 等人,2016 年)。
在集中式环境中,RL 算法可以获取所有观测值和所有奖励 r。在可以定
义单一目标的条件下,该问题可以简化为单代理 RL 问题(在纯协作环
境中,唯一目标很简单)。请注意,即使可以考虑集中式方法(取决于
问题),不利用多代理结构的架构通常也会导致次优学习(例如,
Sunehag 等人,2017 年)。

一般来说,多代理系统具有挑战性,因为代理在学习过程中都会独立更新其策
略,因此对于任何特定代理来说,环境都是非稳态的。对于训练一个特定的代
理,一种方法是从先前学习的策略库中随机选择所有其他代理的策略。这可以
稳定正在学习的代理的训练,并防止过度拟合其他代理的当前策略(Silver 等
人,2016a)。

此外,从特定代理的角度来看,即使所有其他代理都有已知的固定策略,环境
通常也具有很强的随机性。事实上,任何给定的代理都不知道其他代理将如何
行动,因此,它也不知道自己的行动对它所获得的回报有何影响。这部分是由
于部分可观测性,部分是由于其他代理遵循的政策的内在随机性(例如,当探
索程度较高时)。由于这些原因,可以观察到预期全局收益的高方差,这使得
学习具有挑战性(尤其是在与引导法结合使用时)。在协作
10.4. Multi-agent systems 93

setting, a common approach is to use an actor-critic architecture with a


centralized critic during learning and decentralized actor (the agents can
be deployed independently). These topics have already been investigated
in works by Foerster et al., 2017a; Sunehag et al., 2017; Lowe et al.,
2017 as well as in the related work discussed in these papers. Other
works have shown how it is possible to take into account a term that
either considers the learning of the other agent (Foerster et al., 2018)
or an internal model that predicts the actions of other agents (Jaques
et al., 2018).
Deep RL agents are able to achieve human-level performance in 3D
multi-player first-person video games such as Quake III Arena Capture
the Flag (Jaderberg et al., 2018). Thus, techniques from deep RL have
a large potential for many real-world tasks that require multiple agents
to cooperate in domains such as robotics, self-driving cars, etc.
10.4.多代理系统 93

在这种情况下,一种常见的方法是使用 "行动者-批评者 "架构,即在学习过程


中使用集中式批评者,同时使用分散式行动者(代理可以独立部署)。Foerster
等人,2017a;Sunehag 等人,2017;Lowe 等人,2017 以及这些论文中讨
论的相关工作已经对这些主题进行了研究。其他作品则展示了如何将考虑其他
代理学习的术语(Foerster 等人,2018 年)或预测其他代理行动的内部模型
(Jaques 等人,2018 年)考虑在内。

深度 RL 代理能够在三维多人第一人称视频游戏(如 Quake III Arena Capture


the Flag)中达到人类水平的性能(Jaderberg 等人,2018 年)。因此,在机
器人、自动驾驶汽车等领域,深度 RL 技术在许多需要多个代理合作的现实世
界任务中具有巨大潜力。
11
Perspectives on deep reinforcement learning

In this section, we first mention some of the main successes of deep


RL. Then, we describe some of the main challenges for tackling an even
wider range of real-world problems. Finally, we discuss some parallels
that can be found between deep RL and neuroscience.

11.1 Successes of deep reinforcement learning

Deep RL techniques have demonstrated their ability to tackle a wide


range of problems that were previously unsolved. Some of the most
renowned achievements are

• beating previous computer programs in the game of backgammon


(Tesauro, 1995),

• attaining superhuman-level performance in playing Atari games


from the pixels (Mnih et al., 2015),

• mastering the game of Go (Silver et al., 2016a), as well as

• beating professional poker players in the game of heads up no-


limit Texas hold’em: Libratus (Brown and Sandholm, 2017) and
Deepstack (Moravčik et al., 2017).

94
11
关于深度强化学习的观点

在本节中,我们首先会提到深度实时学习的一些主要成功之处。然后,我们将
介绍在解决更广泛的现实世界问题时所面临的一些主要挑战。最后,我们将讨
论深度 RL 与神经科学之间的一些相似之处。

11.1 深度强化学习的成功经验
深度 RL 技术已经证明,它们有能力解决以前无法解决的各种问题。其中最著
名的成就包括
• 在西洋双陆棋游戏中击败了以前的计算机程序(Tesauro,1995 年)、
• 在玩像素雅达利游戏时达到超人水平(Mnih 等人,2015 年)、
• 掌握围棋(Silver 等人,2016a),以及
• 在德州扑克游戏中击败职业扑克玩家:Libratus》(Brown 和
Sandholm,2017 年)和《Deepstack》(Moravčik 等人,2017 年)。

94
11.2. Challenges of applying reinforcement learning to real-world problems
95

These achievements in popular games are important because they


show the potential of deep RL in a variety of complex and diverse
tasks that require working with high-dimensional inputs. Deep RL has
also shown lots of potential for real-world applications such as robotics
(Kalashnikov et al., 2018), self-driving cars (You et al., 2017), finance
(Deng et al., 2017), smart grids (François-Lavet et al., 2016b), dialogue
systems (Fazel-Zarandi et al., 2017), etc. In fact, Deep RL systems are
already in production environments. For example, Gauci et al. (2018)
describe how Facebook uses Deep RL such as for pushing notifications
and for faster video loading with smart prefetching.
RL is also applicable to fields where one could think that supervised
learning alone is sufficient, such as sequence prediction (Ranzato et al.,
2015; Bahdanau et al., 2016). Designing the right neural architecture for
supervised learning tasks has also been cast as an RL problem (Zoph
and Le, 2016). Note that those types of tasks can also be tackled with
evolutionary strategies (Miikkulainen et al., 2017; Real et al., 2017).
Finally, it should be mentioned that deep RL has applications in
classic and fundamental algorithmic problems in the field of computer
science, such as the travelling salesman problem (Bello et al., 2016).
This is an NP-complete problem and the possibility to tackle it with
deep RL shows the potential impact that it could have on several other
NP-complete problems, on the condition that the structure of these
problems can be exploited.

11.2 Challenges of applying reinforcement learning to real-world


problems

The algorithms discussed in this introduction to deep RL can, in


principle, be used to solve many different types of real-world problems. In
practice, even in the case where the task is well defined (explicit reward
function), there is one fundamental difficulty: it is often not possible to
let an agent interact freely and sufficiently in the actual environment
(or set of environments), due to either safety, cost or time constraints.
We can distinguish two main cases in real-world applications:
1. The agent may not be able to interact with the true environment
but only with an inaccurate simulation of it. This scenario occurs
11.2.将强化学习应用于现实世界问题的挑战 95 在热门游戏中取得的这些成就之所以重
要,是因为它们
这些研究表明,在需要处理高维输入的各种复杂多样的任务中,深度 RL 大有
可为。深度 RL 在现实世界的应用中也显示出巨大潜力,如机器人
(Kalashnikov 等人,2018 年)、自动驾驶汽车(You 等人,2017 年)、金融
(Deng 等人,2017 年)、智能电网(François-Lavet 等人,2016 年 b)、对
话系统(Fazel-Zarandi 等人,2017 年)等。事实上,深度 RL 系统
已经应用于生产环境中。例如,Gauci 等人(2018 年)描述了 Facebook 如
何使用 Deep RL,如推送通知和通过智能预取实现更快的视频加载。
RL 也适用于人们可能认为仅有监督学习就足够的领域,如序列预测
(Ranzato 等人,2015 年;Bahdanau 等人,2016 年)。为监督学习任务设
计合适的神经架构也被视为一个 RL 问题(Zoph 和 Le,2016 年)。请注意,
这些类型的任务也可以用进化策略来解决(Miikkulainen 等人,2017;Real
等人,2017)。
最后,值得一提的是,深度 RL 在计算机科学领域的经典基本算法问题中也有
应用,例如旅行推销员问题(Bello 等人,2016 年)。这是一个 NP-complete
(NP-完全)问题,利用深度 RL 解决该问题的可能性表明,在可以利用这些
问题的结构的条件下,深度 RL 可以对其他几个 NP-完全问题产生潜在影响。

11.2 将强化学习应用于实际问题的挑战
本深度 RL 简介中讨论的算法原则上可用于解决许多不同类型的实际问题。在
实践中,即使在任务定义明确(奖励函数明确)的情况下,也存在一个基本困
难:由于安全、成本或时间限制,通常不可能让代理在实际环境(或环境集)
中自由、充分地互动。在实际应用中,我们可以区分两种主要情况:

1. 代理可能无法与真实环境互动,只能与不准确的模拟环境互动。这种情
况会出现
96 Perspectives on deep reinforcement learning

for instance in robotics (Zhu et al., 2016; Gu et al., 2017a). When


first learning in a simulation, the difference with the real-world
domain is known as the reality gap (see e.g. Jakobi et al., 1995).

2. The acquisition of new observations may not be possible anymore


(e.g., the batch setting). This scenario happens for instance in
medical trials, in tasks with dependence on weather conditions or
in trading markets (e.g., energy markets and stock markets).

Note that a combination of the two scenarios is also possible in the case
where the dynamics of the environment may be simulated but where
there is a dependence on an exogenous time series that is only accessible
via limited data (François-Lavet et al., 2016b).
In order to deal with these limitations, different elements are
important:

• One can aim to develop a simulator that is as accurate as possible.

• One can design the learning algorithm so as to improve general-


ization and/or use transfer learning methods (see Chapter 7).

11.3 Relations between deep RL and neuroscience

One of the interesting aspects of deep RL is its relations to neuroscience.


During the development of algorithms able to solve challenging
sequential decision-making tasks, biological plausibility was not a
requirement from an engineering standpoint. However, biological
intelligence has been a key inspiration for many of the most successful
algorithms. Indeed, even the ideas of reinforcement and deep learning
have strong links with neuroscience and biological intelligence.

Reinforcement In general, RL has had a rich conceptual relationship


to neuroscience. RL has used neuroscience as an inspiration and it
has also been a tool to explain neuroscience phenomena (Niv, 2009).
RL models have also been used as a tool in the related field of
neuroeconomics (Camerer et al., 2005), which uses models of human
decision-making to inform economic analyses.
96 关于深度强化学习的观点
例如在机器人领域(Zhu 等人,2016;Gu 等人,2017a)。在首次模拟
学习时,与真实世界领域的差异被称为现实差距(参见 Jakobi 等人,
1995 年)。
2. 可能无法再获取新的观测数据(如批量设置) 。例如,在医学试验、依赖
天气条件的任务或交易市场(如能源市场和股票市场)中就会出现这种
情况。
请注意,在可以模拟环境动态但依赖于只能通过有限数据获取的外生时间序
列的情况下,也有可能将这两种情况结合起来(François-Lavet 等人,
2016b)。
为了应对这些限制,不同的要素都很重要:

• 我们的目标是开发一个尽可能精确的模拟器。
• 我们可以设计学习算法,以提高泛化能力和/或使用迁移学习方法(见第
7 章)。
11.3 深度学习与神经科学的关系
深度 RL 的一个有趣方面是它与神经科学的关系。在开发能够解决具有挑战性
的连续决策任务的算法过程中,从工程学的角度来看,生物合理性并不是一
个必要条件。然而,生物智能一直是许多最成功算法的重要灵感来源。事实
上,即使是强化学习和深度学习的理念也与神经科学和生物智能有着密切联
系。
强化 一般来说,RL 与神经科学有着丰富的概念关系。RL 以神经科学为灵感
来源,同时也是解释神经科学现象的工具(Niv,2009 年)。RL 模型还被用作
神经经济学相关领域的工具(Camerer 等人,2005 年),该领域使用人类决
策模型为经济分析提供信息。
11.3. Relations between deep RL and neuroscience 97

The idea of reinforcement (or at least the term) can be traced back
to the work of Pavlov (1927) in the context of animal behavior. In
the Pavlovian conditioning model, reinforcement is described as the
strengthening/weakening effect of a behavior whenever that behavior is
preceded by a specific stimulus. The Pavlovian conditioning model led
to the development of the Rescorla-Wagner Theory (Rescorla, Wagner,
et al., 1972), which assumed that learning is driven by the error between
predicted and received reward, among other prediction models. In
computational RL, those concepts have been at the heart of many
different algorithms, such as in the development of temporal-difference
(TD) methods (Sutton, 1984; Schultz et al., 1997; Russek et al., 2017).
These connections were further strengthened when it was found that
the dopamine neurons in the brain act in a similar manner to TD-like
updates to direct learning in the brain (Schultz et al., 1997).
Driven by such connections, many aspects of reinforcement learning
have also been investigated directly to explain certain phenomena in
the brain. For instance, computational models have been an inspiration
to explain cognitive phenomena such as exploration (Cohen et al., 2007)
and temporal discounting of rewards (Story et al., 2014). In cognitive
science, Kahneman (2011) has also described that there is a dichotomy
between two modes of thoughts: a "System 1" that is fast and instinctive
and a "System 2" that is slower and more logical. In deep reinforcement,
a similar dichotomy can be observed when we consider the model-free
and the model-based approaches. As another example, the idea of having
a meaningful abstract representation in deep RL can also be related to
how animals (including humans) think. Indeed, a conscious thought at
a particular time instant can be seen as a low-dimensional combination
of a few concepts in order to take decisions (Bengio, 2017).
There is a dense and rich literature about the connections between
RL and neuroscience and, as such, the reader is referred to the work of
Sutton and Barto (2017), Niv (2009), Lee et al. (2012), Holroyd and
Coles (2002), Dayan and Niv (2008), Dayan and Daw (2008), Montague
(2013), and Niv and Montague (2009) for an in-depth history of the
development of reinforcement learning and its relations to neuroscience.
11.3.深度学习与神经科学的关系 97

强化的概念(或至少是这个术语)可以追溯到巴甫洛夫(1927 年)在动物行为
方面的研究成果。在巴甫洛夫条件反射模型中,强化被描述为每当某种行为之
前出现特定刺激时,该行为的加强/减弱效应。巴甫洛夫条件反射模型导致了雷
斯科拉-瓦格纳理论(Rescorla, Wagner, et al., 1972 年)的发展,该理论假设
学习是由预测奖励与接收奖励之间的误差以及其他预测模型驱动的。在计算 RL
中,这些概念一直是许多不同算法的核心,例如时间差(TD)方法的发展
(Sutton,1984 年;Schultz 等人,1997 年;Russek 等人,2017 年)。当发
现大脑中的多巴胺神经元以类似于 TD 更新的方式引导大脑学习时,这些联系
得到了进一步加强(Schultz 等人,1997 年)。

在这种联系的推动下,人们对强化学习的许多方面进行了直接研究,以解释大
脑中的某些现象。例如,计算模型一直是解释探索(Cohen 等人,2007 年)
和奖励的时间折扣(Story 等人,2014 年)等认知现象的灵感来源。在认知科
学领域,卡尼曼(Kahneman,2011 年)也曾描述过两种思维模式之间的对
立:一种是快速和本能的 "系统 1",另一种是较慢和更具逻辑性的 "系统 2"。
在深度强化中,当我们考虑无模型和基于模型的方法时,也可以观察到类似的
二分法。另一个例子是,在深度 RL 中,有意义的抽象表征这一概念也与动物
(包括人类)的思维方式有关。事实上,特定时间瞬间的有意识思维可以被视
为几个概念的低维组合,以便做出决策(Bengio,2017)。

关于强化学习与神经科学之间的联系,有大量丰富的文献,因此,读者可参阅
Sutton 和 Barto (2017)、Niv (2009)、Lee 等人 (2012)、Holroyd 和 Coles
(2002)、Dayan 和 Niv (2008)、Dayan 和 Daw (2008)、Montague (2013) 以及
Niv 和 Montague (2009)的著作,深入了解强化学习的发展历史及其与神经科
学的关系。
98 Perspectives on deep reinforcement learning

Deep learning Deep learning also finds its origin in models of neural
processing in the brain of biological entities. However, subsequent de-
velopments are such that deep learning has become partly incompatible
with current knowledge of neurobiology (Bengio et al., 2015). There
exists nonetheless many parallels. One such example is the convolutional
structure used in deep learning that is inspired by the organization of
the animal visual cortex (Fukushima and Miyake, 1982; LeCun et al.,
1998).

Much work is still needed to bridge the gap between machine learning
and general intelligence of humans (or even animals). Looking back at all
the achievements obtained by taking inspiration from neuroscience, it is
natural to believe that further understanding of biological brains could
play a vital role in building more powerful algorithms and conversely.
In particular, we refer the reader to the survey by Hassabis et al., 2017
where the bidirectional influence between deep RL and neuroscience is
discussed.
98 关于深度强化学习的观点
深度学习 深度学习也源于生物实体大脑神经处理模型。然而,由于后来的发
展,深度学习在一定程度上已经与当前的神经生物学知识不相容(Bengio et
al.)尽管如此,两者之间仍存在许多相似之处。其中一个例子是深度学习中
使用的卷积结构,其灵感来自动物视觉皮层的组织结构(Fukushima and
Miyake, 1982; LeCun et al.)
要缩小机器学习与人类(甚至动物)一般智能之间的差距,还有许多工作要
做。回顾从神经科学中汲取灵感而取得的所有成就,我们自然会相信,对生物
大脑的进一步了解可以在构建更强大的算法方面发挥至关重要的作用,反之亦
然。我们特别推荐读者阅读哈萨比斯(Hassabis)等人于 2017 年撰写的调查
报告,其中讨论了深度 RL 与神经科学之间的双向影响。
12
Conclusion

Sequential decision-making remains an active field of research with many


theoretical, methodological and experimental challenges still open. The
important developments in the field of deep learning have contributed to
many new avenues where RL methods and deep learning are combined.
In particular, deep learning has brought important generalization
capabilities, which opens new possibilities to work with large, high-
dimensional state and/or action spaces. There is every reason to think
that this development will continue in the coming years with more
efficient algorithms and many new applications.

12.1 Future development of deep RL

In deep RL, we have emphasized in this manuscript that one of the


central questions is the concept of generalization. To this end, the new
developments in the field of deep RL will surely develop the current
trends of taking explicit algorithms and making them differentiable
so that they can be embedded in a specific form of neural network
and can be trained end-to-end. This can bring algorithms with richer
and smarter structures that would be better suited for reasoning on a
more abstract level, which would allow to tackle an even wider range of

99
12
结论

顺序决策仍是一个活跃的研究领域,许多理论、方法和实验方面的挑战仍未解
决。深度学习领域的重要发展促进了许多将 RL 方法与深度学习相结合的新途
径。特别是,深度学习带来了重要的泛化能力,为处理大型高维状态和/或行动
空间提供了新的可能性。我们完全有理由相信,在未来几年里,这一发展将继
续下去,并带来更高效的算法和许多新的应用。

12.1 深度 RL 的未来发展
在深度 RL 中,我们在本手稿中强调,核心问题之一是泛化概念。为此,深度
RL 领域的新发展必将发展当前的趋势,即采用显式算法并使其可微分,从而
将其嵌入特定形式的神经网络并进行端到端训练。这将为算法带来更丰富、更
智能的结构,使其更适合在更抽象的层次上进行推理,从而能够处理更广泛的
问题。

99
100 Conclusion

applications than they currently do today. Smart architectures could


also be used for hierarchical learning where much progress is still needed
in the domain of temporal abstraction.
We also expect to see deep RL algorithms going in the direction of
meta-learning and lifelong learning where previous knowledge (e.g., in
the form of pre-trained networks) can be embedded so as to increase
performance and training time. Another key challenge is to improve
current transfer learning abilities between simulations and real-world
cases. This would allow learning complex decision-making problems in
simulations (with the possibility to gather samples in a flexible way), and
then use the learned skills in real-world environments, with applications
in robotics, self-driving cars, etc.
Finally, we expect deep RL techniques to develop improved curiosity
driven abilities to be able to better discover by themselves their
environment.

12.2 Applications and societal impact of deep RL and artificial


intelligence in general

In terms of applications, many areas are likely to be impacted by the


possibilities brought by deep RL. It is always difficult to predict the
timelines for the different developments, but the current interest in deep
RL could be the beginning of profound transformations in information
and communication technologies, with applications in clinical decision
support, marketing, finance, resource management, autonomous driving,
robotics, smart grids, and more.
Current developments in artificial intelligence (both for deep RL or
in general for machine learning) follows the development of many tools
brought by information and communications technologies. As with all
new technologies, this comes with different potential opportunities and
challenges for our society.
On the positive side, algorithms based on (deep) reinforcement
learning promise great value to people and society. They have the
potential to enhance the quality of life by automating tedious and
exhausting tasks with robots (Levine et al., 2016; Gandhi et al., 2017;
Pinto et al., 2017). They may improve education by providing adaptive
100 结论
这样就可以处理比现在更广泛的应用。智能架构还可用于分层学习,在时间
抽象领域仍需取得长足进步。
我们还期待看到深度 RL 算法朝着元学习和终身学习的方向发展,在元学习和
终身学习中,可以嵌入先前的知识(例如,以预训练网络的形式),从而提高
性能并缩短训练时间。另一个关键挑战是提高当前模拟与真实世界案例之间
的迁移学习能力。这将允许在模拟中学习复杂的决策问题(可以灵活地收集
样本),然后在真实世界环境中使用所学技能,应用于机器人、自动驾驶汽车
等领域。
最后,我们希望深度 RL 技术能够提高好奇心驱动的能力,从而更好地自我发
现周围环境。

12.2 深度 RL 和一般人工智能的应用和社会影响
在应用方面,许多领域都有可能受到深度 RL 带来的可能性的影响。预测不同
发展的时间表总是很困难,但目前对深度 RL 的兴趣可能是信息和通信技术深
刻变革的开端,其应用领域包括临床决策支持、市场营销、金融、资源管理、
自动驾驶、机器人、智能电网等。

随着信息和通信技术带来的许多工具的发展,人工智能(无论是深度 RL 还是
一般的机器学习)也取得了长足的进步。与所有新技术一样,这也给我们的社
会带来了不同的潜在机遇和挑战。

从积极方面看,基于(深度)强化学习的算法有望为人类和社会带来巨大价
值。它们有可能通过机器人自动完成乏味和令人疲惫的任务来提高生活质量
(Levine 等人,2016 年;Gandhi 等人,2017 年;Pinto 等人,2017 年)。
它们可以通过提供适应性
12.2. Applications and societal impact of deep RL 101

content and keeping students engaged (Mandel et al., 2014). They can
improve public health with, for instance, intelligent clinical decision-
making (Fonteneau et al., 2008; Bennett and Hauser, 2013). They may
provide robust solutions to some of the self-driving cars challenges
(Bojarski et al., 2016; You et al., 2017). They also have the possibility
to help managing ecological resources (Dietterich, 2009) or reducing
greenhouse gas emissions by, e.g., optimizing traffic (Li et al., 2016).
They have applications in computer graphics, such as for character
animation (Peng et al., 2017b). They also have applications in finance
(Deng et al., 2017), smart grids (François-Lavet, 2017), etc.
However, we need to be careful that deep RL algorithms are safe,
reliable and predictable (Amodei et al., 2016; Bostrom, 2017). As a
simple example, to capture what we want an agent to do in deep
RL, we frequently end up, in practice, designing the reward function,
somewhat arbitrarily. Often this works well, but sometimes it produces
unexpected, and potentially catastrophic behaviors. For instance, to
remove a certain invasive species from an environment, one may design
an agent that obtains a reward every time it removes one of these
organisms. However, it is likely that to obtain the maximum cumulative
rewards, the agent will learn to let that invasive species develop and
only then would eliminate many of the invasive organisms, which is of
course not the intended behavior. All aspects related to safe exploration
are also potential concerns in the hypothesis that deep RL algorithms
are deployed in real-life settings.
In addition, as with all powerful tools, deep RL algorithms also
bring societal and ethical challenges (Brundage et al., 2018), raising
the question of how they can be used for the benefit of all. Even tough
different interpretations can come into play when one discusses human
sciences, we mention in this conclusion some of the potential issues that
may need further investigation.
The ethical use of artificial intelligence is a broad concern. The
specificity of RL as compared to supervised learning techniques is that
it can naturally deal with sequences of interactions, which is ideal for
chatbots, smart assistants, etc. As it is the case with most technologies,
regulation should, at some point, ensure a positive impact of its usage.
12.2.深层制冷剂的应用和社会影响 101

内容,让学生参与其中(Mandel 等人,2014 年)。例如,它们可以通过智能


临床决策改善公共卫生(Fonteneau 等人,2008 年;Bennett 和 Hauser,
2013 年)。它们可以为自动驾驶汽车面临的一些挑战提供强有力的解决方案。
它们可以通过提供自适应功能改善教育(Bojarski 等人,2016 年;You 等人,
2017 年)。它们还有可能帮助管理生态资源(Dietterich,2009 年)或通过优
化交通等方式减少温室气体排放(Li 等人,2016 年)。它们还可应用于计算机
制图,如角色动画(Peng 等人,2017b)。它们还可应用于金融(Deng 等
人,2017 年)、智能电网(François-Lavet,2017 年)等领域。

然而,我们需要注意深度 RL 算法的安全性、可靠性和可预测性(Amodei 等
人,2016 年;Bostrom,2017 年)。举个简单的例子,为了捕捉我们希望代
理在深度 RL 中做的事情,我们在实践中经常会任意设计奖励函数。这样做往
往效果很好,但有时也会产生意想不到的、可能是灾难性的行为。例如,为
了从环境中清除某种入侵物种,我们可能会设计一个代理,让它在每次清除
这些生物时都能获得奖励。然而,为了获得最大的累积奖励,该代理很可能
会学会任由入侵物种发展,然后才会消灭许多入侵生物,这当然不是预期的
行为。与安全探索相关的所有方面也都是在现实生活中部署深度 RL 算法时可
能需要关注的问题。
此外,与所有强大的工具一样,深度 RL 算法也带来了社会和伦理方面的挑战
(Brundage 等人,2018 年),提出了如何利用这些算法造福全人类的问题。
即使在讨论人文科学时可能会出现不同的解释,我们也会在本结论中提及一
些可能需要进一步调查的潜在问题。

人工智能的伦理使用是一个广泛关注的问题。与监督学习技术相比,RL 的特殊
性在于它可以自然地处理交互序列,是聊天机器人、智能助手等的理想选择。
与大多数技术一样,监管应在一定程度上确保其使用产生积极影响。
102 Conclusion

In addition, machine learning and deep RL algorithms will likely


yield to further automation and robotisation than it is currently possible.
This is clearly a concern in the context of autonomous weapons, for
instance (Walsh, 2017). Automation also influences the economy, the
job market and our society as a whole. A key challenge for humanity
is to make sure that the future technological developments in artificial
intelligence do not create an ecological crisis (Harari, 2014) or deepen the
inequalities in our society with potential social and economic instabilities
(Piketty, 2013).
We are still at the very first steps of deep RL and artificial intelligence
in general. The future is hard to predict; however, it is key that the
potential issues related to the use of these algorithms are progressively
taken into consideration in public policies. If that is the case, these new
algorithms can have a positive impact on our society.
102 结论
此外,机器学习和深度 RL 算法很可能比目前更进一步实现自动化和机器人
化。例如,在自主武器方面,这显然是一个值得关注的问题(Walsh, 2017)。
自动化也会影响经济、就业市场和整个社会。人类面临的一个关键挑战是确保
未来人工智能技术的发展不会造成生态危机(Harari,2014 年),或加深社会
中的不平等,带来潜在的社会和经济不稳定(Piketty,2013 年)。

我们仍处于深度实时学习和人工智能的起步阶段。未来难以预测,但关键是要
在公共政策中逐步考虑到与使用这些算法相关的潜在问题。果真如此,这些新
算法将对我们的社会产生积极影响。
Appendices
附录
A Deep RL frameworks
Here is a list of some well-known frameworks used for deep RL:

• DeeR (François-Lavet et al., 2016a) is focused on being (i) easily


accessible and (ii) modular for researchers.

• Dopamine (Bellemare et al., 2018) provides standard algorithms


along with baselines for the ATARI games.

• ELF (Tian et al., 2017) is a research platform for deep RL, aimed
mainly to real-time strategy games.

• OpenAI baselines (Dhariwal et al., 2017) is a set of popular deep


RL algorithms, including DDPG, TRPO, PPO, ACKTR. The
focus of this framework is to provide implementations of baselines.

• PyBrain (Schaul et al., 2010) is a machine learning library with


some RL support.

• rllab (Duan et al., 2016a) provides a set of benchmarked


implementations of deep RL algorithms.

• TensorForce (Schaarschmidt et al., 2017) is a framework for deep


RL built around Tensorflow with several algorithm implemen-
tations. It aims at moving reinforcement computations into the
Tensorflow graph for performance gains and efficiency. As such, it
is heavily tied to the Tensorflow deep learning library. It provides
many algorithm implementations including TRPO, DQN, PPO,
and A3C.

Even though, they are not tailored specifically for deep RL, we can
also cite the two following frameworks for reinforcement learning:

• RL-Glue (Tanner and White, 2009) provides a standard interface


that allows to connect RL agents, environments, and experiment
programs together.
深度 RL 框架
下面列出了一些用于深度 RL 的著名框架:
• DeeR(François-Lavet 等人,2016a)的重点是:(i) 方便研究人员使
用;(ii) 模块化。
• 多巴胺(Bellemare 等人,2018 年)为 ATARI 游戏提供了标准算法和
基线。
• ELF(Tian 等人,2017 年)是一个深度 RL 研究平台,主要针对实时战
略游戏。
• OpenAI baselines(Dhariwal 等人,2017 年)是一组流行的深度 RL 算
法,包括 DDPG、TRPO、PPO、ACKTR。该框架的重点是提供基线的实
现。
• PyBrain (Schaul 等人,2010 年)是一个支持 RL 的机器学习库。

• rllab(Duan 等人,2016a)提供了一组深度 RL 算法的基准实现。


• TensorForce(Schaarschmidt 等人,2017 年)是一个围绕
Tensorflow 构建的深度 RL 框架,有多种算法实现。它旨在将强化计算
移入 Tensorflow 图中,以提高性能和效率。因此,它与 Tensorflow 深
度学习库紧密相连。它提供了许多算法实现,包括 TRPO、DQN、PPO
和 A3C。

尽管它们不是专门为深度 RL 量身定制的,但我们也可以引用以下两个强
化学习框架:
• RL-Glue (Tanner 和 White,2009 年)提供了一个标准接口,可将 RL
代理、环境和实验程序连接在一起。
• RLPy (Geramifard et al., 2015) is a framework focused on value-
based RL using linear function approximators with discrete
actions.

Table 1 provides a summary of some properties of the aforementioned


libraries.

Framework Deep RL Python Automatic


interface GPU support
DeeR yes yes yes
Dopamine yes yes yes
ELF yes no yes
OpenAI baselines yes yes yes
PyBrain yes yes no
RL-Glue no yes no
RLPy no yes no
rllab yes yes yes
TensorForce yes yes yes

Table 1: Summary of some characteristics for a few existing RL frameworks.


- RLPy (Geramifard 等人,2015 年)是一个基于值的 RL 框架,使用离散
动作的线性函数近似值。
表 1 概述了上述图书馆的一些特性。
框架 深度 RL Python 自动
界面 图形处理器支持
DeeR 是的 是的 是的
多巴胺 是的 是的 ELF 是的 不是的 OpenAI 基线 是的
是的 PyBrain 是的 不是的 RL-Glue 不是的 不是的
RLPy 不是的 不是的 rllab 是的 是的 TensorForce 是
的 是的 是的

表 1:一些现有 RL 框架的特征摘要。
References

Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,


G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. 2016. “Tensor-
Flow: Large-Scale Machine Learning on Heterogeneous Distributed
Systems”. arXiv preprint arXiv:1603.04467.
Abbeel, P. and A. Y. Ng. 2004. “Apprenticeship learning via inverse rein-
forcement learning”. In: Proceedings of the twenty-first international
conference on Machine learning. ACM. 1.
Amari, S. 1998. “Natural Gradient Works Efficiently in Learning”.
Neural Computation. 10(2): 251–276.
Amodei, D., C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and
D. Mané. 2016. “Concrete problems in AI safety”. arXiv preprint
arXiv:1606.06565.
Anderson, T. W., T. W. Anderson, T. W. Anderson, T. W. Anderson,
and E.-U. Mathématicien. 1958. An introduction to multivariate
statistical analysis. Vol. 2. Wiley New York.
Aytar, Y., T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas.
2018. “Playing hard exploration games by watching YouTube”. arXiv
preprint arXiv:1805.11592.
Bacon, P.-L., J. Harb, and D. Precup. 2016. “The option-critic
architecture”. arXiv preprint arXiv:1609.05140.

106
参考资料

Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.


Corrado, A. Davis, J. Dean, M. Devin, et al. 2016."TensorFlow: Large-Scale
Machine Learning on Heterogeneous Distributed Systems". arXiv preprint
arXiv:1603.04467.Abbeel, P. and A. Y. Ng.2004."通过反向再学习的学徒制学
习"(Apprenticeship learning via inverse rein-
《力学习"。In:第 21 届机器学习国际会议论文集》。ACM.1.
Amari, S. 1998."自然梯度在学习中的高效作用》。
神经计算。10(2):251-276.
Amodei, D., C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D.
Mané.2016."人工智能安全的具体问题》。ArXiv 预印本 arXiv:1606.06565.
Anderson, T. W., T. W. Anderson, T. W. Anderson, T. W. Anderson、
and E.-U.Mathématicien.1958.多元统计分析导论》第 2 卷。Wiley New
York.
Aytar, Y., T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas.
2018."通过观看 YouTube 玩艰苦探索游戏".ArXiv 预印本
arXiv:1805.11592.
Bacon, P. -L., J. Harb, and D. Precup.2016."The option-critic architecture".
arXiv preprint arXiv:1609.05140.

106
Bahdanau, D., P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A.
Courville, and Y. Bengio. 2016. “An actor-critic algorithm for
sequence prediction”. arXiv preprint arXiv:1607.07086.
Baird, L. 1995. “Residual algorithms: Reinforcement learning with
function approximation”. In: ICML. 30–37.
Baker, M. 2016. “1,500 scientists lift the lid on reproducibility”. Nature
News. 533(7604): 452.
Bartlett, P. L. and S. Mendelson. 2002. “Rademacher and Gaussian
complexities: Risk bounds and structural results”. Journal of
Machine Learning Research. 3(Nov): 463–482.
Barto, A. G., R. S. Sutton, and C. W. Anderson. 1983. “Neuronlike
adaptive elements that can solve difficult learning control problems”.
IEEE transactions on systems, man, and cybernetics. (5): 834–846.
Beattie, C., J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H.
Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, et al. 2016.
“DeepMind Lab”. arXiv preprint arXiv:1612.03801.
Bellemare, M. G., P. S. Castro, C. Gelada, K. Saurabh, and S. Moitra.
2018. “Dopamine”. https://github.com/google/dopamine.
Bellemare, M. G., W. Dabney, and R. Munos. 2017. “A distri-
butional perspective on reinforcement learning”. arXiv preprint
arXiv:1707.06887.
Bellemare, M. G., Y. Naddaf, J. Veness, and M. Bowling. 2013. “The
Arcade Learning Environment: An evaluation platform for general
agents.” Journal of Artificial Intelligence Research. 47: 253–279.
Bellemare, M. G., S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and
R. Munos. 2016. “Unifying Count-Based Exploration and Intrinsic
Motivation”. arXiv preprint arXiv:1606.01868.
Bellman, R. 1957a. “A Markovian decision process”. Journal of
Mathematics and Mechanics: 679–684.
Bellman, R. 1957b. “Dynamic Programming”.
Bellman, R. E. and S. E. Dreyfus. 1962. “Applied dynamic program-
ming”.
Bello, I., H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. 2016. “Neural
Combinatorial Optimization with Reinforcement Learning”. arXiv
preprint arXiv:1611.09940.
D. Bahdanau、P. Brakel、K. Xu、A. Goyal、R. Lowe、J. Pineau、A.
Courville 和 Y. Bengio。2016."ArXiv preprint arXiv:1607.07086.
Baird, L. 1995."残差算法:带函数近似的强化学习"。In:ICML.30-37.Baker,
M. 2016."1,500 scientists lift the lid on reproducibility".自然新闻》。
533(7604):452.Bartlett, P. L. and S. Mendelson.2002."Rademacher and
Gaussian complexities:Risk bounds and structural results".机器学习研究
期刊》。3(Nov):463-482.Barto, A. G., R. S. Sutton, and C. W.
Anderson.1983."能解决学习控制难题的神经元式自适应元素》。

IEEE transactions on systems, man, and cybernetics.(5):834-846.


Beattie, C., J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. J. Z. Leibo, D.
Teplyashin,
Küttler、A.T. Ward, M. Wainwright,
Lefrancq、S. H. J. Z. Leibo
Green、V.Valdés, A. Sadik, et al. 2016.
"DeepMind Lab". arXiv preprint arXiv:1612.03801.
Bellemare, M. G., P. S. Castro, C. Gelada, K. Saurabh, and S. Moitra.
2018."多巴胺"。https://github.com/google/dopamine。
Bellemare, M. G., W. Dabney, and R. Munos.2017."A distributional
perspective on reinforcement learning". arXiv preprint arXiv:1707.06887.
Bellemare, M. G., Y. Naddaf, J. Veness, and M. Bowling.2013."拱廊学习环
境:通用代理评估平台"。人工智能研究期刊》。47: 253-279.Bellemare, M.
G., S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R.
Munos.2016."Unifying Count-Based Exploration and Intrinsic Motivation".
arXiv preprint arXiv:1606.01868.Bellman, R. 1957a."A Markovian decision
process".Journal of Mathematics and Mechanics:679-684.

Bellman, R. 1957b."Dynamic Programming".


Bellman, R. E. and S. E. Dreyfus.1962."Applied dynamic
programming".Bello, I., H. Pham, Q. V. Le, M. Norouzi, and S.
Bengio.2016."神经组合优化与强化学习》。arXiv 预印本 arXiv:1611.09940.
Bengio, Y. 2017. “The Consciousness Prior”. arXiv preprint
arXiv:1709.08568.
Bengio, Y., D.-H. Lee, J. Bornschein, T. Mesnard, and Z. Lin. 2015.
“Towards biologically plausible deep learning”. arXiv preprint
arXiv:1502.04156.
Bengio, Y., J. Louradour, R. Collobert, and J. Weston. 2009. “Cur-
riculum learning”. In: Proceedings of the 26th annual international
conference on machine learning. ACM. 41–48.
Bennett, C. C. and K. Hauser. 2013. “Artificial intelligence framework
for simulating clinical decision-making: A Markov decision process
approach”. Artificial intelligence in medicine. 57(1): 9–19.
Bertsekas, D. P., D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas.
1995. Dynamic programming and optimal control. Vol. 1. No. 2.
Athena scientific Belmont, MA.
Bojarski, M., D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al.
2016. “End to end learning for self-driving cars”. arXiv preprint
arXiv:1604.07316.
Bostrom, N. 2017. Superintelligence. Dunod.
Bouckaert, R. R. 2003. “Choosing between two learning algorithms
based on calibrated tests”. In: Proceedings of the 20th International
Conference on Machine Learning (ICML-03). 51–58.
Bouckaert, R. R. and E. Frank. 2004. “Evaluating the replicability of
significance tests for comparing learning algorithms”. In: PAKDD.
Springer. 3–12.
Boularias, A., J. Kober, and J. Peters. 2011. “Relative Entropy Inverse
Reinforcement Learning.” In: AISTATS. 182–189.
Boyan, J. A. and A. W. Moore. 1995. “Generalization in reinforcement
learning: Safely approximating the value function”. In: Advances in
neural information processing systems. 369–376.
Brafman, R. I. and M. Tennenholtz. 2003. “R-max-a general polynomial
time algorithm for near-optimal reinforcement learning”. The
Journal of Machine Learning Research. 3: 213–231.
Bengio, Y. 2017."The Consciousness Prior". arXiv preprint
arXiv:1709.08568.
Bengio, Y., D.- H. Lee, J. Bornschein, T. Mesnard, and Z. Lin.Lin.2015.
"迈向生物学上可信的深度学习"。arXiv preprint arXiv:1502.04156.
Bengio, Y., J. Louradour, R. Collobert, and J. Weston.2009."Cur-
riculum learning"。In:第 26 届机器学习年度国际会议论文集》。
ACM.41-48.
Bennett, C. C. and K. Hauser.2013."模拟临床决策的人工智能框架:马尔可夫
决策过程方法》。人工智能在医学中的应用》。57(1):9-19.
Bertsekas, D. P., D. P. Bertsekas, and D. P. Bertsekas.
1995. 动态编程与最优控制》。Vol.No.
马萨诸塞州贝尔蒙特市雅典娜科学馆。
Bojarski, M., D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L.
D. Jackel, M. Monfort, U. Muller, J. Zhang, et al.
2016."自动驾驶汽车的端到端学习"。arXiv 预印本 arXiv:1604.07316。
Bostrom, N. 2017.超级智能》。杜诺德
Bouckaert, R. R. 2003."在两种学习算法之间做出选择
《基于校准测试"。In:第 20 届机器学习国际会议(ICML-03)论文集》。
51-58.
Bouckaert, R. R. and E. Frank.2004."评估用于比较学习算法的显著性检验的
可复制性》。In:PAKDD.
Springer.3-12.
Boularias, A., J. Kober, and J. Peters.2011."相对熵反强化学习"。In:
AISTATS.182-189.Boyan, J. A. and A. W. Moore.1995."强化学习中的泛化:安
全地近似价值函数"。In. Advances in neural information processing
systems:神经信息处理系统进展》。369-376.Brafman, R. I. and M.
Tennenholtz.2003."R-max-a general polynomial time algorithm for near-
optimal reinforcement learning".机器学习研究期刊》。3: 213-231.
Branavan, S., N. Kushman, T. Lei, and R. Barzilay. 2012. “Learning
high-level planning from text”. In: Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics: Long
Papers-Volume 1. Association for Computational Linguistics. 126–
135.
Braziunas, D. 2003. “POMDP solution methods”. University of Toronto,
Tech. Rep.
Brockman, G., V. Cheung, L. Pettersson, J. Schneider, J. Schulman,
J. Tang, and W. Zaremba. 2016. “OpenAI Gym”.
Brown, N. and T. Sandholm. 2017. “Libratus: The Superhuman AI
for No-Limit Poker”. International Joint Conference on Artificial
Intelligence (IJCAI-17).
Browne, C. B., E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling,
P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S.
Colton. 2012. “A survey of monte carlo tree search methods”. IEEE
Transactions on Computational Intelligence and AI in games. 4(1):
1–43.
Brügmann, B. 1993. “Monte carlo go”. Tech. rep. Citeseer.
Brundage, M., S. Avin, J. Clark, H. Toner, P. Eckersley, B. Garfinkel,
A. Dafoe, P. Scharre, T. Zeitzoff, B. Filar, et al. 2018. “The
Malicious Use of Artificial Intelligence: Forecasting, Prevention,
and Mitigation”. arXiv preprint arXiv:1802.07228.
Brys, T., A. Harutyunyan, P. Vrancx, M. E. Taylor, D. Kudenko, and
A. Nowé. 2014. “Multi-objectivization of reinforcement learning
problems by reward shaping”. In: Neural Networks (IJCNN), 2014
International Joint Conference on. IEEE. 2315–2322.
Bubeck, S., R. Munos, and G. Stoltz. 2011. “Pure exploration in
finitely-armed and continuous-armed bandits”. Theoretical Computer
Science. 412(19): 1832–1852.
Burda, Y., H. Edwards, A. Storkey, and O. Klimov. 2018. “Exploration
by Random Network Distillation”. arXiv preprint arXiv:1810.12894.
Camerer, C., G. Loewenstein, and D. Prelec. 2005. “Neuroeconomics:
How neuroscience can inform economics”. Journal of economic
Literature. 43(1): 9–64.
Campbell, M., A. J. Hoane, and F.-h. Hsu. 2002. “Deep blue”. Artificial
intelligence. 134(1-2): 57–83.
Branavan, S., N. Kushman, T. Lei, and R. Barzilay.2012."从文本中学习高级
规划"。In:第 50 届年会论文集
计算语言学协会会议:长
论文-第 1 卷。计算语言学协会。126- 135.
Braziunas, D. 2003."POMDP 求解方法》。多伦多大学,Tech.Rep.
Brockman, G., V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J.
Tang, and W. Zaremba.2016."OpenAI Gym".Brown, N. and T.
Sandholm.2017."Libratus:超人 AI
无上限扑克"。国际人工智能联合会议(IJCAI-17)。
C. Browne、C. B.、E. Powley、D. Whitehouse、S. M. Lucas、P. I.
Cowling、P. Rohlfshagen、S. Tavener、D. Perez、S. Samothrakis 和 S.
Browne。
《科尔顿 2012."蒙特卡罗树搜索方法概览》。美国
《计算智能与游戏中的人工智能》期刊。4(1):1-43.
Brügmann, B. 1993."Monte Carlo Go》。Tech.Citeseer.Brundage, M., S.
Avin, J. Clark, H. Toner, P. Eckersley, B. Garfinkel, A. Dafoe, P. Scharre, T.
Zeitzoff, B. Filar, et al. 2018."人工智能的恶意使用:ArXiv preprint
arXiv:1802.07228.Brys, T., A. Harutyunyan, P. Vrancx, M. E. Taylor, D.
Kudenko, and A. Nowé.2014."通过奖励塑造强化学习问题的多目标化》。In:
Neural Networks (IJCNN), 2014 International Joint Conference
on.IEEE.2315-2322.Bubeck, S., R. Munos, and G. Stoltz.2011."有限武装和连
续武装强盗中的纯探索》。Theoretical Computer Science.412(19):1832-
1852.Burda, Y., H. Edwards, A. Storkey, and O. Klimov.2018."ArXiv preprint
arXiv:1810.12894.Camerer, C., G. Loewenstein, and D.
Prelec."Neuroeconomics: How neuroscience can inform economics".经济
文献杂志》。43(1):9-64.Campbell, M., A. J. Hoane, and F.- h. Hsu.2002."深
蓝"。Artificial intelligence.134(1-2):57-83.
Casadevall, A. and F. C. Fang. 2010. “Reproducible science”.
Castronovo, M., V. François-Lavet, R. Fonteneau, D. Ernst, and A.
Couëtoux. 2017. “Approximate Bayes Optimal Policy Search using
Neural Networks”. In: 9th International Conference on Agents and
Artificial Intelligence (ICAART 2017).
Chebotar, Y., A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N.
Ratliff, and D. Fox. 2018. “Closing the Sim-to-Real Loop: Adapting
Simulation Randomization with Real World Experience”. arXiv
preprint arXiv:1810.05687.
Chen, T., I. Goodfellow, and J. Shlens. 2015. “Net2net: Accelerating
learning via knowledge transfer”. arXiv preprint arXiv:1511.05641.
Chen, X., C. Liu, and D. Song. 2017. “Learning Neural Programs To
Parse Programs”. arXiv preprint arXiv:1706.01284.
Chiappa, S., S. Racaniere, D. Wierstra, and S. Mohamed. 2017. “Recur-
rent Environment Simulators”. arXiv preprint arXiv:1704.02254.
Christiano, P., J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei.
2017. “Deep reinforcement learning from human preferences”. arXiv
preprint arXiv:1706.03741.
Christopher, M. B. 2006. Pattern recognition and machine learning.
Springer.
Cohen, J. D., S. M. McClure, and J. Y. Angela. 2007. “Should I stay or
should I go? How the human brain manages the trade-off between
exploitation and exploration”. Philosophical Transactions of the
Royal Society of London B: Biological Sciences. 362(1481): 933–942.
Cortes, C. and V. Vapnik. 1995. “Support-vector networks”. Machine
learning. 20(3): 273–297.
Coumans, E., Y. Bai, et al. 2016. “Bullet”. http://pybullet.org/.
Da Silva, B., G. Konidaris, and A. Barto. 2012. “Learning parameterized
skills”. arXiv preprint arXiv:1206.6398.
Dabney, W., M. Rowland, M. G. Bellemare, and R. Munos. 2017.
“Distributional Reinforcement Learning with Quantile Regression”.
arXiv preprint arXiv:1710.10044.
Dayan, P. and N. D. Daw. 2008. “Decision theory, reinforcement learning,
and the brain”. Cognitive, Affective, & Behavioral Neuroscience.
8(4): 429–453.
Casadevall, A. and F. C. Fang.2010."可复制科学》。
Castronovo, M., V. François-Lavet, R. Fonteneau, D. Ernst, and A. M., V. François-Lavet,
R. Fonteneau, D. Ernst, and A. M. M.
Couëtoux.2017."使用神经网络的近似贝叶斯最优策略搜索》 。In: 9th
International Conference on Agents and Artificial Intelligence (ICAART
2017).
Chebotar, Y., A.Handa, V.M. Macklin, J. Issac, N. Makoviychuk, M. Macklin, J. Issac, N.
Makoviychuk.
拉特利夫和 D. 福克斯。2018."Closing the Sim-to-Real Loop:Adapting
Simulation Randomization with Real World Experience". arXiv
preprint arXiv:1810.05687.
Chen, T., I. Goodfellow, and J. Shlens.2015."Net2net:ArXiv preprint
arXiv:1511.05641.Chen, X., C. Liu, and D. Song.2017."Learning Neural
Programs To Parse Programs". arXiv preprint arXiv:1706.01284.Chiappa,
S., S. Racaniere, D. Wierstra, and S. Mohamed.2017."Recurrent
Environment Simulators". arXiv preprint arXiv:1704.02254.
Christiano, P., J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei.

2017."从人类偏好出发的深度强化学习".ArXiv 预印本 arXiv:1706.03741.


Christopher, M. B. 2006.模式识别与机器学习。
施普林格
Cohen, J. D., S. M. McClure, and J. Y. Angela.2007."我该走还是留?人类大脑
是如何权衡 "去 "与 "留 "的?
开发和探索"。伦敦皇家学会哲学论文集 B:生物科学》。362(1481):933-
942.Cortes, C. and V. Vapnik.1995."支持向量网络》。机器学习》。20(3):273-
297.
Coumans, E., Y. Bai, et al. 2016."Bullet". http://pybullet.org/.Da Silva, B.,
G. Konidaris, and A. Barto.2012."Learning parameterized skills". arXiv
preprint arXiv:1206.6398.
Dabney, W., M. Rowland, M. G. Bellemare, and R. Munos.2017.
"分布式强化学习与定量回归"。
arXiv preprint arXiv:1710.10044.
Dayan, P. and N. D. Daw.2008."决策理论、强化学习、
《与大脑"。认知、情感和行为神经科学》。
8(4): 429-453.
Dayan, P. and Y. Niv. 2008. “Reinforcement learning: the good, the
bad and the ugly”. Current opinion in neurobiology. 18(2): 185–196.
Dearden, R., N. Friedman, and D. Andre. 1999. “Model based
Bayesian exploration”. In: Proceedings of the Fifteenth conference on
Uncertainty in artificial intelligence. Morgan Kaufmann Publishers
Inc. 150–159.
Dearden, R., N. Friedman, and S. Russell. 1998. “Bayesian Q-learning”.
Deisenroth, M. and C. E. Rasmussen. 2011. “PILCO: A model-based
and data-efficient approach to policy search”. In: Proceedings of
the 28th International Conference on machine learning (ICML-11).
465–472.
Demšar, J. 2006. “Statistical comparisons of classifiers over multiple
data sets”. Journal of Machine learning research. 7(Jan): 1–30.
Deng, Y., F. Bao, Y. Kong, Z. Ren, and Q. Dai. 2017. “Deep
direct reinforcement learning for financial signal representation
and trading”. IEEE transactions on neural networks and learning
systems. 28(3): 653–664.
Dhariwal, P., C. Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor,
and Y. Wu. 2017. “OpenAI Baselines”.
Dietterich, T. G. 1998. “Approximate statistical tests for comparing
supervised classification learning algorithms”. Neural computation.
10(7): 1895–1923.
Dietterich, T. G. 2009. “Machine learning and ecosystem informatics:
challenges and opportunities”. In: Asian Conference on Machine
Learning. Springer. 1–5.
Dinculescu, M. and D. Precup. 2010. “Approximate predictive represen-
tations of partially observable systems”. In: Proceedings of the 27th
International Conference on Machine Learning (ICML-10). 895–902.
Dosovitskiy, A. and V. Koltun. 2016. “Learning to act by predicting
the future”. arXiv preprint arXiv:1611.01779.
Duan, Y., M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever,
P. Abbeel, and W. Zaremba. 2017. “One-Shot Imitation Learning”.
arXiv preprint arXiv:1703.07326.
Duan, Y., X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. 2016a.
“Benchmarking deep reinforcement learning for continuous control”.
In: International Conference on Machine Learning. 1329–1338.
Dayan, P. and Y. Niv.2008."强化学习:好、坏、丑》。神经生物学的最新观
点》。18(2):185-196.Dearden, R., N. Friedman, and D. Andre.1999."基于模型

《贝叶斯探索"。In:第十五届人工智能不确定性会议论文集》。Morgan
Kaufmann Publishers Inc.150-159.
Dearden, R., N. Friedman, and S. Russell.1998."Bayesian Q-
learning".Deisenroth, M. and C. E. Rasmussen.2011."PILCO: A model-
based and data-efficient approach to policy search".In:第 28 届机器学习
国际会议(ICML-11)论文集》。
465-472.
Demšar, J. 2006."多数据集分类器的统计比较》。机器学习研究期刊》。
7(Jan):1-30.Deng, Y., F. Bao, Y. Kong, Z. Ren, and Q. Dai.2017."用于金融信
号表示的深度直接强化学习
和交易"。IEEE 神经网络和学习系统交易。28(3):653-664.Dhariwal, P., C.
Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y.
Wu.2017."OpenAI Baselines".Dietterich, T. G. 1998."比较监督分类学习算法
的近似统计检验》。神经计算》。

10(7): 1895-1923.
Dietterich, T. G. 2009."机器学习与生态系统信息学:挑战与机遇》。In:亚洲
机器学习会议。Springer.1-5.Dinculescu, M. and D. Precup.2010."部分可观
测系统的近似预测表征》。In:第 27 届机器学习国际会议(ICML-10)论文
集》。895-902.Dosovitskiy, A. and V. Koltun.2016."ArXiv preprint
arXiv:1611.01779.Duan, Y., M. Andrychowicz, B. Stadie, J. Ho, J. Schneider,
I. Sutskever, P. Abbeel, and W. Zaremba.2017."单次模仿学习》。

arXiv preprint arXiv:1703.07326.


Duan, Y., X. Chen, R. Houthooft, J. Schulman, and P. Abbeel.2016a.
"连续控制的深度强化学习基准"。
In:国际机器学习会议。1329-1338.
Duan, Y., J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and
P. Abbeel. 2016b. “RL 2̂: Fast Reinforcement Learning via Slow
Reinforcement Learning”. arXiv preprint arXiv:1611.02779.
Duchesne, L., E. Karangelos, and L. Wehenkel. 2017. “Machine learning
of real-time power systems reliability management response”.
PowerTech Manchester 2017 Proceedings.
Džeroski, S., L. De Raedt, and K. Driessens. 2001. “Relational
reinforcement learning”. Machine learning. 43(1-2): 7–52.
Erhan, D., Y. Bengio, A. Courville, and P. Vincent. 2009. “Visualizing
higher-layer features of a deep network”. University of Montreal.
1341(3): 1.
Ernst, D., P. Geurts, and L. Wehenkel. 2005. “Tree-based batch mode
reinforcement learning”. In: Journal of Machine Learning Research.
503–556.
Farquhar, G., T. Rocktäschel, M. Igl, and S. Whiteson. 2017. “TreeQN
and ATreeC: Differentiable Tree Planning for Deep Reinforcement
Learning”. arXiv preprint arXiv:1710.11417.
Fazel-Zarandi, M., S.-W. Li, J. Cao, J. Casale, P. Henderson, D. Whitney,
and A. Geramifard. 2017. “Learning Robust Dialog Policies in Noisy
Environments”. arXiv preprint arXiv:1712.04034.
Finn, C., P. Abbeel, and S. Levine. 2017. “Model-agnostic meta-
learning for fast adaptation of deep networks”. arXiv preprint
arXiv:1703.03400.
Finn, C., I. Goodfellow, and S. Levine. 2016a. “Unsupervised learning
for physical interaction through video prediction”. In: Advances In
Neural Information Processing Systems. 64–72.
Finn, C., S. Levine, and P. Abbeel. 2016b. “Guided cost learning: Deep
inverse optimal control via policy optimization”. In: Proceedings of
the 33rd International Conference on Machine Learning. Vol. 48.
Florensa, C., Y. Duan, and P. Abbeel. 2017. “Stochastic neural
networks for hierarchical reinforcement learning”. arXiv preprint
arXiv:1704.03012.
Florensa, C., D. Held, X. Geng, and P. Abbeel. 2018. “Automatic
goal generation for reinforcement learning agents”. In: International
Conference on Machine Learning. 1514–1523.
Duan, Y., J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P.
Abbeel.2016b."RL ˆ2: Fast Reinforcement Learning via Slow
Reinforcement Learning". arXiv preprint arXiv:1611.02779.Duchesne, L., E.
Karangelos, and L. Wehenkel.2017."实时电力系统可靠性管理响应的机器学
习》。
2017 年曼彻斯特电力技术大会论文集》。
Džeroski, S., L. De Raedt, and K. Driessens.2001."关系强化学习》。机器学
习》。43(1-2):7-52.Erhan, D., Y. Bengio, A. Courville, and P. Vincent.2009."深
度网络高层特征可视化》。蒙特利尔大学。
1341(3): 1.
Ernst, D., P. Geurts, and L. Wehenkel.2005."基于树的批处理模式
《强化学习"。In:机器学习研究期刊》。
503-556.
Farquhar, G., T. Rocktäschel, M. Igl, and S. Whiteson.2017."TreeQN and
ATreeC:ArXiv preprint arXiv:1710.11417.Fazel-Zarandi,M.、S. -W.Li、J.
Cao、J. Casale、P. Henderson、D. Whitney 和 A. Geramifard。2017."在嘈
杂环境中学习鲁棒对话策略》。arXiv preprint arXiv:1712.04034.Finn, C., P.
Abbeel, and S. Levine.2017."ARXiv preprint arXiv:1703.03400.

Finn, C., I. Goodfellow, and S. Levine.2016a."通过视频预测实现物理交互的


无监督学习"。In:神经信息处理系统进展》。64-72.Finn,C.、S. Levine 和
P. Abbeel。2016b."引导成本学习:通过策略优化的深度反最优控制"。In:
第 33 届机器学习国际会议论文集》。第 48 卷。Florensa, C., Y. Duan, and P.
Abbeel.2017."分层强化学习的随机神经网络》。arXiv 预印本
arXiv:1704.03012.
Florensa, C., D. Held, X. Geng, and P. Abbeel.2018."强化学习代理的自动目
标生成》。In:国际机器学习大会。1514-1523.
Foerster, J., R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and
I. Mordatch. 2018. “Learning with opponent-learning awareness”. In:
Proceedings of the 17th International Conference on Autonomous
Agents and MultiAgent Systems. International Foundation for
Autonomous Agents and Multiagent Systems. 122–130.
Foerster, J., G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson.
2017a. “Counterfactual Multi-Agent Policy Gradients”. arXiv
preprint arXiv:1705.08926.
Foerster, J., N. Nardelli, G. Farquhar, P. Torr, P. Kohli, S. Whiteson,
et al. 2017b. “Stabilising experience replay for deep multi-agent
reinforcement learning”. arXiv preprint arXiv:1702.08887.
Fonteneau, R., S. A. Murphy, L. Wehenkel, and D. Ernst. 2013. “Batch
mode reinforcement learning based on the synthesis of artificial
trajectories”. Annals of operations research. 208(1): 383–416.
Fonteneau, R., L. Wehenkel, and D. Ernst. 2008. “Variable selection for
dynamic treatment regimes: a reinforcement learning approach”.
Fortunato, M., M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves,
V. Mnih, R. Munos, D. Hassabis, O. Pietquin, et al. 2017. “Noisy
networks for exploration”. arXiv preprint arXiv:1706.10295.
Fox, R., A. Pakman, and N. Tishby. 2015. “Taming the noise in reinforce-
ment learning via soft updates”. arXiv preprint arXiv:1512.08562.
François-Lavet, V. et al. 2016a. “DeeR”. https://deer.readthedocs.io/.
François-Lavet, V. 2017. “Contributions to deep reinforcement learning
and its applications in smartgrids”. PhD thesis. University of Liege,
Belgium.
François-Lavet, V., Y. Bengio, D. Precup, and J. Pineau. 2018.
“Combined Reinforcement Learning via Abstract Representations”.
arXiv preprint arXiv:1809.04506.
François-Lavet, V., D. Ernst, and F. Raphael. 2017. “On overfitting
and asymptotic bias in batch reinforcement learning with partial
observability”. arXiv preprint arXiv:1709.07796.
François-Lavet, V., R. Fonteneau, and D. Ernst. 2015. “How to Discount
Deep Reinforcement Learning: Towards New Dynamic Strategies”.
arXiv preprint arXiv:1512.02011.
Foerster, J., R. Y..Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and
I.Mordatch.2018."带着对手学习意识学习"。In:第 17 届自主代理与多代
理系统国际会议论文集》。国际自主代理和多代理系统基金会。122-130.

Foerster, J., G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson.


2017a."Counterfactual Multi-Agent Policy Gradients". arXiv preprint
arXiv:1705.08926.
Foerster, J., N. Nardelli, G. Farquhar, P. Torr, P. Kohli, S. Whiteson, et al."深
度多代理强化学习的稳定经验重放》。ArXiv 预印本
arXiv:1702.08887.Fonteneau, R., S. A. Murphy, L. Wehenkel, and D.
Ernst.2013."基于人工轨迹合成的批量模式强化学习》。运筹学年鉴》。
208(1):383-416.Fonteneau, R., L. Wehenkel, and D. Ernst.2008."动态处理
机制的变量选择:强化学习方法》。Fortunato, M., M. G. Azar, B. Piot, J.
Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin,
et al. 2017."Noisy networks for exploration". arXiv preprint
arXiv:1706.10295.Fox, R., A. Pakman, and N. Tishby.2015."Taming the
noise in reinforcement learning via soft updates". arXiv preprint
arXiv:1512.08562.
François-Lavet, V. et al. 2016a."DeeR".
https://deer.readthedocs.io/.François-Lavet, V. 2017."对深度强化学习的贡
献及其在智能电网中的应用》。博士论文。比利时列日大学。
François-Lavet, V., Y. Bengio, D. Precup, and J. Pineau.2018.

"通过抽象表征进行组合强化学习"。
arXiv preprint arXiv:1809.04506.
François-Lavet, V., D. Ernst, and F. Raphael.2017."ArXiv preprint
arXiv:1709.07796.François-Lavet, V., R. Fonteneau, and D. Ernst.2015."How
to Discount Deep Reinforcement Learning:Towards New Dynamic
Strategies".
arXiv preprint arXiv:1512.02011.
François-Lavet, V., D. Taralla, D. Ernst, and R. Fonteneau. 2016b.
“Deep Reinforcement Learning Solutions for Energy Microgrids
Management”. In: European Workshop on Reinforcement Learning.
Fukushima, K. and S. Miyake. 1982. “Neocognitron: A self-organizing
neural network model for a mechanism of visual pattern recognition”.
In: Competition and cooperation in neural nets. Springer. 267–285.
Gal, Y. and Z. Ghahramani. 2016. “Dropout as a Bayesian Approx-
imation: Representing Model Uncertainty in Deep Learning”. In:
Proceedings of the 33nd International Conference on Machine Learn-
ing, ICML 2016, New York City, NY, USA, June 19-24, 2016. 1050–
1059.
Gandhi, D., L. Pinto, and A. Gupta. 2017. “Learning to Fly by Crashing”.
arXiv preprint arXiv:1704.05588.
Garnelo, M., K. Arulkumaran, and M. Shanahan. 2016. “To-
wards Deep Symbolic Reinforcement Learning”. arXiv preprint
arXiv:1609.05518.
Gauci, J., E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden,
V. Narayanan, and X. Ye. 2018. “Horizon: Facebook’s Open
Source Applied Reinforcement Learning Platform”. arXiv preprint
arXiv:1811.00260.
Gelly, S., Y. Wang, R. Munos, and O. Teytaud. 2006. “Modification of
UCT with patterns in Monte-Carlo Go”.
Geman, S., E. Bienenstock, and R. Doursat. 1992. “Neural networks
and the bias/variance dilemma”. Neural computation. 4(1): 1–58.
Geramifard, A., C. Dann, R. H. Klein, W. Dabney, and J. P. How. 2015.
“RLPy: A Value-Function-Based Reinforcement Learning Framework
for Education and Research”. Journal of Machine Learning Research.
16: 1573–1578.
Geurts, P., D. Ernst, and L. Wehenkel. 2006. “Extremely randomized
trees”. Machine learning. 63(1): 3–42.
Ghavamzadeh, M., S. Mannor, J. Pineau, A. Tamar, et al. 2015.
“Bayesian reinforcement learning: A survey”. Foundations and
Trends® in Machine Learning. 8(5-6): 359–483.
François-Lavet, V., D.Taralla, D. Ernst, and R. Fonteneau.Ernst, and R.
Fonteneau.2016b.
"能源微电网的深度强化学习解决方案
管理"。In:欧洲强化学习研讨会。Fukushima, K. and S.
Miyake.1982."Neocognitron:视觉模式识别机制的自组织神经网络模型"。
《神经网络中的竞争与合作神经网络中的竞争与合作》。Springer.267-285.
Gal、Y. 和 Z. Ghahramani。2016."辍学 "作为一种贝叶斯方法
imation:在深度学习中表示模型的不确定性"。In:
第 33 届国际机器学习大会论文集,ICML 2016,美国纽约州纽约市,2016
年 6 月 19-24 日。1050- 1059.
Gandhi, D., L. Pinto, and A. Gupta.2017."通过撞击学会飞行》。
arXiv preprint arXiv:1704.05588.
Garnelo, M., K. Arulkumaran, and M. Shanahan.2016."Towards Deep
Symbolic Reinforcement Learning". arXiv preprint arXiv:1609.05518.
Gauci, J., E. Conti, Y.Liang, K.Virochsiri, Y.He, Z.卡登、V.Narayanan, and X.
Ye.Ye. 2018."Horizon: Facebook's Open Source Applied Reinforcement
Learning Platform". arXiv preprint arXiv:1811.00260.

Gelly, S., Y. Wang, R. Munos, and O. Teytaud.2006."用蒙特卡洛围棋中的模式


修改 UCT》。Geman, S., E. Bienenstock, and R. Doursat.1992."神经网络与
偏差/方差困境》。神经计算》。4(1):1-58.
Geramifard, A., C. Dann, R. H. Klein, W. Dabney, and J. P. How. 2015.
"RLPy:基于价值函数的强化学习框架
教育与研究》。机器学习研究》杂志。
16: 1573-1578.
Geurts, P., D. Ernst, and L. Wehenkel.2006."极随机化树》。机器学习》。
63(1):3-42.
Ghavamzadeh, M., S. Mannor, J. Pineau, A. Tamar, et al. 2015.
"贝叶斯强化学习:调查"。机器学习的基础与趋势》。8(5-6):359-483.
Giusti, A., J. Guzzi, D. C. Cireşan, F.-L. He, J. P. Rodriguez, F. Fontana,
M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al. 2016.
“A machine learning approach to visual perception of forest trails
for mobile robots”. IEEE Robotics and Automation Letters. 1(2):
661–667.
Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep learning. MIT
Press.
Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio. 2014. “Generative adversarial
nets”. In: Advances in neural information processing systems. 2672–
2680.
Gordon, G. J. 1996. “Stable fitted reinforcement learning”. In: Advances
in neural information processing systems. 1052–1058.
Gordon, G. J. 1999. “Approximate solutions to Markov decision
processes”. Robotics Institute: 228.
Graves, A., G. Wayne, and I. Danihelka. 2014. “Neural turing machines”.
arXiv preprint arXiv:1410.5401.
Gregor, K., D. J. Rezende, and D. Wierstra. 2016. “Variational Intrinsic
Control”. arXiv preprint arXiv:1611.07507.
Gruslys, A., M. G. Azar, M. G. Bellemare, and R. Munos. 2017.
“The Reactor: A Sample-Efficient Actor-Critic Architecture”. arXiv
preprint arXiv:1704.04651.
Gu, S., E. Holly, T. Lillicrap, and S. Levine. 2017a. “Deep reinforcement
learning for robotic manipulation with asynchronous off-policy
updates”. In: Robotics and Automation (ICRA), 2017 IEEE
International Conference on. IEEE. 3389–3396.
Gu, S., T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. 2016a.
“Q-prop: Sample-efficient policy gradient with an off-policy critic”.
arXiv preprint arXiv:1611.02247.
Gu, S., T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. 2017b.
“Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic”.
In: 5th International Conference on Learning Representations (ICLR
2017).
Giusti, A., J. Guzzi, D. C. Cireşan, F. -L.He, J. P. Rodriguez, F. Fontana, M.
Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al.
"森林小径视觉感知的机器学习方法
移动机器人"。IEEE Robotics and Automation Letters.1(2):661-667.
Goodfellow, I., Y. Bengio, and A. Courville.2016.Deep learning.麻省理工学
院出版社。Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
Farley, S. Ozair, A. Courville, and Y. Bengio.2014."生成式对抗
《网"。In:神经信息处理系统进展》。2672- 2680.
Gordon, G. J. 1996."Stable fitted reinforcement learning".In:神经信息处
理系统进展》。1052-1058.Gordon, G. J. 1999."马尔可夫决策过程的近似
解"。Robotics Institute:228.
Graves, A., G. Wayne, and I. Danihelka.2014."神经图灵机》。
arXiv preprint arXiv:1410.5401.
Gregor, K., D. J. Rezende, and D. Wierstra.2016."Variational Intrinsic
Control". arXiv preprint arXiv:1611.07507.
Gruslys, A., M. G. Azar, M. G. Bellemare, and R. Munos.2017.
"The Reactor:A Sample-Efficient Actor-Critic Architecture". arXiv
preprint arXiv:1704.04651.
Gu, S., E. Holly, T. Lillicrap, and S. Levine.2017a."异步非策略机器人操纵的
深度强化学习"(Deep reinforcement learning for robotic manipulation
with更新"。In:机器人与自动化(ICRA)
asynchronous off-policy ,2017 IEEE 国际会议。IEEE.3389-
3396.
Gu, S., T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine.2016a.
"Q-prop:具有非政策批评者的抽样高效政策梯度"。
arXiv preprint arXiv:1611.02247.
Gu, S., T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine.2017b.
"Q-Prop:具有非政策批评者的抽样有效政策梯度"。
第五届学习表征国际会议(ICLR 2017)。
Gu, S., T. Lillicrap, Z. Ghahramani, R. E. Turner, B. Schölkopf, and S.
Levine. 2017c. “Interpolated Policy Gradient: Merging On-Policy and
Off-Policy Gradient Estimation for Deep Reinforcement Learning”.
arXiv preprint arXiv:1706.00387.
Gu, S., T. Lillicrap, I. Sutskever, and S. Levine. 2016b. “Continuous
Deep Q-Learning with Model-based Acceleration”. arXiv preprint
arXiv:1603.00748.
Guo, Z. D. and E. Brunskill. 2017. “Sample efficient feature selection
for factored mdps”. arXiv preprint arXiv:1703.03454.
Haarnoja, T., H. Tang, P. Abbeel, and S. Levine. 2017. “Reinforce-
ment learning with deep energy-based policies”. arXiv preprint
arXiv:1702.08165.
Haber, N., D. Mrowca, L. Fei-Fei, and D. L. Yamins. 2018. “Learning to
Play with Intrinsically-Motivated Self-Aware Agents”. arXiv preprint
arXiv:1802.07442.
Hadfield-Menell, D., S. J. Russell, P. Abbeel, and A. Dragan. 2016.
“Cooperative inverse reinforcement learning”. In: Advances in neural
information processing systems. 3909–3917.
Hafner, R. and M. Riedmiller. 2011. “Reinforcement learning in feedback
control”. Machine learning. 84(1-2): 137–169.
Halsey, L. G., D. Curran-Everett, S. L. Vowler, and G. B. Drummond.
2015. “The fickle P value generates irreproducible results”. Nature
methods. 12(3): 179–185.
Harari, Y. N. 2014. Sapiens: A brief history of humankind.
Harutyunyan, A., M. G. Bellemare, T. Stepleton, and R. Munos.
2016. “Q (\ lambda) with Off-Policy Corrections”. In: International
Conference on Algorithmic Learning Theory. Springer. 305–320.
Hassabis, D., D. Kumaran, C. Summerfield, and M. Botvinick. 2017.
“Neuroscience-inspired artificial intelligence”. Neuron. 95(2): 245–
258.
Hasselt, H. V. 2010. “Double Q-learning”. In: Advances in Neural
Information Processing Systems. 2613–2621.
Hausknecht, M. and P. Stone. 2015. “Deep recurrent Q-learning for
partially observable MDPs”. arXiv preprint arXiv:1507.06527.
Gu、S.、T. Lillicrap、Z. Ghahramani、R. E. Turner、B. Schölkopf 和 S.
Levine.2017c."插值策略梯度:为深度强化学习合并政策内和政策外梯度
估计》。
arXiv preprint arXiv:1706.00387.
Gu, S., T. Lillicrap, I. Sutskever, and S. Levine.2016b."基于模型加速的连续深
度 Q-Learning". arXiv preprint arXiv:1603.00748.
Guo, Z. D. and E. Brunskill.2017."因式 mdps 的样本高效特征选择》。arXiv 预
印本 arXiv:1703.03454.Haarnoja, T., H. Tang, P. Abbeel, and S.
Levine.2017."基于深度能量策略的强化学习》。arXiv preprint
arXiv:1702.08165.
Haber, N., D. Mrowca, L. Fei-Fei, and D. L. Yamins.2018."Learning to Play
with Intrinsically-Motivated Self-Aware Agents". arXiv preprint
arXiv:1802.07442.
Hadfield-Menell, D., S. J. Russell, P. Abbeel, and A. Dragan.2016.
"合作逆强化学习"。In:神经信息处理系统进展》。3909-3917.
Hafner, R. and M. Riedmiller.2011."反馈控制中的强化学习》。机器学习》。
84(1-2):137-169.
Halsey, L. G., D. Curran-Everett, S. L. Vowler, and G. B. Drummond.
2015."善变的 P 值产生不可再现的结果》。自然方法》。12(3):179-185.
Harari, Y. N. 2014.智人:人类简史》(Sapiens: A brief history of humankind)。
Harutyunyan, A., M. G. Bellemare, T. Stepleton, and R. Munos.
2016."Q (\lambda) with Off-Policy Corrections"。In:算法学习理论国
际会议。Springer.305-320.
Hassabis, D., D. Kumaran, C. Summerfield, and M. Botvinick.2017.
"神经科学启发的人工智能"。Neuron.95(2):245- 258.
Hasselt, H. V. 2010."Double Q-learning".In:神经信息处理系统进展》。
2613-2621.Hausknecht, M. and P. Stone.2015."Deep recurrent Q-learning
for partially observable MDPs". arXiv preprint arXiv:1507.06527.
Hauskrecht, M., N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier.
1998. “Hierarchical solution of Markov decision processes using
macro-actions”. In: Proceedings of the Fourteenth conference on
Uncertainty in artificial intelligence. Morgan Kaufmann Publishers
Inc. 220–229.
He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep residual learning
for image recognition”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 770–778.
Heess, N., G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. 2015.
“Learning continuous control policies by stochastic value gradients”.
In: Advances in Neural Information Processing Systems. 2944–2952.
Henderson, P., W.-D. Chang, F. Shkurti, J. Hansen, D. Meger, and G.
Dudek. 2017a. “Benchmark Environments for Multitask Learning in
Continuous Domains”. ICML Lifelong Learning: A Reinforcement
Learning Approach Workshop.
Henderson, P., R. Islam, P. Bachman, J. Pineau, D. Precup, and D.
Meger. 2017b. “Deep Reinforcement Learning that Matters”. arXiv
preprint arXiv:1709.06560.
Hessel, M., J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W.
Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. 2017. “Rainbow:
Combining Improvements in Deep Reinforcement Learning”. arXiv
preprint arXiv:1710.02298.
Hessel, M., H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H.
van Hasselt. 2018. “Multi-task Deep Reinforcement Learning with
PopArt”. arXiv preprint arXiv:1809.04474.
Higgins, I., A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A.
Pritzel, M. Botvinick, C. Blundell, and A. Lerchner. 2017. “Darla:
Improving zero-shot transfer in reinforcement learning”. arXiv
preprint arXiv:1707.08475.
Ho, J. and S. Ermon. 2016. “Generative adversarial imitation learning”.
In: Advances in Neural Information Processing Systems. 4565–4573.
Hochreiter, S. and J. Schmidhuber. 1997. “Long short-term memory”.
Neural computation. 9(8): 1735–1780.
Hochreiter, S., A. S. Younger, and P. R. Conwell. 2001. “Learning
to learn using gradient descent”. In: International Conference on
Artificial Neural Networks. Springer. 87–94.
Hauskrecht, M., N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier.
1998."利用宏观行动分层解决马尔可夫决策过程》。In:第十四届人工智能
不确定性会议论文集》。Morgan Kaufmann Publishers Inc. 220-229.

He, K., X. Zhang, S. Ren, and J. Sun.Zhang, S. Ren, and J. Sun.2016."深度残差学习


《用于图像识别"。In:电气和电子工程师学会计算机视觉和模式识别会议
论文集》。770-778.
Heess, N., G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa.2015.
"通过随机值梯度学习连续控制策略"。
In:神经信息处理系统进展》。2944-2952.
Henderson, P., W. -D.Chang、F. Shkurti、J. Hansen、D.Meger, and G.
Dudek.2017a."连续领域多任务学习的基准环境》。ICML 终身学习:强化
学习方法研讨会》。
Henderson, P., R. Islam, P. Bachman, J. Pineau, D. Precup, and D. R. Islam.
Meger.2017b."Deep Reinforcement Learning that Matters". arXiv
preprint arXiv:1709.06560.
Hessel, M., J.Modayil, H. van Hasselt, T.Schaul, G.Ostrovski, W.
Dabney、D. Horgan、B. Piot、M. Azar 和 D. Silver。2017."彩虹:结合
深度强化学习的改进》。arXiv preprint arXiv:1710.02298.
Hessel, M., H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, and H. M.
van Hasselt.2018."使用 PopArt 的多任务深度强化学习》。arXiv preprint
arXiv:1809.04474.
Higgins, I., A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. P. Rusu, L. Matthey, C. P.
Burgess.
Pritzel、M. Botvinick、C. Blundell 和 A. Lerchner。2017."Darla:
Improving zero-shot transfer in reinforcement learning". arXiv preprint
arXiv:1707.08475.
Ho, J. and S. Ermon.2016."生成式对抗模仿学习》。
In:神经信息处理系统进展》。4565-4573.
Hochreiter, S. and J. Schmidhuber.1997."Long short-term memory".
神经计算。9(8):1735-1780.
Hochreiter, S., A. S. Younger, and P. R. Conwell.2001."使用梯度下降学习"。
In:人工神经网络国际会议。Springer.87-94.
Holroyd, C. B. and M. G. Coles. 2002. “The neural basis of human error
processing: reinforcement learning, dopamine, and the error-related
negativity.” Psychological review. 109(4): 679.
Houthooft, R., X. Chen, Y. Duan, J. Schulman, F. De Turck,
and P. Abbeel. 2016. “Vime: Variational information maximizing
exploration”. In: Advances in Neural Information Processing Systems.
1109–1117.
Ioffe, S. and C. Szegedy. 2015. “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”. arXiv preprint
arXiv:1502.03167.
Islam, R., P. Henderson, M. Gomrokchi, and D. Precup. 2017.
“Reproducibility of Benchmarked Deep Reinforcement Learning
Tasks for Continuous Control”. ICML Reproducibility in Machine
Learning Workshop.
Jaderberg, M., V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D.
Silver, and K. Kavukcuoglu. 2016. “Reinforcement learning with
unsupervised auxiliary tasks”. arXiv preprint arXiv:1611.05397.
Jaderberg, M., W. M. Czarnecki, I. Dunning, L. Marris, G. Lever,
A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos,
A. Ruderman, et al. 2018. “Human-level performance in first-
person multiplayer games with population-based deep reinforcement
learning”. arXiv preprint arXiv:1807.01281.
Jakobi, N., P. Husbands, and I. Harvey. 1995. “Noise and the reality
gap: The use of simulation in evolutionary robotics”. In: European
Conference on Artificial Life. Springer. 704–720.
James, G. M. 2003. “Variance and bias for general loss functions”.
Machine Learning. 51(2): 115–135.
Jaques, N., A. Lazaridou, E. Hughes, C. Gulcehre, P. A. Ortega, D.
Strouse, J. Z. Leibo, and N. de Freitas. 2018. “Intrinsic Social
Motivation via Causal Influence in Multi-Agent RL”. arXiv preprint
arXiv:1810.08647.
Jaquette, S. C. et al. 1973. “Markov decision processes with a new
optimality criterion: Discrete time”. The Annals of Statistics. 1(3):
496–505.
Holroyd, C. B. and M. G. Coles.2002."人类错误处理的神经基础:强化学习、
多巴胺和与错误相关的消极性》。心理学评论》。109(4):679.Houthooft, R., X.
Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel.2016."Vime:变异
信息最大化
《探索"。In:神经信息处理系统进展》。
1109-1117.
Ioffe, S. and C. Szegedy.2015."批量归一化:Batch normalization:
Accelerating deep network training by reducing internal covariate shift".
arXiv preprint arXiv:1502.03167.
Islam, R., P. Henderson, M. Gomrokchi, and D. Precup.2017.
"基准深度强化学习 "的可重复性
持续控制任务"。ICML 机器学习可重复性研讨会。
Jaderberg, M., V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. J. J. J. Z.
Leibo, D. J. J. J. J. J. J. J. J. J. Z.
Silver, and K. Kavukcuoglu.2016."带无监督辅助任务的强化学习》。arXiv
preprint arXiv:1611.05397.Jaderberg, M., W. M. Czarnecki, I. Dunning, L.
Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S.
Morcos, A. Ruderman, et al."基于群体的深度强化学习在第一人称多人游戏
中的人类级表现》。arXiv preprint arXiv:1807.01281.Jakobi, N., P.
Husbands, and I. Harvey.1995."噪声与现实差距:进化机器人学中模拟的使
用"。In. European Conference on Artificial Life:European Conference on
Artificial Life.Springer.704-720.
James, G. M. 2003."一般损失函数的方差与偏差》。
机器学习 51(2):115-135.
Jaques, N., A. Lazaridou, E. Hughes, C. Gulcehre, P. A. Ortega, D. Jaques, N.
Strouse, J. Z. Leibo, and N. de Freitas.2018."ArXiv preprint
arXiv:1810.08647.
Jaquette, S. C. et al."马尔可夫决策过程与新
最优标准:离散时间"。The Annals of Statistics.1(3):496-505.
Jiang, N., A. Kulesza, and S. Singh. 2015a. “Abstraction selection in
model-based reinforcement learning”. In: Proceedings of the 32nd
International Conference on Machine Learning (ICML-15). 179–188.
Jiang, N., A. Kulesza, S. Singh, and R. Lewis. 2015b. “The Dependence
of Effective Planning Horizon on Model Accuracy”. In: Proceedings
of the 2015 International Conference on Autonomous Agents and
Multiagent Systems. International Foundation for Autonomous
Agents and Multiagent Systems. 1181–1189.
Jiang, N. and L. Li. 2016. “Doubly robust off-policy value evaluation for
reinforcement learning”. In: Proceedings of The 33rd International
Conference on Machine Learning. 652–661.
Johnson, J., B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-
Fei, C. L. Zitnick, and R. Girshick. 2017. “Inferring and Executing
Programs for Visual Reasoning”. arXiv preprint arXiv:1705.03633.
Johnson, M., K. Hofmann, T. Hutton, and D. Bignell. 2016. “The
Malmo Platform for Artificial Intelligence Experimentation.” In:
IJCAI. 4246–4247.
Juliani, A., V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and
D. Lange. 2018. “Unity: A General Platform for Intelligent Agents”.
arXiv preprint arXiv:1809.02627.
Kaelbling, L. P., M. L. Littman, and A. R. Cassandra. 1998. “Planning
and acting in partially observable stochastic domains”. Artificial
intelligence. 101(1): 99–134.
Kahneman, D. 2011. Thinking, fast and slow. Macmillan.
Kakade, S. 2001. “A Natural Policy Gradient”. In: Advances in Neural
Information Processing Systems 14 [Neural Information Processing
Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001,
Vancouver, British Columbia, Canada]. 1531–1538.
Kakade, S., M. Kearns, and J. Langford. 2003. “Exploration in metric
state spaces”. In: ICML. Vol. 3. 306–312.
Kalakrishnan, M., P. Pastor, L. Righetti, and S. Schaal. 2013. “Learning
objective functions for manipulation”. In: Robotics and Automation
(ICRA), 2013 IEEE International Conference on. IEEE. 1331–1336.
Jiang, N., A. Kulesza, and S. Singh.2015a."基于模型的强化学习中的抽象选
择》。In:第 32 届
国际机器学习会议(ICML-15)。179-188.Jiang, N., A. Kulesza, S. Singh, and
R. Lewis.2015b."有效规划范围对模型准确性的依赖》。In:论文集
2015 年自主代理与多代理系统国际会议。自主代理与多代理系统国际基金
会。1181-1189.
Jiang, N. and L. Li.2016."双稳健政策外价值评估"(Doubly robustly off-policy value
evaluation
《强化学习"。In:第
for 33 届机器学习国际会议论文集》。652-661.
Johnson, J., B. Hariharan, L. van der Maaten, J. Hoffman, L. FeiFei, C. L.
Zitnick, and R. Girshick.2017."可视化推理的程序推断与执行》。arXiv
preprint arXiv:1705.03633.Johnson, M., K. Hofmann, T. Hutton, and D.
Bignell.2016."可视化推理
马尔默人工智能实验平台"。In:IJCAI.4246-4247.
Juliani, A., V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D.
Lange.2018."Unity:A General Platform for Intelligent Agents".
arXiv preprint arXiv:1809.02627.
Kaelbling, L. P., M. L. Littman, and A. R. Cassandra.1998."部分可观测随机
域中的规划与行动》。Artificial intelligence.101(1):99-134.
Kahneman, D. 2011.思考,快与慢》。麦克米伦出版社。Kakade, S.
2001."自然政策梯度》。In:神经科学进展
信息处理系统 14 [神经信息处理系统:NIPS 2001,2001 年 12 月 3-8 日,
加拿大不列颠哥伦比亚省温哥华]。1531-1538.
Kakade, S., M. Kearns, and J. Langford.2003."公制状态空间探索"。In:
ICML.3. 306-312.Kalakrishnan, M., P. Pastor, L. Righetti, and S.
Schaal.2013."学习操纵的目标函数》。In:机器人与自动化
(ICRA), 2013 IEEE 国际会议。IEEE.1331-1336.
Kalashnikov, D., A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D.
Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine.
2018. “Qt-opt: Scalable deep reinforcement learning for vision-based
robotic manipulation”. arXiv preprint arXiv:1806.10293.
Kalchbrenner, N., A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals,
A. Graves, and K. Kavukcuoglu. 2016. “Video pixel networks”. arXiv
preprint arXiv:1610.00527.
Kansky, K., T. Silver, D. A. Mély, M. Eldawy, M. Lázaro-Gredilla,
X. Lou, N. Dorfman, S. Sidor, S. Phoenix, and D. George. 2017.
“Schema Networks: Zero-shot Transfer with a Generative Causal
Model of Intuitive Physics”. arXiv preprint arXiv:1706.04317.
Kaplan, R., C. Sauer, and A. Sosa. 2017. “Beating Atari with
Natural Language Guided Reinforcement Learning”. arXiv preprint
arXiv:1704.05539.
Kearns, M. and S. Singh. 2002. “Near-optimal reinforcement learning
in polynomial time”. Machine Learning. 49(2-3): 209–232.
Kempka, M., M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski.
2016. “Vizdoom: A doom-based ai research platform for visual
reinforcement learning”. In: Computational Intelligence and Games
(CIG), 2016 IEEE Conference on. IEEE. 1–8.
Kirkpatrick, J., R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska,
et al. 2016. “Overcoming catastrophic forgetting in neural networks”.
arXiv preprint arXiv:1612.00796.
Klambauer, G., T. Unterthiner, A. Mayr, and S. Hochreiter. 2017. “Self-
Normalizing Neural Networks”. arXiv preprint arXiv:1706.02515.
Kolter, J. Z. and A. Y. Ng. 2009. “Near-Bayesian exploration in
polynomial time”. In: Proceedings of the 26th Annual International
Conference on Machine Learning. ACM. 513–520.
Konda, V. R. and J. N. Tsitsiklis. 2000. “Actor-critic algorithms”. In:
Advances in neural information processing systems. 1008–1014.
Krizhevsky, A., I. Sutskever, and G. E. Hinton. 2012. “Imagenet
classification with deep convolutional neural networks”. In: Advances
in neural information processing systems. 1097–1105.
Kalashnikov, D., A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Pastor.Jang, D.
Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S.
Levine.2018."Qt-opt:基于视觉的可扩展深度强化学习
arXiv preprint arXiv:1806.10293.
Kalchbrenner, N., A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A.
Graves, and K. Kavukcuoglu.2016."视频像素网络》。ArXiv 预印本
arXiv:1610.00527.
Kansky, K., T.Silver, D.A.Mély, M. Eldawy, M. Lázaro-Gredilla, X. Lou, N.
Dorfman, S. Sidor, S. Phoenix, and D. George.Lou, N. Dorfman, S. Sidor, S.
"Schema Networks:ArXiv preprint arXiv:1706.04317.
Kaplan, R., C. Sauer, and A. Sosa.2017."用自然语言指导强化学习击败
Atari》。arXiv 预印本 arXiv:1704.05539.
Kearns, M. and S. Singh.2002."多项式时间内的近优强化学习》。机器学
习》。49(2-3):209-232.
Kempka, M., M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski.
2016."Vizdoom:基于厄运的视觉强化学习人工智能研究平台"。In:
Computational Intelligence and Games (CIG), 2016 IEEE Conference
on.IEEE.1-8.
Kirkpatrick, J., R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A.
Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. 2016."克
服神经网络中的灾难性遗忘》。
arXiv preprint arXiv:1612.00796.
Klambauer, G., T. Unterthiner, A. Mayr, and S. Hochreiter.2017."ArXiv
preprint arXiv:1706.02515.Kolter, J. Z. and A. Y. Ng.2009."近贝叶斯探索
《多项式时间"。In:第 26 届机器学习国际年会论文集》。ACM.513-520.
Konda, V. R. and J. N. Tsitsiklis.2000."Actor-critic algorithms".In:神经信息
处理系统进展》。1008-1014.Krizhevsky, A., I. Sutskever, and G. E.
Hinton.2012."利用深度卷积神经网络进行图像分类》。In:神经信息处理系统
进展》。1097-1105.
Kroon, M. and S. Whiteson. 2009. “Automatic feature selection
for model-based reinforcement learning in factored MDPs”. In:
Machine Learning and Applications, 2009. ICMLA’09. International
Conference on. IEEE. 324–330.
Kulkarni, T. D., K. Narasimhan, A. Saeedi, and J. Tenenbaum. 2016.
“Hierarchical deep reinforcement learning: Integrating temporal
abstraction and intrinsic motivation”. In: Advances in Neural
Information Processing Systems. 3675–3683.
Lample, G. and D. S. Chaplot. 2017. “Playing FPS Games with Deep
Reinforcement Learning.” In: AAAI. 2140–2146.
LeCun, Y., Y. Bengio, and G. Hinton. 2015. “Deep learning”. Nature.
521(7553): 436–444.
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner. 1998. “Gradient-based
learning applied to document recognition”. Proceedings of the IEEE.
86(11): 2278–2324.
LeCun, Y., Y. Bengio, et al. 1995. “Convolutional networks for images,
speech, and time series”. The handbook of brain theory and neural
networks. 3361(10): 1995.
Lee, D., H. Seo, and M. W. Jung. 2012. “Neural basis of reinforcement
learning and decision making”. Annual review of neuroscience. 35:
287–308.
Leffler, B. R., M. L. Littman, and T. Edmunds. 2007. “Efficient
reinforcement learning with relocatable action models”. In: AAAI.
Vol. 7. 572–577.
Levine, S., C. Finn, T. Darrell, and P. Abbeel. 2016. “End-to-end
training of deep visuomotor policies”. Journal of Machine Learning
Research. 17(39): 1–40.
Levine, S. and V. Koltun. 2013. “Guided policy search”. In: International
Conference on Machine Learning. 1–9.
Li, L., Y. Lv, and F.-Y. Wang. 2016. “Traffic signal timing via deep
reinforcement learning”. IEEE/CAA Journal of Automatica Sinica .
3(3): 247–254.
Li, L., W. Chu, J. Langford, and X. Wang. 2011. “Unbiased offline
evaluation of contextual-bandit-based news article recommendation
algorithms”. In: Proceedings of the fourth ACM international
conference on Web search and data mining. ACM. 297–306.
Kroon, M. and S. Whiteson.2009."自动特征选择
的基于模型的强化学习"。In:机器学习与应用,2009 年。ICMLA'09.国际
会议。IEEE.324-330.
Kulkarni, T. D., K. Narasimhan, A. Saeedi, and J. Tenenbaum.2016.
"分层深度强化学习:整合时间抽象和内在动力"。In:神经信息处理系统
进展》。3675-3683.
Lample, G. and D. S. Chaplot.2017."用深度强化学习玩 FPS 游戏"。In:
AAAI.2140-2146.
LeCun, Y., Y. Bengio, and G. Hinton.2015."深度学习》。自然》。
521(7553): 436-444.
LeCun, Y., L. Bottou, Y. Bengio, and P. Haffner.1998."基于梯度的学习应用于
文档识别》。电气和电子工程师学会论文集》。
86(11): 2278-2324.
LeCun, Y., Y. Bengio, et al."图像卷积网络、
《语言和时间序列"。大脑理论和神经网络手册》。3361(10):1995.Lee, D., H.
Seo, and M. W. Jung.2012."强化的神经基础
《学习和决策"。神经科学年度评论》。35: 287-308.
Leffler, B. R., M. L. Littman, and T. Edmunds.2007."使用可迁移行动模型的高
效强化学习"。In:AAAI.
7. 572-577.7. 572-577.
Levine, S., C. Finn, T. Darrell, and P. Abbeel.2016."深度视觉运动策略的端到
端训练》。机器学习研究期刊》。17(39):1-40.Levine, S. and V.
Koltun.2013."Guided policy search".In:国际机器学习会议。1-9.Li, L., Y.
Lv, and F. -Y. Wang.Wang.2016."通过深度

强化学习"。IEEE/CAA Journal of Automatica Sinica.


3(3): 247-254.
Li, L., W. Chu, J. Langford, and X. Wang.2011."基于上下文比特的新闻文章推
荐的无偏离线评估
《算法"。In:第四届 ACM 网络搜索与数据挖掘国际会议论文集》。
ACM.297-306.
Li, X., L. Li, J. Gao, X. He, J. Chen, L. Deng, and J. He. 2015.
“Recurrent reinforcement learning: a hybrid approach”. arXiv
preprint arXiv:1509.03044.
Liaw, A., M. Wiener, et al. 2002. “Classification and regression by
randomForest”. R news. 2(3): 18–22.
Lillicrap, T. P., J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra. 2015. “Continuous control with deep
reinforcement learning”. arXiv preprint arXiv:1509.02971.
Lin, L.-J. 1992. “Self-improving reactive agents based on reinforcement
learning, planning and teaching”. Machine learning. 8(3-4): 293–321.
Lipton, Z. C., J. Gao, L. Li, X. Li, F. Ahmed, and L. Deng. 2016.
“Efficient exploration for dialogue policy learning with BBQ networks
& replay buffer spiking”. arXiv preprint arXiv:1608.05081.
Littman, M. L. 1994. “Markov games as a framework for multi-agent
reinforcement learning”. In: Proceedings of the eleventh international
conference on machine learning. Vol. 157. 157–163.
Liu, Y., A. Gupta, P. Abbeel, and S. Levine. 2017. “Imitation from
Observation: Learning to Imitate Behaviors from Raw Video via
Context Translation”. arXiv preprint arXiv:1707.03374.
Lowe, R., Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch.
2017. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive
Environments”. arXiv preprint arXiv:1706.02275.
MacGlashan, J., M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E.
Taylor, and M. L. Littman. 2017. “Interactive Learning from Policy-
Dependent Human Feedback”. arXiv preprint arXiv:1701.06049.
Machado, M. C., M. G. Bellemare, and M. Bowling. 2017a. “A Laplacian
Framework for Option Discovery in Reinforcement Learning”. arXiv
preprint arXiv:1703.00956.
Machado, M. C., M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht,
and M. Bowling. 2017b. “Revisiting the Arcade Learning Environ-
ment: Evaluation Protocols and Open Problems for General Agents”.
arXiv preprint arXiv:1709.06009.
Li, X., L. Li, J. Gao, X. He, J. Chen, L. Deng, and J. He. 2015.
"递归强化学习:一种混合方法》。arXiv 预印本 arXiv:1509.03044.
Liaw, A., M. Wiener, et al."随机森林的分类与回归》。R news.2(3):18-
22.Lillicrap, T. P., J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,
and D. Wierstra.2015."深度强化学习的连续控制》。ArXiv 预印本
arXiv:1509.02971.Lin,L. - J. 1992."基于强化学习、规划和教学的自改进反应
式代理》。Machine learning.8(3-4):293-321.Lipton, Z. C., J. Gao, L. Li, X. Li,
F. Ahmed, and L. Deng.2016.

"利用 BBQ 网络和重放缓冲区尖峰对对话策略学习的高效探索》。arXiv 预


印本 arXiv:1608.05081.
Littman, M. L. 1994."马尔可夫博弈作为多行为主体的框架
《强化学习"。In:第十一届机器学习国际会议论文集》,第 157 卷。Vol.
157.157-163.
Liu, Y., A. Gupta, P. Abbeel, and S. Levine.2017."从观察中模仿:ArXiv
preprint arXiv:1707.03374.
Lowe, R., Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch.
2017."混合合作竞争环境中的多代理行动者批判》。arXiv 预印本
arXiv:1706.02275。
MacGlashan, J., M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E., J., M. K. Ho, R. Loftin, B.
Peng,Taylor,
D. Roberts.
and M. L. Littman.2017."从依赖政策的人类反馈中交互学习》。
arXiv 预印本 arXiv:1701.06049.
Machado, M. C., M. G. Bellemare, and M. Bowling.2017a."A Laplacian
Framework for Option Discovery in Reinforcement Learning". arXiv
preprint arXiv:1703.00956.
Machado, M. C., M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and
M. Bowling.2017b."重新审视拱廊学习环境:通用代理的评估协议与开放问
题》。
arXiv preprint arXiv:1709.06009.
Mandel, T., Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. 2014.
“Offline policy evaluation across representations with applications
to educational games”. In: Proceedings of the 2014 international
conference on Autonomous agents and multi-agent systems. Interna-
tional Foundation for Autonomous Agents and Multiagent Systems.
1077–1084.
Mankowitz, D. J., T. A. Mann, and S. Mannor. 2016. “Adaptive Skills
Adaptive Partitions (ASAP)”. In: Advances in Neural Information
Processing Systems. 1588–1596.
Mathieu, M., C. Couprie, and Y. LeCun. 2015. “Deep multi-scale
video prediction beyond mean square error”. arXiv preprint
arXiv:1511.05440.
Matiisen, T., A. Oliver, T. Cohen, and J. Schulman. 2017. “Teacher-
Student Curriculum Learning”. arXiv preprint arXiv:1707.00183.
McCallum, A. K. 1996. “Reinforcement learning with selective percep-
tion and hidden state”. PhD thesis. University of Rochester.
McGovern, A., R. S. Sutton, and A. H. Fagg. 1997. “Roles of macro-
actions in accelerating reinforcement learning”. In: Grace Hopper
celebration of women in computing. Vol. 1317.
Miikkulainen, R., J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon,
B. Raju, A. Navruzyan, N. Duffy, and B. Hodjat. 2017. “Evolving
Deep Neural Networks”. arXiv preprint arXiv:1703.00548.
Mirowski, P., R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino,
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. 2016.
“Learning to navigate in complex environments”. arXiv preprint
arXiv:1611.03673.
Mnih, V., A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D.
Silver, and K. Kavukcuoglu. 2016. “Asynchronous methods for deep
reinforcement learning”. In: International Conference on Machine
Learning.
Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
et al. 2015. “Human-level control through deep reinforcement
learning”. Nature. 518(7540): 529–533.
Mandel, T., Y. -E. Liu, S. Levine, E. Brunskill, and Z. Popovic.Liu, S. Levine, E. Brunskill,
and"跨表征的离线政策评估与应用
Z. Popovic.2014.
《教育游戏"。In:2014 年自主代理和多代理系统国际会议论文集》。国际
自主代理和多代理系统基金会。
1077-1084.
Mankowitz, D. J., T. A. Mann, and S. Mannor.2016."Adaptive Skills
Adaptive Partitions (ASAP)".In:神经信息处理系统进展》。1588-
1596.Mathieu, M., C. Couprie, and Y. LeCun.2015."超越均方误差的深度多尺
度视频预测》。arXiv preprint arXiv:1511.05440.

Matiisen, T., A. Oliver, T. Cohen, and J. Schulman.2017."TeacherStudent


Curriculum Learning". arXiv preprint arXiv:1707.00183.McCallum, A. K.
1996."带选择性感知和隐藏状态的强化学习》。PhD thesis.罗切斯特大学。
McGovern, A., R. S. Sutton, and A. H. Fagg.1997."Roles of macroactions in
accelerating reinforcement learning".In:Grace Hopper celebration of
women in computing.第 1317 卷。Miikkulainen, R., J. Liang, E. Meyerson,
A. Rawal, D. Fink, O. Francon, B. Raju, A. Navruzyan, N. Duffy, and B.
Hodjat.2017."进化的深度神经网络》。ArXiv 预印本 arXiv:1703.00548。
Mirowski, P., R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil,
R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. 2016.

"在复杂环境中学习导航》。arXiv preprint arXiv:1611.03673.


Mnih, V., A. P..巴迪亚、M. 米尔扎、A. 格雷夫斯、T.P. Lillicrap, T.Harley, D.
Silver, and K. Kavukcuoglu.2016."深度强化学习的异步方法》。In:国际
机器学习大会。
Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare、A. Graves、M. Riedmiller、A. K. Fidjeland、G. Ostrovski 等
人,2015 年。"通过深度强化学习实现人类水平的控制》。
Nature.518(7540):529-533.
Mohamed, S. and D. J. Rezende. 2015. “Variational information
maximisation for intrinsically motivated reinforcement learning”. In:
Advances in neural information processing systems. 2125–2133.
Montague, P. R. 2013. “Reinforcement Learning Models Then-and-
Now: From Single Cells to Modern Neuroimaging”. In: 20 Years of
Computational Neuroscience. Springer. 271–277.
Moore, A. W. 1990. “Efficient memory-based learning for robot control”.
Morari, M. and J. H. Lee. 1999. “Model predictive control: past, present
and future”. Computers & Chemical Engineering. 23(4-5): 667–682.
Moravčik, M., M. Schmid, N. Burch, V. Lisy, D. Morrill, N. Bard, T.
Davis, K. Waugh, M. Johanson, and M. Bowling. 2017. “DeepStack:
Expert-level artificial intelligence in heads-up no-limit poker”.
Science. 356(6337): 508–513.
Mordatch, I., K. Lowrey, G. Andrew, Z. Popovic, and E. V. Todorov.
2015. “Interactive control of diverse complex characters with neural
networks”. In: Advances in Neural Information Processing Systems.
3132–3140.
Morimura, T., M. Sugiyama, H. Kashima, H. Hachiya, and T.
Tanaka. 2010. “Nonparametric return distribution approximation
for reinforcement learning”. In: Proceedings of the 27th International
Conference on Machine Learning (ICML-10). 799–806.
Munos, R. and A. Moore. 2002. “Variable resolution discretization in
optimal control”. Machine learning. 49(2): 291–323.
Munos, R., T. Stepleton, A. Harutyunyan, and M. Bellemare. 2016.
“Safe and efficient off-policy reinforcement learning”. In: Advances
in Neural Information Processing Systems. 1046–1054.
Murphy, K. P. 2012. “Machine Learning: A Probabilistic Perspective.”
Nagabandi, A., G. Kahn, R. S. Fearing, and S. Levine. 2017. “Neural
network dynamics for model-based deep reinforcement learning with
model-free fine-tuning”. arXiv preprint arXiv:1708.02596.
Nagabandi, A., G. Kahn, R. S. Fearing, and S. Levine. 2018. “Neural
network dynamics for model-based deep reinforcement learning with
model-free fine-tuning”. In: 2018 IEEE International Conference on
Robotics and Automation (ICRA). IEEE. 7559–7566.
Mohamed, S. and D. J. Rezende.2015."变量信息
内在动机强化学习的最大化"。In:神经信息处理系统进展》。2125-
2133.Montague, P. R. 2013."Reinforcement Learning Models Then-
andNow:从单细胞到现代神经成像》。In: 20 Years of Computational
Neuroscience.Springer.271-277.
Moore, A. W. 1990."基于记忆的高效机器人控制学习》。Morari, M. and J. H.
Lee.1999."模型预测控制:过去、现在和未来》。计算机与化学工程》。23(4-
5):667-682.M. Moravčik、M. Schmid、N. Burch、V. Lisy、D. Morrill、N.
Bard、T. M. Schmid、N. Burch、V. Lisy。
Davis、K. Waugh、M. Johanson 和 M. Bowling。2017."DeepStack:无
上限扑克中的专家级人工智能"。
科学 356(6337):508-513.
Mordatch, I., K. Lowrey, G. Andrew, Z. Popovic, and E. V. Todorov.
2015."用神经元交互控制各种复杂角色
《网络"。In:神经信息处理系统进展》。
3132-3140.
Morimura, T., M. Sugiyama, H. Kashima, H. Hachiya, and T. M. Sugiyama, H. Kashima,
H. Hachiya, and H. Hachiya.
田中。2010."用于强化学习的非参数回报分布近似"。In:第 27 届机器学
习国际会议论文集(ICML-10)。799-806.
Munos, R. and A. Moore.2002."最优控制中的可变分辨率离散化》。机器学
习》。49(2):291-323.
Munos, R., T.Stepleton, A. Harutyunyan, and M. Bellemare.Harutyunyan, and M.
Bellemare.
"安全高效的非政策强化学习"。In:神经信息处理系统进展》。1046-1054.
Murphy, K. P. 2012."机器学习:概率论视角"。Nagabandi, A., G. Kahn, R. S.
Fearing, and S. Levine.2017."ArXiv preprint arXiv:1708.02596.Nagabandi,
A., G. Kahn, R. S. Fearing, and S. Levine.2018."基于模型的深度强化学习的神
经网络动力学

无模型微调"。In: 2018 IEEE International Conference on Robotics and


Automation (ICRA).IEEE.7559-7566.
Narvekar, S., J. Sinapov, M. Leonetti, and P. Stone. 2016. “Source
task creation for curriculum learning”. In: Proceedings of the 2016
International Conference on Autonomous Agents & Multiagent
Systems. International Foundation for Autonomous Agents and
Multiagent Systems. 566–574.
Neelakantan, A., Q. V. Le, and I. Sutskever. 2015. “Neural programmer:
Inducing latent programs with gradient descent”. arXiv preprint
arXiv:1511.04834.
Neu, G. and C. Szepesvári. 2012. “Apprenticeship learning using
inverse reinforcement learning and gradient methods”. arXiv preprint
arXiv:1206.5264.
Ng, A. Y., D. Harada, and S. Russell. 1999. “Policy invariance under
reward transformations: Theory and application to reward shaping”.
In: ICML. Vol. 99. 278–287.
Ng, A. Y., S. J. Russell, et al. 2000. “Algorithms for inverse reinforcement
learning.” In: Icml. 663–670.
Nguyen, D. H. and B. Widrow. 1990. “Neural networks for self-learning
control systems”. IEEE Control systems magazine. 10(3): 18–23.
Niv, Y. 2009. “Reinforcement learning in the brain”. Journal of
Mathematical Psychology. 53(3): 139–154.
Niv, Y. and P. R. Montague. 2009. “Theoretical and empirical studies
of learning”. In: Neuroeconomics. Elsevier. 331–351.
Norris, J. R. 1998. Markov chains. No. 2. Cambridge university press.
O’Donoghue, B., R. Munos, K. Kavukcuoglu, and V. Mnih. 2016.
“PGQ: Combining policy gradient and Q-learning”. arXiv preprint
arXiv:1611.01626.
Oh, J., V. Chockalingam, S. Singh, and H. Lee. 2016. “Control of
Memory, Active Perception, and Action in Minecraft”. arXiv preprint
arXiv:1605.09128.
Oh, J., X. Guo, H. Lee, R. L. Lewis, and S. Singh. 2015. “Action-
conditional video prediction using deep networks in atari games”.
In: Advances in Neural Information Processing Systems. 2863–2871.
Oh, J., S. Singh, and H. Lee. 2017. “Value Prediction Network”. arXiv
preprint arXiv:1707.03497.
Olah, C., A. Mordvintsev, and L. Schubert. 2017. “Feature Visualiza-
tion”. Distill. https://distill.pub/2017/feature-visualization.
Narvekar, S., J. Sinapov, M. Leonetti, and P. Stone.2016."课程学习的源任务
创建"。In:2016 年
自主代理与多代理系统国际会议。自主代理与多代理系统国际基金会。
566-574.
Neelakantan, A., Q. V. Le, and I. Sutskever.2015."Neural programmer:
ArXiv preprint arXiv:1511.04834.
Neu, G. and C. Szepesvári.2012."使用反强化学习和梯度方法的学徒制学
习》。arXiv preprint arXiv:1206.5264.
Ng, A. Y., D. Harada, and S. Russell.1999."奖励转换下的政策不变性:奖励塑
造的理论与应用"。
In: ICML.99. 278-287.
Ng, A. Y., S. J. Russell, et al. 2000."反强化学习算法"。In:663-670.Nguyen,
D. H. and B. Widrow.1990."用于自学控制系统的神经网络》。IEEE 控制系统杂
志。10(3):18-23.Niv, Y. 2009."大脑中的强化学习》。数学心理学杂志》。
53(3):139-154.Niv, Y. and P. R. Montague.2009."Theoretical and empirical
studies of learning".In. Neuroeconomics:Neuroeconomics.Elsevier.331-
351.

Norris, J. R. 1998.马尔可夫链 No.剑桥大学出版社。


O'Donoghue, B., R. Munos, K. Kavukcuoglu, and V. Mnih. 2016.
"PGQ:PGQ: Combining policy gradient and Q-learning". arXiv
preprint arXiv:1611.01626.
Oh, J., V. Chockalingam, S. Singh, and H. Lee.2016."Minecraft 中的记忆、
主动感知和行动控制》。arXiv 预印本 arXiv:1605.09128.
Oh, J., X. Guo, H. Lee, R. L. Lewis, and S. Singh.2015."在阿塔里游戏中使用
深度网络进行动作条件视频预测》(Actionconditional video prediction
using deep networks in atari games)。
In:神经信息处理系统进展》。2863-2871.Oh, J., S. Singh, and H.
Lee.2017."ArXiv preprint arXiv:1707.03497.
Olah, C., A. Mordvintsev, and L. Schubert.2017."特征可视化"。
https://distill.pub/2017/feature-visualization。
Ortner, R., O.-A. Maillard, and D. Ryabko. 2014. “Selecting near-
optimal approximate state representations in reinforcement learning”.
In: International Conference on Algorithmic Learning Theory.
Springer. 140–154.
Osband, I., C. Blundell, A. Pritzel, and B. Van Roy. 2016. “Deep Explo-
ration via Bootstrapped DQN”. arXiv preprint arXiv:1602.04621.
Ostrovski, G., M. G. Bellemare, A. v. d. Oord, and R. Munos. 2017.
“Count-based exploration with neural density models”. arXiv preprint
arXiv:1703.01310.
Paine, T. L., S. G. Colmenarejo, Z. Wang, S. Reed, Y. Aytar, T. Pfaff,
M. W. Hoffman, G. Barth-Maron, S. Cabi, D. Budden, et al. 2018.
“One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets
with RL”. arXiv preprint arXiv:1810.05017.
Parisotto, E., J. L. Ba, and R. Salakhutdinov. 2015. “Actor-mimic:
Deep multitask and transfer reinforcement learning”. arXiv preprint
arXiv:1511.06342.
Pascanu, R., Y. Li, O. Vinyals, N. Heess, L. Buesing, S. Racanière,
D. Reichert, T. Weber, D. Wierstra, and P. Battaglia. 2017.
“Learning model-based planning from scratch”. arXiv preprint
arXiv:1707.06170.
Pathak, D., P. Agrawal, A. A. Efros, and T. Darrell. 2017. “Curiosity-
driven exploration by self-supervised prediction”. In: International
Conference on Machine Learning (ICML). Vol. 2017.
Pavlov, I. P. 1927. Conditioned reflexes. Oxford University Press.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et
al. 2011. “Scikit-learn: Machine learning in Python”. Journal of
Machine Learning Research. 12(Oct): 2825–2830.
Peng, J. and R. J. Williams. 1994. “Incremental multi-step Q-learning”.
In: Machine Learning Proceedings 1994. Elsevier. 226–232.
Peng, P., Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang.
2017a. “Multiagent Bidirectionally-Coordinated Nets for Learning to
Play StarCraft Combat Games”. arXiv preprint arXiv:1703.10069.
Ortner, R., O.-A.Maillard, and D. Ryabko.2014."在强化学习中选择近优近似状
态表征》。
In:算法学习理论国际会议。
Springer.140-154.
Osband, I., C. Blundell, A. Pritzel, and B. Van Roy.2016."ArXiv preprint
arXiv:1602.04621.
Ostrovski, G., M. G. Bellemare, A. v. d. Oord, and R. Munos.Munos.
"基于计数的神经密度模型探索"。arXiv preprint arXiv:1703.01310。
Paine, T. L., S. G. Colmenarejo, Z. Wang, S. Reed, Y. Aytar, T. Pfaff, M. W.
Hoffman, G. Barth-Maron, S. Cabi, D. Budden, et al. 2018.
"单次高保真模仿:ArXiv preprint arXiv:1810.05017.
Parisotto, E., J. L. Ba, and R. Salakhutdinov.2015."Actor-mimic:Deep
multitask and transfer reinforcement learning". arXiv preprint
arXiv:1511.06342.
Pascanu, R., Y. Li, O. Vinyals, N. Heess, L. Buesing, S. Racanière, D.
Reichert, T. Weber, D. Wierstra, and P. Battaglia.2017.
"arXiv preprint arXiv:1707.06170.
Pathak, D., P. Agrawal, A. A. Efros, and T. Darrell.2017."通过自我监督预测进
行好奇心驱动的探索》。In:国际机器学习会议(ICML)。Vol.
Pavlov, I. P. 1927.条件反射。牛津大学出版社。Pedregosa, F., G. Varoquaux,
A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.
Weiss, V. Dubourg, et al."Scikit-learn:Python 中的机器学习"。机器学习研
究期刊》。12(Oct):2825-2830.
Peng, J. and R. J. Williams.1994."Incremental multi-step Q-learning".

In:机器学习论文集 1994》。Elsevier.226-232.
Peng, P., Q. Yuan, Y. Wen, Y. Yang, Z.Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J.
Wang. Tang, H. Long, and J. Wang.
2017a."多代理双向协调学习网络"(Multi-agent Bidirectally-Coordinated Nets for
Learning
Play StarCraft
to Combat Games". arXiv preprint arXiv:1703.10069.
Peng, X. B., G. Berseth, K. Yin, and M. van de Panne. 2017b.
“DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep
Reinforcement Learning”. ACM Transactions on Graphics (Proc.
SIGGRAPH 2017). 36(4).
Perez-Liebana, D., S. Samothrakis, J. Togelius, T. Schaul, S. M. Lucas,
A. Couëtoux, J. Lee, C.-U. Lim, and T. Thompson. 2016. “The 2014
general video game playing competition”. IEEE Transactions on
Computational Intelligence and AI in Games. 8(3): 229–243.
Petrik, M. and B. Scherrer. 2009. “Biasing approximate dynamic
programming with a lower discount factor”. In: Advances in neural
information processing systems. 1265–1272.
Piketty, T. 2013. “Capital in the Twenty-First Century”.
Pineau, J., G. Gordon, S. Thrun, et al. 2003. “Point-based value iteration:
An anytime algorithm for POMDPs”. In: IJCAI. Vol. 3. 1025–1032.
Pinto, L., M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel.
2017. “Asymmetric Actor Critic for Image-Based Robot Learning”.
arXiv preprint arXiv:1710.06542.
Plappert, M., R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen,
T. Asfour, P. Abbeel, and M. Andrychowicz. 2017. “Parameter Space
Noise for Exploration”. arXiv preprint arXiv:1706.01905.
Precup, D. 2000. “Eligibility traces for off-policy policy evaluation”.
Computer Science Department Faculty Publication Series: 80.
Ranzato, M., S. Chopra, M. Auli, and W. Zaremba. 2015. “Sequence
level training with recurrent neural networks”. arXiv preprint
arXiv:1511.06732.
Rasmussen, C. E. 2004. “Gaussian processes in machine learning”. In:
Advanced lectures on machine learning. Springer. 63–71.
Ravindran, B. and A. G. Barto. 2004. “An algebraic approach to
abstraction in reinforcement learning”. PhD thesis. University of
Massachusetts at Amherst.
Real, E., S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A.
Kurakin. 2017. “Large-Scale Evolution of Image Classifiers”. arXiv
preprint arXiv:1703.01041.
Reed, S. and N. De Freitas. 2015. “Neural programmer-interpreters”.
arXiv preprint arXiv:1511.06279.
Peng, X. B., G. Berseth, K. Yin, and M. van de Panne.2017b.
"DeepLoco:利用分层深度学习动态运动技能
强化学习"。ACM 图形学论文集(Proc.
36(4).
Perez-Liebana, D., S. Samothrakis, J. Togelius, T. Schaul, S. M. Lucas, A.
Couëtoux, J. Lee, C. -U.Lim, and T. Thompson.2016."2014 年通用视频游戏
竞赛》。IEEE Transactions on Computational Intelligence and AI in
Games.8(3):229-243.Petrik, M. and B. Scherrer.2009."用较低折扣系数偏置
近似动态编程》。In:神经信息处理系统进展》。1265-1272.
Piketty, T. 2013."二十一世纪的资本》。Pineau, J., G. Gordon, S. Thrun, et
al. 2003."基于点的价值迭代:An anytime algorithm for POMDPs".In.
IJCAI:IJCAI.3. 1025-1032.Pinto, L., M. Andrychowicz, P. Welinder, W.
Zaremba, and P. Abbeel.

2017."基于图像的机器人学习的非对称行为批评者"。
arXiv preprint arXiv:1710.06542.
Plappert, M., R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T.
Asfour, P. Abbeel, and M. Andrychowicz.Chen, X. Chen, T. Asfour, P.
Abbeel, and M. Andrychowicz.2017."ArXiv preprint arXiv:1706.01905.
Precup, D. 2000."政策外政策评估的资格追踪"。
计算机科学系教师出版物系列:80.
Ranzato, M., S. Chopra, M. Auli, and W. Zaremba.2015."用递归神经网络进行
序列级训练》。arXiv preprint arXiv:1511.06732.
Rasmussen, C. E. 2004."机器学习中的高斯过程》。In:机器学习高级讲座》。
Springer.63-71.Ravindran, B. and A. G. Barto.2004."强化学习中抽象的代数
方法》。博士论文。马萨诸塞大学阿默斯特分校。
Real, E., S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. R., E.,
S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. R.
《库拉金 2017."图像分类器的大规模进化》。ArXiv 预印本
arXiv:1703.01041。
Reed, S. and N. De Freitas.2015."神经编程解释器》。
arXiv preprint arXiv:1511.06279.
Rescorla, R. A., A. R. Wagner, et al. 1972. “A theory of Pavlovian
conditioning: Variations in the effectiveness of reinforcement and
nonreinforcement”. Classical conditioning II: Current research and
theory. 2: 64–99.
Riedmiller, M. 2005. “Neural fitted Q iteration–first experiences with a
data efficient neural reinforcement learning method”. In: Machine
Learning: ECML 2005. Springer. 317–328.
Riedmiller, M., R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Van de
Wiele, V. Mnih, N. Heess, and J. T. Springenberg. 2018. “Learning
by Playing - Solving Sparse Reward Tasks from Scratch”. arXiv
preprint arXiv:1802.10567.
Rowland, M., M. G. Bellemare, W. Dabney, R. Munos, and Y. W. Teh.
2018. “An Analysis of Categorical Distributional Reinforcement
Learning”. arXiv preprint arXiv:1802.08163.
Ruder, S. 2017. “An overview of multi-task learning in deep neural
networks”. arXiv preprint arXiv:1706.05098.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams. 1988. “Learning
representations by back-propagating errors”. Cognitive modeling.
5(3): 1.
Russakovsky, O., J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z.
Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. 2015. “Imagenet
large scale visual recognition challenge”. International Journal of
Computer Vision. 115(3): 211–252.
Russek, E. M., I. Momennejad, M. M. Botvinick, S. J. Gershman, and
N. D. Daw. 2017. “Predictive representations can link model-based
reinforcement learning to model-free mechanisms”. bioRxiv: 083857.
Rusu, A. A., S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J.
Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell.
2015. “Policy distillation”. arXiv preprint arXiv:1511.06295.
Rusu, A. A., M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and
R. Hadsell. 2016. “Sim-to-real robot learning from pixels with
progressive nets”. arXiv preprint arXiv:1610.04286.
Sadeghi, F. and S. Levine. 2016. “CAD2RL: Real single-image flight
without a single real image”. arXiv preprint arXiv:1611.04201.
Rescorla, R. A., A. R. Wagner, et al."巴甫洛夫条件反射理论:巴甫洛夫条件反
射的理论:强化和
《非强化"。经典条件反射 II:当前研究与理论》。2: 64-99.
Riedmiller, M. 2005."神经拟合 Q 迭代优先经验与数据高效神经强化学习方
法》。In:Machine Learning:ECML 2005.Springer.317-328.Riedmiller, M.,
R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Van de Wiele, V. Mnih, N.
Heess, and J. T. Springenberg.2018."玩中学--从零开始解决稀疏奖励任务》。
arXiv preprint arXiv:1802.10567.

Rowland, M., M. G. Bellemare, W. Dabney, R. Munos, and Y. W. Teh.


2018."分类分布式强化学习分析"。arXiv preprint arXiv:1802.08163。
Ruder, S. 2017."深度神经网络中的多任务学习概述》。arXiv preprint
arXiv:1706.05098.Rumelhart, D. E., G. E. Hinton, and R. J.
Williams.1988."通过反向传播错误学习表征》。认知建模》。
5(3): 1.
Russakovsky, O., J..Deng, H.Su, J. Krause, S. Satheesh, S. Ma, Z.Krause, S. Satheesh,
S. Ma, Z. Karpathy、A. Khosla、M. Bernstein 等人,2015
Huang、A.
年。"Imagenet 大规模视觉识别挑战》。国际计算机视觉杂志》。
115(3):211-252.
Russek, E. M., I. Momennejad, M. M. Botvinick, S. J. Gershman, and N. D.
Daw.2017."预测表征可将基于模型的强化学习与无模型机制联系起来》。
bioRxiv: 083857.Rusu、A. A.、S. G. Colmenarejo、C. Gulcehre、G.
Desjardins、J.
Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell.Mnih, K.
Kavukcuoglu, and R. Hadsell.
2015."政策提炼"。ArXiv 预印本 arXiv:1511.06295。
Rusu, A. A., M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R.
Hadsell.2016."用渐进式网络从像素学习仿真机器人》。arXiv preprint
arXiv:1610.04286.Sadeghi, F. and S. Levine.2016."CAD2RL:Real single-
image flight without a single real image". arXiv preprint arXiv:1611.04201.
Salge, C., C. Glackin, and D. Polani. 2014. “Changing the environment
based on empowerment as intrinsic motivation”. Entropy. 16(5):
2789–2819.
Salimans, T., J. Ho, X. Chen, and I. Sutskever. 2017. “Evolution
Strategies as a Scalable Alternative to Reinforcement Learning”.
arXiv preprint arXiv:1703.03864.
Samuel, A. L. 1959. “Some studies in machine learning using the game of
checkers”. IBM Journal of research and development. 3(3): 210–229.
Sandve, G. K., A. Nekrutenko, J. Taylor, and E. Hovig. 2013.
“Ten simple rules for reproducible computational research”. PLoS
computational biology. 9(10): e1003285.
Santoro, A., D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P.
Battaglia, and T. Lillicrap. 2017. “A simple neural network module
for relational reasoning”. arXiv preprint arXiv:1706.01427.
Savinov, N., A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys,
T. Lillicrap, and S. Gelly. 2018. “Episodic Curiosity through
Reachability”. arXiv preprint arXiv:1810.02274.
Schaarschmidt, M., A. Kuhnle, and K. Fricke. 2017. “TensorForce: A
TensorFlow library for applied reinforcement learning”.
Schaul, T., J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T.
Rückstieß, and J. Schmidhuber. 2010. “PyBrain”. The Journal of
Machine Learning Research. 11: 743–746.
Schaul, T., D. Horgan, K. Gregor, and D. Silver. 2015a. “Universal value
function approximators”. In: Proceedings of the 32nd International
Conference on Machine Learning (ICML-15). 1312–1320.
Schaul, T., J. Quan, I. Antonoglou, and D. Silver. 2015b. “Prioritized
Experience Replay”. arXiv preprint arXiv:1511.05952.
Schmidhuber, J. 2010. “Formal theory of creativity, fun, and intrinsic
motivation (1990–2010)”. IEEE Transactions on Autonomous
Mental Development. 2(3): 230–247.
Schmidhuber, J. 2015. “Deep learning in neural networks: An overview”.
Neural Networks. 61: 85–117.
Schraudolph, N. N., P. Dayan, and T. J. Sejnowski. 1994. “Temporal
difference learning of position evaluation in the game of Go”. In:
Advances in Neural Information Processing Systems. 817–824.
Salge, C., C. Glackin, and D. Polani.2014."改变环境
以赋权为内在动力"。熵。16(5):2789-2819.
Salimans, T., J. Ho, X. Chen, and I. Sutskever.2017."进化策略作为强化学习
的可扩展替代方案》。
arXiv preprint arXiv:1703.03864.
Samuel, A. L. 1959."使用''游戏''进行机器学习的一些研究
《跳棋"。IBM 研究与开发期刊》。3(3):210-229.
Sandve, G. K., A. Nekrutenko, J. Taylor, and E. Hovig.2013.
"可重复计算研究的十条简单规则》。PLoS computational biology.9(10):
e1003285.
Santoro, A., D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. K., M. J., M. J., M.
J., M.Battaglia,
J., M. J., and
M. J.,T.M.Lillicrap.2017."A
J., M. J., M. simple neural network module for
relational reasoning". arXiv preprint arXiv:1706.01427.
Savinov, N., A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap,
and S. Gelly.2018."ArXiv preprint arXiv:1810.02274.Schaarschmidt, M., A.
Kuhnle, and K. Fricke.2017."TensorForce:用于应用强化学习的
TensorFlow 库"。
Schaul, T., J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. M. Schaul,
T.
Rückstieß, and J. Schmidhuber.2010."PyBrain".机器学习研究期刊》。
11: 743-746.
Schaul, T., D. Horgan, K. Gregor, and D. Silver.2015a."普世价值
函数近似值"。In:第 32 届机器学习国际会议论文集(ICML-15)。1312-
1320.
Schaul, T., J. Quan, I. Antonoglou, and D. Silver.2015b."ArXiv preprint
arXiv:1511.05952.Schmidhuber, J. 2010.创造性、趣味性和内在性的形式理
论"。
动机(1990-2010 年)"。IEEE Transactions on Autonomous Mental
Development.2(3):230-247.
Schmidhuber, J. 2015."神经网络中的深度学习:综述》。
神经网络。61: 85-117.
Schraudolph, N. N., P. Dayan, and T. J. Sejnowski.1994."Temporal
《围棋中位置评估的差异学习"。In:神经信息处理系统进展》。817-824.
Schulman, J., P. Abbeel, and X. Chen. 2017a. “Equivalence Be-
tween Policy Gradients and Soft Q-Learning”. arXiv preprint
arXiv:1704.06440.
Schulman, J., J. Ho, C. Lee, and P. Abbeel. 2016. “Learning from
demonstrations through the use of non-rigid registration”. In:
Robotics Research. Springer. 339–354.
Schulman, J., S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. 2015.
“Trust Region Policy Optimization”. In: ICML. 1889–1897.
Schulman, J., F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
2017b. “Proximal policy optimization algorithms”. arXiv preprint
arXiv:1707.06347.
Schultz, W., P. Dayan, and P. R. Montague. 1997. “A neural substrate
of prediction and reward”. Science. 275(5306): 1593–1599.
Shannon, C. 1950. “Programming a Computer for Playing Chess”.
Philosophical Magazine. 41(314).
Silver, D. L., Q. Yang, and L. Li. 2013. “Lifelong Machine Learning Sys-
tems: Beyond Learning Algorithms.” In: AAAI Spring Symposium:
Lifelong Machine Learning. Vol. 13. 05.
Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller.
2014. “Deterministic Policy Gradient Algorithms”. In: ICML.
Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M.
Lanctot, et al. 2016a. “Mastering the game of Go with deep neural
networks and tree search”. Nature. 529(7587): 484–489.
Silver, D., H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley,
G. Dulac-Arnold, D. Reichert, N. Rabinowitz, A. Barreto, et al.
2016b. “The predictron: End-to-end learning and planning”. arXiv
preprint arXiv:1612.08810.
Singh, S. P., T. S. Jaakkola, and M. I. Jordan. 1994. “Learning Without
State-Estimation in Partially Observable Markovian Decision
Processes.” In: ICML. 284–292.
Singh, S. P. and R. S. Sutton. 1996. “Reinforcement learning with
replacing eligibility traces”. Machine learning. 22(1-3): 123–158.
Singh, S., T. Jaakkola, M. L. Littman, and C. Szepesvári. 2000.
“Convergence results for single-step on-policy reinforcement-learning
algorithms”. Machine learning. 38(3): 287–308.
Schulman, J., P. Abbeel, and X. Chen.2017a."Policy Gradients and Soft Q-
Learning Equivalence". arXiv preprint arXiv:1704.06440.
Schulman, J., J. Ho, C. Lee, and P. Abbeel.2016.从
通过使用非刚性注册进行演示"。In:机器人研究。Springer.339-354.
Schulman, J., S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz.2015.
"信任区域政策优化"。In:ICML.1889-1897.
Schulman, J., F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.
2017b."ArXiv preprint arXiv:1707.06347.
Schultz, W., P. Dayan, and P. R. Montague.1997."预测与奖赏的神经基质》。
Science.275(5306):1593-1599.
Shannon, C. 1950."国际象棋计算机编程》。
哲学杂志。41(314).
Silver, D. L., Q. Yang, and L. Li.2013."终身机器学习系统"(Lifelong Machine Learning
Sys-系统:超越学习算法"。In:AAAI Spring Symposium:Lifelong Machine
Learning.Vol.05.
Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller.
2014."确定性政策梯度算法》。In:ICML.
Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. J. J. J. J. J. J. J. J. J.
J. J.Lanctot,
J. J. J. J.etJ.al. 2016a."用深度神经网络和树搜索掌握围棋》。自然》杂
志。529(7587):484-489.
Silver, D., H. van Hasselt, M. Hessel, T. Schaul, A. Guez, T. Harley, G. Dulac-
Arnold, D. Reichert, N. Rabinowitz, A. Barreto, et al.
2016b."ArXiv preprint arXiv:1612.08810.
Singh, S. P., T. S. Jaakkola, and M. I. Jordan.1994."部分可观测马尔可夫决策
过程中的无状态估计学习"。In:ICML.284-292.Singh, S. P. and R. S.
Sutton.1996."使用替换资格跟踪的强化学习》。机器学习》。22(1-3):123-158.

Singh, S., T. Jaakkola, M. L. Littman, and C. Szepesvári.


"单步策略强化学习算法的收敛结果》。机器学习。38(3):287-308.
Sondik, E. J. 1978. “The optimal control of partially observable Markov
processes over the infinite horizon: Discounted costs”. Operations
research. 26(2): 282–304.
Srivastava, N., G. E. Hinton, A. Krizhevsky, I. Sutskever, and R.
Salakhutdinov. 2014. “Dropout: a simple way to prevent neural
networks from overfitting.” Journal of Machine Learning Research.
15(1): 1929–1958.
Stadie, B. C., S. Levine, and P. Abbeel. 2015. “Incentivizing Exploration
In Reinforcement Learning With Deep Predictive Models”. arXiv
preprint arXiv:1507.00814.
Stone, P. and M. Veloso. 2000. “Layered learning”. Machine Learning:
ECML 2000 : 369–381.
Story, G., I. Vlaev, B. Seymour, A. Darzi, and R. Dolan. 2014. “Does
temporal discounting explain unhealthy behavior? A systematic re-
view and reinforcement learning perspective”. Frontiers in behavioral
neuroscience. 8: 76.
Sukhbaatar, S., A. Szlam, and R. Fergus. 2016. “Learning multiagent
communication with backpropagation”. In: Advances in Neural
Information Processing Systems. 2244–2252.
Sun, Y., F. Gomez, and J. Schmidhuber. 2011. “Planning to be
surprised: Optimal bayesian exploration in dynamic environments”.
In: Artificial General Intelligence. Springer. 41–51.
Sunehag, P., G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M.
Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al.
2017. “Value-Decomposition Networks For Cooperative Multi-Agent
Learning”. arXiv preprint arXiv:1706.05296.
Sutton, R. S. 1988. “Learning to predict by the methods of temporal
differences”. Machine learning. 3(1): 9–44.
Sutton, R. S. 1996. “Generalization in reinforcement learning: Suc-
cessful examples using sparse coarse coding”. Advances in neural
information processing systems: 1038–1044.
Sutton, R. S. and A. G. Barto. 1998. Reinforcement learning: An
introduction. Vol. 1. No. 1. MIT press Cambridge.
Sutton, R. S. and A. G. Barto. 2017. Reinforcement Learning: An
Introduction (2nd Edition, in progress). MIT Press.
Sondik, E. J. 1978."部分可观测马尔可夫过程在无限时间跨度上的最优控制:
贴现成本"。Operations research.26(2):282-304.
Srivastava, N., G..E.Hinton、A. Krizhevsky、I. Sutskever 和 R. Hinton。
Salakhutdinov.2014."辍学:预防神经系统疾病的简单方法
《网络的过度拟合"。机器学习研究期刊》。
15(1): 1929-1958.
Stadie, B. C., S. Levine, and P. Abbeel.2015."深度预测模型强化学习中的激励
探索》。arXiv 预印本 arXiv:1507.00814.
Stone, P. and M. Veloso.2000."分层学习》。机器学习》:ecml 2000 : 369-
381.Story, G., I. Vlaev, B. Seymour, A. Darzi, and R. Dolan.2014."时间折扣
能解释不健康行为吗?系统回顾与强化学习视角》。行为神经科学前沿》。8:
76.
Sukhbaatar, S., A. Szlam, and R. Fergus.2016."利用反向传播学习多代理通
信》。In:神经信息处理系统进展》。2244-2252.Sun, Y., F. Gomez, and J.
Schmidhuber.2011."Planning to be surprised:动态环境中的最优贝叶斯探
索"。

In:人工通用智能》。Springer.41-51.
Sunehag、P.、G. Lever、A. Gruslys、W. M. Czarnecki、V.Zambaldi, M.
Jaderberg、M. Lanctot、N. Sonnerat、J. Z. Leibo、K. Tuyls 等人,
2017 年。"多代理合作学习的价值分解网络》。arXiv 预印本
arXiv:1706.05296.
Sutton, R. S. 1988."用时间差的方法学习预测》。Machine learning.3(1):9-
44.Sutton, R. S. 1996."强化学习中的泛化:使用稀疏粗编码的成功范例"。
Advances in neural information processing systems:1038-1044.Sutton, R.
S. and A. G. Barto.1998.Reinforcement learning:An introduction.Vol.No.
麻省理工学院出版社,剑桥。Sutton, R. S. and A. G.
Barto.2017.Reinforcement Learning:导论》(第 2 版,出版中)。麻省理工
学院出版社。
Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour. 2000.
“Policy gradient methods for reinforcement learning with function
approximation”. In: Advances in neural information processing
systems. 1057–1063.
Sutton, R. S., D. Precup, and S. Singh. 1999. “Between MDPs and
semi-MDPs: A framework for temporal abstraction in reinforcement
learning”. Artificial intelligence. 112(1-2): 181–211.
Sutton, R. S. 1984. “Temporal credit assignment in reinforcement
learning”.
Synnaeve, G., N. Nardelli, A. Auvolat, S. Chintala, T. Lacroix, Z. Lin, F.
Richoux, and N. Usunier. 2016. “TorchCraft: a Library for Machine
Learning Research on Real-Time Strategy Games”. arXiv preprint
arXiv:1611.00625.
Szegedy, C., S. Ioffe, V. Vanhoucke, and A. Alemi. 2016. “Inception-v4,
inception-resnet and the impact of residual connections on learning”.
arXiv preprint arXiv:1602.07261.
Szegedy, C., S. Ioffe, V. Vanhoucke, and A. A. Alemi. 2017. “Inception-
v4, inception-resnet and the impact of residual connections on
learning.” In: AAAI. Vol. 4. 12.
Tamar, A., S. Levine, P. Abbeel, Y. WU, and G. Thomas. 2016. “Value
iteration networks”. In: Advances in Neural Information Processing
Systems. 2146–2154.
Tan, J., T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez,
and V. Vanhoucke. 2018. “Sim-to-Real: Learning Agile Locomotion
For Quadruped Robots”. arXiv preprint arXiv:1804.10332.
Tanner, B. and A. White. 2009. “RL-Glue: Language-independent
software for reinforcement-learning experiments”. The Journal of
Machine Learning Research. 10: 2133–2136.
Teh, Y. W., V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R.
Hadsell, N. Heess, and R. Pascanu. 2017. “Distral: Robust Multitask
Reinforcement Learning”. arXiv preprint arXiv:1707.04175.
Tesauro, G. 1995. “Temporal difference learning and TD-Gammon”.
Communications of the ACM. 38(3): 58–68.
Tessler, C., S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. 2017.
“A Deep Hierarchical Approach to Lifelong Learning in Minecraft.”
In: AAAI. 1553–1561.
Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour.2000.
"带函数的强化学习的策略梯度方法
《近似"。In:神经信息处理系统进展》。1057-1063.
Sutton, R. S., D. Precup, and S. Singh.1999."介于 MDP 与半 MDP 之间:A
framework for temporal abstraction in reinforcement learning".Artificial
intelligence.112(1-2):181-211.Sutton, R. S. 1984."强化学习中的时间学分分
配》。
Synnaeve、G.、N. Nardelli、A. Auvolat、S. Chintala、T. Lacroix、Z.Lin, F.
Richoux 和 N. Usunier。2016."TorchCraft: a Library for Machine
Learning Research on Real-Time Strategy Games". arXiv preprint
arXiv:1611.00625.
Szegedy, C., S. Ioffe, V. Vanhoucke, and A. Alemi.2016."Inception-v4、
inception-resnet 和残余连接对学习的影响》。
arXiv preprint arXiv:1602.07261.
Szegedy, C., S. Ioffe, V. Vanhoucke, and A. A. Alemi.2017."Inceptionv4、
inception-resnet 和残余连接对学习的影响"。In:AAAI.Vol. 4. 12.Tamar, A.,
S. Levine, P. Abbeel, Y. WU, and G. Thomas.2016.价值
《迭代网络"。In:神经信息处理系统进展》。2146-2154.
Tan, J., T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V.
Vanhoucke.2018."Sim-to-Real: Learning Agile Locomotion For
Quadruped Robots". arXiv preprint arXiv:1804.10332.Tanner, B. and A.
White.2009."RL-Glue:与语言无关的强化学习实验软件"。机器学习研究期
刊》。10: 2133-2136.
Teh, Y.W.、V. Bapst、W. M. Czarnecki、J. Quan、J. Kirkpatrick、R.
Hadsell, N. Heess, and R. Pascanu.2017."Distral:Robust Multitask
Reinforcement Learning". arXiv preprint arXiv:1707.04175.
Tesauro, G. 1995."时差学习与 TD-Gammon》。
ACM 通信。38(3):58-68.
Tessler, C., S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor.2017.
"在 Minecraft 中进行终身学习的深度分层方法"。In:AAAI.1553-1561.
Thomas, P. 2014. “Bias in natural actor-critic algorithms”. In: Interna-
tional Conference on Machine Learning. 441–448.
Thomas, P. S. and E. Brunskill. 2016. “Data-efficient off-policy policy
evaluation for reinforcement learning”. In: International Conference
on Machine Learning.
Thrun, S. B. 1992. “Efficient exploration in reinforcement learning”.
Tian, Y., Q. Gong, W. Shang, Y. Wu, and C. L. Zitnick. 2017. “ELF:
An Extensive, Lightweight and Flexible Research Platform for Real-
time Strategy Games”. Advances in Neural Information Processing
Systems (NIPS).
Tieleman, H. 2012. “Lecture 6.5-rmsprop: Divide the gradient by a
running average of its recent magnitude”. COURSERA: Neural
Networks for Machine Learning.
Tobin, J., R. Fong, A. Ray, J. Schneider, W. Zaremba, and P.
Abbeel. 2017. “Domain Randomization for Transferring Deep Neural
Networks from Simulation to the Real World”. arXiv preprint
arXiv:1703.06907.
Todorov, E., T. Erez, and Y. Tassa. 2012. “MuJoCo: A physics engine
for model-based control”. In: Intelligent Robots and Systems (IROS),
2012 IEEE/RSJ International Conference on. IEEE. 5026–5033.
Tsitsiklis, J. N. and B. Van Roy. 1997. “An analysis of temporal-
difference learning with function approximation”. Automatic Control,
IEEE Transactions on. 42(5): 674–690.
Turing, A. M. 1953. “Digital computers applied to games”. Faster than
thought.
Tzeng, E., C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine,
K. Saenko, and T. Darrell. 2015. “Adapting deep visuomotor
representations with weak pairwise constraints”. arXiv preprint
arXiv:1511.07111.
Ueno, S., M. Osawa, M. Imai, T. Kato, and H. Yamakawa. 2017.
““Re: ROS”: Prototyping of Reinforcement Learning Environment
for Asynchronous Cognitive Architecture”. In: First International
Early Research Career Enhancement School on Biologically Inspired
Cognitive Architectures. Springer. 198–203.
Van Hasselt, H., A. Guez, and D. Silver. 2016. “Deep Reinforcement
Learning with Double Q-Learning.” In: AAAI. 2094–2100.
Thomas, P. 2014."自然行为批评算法中的偏见》。In:国际机器学习会议。
441-448.Thomas, P. S. and E. Brunskill.2016."强化学习的数据高效非政策策
略评估》。In:国际机器学习会议。
Thrun, S. B. 1992."强化学习中的高效探索》。Tian, Y., Q. Gong, W. Shang, Y.
Wu, and C. L. Zitnick.2017."ELF:一个广泛、轻量级和灵活的实战研究平
台"(ELF: An Extensive, Lightweight and Flexible Research Platform for
Real-
《时间策略游戏"。神经信息处理系统进展》(NIPS)。
Tieleman, H. 2012."讲座 6.5-rmsprop:将梯度除以其最近幅度的运行平均
值"。COURSERA: Neural Networks for Machine Learning.
托宾、J.、R.方、A.雷、J.施耐德、W.扎伦巴和 P.罗伯茨。
Abbeel.2017."将深度神经网络从模拟转移到现实世界的领域随机化》。
arXiv 预印本 arXiv:1703.06907。
Todorov, E., T. Erez, and Y. Tassa.2012."MuJoCo:物理引擎
基于模型的控制"。In:智能机器人与系统(IROS),2012 年 IEEE/RSJ 国
际会议。IEEE.5026-5033.
Tsitsiklis, J. N. and B. Van Roy.1997."带函数近似的时差学习分析》。
Automatic Control, IEEE Transactions on.42(5):674-690.Turing, A. M.
1953."数字计算机应用于游戏"。Faster than thought.
Tzeng, E., C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, K. Saenko,
and T. Darrell.2015."Adapting deep visuomotor representations with
weak pairwise constraints". arXiv preprint arXiv:1511.07111.
Ueno, S., M. Osawa, M. Imai, T. Kato, and H. Yamakawa.2017.

""关于:ROS":异步认知架构的强化学习环境原型"。In:第一届国际
生物启发认知架构早期研究职业提升学校。Springer.198-203.
Van Hasselt, H., A. Guez, and D. Silver.2016."双 Q 学习的深度强化学习"。
In:AAAI.2094-2100.
Vapnik, V. N. 1998. “Statistical learning theory. Adaptive and learning
systems for signal processing, communications, and control”.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin. 2017. “Attention Is All You Need”.
arXiv preprint arXiv:1706.03762.
Vezhnevets, A., V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou,
et al. 2016. “Strategic attentive writer for learning macro-actions”.
In: Advances in Neural Information Processing Systems. 3486–3494.
Vinyals, O., T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets,
M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al.
2017. “StarCraft II: A New Challenge for Reinforcement Learning”.
arXiv preprint arXiv:1708.04782.
Wahlström, N., T. B. Schön, and M. P. Deisenroth. 2015. “From pixels
to torques: Policy learning with deep dynamical models”. arXiv
preprint arXiv:1502.02251.
Walsh, T. 2017. It’s Alive!: Artificial Intelligence from the Logic Piano
to Killer Robots. La Trobe University Press.
Wang, J. X., Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo,
R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. 2016a.
“Learning to reinforcement learn”. arXiv preprint arXiv:1611.05763.
Wang, Z., V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and
N. de Freitas. 2016b. “Sample efficient actor-critic with experience
replay”. arXiv preprint arXiv:1611.01224.
Wang, Z., N. de Freitas, and M. Lanctot. 2015. “Dueling network
architectures for deep reinforcement learning”. arXiv preprint
arXiv:1511.06581.
Warnell, G., N. Waytowich, V. Lawhern, and P. Stone. 2017. “Deep
TAMER: Interactive Agent Shaping in High-Dimensional State
Spaces”. arXiv preprint arXiv:1709.10163.
Watkins, C. J. and P. Dayan. 1992. “Q-learning”. Machine learning.
8(3-4): 279–292.
Watkins, C. J. C. H. 1989. “Learning from delayed rewards”. PhD thesis.
King’s College, Cambridge.
Vapnik, V. N. 1998."统计学习理论。用于信号处理、通信和控制的自适应和学
习系统"。Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin.2017."关注就是一切》。
arXiv preprint arXiv:1706.03762.
Vezhnevets, A., V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou, et
al. 2016."学习宏观行动的战略细心作家》。
In:神经信息处理系统进展》。3486-3494.Vinyals, O., T. Ewalds, S.
Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J.
Agapiou, J. Schrittwieser, et al. 2017."星际争霸 II:强化学习的新挑战》。
arXiv preprint arXiv:1708.04782.
Wahlström, N., T. B. Schön, and M. P. Deisenroth.2015."From pixels to
torques:ArXiv preprint arXiv:1502.02251.
Walsh, T. 2017.It's Alive!从逻辑钢琴到杀手机器人的人工智能》。拉筹伯大学
出版社。
Wang, J. X., Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos,
C. Blundell, D. Kumaran, and M. Botvinick.2016a.
"学习强化学习"。arXiv 预印本 arXiv:1611.05763。Wang, Z., V. Bapst, N.
Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas.2016b."带有
经验重放的样本高效演员批评》。arXiv 预印本 arXiv:1611.01224.Wang, Z., N.
de Freitas, and M. Lanctot.2015."Dueling network architectures for deep
reinforcement learning". arXiv preprint arXiv:1511.06581.

Warnell, G., N. Waytowich, V. Lawhern, and P. Stone.2017."Deep TAMER:


Interactive Agent Shaping in High-Dimensional State Spaces". arXiv
preprint arXiv:1709.10163.
Watkins, C. J. and P. Dayan.1992."Q-learning".机器学习。
8(3-4): 279-292.
Watkins, C. J. C. H. 1989."从延迟奖励中学习》。博士论文。
剑桥大学国王学院
Watter, M., J. Springenberg, J. Boedecker, and M. Riedmiller. 2015.
“Embed to control: A locally linear latent dynamics model for control
from raw images”. In: Advances in neural information processing
systems. 2746–2754.
Weber, T., S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J.
Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. 2017.
“Imagination-Augmented Agents for Deep Reinforcement Learning”.
arXiv preprint arXiv:1707.06203.
Wender, S. and I. Watson. 2012. “Applying reinforcement learning
to small scale combat in the real-time strategy game StarCraft:
Broodwar”. In: Computational Intelligence and Games (CIG), 2012
IEEE Conference on. IEEE. 402–408.
Whiteson, S., B. Tanner, M. E. Taylor, and P. Stone. 2011. “Protecting
against evaluation overfitting in empirical reinforcement learning”.
In: Adaptive Dynamic Programming And Reinforcement Learning
(ADPRL), 2011 IEEE Symposium on. IEEE. 120–127.
Williams, R. J. 1992. “Simple statistical gradient-following algorithms
for connectionist reinforcement learning”. Machine learning. 8(3-4):
229–256.
Wu, Y. and Y. Tian. 2016. “Training agent for first-person shooter game
with actor-critic curriculum learning”.
Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
and Y. Bengio. 2015. “Show, attend and tell: Neural image caption
generation with visual attention”. In: International Conference on
Machine Learning. 2048–2057.
You, Y., X. Pan, Z. Wang, and C. Lu. 2017. “Virtual to Real
Reinforcement Learning for Autonomous Driving”. arXiv preprint
arXiv:1704.03952.
Zamora, I., N. G. Lopez, V. M. Vilches, and A. H. Cordero. 2016.
“Extending the OpenAI Gym for robotics: a toolkit for reinforcement
learning using ROS and Gazebo”. arXiv preprint arXiv:1608.05742.
Zhang, A., N. Ballas, and J. Pineau. 2018a. “A Dissection of Overfitting
and Generalization in Continuous Reinforcement Learning”. arXiv
preprint arXiv:1806.07937.
Zhang, A., H. Satija, and J. Pineau. 2018b. “Decoupling Dynamics and
Reward for Transfer Learning”. arXiv preprint arXiv:1804.10689.
Watter, M., J. Springenberg, J. Boedecker, and M. Riedmiller.2015.
"嵌入控制:用于控制的局部线性潜动力模型
《从原始图像"。In:神经信息处理系统进展》。2746-2754.
Weber, T., S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J., D. M., D. M., D. M., D.
M., Rezende、A.
D. M., D. M., D.P.M., D. M. Vinyals、N. Heess、Y. Li 等人,2017 年。"用
Badia、O.
于深度强化学习的想象力增强型代理》。
arXiv preprint arXiv:1707.06203.
Wender, S. and I. Watson.2012."应用强化学习
在即时战略游戏《星际争霸》中的小规模战斗:
Broodwar".In:计算智能与游戏(CIG),2012 年 IEEE 会议。IEEE.402-
408.
Whiteson, S., B. Tanner, M. E. Taylor, and P. Stone.2011."经验强化学习中防
止评价过拟合》。
在 Adaptive Dynamic Programming And Reinforcement Learning
(ADPRL), 2011 IEEE Symposium on.IEEE.120-127.
Williams, R. J. 1992."简单统计梯度跟踪算法
连接强化学习"。机器学习。8(3-4):229-256.
Wu, Y. and Y. Tian.2016."第一人称射击游戏训练代理与演员批评课程学习》。
Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio.2015."Show, attend and tell:神经图像标题生成与视觉注意力"。
In:国际机器学习大会。2048-2057.

You, Y., X. Pan, Z. Wang, and C. Lu.2017."自动驾驶的虚拟到现实强化学习》。


arXiv preprint arXiv:1704.03952.
Zamora, I., N. G. Lopez, V.M. Vilches, and A. H. Cordero.H. Cordero.
"扩展用于机器人技术的 OpenAI Gym:使用 ROS 和 Gazebo 的强化学习工具
包》,arXiv 预印本 arXiv:1608.05742。Zhang, A., N. Ballas, and J.
Pineau.2018a."A Dissection of Overfitting and Generalization in
Continuous Reinforcement Learning". arXiv preprint arXiv:1806.07937.
Zhang, A., H. Satija, and J. Pineau.2018b."解耦动态与迁移学习的奖励》。
arXiv preprint arXiv:1804.10689.
Zhang, C., O. Vinyals, R. Munos, and S. Bengio. 2018c. “A Study
on Overfitting in Deep Reinforcement Learning”. arXiv preprint
arXiv:1804.06893.
Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals. 2016.
“Understanding deep learning requires rethinking generalization”.
arXiv preprint arXiv:1611.03530.
Zhu, Y., R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-
Fei, and A. Farhadi. 2016. “Target-driven visual navigation in
indoor scenes using deep reinforcement learning”. arXiv preprint
arXiv:1609.05143.
Ziebart, B. D. 2010. Modeling purposeful adaptive behavior with the
principle of maximum causal entropy. Carnegie Mellon University.
Zoph, B. and Q. V. Le. 2016. “Neural architecture search with
reinforcement learning”. arXiv preprint arXiv:1611.01578.
Zhang, C., O. Vinyals, R. Munos, and S. Bengio.2018c."A Study on
Overfitting in Deep Reinforcement Learning". arXiv preprint
arXiv:1804.06893.
Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals.Vinyals.
"理解深度学习需要重新思考泛化问题"。
arXiv preprint arXiv:1611.03530.
Zhu, Y., R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A.
Farhadi.2016."使用深度强化学习的室内场景目标驱动视觉导航》。arXiv
preprint arXiv:1609.05143.
Ziebart, B. D. 2010.用最大因果熵原理建模有目的的适应性行为》。卡内基梅
隆大学。
Zoph, B. and Q. V. Le.2016."强化学习的神经架构搜索》。arXiv preprint
arXiv:1611.01578.

You might also like