Reinforcement Learning
Reinforcement Learning
doi:10.1038/nature14540
Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting
with the world and evaluative feedback to improve a systems ability to make behavioural decisions. It has been called the
artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and
achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in
the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.
einforcement-learning algorithms1,2 are inspired by our understanding of decision making in humans and other animals in
which learning is supervised through the use of reward signals
in response to the observed outcomes of actions. As our understanding
of this class of problems improves, so does our ability to bring it to bear
in practical settings. Reinforcement learning is having a considerable
impact on nuts-and-bolts questions such as how to create more effective personalized Web experiences, as well as esoteric questions such
as how to design better computerized players for the traditional board
game Go or 1980s video games. Reinforcement learning is also providing a valuable conceptual framework for work in psychology, cognitive
science, behavioural economics and neuroscience that seeks to explain
the process of decision making in the natural world.
One way to think about machine learning is as a set of techniques
to try to answer the question: when I am in situation x, what response
should I choose? As a concrete example, consider the problem of assigning patrons to tables in a restaurant. Parties of varying sizes arrive in
an unknown order and the host allocates each party to a table. The
host maps the current situation x (size of the latest party and information about which tables are occupied and for how long) to a decision
(which table to assign to the party), trying to satisfy a set of competing
goals such as minimizing the waiting time for a table, balancing the load
among the various servers, and ensuring that the members of a party
can sit together. Similar allocation challenges come up in games such
as Tetris and computational problems such as data-centre task allocation. Viewed more broadly, this framework fits any problem in which a
sequence of decisions needs to be made to maximize a scoring function
over uncertain outcomes.
The kinds of approaches needed to learn good behaviour for the
patron-assignment problem depend on what kinds of feedback information are available to the decision maker while learning (Fig.1).
Exhaustive versus sampled feedback is concerned with the coverage of the training examples. A learner given exhaustive feedback is
exposed to all possible situations. Sampled feedback is weaker, in that
the learner is only provided with experience of a subset of situations.
The central problem in classic supervised learning is generalizing from
sampled examples.
Supervised versus evaluative feedback is concerned with how the
learner is informed of right and wrong answers. A requirement for
applying supervised learning methods is the availability of examples
with known optimal decisions. In the patron-assignment problem, a
host-in-training could work as an apprentice to a much more experienced supervisor to learn how to handle a range of situations. If the
apprentice can only learn from supervised feedback, however, she would
have no opportunities to improve after the apprenticeship ends.
By contrast, evaluative feedback provides the learner with an assessment of the effectiveness of the decisions that she made; no information
is available on the appropriateness of alternatives. For example, a host
might learn about the ability of a server to handle unruly patrons by trial
and error: when the host makes an assignment of a difficult customer,
it is possible to tell whether things went smoothly with the selected
server, but no direct information is available as to whether one of the
other servers might have been a better choice. The central problem in
the field of reinforcement learning is addressing the challenge of evaluative feedback.
One-shot versus sequential feedback is concerned with the relative timing of learning signals. Evaluative feedback can be subdivided
into whether it is provided directly for each decision or whether it has
longer-term impacts that are evaluated over a sequence of decisions.
For example, if a host needs to seat a party of 12 and there are no large
tables available, some past decision to seat a small party at a big table
might be to blame. Reinforcement learners must solve this temporal
credit assignment problem to be able to derive good behaviour in the
face of this weak sequential feedback.
From the beginning, reinforcement-learning methods have contended with all three forms of weak feedback simultaneously sampled, evaluative and sequential feedback. As such, the problem is
considerably harder than that of supervised learning. However, methods
that can learn from weak sources of feedback are more generally applicable and can be mapped to a variety of naturally occurring problems,
as suggested by the patron-assignment problem.
The following sections describe recent advances in several sub-areas
of reinforcement learning that are expanding its power and applicability.
Bandit problems
Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 5
INSIGHT REVIEW
Bandits
Tabular
reinforcement learning
Reinforcement
learning
Sequential
versus
one-shot
Contextual
bandits
Supervised
machine
learning
Sampled
versus
exhaustive
Evaluative
versus
supervised
4 4 6 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
REVIEW INSIGHT
represents the feedback received from the environment for the transition to x. Classic temporal difference methods set about minimizing
the difference between these quantities.
Reinforcement learning in the face of evaluative, sampled and sequential feedback requires combining methods for temporal credit assignment with methods for generalization. Concretely, if the V values are
represented with a neural network or similar representation, finding
predictions such that V(x)Ex[r(x)+V(x)] can lead to instability and
divergence in the learning process14,15.
Recent work16 has modified the goal of learning slightly by incorporating the generalization process into the learning objective
itself. Specifically, consider the goal of seeking V values such that
V(x)Ex[r(x)+V(x)] where is the projection of the values resulting
from their representation by the generalization method. When feedback
is exhaustive and no generalization is used, is the identity function
and this goal is no different from classic temporal difference learning.
However, when linear functions are used to represent values, this modified goal can be used to create provably convergent learning algorithms17.
This insight was rapidly generalized to non-linear function approximators that are smooth (locally linear)18, to the control setting19, and to
the use of eligibility traces in the learning process20. This line of work
holds a great deal of promise for creating reliable methods for learning
effective behaviour from very weak feedback.
Planning
X
O
O
X
O
O
O O
X
O O
X
BOX 1
State
Action
Reward
R(s, a)
State
*(s)
Action
Reward
State
Action
P(sls, a)
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 7
INSIGHT REVIEW
Evaluation-function methods
Full tree
whether its moves will lead to a win, a loss or a tie from each. In a game
such as chess, however, the size of the game tree is astronomical and the
best an algorithm can hope for is to assemble a representative sample of
boards to consider. The classic approach for dealing with this problem
is to use an evaluation function. Programs search as deeply in the game
tree as they can, but if they cannot reach the end of the game where the
outcome is known, a heuristic evaluation function is used to estimate
how the game will end. Even a moderately good guess can lead to excellent game play in this context evaluation-function methods have
been used in chess from the earliest days of artificial intelligence to the
decision making in the Deep Blue system that beat the World Chess
Champion Garry Kasparov22.
A powerful variation of this idea is to use machine learning to
improve the evaluation function over time. Here, temporal difference
To plan, a learning algorithm needs to be able to predict the immediate outcomes of its actions. Since it can try an action and observe
the effects right away, this part of the problem can be addressed using
supervised learning methods. In model-based reinforcement-learning
approaches, experiences are used to estimate a transition model, mapping from situations and decisions to resulting situations, and then a
planning approach is used to make decisions that maximize predicted
long-term outcomes.
Model-based reinforcement-learning methods and model-free methods such as temporal differences strike a different balance between computational expense and experience efficiency to a first approximation,
model-free methods are inexpensive, but slow to learn; whereas modelbased methods are computationally intensive, but squeeze the most out
of each experience (Box 2). In applications such as learning to manage
memory resources in a central processing unit, decisions need to be
made at the speed of a computers clock and data are abundant, making
temporal difference learning an excellent fit28. In applications such as a
robot helicopter learning acrobatic tricks, experience is slow to collect
and there is ample time offline to compute decision policies, making a
model-based approach appropriate29. There are also methods that draw
on both styles of learning30 to try to get the complementary advantages.
Model-based algorithms can make use of insights from supervised
learning to deal with sampled feedback when learning their transition
BOX 2
4 4 8 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
REVIEW INSIGHT
BOX 3
Empirical methods
An important turning point in the development of practical supervised learning methods was a shift from artificial learning problems
to learning problems grounded in measured data and careful experimental design34. The reinforcement-learning community is showing
signs of a similar evolution, which promises to help to solidify the
technical gains described in the previous sections. Of particular interest are new evaluation methodologies that go beyond case studies of
individual algorithm learning in individual environments to controlled studies of different learning algorithms and their performance
across multiple natural domains. In supervised learning, data-set collections such as the University of California, Irvine Machine Learning
Repository (http://archive.ics.uci.edu/ml/) and the Weka Machine
Learning library (http://www.cs.waikato.ac.nz/ml/weka/) facilitate
this kind of research by providing data for learning problems and a
suite of algorithms in a modular form. A challenge for the study of
reinforcement learning, however, is that learning takes place through
active exploration between the learner and her environment. Environments are almost always represented by simulators, which can be
difficult to standardize and share, and are rarely realistic stand-ins
for the actual environment of interest.
One noteworthy exception is the Arcade Learning Environment35,
which provides a reinforcement-leaning-friendly interface for an
eclectic collection of more than 50 Atari 2600 video games. Although
video games are by no means the only domain of interest, the diversity of game types and the unity of the interface makes the Arcade
Learning Environment an exciting platform for the empirical study of
machine reinforcement learning. An extremely impressive example of
Output f(x)
Yes
Yes
15
20
I dont know
No
No
24
Yes
25
No
INSIGHT REVIEW
a good choice. Longer learning periods and contextual decisions make
the number of possibilities to consider astronomically high. Despite
this daunting combinatorial explosion, offline evaluation procedures
for contextual bandit problems have been proposed38. The key idea is to
collect a static (but enormous) set of contextual bandit decisions by running a uniform random selection algorithm on a real problem (a news
article recommendation, for example). Then, to test a bandit algorithm
using this collection, present the algorithm with the same series of contexts. Whenever the selection made by the algorithm does not match
the selection made during data collection, the algorithm is forced to
forget what it saw. Thus, as far as the algorithm knows, all of its choices
during learning matched the choices made during data collection. The
end result is an unbiased evaluation of a dynamic bandit algorithm using
static pre-recorded data.
One reason that this approach works for contextual bandit problems
is that apart from the state of the learning algorithm there are no
dependencies from one round of decision making to the next. So, skipping many rounds because the algorithm and the data collection do not
match does not change the fundamental character of the learning problem. In the presence of sequential feedback, however, this trick no longer
applies. Proposals have been made for creating reinforcement-learning
data repositories39 and for evaluating policies using observational data
sets40, but no solid solution for evaluating general learning algorithms
has yet emerged.
Finding appropriate reward functions
Reinforcement-learning systems strive to identify behaviour that maximizes expected total reward. There is a sense, therefore, that they are
programmable through their reward functions the mappings from
situation to score. A helpful analogy to consider is between learning
and program interpretation in digital computers: a program (reward
function) directs an interpreter (learning algorithm) to process inputs
(environments) into desirable outputs (behaviour). The vast majority
of research in reinforcement learning has been into the development
of effective interpreters with little attention paid to how we should be
programming them (providing reward functions).
There are a few approaches that have been explored for producing
reward functions that induce a target behaviour given some conception
of that behaviour. As a simple example, if one reward function for the
target behaviour is known, the space of behaviour-preserving transformations to this reward function is well understood41; other, essentially
equivalent, reward functions can be created. If a human teacher is available who can provide evaluative feedback, modifications to intended
target reinforcement-learning algorithms can be used to search for the
target behaviour4244. The problem of inverse reinforcement learning45,46
addresses the challenge of creating appropriate reward functions given
the availability of behavioural traces from an expert executing the target
behaviour. It is called inverse reinforcement learning because the learner
is given behaviour and needs to generate a reward function instead of
the other way around. These methods infer a reward function that the
demonstrated behaviour optimizes. One can turn to evolutionary optimization47 to generate reward functions if there is an effective way to
evaluate the appropriateness of the resulting behaviour. One advantage of
the evolutionary approach is that, among the many possible reward functions that generate good behaviour, it can identify ones that provide helpful but not too distracting hints that can speed up the learning process.
Looking forward, new techniques for specifying complex behaviour and
translations of these specifications into appropriate reward functions are
essential. Existing reward-function specifications lack the fundamental
ideas that enable the design of todays massive software systems, such as
abstraction, modularity and encapsulation. Analogues of these ideas could
greatly extend the practical utility of reinforcement-learning systems.
Reinforcement learning as a cognitive model
The term reinforcement learning itself originated in the animallearning community. It had long been observed that certain stimuli,
4 5 0 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
REVIEW INSIGHT
11. Bubeck, S. & Liu, C.-Y. Prior-free and prior-dependent regret bounds for
Thompson sampling. In Proc. Advances in Neural Information Processing Systems
638646 (2013).
12. Gershman, S. & Blei, D. A tutorial on Bayesian nonparametric models. J. Math.
Psychol. 56, 112 (2012).
13. Sutton, R. S. Learning to predict by the method of temporal differences. Mach.
Learn. 3, 944 (1988).
14. Boyan, J. A. & Moore, A. W. Generalization in reinforcement learning: safely
approximating the value function. In Proc. Advances in Neural Information
Processing Systems 369 376 (1995).
15. Baird, L. Residual algorithms: reinforcement learning with function
approximation. In Proc. 12th International Conference on Machine Learning (eds
Prieditis, A. & Russell, S.) 3037 (Morgan Kaufmann, 1995).
16. Sutton, R. S. et al. Fast gradient-descent methods for temporal-difference
learning with linear function approximation. In Proc. 26th Annual International
Conference on Machine Learning 9931000 (2009).
17. Sutton, R. S., Maei, H. R. & Szepesvri, C. A convergent O(n) temporal-difference
algorithm for off-policy learning with linear function approximation. In Proc.
Advances in Neural Information Processing Systems 16091616 (2009).
18. Maei, H. R. et al. Convergent temporal-difference learning with arbitrary smooth
function approximation. In Proc. Advances in Neural Information Processing
Systems 12041212 (2009).
19. Maei, H. R., Szepesvri, C., Bhatnagar, S. & Sutton, R. S. Toward off-policy
learning control with function approximation. In Proc. 27th International
Conference on Machine Learning 719726 (2010).
20. van Hasselt, H., Mahmood, A. R. & Sutton, R. S. Off-policy TD() with a true
online equivalence. In Proc. 30th Conference on Uncertainty in Artificial
Intelligence 324 (2014).
21. Russell, S. J. & Norvig, P. Artificial Intelligence: A Modern Approach (Prentice
Hall, 1994).
22. Campbell, M., Hoane, A. J. & Hsu, F. H. Deep blue. Artif. Intell. 134, 5783
(2002).
23. Samuel, A. L. Some studies in machine learning using the game of checkers.
IBM J. Res. Develop. 3, 211229 (1959).
24. Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves
master-level play. Neural Comput. 6, 215219 (1994).
This article describes the first reinforcement-learning system to solve a truly
non-trivial task.
25. Tesauro, G., Gondek, D., Lenchner, J., Fan, J. & Prager, J. M. Simulation, learning,
and optimization techniques in Watsons game strategies. IBM J. Res. Develop.
56, 111 (2012).
26. Kocsis, L. & Szepesvri, C. Bandit based Monte-Carlo planning. In Proc. 17th
European Conference on Machine Learning 282293 (2006).
This article introduces UCT, the decision-making algorithm that
revolutionized gameplay in Go.
27. Gelly, S. et al. The grand challenge of computer Go: Monte Carlo tree search
and extensions. Communications of the ACM 55, 106113 (2012).
28. pek. E., Mutlu, O., Martnez, J. F. & Caruana, R. Self-optimizing memory
controllers: a reinforcement learning approach. In Proc. 35th International
Symposium on Computer Architecture 3950 (2008).
29. Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via
reinforcement learning. In Proc. Advances in Neural Information Processing
Systems http://papers.nips.cc/paper/2455-autonomous-helicopter-flight-viareinforcement-learning (2003).
30. Sutton, R. S. Integrated architectures for learning, planning, and reacting based
on approximating dynamic programming. In Proc. 7th International Conference
on Machine Learning 216224 (Morgan Kaufmann, 1990).
31. Kearns, M. J. & Singh, S. P. Near-optimal reinforcement learning in polynomial
time. Mach. Learn. 49, 209232 (2002).
This article provides the first algorithm and analysis that shows that
reinforcement-learning tasks can be solved approximately optimally with a
relatively small amount of experience.
32. Brafman, R. I. & Tennenholtz, M. R-MAX a general polynomial time algorithm
for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213231
(2002).
33. Li, L., Littman, M. L., Walsh, T. J. & Strehl, A. L. Knows what it knows: a framework
for self-aware learning. Mach. Learn. 82, 399443 (2011).
34. Langley, P. Machine learning as an experimental science. Mach. Learn. 3, 58
(1988).
35. Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning
environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47,
253279 (2013).
36. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature
518, 529533 (2015).
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
This article describes the application of deep learning in a reinforcementlearning setting to address the challenging task of decision making in an
arcade environment.
Murphy, S. A. An experimental design for the development of adaptive
treatment strategies. Stat. Med. 24, 14551481 (2005).
Li, L., Chu, W., Langford, J. & Wang, X. Unbiased offline evaluation of contextualbandit-based news article recommendation algorithms. In Proc. 4th ACM
International Conference on Web Search and Data Mining 297306 (2011).
Nouri, A. et al. A novel benchmark methodology and data repository for real-life
reinforcement learning. In Proc. Multidisciplinary Symposium on Reinforcement
Learning, Poster (2009).
Marivate, V. N., Chemali, J., Littman, M. & Brunskill, E. Discovering multimodal characteristics in observational clinical data. In Proc. Machine Learning
for Clinical Data Analysis and Healthcare NIPS Workshop http://paul.rutgers.
edu/~vukosi/papers/nips2013workshop.pdf (2013).
Ng, A. Y., Harada, D. & Russell, S. Policy invariance under reward
transformations: theory and application to reward shaping. In Proc. 16th
International Conference on Machine Learning 278287 (1999).
Thomaz, A. L. & Breazeal, C. Teachable robots: understanding human teaching
behaviour to build more effective robot learners. Artif. Intell. 172, 716737
(2008).
Knox, W. B. & Stone, P. Interactively shaping agents via human reinforcement:
The TAMER framework. In Proc. 5th International Conference on Knowledge
Capture 916 (2009).
Loftin, R. et al. A strategy-aware technique for learning behaviors from discrete
human feedback. In Proc. 28th Association for the Advancement of Artificial
Intelligence Conference https://www.aaai.org/ocs/index.php/AAAI/AAAI14/
paper/view/8579 (2014).
Ng, A. Y. & Russell, S. Algorithms for inverse reinforcement learning. In Proc.
International Conference on Machine Learning 663670 (2000).
Babes, M., Marivate, V. N., Littman, M. L. & Subramanian, K. Apprenticeship
learning about multiple intentions. In Proc. International Conference on Machine
Learning 897904 (2011).
Singh,S., Lewis, R.L., Barto, A.G. & Sorg, J. Intrinsically motivated reinforcement
learning: an evolutionary perspective. IEEE Trans. Auto. Mental Dev. 2, 7082 (2010).
Newell, A. The chess machine: an example of dealing with a complex task by
adaptation. In Proc. Western Joint Computer Conference 101108 (1955).
Minsky, M. L. Some methods of artificial intelligence and heuristic
programming. In Proc. Symposium on the Mechanization of Thought Processes
2427 (1958).
Sutton, R. S. & Barto, A. G. Toward a modern theory of adaptive networks:
expectation and prediction. Psychol. Rev. 88, 135170 (1981).
Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and
reward. Science 275, 15931599 (1997).
Dayan, P. & Niv, Y. Reinforcement learning and the brain: the good, the bad and
the ugly. Curr. Opin. Neurobiol. 18, 185196 (2008).
Niv, Y. Neuroscience: dopamine ramps up. Nature 500, 533535 (2013).
Cushman, F. Action, outcome, and value a dual-system framework for morality.
Pers. Soc. Psychol. Rev. 17, 273292 (2013).
Shapley, L. Stochastic games. Proc. Natl Acad. Sci. USA 39, 10951100 (1953).
Bellman, R. Dynamic Programming (Princeton Univ. Press, 1957).
Kober, J., Bagnell, J. A. & Peters, J. Reinforcement learning in robotics: a survey.
Int. J. Rob. Res. 32, 12381274 (2013).
Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach. Learn. 8, 279292 (1992).
This article introduces the first provably correct approach to reinforcement
learning for both prediction and decision making.
Jaakkola, T., Jordan, M. I. & Singh, S. P. Convergence of stochastic iterative
dynamic programming algorithms. In Advances in Neural Information Processing
Systems 6, 703710 (Morgan Kaufmann, 1994).
Diuk, C., Li, L. & Leffner, B. R. The adaptive k-meteorologists problem and
its application to structure learning and feature selection in reinforcement
learning. In Proc. 26th International Conference on Machine Learning 3240
(2009).
Acknowledgements
The author appreciates his discussions with his colleagues that led to this synthesis
of current work.
Author Information Reprints and permissions information is available at
www.nature.com/reprints. The author declares no competing financial
interests. Readers are welcome to comment on the online version of this paper
at go.nature.com/csIqwl. Correspondence should be addressed to M.L.L.
([email protected]).
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 5 1