0% found this document useful (0 votes)

205 views23 pages

A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai

This document provides an introduction to deep reinforcement learning. It explains that reinforcement learning algorithms can learn how to achieve complex goals or maximize rewards over many time steps by trial and error using feedback from their environment. These algorithms have achieved superhuman performance at games like Go and Atari video games by incorporating deep learning techniques. The document defines key concepts in reinforcement learning like agents, environments, states, actions, rewards, policies, values, and Q-values. It provides examples of how reinforcement learning differs from supervised and unsupervised learning.

Uploaded by

NIKHILESH M NAIK 1827521

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views23 pages

A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai

Uploaded by

NIKHILESH M NAIK 1827521

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

A Beginner's Guide to Deep

Reinforcement Learning
skymind.ai

When it is not in our power to determine what is true, we ought to act

in accordance with what is most probable. - Descartes

Contents

While neural networks are responsible for recent breakthroughs in

problems like computer vision, machine translation and time series
prediction – they can also combine with reinforcement learning
algorithms to create something astounding like AlphaGo.

Reinforcement learning refers to goal-oriented algorithms, which

learn how to attain a complex objective (goal) or maximize along a
particular dimension over many steps; for example, maximize the
points won in a game over many moves. They can start from a blank
slate, and under the right conditions they achieve superhuman
performance. Like a child incentivized by spankings and candy, these
algorithms are penalized when they make the wrong decisions and
rewarded when they make the right ones – this is reinforcement.
Reinforcement algorithms that incorporate deep learning can beat
world champions at the game of Go as well as human experts playing
numerous Atari video games. While that may sound trivial, it’s a vast
improvement over their previous accomplishments, and the state of
the art is progressing rapidly.

Reinforcement learning solves the difficult problem of correlating

immediate actions with the delayed returns they produce. Like
humans, reinforcement learning algorithms sometimes have to wait a
while to see the fruit of their decisions. They operate in a delayed
return environment, where it can be difficult to understand which
action leads to which outcome over many time steps.

Reinforcement learning algorithms can be expected to perform better

and better in more ambiguous, real-life environments while choosing
from an arbitrary number of possible actions, rather than from the
limited options of a video game. That is, with time we expect them to
be valuable to achieve goals in the real world.

Two reinforcement learning algorithms - Deep-Q learning and A3C -

have been implemented in a Deeplearning4j library called RL4J. It can
already play Doom.

Learn to build AI apps now »

Reinforcement Learning Definitions

Reinforcement learning can be understood using the concepts of
agents, environments, states, actions and rewards, all of which we’ll
explain below. Capital letters tend to denote sets of things, and lower-
case letters denote a specific instance of that thing; e.g. A is all
possible actions, while a is a specific action contained in the set.

• Agent: An agent takes actions; for example, a drone making

a delivery, or Super Mario navigating a video game. The
algorithm is the agent. In life, the agent is you.1
• Action (A): A is the set of all possible moves the agent can
make. An action is almost self-explanatory, but it should be
noted that agents choose among a list of possible actions. In
video games, the list might include running right or left,
jumping high or low, crouching or standing still. In the
stock markets, the list might include buying, selling or
holding any one of an array of securities and their
derivatives. When handling aerial drones, alternatives
would include many different velocities and accelerations in
3D space.
• Discount factor: The discount factor is multiplied by
future rewards as discovered by the agent in order to
dampen thse rewards’ effect on the agent’s choice of action.
Why? It is designed to make future rewards worth less than
immediate rewards; i.e. it enforces a kind of short-term
hedonism in the agent. Often expressed with the lower-case
Greek letter gamma: γ. If γ is .8, and there’s a reward of 10
points after 3 time steps, the present value of that reward is
0.8³ x 10. A discount factor of 1 would make future rewards
worth just as much as immediate rewards. We’re fighting
against delayed gratification here.
• Environment: The world through which the agent moves.
The environment takes the agent’s current state and action
as input, and returns as output the agent’s reward and its
next state. If you are the agent, the environment could be
the laws of physics and the rules of society that process
your actions and determine the consequences of them.
• State (S): A state is a concrete and immediate situation in
which the agent finds itself; i.e. a specific place and
moment, an instantaneous configuration that puts the agent
in relation to other significant things such as tools,
obstacles, enemies or prizes. It can the current situation
returned by the environment, or any future situation. Were
you ever in the wrong place at the wrong time? That’s a
state.
• Reward (R): A reward is the feedback by which we
measure the success or failure of an agent’s actions. For
example, in a video game, when Mario touches a coin, he
wins points. From any given state, an agent sends output in
the form of actions to the environment, and the
environment returns the agent’s new state (which resulted
from acting on the previous state) as well as rewards, if
there are any. Rewards can be immediate or delayed. They
effectively evaluate the agent’s action.
• Policy (π): The policy is the strategy that the agent
employs to determine the next action based on the current
state. It maps states to actions, the actions that promise the
highest reward.
• Value (V): The expected long-term return with discount, as
opposed to the short-term reward R. Vπ(s) is defined as the
expected long-term return of the current state under policy
π. We discount rewards, or lower their estimated value, the
further into the future they occur. See discount factor. And
remember Keynes: “In the long run, we are all dead.” That’s
why you discount future rewards.
• Q-value or action-value (Q): Q-value is similar to Value,
except that it takes an extra parameter, the current action a.
Qπ(s, a) refers to the long-term return of the current state s,
taking action a under policy π. Q maps state-action pairs to
rewards. Note the difference between Q and policy.
• Trajectory: A sequence of states and actions that influence
those states. From the Latin “to throw across.” The life of an
agent is but a ball tossed high and arching through space-
time.

So environments are functions that transform an action taken in the

current state into the next state and a reward; agents are functions
that transform the new state and reward into the next action. We can
know the agent’s function, but we cannot know the function of the
environment. It is a black box where we only see the inputs and
outputs. It’s like most people’s relationship with technology: we know
what it does, but we don’t know how it works. Reinforcement
learning represents an agent’s attempt to approximate the
environment’s function, such that we can send actions into the black-
box environment that maximize the rewards it spits out.

*Credit: Sutton & Barto

In the feedback loop above, the subscripts denote the time steps t and
t+1, each of which refer to different states: the state at moment t, and
the state at moment t+1. Unlike other forms of machine learning –
such as supervised and unsupervised learning – reinforcement
learning can only be thought about sequentially in terms of state-
action pairs that occur one after the other.

Reinforcement learning judges actions by the results they produce. It

is goal oriented, and its aim is to learn sequences of actions that will
lead an agent to achieve its goal, or maximize its objective function.
Here are some examples:

• In video games, the goal is to finish the game with the most
points, so each additional point obtained throughout the
game will affect the agent’s subsequent behavior; i.e. the
agent may learn that it should shoot battleships, touch coins
or dodge meteors to maximize its score.
• In the real world, the goal might be for a robot to travel
from point A to point B, and every inch the robot is able to
move closer to point B could be counted like points.
Here’s an example of an objective function for reinforcement
learning; i.e. the way it defines its goal.

We are summing reward function r over t, which stands for time

steps. So this objective function calculates all the reward we could
obtain by running through, say, a game. Here, x is the state at a given
time step, and a is the action taken in that state. r is the reward
function for x and a. (We’ll ignore γ for now.)

Reinforcement learning differs from both supervised and

unsupervised learning by how it interprets inputs. We can illustrate
their difference by describing what they learn about a “thing.”

• Unsupervised learning: That thing is like this other thing.

(The algorithms learn similarities w/o names, and by
extension they can spot the inverse and perform anomaly
detection by recognizing what is unusual or dissimilar)
• Supervised learning: That thing is a “double bacon cheese
burger”. (Labels, putting names to faces…) These
algorithms learn the correlations between data instances
and their labels; that is, they require a labelled dataset.
Those labels are used to “supervise” and correct the
algorithm as it makes wrong guesses when predicting
labels.
• Reinforcement learning: Eat that thing because it tastes
good and will keep you alive longer. (Actions based on
short- and long-term rewards, such as the amount of
calories you ingest, or the length of time you survive.)
Reinforcement learning can be thought of as supervised
learning in an environment of sparse feedback.
Domain Selection for Reinforcement
Learning
One way to imagine an autonomous reinforcement learning agent
would be as a blind person attempting to navigate the world with
only their ears and a white cane. Agents have small windows that
allow them to perceive their environment, and those windows may
not even be the most appropriate way for them to perceive what’s
around them.

Are you using Machine Learning for enterprise applications? The

Skymind Platform can help you ship faster. Read the platform overview
or request a demo.

(In fact, deciding which types of input and feedback your agent should
pay attention to is a hard problem to solve. This is known as domain
selection. Algorithms that are learning how to play video games can
mostly ignore this problem, since the environment is man-made and
strictly limited. Thus, video games provide the sterile environment of
the lab, where ideas about reinforcement learning can be tested.
Domain selection requires human decisions, usually based on
knowledge or theories about the problem to be solved; e.g. selecting
the domain of input for an algorithm in a self-driving car might
include choosing to include radar sensors in addition to cameras and
GPS data.)

State-Action Pairs & Complex Probability

Distributions of Reward
The goal of reinforcement learning is to pick the best known action
for any given state, which means the actions have to be ranked, and
assigned values relative to one another. Since those actions are state-
dependent, what we are really gauging is the value of state-action
pairs; i.e. an action taken from a certain state, something you did
somewhere. Here are a few examples to demonstrate that the value
and meaning of an action is contingent upon the state in which it is
taken:

• If the action is marrying someone, then marrying a 35-year-

old when you’re 18 probably means something different
than marrying a 35-year-old when you’re 90, and those two
outcomes probably have different motivations and lead to
different outcomes.

• If the action is yelling “Fire!”, then performing the action a

crowded theater should mean something different from
performing the action next to a squad of men with rifles. We
can’t predict an action’s outcome without knowing the
context.

We map state-action pairs to the values we expect them to produce

with the Q function, described above. The Q function takes as its
input an agent’s state and action, and maps them to probable
rewards.

Reinforcement learning is the process of running the agent through

sequences of state-action pairs, observing the rewards that result,
and adapting the predictions of the Q function to those rewards until
it accurately predicts the best path for the agent to take. That
prediction is known as a policy.

Reinforcement learning is an attempt to model a complex probability

distribution of rewards in relation to a very large number of state-
action pairs. This is one reason reinforcement learning is paired with,
say, a Markov decision process, a method to sample from a complex
distribution to infer its properties. It closely resembles the problem
that inspired Stan Ulam to invent the Monte Carlo method; namely,
trying to infer the chances that a given hand of solitaire will turn out
successful.

Any statistical approach is essentially a confession of ignorance. The

immense complexity of some phenomena (biological, political,
sociological, or related to board games) make it impossible to reason
from first principles. The only way to study them is through statistics,
measuring superficial events and attempting to establish correlations
between them, even when we do not understand the mechanism by
which they relate. Reinforcement learning, like deep neural networks,
is one such strategy, relying on sampling to extract information from
data.

After a little time spent employing something like a Markov decision

process to approximate the probability distribution of reward over
state-action pairs, a reinforcement learning algorithm may tend to
repeat actions that lead to reward and cease to test alternatives.
There is a tension between the exploitation of known rewards, and
continued exploration to discover new actions that also lead to
victory. Just as oil companies have the dual function of pumping
crude out of known oil fields while drilling for new reserves, so too,
reinforcement learning algorithms can be made to both exploit and
explore to varying degrees, in order to ensure that they don’t pass
over rewarding actions at the expense of known winners.

Reinforcement learning is iterative. In its most interesting

applications, it doesn’t begin by knowing which rewards state-action
pairs will produce. It learns those relations by running through states
again and again, like athletes or musicians iterate through states in
an attempt to improve their performance.

The Relationship Between Machine

Learning with Time
You could say that an algorithm is a method to more quickly
aggregate the lessons of time.2 Reinforcement learning algorithms
have a different relationship to time than humans do. An algorithm
can run through the same states over and over again while
experimenting with different actions, until it can infer which actions
are best from which states. Effectively, algorithms enjoy their very
own Groundhog Day, where they start out as dumb jerks and slowly
get wise.

Since humans never experience Groundhog Day outside the movie,

reinforcement learning algorithms have the potential to learn more,
and better, than humans. Indeed, the true advantage of these
algorithms over humans stems not so much from their inherent
nature, but from their ability to live in parallel on many chips at once,
to train night and day without fatigue, and therefore to learn more.
An algorithm trained on the game of Go, such as AlphaGo, will have
played many more games of Go than any human could hope to
complete in 100 lifetimes.3

Neural Networks and Deep

Reinforcement Learning
Where do neural networks fit in? Neural networks are the agent that
learns to map state-action pairs to rewards. Like all neural networks,
they use coefficients to approximate the function relating inputs to
outputs, and their learning consists to finding the right coefficients,
or weights, by iteratively adjusting those weights along gradients that
promise less error.

In reinforcement learning, convolutional networks can be used to

recognize an agent’s state; e.g. the screen that Mario is on, or the
terrain before a drone. That is, they perform their typical task of
image recognition.

But convolutional networks derive different interpretations from

images in reinforcement learning than in supervised learning. In
supervised learning, the network applies a label to an image; that is,
it matches names to pixels.
In fact, it will rank the labels that best fit the image in terms of their
probabilities. Shown an image of a donkey, it might decide the picture
is 80% likely to be a donkey, 50% likely to be a horse, and 30% likely
to be a dog.

In reinforcement learning, given an image that represents a state, a

convolutional net can rank the actions possible to perform in that
state; for example, it might predict that running right will return 5
points, jumping 7, and running left none.

The above image illustrates what a policy agent does, mapping a state
to the best action.

A policy maps a state to an action.

If you recall, this is distinct from Q, which maps state action pairs to
rewards.
To be more specific, Q maps state-action pairs to the highest
combination of immediate reward with all future rewards that might
be harvested by later actions in the trajectory. Here is the equation
for Q, from Wikipedia:

Having assigned values to the expected rewards, the Q function

simply selects the state-action pair with the highest so-called Q value.

At the beginning of reinforcement learning, the neural network

coefficients may be initialized stochastically, or randomly. Using
feedback from the environment, the neural net can use the difference
between its expected reward and the ground-truth reward to adjust
its weights and improve its interpretation of state-action pairs.

This feedback loop is analogous to the backpropagation of error in

supervised learning. However, supervised learning begins with
knowledge of the ground-truth labels the neural network is trying to
predict. Its goal is to create a model that maps different images to
their respective names.

Reinforcement learning relies on the environment to send it a scalar

number in response to each new action. The rewards returned by the
environment can be varied, delayed or affected by unknown variables,
introducing noise to the feedback loop.

This leads us to a more complete expression of the Q function, which

takes into account not only the immediate rewards produced by an
action, but also the delayed rewards that may be returned several
time steps deeper in the sequence.

Like human beings, the Q function is recursive. Just as calling the

wetware method human() contains within it another method human(), of
which we are all the fruit, calling the Q function on a given state-
action pair requires us to call a nested Q function to predict the value
of the next state, which in turn depends on the Q function of the state
after that, and so forth.

Footnotes
1) It might be helpful to imagine a reinforcement learning algorithm in
action, to paint it visually. Let’s say the algorithm is learning to play
the video game Super Mario. It’s trying to get Mario through the game
and acquire the most points. To do that, we can spin up lots of
different Marios in parallel and run them through the space of all
possible game states. It’s as though you have 1,000 Marios all
tunnelling through a mountain, and as they dig (e.g. as they decide
again and again which action to take to affect the game environment),
their experience-tunnels branch like the intricate and fractal twigs of a
tree. The Marios’ experience-tunnels are corridors of light cutting
through the mountain. And as in life itself, one successful action may
make it more likely that successful action is possible in a larger
decision flow, propelling the winning Marios onward. You might also
imagine, if each Mario is an agent, that in front of him is a heat map
tracking the rewards he can associate with state-action pairs. (Imagine
each state-action pair as have its own screen overlayed with heat from
yellow to red. The many screens are assembled in a grid, like you
might see in front of a Wall St. trader with many monitors. One action
screen might be “jump harder from this state”, another might be “run
faster in this state” and so on and so forth.) Since some state-action
pairs lead to significantly more reward than others, and different kinds
of actions such as jumping, squatting or running can be taken, the
probability distribution of reward over actions is not a bell curve but
instead complex, which is why Markov and Monte Carlo techniques are
used to explore it, much as Stan Ulam explored winning Solitaire
hands. That is, while it is difficult to describe the reward distribution
in a formula, it can be sampled. Because the algorithm starts ignorant
and many of the paths through the game-state space are unexplored,
the heat maps will reflect their lack of experience; i.e. there could be
blanks in the heatmap of the rewards they imagine, or they might just
start with some default assumptions about rewards that will be
adjusted with experience. The Marios are essentially reward-seeking
missiles guided by those heatmaps, and the more times they run
through the game, the more accurate their heatmap of potential future
reward becomes. The heatmaps are basically probability distributions
of reward over the state-action pairs possible from the Mario’s current
state.

2) Technology collapses time and space, what Joyce called the

“ineluctable modalities of being.” What do we mean by collapse? Very
long distances start to act like very short distances, and long periods
are accelerated to become short periods. For example, radio waves
enabled people to speak to others over long distances, as though they
were in the same room. The same could be said of other wave lengths
and more recently the video conference calls enabled by fiber optic
cables. While distance has not been erased, it matters less for some
activities. Any number of technologies are time savers. Household
appliances are a good example of technologies that have made long
tasks into short ones. But the same goes for computation. The rate of
computational, or the velocity at which silicon can process
information, has steadily increased. And that speed can be increased
still further by parallelizing your compute; i.e. breaking up a
computational workload and distributing it over multiple chips to be
processed simultaneously. Parallelizing hardware is a way of
parallelizing time. That’s particularly useful and relevant for
algorithms that need to process very large datasets, and algorithms
whose performance increases with their experience. AI think tank
OpenAI trained an algorithm to play the popular multi-player video
game Data 2 for 10 months, and every day the algorithm played the
equivalent of 180 years worth of games. At the end of those 10 months,
the algorithm (known as OpenAI Five) beat the world-champion human
team. That victory was the result of parallelizing and accelerating
time, so that the algorithm could leverage more experience than any
single human could hope to collect, in order to win.
3) The correct analogy may actually be that a learning algorithm is like
a species. Each simulation the algorithm runs as it learns could be
considered an individual of the species. Just as knowledge from the
algorithm’s runs through the game is collected in the algorithm’s model
of the world, the individual humans of any group will report back via
language, allowing the collective’s model of the world, embodied in its
texts, records and oral traditions, to become more intelligent (At least
in the ideal case. The subversion and noise introduced into our
collective models is a topic for another post, and probably for another
website entirely.). This puts a finer point on why the contest between
algorithms and individual humans, even when the humans are world
champions, is unfair. We are pitting a civilization that has accumulated
the wisdom of 10,000 lives against a single sack of flesh.

Reinforcement Learning Books

• Richard Sutton and Andrew Barto, Reinforcement Learning:

An Introduction (1st Edition, 1998) [Book] [Code]
• Richard Sutton and Andrew Barto, Reinforcement Learning:
An Introduction (2nd Edition, in progress, 2018) [Book]
[Code]
• Csaba Szepesvari, Algorithms for Reinforcement Learning
[Book]
• David Poole and Alan Mackworth, Artificial Intelligence:
Foundations of Computational Agents [Book Chapter]
• Dimitri P. Bertsekas and John N. Tsitsiklis, Neuro-Dynamic
Programming [Book (Amazon)] [Summary]
• Mykel J. Kochenderfer, Decision Making Under Uncertainty:
Theory and Application [Book (Amazon)]

Survey Papers

• Leslie Pack Kaelbling, Michael L. Littman, Andrew W.

Moore, Reinforcement Learning: A Survey, JAIR, 1996.
[Paper]
• S. S. Keerthi and B. Ravindran, A Tutorial Survey of
Reinforcement Learning, Sadhana, 1994. [Paper]
• Matthew E. Taylor, Peter Stone, Transfer Learning for
Reinforcement Learning Domains: A Survey, JMLR, 2009.
[Paper]
• Jens Kober, J. Andrew Bagnell, Jan Peters, Reinforcement
Learning in Robotics, A Survey, IJRR, 2013. [Paper]
• Michael L. Littman, “Reinforcement learning improves
behaviour from evaluative feedback.” Nature 521.7553
(2015): 445-451. [Paper]
• Marc P. Deisenroth, Gerhard Neumann, Jan Peter, A Survey
on Policy Search for Robotics, Foundations and Trends in
Robotics, 2014. [Book]

Reinforcement Learning Papers / Thesis

Foundational Papers

• Marvin Minsky, Steps toward Artificial Intelligence,

Proceedings of the IRE, 1961. [Paper] (discusses issues in RL
such as the “credit assignment problem”)
• Ian H. Witten, An Adaptive Optimal Controller for Discrete-
Time Markov Environments, Information and Control, 1977.
[Paper] (earliest publication on temporal-difference (TD)
learning rule)

Reinforcement Learning Methods

• Dynamic Programming (DP):

◦ Christopher J. C. H. Watkins, Learning from
Delayed Rewards, Ph.D. Thesis, Cambridge
University, 1989. [Thesis]

• Monte Carlo:

◦ Andrew Barto, Michael Duff, Monte Carlo

Inversion and Reinforcement Learning, NIPS,
1994. [Paper]
◦ Satinder P. Singh, Richard S. Sutton,
Reinforcement Learning with Replacing Eligibility
Traces, Machine Learning, 1996. [Paper]

• Temporal-Difference:

◦ Richard S. Sutton, Learning to predict by the

methods of temporal differences. Machine
Learning 3: 9-44, 1988. [Paper]

• Q-Learning (Off-policy TD algorithm):

◦ Chris Watkins, Learning from Delayed Rewards,

Cambridge, 1989. [Thesis]

• Sarsa (On-policy TD algorithm):

◦ G.A. Rummery, M. Niranjan, On-line Q-learning

using connectionist systems, Technical Report,
Cambridge Univ., 1994. [Report]
◦ Richard S. Sutton, Generalization in
Reinforcement Learning: Successful examples
using sparse coding, NIPS, 1996. [Paper]

• R-Learning (learning of relative values)

◦ Andrew Schwartz, A Reinforcement Learning

Method for Maximizing Undiscounted Rewards,
ICML, 1993. [Paper-Google Scholar]
• Function Approximation methods (Least-Square Temporal
Difference, Least-Square Policy Iteration)

◦ Steven J. Bradtke, Andrew G. Barto, Linear Least-

Squares Algorithms for Temporal Difference
Learning, Machine Learning, 1996. [Paper]
◦ Michail G. Lagoudakis, Ronald Parr, Model-Free
Least Squares Policy Iteration, NIPS, 2001. [Paper]
[Code]

• Policy Search / Policy Gradient

◦ Richard Sutton, David McAllester, Satinder Singh,

Yishay Mansour, Policy Gradient Methods for
Reinforcement Learning with Function
Approximation, NIPS, 1999. [Paper]
◦ Jan Peters, Sethu Vijayakumar, Stefan Schaal,
Natural Actor-Critic, ECML, 2005. [Paper]
◦ Jens Kober, Jan Peters, Policy Search for Motor
Primitives in Robotics, NIPS, 2009. [Paper]
◦ Jan Peters, Katharina Mulling, Yasemin Altun,
Relative Entropy Policy Search, AAAI, 2010.
[Paper]
◦ Freek Stulp, Olivier Sigaud, Path Integral Policy
Improvement with Covariance Matrix Adaptation,
ICML, 2012. [Paper]
◦ Nate Kohl, Peter Stone, Policy Gradient
Reinforcement Learning for Fast Quadrupedal
Locomotion, ICRA, 2004. [Paper]
◦ Marc Deisenroth, Carl Rasmussen, PILCO: A
Model-Based and Data-Efficient Approach to Policy
Search, ICML, 2011. [Paper]
◦ Scott Kuindersma, Roderic Grupen, Andrew Barto,
Learning Dynamic Arm Motions for Postural
Recovery, Humanoids, 2011. [Paper]
◦ Konstantinos Chatzilygeroudis, Roberto Rama,
Rituraj Kaushik, Dorian Goepp, Vassilis
Vassiliades, Jean-Baptiste Mouret, Black-Box Data-
efficient Policy Search for Robotics, IROS, 2017.
[Paper]

• Hierarchical RL

◦ Richard Sutton, Doina Precup, Satinder Singh,

Between MDPs and Semi-MDPs: A Framework for
Temporal Abstraction in Reinforcement Learning,
Artificial Intelligence, 1999. [Paper]
◦ George Konidaris, Andrew Barto, Building Portable
Options: Skill Transfer in Reinforcement Learning,
IJCAI, 2007. [Paper]

• Deep Learning + Reinforcement Learning (A sample of

recent works on DL+RL)

◦ V. Mnih, et. al., Human-level Control through Deep

Reinforcement Learning, Nature, 2015. [Paper]
◦ Xiaoxiao Guo, Satinder Singh, Honglak Lee,
Richard Lewis, Xiaoshi Wang, Deep Learning for
Real-Time Atari Game Play Using Offline Monte-
Carlo Tree Search Planning, NIPS, 2014. [Paper]
◦ Sergey Levine, Chelsea Finn, Trevor Darrel, Pieter
Abbeel, End-to-End Training of Deep Visuomotor
Policies. ArXiv, 16 Oct 2015. [ArXiv]
◦ Tom Schaul, John Quan, Ioannis Antonoglou, David
Silver, Prioritized Experience Replay, ArXiv, 18
Nov 2015. [ArXiv]
◦ Hado van Hasselt, Arthur Guez, David Silver, Deep
Reinforcement Learning with Double Q-Learning,
ArXiv, 22 Sep 2015. [ArXiv]
◦ Volodymyr Mnih, Adrià Puigdomènech Badia,
Mehdi Mirza, Alex Graves, Timothy P. Lillicrap,
Tim Harley, David Silver, Koray Kavukcuoglu,
Asynchronous Methods for Deep Reinforcement
Learning, ArXiv, 4 Feb 2016. [ArXiv]
◦ Simon Schmitt, Jonathan J. Hudson, Augustin
Zidek, Simon Osindero, Carl Doersch, Wojciech M.
Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew
Zisserman, Karen Simonyan, S. M. Ali Eslami,
Kickstarting Deep Reinforcement Learning, ArXiv,
10 Mar 2018, Paper

Reinforcement Learning Applications

Game Playing with Reinforcement Learning

Traditional Games

• Backgammon - “TD-Gammon” game play using TD(λ)

(Tesauro, ACM 1995) [Paper]
• Chess - “KnightCap” program using TD(λ) (Baxter, arXiv
1999) [arXiv]
• Chess - Giraffe: Using deep reinforcement learning to play
chess (Lai, arXiv 2015) [arXiv]

Computer Games

Robotics with Reinforcement Learning

• Policy Gradient Reinforcement Learning for Fast

Quadrupedal Locomotion (Kohl, ICRA 2004) [Paper]
• Robot Motor SKill Coordination with EM-based
Reinforcement Learning (Kormushev, IROS 2010) [Paper]
[Video]
• Generalized Model Learning for Reinforcement Learning on
a Humanoid Robot (Hester, ICRA 2010) [Paper] [Video]
• Autonomous Skill Acquisition on a Mobile Manipulator
(Konidaris, AAAI 2011) [Paper] [Video]
• PILCO: A Model-Based and Data-Efficient Approach to Policy
Search (Deisenroth, ICML 2011) [Paper]
• Incremental Semantically Grounded Learning from
Demonstration (Niekum, RSS 2013) [Paper]
• Efficient Reinforcement Learning for Robots using
Informative Simulated Priors (Cutler, ICRA 2015) [Paper]
[Video]
• Robots that can adapt like animals (Cully, Nature 2015)
[Paper] [Video] [Code]
• Black-Box Data-efficient Policy Search for Robotics
(Chatzilygeroudis, IROS 2017) [Paper] [Video] [Code]

Control with Reinforcement Learning

• An Application of Reinforcement Learning to Aerobatic

Helicopter Flight (Abbeel, NIPS 2006) [Paper] [Video]
• Autonomous helicopter control using Reinforcement
Learning Policy Search Methods (Bagnell, ICRA 2001)
[Paper]

Operations Research & Reinforcement Learning

• Scaling Average-reward Reinforcement Learning for Product

Delivery (Proper, AAAI 2004) [Paper]
• Cross Channel Optimized Marketing by Reinforcement
Learning (Abe, KDD 2004) [Paper]

Human Computer Interaction

• Optimizing Dialogue Management with Reinforcement

Learning: Experiments with the NJFun System (Singh, JAIR
2002) [Paper]
Reinforcement Learning Tutorials /
Websites

Online Demos
• Real-world demonstrations of Reinforcement Learning
• Deep Q-Learning Demo - A deep Q learning demonstration
using ConvNetJS
• Deep Q-Learning with Tensor Flow - A deep Q learning
demonstration using Google Tensorflow
• Reinforcement Learning Demo - A reinforcement learning
demo using reinforcejs by Andrej Karpathy

Life Cycle of Mushroom
No ratings yet
Life Cycle of Mushroom
17 pages
Artificial Intelligence Based Mobile Application For The Workers
0% (1)
Artificial Intelligence Based Mobile Application For The Workers
31 pages
Behavioral Design in Video Games
No ratings yet
Behavioral Design in Video Games
95 pages
Christoph Pfeiffer - Game Theory - Successful Negotiation in Purchasing - Requirements, Incentives and Award-Springer (2023)
100% (1)
Christoph Pfeiffer - Game Theory - Successful Negotiation in Purchasing - Requirements, Incentives and Award-Springer (2023)
139 pages
Types of Glasswares
100% (2)
Types of Glasswares
18 pages
Game Theory Yuval Peres
100% (1)
Game Theory Yuval Peres
180 pages
ATTNatural Voices TTS14
No ratings yet
ATTNatural Voices TTS14
152 pages
Midlands State University: Module: Marketing of Financial Services (Mmrk812)
No ratings yet
Midlands State University: Module: Marketing of Financial Services (Mmrk812)
58 pages
Gamification in Education
No ratings yet
Gamification in Education
8 pages
BFBV Course Outline
100% (1)
BFBV Course Outline
7 pages
Ebusiness Strategy
No ratings yet
Ebusiness Strategy
373 pages
Time Travel-Paul Davies
100% (1)
Time Travel-Paul Davies
7 pages
Chapter 23 - Product Metrics
No ratings yet
Chapter 23 - Product Metrics
23 pages
Game Theory
No ratings yet
Game Theory
53 pages
Full ISCA Course Notes - Final
No ratings yet
Full ISCA Course Notes - Final
280 pages
Agent-Oriented Information Systems
100% (1)
Agent-Oriented Information Systems
220 pages
The Tech Debt Playbook
No ratings yet
The Tech Debt Playbook
10 pages
Gainsforth 1945
100% (1)
Gainsforth 1945
12 pages
Liverpool Medals Catalogue
100% (1)
Liverpool Medals Catalogue
116 pages
Mapping The Neuro-Symbolic AI Landscape by Architectures: A Handbook On Augmenting Deep Learning Through Symbolic Reasoning
No ratings yet
Mapping The Neuro-Symbolic AI Landscape by Architectures: A Handbook On Augmenting Deep Learning Through Symbolic Reasoning
57 pages
Game Theory Summary
0% (1)
Game Theory Summary
3 pages
Simplus QTC Ebook Digital Design
No ratings yet
Simplus QTC Ebook Digital Design
186 pages
Long-Range Planning and Strategic Management
No ratings yet
Long-Range Planning and Strategic Management
9 pages
Approaches To Language Curriculum Development
No ratings yet
Approaches To Language Curriculum Development
18 pages
Block 1
No ratings yet
Block 1
116 pages
Pinnacle Kart: Joe Chan Hao Chen Joe Giovanatto Jan Kellerman Sujay Lahiri Vinh Nguyen Mehdi Shabestary Faisal Siddiqui
0% (1)
Pinnacle Kart: Joe Chan Hao Chen Joe Giovanatto Jan Kellerman Sujay Lahiri Vinh Nguyen Mehdi Shabestary Faisal Siddiqui
70 pages
Entso-E CESysSep 210724 02 Final Report 220325
No ratings yet
Entso-E CESysSep 210724 02 Final Report 220325
132 pages
Game Theory - Thomas S. Ferguson
No ratings yet
Game Theory - Thomas S. Ferguson
8 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
Teorija Na Igra - Beleski
No ratings yet
Teorija Na Igra - Beleski
81 pages
Lecture 7-Information Hiding
No ratings yet
Lecture 7-Information Hiding
27 pages
Module 1
No ratings yet
Module 1
72 pages
Design Thinking - Ivorum
No ratings yet
Design Thinking - Ivorum
68 pages
AI and Its Impact To The Future of Work - PRE ICP IBS 2024 V01
No ratings yet
AI and Its Impact To The Future of Work - PRE ICP IBS 2024 V01
32 pages
cpphtp10 07
No ratings yet
cpphtp10 07
111 pages
G Secs
No ratings yet
G Secs
63 pages
GDC 2012 - Core Games Real Numbers Final
No ratings yet
GDC 2012 - Core Games Real Numbers Final
47 pages
The Next Great Disruption Is Hybrid Work - Are We Ready?
No ratings yet
The Next Great Disruption Is Hybrid Work - Are We Ready?
38 pages
Reinforcement Learning: A Survey: Leslie Pack Kaelbling Michael L. Littman Andrew W. Moore
No ratings yet
Reinforcement Learning: A Survey: Leslie Pack Kaelbling Michael L. Littman Andrew W. Moore
49 pages
Agent Architecture
No ratings yet
Agent Architecture
27 pages
Investment Grade Energy Auditor Certification Detailv5
No ratings yet
Investment Grade Energy Auditor Certification Detailv5
20 pages
World Map: Middle East Europe Africa Asia North America Central, South America South West Pacific
100% (2)
World Map: Middle East Europe Africa Asia North America Central, South America South West Pacific
32 pages
SmartAgent - Creating Reinforcement Learning Tetris AI
No ratings yet
SmartAgent - Creating Reinforcement Learning Tetris AI
52 pages
Cs 171 07a Games MiniMax
No ratings yet
Cs 171 07a Games MiniMax
28 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Python Mini Project Calculator Report.
No ratings yet
Python Mini Project Calculator Report.
9 pages
HXGN Connect
No ratings yet
HXGN Connect
26 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
The Version 2 of The Scrim Primer
No ratings yet
The Version 2 of The Scrim Primer
20 pages
A Review of Satellite Based Atomic Oxygen Sens - 2023 - Progress in Aerospace SC
No ratings yet
A Review of Satellite Based Atomic Oxygen Sens - 2023 - Progress in Aerospace SC
10 pages
TGCC-Renesas Company Acquisition Case
No ratings yet
TGCC-Renesas Company Acquisition Case
15 pages
Financial Trend Analysis - Ransomware 508 FINAL
No ratings yet
Financial Trend Analysis - Ransomware 508 FINAL
17 pages
Solutions 1
No ratings yet
Solutions 1
11 pages
Using Gamification To Build A Passionate and Quality-Driven Software Development Team
100% (1)
Using Gamification To Build A Passionate and Quality-Driven Software Development Team
10 pages
A Prompt Pattern Catalog To Enhance Prompt Engineering With Chatgpt
No ratings yet
A Prompt Pattern Catalog To Enhance Prompt Engineering With Chatgpt
19 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Strategic Positioning of The Digital Villages As A Centre For Rural Empowerment and Development
No ratings yet
Strategic Positioning of The Digital Villages As A Centre For Rural Empowerment and Development
22 pages
Advanced Threat Hunting Professional (ATHP)
No ratings yet
Advanced Threat Hunting Professional (ATHP)
10 pages
Becke More Than A Book Review 1 CD2012 01
No ratings yet
Becke More Than A Book Review 1 CD2012 01
18 pages
Activity Design District Municipal Festival of Talents 2024
No ratings yet
Activity Design District Municipal Festival of Talents 2024
6 pages
IT 440 Software Implementation-6
No ratings yet
IT 440 Software Implementation-6
14 pages
Bernard Keane & Helen Razer - A Short History of Stupid (Extract)
No ratings yet
Bernard Keane & Helen Razer - A Short History of Stupid (Extract)
31 pages
Carbohydrates Pcog
No ratings yet
Carbohydrates Pcog
8 pages
Math Made Easy - Percentage
0% (1)
Math Made Easy - Percentage
4 pages
NT9.1 SDH Network Takeover TL1 ED01
No ratings yet
NT9.1 SDH Network Takeover TL1 ED01
31 pages
Department of Education: Sta. Lucia National High School
No ratings yet
Department of Education: Sta. Lucia National High School
9 pages
ACP 312 Quiz 1 Week 1 3
No ratings yet
ACP 312 Quiz 1 Week 1 3
6 pages
Maxtrade Inc: It Strategy-Assignment 4
No ratings yet
Maxtrade Inc: It Strategy-Assignment 4
13 pages
Chess Game Software Design
No ratings yet
Chess Game Software Design
7 pages
MetroL2 Grammar Worksheets 1 8
No ratings yet
MetroL2 Grammar Worksheets 1 8
16 pages
Creating Graphics For The Channel Impulse Response in Underwater Acoustics With MATLAB
No ratings yet
Creating Graphics For The Channel Impulse Response in Underwater Acoustics With MATLAB
9 pages
S OverviewofScrumFrame 1
No ratings yet
S OverviewofScrumFrame 1
7 pages
NP Unit1
No ratings yet
NP Unit1
3 pages
Congratulations On Successfully Paying All Your Premiums Regularly & Completing Another Year With Max Life Smart Term Plan
No ratings yet
Congratulations On Successfully Paying All Your Premiums Regularly & Completing Another Year With Max Life Smart Term Plan
2 pages
Wisdom of Jack Ma Condensed Into 33 Quotes PDF
No ratings yet
Wisdom of Jack Ma Condensed Into 33 Quotes PDF
5 pages
Laporan Stressing (Stressing Report) : Precast I - Girder Segmental H-125cm L-16m CTC-220cm
No ratings yet
Laporan Stressing (Stressing Report) : Precast I - Girder Segmental H-125cm L-16m CTC-220cm
1 page
Barriers in Comm
No ratings yet
Barriers in Comm
7 pages
The Splendors and Miseries of Martingales - (Mazliak, Shafer)
No ratings yet
The Splendors and Miseries of Martingales - (Mazliak, Shafer)
5 pages
ITSM Excellence With Proven Training From The Industry Leader
No ratings yet
ITSM Excellence With Proven Training From The Industry Leader
4 pages
Recruitment: o Fake Hiring o Dashboard
No ratings yet
Recruitment: o Fake Hiring o Dashboard
2 pages
Post Synthesis Simulation
No ratings yet
Post Synthesis Simulation
6 pages
TEDTalksFREEWorksheettoUseWithANYTEDTalkPublicSpeakingGrades612 PDF
No ratings yet
TEDTalksFREEWorksheettoUseWithANYTEDTalkPublicSpeakingGrades612 PDF
2 pages
Game Testing Techniques
No ratings yet
Game Testing Techniques
3 pages
Brief Overview of Quantela: Leader in Smart Urban Infrastructure Automation
No ratings yet
Brief Overview of Quantela: Leader in Smart Urban Infrastructure Automation
5 pages
Crop Circle Templates
No ratings yet
Crop Circle Templates
2 pages
Listening To Shortwave Radio Broadcasts From Around The World
No ratings yet
Listening To Shortwave Radio Broadcasts From Around The World
3 pages
Architecture of Computer System
No ratings yet
Architecture of Computer System
64 pages
30 60 90 Copy Copy v2
No ratings yet
30 60 90 Copy Copy v2
1 page
Systems Thinking at Aviva
No ratings yet
Systems Thinking at Aviva
2 pages
Compliance Gap Analysis
No ratings yet
Compliance Gap Analysis
1 page

A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai

Uploaded by

A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai

Uploaded by

A Beginner's Guide to Deep

When it is not in our power to determine what is true, we ought to act

While neural networks are responsible for recent breakthroughs in

Reinforcement learning refers to goal-oriented algorithms, which

Reinforcement learning solves the difficult problem of correlating

Reinforcement learning algorithms can be expected to perform better

Two reinforcement learning algorithms - Deep-Q learning and A3C -

Learn to build AI apps now »

Reinforcement Learning Definitions

• Agent: An agent takes actions; for example, a drone making

So environments are functions that transform an action taken in the

*Credit: Sutton & Barto

Reinforcement learning judges actions by the results they produce. It

We are summing reward function r over t, which stands for time

Reinforcement learning differs from both supervised and

• Unsupervised learning: That thing is like this other thing.

Are you using Machine Learning for enterprise applications? The

State-Action Pairs & Complex Probability

• If the action is marrying someone, then marrying a 35-year-

• If the action is yelling “Fire!”, then performing the action a

We map state-action pairs to the values we expect them to produce

Reinforcement learning is the process of running the agent through

Reinforcement learning is an attempt to model a complex probability

Any statistical approach is essentially a confession of ignorance. The

After a little time spent employing something like a Markov decision

Reinforcement learning is iterative. In its most interesting

The Relationship Between Machine

Since humans never experience Groundhog Day outside the movie,

Neural Networks and Deep

In reinforcement learning, convolutional networks can be used to

But convolutional networks derive different interpretations from

In reinforcement learning, given an image that represents a state, a

A policy maps a state to an action.

Having assigned values to the expected rewards, the Q function

At the beginning of reinforcement learning, the neural network

This feedback loop is analogous to the backpropagation of error in

Reinforcement learning relies on the environment to send it a scalar

This leads us to a more complete expression of the Q function, which

Like human beings, the Q function is recursive. Just as calling the

2) Technology collapses time and space, what Joyce called the

Reinforcement Learning Books

• Richard Sutton and Andrew Barto, Reinforcement Learning:

• Leslie Pack Kaelbling, Michael L. Littman, Andrew W.

Reinforcement Learning Papers / Thesis

• Marvin Minsky, Steps toward Artificial Intelligence,

Reinforcement Learning Methods

• Dynamic Programming (DP):

◦ Andrew Barto, Michael Duff, Monte Carlo

◦ Richard S. Sutton, Learning to predict by the

• Q-Learning (Off-policy TD algorithm):

◦ Chris Watkins, Learning from Delayed Rewards,

• Sarsa (On-policy TD algorithm):

◦ G.A. Rummery, M. Niranjan, On-line Q-learning

• R-Learning (learning of relative values)

◦ Andrew Schwartz, A Reinforcement Learning

◦ Steven J. Bradtke, Andrew G. Barto, Linear Least-

• Policy Search / Policy Gradient

◦ Richard Sutton, David McAllester, Satinder Singh,

◦ Richard Sutton, Doina Precup, Satinder Singh,

• Deep Learning + Reinforcement Learning (A sample of

◦ V. Mnih, et. al., Human-level Control through Deep

Reinforcement Learning Applications

• Backgammon - “TD-Gammon” game play using TD(λ)

Robotics with Reinforcement Learning

• Policy Gradient Reinforcement Learning for Fast

Control with Reinforcement Learning

• An Application of Reinforcement Learning to Aerobatic

Operations Research & Reinforcement Learning

• Scaling Average-reward Reinforcement Learning for Product

Human Computer Interaction

• Optimizing Dialogue Management with Reinforcement

You might also like