A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
Reinforcement Learning
skymind.ai
Contents
In the feedback loop above, the subscripts denote the time steps t and
t+1, each of which refer to different states: the state at moment t, and
the state at moment t+1. Unlike other forms of machine learning –
such as supervised and unsupervised learning – reinforcement
learning can only be thought about sequentially in terms of state-
action pairs that occur one after the other.
• In video games, the goal is to finish the game with the most
points, so each additional point obtained throughout the
game will affect the agent’s subsequent behavior; i.e. the
agent may learn that it should shoot battleships, touch coins
or dodge meteors to maximize its score.
• In the real world, the goal might be for a robot to travel
from point A to point B, and every inch the robot is able to
move closer to point B could be counted like points.
Here’s an example of an objective function for reinforcement
learning; i.e. the way it defines its goal.
(In fact, deciding which types of input and feedback your agent should
pay attention to is a hard problem to solve. This is known as domain
selection. Algorithms that are learning how to play video games can
mostly ignore this problem, since the environment is man-made and
strictly limited. Thus, video games provide the sterile environment of
the lab, where ideas about reinforcement learning can be tested.
Domain selection requires human decisions, usually based on
knowledge or theories about the problem to be solved; e.g. selecting
the domain of input for an algorithm in a self-driving car might
include choosing to include radar sensors in addition to cameras and
GPS data.)
The above image illustrates what a policy agent does, mapping a state
to the best action.
If you recall, this is distinct from Q, which maps state action pairs to
rewards.
To be more specific, Q maps state-action pairs to the highest
combination of immediate reward with all future rewards that might
be harvested by later actions in the trajectory. Here is the equation
for Q, from Wikipedia:
Footnotes
1) It might be helpful to imagine a reinforcement learning algorithm in
action, to paint it visually. Let’s say the algorithm is learning to play
the video game Super Mario. It’s trying to get Mario through the game
and acquire the most points. To do that, we can spin up lots of
different Marios in parallel and run them through the space of all
possible game states. It’s as though you have 1,000 Marios all
tunnelling through a mountain, and as they dig (e.g. as they decide
again and again which action to take to affect the game environment),
their experience-tunnels branch like the intricate and fractal twigs of a
tree. The Marios’ experience-tunnels are corridors of light cutting
through the mountain. And as in life itself, one successful action may
make it more likely that successful action is possible in a larger
decision flow, propelling the winning Marios onward. You might also
imagine, if each Mario is an agent, that in front of him is a heat map
tracking the rewards he can associate with state-action pairs. (Imagine
each state-action pair as have its own screen overlayed with heat from
yellow to red. The many screens are assembled in a grid, like you
might see in front of a Wall St. trader with many monitors. One action
screen might be “jump harder from this state”, another might be “run
faster in this state” and so on and so forth.) Since some state-action
pairs lead to significantly more reward than others, and different kinds
of actions such as jumping, squatting or running can be taken, the
probability distribution of reward over actions is not a bell curve but
instead complex, which is why Markov and Monte Carlo techniques are
used to explore it, much as Stan Ulam explored winning Solitaire
hands. That is, while it is difficult to describe the reward distribution
in a formula, it can be sampled. Because the algorithm starts ignorant
and many of the paths through the game-state space are unexplored,
the heat maps will reflect their lack of experience; i.e. there could be
blanks in the heatmap of the rewards they imagine, or they might just
start with some default assumptions about rewards that will be
adjusted with experience. The Marios are essentially reward-seeking
missiles guided by those heatmaps, and the more times they run
through the game, the more accurate their heatmap of potential future
reward becomes. The heatmaps are basically probability distributions
of reward over the state-action pairs possible from the Mario’s current
state.
Further Reading
RL Theory
Lectures
Survey Papers
Foundational Papers
• Monte Carlo:
• Temporal-Difference:
• Hierarchical RL
Traditional Games
Computer Games
Online Demos
• Real-world demonstrations of Reinforcement Learning
• Deep Q-Learning Demo - A deep Q learning demonstration
using ConvNetJS
• Deep Q-Learning with Tensor Flow - A deep Q learning
demonstration using Google Tensorflow
• Reinforcement Learning Demo - A reinforcement learning
demo using reinforcejs by Andrej Karpathy