0% found this document useful (0 votes)
267 views

Reinforcement Learning

The document provides an overview of a course on reinforcement learning algorithms. It introduces the instructor, Janani Ravi, and their background. The course will cover basic reinforcement learning principles, taxonomy, and specific techniques like Q-learning and SARSA. Students will learn how to model environments as Markov decision processes and use techniques like dynamic programming, temporal difference learning, and SARSA to find optimal policies. The goal is for students to understand reinforcement learning techniques and be able to implement basic RL algorithms.

Uploaded by

Pratheek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
267 views

Reinforcement Learning

The document provides an overview of a course on reinforcement learning algorithms. It introduces the instructor, Janani Ravi, and their background. The course will cover basic reinforcement learning principles, taxonomy, and specific techniques like Q-learning and SARSA. Students will learn how to model environments as Markov decision processes and use techniques like dynamic programming, temporal difference learning, and SARSA to find optimal policies. The goal is for students to understand reinforcement learning techniques and be able to implement basic RL algorithms.

Uploaded by

Pratheek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Course Overview

Course Overview

Hi. My name Janani Ravi, and welcome to this course on Understanding Algorithms for
Reinforcement Learning. A little about myself, I have a Masters degree in Electrical
Engineering from Stanford and have worked at companies such as Microsoft, Google, and
Flipkart. At Google, I was one of the first engineers working on real-time collaborative
editing in Google Docs, and I hold four patents for its underlying technologies. I currently
work on my own startup, Loonycorn, a studio for high-quality video content. In this course,
you will learn basic principles of reinforcement learning algorithms, RL taxonomy, and
specific policy search techniques such as Q-learning and SARSA. We'll start off by
understanding the objective of reinforcement learning to find an optimal policy, which allows
agents to make the right decisions to maximize long-term rewards. RL has a wide variety of
use cases such as optimizing trucking routes to conserve fuel, finding the best moves to beat
an opponent in chess. We'll study how to model the environment using Markov decision
processes so that RL algorithms are computationally tractable. We'll then study dynamic
programming, an important technique used to memoize intermediate results, which simplifies
the computation of complex problems. We'll understand and implement policy search
techniques such as temporal difference learning, also called Q-learning, and SARSA, which
help converge to an optimal policy for our RL algorithm. We'll then study reinforcement
learning platforms, which allow us to study prototype and develop our policies. We'll work
with both Q-learning and SARSA techniques on OpenAI Gym. At the end of this course, you
should have a solid understanding of reinforcement learning techniques, Q-learning and
SARSA, and be able to implement basic RL algorithms.

Understanding the Reinforcement Learning Problem


Module Overview

Hi, and welcome to this course on Understanding Algorithms for Reinforcement Learning.
When a student gets started with machine learning techniques, they typically work on
supervised and unsupervised techniques, techniques such as classification, regression,
clustering, and dimensionality reduction. Reinforcement learning differs from supervised and
unsupervised learning. In fact, reinforcement learning is used for creating agents that know
how to explore an uncertain environment. Often in the real world, there is no training data for
your machine learning algorithm to work with, either labeled or unlabeled. This is where
reinforcement learning comes in. You'll train a model to make decisions where you have no
idea initially how those decisions will turn out. The job of your model, or agent in this case,
will learn about the uncertain environment until it's known. Your agent in an RL technique is
the decision maker, and the decision maker needs to choose his actions appropriately so that
rewards are maximized. Your model learns by getting positive reinforcement in the form of a
reward or negative reinforcement for every action in every state. The objective of a
reinforcement learning algorithm, or the model as we call it, is to determine a policy for such
decision making. This policy is what will drive the decisions or the actions of our algorithm,
and the output of reinforcement learning is basically a series of actions that have been created
using this policy. The output of a classification problem might be labels. The output of a
regression problem is predictions. The output of a reinforcement learning algorithm is a series
of actions.
Prerequisites and Course Overview

Let's see some of the prereqs that will help you make the most of this course. This course
assumes that that you have some familiarity with traditional machine learning models, either
supervised or unsupervised models. This will help because it will allow you to put
reinforcement learning in context. If you haven't been exposed to any kind of machine
learning before, here are some courses on Pluralsight that might help you out. How to Think
About Machine Learning Algorithms and Understanding Machine Learning with Python.
Both of these are beginner courses on machine leaning and should help you get warmed up
for this course. This course assumes that you're very comfortable programming in Python. All
the code in this course will be written using Python 3. All the code in this course will be
written using Jupyter Notebooks. If you're not familiar with Jupyter, this is something you
should get set up. Jupyter is an interactive shell on the browser for Python and is a must have
IDE for any analyst or engineer working in Python. In this course, we'll assume that you have
a basic understanding of machine learning, but really, we don't go into any of the other
machine learning techniques here. Reinforcement learning is a completely differently and, in
fact, is an area of research of its own. We'll start this course off with an introduction to
reinforcement learning. We'll talk about the basic concepts that underlie RL, and we'll talk
about the differences between RL and other machine learning techniques. We'll talk about the
basic elements of a reinforcement learning problem and how we use the Markov decision
process to model the environment in order to formulate a policy. In the next module, we'll get
a little more hands on with the basic concepts that underlie reinforcement learning. We'll talk
about dynamic programming first, which is a standard technique used to solve many
reinforcement learning algorithms. We'll then talk about Q-learning techniques, temporal
difference learning and SARSA specifically. The next module is all about using
reinforcement learning platforms for actual implementation. We'll talk about why we need a
gym environment in order to try out your reinforcement learning prototypes. We'll use RL
techniques in order to see how we can cross a frozen lake to retrieve a frisbee without falling
into a hole and how we can balance a pole on a cart, all using OpenAI Gym.

Supervised and Unsupervised Machine Learning Techniques

If you've worked with machine learning so far, chances are you've used either a supervised or
an unsupervised technique. Supervised learning algorithms have labels associated with the
training data, and these labels are used to correct your model parameters. These labels on the
training data gives your model an idea of what is correct, and it tries to fix itself so that its
predictions are more correct. The training data that we use with unsupervised learning
algorithms is not labeled. The model has to be set up right in order to learn structure and
underlying patterns in data. Supervised learning techniques are the most common and
typically are the first thing that you learn when exposed to machine learning. In supervised
learning, the ML algorithm focuses on the attributes of each individual data point. These
attributes are called features. Every training instance or data point is made up of a list or a
vector of such features, and this is called a feature vector. The input to a supervised ML
model is a feature vector, and the output is some kind of prediction, a classification or a
regression. These feature vectors are typically called x variables. The output of a supervised
learning model is a label. This is the prediction that your machine learning algorithm tries to
make. And there are two broad categories of labels possible. Categorical values are typically
the output of any classification problem. Categorical values are discrete outputs, outputs
which fall into a certain category such as days of the week, months of the year, true or false,
cat or dog. Your model can also predict values in a continuous range in between some start
and end values. This is typically the output of a regression model, for example, the price of a
house in San Francisco, the price of Google stock next week, and so on. In machine learning
parlance, these output values of your model are called the y variables. Supervised learning
techniques make the assumption that the input x, that is your feature vector or x variable, is
linked to the output y by some function f, and the objective of the supervised technique is to
learn what f is. If you're working on a linear regression problem, which tries to draw a line
through your input data, you're specifying up front that this function f, which links the
features and the labels, is linear, which means it's of the form y = Wx + b. With the advances
made in machine learning to date, techniques such as neural networks can learn or reverse
engineer pretty much anything, any relationship which links the input x variables to the
corresponding y labels. This function f, which links the input to the output, is typically called
the mapping function. Unsupervised learning techniques learn to approximate this mapping
function. So for new values of x, we can predict y. This mapping function is constantly
improved and tweaked using the training data that we have available. The most common
supervised learning technique out there today is regression and classification. Regression,
given the values of feature x, will try to find the predicted output y where y is in a continuous
range of numbers. Classification problems, on the other hand, try to figure out what category
the input data falls into, true or false, fish or mammal, cat or dog. As opposed to supervised
learning techniques, unsupervised techniques only have the underlying data to work with.
There are no corresponding y variables or labels on this data. Unsupervised learning involves
setting up your model correctly so that it has the ability to learn structure or patterns in the
underlying data. These algorithms are set up in such a manner such that they're able to self-
discover the patterns and structure in the data without the help of any labeled instances. One
of the most common unsupervised learning algorithms that you might have worked with is
clustering. Clustering is used to find logical groups in the underlying data. For example,
when you cluster user data, you might want to find all users who like the same music in the
same cluster. This will allow you to target specific ads to those users. There are many popular
clustering algorithms out there, for example, k-means clustering, mean shift clustering,
hierarchical clustering, and so on. Another common unsupervised learning technique is
autoencoding for dimensionality reduction. Typically, when you work with data sets, you
have a huge number of features. Not all of these features might be significant, and some of
these features might be correlated with each other. This is where you'd use dimensionality
reduction in order to find the latent factors that drive your data. Principal component analysis,
or PCA, is a classic technique that you'd use for dimensionality reduction. This is an
unsupervised technique.

Introducing Reinforcement Learning

With that basic introduction out of the way, we are now ready to get started with
reinforcement learning. You can see an aphorism on screen here. If you ask the wrong
question, you will never get the right answer. Reinforcement learning is all about asking a
number of different questions, that is exploring, and then figuring out which of the answers
are correct, and then training an algorithm to make decisions based on these right answers.
An important characteristic of the supervised and unsupervised learning techniques that we
spoke of earlier is that the value of x is unknown. X is the features in our training data. X may
or may not have y labels associated with them; however, x is known. It's not an unknown
environment. Supervised and unsupervised machine learning models are not geared to work
in an unknown environment where x is unknown when that type of data has never been
encountered before by that machine learning model. And this unknown environment is where
reinforcement learning operates. RL trains decision makers to take actions to maximize
rewards in an uncertain environment. That is a mouthful here, but let's parse it. Let's break it
down. Decision makers are a principal concept in a reinforcement learning algorithm.
Decision makers are essentially software programs which take a series of actions based on
decisions. Decision makers are also often called agents. So instead of a machine learning
model or model parameters, you'll speak of reinforcement learning agents. The agent can be a
car that's learning to self-drive or a programmatic model that's learning to trade. The actions
in RL refer to the decisions that the agent has taken. The output of reinforcement learning is a
set of actions, do this, go left, go right, jump higher, rather than a set of predictions. All of
these actions are geared towards a desirable outcome or an end. How long can your self-
driving car go without an accident? How can you maximize your returns from trading? These
are our desired outcomes. These actions are determined using a reinforcement learning
algorithm, and this algorithm is called your agent's policy. Your agent uses this policy in
order to take actions or make decisions. A self-driving car might have a policy if a human is
within 5 meters of me, hit the breaks. An action for your trading algorithm might be if the
market falls below a certain level, buy. All of these actions that your agent takes has an
objective. These actions must be optimized to earn rewards or to avoid punishment, so you
can have positive reinforcement or negative reinforcement. Your stock trading agent is
rewarded whenever it makes money, and it's punished when it loses money. All of these RL
agents operate in some kind of external environment, and it is this external environment that
imposes these rewards and punishments. Let's say your agent is a robot that's learning to
walk. The external environment is the plane or the surface on which it's walking. If your
agent is a thermostat, the external environment is the temperature or the electricity bill,
humans fiddling with the temperature knob, and so on. An important part of reinforcement
learning is the fact that the environment is uncertain or unknown. It's very complex. If your
robot is learning to climb a hill, it has no idea of what pitfalls it might find. What if your
robot uses this tree branch to swing itself up? Will the branch break, or will it be able to get
higher up on the hill? Unless it's encountered this particular tree and branch before, it has
absolutely no idea, and this is where the training for your agent comes in. The decision
maker, or the agent, needs to be trained in order to explore this uncertain environment. Take
various routes to see which is the best way to get up this hill. Is it even possible? This
exploration should combine both caution and courage. It needs to explore new paths so that it
can find new better ways, but it also might need to stick to familiar paths at some point in
order to ensure that it makes progress towards its goal. An overly adventurous agent might
learn more about the environment, but it's also likely to fail fairly often. An overly cautious
agent might learn very little about the environment. It might find one way up the hill, but it's
possible that there are other better ways which it hasn't explored. RL agents have to maintain
a tricky balance in order to maximize their rewards. Here are the principal elements in any
reinforcement learning algorithm. We have the agent, which is the decision maker that has to
be trained to make the right decision to achieve our objectives. The decision maker is
responsible for observing the environment and understanding the current state of the
environment. A combination of observations make up the current state, and different actions
can be taken based on what that state is. This decision maker is responsible for taking actions
based on the policy that is has learned. The policy that an agent follows comes from the
environment that the agent is in. The objective of the agent is to maximize its rewards. Every
action that an agent takes in a particular state has an associated reward or punishment.
Depending on the policy that the agent follows, the agent knows what decision it has to take.
If you change the policy, the agent's actions will also change. If your tree-climbing robot has
a policy that it should always use tree branches in order to climb uphill, it will do so. Policies
are directly determined by the environment that your agent operates in. If the environment
changes, the policy also changes. For example, if your robot is climbing a hill where the tree
branches are strong and they don't tend to fall, then it'll always use trees to climb or pull itself
up. But if your robot is moved to a different hill, then its policy might be different. Maybe
that hill doesn't have any trees. Maybe it has only weak shrubs. Any agent needs to explore
an uncertain environment before a policy can be generated for it.

Reinforcement Learning vs. Supervised and Unsupervised Learning

Now that we've got a fair understanding of what reinforcement learning is all about, let's
quickly compare and contrast it with supervised and unsupervised techniques. The objective
of a reinforcement learning algorithm is to choose the best set of actions that takes the agent
towards the required goal. Supervised techniques make predictions or classifications based on
training data. Unsupervised techniques are used to simplify or find patterns in the underlying
data. Reinforcement learning works in an unknown or uncertain environment. In fact, the
agent has no idea what it's in for until it starts exploring this environment. When we work
with other ML techniques, the environment is known. The training data provides a sampling
of what is out there in the environment. The environment is the x variables that we feed into
our model. The training process of an RL agent involves exploring the environment. It goes
in different directions and figures out what exactly lies in that direction. Is it positive
reinforcement or negative reinforcement? The training process in supervised learning
techniques involves feeding a whole corpus of data to the machine learning model, tweaking
its parameters so that it makes robust and correct predictions. For unsupervised techniques,
an explicit training phase is entirely absent. The objective of training a reinforcement learning
agent is to find the right policy that works for that particular environment that that agent has
explored. Once we have the best policy, this policy will drive the actions of the agent in that
environment. For supervised techniques, the training process involves finding the best model,
tweaking the model parameters so that it best fits the underlying data. In reinforcement
learning, the reward that you get for a particular action explicitly depends on previous
actions. Let's say that the explicit action of your smart thermostat is to make the temperature
higher. But if the room is already warm, making the temperature even higher might involve a
punishment not a reward. But if the room is freezing cold, the reward for increasing the
temperature will be very high. This reward will diminish as you make the temperature higher
and higher and make the room warmer. In other ML techniques, the individual data points are
completely independent of one another. There is no explicit dependency.

Modeling the Environment as a Markov Decision Process

What must have been fairly clear to you in the last few clips is the fact that the environment
in which a reinforcement learning agent operates is critical, and we need to model this
environment. Environments are complex. Modeling environments is hard, which means we
need to use some kind of simplification technique. We model the environment as a Markov
decision process. The Markov property states that the future is independent of the past given
the present. That means everything that we needed to learn from the past is embedded in the
present, and there is no reason for us to look backwards. We have the information in the
present state, and this present state can be used to model the future. Let's take a brief look at
the steps involved in modeling the environment as an MDP, or a Markov decision process. At
every time step, you can assume that the environment is in some state, and that state is
represented by S subscript t. There are a certain fixed variety of options available to the agent
or to the decision maker, and the decision maker can choose some action a, and this action a
will move our environment to a new state in the next time interval. The state can be depicted
by St+1. Now for having taken that particular decision a at time t, the decision maker will
receive some kind of reward. This reward is very specific to the action a that was taken to
move our state from St to St+1. The important thing to note here is that St+1 depends only on
the action a and the previous state, St. Now what we discussed here is the crux of the Markov
decision process. Let's visualize this on a timeline. We have discrete time intervals going
from t=0 to t=3. We'll visual how states and actions look along this timeline. At any discrete
time interval t, the environment can be thought of being in some state St. If you try to
imagine the state in the real world, let's say for a walking robot, a simplified representation of
state will be simply the coordinates of the robot. The state can also contain other information
such as whether there's a hill up ahead, a hole to the left, and so on. The agent then observes
the current state and then takes some action a within this state. What this action is depends on
the policy that the agent uses, and this policy has been predetermined when the agent
explored the environment earlier. The consequence of the agent taking this action a is that the
state has now moved to a new state, St+1. This new state will have an entirely new set of
observations and conditions. The decision maker then receives some reinforcement for this,
some positive or negative reinforcement. Let's assume positive reinforcement for now. The
decision maker receives some reward for taking this action a when in state St. And this is the
Markov property. St+1 depends only on a and St. This is our model of the environment as a
Markov decision process. The future state St+1 is completely independent of the past. St-1,
St-2 are all irrelevant given that we know the present St. The simplifying assumption here is
that the current state that we are in embeds all of the information that we need to know about
the past. This Markov property is embedded in some significant factors in the state diagram
that we just set up. Notice the reward. The reward depends only on the action that the
decision maker took, the current state, and the next state. The reward is not of the form,
which includes the previous state information. It does not include St-2, St-1, or any of the
earlier states. All the information of the reward is embedded in the current state, and this is
the simplifying assumption in the Markov decision process, which allows us to model even
very complex environments. All of the information is encapsulated in the present state. How
we got to the current state does not matter. What path we took does not matter. The Markov
decision process implies no path dependence, and this is a simplifying assumption. This is an
assumption that has worked and helped us model many real-world situations, which is why
it's used so often. Now a fair question to ask at this point is whether this Markov property is
always a realistic assumption. In the first case, you might have the stock price at $100 for a
particular stock, but last month it was at $10. Is this the same situation as having the stock
price as $100 this month? But last month, the same stock was at $1000. Clearly in the first
case, the value of the particular stock has been going up; the company is getting more
valuable. In the second scenario, the value of the company is going down; the company's
getting less valuable. Intuitively, you might say that the two scenarios are completely
different, but counterintuitively, the theory of finance says that the scenarios are equivalent.
Now this is, of course, a discussion for another day, but this is the implication of using a
Markov decision process to model the environment. We need to model the environment so
that the agent is able to explore it, finding the best possible action at every step. The MDP
greatly simplifies the exploration of the environment because it allows the use of dynamic
programming techniques. The basic assumption of the MDP is that the current state embeds
within it all the information needed in order to make decisions. The path taken to get to the
current state is immaterial. This means that for every state the reward for the best action in
that state can be cached at that state. This caching, or memoization of information, is the
underlying principle of dynamic programming. If you've seen a state before, don't recompute
the reward for the best action. Simply cache it. The simplifying assumptions that we make
because of modeling the environment as a Markov decision process allows us to use dynamic
programming, which makes policy search tractable. Finding the best policy to use in order to
make decisions in a complex environment is now computable.

Reinforcement Learning Applications

Before we move on to specific concepts, let's look at some use cases for reinforcement
learning. Reinforcement learning has been around for a really long time like other machine
learning techniques, but it's experienced a new surge of popularity as people find uses in
critical applications which involve an unknown or uncertain environment such as game
theory or multiple-decision making systems or control systems engineering. Reinforcement
learning is widely used for gaming. In fact, DeepMind, a British artificial intelligence
company, was acquired by Google in 2014, and it specializes in learning how to play video
games like a human being. You might have heard of this company when it made headlines in
2016 after its AlphaGo program beat a human professional Go player for the first time, and
later, within a few months, the same program was able to beat the world champion as well.
Reinforcement learning agents can find application in warehouses or other logistic
operations. If you want to optimize the scheduling of parcel delivery by a robot, you might
train an agent, which is rewarded when it minimizes the drop off time for parcels or reduces
fuel consumption. On the other hand, you might punish this agent when it misses a delivery,
for example. Or you might be interested in playing the stock market, and you want to set up
an RL agent which optimizes the trades made by our programmatic trader. This robotic agent
will seek to maximize profits and minimize volatility. It'll be rewarded for profits and
penalized for losses. Or you might use an RL agent for better temperature control using a
smart thermostat. The objective of this agent is to maintain temperature, as well as save
energy. This agent is rewarded when a human does not tweak with the system and the
optimal temperature is maintained and energy consumption is reduced. It might be penalized
whenever a human has to adjust the temperature manually. Or you might train an RL agent
for something that is more fun. You want to optimize the songs played by an automated DJ.
The objective is to maximize the energy at a party. In this case, you can see that measuring
reward is hard. You might have to come up with something creative, and this is where we
come to the core of any reinforcement learning problem. Modeling the environment. This is
the key to generating a good policy that the agent can use to make decisions. In reinforcement
learning, agents are typically trained to work on two broad categories of tasks. Episodic tasks
have a clear end state such as winning a chess game or reaching a particular target, finding
the best way to get a destination in traffic, and so on. These are episodic tasks. An RL agent
can also be used to work on continuous tasks. Continuous tasks have no clear end state. Stock
market trading. You want to maximize your profits, but there's no real end. You might trade
forever. Music selection can also continue forever. There is no end state or winning state. A
smart thermostat maintaining the temperature of your home, also continuous.

Understanding Policy Search

Once you've modeled your environment, the next step is to perform a policy search algorithm
to find the best policy for your agent to make decisions in this environment. First, let's start
off by considering the basic elements of any reinforcement learning problem. There is the
reward, which is a favorable result awarded for good actions. The corollary is the negative
rewards, or the punishments, for bad actions. There is also the decision maker, or the agent,
which our software program that is competing for these rewards and trying to avoid
punishments. And then we have the policy. This is the algorithm that will allow our agent to
take the right actions that will result in a reward and avoid punishments. The policy is
significant here because the policy tells the agent what needs to be done. Policy determines
action. Reinforcement learning is actually very easy to understand when you consider the real
world. Let's say you believe that honesty is the best policy. Then your policy is honesty, and
your action will be to always speak the truth no matter who is asking you the question, and
the reward that you get for this may be that you sleep better at night. Your policy drove your
action, and you slept well as a reward. You can imagine that this policy is a map or a guide
for the decision maker. Once the decision maker has a policy, he can then use this to figure
out what the next action should be. Should he be running, using a rocket, or riding a car?
Each of these actions may be associated with a particular reward. The magnitude of the
reward will depend on the action and the current state of the environment. These rewards can
be negative rewards or punishments as well. A hill-climbing robot steps in the wrong
direction. It might fall off a cliff. As you can see, this policy, which is an agent's guide or
navigator, is pretty important. Where does this policy come from? It is the agent's exploration
of the environment, which determines policy. It is the environment which tells you what
actions under what conditions are good or bad. The environment rewards some actions and
punishes others. The environment is largely uncertain. That is, before an agent explores the
environment, there is no way to know in advance what actions are good or bad. The decision
maker is responsible for observing the environment and learning from it. The decision maker
explores the environment by performing a number of actions and seeing whether it receives
rewards or punishments, and it learns to modify its behavior accordingly. You can think of
training an agent like toilet training a dog. If you want your dog to poop outside, you'll
reward him when he does, and you'll scold him when he poops in your house. The decision
maker that we saw previously uses this policy that we got from the environment to make his
decisions. If you're racing a tortoise, maybe you just want to run. But if you have a rocket and
you want to get somewhere quickly, then you'd use the rocket. Or if traffic is clear near your
house, you'd rather drive. The decision maker here is constantly observing the environment in
order to take the right action based on a previously determined policy to get rewards. So it's
pretty clear that we need to find this policy, and this policy is determined using a policy
search algorithm, and the responsibility for this algorithm is to find the policy that the
decision maker should follow. We create our reinforcement learning algorithm in order to
find this policy. This policy is a function P that takes in the current state of the environment
represented by S and returns an action a, the action a that the decision maker must take. The
objective of this function P should be such that the policy maximizes cumulative rewards that
an agent collects. Let's say if you're playing a game of chess the environment is the
chessboard and the current state is the position of every piece on that board. Every action that
each of the participants take involves moving a piece on the board. This will lead us to a new
state. Every action that is taken in a particular state of the world is associated with a reward.
Each of these rewards will be different, and the optimal policy tries to maximize these
cumulative rewards that an agent will collect, the current reward, as well as future rewards.
An optimal policy has to consider cumulative rewards as opposed to immediate rewards
because you might find that in certain cases the immediate reward might be huge, but in the
long run, it doesn't pay off. So if you're playing a game of chess and if your next move gains
a pawn, but ignores a threat to your king, your game will be over soon. When you're looking
for the right policy, you need to balance the immediate, as well as future rewards. Just like in
life, delayed gratification is important here. When we modeled our environment as a Markov
decision process, at time t the reward for the next move was represented as you see on screen.
For the move after that, we simply change the suffixes, which define the discrete time
intervals, and this is repeated over an infinite horizon as the game continues. The cumulative
rewards for the period over which your agent runs is to sum up all the rewards that you get
based on your agent's actions from now to infinitely. There is a significant caveat here. Make
sure that you discount future rewards by the discount factor gamma. This discount factor is
applied due to the uncertainty of future rewards. You can be very sure as to what happens in
the next time interval. But in the future, you're less sure, which is why this discount factor
applies. As you go further into the future, the discount factor will be greater. As the
uncertainty increases, you'll tend to discount the rewards even more. So when you're finding
the right policy for your agent, you'll try to maximize this formula that you see here on
screen. First is the current reward. No discount factor is applied to the current reward. The
rewards that you might expect in future time periods are more uncertain, which is why you
apply a discount factor to them. The further in the future a reward, the more you discount it.
This discount factor gamma is a value between 0 and 1. The suffix a that you see here on
screen should make it clear that every reward depends on the action that you've taken at that
time instance. We've spoken before that the output of a policy search algorithm is a policy,
which is a function that takes in a state S and tells you the best action to perform for that
state, which means your rewards depend on your policy. And what you see on screen gives us
a mathematical formula, which allows us to estimate the best possible policy for a particular
environment. The process of solving this optimization problem is called policy search. We
want to find the best policy to drive our actions to maximize cumulative rewards.

Policy Search Algorithms

Our objective now is to find the best policy for our environment. This will be an estimate, of
course. We estimate the best policy using policy search algorithms. There are three broad
categories of algorithms. The first is, of course, brute force methods, the second is policy
gradient methods, and the third is value function methods. Let's study each of these briefly
first. Brute force methods, as the name implies, requires us to evaluate every possible policy
over every possible state. Now we've already spoken about how complex and uncertain
environments can be, which means there are infinite states. We'll then pick the policy with
the best expected reward. As you might expect, brute force methods don't perform very well.
They scale poorly as state space increases. Another less obvious drawback of the brute force
method is that you might measure the best policy using some kind of averaging mechanism,
but it's possible that best could have a high variance, which means you'll end up with the
wrong policy. Policy gradient methods use some kind of exploration of the state space in
order to find the best possible policy. The state space is basically all the states that an
environment can be in. Policy gradient methods use a gradient ascent mechanism to find the
best possible policy. This is conceptually very similar to the gradient descent that you might
have used if you've studied neural networks. You'll typically use policy gradient methods to
find the best policy if you're using neural networks with the neural reinforcement learning
algorithm. We've spoken earlier about how neural networks can be used to learn or reverse
engineer arbitrarily complex functions, which connect the input variables with the y values.
Similarly, when you use neural networks for policy search, neural networks will constantly
have the agent play the game over and over again, explore the environment, and learn the
optimal policy. Using the policy gradient methods for policy search have some good
characteristics. For example, you do not need to model the environment as a Markov decision
process, and you can use neural network technologies, such as TensorFlow. However, using
neural networks makes it hard to model the reinforcement learning problem. It's complicated
to understand, which is why they are not quite as popular as value function methods. The
third kind of algorithm for policy search are value function methods, and these are what we'll
study in this course. Here we need to explicitly model our environment as a Markov decision
process. Value function methods are popular, they are robust, and they are computationally
less intensive. There are several implementations of the value function methods. In this
course, we'll specifically focus on the first two techniques, Q-learning and SARSA. And this
brings us to the end of this introductory module on reinforcement learning. We learned how
reinforcement learning differs from traditional machine learning methods such supervised
and unsupervised learning. We've seen that reinforcement learning works very well in an
uncertain environment where the environment is complex and it's hard to see how to take the
right decisions and what the right decisions are. The first step in any reinforcement learning
algorithm is to model the environment so that it is more tractable and easier to understand.
We model it typically as a Markov decision process where every current state embeds all the
information from past states. Once we model the environment, we'll use this environment to
determine the best policy that maximizes cumulative rewards across the lifetime of the agent.
We've seen that finding the best policy involves a policy search algorithm and that there
exists different possible approaches to policy search. In the next module, we'll focus
specifically on the Q-learning and SARSA techniques for policy search, both value function
methods.

Implementing Reinforcement Learning Algorithms


Module Overview

Hi, and welcome to this module where we'll see how we can go about implementing
reinforcement learning algorithms. We'll start off by studying the basic concepts of dynamic
programming. Dynamic programming is a key technique that is used to implement
reinforcement learning. Dynamic programming allows us to cache information at every state,
thus making exploring the environment tractable. In this module, we'll then move on to
studying Q-learning, which is a technique used in reinforcement learning to find the best
policy to use to take decisions within a certain environment. In order to use Q-learning
techniques, we'll first need to model our environment as a Markov decision process where all
the information about a particular state is embedded in that state itself. How we got to that
state is irrelevant. There are a bunch of different techniques that can be used to implement Q-
learning. We'll focus on understanding two specific techniques, the temporal difference
method and SARSA. We'll also get hands on with Q-learning here. We'll use the temporal
difference method to find the shortest path from a source to a destination node on a graph.

Dynamic Programming

Before we dive into policy search using Q-learning, let's talk briefly about dynamic
programming. Dynamic programming is a programming technique where we solve
algorithms which are very computationally intense and reduce their computational
complexity by using caching or memoization. If you've worked with dynamic programming
before, you know that they're typically with problems that are recursive in nature, which
involve the same set of calculations over and over again. Dynamic programming is a standard
technique used in reinforcement learning. Q-learning techniques, which we'll talk about in
detail in this module, are based on dynamic programming. Here we cache the reward
associated with every action in a particular state and use these cached values to make state
space exploration tractable. Let's understand what dynamic programming really is by looking
at its definition. It's a mouthful, so we'll parse it and break it down. Dynamic programming is
typically used to solve very complex problems. A simple example of a complex problem
where we can use dynamic programming is calculating the factorial of an integer. We know
that 5 factorial is actually 5 x 4 x 3 x 2 x 1. If you look closely and apply your mind, you'll
see that 5 factorial can be decomposed into a simpler subproblem. It's actually 5 x 4 factorial.
Dynamic programming involves decomposing complex problems into simpler subproblems.
If you consider 5 x 4 factorial, computing 4 factorial is obviously simpler. There is one
multiplication less that you have to do. You can also see that this problem is recursive in
nature. Four factorial is actual 4 x 3 factorial. This problem has been further decomposed into
simpler sub-problems. The whole idea behind dynamic programming is that you calculate all
of these complex computations exactly once, and once you have the result, you cache this
result so that it can be reused. If you have to find the factorial of all numbers starting from 1
up to 100, calculate each factorial exactly once and store these results so that they can be
reused. So if you're asked to calculate 6 factorial, reuse the results of 5 factorial and then
multiply it by 6. Make sure you cache the results of 6 factorial so that you can reuse it when
you're calculating 7 factorial. You can consider the result of each integer's factorial as state.
All the information about that state is embedded in that state itself when you cache the result
along with that state.

Demo: 8-Queens Algorithm Using Dynamic Programming, Helper Functions

Let's get hands on now to get a feel for how dynamic programming works. We'll use this
memoization technique to solve the 8-queens problem. This is a pretty standard programming
problem and often used as an interview question. The objective here is to place 8 queens on
an 8x8 chessboard so that no 2 queens threaten each other. None of the queens can kill each
other. If you haven't played chess before, queens are extremely powerful pieces in chess.
They can move diagonally, horizontally, as well as vertically, which means your 8-queen
solution requires that no 2 queens share the same row, column, or diagonal. It might be
interesting for you to pause a little bit here and see if you can actually place eight queens on
this particular chessboard. Here is one possible solution. There are many such. So here are the
conditions. There is one queen in every column of the board. There is also one queen for
every row of this board. Take a careful look at the right, as well as the left diagonal for every
queen. You'll find that no two queens share diagonals as well. This is true for all queens on
this board. Eight queens can be solved recursively. The way you'd approach this problem is
you'll try to place a queen at every column in your board, and each time you place a queen,
you'll check to see whether your board is safe. If yes, you'll move on to the next column and
place your next queen. If not, you'll backtrack to a previous column and try to change the
position of your queen so that you can find a new arrangement of the board that is safe. Let's
solve this in Python using dynamic programming. Notice that we are writing our code in a
Jupyter Notebook. This code is in Python 3. We'll use the NumPy library here. We'll use a
NumPy 2D array to represent our board. Our board is an 8x8 board, but you can extend this
to 16x16, 32x32, and so on. The board_state_memory variable is a Python dictionary which
we use to cache our results. For every state of the board when it has queens placed on it, we'll
cache the result for whether that board is safe or not so that we don't have to recalculate it
from scratch. Our board is two-dimensional NumPy array. It's an N by N array, which is
initialized to all 0s. A value of 1 in any cell indicates that a queen has been placed at that
position. In order to help us with our caching, we'll use a helper function that allows us to
represent an N by N board in string form. This is the create_board_string function. This
function uses a nested for loop to iterate through every cell in our board and constructs a
string to represent the current state of our board. This string representation of our board is
basically a series of 0s and 1s where 0 indicates a cell that is empty, it does not have a queen,
and 1 indicates a cell which has a queen positioned there. We'll use this helper function when
we want to memoize, or cache, the state of our board. Our board is completely empty at this
point. We have no queens positioned. Let's check the state of our board_string. It comprises
of all 0s. Eight by 8, that is 64 0s. Let's make a copy of our board and position our first queen
at 0, 1. This queen is in the topmost row in the second column, and here is the string
representation of this board. This contains exactly one 1 in the second position, the second
column in the topmost row. We'll now define another helper function called is_board_safe.
This takes as its input the current chessboard with the queens positioned on it and checks to
see whether the board is safe. It checks to see whether any of the queens threaten each other.
The board is not considered safe when one queen can kill another. The first thing here, and
this is the portion which uses dynamic programming, is to get a string representation of the
current state of the board and check to see whether we've seen this board before, whether
we've seen the current configuration of queens that exist on this board. If we have, then we
don't need to explicitly check to see whether any queens threaten each other on this board.
We already have this information cached thanks to dynamic programming. We simply return
the cached value for this particular state of the board. This cached value will contain true or
false, true if the board is safe, false if any queen threatens another. If you haven't seen this
board before, then we need to perform some calculations to check whether this board is safe.
We start off by checking to see whether any two queens are in the same row. We calculate
the sum for every row in the board using NumPy. If we find that there is any row within this
board that has sum greater than 1, that means that row has more than 1 queen. Remember, we
use the integer 1 to represent the presence of a queen. If we find that a row has more than one
queen, then the current state of the board is not safe. The first thing we do here is to add to
our board_state_memory, or our cached values. We use the string representation of this board
as the key and indicate that this board is not safe. We also return False to the caller of this
function. In almost an identical manner using NumPy, we can calculate whether any two
columns in this board have more than one queen. If the column sum is greater than 1 for any
column, that means 2 queens have been placed in that column. Add to our cache that this
particular board is not safe and return False to the calling function. The next step is to check
whether any diagonal on the board has more than one queen placed on it. NumPy has some
special functions here that allow us to access the diagonal elements, so get the elements from
the lower-left to the upper-right diagonals then from the upper-left to the lower-right
diagonals. Once we have all the diagonals, for every diagonal, we can check to see whether
more than one queen is positioned on it. If yes, we'll cache this information in our
board_state_memory dictionary and then return. If all of these checks have been completed,
you can now conclude that the current state of the board is safe. None of the queens are
killing each other. We'll save that in our cache and return True.

Demo: 8-Queens Algorithm Using Dynamic Programming, Place Queens

Whenever you write helper functions, it's always good practice to test it before you move on.
Let's make a copy of the board. And on this copy of the board, let's set up two queens which
are in the same row. We'll call is_board_safe on our copy to see whether the right result is
returned, and you can see that is_board_safe returns False. We'll try this again. We'll make
another copy of the board, and this time we'll position two queens in the same column. That
is column 0. We'll then check to see whether this board is safe, and you can see that
is_board_safe returns False. Let's check the diagonals as well. We'll make another copy of the
board, and this time we set up three queens. Two of these queens are on the same diagonal,
and you can see that is_board_safe will once again return False. It has detected that two
queens are on the same diagonal and can kill one another. We are now ready to place queens
on this board. This will be a recursive algorithm. The recursive algorithm uses dynamic
programming when it calls the is_board_safe helper function. This place_queen function will
be called recursively. At each point, we pass in the current state of the board and the column
where we want the new queen to be positioned. If the column number passed in has moved
beyond the edges of the board, it's greater than equal to N, that means all N queens have been
positioned correctly, we can return True. We have successfully placed all queens. If not, we
set up a for loop that iterates through every row and tries to place the queen on each row of
that column to see whether that queen can be placed correctly. Initialize a variable indicating
that the current position of the queen is not safe, we haven't checked it yet, and then call
is_board_safe to check whether the current state of the board is safe. If it so happens that the
current queen has been placed in a safe position, then we can move on to the next column to
place the next queen. We'll do this recursively by calling place_queen once again. We pass in
the board as it stands at this point in time. We pass in the next column because that's where
the next queen has to be placed. This recursive call to the place_queen function will return
True if all the remaining queens have been placed successfully. It'll return False if it has not
managed to place all of the queens. If safe is False, that means we did not manage to place
the rest of the queens successfully. That means there might be a problem with the current
position of the current queen. We unplace the queen, and we move on to the next row to find
a new position. If safe was True, that means we managed to place all queens successfully. We
can simply break out of this loop and return True from the place_queen function. And that's
it. That's the recursive algorithm. Now if you're not used to recursion, this might take some
understanding, but I suggest you sit down with a pencil and paper and work your way
through to ensure that this works. Before we start the place_queens algorithm, let's look at the
current state of our board_state_memory. It has only three entries within it for the three tests
that we ran with different queen positions in our board. All three board states were unsafe,
which is why we have False associated with each key. Let's make a fresh start. Initialize the
board to all 0s, and start off by placing the queen in column 0. This is the first queen. Placed
is the variable that will hold the information for whether we manage to place all queens
successfully, and we did. Our recursive algorithm worked. If you check out the state of the
board, you'll find that all eight queens have been placed such that none of them kill one
another. But wait a second. We've placed all eight queens, but we didn't really use the
caching or the memorization that we set up. This is because each configuration of the board
was new. It was something that our board_state_memory saw for the very first time. We have
now a fully populated board_state_memory dictionary. If you run the place_queen algorithm
once again, you don't need to recalculate whether the board is safe or not for every position.
Chances are that we'll be able to retrieve this information from our cached state, so let's go
ahead and execute the place_queen function once again, and notice that this time we're using
cached information wherever possible. Now it's important to note here that for an 8x8 board
the program runs through very quickly, and using cached information doesn't really speed
things up. Try a 32x32 board. That's where I found the difference. Without memoization, it
ran for about 42 minutes, and with memoization, it was down to around 36 minutes.

Policy Search Techniques: Q-learning and SARSA

When people talk about reinforcement learning, you might have heard terms such as Q-
learning, the temporal difference method, or SARSA. Q-learning is a technique that is used to
find the optimal policy that your agent can use to make decisions about what actions to take.
If you remember the basic concepts in reinforcement learning, you have an agent, the
decision maker, that observes the environment then takes actions based on a policy, and
based on these actions, it collects rewards. Just like the output for a traditional machine
learning model can be a classification or a prediction, the output of a reinforcement learning
algorithm is a set of actions which maximize cumulative rewards. Cumulative rewards are the
rewards accumulated over the lifetime of an agent. The agent determines what action to take
at any state based on the policy, and the policy is determined by the environment. We use
algorithms in order to get an estimate of the best policy to use under different conditions, and
there are various types of policy search algorithms which exist, brute force, policy gradient
methods, and value function methods. Of these techniques, value function methods are the
most common and popular because they are the most tractable, made so using dynamic
programming. In this module, we'll focus on Q-learning techniques. There are several
implementations of Q-learning techniques. The temporal difference method is the most
popular, and that's what we term as Q-learning. The SARSA method is another
implementation of Q-learning. We explicitly refer to it as the SARSA method to differentiate
it from the temporal difference method, which we refer to as Q-learning. Let's quickly
summarize here. For any reinforcement learning agent, policy determines action,
environment determines the policy, and the environment is modeled as a Markov decision
process. The Markov property states that the future is independent of the past given the
present, and this allows us to make even very complex environments tractable. The objective
of the policy search algorithm is to find an optimal policy that will allow us to maximize
cumulative rewards over the lifetime of the agent.

Intuition Behind Q-learning

You'll find that the actual implementation of Q-learning techniques using the temporal
difference method, as well as SARSA, is very straightforward. All you need to use is the right
formula. Understanding the intuition behind the formula is what is important, and that's what
we'll study in this clip. Let's use the intuition behind these techniques to make a decision.
Should you go to grad school or start your first job? Now on one hand, if you go to grad
school, you will learn more now, and you'll start work two years on. If, on the other hand,
you start your first job, you'll join the workforce right away and start earning. You won't get
another degree though. So what's the right choice for us here? Let's solve this using a value
function method. In an earlier module, we discussed the three elements of reinforcement
learning problems. The first of these was the reward, and let's say that the reward that we
want to maximize is our lifetime savings. The reward for each of our actions is the money
that we save, and we want to maximize savings up to the age of 60. This is our cumulative
reward. The agent over here, or the decision maker of this problem, is our graduating senior
from an undergrad university. She needs to decide whether she should continue with grad
school or start on her first job. The policy that she'll follow to make this decision is basically
the algorithm to choose between these two options. This algorithm is the Q-learning
algorithm, and this is what we're seeking to intuitively understand. In order to use Q-learning,
we need to model our environment as a Markov decision process, which means we have to
divide time t into discrete intervals. At time t = 0 is when we make our decision. When we
use the MDP, we model the environment as distinct states. Now there are various state
variables, which, when considered together, represent a state. These state variables can also
be thought of as observations. For this environment and for this problem statement, let's
assume that any state S can be entirely defined by two pieces of information, the highest
education qualification that our graduating senior has received and the years of work
experience she has. At time t = 0, which is when we have to make the decision, the initial
state can be thought of as BS-EE, 0. Our student who has just graduated has received a
bachelor of science degree in electrical engineering and has 0 years of work experience. If
you phrase this a little differently, you can see that the 2 observations which make up our
initial state is BS-EE and 0 years of work ex. Now there are two possible actions that our
student can take from here. If action is equal to grad school, then the next state that the
student will reach will be an MS in computer science with 0 years of work experience. The
student now has a post grad degree, but no work ex. Or if the student chooses instead to work
at her first job, then the next state after two years will be BS-EE. The educational
qualification hasn't changed, but work ex is now 2. The next state is different based on what
action the student takes. So here is our student at the very left in the current state, and there
are two possible states to which the student can move. If the student goes to grad school, she
will be an MS-CS with 0 years of work ex. If the student starts her first job, she will continue
to be a BS-EE, but with two years of work experience. Now each of these individual states
can spawn off other states as well based on the decisions our student makes in the future. If
she continues to work for the next 10 years, she could be an MS-CS with 10 years of work
ex. Or if she decides to study further, she could be an MBA with 8 years of work ex. Or if she
had decided to start working right away, 10 years from now she could be an MBA with 10
years of work ex. Or she could have chosen to enroll in a doctoral program, and 10 years
from now she would be a PhD with 5 years of work ex. Let's say our student has been fairly
successful in her career, and many years from now she might find herself in one of these few
terminal states. She could be the CEO of a company, a vice president, or a general manager.
All of these paths could lead to one of these outcomes. Value function policies, such as Q-
learning, involve estimating rewards at these few terminal states and the probabilities of
getting to these states and then working backwards to see what decision is right for you.
Remember that our example is a simplification of how the real world functions. We're using
this to understand the intuition behind Q-learning. One more simplification here is that we'll
ignore the discount factor that we applied to future rewards. Let's say you've earned really
well as the CEO of a company, and at age 60 you're able to retire with $10 million dollars in
savings. As a VP, your savings might be $5 million, and as a GM, your savings might be $1
million. It's pretty obvious that in the real world none of these outcomes are certain, that is
any of these paths will always be associated with a probability. So if you work backwards,
the probability weighted savings, if you're an MS-CS with 10 years of experience, is $1. 58
million. You can assign random probabilities to each of the other paths and work backwards
to other states as well. We calculate the probability weighted estimate of earnings for all of
the possible intermediate states. That means we now have a number assigned to each
intermediate state. We've now worked backwards and are now at a point where we have to
make the decision between grad school and our first job. Let's take the highest estimate for
each state. If you do an MS-CS, the highest estimate is that our savings will be $2. 7 million.
This, of course, assumes that after the MS we do an MBA and continue working. On the
other hand, if we started our first job, the highest estimate we have is $3. 09 million. This is
an estimate for when our student does a PhD later on in life. At this point, we have an
estimate of future rewards associated with every state that we might move to next. Each of
the actions that our student takes at this point in time is also associated with an immediate
reward or punishment. In the case of grad school, she might have a tuition payment of $100,
000. 00, which means that her cumulative rewards will now be $2. 6 million. It has been
reduced a bit. But if her action is to start her first job, she might accumulate a savings of say
$200, 000. 00 dollars. Her first job is a really well paying one. That means her cumulative
rewards will be $3. 29 million. Once we know her objective and we work backwards from
terminal state, it's pretty clear what her decision should be at this point in time. The best
possible outcome is that she has a savings of $3. 29 million at age 60, which she'll get if she
starts her first job rather than going to grad school now. Remember, this is also dependent on
future decisions. The best possible outcome assumes that our student will do a PhD later in
order to get these rewards, so the optimal action at this point in time is to take up the first job.
If you go back to the state diagram that we've seen before, the student is currently at state St,
the action that she should take is to take up her first job, and that will move her to state of
St+1. If you were to represent this mathematically, this is what the equation will look like.
What we've done intuitively is to find that action that maximizes the sum of the immediate
reward and discounted future rewards. And this is the intuition behind the Q-learning
techniques that we'll study next using the temporal difference method and the SARSA
method. We want to maximize the sum of the immediate reward and discounted future
rewards.

Q-learning Using the Temporal Difference Method and SARSA

Now that we've understood the intuition behind Q-learning, let's see how we calculate Q-
values and how we use them. When we implement reinforcement learning algorithms using
Q-learning, we set up something called a Q-table for every state-action combination. The
rows of this Q-table are all the possible states in this environment, and the columns of the Q-
table are the various actions that can be taken in each state. We've modeled the environment
as a Markov decision process, but this environment is uncertain and unknown when we start
off. We'll have our agent explore the environment and populate Q-values. Q-values is the
estimate of the optimal state-action value, and this is the basis of the policy that our RL agent
uses to make decisions. Let's start off by understanding the Q-table first. Here, every state is
represented by a row in this table, and every action that's possible is represented by a column,
so this is a state-action table. A state is represented by a discrete row, but a state is actually a
combination of state variables. These state variables are also called observation. Every state
is a unique combination of these observations. This Q-table has to be populated by Q-values,
which is what drives the decision making for our agent. Our agent's job is to explore the
environment and see what rewards are available at the various points. There are many
specific techniques that an agent can use to explore the environment. Random exploration
also works. The agent can choose to move left, right, go up or down at random. During the
course of its exploration, the information that the agent collects is used to find the Q-values,
which are then used to make decisions later on. So what does this Q-value actually mean? Q-
value is the optimal state-action value. Let's say that you are in a certain state s, and you take
some action a. Then the Q-value is the cumulative long run reward that you will get for this
particular state-action combination. Q-values, as you can see from the representation, are
associated with a particular state s and an action a, which is why Q-values fill up the cells of
your Q-table. When you're in a particular state and you're making decisions, you'd like to
choose that action which has the highest Q-value because that's where you'll maximize your
long run reward. Initially, when your agent starts off its exploration, Q-values are not known.
Q-values have to be estimated during the exploration phase of the agent. While the agent is
taking different paths to its destination and figuring out what rewards and punishments it
receives along the way, it can also calculate Q-values using different methods. Popular
methods that are typically used to calculate Q-values are the temporal difference method and
the SARSA method. So when we talk of the temporal difference method, we simply say Q-
learning. The SARSA method is explicitly called out by name. We've understood the
difference between the temporal difference method, as well as the SARSA method in an
earlier clip. This is the precise mathematical formulation for this method of calculating the Q-
values. There's a lot going on here. Let's parse this. Qt+1 is the updated Q-value. This is the
Q-value for a state-action pair that we update during the exploration phase of our agent. This
particular state-action pair might have had another Q-value earlier. This is the previous Q-
value. Every action from a particular state is associated with an immediate reward. This is the
reward for action a from state s. And this last term here is basically the maximum Q-value
across all actions in the next state. We move from state s to s prime. We consider all the
actions in that state and find one where the Q-value is maximum. As in the case of any
machine learning algorithm, we also have a learning rate, which is represented by alpha. The
learning rate is the size of the movement away from your current state while exploring. If
alpha is large, you'll make large movements away from your current state. If alpha is small,
your exploration will be more limited. You'll make smaller movements away from your
current state. And here we have the discount factor, the gamma in our formula. This discount
factor tells us how much to down weigh future rewards, which are uncertain, relative to
immediate rewards, which are much more certain. If gamma is equal to 1, future rewards are
as important as immediate rewards. If it's smaller than 1, then future rewards are less
important. This was the temporal difference method. Here is the mathematical formula for the
SARSA method of Q-value calculation. SARSA is an acronym, which stands for state-action-
reward-state-action. As you can see, the formula for the SARSA method is very similar to
that of the temporal difference method, except for this term that you see at the very end. In
the SARSA method, we use the actual reward for the action that we've taken in order to
update the Q-value, not the estimated maximum reward. This brings in many subtle
differences. We'll explore that in a later clip.

Exploring State Space

In our example of the graduating senior that we used earlier in order to intuitively understand
Q-learning, our example was very simplistic. When we use Q-learning in the real world, we
use it for realistic state space exploration, and you should understand the differences here. In
our student example, we greatly reduced the number of states that were possible. Our state
space was small so that we could cover it exhaustively. In the real world when we use Q-
learning, state space is typically too vast to cover exhaustively. You have to make some
simplifying assumptions or leave out some portions of the state space. In order to make the
state space tractable, one technique that you'll see used in our demos is to discretize the state
space by bucketing various observations. In our simplistic example, we assume that all
transition probabilities are known. We knew the probability of our student becoming a CEO
whether she did an MS in CS, whether she did an MBA, or whether she did a PhD. The real
world doesn't come with a known set of transition probabilities. In fact, these have to be
learned when our agent performs exploration of the environment. Our agent might have to
perform the same state transitions over and over again in order to get a probability associated
with that transition. Our simplistic example also assumed that the rewards for each state-
action combination are known. This is my degree. If I work this much, this will be my
reward. This was always known. In the real world though, these rewards for state-action
combinations are unknown, which is why we need to perform exploration to populate the Q-
values in our Q-table. In our simplistic example, we did not account for discounting at all.
We assumed that future rewards are as certain as immediate rewards. We did not discount
future rewards in calculating the cumulative reward. In the real world, all mathematical
formulations of Q-learning will include a discount factor gamma. In our simplistic example,
there was no exploration phase at all. We didn't have a policy which determined how our
agent will explore the environment. Here we need an exploration policy and a learning rate,
which we use to learn from the environment. In the real world, once we have the Q-table
populated with Q-values, using the formula that we saw in the last clip, we now want to find
the action a that maximizes the sum of the immediate reward and discounted future reward.
The discount factor that we applied to this future reward is gamma. One thing that we've
mentioned so far, but we haven't studied in detail is this concept of exploring the state space.
Our agent needs to explore to see what rewards it can achieve along different paths. Let's
consider our state space to be very simple. It's represented using two variables, state variable
1 and 2. The Z-axis is the immediate reward that we get when we move from one state to
another. Let's assume that this reinforcement learning task is an episodic task, which means
there is some end state that we want to achieve. Sum state represents the final goal state. This
curve that you see on screen represents the state space, and every point on this curve
represents one state. The simplifying assumption here is that we have just two dimensions in
this state. Even in that case, the state space is really vast. Let's consider two specific states in
the state space, and let's consider that there are two actions which can take us from one state
to another. Each of these actions are along different paths. Each action that takes us between
these two states will give us a different reward. Let's say that our agent wants to explore the
state space. It wants to get from the source to the destination. It'll move around in the state
space, exploring to see what it can find and what rewards it can collect. The size of individual
steps that our agent takes during exploration depends on the learning rate alpha. If the learn
rate is small, the steps will be smaller. If the learning rate is large, the steps will be longer.
Every path, as we move from one state to another, will be associated with a reward. It keeps
track of the rewards by discounting it backwards by gamma, the discount factor. These
discounted cumulative rewards of future states will give us the Q-values for state-action
transitions. Based on the Q-learning technique that we're using, either the temporal difference
method or SARSA, we'll populate the Q-table with these values. We'll use the mathematical
formulation for individual techniques. As our agent continues to explore the state space, we'll
get to a point where parts of the state space will become known. It's mapped out. They are no
longer unknown. These mapped out portions of the state space will have the Q-values
populated in our Q-table. Now let's say we have an agent poised at a particular state, and it
has a choice of actions. Given any state, our agent can then choose to stick to the parts of the
state space that it knows about, it knows exactly what rewards it will receive there, or it can
choose to explore the unknown. This choice that our agent makes is extremely important
because this exploration of the unknown is what will let us know if there are better paths to
our destination where we can accumulate more rewards. In the known parts of the state space,
we know the Q-values for every state-action combination, so our agent can always pick the
best path. The unknown paths are, by definition, unknown, which means there might be an
even better path which we haven't explored yet, which we haven't found yet. Ideally, we want
our agent to maintain a balance between the two, exploring the known paths and venturing
out into the unknown, and we can control this by using the parameter epsilon, which controls
the probability of trying out new paths. So typically, for any state-action transition, our agent
with a probability of 1 minus epsilon will stick to known paths and with the probability
epsilon venture out into the unknown. It'll be adventurous and try out new paths. Initially,
when large parts of the state space are unknown, we want our agent to be very adventurous.
We want it to be trying out new paths all the time. But later, we'd ideally want to reduce the
value of epsilon from 1 down to somewhere close to 0 so that our agent sticks to those paths
that are more rewarding, paths which will allow it maximize cumulative rewards.

Demo: Q-learning for Shortest Path: Initialization

We've understood the intuition behind Q-learning. We also know the mathematical formula
for the temporal difference method. Let's now apply it to a real problem, albeit a simple one.
We'll use Q-learning to find the shortest path in a graph from a source node to a destination
node. Here are the libraries that we'll use in this program. NumPy for manipulating arrays,
pylab for seeing visualizations, and pandas for reading in data and expressing data in the form
of data frames. The one Python package that you may not be familiar with is NetworkX. This
is explicitly used for creating and manipulating complex networks, that is graphs. We'll use
NetworkX to represent a simple graph. If you want to learn more about this package, here is a
GitHub link that you can visit. This Python package can be installed on your local machine
using a simple pip install networkx command. Here is a simple graph that we'll define by
specifying the edges in that graph. This is a graph with 9 nodes labeled from 0 to 8, and here
are all the edges in the nodes. There is an edge which goes from 0 to 2, 0 to 1, 3 to 7, and so
on. This is an undirected graph. If there is an edge from 0 to 1, there is also an edge from 1 to
0. The goal node here is node 7. Our objective is to get to this goal node from any source.
We'll instantiate this graph using the nx. Graph class, and we'll add the edges to this graph
from the edge_list that we just set up earlier. We now want to render and display this graph
on screen in order to visually examine this. We'll do this by using a spring_layout. This is a
specific way to lay out a graph on screen using the force-directed method. This method lays
out the graph such that edges are more or less of the equal length, and it minimizes
overlapping edges. We want this graph to render nodes, edges, and every node should have
an associated label. And this is how the graph looks on screen. You can see that there are 9
nodes here labeled from 0 to 8 and edges which connect these nodes. This is an undirected
graph. Note that there are no arrows on the edges. Transitions can be in either direction along
every edge. We are going to set up a rewards table and then a Q-table. Both of these are
matrices, and both of these are 9x9 matrices where every row and column represent a node.
We'll first set up a rewards matrix. The rewards matrix specifies the reward that you get when
you transition from one node to another. The initial values in the rewards matrix are all set to
-1. Iterate through all the edges that we have in our edge list, and for every edge which leads
to our goal, which is node 7, set the reward to 100. There is a huge reward associated with
any transition that leads directly to the goal. We set the rewards for all other edges to 0. There
is no reward for traversing any edge which doesn't lead to the goal. We can also set the
rewards matrix for the edge in reverse because this is an undirected graph. In an undirected
graph, if the node is from the goal edge, it also leads to the goal because of its undirected
nature. Set the reward to 100 there as well. For all other edges, set the reward to 0 as before.
Once you're in the goal state, you want to ensure that you continue to remain there, which is
why you set the reward for goal, goal to 100 as well. And here is our initialized rewards
matrix. It contains 3 different values, -1, 0, and 100. Nodes which are not connected to one
another contain the value -1 in the corresponding cell. You can't traverse those edges. Edges
which are connected directly to the goal, that is node 7, have a reward of 100. If an edge
exists between two nodes, but neither of those two nodes are goal nodes, the reward is 0.
We'll use a discount factor of gamma for future rewards that we might achieve. Let's set up a
Q-table that will hold the Q-values that we calculate when we explore this graph. We'll then
use these Q-values to find the best path from source to destination node. This Q-table will be
a 9x9 NumPy matrix where every row and column represents a node. The initial value
populated in our Q-table is all 0s. You can see this viewing it as a data frame.

Demo: Q-learning for Shortest Path: Implementation

We'll first set up some helper functions that will allow us to explore our graph and populate
our Q-values. The first of these helper functions is to get the available actions for a particular
state. A state in our state space is a node in our graph. We access the row corresponding to
the current state, and then we find what actions are available there. The available_actions are
the state transitions that are possible from that node. All actions where rewards are 0 or
greater are available from the current node. The action is an edge which exists. We access the
node to which the current state leads and return that as the available action. We'll also have a
helper function set up, which allows us to choose the next state at random. This will help set
up our exploration policy where we start from a random state and try to see where that leads.
The update helper function will help us update the Q-values in our Q-table as we discover
new information during exploration. The action here is simply the transition to the next node
or the next state. We find the index of that transition which has the highest Q-value. If we
find that there is more than one action with the maximum Q-value, then we can pick one
action from this list at random; otherwise, we find that transition or that action which has the
maximum Q-value. We stored the index of that in the max_index variable, and we stored the
actual maximum Q-value that we've got in the max_value variable. We can now calculate and
populate the new Q-value for this current state-action combination using our Q-learning
mathematical formula that we studied in an earlier clip. We consider the immediate reward
for this transition from the rewards table and apply our gamma discount factor to the
maximum expected reward in the next state. We can go ahead and print this to screen for
debugging, and let's start exploring. Our initial state will be node 0. If you start off at node 0,
the actions available to us are transitions to the following nodes. We can go to node 1, 2, 3, 6,
and 8 from node 0. The helper method sample_next_action will pick one of these nodes at
random to transition to. Here it's picked node 6. Let's check out that update helper function.
All our Q-values have been initialized to 0, which means max_value is 0 at this point in time.
We are now ready to run the Q-value iteration algorithm. This is the exploration that we'll do
of our graph to populate the Q-table. We'll run this for 700 iterations, and for each iteration,
we start off in some random state. We get a list of all state transitions or actions that are
possible from that particular node and then choose one of these actions at random.
Remember, we are exploring here to populate our Q-values. We then call the update helper
function to update our Q-table for the current_state, action transition. Let's go ahead and run
our exploration. This will take some minutes to run. Notice that, initially, the max Q-value is
0. Slowly, our Q-values increase. It has now grown to 80. Our Q-table is getting populated
slowly yet steadily. Q-values are converging to the optimal policy value. Notice the Q-values
steadily increasing for certain node action combinations until, finally, our Q-values will
converge to the optimal state values as per our policy. Let's take a look at the final Q-values
after exploring for 700 iterations. Here is our Trained Q matrix. Notice that Q-values of all
nodes that lead to 7 are very, very high. The state-action transitions from nodes 1, 2, and 3 to
0 also have a high Q-value because, ultimately, it leads to node 7. Let's normalize the
numbers in our Q matrix by dividing by the maximum Q-value and multiplying by 100. Now
with this, let's find the shortest path from the source node to our goal node 7. We want to find
a path from node 0 to 7. Zero is our initial state. We set up a while loop so long as our current
state is not equal to 7, and then we find the next step that we have to take, the next action.
The Q-table determines the policy, which we use to choose actions. We choose that action
which has the maximum Q-value, and if there are multiple such actions, we'll pick one of
them at random. Once we choose the action which has the maximum Q-value based on our
Q-table, we can then move to that state and append it to our steps. And once we get to the
destination state, the list of steps will give us the most efficient paths there.

Intuitive Differences Between the Temporal Difference Method and SARSA

In an earlier clip, we had spoken about two different Q-learning techniques, the temporal
difference method, which we just referred to as Q-learning, and the SARSA method. In this
clip, we'll discuss the differences between the two. We already know that the formula for
calculating Q-values using the temporal difference method is different in a subtle way from
the formula for calculating Q-values using the SARSA method. Let's understand what these
differences mean in the real world. Q-learning using the temporal difference method follows
the principle look before you leap, and make sure your leap is the best estimated leap. The
SARSA method follows the principle take the plunge and then go back and update your view.
Here are the two formulas for the Q-learning, as well as for the SARSA method. We've
looked at these before, but what do they mean? In the temporal difference method, the
decision maker evaluates the max reward from the next state that it can get to to all possible
other states. In the case of SARSA, the decision maker first takes an action and then goes
back and updates the previous state based on the actual reward that he received. The temporal
difference method is based off of the estimate of the maximum possible reward that it might
get, whereas in the SARSA method, we work with actual rewards. In the temporal difference
method, we estimate the future reward based on the best currently known action, but we
haven't taken that action yet, whereas when we use SARSA, we always update Q-values for
the previous step. Because it's based on actual rewards, we have to take the action, collect the
reward, and then go back and update our previous step. The temporal difference method
works off of our estimate of the best possible action because it assumes that an optimal policy
is being followed. In the SARSA method, it does not assume that the current exploration
policy is optimal, which is why it takes the action before it feeds the reward back to the
previous time state. The temporal difference method is called an off-policy reinforcement
learning method because it learns the optimal Q-values without considering how the agent
actually acts. It estimates the best possible action and then uses that to update the current Q-
value. Because the actual action hasn't occurred yet, this is an off-policy RL method. On the
other hand, SARSA is called an on-policy reinforcement learning method because the
calculation of Q-values consider how an agent actually acts. The agent takes the step to go
from state S1 to state S2. It'll perform the action, collect the reward, and use the actual value
of that reward to go back and update the Q-values for the previous state. The difference
between these two methods here is subtle and might be a little hard to understand at first, but
this difference is important because Q-learning using the temporal difference method
typically makes sense for offline learning where the agent first performs an exploration of the
state space and then decides later what is the best path. The SARSA method is typically used
for online learning where our agent continues to take decisions as it tries to find the best path
in its environment.

Q-values as a Memoization Technique

You might have figured this out already, but I thought it best to explicitly mention where
exactly dynamic programming is used in Q-learning. Here are the differences between the
simplistic example that we considered early on in this module of our undergrad student and
state space exploration in the real world using Q-learning. In the real world, state space is too
vast to cover exhaustively, which means we use certain techniques to make it more tractable,
and dynamic programming is one of those techniques. Here is how we explore a state space
with two state variables. Even in two dimensions, we can see that the state space is vast.
Every action associated with a particular state has a Q-value. The Q-values for every state-
action pair are stored in the Q-table, and this Q-table is a form of memoization, or dynamic
programming. And remember, the only reason that we can use dynamic programming is
because we model the environment as a Markov decision process. The future is independent
of the past given the present. Everything we need to know about the past is embedded in the
present state. This Markov property allows us to store the Q-value of a particular state-action
combination independent of the path we took to get to that particular state. This caching of Q-
values for each state-action pair is basically dynamic programming, which has reduced the
computational complexity of our state space exploration, making our problem more tractable.
And this brings us to the end of this module. We started this module off by talking about
dynamic programming, which makes exploring the environment tractable, and we
implemented the 8-queens problem using memoization techniques in order to get an
understanding of how caching works. We then studied in detail two implementations of the
Q-learning technique used in reinforcement learning to find the optimal policy. We studied
the temporal difference method and the SARSA and understood the differences between the
two. We also got a taste of implementing Q-learning. We implemented the shortest path
algorithm using Q-tables and Q-values. The next module is all about demos and getting hands
on with reinforcement learning algorithms. We'll use reinforcement learning platforms such
as the OpenAI Gym for our implementations.

Using Reinforcement Learning Platforms


Module Overview

Hi, and welcome to this module where we'll get really hands on with reinforcement learning
algorithms by using platforms. Now why are platforms needed in reinforcement learning
techniques? If you think about the applications of reinforcement learning in the real world, it
will become pretty obvious to you that these are better prototyped using platforms, and they
are hard to test in the real world. Let's say you're teaching a robot to walk. If your robot keeps
walking around in the real world, it might walk off a cliff or damage itself in some way. The
OpenAI Gym is a very popular reinforcement learning platform for testing and prototyping
your models. This an open-source platform, which means it's freely available for anyone to
use. We'll implement several reinforcement learning techniques. We'll first implement
reinforcement learning using the SARSA methodology in the FrozenLake environment on
OpenAI Gym. We'll then implement reinforcement learning using the temporal difference
method, or Q-learning, in the CartPole environment in OpenAI Gym.

Exploring Reinforcement Learning Platforms

Let's start off by exploring what platforms are available to us to work with reinforcement
learning techniques. We spoke about the artificial intelligence company DeepMind earlier,
which Google acquired in 2014. This uses neural networks in order to teach programs to play
games like humans do. DeepMind Lab is an open-source project by Google, which offers a
suite of 3D navigation and puzzle-solving tasks for learning agents. You can use these to train
your reinforcement learning algorithms. It's very easy to get started with DeepMind in your
Linux environment. You'll get the Bazel build tools from bazel. io, clone the Git repository,
cd into the lab, and get a random agent up and running. Google's DeepMind isn't the only RL
platform in town. There is an AI research platform which allows you to test your
reinforcement learning techniques on the game Doom. The site is called ViZDoom, and it
allows you to use visual information from the Doom game in order to test and prototype your
reinforcement learning techniques. Once again, if you take a look at the documentation, we'll
find that it's very easy to get started with. There are tutorials available and GitHub
repositories. Reinforcement learning is hot and important topic, which means Microsoft can't
be far behind as well. Microsoft has an AI experimentation platform built on top of the
Minecraft game, and this platform can be used to support AI research. This platform is called
Project Malmo. And if you explore this page, you'll find the GitHub link that'll help you get
started with Project Malmo. But at the time of recording this course, the most popular
reinforcement learning platform by far is the OpenAI Gym. This is the open-source interface
for various reinforcement learning tasks. The OpenAI Gym offers a number of different
reinforcement learning agents that you can preen. You can view the documentation, and it has
many examples, which are hosted on GitHub. Once again, it's very easy to get started with
the OpenAI Gym. All our demos will be based on this platform. There are a variety of
different environments that are available in the OpenAI Gym. Let's briefly explore these. You
can see that we have classic control environments. These are fairly simple reinforcement
learning tasks. Each task will have a well-defined set of actions and a state space that can be
easily discretized. In fact, later on in this module, we'll implement the CartPole environment
where we try to balance a pole on a moving cart. The OpenAI Gym also contains
environments for standard computation tasks. It contains Atari environments where you can
test your reinforcement learning techniques on the Atari game. Box2D is a special simulator
that is available for continuous control tasks, and the MuJoCo is a special physics simulator.
The Robotics simulator allows you to control the Fetch and ShadowHand robots. These are
special kind of robots for special tasks. Each of these is a separate research platform which
contains tasks of varying complexity. These tasks are more complex than those available for
the MuJoCo platforms. And finally, Toy text contains the simplest of all environments. These
are the environments that we'll start off with when we code using the OpenAI Gym. If you
click on the Documentation link, you'll see that getting up and running with the OpenAI Gym
is very simple. All you have to do is to pip install the gym on your local machine. So we'll
switch over to the terminal window, and I'll call pip install gym, and you'll see that I already
have it downloaded and installed. I have the packages for the Python 3 version. This is what
I'm going to be using in my demos.

Exploring Environments in the Open AI Gym

Now that we've used pip install to get the OpenAI Gym libraries, let's explore it from within
our Jupyter Notebook. Here are the import statements for this program. The import gym. The
import gym statement will bring in the OpenAI Gym, which we'll use to create our
environments. Use the gym. make function in order to create an environment. We set up the
CartPole environment-v0. Here we want to balance a pole on a cart by moving the cart either
left or right on a frictionless surface. After the environment has been created, it can be
initialized using the env. reset function. Each reset of the environment can be considered to
be one episode. An episode is a start of a new task, and this task is seen through until the end.
We consider a maximum of 100 time intervals for each episode, and we render the
visualization to the screen. This starts the CartPole visualization. At every time interval, we
execute some action. This action will be on the cart in the CartPole environment, and it can
cause the cart to go either left or right. Env. action_space. sample picks an action at random
from the sample action space. This should bring up the OpenAI window that you see on
screen, and here you can see the CartPole trying to balance itself. It keeps falling down
though. Let's go ahead and close this environment, and let's take a look at the other
environments that our gym has to offer. The Acrobot environment contains two joints and
two links, and our objective is to swing the end of the lower link up to a given height. And in
executing the same code, we can see the visualization of the Acrobot environment. Notice the
two joints and the two links. Let's close this environment and move on to the next one, the
MountainCar-v0. Here we'll use a car to drive back and forth in the gap between mountains,
and we'll try and move up the mountain by using the momentum generated. The next
environment that we'll look at here is the FrozenLake. This is a toy text environment, which
means it's a simpler RL problem. The objective here is to find a safe path across a frozen
lake, which contains ice grids, which are frozen, or holes where you can fall in and die. When
you reset this environment and render it to screen, you can see that the result is in the form of
text. The grids here represent the various styles in our FrozenLake. Here, F stands for a
frozen tile, H stands for a hole where we can plummet to our deaths, S stands for the start tile,
and G stands for the goal tile. Our objective is to go from the start to the goal to retrieve a
frisbee that has flown into the frozen lake. We'll be working with two demos in this module,
one in the FrozenLake environment and another in the CartPole environment.
Demo: Q-learning Using SARSA in the Frozen Lake Environment

In this demo, let's use reinforcement learning to retrieve a frisbee from a frozen lake. We'll
use the OpenAI Gym platform. Go ahead and import the gym environment like we did
before. We'll also be using the NumPy package. Create the FrozenLake environment using
the gym. make function. Let's now explore this environment. Env. action_space. n gives us
the number of actions that are possible in this environment. When we move to retrieve a
frisbee from this frozen lake, there are four possible actions. We can move left, right, move to
the top or to the bottom. Let's look at the observation space. An observation is a way to
describe the current state of the environment, the various state variables that need to be
represented. Our FrozenLake is very simple. It's a 4x4 grid with 16 tiles. These tiles are
labeled S, F, H, and G. S is the start, F is a frozen tile, H represents a hole, and G represents
our goal. Initialize our learning rate, alpha is equal to 0. 4, and initialize the discount factor
that we apply to future rewards. Gamma is 0. 999. The next step is for us to initialize our
q_table. In this example, we represent our q_table in the form of a dictionary. The key in this
dictionary represents the current state. The current state is when we are in 1 of the 16 tiles on
the FrozenLake, so there are 16 possible states. For every state, there are four possible next
actions. We can move left, right, to the top, or to the bottom tile. We'll initialize the Q-values
for all of the state-action transitions to 1. Here is our initial q_table. Notice on the left all the
keys are the 16 different states numbered from 0 to 15, and on the right are the various
actions that are available on each state. Every state-action value is initialized to 1. We'll
define just one helper method here, the choose_action method. The choose_action takes in
the current state called an observation here, and it chooses that action which has the
maximum Q-value. This action with the highest Q-value will be the recommended action for
our reinforcement learning agent. We know nothing about the environment, so let's start
exploring. We'll populate the values in our q_table using 10, 000 different episodes. Reset the
environment to initialize it at the start of each episode and then choose some action from the
sample space at random. When we start a new episode, we do not have any value for the
previous state and the previous action. Both of these are set to None. We start off at the first
time step, that is t is equal to 0, and we run for a maximum of 2500 time steps in each
episode. An episode is considered complete if we reach the goal state and we retrieve the
frisbee, if we fall into a hole and plummet to our death, or if we run continuously for 2500
time instances without reaching the goal and without dying. Render the environment for
every time instance, and let's take the next action. The env. step function, which executes our
action, returns four different values. The observation, that is the next state because of our
action; the reward for that particular action; a Boolean done, which indicates that this current
episode is complete; and info, which contains additional information. Once we have the next
state in the observ variable, we can use choose_action to determine the next action that we
have to execute. Once we've executed the current action using env. step, we can now update
the Q-values in our q_table provided we had a previous state. We go ahead and retrieve the
Q-value for the previous state and the previous action combination. Remember in the SARSA
method we update the Q-value for the previous state-action combination after executing the
action and collecting the actual reward. If this episode is complete, then we can go ahead and
calculate the new Q-value for the previous state-action combination. This is a simplified
formula which only contains the reward for the current action. The episode is complete.
We've reached the end state. There are no future rewards to discount. If the episode isn't done
yet, then we use the mathematical formula for SARSA in order to compute the new Q-value.
This is the same formula that we've studied before. This is the new Q-value for the previous
state-action combination. We use alpha as the learning rate, Rt+1 is the reward for the current
action, we discount the actual reward, and we subtract the current Q-value. Remember that
the SARSA technique uses the actual rewards obtained in order to calculate Q-values for the
previous state. Go back and update the Q-values for the previous state-action combination.
The current state now becomes the previous state, and the current action becomes the
previous action. We can now move on and continue until we are done. And this is all the code
we need. Let's go ahead and hit Shift+Enter and execute this cell, and let's see the SARSA
method in action. Let's look at a few interesting things in the output printed to screen. Notice
that the action suggested is left. There is no cell to the left, no grid tile to the left, which is
why there is no movement. But once again, if you scroll down a little below, the action says
left, but the actual action that was executed was a down movement. From the start, our agent
moved one tile down rather than to the left. Looking at the actions suggested and the actual
actions that were taken are interesting. Notice an instance here where the action was up, and
it actually caused us to move up. You can see in our agent's exploration that the first episode
was completed after 24 time intervals with r=1. R=1 implies that we reached the goal tile.
Here is another later episode that reached the goal state, but with fewer timesteps. It did it in
11 timesteps. In our exploration, there are other episodes which fail as well. This episode,
after 37 timesteps, ended with us not in the goal state. We probably failed on a whole by
trying to retrieve our frisbee. Here is the last update that we made to a particular Q-value, and
here is our fully-populated q_table. This has the Q-values computed using the SARSA
method for every state-action combination.

Demo: Q-learning to Balance a Pole on a Cart

In this demo, we'll study one of the more interesting reinforcement learning problems out
there. We'll see how we can balance a pole on a cart in OpenAI Gym. We'll do this using Q-
learning, that is the temporal difference method. The imports here are straightforward. Make
sure you include the OpenAI Gym package. We set up the CartPole environment, which
allows us to try balancing a pole on a cart by moving the cart to the left or to the right on a
frictionless surface. This environment allows two possible actions. You can move the cart to
the left or to the right. There are four variables which represent the current state, that is the
observation space, the position of the cart, the velocity of the cart, the angle of the pole to the
vertical, and the angular velocity in which the pole is moving. Env. observation_space. low
gives us the lower bound of the four values that make up the observation space. The
individual state variables cannot have values below these. Similarly, env. observation_space.
high gives us the upper bounds of the four values that make up our state space. It's pretty
clear from the state variables that are used to define our observation space or state space that
the state space is potentially infinite. We need to make this more tractable, and we'll do this
by discretizing the state space so that we can apply Q-learning to discrete bounded values.
Each of the four variables that define our current state are discretized into buckets. The first
variable represents the position of the cart. The cart position can have two states, left or right.
When we reduce the number of states to 1, this means that we are ignoring this variable
completely in our state space. This is a technique that we can use to reduce the
dimensionality of our Q-value computation. The second variable refers to the cart velocity.
By reducing the number of buckets for this variable to 1, we are ignoring this state variable as
well in our Q-value computation. We've specified two 1s in our number of buckets. That
means we've reduced our state space along two dimensions. This will make our learning
much faster as our Q-table size is smaller. We've completely taken off two dimensions from
the Q-table. The remaining two state variables represent the angular position of the pole with
respect to the vertical and the angular velocity of the pole. We are going to use six buckets, or
six discrete intervals, to represent the angular position of the pole and three buckets to
represent its angular velocity. Let's initialize some values here. NUM_ACTIONS is the
number of cart actions possible. This is two, move left or move right. In the
STATE_BOUNDS variable, we'll store the upper and lower bounds for each of our state
variables. In order to further limit our state space, let's redefine some of these bounds. The
cart velocity bounds are now -0. 5 to +0. 5, and the bounds for the angular velocity of the
pole are from negative 50 radians to positive 50 radians. By limiting these bounds, we now
have a state space that is tractable to work with on our local machine. You should note that
the CartPole environment in the OpenAI Gym already limits the state space for us. The
episode ends when the pole is more than 15 degrees from the vertical or the cart moves more
than 2. 4 units from the center. It's time now to finally initialize our q_table. Our q_table has
five dimensions. If you think about a standard q_table, it basically has number of states into
number of action's cells. The number of discrete states that are possible in our environment is
1 x 1 x 6 x 3, and for every state we have to possible actions, move the cart left or right. Here
is the shape of our q_table. The first four dimensions here represent the number of states. The
last dimension is the number of actions that we can take in each state. Our q_table is a
NumPy matrix, which has been initialized to all 0s to start off with. We initially want to
explore the state space in order to fill up the Q-values in our q_table. We use an exploration
rate of 0. 01 and a learning rate of 0. 1. We'll now set up a helper function that will allow us
to decay our exploration rate over time. We want to decay the exploration rate, but not too
fast. We want to explore less over time as we've learned and understood our state space.
Similarly, we want to decay the learning rate as well so that we don't miss any maximums.
The select_action function allows us to select the action that we want to take from the current
state. This takes in the explore_rate as well. When it's time for us to take an action, we can
choose to explore the sample space at random based on what our explore_rate is, or we can
choose to stick with the known and perform that action that gets us to the state with the
highest Q-value.

Demo: Q-learning to Balance a Pole on a Cart Simulation

We'll now set up another useful helper function. If you remember in the last clip, we had
spoken in discretizing our state space so that it's easier to explore. A state in the CartPole
environment is represented using four observations, or four state variables, the position of the
cart, the velocity of the cart, the angle the pole makes at the vertical, and the angular velocity
of the pole. We had discretized the state space by dividing all of these variables into buckets.
This helper function takes in continuous state information and returns a discretized,
bucketized version. We iterate through all four state variables one at a time, and the first
check we do is to see whether the current value of that state variable is less than the lower
bounds that we have set up for it. If yes, then we set the state value to be equal to that of the
smallest bucket, 0. We perform a similar check at the other end. If we find that the value of
the state variable is beyond the upper bounds for that particular state, then we cap it and set it
to the largest bucket. In all other cases, our state variable value is somewhere within the
lower and the upper bound. To figure out which bucket it belongs to, we calculate the width
of the bounds and perform a mathematical computation, which uses the offset and the scaling
factor of each bucket to figure out which discrete bucket our continuous value falls in.
Append the discrete bucket numbers, which makes up our discretized state space, to the
bucket_indices list and return it. And finally, we are done with all the helper functions that
we need. We are now ready to set up a CartPole simulation. We want to balance the pole on
the cart for as long as possible. Get the value of the learning_rate and the explore_rate that we
start off with. Remember, we want these values to be large initially and decay over time. Our
value for gamma, the discount_factor, is 0. 99. That means future rewards are almost as
important as immediate rewards. We also calculate something called the streaks to see how
long we can balance the pole on the cart. If it's for over 200 time instances, that is equal to 1
streak. We'll run our CartPole simulation for 1000 episodes. An episode ends when one of the
following conditions is satisfied. The pole is not able to balance on the cart, and it goes more
than 15 degrees from the vertical, or if the cart moves more than 2. 4 units on either side. The
episode also ends when the number of discrete time intervals are up. We'll have every episode
run for a maximum of 250 time intervals. We'll initialize the environment by calling env.
reset and get the first state information. We can discretize this observation, or state, by calling
state_to_bucket on our observation. We store the result in state_0. Let's run our CartPole
simulation for 250 time instances. We render our environment for each instance. Select an
action to be performed. Given the current state, this action will look up the q_table as we've
set up in the select_action helper function. With a small probability that depends on our
explore_rate, the next action will be chosen at random. Execute the action using env. step and
get the next state, the reward for the current action, and whether the simulation has completed
or not. Each time we have a new observation, or state, in state space, we'll discretize it into
buckets. Look up the q_table to find the best Q-value to go from this state to any of the other
states. This is the best state-action combination. Once we have that, we can use the temporal
difference method formula in order to update the values in our q_table. We assign the new Q-
value for the current state. Alpha is our learning rate parameter. We'll gradually decrease this
for every time instance. Reward is the immediate reward that we get for the current action.
Gamma is the discount factor for a future reward. And finally, we subtract the current Q-
value as well. We've completed one iteration, that is one time interval. Update our current
state to be the next state, and then we continue. We print out a bunch of stuff to screen so that
we can see what's going on. If the done parameter from the env. step call indicates that the
episode is complete, we print information out to screen. If you run for more than 199 discrete
time intervals, that means we balance the pole on the cart for a fairly long time. That is one
streak. We reset our num_streaks to 0 if we can't make 200 time instances, and then we break
out of the simulation. If we find that we've had 120 streaks out of the 1000 episodes that we
ran, that's a pretty good number. This is the time for us to bail out. For every episode, let's get
a new explore_rate and learning_rate. We are constantly reducing this number. And that's it.
We are now ready to kickstart our simulation. Invoke the simulate function, and it can see the
OpenAI Gym environment, and you can see the pole balancing on the cart. Notice that as
time goes on the balancing becomes better. The cart moves less. The velocity of the cart
movements is also lower. This completes our simulation. Let's examine the messages on
screen to see if there's anything interesting. You can see that our very first episode completed
in only 16 time instances. We achieved our first streak at episode 156. Episode 156 ran for
199 time instances. As our q_table is gradually updated, we'll find that the number of streaks
that we run for is constantly increasing. We have 56 streaks at this point in time. The current
episode is episode 258. It's possible that for some random episode we don't manage 200 time
instances, which means our num_streaks gets reset to 0, but we go back up. You can see that
as the q_table gets updated we get steadily better at balancing the pole on the cart and the
number of streaks keeps rising. Let's close the environment now that we are done with it, and
let's take a look at the q_table values after 1000 episodes. And here you can see them on
screen.

Summary and Further Study

And on this note, we come to the very end of this module and to the end of this course on
reinforcement learning. This module was very hands on, and we explored and worked with a
number of reinforcement learning platforms. Reinforcement learning is hard to test out in the
real world, which means it's good for you to start building your prototypes using these RL
platforms. There are a number of different platforms available, but we specifically worked
with the OpenAI Gym. This is a very popular open-source platform with a number of
different environments in which your reinforcement learning agents can train. We
implemented two different reinforcement learning agents, one of which used the SARSA
method of Q-learning in the FrozenLake environment. In the second demo, we saw how we
could balance a pole on a cart using the temporal difference method of Q-learning. With this,
we come to the very end of this course. If you're interested in furthering your study of
machine learning algorithms, here are some courses that you can take on Pluralsight.
Building Classification Models with TensorFlow focuses on developing and implementing
classification models using neural networks such as convolutional neural networks, as well as
recurrent neural networks. Classification is a supervised learning technique. If you're
interested in unsupervised learning models, there is a course for that on Pluralsight as well,
Building Unsupervised Learning Models with TensorFlow. Or maybe you don't want to work
with TensorFlow, you want to work with a higher-level abstraction, in which case you have
Deep Learning with Keras, also available on Pluralsight. If you're looking for good books on
machine learning, then I'd highly recommend Hands-On Machine Learning with Scikit-Learn
and TensorFlow by Aurelien Geron. This is an extremely interesting book, and in fact, the
reinforcement learning chapter is very, very fun to read.

You might also like