Value Functions & Bellman Equations
Value Functions & Bellman Equations
From the above equation, we can see that the value of a state can be decomposed into
immediate reward(R[t+1]) plus the value of successor state(v[S (t+1)]) with a discount
factor( 𝛾). This still stands for Bellman Expectation Equation. But now what we are doing
is we are finding the value of a particular state subjected to some policy(π). This is the
difference between the Bellman Equation and the Bellman Expectation Equation.
Let’s call this Equation 1. The above equation tells us that the value of a particular state is
determined by the immediate reward plus the value of successor states when we are
following a certain policy(π).
Let’s call this Equation 2.From the above equation, we can see that the State-Action Value
of a state can be decomposed into the immediate reward we get on performing a certain
action in state(s) and moving to another state(s’) plus the discounted value of the state-
action value of the state(s’) with respect to the some action(a) our agent will take from
that state on-wards.
First, let’s understand Bellman Expectation Equation for State-Value Function with the
help of a backup diagram:
This backup diagram describes the value of being in a particular state. From the state s
there is some probability that we take both the actions. There is a Q-value(State-action
value function) for each of the action. We average the Q-values which tells us how good it
is to be in a particular state. Basically, it defines Vπ(s).[Look Equation 1]
Value of Being in a state. This equation also tells us the connection between State-Value
function and State-Action Value Function. Now, let’s look at the backup diagram for State-
Action Value Function:
Backup Diagram for State-action Value Function
This backup diagram says that suppose we start off by taking some action(a). So, because
of the action(a) the agent might be blown to any of these states by the environment.
Therefore, we are asking the question, how good it is to take action(a)?
We again average the state-values of both the states, added with an immediate reward
which tells us how good it is to take a particular action(a).This defines our qπ(s,a).
Mathematically, we can define this as follows :
Equation defining how good it is to take a particular action a in state s where P is the
Transition Probability. Now let’s stitch these backup diagrams together to define State-
Value Function, Vπ(s):
Backup Diagram for State-Value Function
From the above diagram, if our agent is in some state(s) and from that state suppose our
agent can take two actions due to which environment might take our agent to any of
the states(s’). Note that the probability of the action our agent might take from state s is
weighted by our policy and after taking that action the probability that we land in any of
the states(s’) is weighted by the environment. Now our question is, how good it is to be in
state(s) after taking some action and landing on another state(s’) and following our
policy(π) after that? It is similar to what we have done before,we are going to average the
value of successor states(s’) with some transition probability(P) weighted with our policy.
Mathematically, we can define it as follows:
It’s very similar to what we did in State-Value Function and just it’s inverse, so this
diagram basically says that our agent take some action(a) because of which the
environment might land us on any of the states(s), then from that state we can choose to
take any actions(a’) weighted with the probability of our policy(π). Again, we average
them together and that gives us how good it is to take a particular action following a
particular policy(π) all along. Mathematically, this can be expressed as :
So, this is how we can formulate Bellman Expectation Equation for a given MDP to find it’s
State-Value Function and State-Action Value Function. But, it does not tell us the best way
to behave in an MDP. For that let’s talk about what is meant by Optimal
Value and Optimal Policy Function.
In the above formula, v∗(s) tells us what is the maximum reward we can get from the
system.
Similarly, Optimal State-Action Value Function tells us the maximum reward we are
going to get if we are in state s and taking action a from there on-wards.
Optimal State-Value Function :It is the maximum Value function over all policies.
Optimal State-Action Value Function: It is the maximum action-value function over all
policies.
Before we define Optimal Policy, let’s know, what is meant by one policy better than
other policy?
We know that for any MDP, there is a policy (π) better than any other policy(π’). But How?
We say that one policy(π) is better than other policy (π’) if the value function with the
policy π for all states is greater than the value function with the policy π’ for all states.
Intuitively, it can be expressed as :
Note that, there can be more than one optimal policy in a MDP. But, all optimal policy
achieve the same optimal value function and optimal state-action Value Function(Q-
function).
We find an optimal policy by maximizing over q*(s, a) i.e. our optimal state-action value
function.We solve q*(s,a) and then we pick the action that gives us most optimal state-
action value function(q*(s,a)).
What this says is that for a state s we pick the action a with probability 1, if it gives us the
maximum q*(s,a). So, if we know q*(s,a) we can get an optimal policy from it.
In this example, the red arcs are the optimal policy which means that if our agent follows
this path it will yield maximum reward from this MDP. Also, by seeing the q* values for
each state we can say the actions our agent will take that yields maximum reward. So,
optimal policy always takes action with higher q* value(State-Action Value Function). For
example, in the state with value 8, there is q* with value 0 and 8. Our agent chooses the
one with greater q* value i.e. 8.
The Optimal Value Function is recursively related to the Bellman Optimality Equation.
Bellman Optimality equation is the same as Bellman Expectation Equation but the only
difference is instead of taking the average of the actions our agent can take we take the
action with the max value.
Suppose our agent is in state S and from that state it can take two actions (a). So, we look
at the action-values for each of the actions and unlike, Bellman Expectation
Equation, instead of taking the average our agent takes the action with greater q* value.
This gives us the value of being in the state S.
Similarly, let’s define Bellman Optimality Equation for State-Action Value Function (Q-
Function).
Suppose, our agent has taken an action a in some state s. Now, it’s on the environment that
it might blow us to any of these states (s’). We still take the average of the values of both
the states, but the only difference is in Bellman Optimality Equation we know
the optimal values of each of the states.Unlike in Bellman Expectation Equation we just
knew the value of the states.
Suppose our agent is in state s and from that state it took some action (a) where the
probability of taking that action is weighted by the policy. And because of the action (a),
the agent might get blown to any of the states(s’) where probability is weighted by the
environment. In order to find the value of state S we simply average the Optimal
values of the States(s’). This gives us the value of being in state S.
Bellman Optimality Equation for State-Value Function from the Backup Diagram
The max in the equation is because we are maximizing the actions the agent can take in
the upper arcs. This equation also shows how we can relate V* function to itself.
Now, let’s look at the Bellman Optimality Equation for State-Action Value Function,q*(s,a)
:
Suppose, our agent was in state s and it took some action(a). Because of that action, the
environment might land our agent to any of the states (s’) and from these states we get
to maximize the action our agent will take i.e. choosing the action with maximum q*
value. We back that up to the top and that tells us the value of the action a.
Bellman Optimality Equation for State-Action Value Function from the Backup Diagram
Look at the red arrows, suppose we wish to find the value of state with value 6 (in red),
as we can see we get a reward of -1 if our agent chooses Facebook and a reward of -2 if our
agent choose to study. In order to find the value of state in red, we will use the Bellman
Optimality Equation for State-Value Function i.e. considering the other two states have
optimal value we are going to take an average and maximize for both the action (choose the
one that gives maximum value). So, from the diagram we can see that going to Facebook
yields a value of 5 for our red state and going to study yields a value of 6 and then we
maximize over the two which gives us 6 as the answer.