0% found this document useful (0 votes)

8 views13 pages

Value Functions & Bellman Equations

Uploaded by

jashwanthkumar712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views13 pages

Value Functions & Bellman Equations

Uploaded by

jashwanthkumar712

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

UNIT-3

VALUE FUNCTIONS & BELLMAN EQUATIONS

11.Bellman Equation Derivation

The Bellman Equation is:

From the above equation, we can see that the value of a state can be decomposed into
immediate reward(R[t+1]) plus the value of successor state(v[S (t+1)]) with a discount
factor( 𝛾). This still stands for Bellman Expectation Equation. But now what we are doing
is we are finding the value of a particular state subjected to some policy(π). This is the
difference between the Bellman Equation and the Bellman Expectation Equation.

Mathematically we can define Bellman Expectation Equation as :

Let’s call this Equation 1. The above equation tells us that the value of a particular state is
determined by the immediate reward plus the value of successor states when we are
following a certain policy(π).

Similarly, we can express our state-action Value function (Q-Function) as follows :

Let’s call this Equation 2.From the above equation, we can see that the State-Action Value
of a state can be decomposed into the immediate reward we get on performing a certain
action in state(s) and moving to another state(s’) plus the discounted value of the state-
action value of the state(s’) with respect to the some action(a) our agent will take from
that state on-wards.

Going Deeper into Bellman Expectation Equation :

First, let’s understand Bellman Expectation Equation for State-Value Function with the
help of a backup diagram:

Backup Diagram for State-Value Function

This backup diagram describes the value of being in a particular state. From the state s
there is some probability that we take both the actions. There is a Q-value(State-action
value function) for each of the action. We average the Q-values which tells us how good it
is to be in a particular state. Basically, it defines Vπ(s).[Look Equation 1]

Mathematically, we can define it as follows:

Value of Being in a state. This equation also tells us the connection between State-Value
function and State-Action Value Function. Now, let’s look at the backup diagram for State-
Action Value Function:
Backup Diagram for State-action Value Function

This backup diagram says that suppose we start off by taking some action(a). So, because
of the action(a) the agent might be blown to any of these states by the environment.
Therefore, we are asking the question, how good it is to take action(a)?

We again average the state-values of both the states, added with an immediate reward
which tells us how good it is to take a particular action(a).This defines our qπ(s,a).
Mathematically, we can define this as follows :

Equation defining how good it is to take a particular action a in state s where P is the
Transition Probability. Now let’s stitch these backup diagrams together to define State-
Value Function, Vπ(s):
Backup Diagram for State-Value Function

From the above diagram, if our agent is in some state(s) and from that state suppose our
agent can take two actions due to which environment might take our agent to any of
the states(s’). Note that the probability of the action our agent might take from state s is
weighted by our policy and after taking that action the probability that we land in any of
the states(s’) is weighted by the environment. Now our question is, how good it is to be in
state(s) after taking some action and landing on another state(s’) and following our
policy(π) after that? It is similar to what we have done before,we are going to average the
value of successor states(s’) with some transition probability(P) weighted with our policy.
Mathematically, we can define it as follows:

State-Value function for being in state S in Backup Diagram

Now, let’s do the same for State-Action Value Function, qπ(s,a) :

Backup Diagram for State-Action Value Function

It’s very similar to what we did in State-Value Function and just it’s inverse, so this
diagram basically says that our agent take some action(a) because of which the
environment might land us on any of the states(s), then from that state we can choose to
take any actions(a’) weighted with the probability of our policy(π). Again, we average
them together and that gives us how good it is to take a particular action following a
particular policy(π) all along. Mathematically, this can be expressed as :

State-Action Value Function from the Backup Diagram

So, this is how we can formulate Bellman Expectation Equation for a given MDP to find it’s
State-Value Function and State-Action Value Function. But, it does not tell us the best way
to behave in an MDP. For that let’s talk about what is meant by Optimal
Value and Optimal Policy Function.

Optimal Value Function

Defining Optimal State-Value Function

In an MDP environment, there are many different value functions according to different
policies. The optimal Value function is one which yields maximum value compared to
all other value function. When we say we are solving an MDP it actually means we are
finding the Optimal Value Function.

So, mathematically Optimal State-Value Function can be expressed as :

Optimal State-Value Function

In the above formula, v∗(s) tells us what is the maximum reward we can get from the
system.

Defining Optimal State-Action Value Function (Q-Function)

Similarly, Optimal State-Action Value Function tells us the maximum reward we are
going to get if we are in state s and taking action a from there on-wards.

Mathematically, It can be defined as :

Optimal State-Action Value Function

Optimal State-Value Function :It is the maximum Value function over all policies.

Optimal State-Action Value Function: It is the maximum action-value function over all
policies.

Now, let’s look at, what is meant by Optimal Policy ?

Optimal Policy

Before we define Optimal Policy, let’s know, what is meant by one policy better than
other policy?

We know that for any MDP, there is a policy (π) better than any other policy(π’). But How?

We say that one policy(π) is better than other policy (π’) if the value function with the
policy π for all states is greater than the value function with the policy π’ for all states.
Intuitively, it can be expressed as :

Now, let’s define Optimal Policy :

Optimal Policy is one which results in optimal value function.

Note that, there can be more than one optimal policy in a MDP. But, all optimal policy
achieve the same optimal value function and optimal state-action Value Function(Q-
function).

Now, the question arises how we find Optimal Policy.

Finding an Optimal policy :

We find an optimal policy by maximizing over q*(s, a) i.e. our optimal state-action value
function.We solve q*(s,a) and then we pick the action that gives us most optimal state-
action value function(q*(s,a)).

The above statement can be expressed as:

Finding Optimal Policy

What this says is that for a state s we pick the action a with probability 1, if it gives us the
maximum q*(s,a). So, if we know q*(s,a) we can get an optimal policy from it.

Let’s understand it with an example :

Example for Optimal Policy

In this example, the red arcs are the optimal policy which means that if our agent follows
this path it will yield maximum reward from this MDP. Also, by seeing the q* values for
each state we can say the actions our agent will take that yields maximum reward. So,
optimal policy always takes action with higher q* value(State-Action Value Function). For
example, in the state with value 8, there is q* with value 0 and 8. Our agent chooses the
one with greater q* value i.e. 8.

Now, the question arises, How do we find these q*(s,a) values ?

This is where Bellman Optimality Equation comes into play.

Bellman Optimality Equation

The Optimal Value Function is recursively related to the Bellman Optimality Equation.

Bellman Optimality equation is the same as Bellman Expectation Equation but the only
difference is instead of taking the average of the actions our agent can take we take the
action with the max value.

Let’s understand this with the help of Backup diagram:

Backup diagram for State-Value Function

Suppose our agent is in state S and from that state it can take two actions (a). So, we look
at the action-values for each of the actions and unlike, Bellman Expectation
Equation, instead of taking the average our agent takes the action with greater q* value.
This gives us the value of being in the state S.

Mathematically, this can be expressed as :

Bellman Optimality Equation for State-value Function

Similarly, let’s define Bellman Optimality Equation for State-Action Value Function (Q-
Function).

Let’s look at the Backup Diagram for State-Action Value Function(Q-Function):

Backup Diagram for State-Action Value Function

Suppose, our agent has taken an action a in some state s. Now, it’s on the environment that
it might blow us to any of these states (s’). We still take the average of the values of both
the states, but the only difference is in Bellman Optimality Equation we know
the optimal values of each of the states.Unlike in Bellman Expectation Equation we just
knew the value of the states.

Mathematically, this can be expressed as :

Bellman Optimality Equation for State-Action Value Function

Let’s again stitch these backup diagrams for State-Value Function :

Backup Diagram for State-Value Function

Suppose our agent is in state s and from that state it took some action (a) where the
probability of taking that action is weighted by the policy. And because of the action (a),
the agent might get blown to any of the states(s’) where probability is weighted by the
environment. In order to find the value of state S we simply average the Optimal
values of the States(s’). This gives us the value of being in state S.

Mathematically, this can be expressed as :

Bellman Optimality Equation for State-Value Function from the Backup Diagram
The max in the equation is because we are maximizing the actions the agent can take in
the upper arcs. This equation also shows how we can relate V* function to itself.

Now, let’s look at the Bellman Optimality Equation for State-Action Value Function,q*(s,a)
:

Backup Diagram for State-Action Value Function

Suppose, our agent was in state s and it took some action(a). Because of that action, the
environment might land our agent to any of the states (s’) and from these states we get
to maximize the action our agent will take i.e. choosing the action with maximum q*
value. We back that up to the top and that tells us the value of the action a.

Mathematically, this can be expressed as :

Bellman Optimality Equation for State-Action Value Function from the Backup Diagram

Let’s look at an example to understand it better :

Example for Bellman Optimality Equation

Look at the red arrows, suppose we wish to find the value of state with value 6 (in red),
as we can see we get a reward of -1 if our agent chooses Facebook and a reward of -2 if our
agent choose to study. In order to find the value of state in red, we will use the Bellman
Optimality Equation for State-Value Function i.e. considering the other two states have
optimal value we are going to take an average and maximize for both the action (choose the
one that gives maximum value). So, from the diagram we can see that going to Facebook
yields a value of 5 for our red state and going to study yields a value of 6 and then we
maximize over the two which gives us 6 as the answer.

Astm D1729-16
88% (17)
Astm D1729-16
4 pages
Kyocera KM1650 / 2050 Parts List / Manual
No ratings yet
Kyocera KM1650 / 2050 Parts List / Manual
48 pages
Management of Ventral Hernias
No ratings yet
Management of Ventral Hernias
22 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Gold Code
100% (1)
Gold Code
3 pages
Software Requirements Specification: Lovely Professional University
No ratings yet
Software Requirements Specification: Lovely Professional University
9 pages
Introduction To Economics Notes
No ratings yet
Introduction To Economics Notes
12 pages
Easylyte Plus Manual: Page 3 of About 90,800 Results (0.31 Seconds)
No ratings yet
Easylyte Plus Manual: Page 3 of About 90,800 Results (0.31 Seconds)
1 page
Relators Application For Order Requiring Citation
No ratings yet
Relators Application For Order Requiring Citation
63 pages
BOSS Supastor Stainless Steel Unvented Cylinders
No ratings yet
BOSS Supastor Stainless Steel Unvented Cylinders
10 pages
Study of A Novel Cathode Tool Structure For Improving Heat Removal in Electrochemical Micro-Machining
No ratings yet
Study of A Novel Cathode Tool Structure For Improving Heat Removal in Electrochemical Micro-Machining
7 pages
Specialties and Accessories: Buffer Tank Hydraulic Separator
No ratings yet
Specialties and Accessories: Buffer Tank Hydraulic Separator
4 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
Qualitative and Qualitative Research Paradigm
No ratings yet
Qualitative and Qualitative Research Paradigm
20 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
METROPOLITAN BANK AND TRUST COMPANY vs. MARINA B. CUSTODIO
No ratings yet
METROPOLITAN BANK AND TRUST COMPANY vs. MARINA B. CUSTODIO
2 pages
Beginners Simple Enhancement For SE38: Applies To
No ratings yet
Beginners Simple Enhancement For SE38: Applies To
16 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Information and Resources For Starting A Home-Based Food Business
No ratings yet
Information and Resources For Starting A Home-Based Food Business
2 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
Mrs Wash Flyer For LBP, OWWA & POEA - Tuguegarao City
No ratings yet
Mrs Wash Flyer For LBP, OWWA & POEA - Tuguegarao City
1 page
Lec 09
No ratings yet
Lec 09
51 pages
Sol. Chp. 22 - (Part 1)
No ratings yet
Sol. Chp. 22 - (Part 1)
20 pages
Forage, Harvest, Feast - Honeysuckle
No ratings yet
Forage, Harvest, Feast - Honeysuckle
6 pages
Affidavit For Editing Purposes
No ratings yet
Affidavit For Editing Purposes
4 pages
Reinforcement Learning - Markov Decision Process
No ratings yet
Reinforcement Learning - Markov Decision Process
11 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
WD Syllabus
No ratings yet
WD Syllabus
2 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Lec 3
No ratings yet
Lec 3
15 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Value Functions & Bellman Equations
No ratings yet
Value Functions & Bellman Equations
11 pages
Markov Decision Process (MDP)
No ratings yet
Markov Decision Process (MDP)
31 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
NutcrackerMarch WW
No ratings yet
NutcrackerMarch WW
20 pages
21ai020 & Reinforcement Learning: Topic
No ratings yet
21ai020 & Reinforcement Learning: Topic
8 pages
Neoscholar DPRLHW8
No ratings yet
Neoscholar DPRLHW8
3 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
A Distrib Persp On RL
No ratings yet
A Distrib Persp On RL
19 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
3 - Chapter 2 State Values and Bellman Equation
No ratings yet
3 - Chapter 2 State Values and Bellman Equation
20 pages
Smarto Life C
No ratings yet
Smarto Life C
16 pages
Mi Temperature and Humidity Monitor 2 User Manual
No ratings yet
Mi Temperature and Humidity Monitor 2 User Manual
1 page
Sol 7
No ratings yet
Sol 7
5 pages
1 Markov
No ratings yet
1 Markov
34 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
TUQ English
No ratings yet
TUQ English
3 pages
SanchezPajueloKai PS3
No ratings yet
SanchezPajueloKai PS3
6 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
RL - Unit III
No ratings yet
RL - Unit III
12 pages
Ed4 Unit2foundationsandcharacteristics
No ratings yet
Ed4 Unit2foundationsandcharacteristics
13 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Lecture3 InsideAnAgent
No ratings yet
Lecture3 InsideAnAgent
35 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
ForwardInvoice ORD474579931
No ratings yet
ForwardInvoice ORD474579931
2 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
E010OBS
No ratings yet
E010OBS
4 pages
CONSUMER DECISION MAKING Notes
No ratings yet
CONSUMER DECISION MAKING Notes
16 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Splo Set 17
No ratings yet
Splo Set 17
3 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Lec 12
No ratings yet
Lec 12
60 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
CS229
No ratings yet
CS229
17 pages
Subtitle
No ratings yet
Subtitle
2 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Student Lms - Usecs
No ratings yet
Student Lms - Usecs
1 page
1.7 Policies and Value Functions
No ratings yet
1.7 Policies and Value Functions
23 pages
1.8 Bellman Equations
No ratings yet
1.8 Bellman Equations
20 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Subtitle
No ratings yet
Subtitle
2 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
FALLSEM2024-25 BCSE209L TH VL2024250101717 2024-11-12 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101717 2024-11-12 Reference-Material-I
11 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet

Value Functions & Bellman Equations

Uploaded by

Value Functions & Bellman Equations

Uploaded by

UNIT-3

VALUE FUNCTIONS & BELLMAN EQUATIONS

11.Bellman Equation Derivation

The Bellman Equation is:

Mathematically we can define Bellman Expectation Equation as :

Similarly, we can express our state-action Value function (Q-Function) as follows :

Going Deeper into Bellman Expectation Equation :

Backup Diagram for State-Value Function

Mathematically, we can define it as follows:

State-Value function for being in state S in Backup Diagram

Now, let’s do the same for State-Action Value Function, qπ(s,a) :

State-Action Value Function from the Backup Diagram

Optimal Value Function

Defining Optimal State-Value Function

So, mathematically Optimal State-Value Function can be expressed as :

Optimal State-Value Function

Defining Optimal State-Action Value Function (Q-Function)

Mathematically, It can be defined as :

Optimal State-Action Value Function

Now, let’s look at, what is meant by Optimal Policy ?

Now, let’s define Optimal Policy :

Optimal Policy is one which results in optimal value function.

Now, the question arises how we find Optimal Policy.

Finding an Optimal policy :

The above statement can be expressed as:

Let’s understand it with an example :

Example for Optimal Policy

Now, the question arises, How do we find these q*(s,a) values ?

This is where Bellman Optimality Equation comes into play.

Bellman Optimality Equation

Let’s understand this with the help of Backup diagram:

Backup diagram for State-Value Function

Mathematically, this can be expressed as :

Bellman Optimality Equation for State-value Function

Let’s look at the Backup Diagram for State-Action Value Function(Q-Function):

Backup Diagram for State-Action Value Function

Mathematically, this can be expressed as :

Let’s again stitch these backup diagrams for State-Value Function :

Backup Diagram for State-Value Function

Mathematically, this can be expressed as :

Backup Diagram for State-Action Value Function

Mathematically, this can be expressed as :

Let’s look at an example to understand it better :

You might also like