0% found this document useful (0 votes)

2 views

Reinforcement Learning Note

The document provides an overview of Reinforcement Learning, focusing on Markov Decision Processes (MDPs), which include components such as state space, action space, transition function, and reward function. It discusses various algorithms for finding optimal policies, including Value Iteration, Policy Iteration, Q-Learning, and SARSA, as well as different reward models and task types (episodic and continuing). Additionally, it explains Bellman's equations for both policy and value iteration, emphasizing their role in solving MDPs and optimizing decision-making strategies.

Uploaded by

sakethdosapati11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Reinforcement Learning Note

Uploaded by

sakethdosapati11

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Reinforcement Learning

Unit - 2
Reinforcement Learning

Markov Decision Problem,Policy,

and value function, Reward models
(infinite discounted, total, finite
horizon, and average), Episodic &
continuing tasks, Bellman’s
optimally operator, and value
iteration & policy iteration
Reinforcement Learning

Markov Decision Problem

Markov Decision Processes

In reinforcement learning, the interactions between the agent and

the environment are often described by a Markov Decision
Process (MDP) , specified by:

• State space S: A finite set of states represents the different

situations or configurations the agent can be in. At each time step, the agent
is in one of these states
• Action space A: A finite set of actions represents the
choices the agent can make. Actions are the decisions or moves
available to the agent.
Reinforcement Learning

• Transition function P : S ×A → ∆(S), where ∆(S) is the

space of probability distributions over S (i.e., the probability
simplex). P(s 0 |s, a) is the probability of transitioning into state s
0 upon taking action a in state s.
• Reward function R : S × A → [0, Rmax], where Rmax >
0 is a constant. R(s, a) is the immediate reward associated with
taking action a in state s.
• Discount factor γ ∈ [0, 1), which defines a horizon for
the problem.

Interaction protocol
In a given MDP M = (S, A, P, R, γ), the agent interacts with
the environment according to the following protocol: the agent
starts at some state s1; at each time step t = 1, 2, . . ., the agent
takes an action at ∈ A, obtains the immediate reward rt = R(st,
at), and observes the next state st+1 sampled from P(st, at), or
st+1 ∼ P(st, at). The interaction record
τ = (s1, a1, r1, s2, . . . , sH+1)
is called a trajectory of length H.

In some situations, it is necessary to specify how the initial

state s1 is generated. We consider s1 sampled from an initial
distribution d0 ∈ ∆(S). When d0 is of importance to the
discussion, we include it as part of the MDP definition, and write
M = (S, A, P, R, γ, d0).
Reinforcement Learning

The goal in a Markov Decision Problem is to find an

optimal policy, denoted as π*, that maximizes the expected
cumulative reward over time. This cumulative reward is often
referred to as the "return." In other words, the agent aims to
make a sequence of decisions that yield the highest expected sum
of rewards over the long run.

To find the optimal policy, various algorithms and methods

can be used, including:

Value Iteration: An iterative algorithm that computes the

optimal value function (expected cumulative reward) for each
state and then derives the optimal policy from it.

Policy Iteration: An iterative algorithm that alternates between

policy evaluation (computing the value function for a policy) and
policy improvement (selecting a better policy based on the value
function).

Q-Learning: A popular model-free reinforcement learning

algorithm that learns the optimal action-value function
(Q-function) through exploration and exploitation.
Reinforcement Learning

SARSA: Another model-free reinforcement learning algorithm

that learns the Q-function by estimating the expected return of
taking specific actions in specific states.

MDPs are used to model a wide range of real-world

decision-making problems, including robotics, game playing,
autonomous systems, recommendation systems, and more. They
provide a structured framework for studying and solving
problems where decisions must be made sequentially in the
presence of uncertainty.

Policy and value

A (deterministic and stationary) policy π : S → A specifies a

decision-making strategy in which the agent chooses actions
adaptively based on the current state, i.e., at = π(st). More
generally, the agent may also choose actions according to a
stochastic policy π : S → ∆(A), and with a slight abuse of
notation we write at ∼ π(st). A deterministic policy is its special
case when π(s) is a point mass for all s ∈ S.
The goal of the agent is to choose a policy π to maximize t
he expected discounted sum of rewards, or value:

(1)
Reinforcement Learning

The expectation is with respect to the randomness of the

trajectory, that is, the randomness in state transitions and the
stochasticity of π. Notice that, since rt is nonnegative and upper
bounded by Rmax, we have

(2)

Hence, the discounted sum of rewards (or the discounted

return) along any actual trajectory is always bounded in range [0,
Rmax / 1−γ ], and so is its expectation of any form. This fact will
be important when we later analyze the error propagation of
planning and learning algorithms.
Note that for a fixed policy, its value may differ for
different choice of s1, and we define the value function
V π M : S → R as

which is the value obtained by following policy π starting at

state s. Similarly we define the action value (or Q-value) function
Qπ M : S × A → R as
Reinforcement Learning

Henceforth, the dependence of any notation on M will be

made implicit whenever it is clear from context.

Reward Models:
In the context of Markov Decision Problems (MDPs), there
are several types of reward models that characterize different
aspects of the expected cumulative rewards the agent aims to
optimize. These reward models include:

1. Infinite Discounted Reward Model:

In the infinite discounted reward model, the objective is to
maximize the expected cumulative discounted reward over an
infinite time horizon.
The cumulative reward is discounted at each time step by a
discount factor γ (0 ≤ γ < 1) to account for the agent's preference
for immediate rewards over future rewards. The objective is to
maximize the following quantity:

2. Total Reward Model:

- In the total reward model, the objective is to maximize the
expected cumulative reward over a finite time horizon T.
Reinforcement Learning

- Unlike the infinite discounted reward model, there is no

discount factor applied to future rewards. The objective is to
maximize the following quantity:

3. Finite Horizon Reward Model:

- In the finite horizon reward model, the agent aims to
maximize the expected cumulative reward over a fixed and finite
time horizon T.
- This model is similar to the total reward model but is
explicitly defined for a predetermined number of time steps.

4. Average Reward Model:

- In the average reward model, the objective is to maximize
the expected average reward per time step over an infinite time
horizon.
- It is particularly useful when comparing different policies
because it normalizes the cumulative reward by the number of
time steps. The objective is to maximize the following quantity:
Reinforcement Learning

These different reward models reflect variations in the

temporal focus and objectives of the agent's decision-making
process. The choice of reward model depends on the specific
problem and the agent's goals.
If the agent is interested in optimizing rewards over a fixed
time horizon without discounting, the total or finite horizon
reward model may be used. The average reward model is often
used when comparing policies in a steady-state setting.
It's important to note that the choice of reward model can
significantly influence the optimal policy and the behavior of the
agent in an MDP. Therefore, it should be carefully considered
when formulating and solving MDPs.
Episodic & Continuing tasks:
Episodic Tasks:
● Episodic tasks are problems that have a well-defined starting point
(initial state) and a terminal point (terminal state) or goal. The agent
interacts with the environment for a finite number of time steps, and
the episode terminates when a specific goal state is reached or when
a predetermined maximum number of steps is reached.

1. Episodic tasks naturally have a finite and discrete time horizon.

2. The objective is typically to maximize the cumulative reward within a
single episode.
3. Learning and decision-making occur independently in each episode,
and the agent's behavior doesn't have to consider long-term
consequences beyond the current episode.
Reinforcement Learning

● Examples:
● Playing a single game of chess, where the game starts from an
initial board state, and it ends when one player wins or a draw
occurs.
● Solving a maze, where the agent starts at the entrance and
finishes upon reaching the exit.
● Training an agent to perform a specific task in a video game
level, where an episode ends when the level is completed or the
character dies.

Gt . = Rt+1 + Rt+2 + Rt+3 + ··· + RT

Continuing Tasks:

● Continuing tasks, on the other hand, do not have a natural endpoint or

terminal state. The agent interacts with the environment indefinitely,
and there is no predefined limit on the number of time steps or
episodes.

1. Continuing tasks involve a potentially infinite and continuous time

horizon.
2. The objective is to maximize the agent's expected cumulative reward
over the long run, rather than within a single episode.
3. Learning and decision-making must take into account the long-term
consequences of actions because there is no natural endpoint to the
task.

● Examples:
● Stock trading, where an agent makes investment decisions over
an indefinite time horizon.
Reinforcement Learning

● Robot control, where a robot must continuously adapt to its

surroundings and perform tasks over time.
● Recommendation systems, where an algorithm continuously
suggests items to users based on their preferences.

Gt = Rt+1 +γ Rt+2 + γ 2Rt+3 +γ 3Rt+4 + ···

= Rt+1 + γ ( Rt+2 + γ Rt+3 + γ 2Rt+4 + ··· )

= Rt+1 + γ Gt+1

Bellman's equation

Bellman's equation is a fundamental concept in dynamic programming and

reinforcement learning. It plays a crucial role in solving problems that involve
making a sequence of decisions over time. The equation is named after Richard E.
Bellman, who made significant contributions to the field of dynamic programming.

The basic form of Bellman's equation can be expressed as follows:

V(s) = maxa [ R(s, a) + γ Es' P(s' | s, a) V(s')]

V(s) = maxa [ R(s, a) + γ V(s')]

Reinforcement Learning

Bellman's equation is a key component of various algorithms in reinforcement

learning, such as the Bellman equation for policy evaluation, Q-learning, and the
value iteration algorithm. These algorithms use Bellman's equation as a foundation
for finding optimal policies and values in Markov decision processes (MDPs) and
other sequential decision-making problems.

Where:
- (V(s) represents the value of being in state s. This value represents the expected
cumulative reward or utility that can be obtained starting from state \(s\) and
following an optimal policy.
- a represents the action taken in state s.
- R(s, a) is the immediate reward obtained after taking action a in state s.
- γ (gamma) is the discount factor, which represents the importance of future
rewards. It's a value between 0 and 1.
- Es' represents a sum over all possible next states s' that can be reached from state s
by taking action a.
- P(s' | s, a) is the probability of transitioning to state s' when action a is taken in
state s.
- V(s') represents the value of the next state s'.

The objective of using Bellman's equation is to find the optimal value

function V*(s), which represents the maximum expected cumulative reward
achievable from each state under an optimal policy. Solving for V*(s) allows you
to determine the best actions to take in each state to maximize your expected return.

Bellman equations for policy iteration:

Bellman's policy iteration is an iterative algorithm used in reinforcement
learning and dynamic programming to find an optimal policy for a Markov decision
process (MDP). The goal of policy iteration is to determine the best actions to take
in each state in order to maximize the expected cumulative reward.

The policy iteration algorithm consists of two main steps that are repeated
iteratively until convergence:
Reinforcement Learning

1. Policy Evaluation:
- In this step, we evaluate the value function for a given policy. The value
function, denoted as Vπ (s), represents the expected cumulative reward starting from
state s and following policy π thereafter.
- The value function is updated iteratively using the Bellman expectation
equation:
Vπ =Ea π ( a | s) Es', r P(s', r | s, a) [r + γ Vπ (s')]
- Here, π (a|s) is the probability of taking action a in state s, and P(s', r | s, a)
represents the transition probabilities and rewards associated with taking action a in
state s and transitioning to state s' with reward r.
- The above equation is solved for each state until the value function converges to
a fixed point.

2. Policy Improvement:
- Once we have the value function V^\π (s) for the current policy \π , we can
improve the policy by selecting the action in each state that maximizes the expected
reward. This results in a new policy \π ':

π '(s) = argmaxa Es', r P(s', r | s, a) [r + γ Vπ (s')]

- The new policy π ' is a greedy policy with respect to the current value function
Vπ (s). It selects the action that is expected to yield the highest reward in each state.
- If π ' is not significantly different from π (i.e., the policies have not changed
much), then the algorithm terminates, indicating that the optimal policy has been
found. Otherwise, the process continues with policy evaluation using the new policy
π '.

The policy iteration algorithm alternates between policy evaluation and

policy improvement until convergence. At convergence, the policy becomes
optimal, meaning that it maximizes the expected cumulative reward in the given
MDP. Policy iteration is guaranteed to converge to the optimal policy for finite
MDPs, although it may take several iterations to do so.
Reinforcement Learning

This algorithm is effective for finding the optimal policy in MDPs and is widely
used in reinforcement learning and dynamic programming.

Example for policy iteration given in class notes

Bellman equations for Value iteration:

Bellman's value iteration is an iterative algorithm used in reinforcement

learning and dynamic programming to find an optimal policy for a Markov decision
process (MDP). It is a powerful approach for solving MDPs and determining the
best actions to take in each state in order to maximize the expected cumulative
reward.

Value iteration is an iterative process that works as follows:

1. Initialization:
- Initialize a value function V(s) for each state s in the MDP. This can be done
arbitrarily or with an initial guess.
- Set a convergence threshold epsilon to determine when the algorithm has
converged.

2. Value Iteration:
● For each state s, update the value function V(s) using the Bellman optimality
equation:
V(s) <--- maxa Es', r P(s', r | s, a) [r + γ V(s')]
In this equation:
● a represents the action taken in state s.
● P(s', r | s, a) represents the transition probabilities and rewards
associated with taking action a in state s and transitioning to state s'
with reward r.
● γ ( γ) is the discount factor, which represents the importance of future
rewards.
Reinforcement Learning

● Update V(s) for all states simultaneously.

● Repeat this process until the change in the value function Delta V(s) for all
states is smaller than the convergence threshold (Delta V(s) < for all s).

3. Policy Extraction:
● Once the value iteration process converges, you can extract the optimal
policy π* by choosing the action that maximizes the right-hand side of the
Bellman optimality equation for each state:

π(s) = argmaxa Es', r P(s', r | s, a) [r + γ V(s')]

● The policy π* is now the optimal policy that maximizes the expected
cumulative reward in the MDP.

Value iteration converges to the optimal policy in finite MDPs, guaranteeing

that the policy it produces is the best policy. It combines policy evaluation and
policy improvement into a single step and iteratively refines the value function until
it converges. This makes it a computationally efficient way to solve MDPs and is
widely used in reinforcement learning and dynamic programming.

Example for value iteration given in class notes

Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
AS02
No ratings yet
AS02
16 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
RL Frra
No ratings yet
RL Frra
10 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
CS229
No ratings yet
CS229
17 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Markov decision
No ratings yet
Markov decision
4 pages
06 MDP
No ratings yet
06 MDP
89 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
RL Frra
No ratings yet
RL Frra
9 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
2
No ratings yet
2
23 pages
Unit 4
No ratings yet
Unit 4
49 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
RL-1
No ratings yet
RL-1
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Alpha Beta Example 2 PDF
No ratings yet
Alpha Beta Example 2 PDF
28 pages
Database Management Systems: BITS Pilani
No ratings yet
Database Management Systems: BITS Pilani
17 pages
Sorting Program
No ratings yet
Sorting Program
16 pages
COMP9123_Nehal_A4
No ratings yet
COMP9123_Nehal_A4
17 pages
Numerical Report
No ratings yet
Numerical Report
7 pages
0 Computer Vision Panikzettel
No ratings yet
0 Computer Vision Panikzettel
28 pages
COA Experiment
No ratings yet
COA Experiment
3 pages
Practical Deep Learning For NLP: Maarten Versteegh NLP Research Engineer
No ratings yet
Practical Deep Learning For NLP: Maarten Versteegh NLP Research Engineer
44 pages
Course Code BFC20802 Course Name Computer Programming Faculty of Civil Engineering and Built Environment 1. 1
No ratings yet
Course Code BFC20802 Course Name Computer Programming Faculty of Civil Engineering and Built Environment 1. 1
5 pages
Practical-8: Import As Import As Import As Import Import As
No ratings yet
Practical-8: Import As Import As Import As Import Import As
9 pages
Vander Monde
No ratings yet
Vander Monde
11 pages
Linear Filters
No ratings yet
Linear Filters
41 pages
Wireless Communication Systems in Matlab 2nd Edition Mathuranathan Viswanathan - Read the ebook now or download it for a full experience
No ratings yet
Wireless Communication Systems in Matlab 2nd Edition Mathuranathan Viswanathan - Read the ebook now or download it for a full experience
89 pages
Binomial Theorem
No ratings yet
Binomial Theorem
12 pages
3-2-9 - Soft Computing Lab
No ratings yet
3-2-9 - Soft Computing Lab
2 pages
Menu Driven Program in C
No ratings yet
Menu Driven Program in C
53 pages
AE384 Automatic Control Systems - I
No ratings yet
AE384 Automatic Control Systems - I
15 pages
Machine Learning Regression
No ratings yet
Machine Learning Regression
64 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Mohammad Mahdi Swaidan
No ratings yet
Mohammad Mahdi Swaidan
10 pages
AMLATA2020_044
No ratings yet
AMLATA2020_044
11 pages
Python Sort Dictionaries
No ratings yet
Python Sort Dictionaries
4 pages
Digital Signal Processing: Course Code: 15EC1115 L T P C 3 1 0 4
No ratings yet
Digital Signal Processing: Course Code: 15EC1115 L T P C 3 1 0 4
3 pages
General Optimization Models For Business Decisions 13027 2020
No ratings yet
General Optimization Models For Business Decisions 13027 2020
3 pages
Example-35: Solve The Following Non-Linear Programming Problem Using Kuhn
No ratings yet
Example-35: Solve The Following Non-Linear Programming Problem Using Kuhn
17 pages
Sheets-BAS111 Numerical Analysis-Part
No ratings yet
Sheets-BAS111 Numerical Analysis-Part
6 pages
Array Operation and Sorting
No ratings yet
Array Operation and Sorting
10 pages
ON ADC's
No ratings yet
ON ADC's
24 pages
LDPC
No ratings yet
LDPC
40 pages
CS2201 PDF
No ratings yet
CS2201 PDF
1 page

Reinforcement Learning Note

Uploaded by

Reinforcement Learning Note

Uploaded by

Reinforcement Learning

Markov Decision Problem,Policy,

Markov Decision Problem

Markov Decision Processes

In reinforcement learning, the interactions between the agent and

• State space S: A finite set of states represents the different

• Transition function P : S ×A → ∆(S), where ∆(S) is the

In some situations, it is necessary to specify how the initial

The goal in a Markov Decision Problem is to find an

To find the optimal policy, various algorithms and methods

Value Iteration: An iterative algorithm that computes the

Policy Iteration: An iterative algorithm that alternates between

Q-Learning: A popular model-free reinforcement learning

SARSA: Another model-free reinforcement learning algorithm

MDPs are used to model a wide range of real-world

Policy and value

A (deterministic and stationary) policy π : S → A specifies a

The expectation is with respect to the randomness of the

Hence, the discounted sum of rewards (or the discounted

which is the value obtained by following policy π starting at

Henceforth, the dependence of any notation on M will be

1. Infinite Discounted Reward Model:

2. Total Reward Model:

- Unlike the infinite discounted reward model, there is no

3. Finite Horizon Reward Model:

4. Average Reward Model:

These different reward models reflect variations in the

1. Episodic tasks naturally have a finite and discrete time horizon.

Gt . = Rt+1 + Rt+2 + Rt+3 + ··· + RT

● Continuing tasks, on the other hand, do not have a natural endpoint or

1. Continuing tasks involve a potentially infinite and continuous time

● Robot control, where a robot must continuously adapt to its

Gt = Rt+1 +γ Rt+2 + γ 2Rt+3 +γ 3Rt+4 + ···

= Rt+1 + γ ( Rt+2 + γ Rt+3 + γ 2Rt+4 + ··· )

Bellman's equation is a fundamental concept in dynamic programming and

The basic form of Bellman's equation can be expressed as follows:

V(s) = maxa [ R(s, a) + γ Es' P(s' | s, a) V(s')]

V(s) = maxa [ R(s, a) + γ V(s')]

Bellman's equation is a key component of various algorithms in reinforcement

The objective of using Bellman's equation is to find the optimal value

Bellman equations for policy iteration:

π '(s) = argmaxa Es', r P(s', r | s, a) [r + γ Vπ (s')]

The policy iteration algorithm alternates between policy evaluation and

Example for policy iteration given in class notes

Bellman equations for Value iteration:

Bellman's value iteration is an iterative algorithm used in reinforcement

Value iteration is an iterative process that works as follows:

● Update V(s) for all states simultaneously.

π*(s) = argmaxa Es', r P(s', r | s, a) [r + γ V*(s')]

Value iteration converges to the optimal policy in finite MDPs, guaranteeing

Example for value iteration given in class notes

You might also like

π(s) = argmaxa Es', r P(s', r | s, a) [r + γ V(s')]