0% found this document useful (0 votes)

29 views23 pages

RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2

material

Uploaded by

pillipramod8096

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views23 pages

RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2

material

Uploaded by

pillipramod8096

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

lOMoARcPSD|51464582

RL-UNIT2 - RL unit 2

Cse (ai & ml) (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by PILLI PRAMOD KUMAR ([email protected])
lOMoARcPSD|51464582

Reinforcement Learning
UNIT-II
Markov Decision Process Problem, Policy, and value function .Reward
models (infinite discounted, total, finite horizon, and average) Episodic &
continuing tasks, Bellman’s optimality operator, and Value iteration &
policy iteration.

Markov Decision Process (MDP):

A Markov Decision Process is a mathematical framework used to model decision-making in
situations where the outcome depends on both the current state and the action taken by an agent.
It is widely used in the field of artificial intelligence, reinforcement learning, operations research,
and control theory.

In the problem, an agent is supposed to decide the best action to select based
on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
:

The Agent–Environment Interface

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Markov Decision Process (MDP) model contains

MDP=(S,A,P,R,۷)

1. States (S): The System can be represented with A finite set

of possible states .These states represent the different
situations or configurations of the environment.
2. Actions (A): A finite set of actions that the agent can take.
Actions represent the decisions or choices that the agent can
make to influence the state transitions.
3. Transition Probabilities (P): A set of transition probabilities
that describe the likelihood of moving from one state to
another when an action is taken.

These probabilities are denoted by P(s'|s, a), which is the

probability of transitioning to state s' given that the current
state is s and the action a is taken.
4. Rewards (R): A reward function that provides a scalar value
as feedback to the agent after taking an action in a particular
state. The reward function is denoted by R(s, a, s'), where s
is the current state, a is the action taken, and s' is the next
state.
5. Discount Factor (γ): A discount factor that is used to weigh
the importance of future rewards compared to immediate
rewards. It ranges between 0 and 1, where 0 indicates that
the agent only cares about immediate rewards, and 1
indicates that the agent values future rewards equally with
immediate rewards.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

 The objective of an agent in a Markov Decision Process

is to find a policy (π), which is a mapping from states to
actions that maximizes the expected cumulative reward
over time.
 The policy tells the agent which action to take in each
state to achieve the best overall outcome.
 Solving a Markov Decision Process involves finding the
optimal policy that maximizes the expected total
reward.
 There are various algorithms to do this, such as Value
Iteration, Policy Iteration, Q-Learning, and SARSA,
which are commonly used in the context of
reinforcement learning.
 Markov Decision Processes provide a powerful
framework for modeling decision-making problems in
stochastic and uncertain environments, making them
essential tools in the study of artificial intelligence and
decision theory.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Policy and value function

 In the context of a Markov Decision Process (MDP), both

policy and value function are fundamental concepts
used to guide decision-making and evaluate the quality
of actions in different states.
Policy: A policy (π) in an MDP is a strategy or a rule that defines the
agent's behavior.
 It specifies the action taken by the agent in each state to
maximize its expected cumulative reward over time.
 . Mathematically, a policy is represented as:

π(a | s) = P(take action a | in state s),

where:

π(a | s) is the probability of taking action a in state s according to the policy.

 A policy can be deterministic (choosing a single action for

each state) or stochastic (selecting actions with certain
probabilities).
 The objective of the agent is to find the best policy that leads
to the highest expected total reward

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid.

The grid has a START state(grid no 1,1). The purpose of
the agent is to wander around the grid to finally reach the
Blue Diamond (grid no 4,3). Under all circumstances, the
agent should avoid the Fire grid (orange color, grid no 4,2).
Also the grid no 2,2 is a blocked grid, it acts as a wall hence
the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would
have taken, the agent stays in the same place. So for example, if the agent says
LEFT in the START grid he would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond.
Two such sequences can be found:

 RIGHT RIGHT UP UPRIGHT

 UP UP RIGHT RIGHT RIGHT

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent
discussion.
For example, if the agent says UP the probability of going UP is 0.8 whereas the
probability of going LEFT is 0.1, and the probability of going RIGHT is 0.1 (since
LEFT and RIGHT are right angles to UP).

The agent receives rewards each time step:-

 Small reward each step (can be negative when can also be term as punishment,
in the above example entering the Fire can have a reward of -1).
 Big rewards come at the end (good or bad).
 The goal is to Maximize the sum of rewards.

Value Function:
 The value function is a function that estimates the expected
cumulative reward an agent can achieve starting from a
particular state and following a specific policy.
 The value function provides a measure of how good it is for
the agent to be in a particular state and follow a particular
policy.
 The notion of “how good” here is defined in terms of
future rewards that can be expected
 There are two types of value functions:

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

1.The state value function (V∏(s))

We call the function v∏ the state-value function for policy ∏.

 where E∏[·] denotes the expected value of a random variable

given that the agent follows policy ∏, and t is any time step.
 γ is the discount factor that weighs the importance of future rewards.
 Rt is the reward obtained at time step t.

Action-Value Function
We define the value of taking action a in state s under a policy ∏,
denoted q∏(s, a), as the expected return starting from s, taking the
action a, and thereafter following policy∏ :

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Reward models
In the context of reinforcement learning and Markov Decision
Processes (MDPs), reward models play a crucial role in guiding the
learning process and shaping the behavior of an agent.
 Reward models define the immediate feedback an agent
receives for each action taken in a particular state.
 They are used to represent the goals and objectives of the
agent in the learning environment.

A reward model can be thought of as a mapping from state-action pairs to

real-valued rewards. Mathematically, a reward model can be represented as
a function R(s, a), where:
 s is the current state in the environment.
 a is the action taken by the agent in that state.
 R(s, a) is the reward received by the agent after taking action a in
state s.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Various types of reward models are used to represent different objectives and goals
for an agent. The four main types of reward models are:

Infinite Discounted Reward Model: In the infinite discounted

reward model, the agent aims to maximize the cumulative discounted
reward over an infinite time horizon.
The cumulative discounted reward at time step t is given by the sum
of rewards obtained at each time step, each multiplied by a discount
factor γ raised to the power of the time step:
Cumulative Discounted Reward = Σ [γ^t * R_t],
where:
 γ (gamma) is the discount factor, 0 ≤ γ < 1. It determines the
relative importance of immediate rewards compared to future
rewards. A smaller γ values emphasize immediate rewards,
while larger γ values emphasize future rewards.

1. The objective of the agent is to find a policy that maximizes the

expected cumulative discounted reward.
2. The discounted reward model helps the agent prioritize getting
rewards sooner rather than later, as future rewards are discounted with
each time step.

2.Total Reward Model: In

the total reward model, the agent aims to
maximize the cumulative sum of rewards over a finite time
horizon T. The objective is to obtain as much reward as
possible within a fixed number of time steps.
 Total Reward = Σ [R_t],
 where t ranges from 0 to T.
 Unlike the infinite discounted reward model, the
total reward model does not involve a discount
factor, and the agent's focus is on maximizing the
sum of rewards obtained over a fixed time horizon.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

3.Finite Horizon Reward Model: The finite horizon reward model is similar
to the total reward model, but it allows for variable time horizons.
Instead of having a fixed time horizon T, each episode (sequence of
actions) may have a different length, and the agent aims to maximize
the cumulative reward over each individual episode.
 Finite Horizon Reward = Σ [R_t],
 where t ranges from 0 to T_k, the time step at the end of
episode k.
 The finite horizon reward model is particularly useful in
episodic environments, where each episode has a natural
ending point.

4.Average Reward Model: In the average reward model, the agent aims to
maximize the average reward obtained per time step over an infinite
time horizon. Unlike the infinite discounted reward model, which
sums up discounted rewards, the average reward model calculates
the average of rewards without discounting.
 Average Reward = lim (T → ∞) [Σ [R_t] / T],
 where T is the number of time steps.
 The average reward model is often used in stationary
environments, where the reward distribution remains constant
over time, and the agent's objective is to find a policy that
maximizes the long-term average reward per time step.

Each of these reward models represents different objectives for the

agent and can lead to distinct optimal policies and behaviors. The
choice of reward model depends on the specific task and objectives
of the reinforcement learning problem at hand.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Episodic and continuing tasks

Episodic and continuing tasks are two types of environments that
are commonly encountered in the context of reinforcement learning
and Markov Decision Processes (MDPs).
 They differ in terms of the duration and structure of the agent's
interactions with the environment.

Episodic Tasks:
 In episodic tasks, the agent's interactions with the environment
are organized into episodes.
 Each episode is a sequence of actions and states that begins
with an initial state and ends with a terminal state.
 The terminal state is a special state that marks the end of an
episode, after which the environment is reset to the initial state,
and a new episode begins.
 Examples of episodic tasks include games with multiple rounds
or levels, where each round is an episode, or a robotic task that
requires the agent to complete a specific task within a limited
time frame.
The objective of the agent in an episodic task is often to maximize the
cumulative reward obtained within each individual episode.
After each episode, the agent receives a terminal signal (usually a
reward of 0) to indicate the end of the episode.

Continuing Tasks:

 In continuing tasks, the agent's interactions with the

environment do not have an explicit episode structure or a
terminal state.
 The agent's experience continues indefinitely without any
predefined end point.
 Examples of continuing tasks include controlling a robot that
operates continuously without a predefined stopping condition
or an agent learning to navigate a virtual environment without
any specific endpoint.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

The objective of the agent in a continuing task is typically to maximize the long-term
cumulative reward over an infinite or very long time horizon.
Since there are no episodes, the agent's experience continually accumulates, and there
is no explicit end-of-episode signal.

Key Differences:

With respect to Episodic tasks Continuing tasks

In an episodic environment the agent In sequential environment on

Unified Notation experience is divided into atomic the other hand the current
episodes. In episodic environment the decision affects the future
choice action in each episode depends decisions
only on the episode itself. 1.

2. while continuing tasks do not

Temporal Episodic tasks have well-defined episodes have a predefined episode
Structure with a fixed number of time steps structure or terminal states.

In continuing tasks, the objective

Learning Objective In episodic tasks, the agent aims to is to maximize the long-term
maximize the cumulative reward within cumulative reward over an infinite
each episode. or extended time horizon.

3. . In continuing tasks, there is no

Resetting Episodic tasks involve resetting the reset between episodes since they
environment to its initial state after each do not have an end.
episode, allowing the agent to start a new 4.
episode

5. In continuing tasks, there is no

Termination Signal In episodic tasks, the terminal state or a explicit termination signal.
terminal signal marks the end of each
episode. 6.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Reinforcement learning algorithms and approaches can be

adapted to handle both episodic and continuing tasks, and
the choice of the task type depends on the specific problem
being addressed.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

To calculate the discounted cumulative rewards G0, G1, ..., G5, we work
backward starting from the terminal state (T=5) and use the given discount
factor (ɼ = 0.5). The discounted cumulative reward G_t at time step t is defined
as:
G_t = R_t+1 + (# * G_t+1).

Using the given sequence of rewards R1 = -1, R2 = 2, R3 = 6, R4 = 3, and R5 = 2, let's calculate

G0, G1, ..., G5 step by step:

1. G5 (Terminal State):
G5 = R5 = 2.
2. G4:
G4 = R5 + (ɼ * G5) = 2 + (0.5 * 2) = 2+ 1 = 3.
3. G3:
G3 = R4 + (ɼ * G4) = 3+ (0.5 * 3) = 3+ 1.5 = 4.5.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

4. G2:
G2 = R3 + (ɼ * G3) = 6 + (0.5 * 4.5) = 6 +2.25= 8.25.
5. G1:
G1 = R2 + (ɼ* G2) = 2 + (0.5 * 8.25) = 2 + 4.125= 6.125
6. G0 (Initial State): G0 = R1+(ɼ * G1) =-1+ 0.5 * 6.125= 2.0625

So, the discounted cumulative rewards for each time step are: G0 = 0, G1 = 5,
G2 = 10, G3 = 5, G4 = 3, G5 = 2.

Bellman’s optimality operator

 The Bellman optimality operator is a mathematical operator

used to define the optimal value function in a Markov Decision
Process (MDP) of dynamic programming.
 In an MDP, an agent makes decisions in an environment,
transitioning from one state to another while receiving rewards
based on its actions.
 The goal of the agent is to find a policy (a strategy for selecting
actions) that maximizes the expected cumulative reward over
time.

The Bellman optimality operator is used to iteratively update the value

function based on the Bellman optimality equation. and learn the
optimal policy through interactions with the environment.

For a given state s, the Bellman optimality equation for a state value V*(S) is
defined as follows:

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Where:

 V*(s) is the optimal value function for state s.

 max[a] denotes taking the maximum over all possible actions a that
the agent can take in state s.
 P(s', r | s, a) is the probability of transitioning to state s' and receiving
reward r, given that the agent is in state s and takes action a.
 r is the immediate reward the agent receives from the environment
after taking action a in state s and transitioning to state s'.
 γ (gamma) is the discount factor, a constant between 0 and 1, which
determines the agent's preference for immediate rewards versus
future rewards.

 The Bellman optimality operator takes the current value function V*(s)
and updates it to a new value based on the expected rewards of taking
the optimal action from each state.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Policy iteration :
 Policy iteration is an iterative algorithm used to find the optimal
policy (What action to take in a certain states to maximize the
reward/total discount reward is called optimal policy) in a
Markov Decision Process (MDP) or reinforcement learning
problem.
 It combines two main steps: 1.policy evaluation and 2.policy
improvement.
 The goal of policy iteration is to converge to the optimal policy
by improving an initial policy through multiple iterations.

The algorithm starts with an initial policy, which can be any arbitrary
policy or a random policy. Then, it alternates between the following
two steps until the policy converges to the optimal policy:

1. Policy Evaluation: In this step, the algorithm evaluates the value function for
the state for the current policy. The value function for a policy π is
denoted by Vπ(s) and can be computed using the following
equation:
Vπ(s) = Σ[s', r] P(s', r | s, π(s)) * [r + γ * Vπ(s')]
Where:

 Vπ(s) is the value function for policy π in state s.

 Vπ(s') is the value function of the future(next) state
 P(s', r | s, π(s)) is the probability of transitioning to state s' and receiving
reward r when taking the action determined by policy π in state s.
 γ (gamma) is the discount factor, a constant between 0 and 1, which
determines the agent's preference for immediate rewards versus future
rewards.

The policy evaluation step involves solving the above equation for each state
in the MDP, and this process is usually repeated until the values of the value
function converge.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

2. Policy Improvement: The policy improvement step involves selecting the best
action in each state to maximize the expected cumulative reward according to
the current value function.

The updated policy, denoted as π', is given by:

π'(s) = argmax[a] { Σ[s', r] P(s', r | s, a) * [r + γ * Vπ(s')] }

Where:

 π'(s) is the action selected by the updated policy π' in state s.

 The argmax[a] denotes taking the action that maximizes the expression in the
curly brackets.

After policy improvement, the new policy π' is obtained, and the process
repeats.

Policy Iteration:

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Policy iteration Algorithm:

Value iteration:

 The process of repeatedly applying the Bellman optimality operator to

the value function until it converges to the true optimal value function
is known as the value iteration algorithm, a common approach to
solving MDPs in dynamic programming.

 It is a dynamic programming method and is particularly efficient for solving

MDPs with large state spaces.
 The value iteration algorithm starts with an initial estimate of the optimal
value function and then repeatedly updates the value function until it
converges to the true optimal value function.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

 The algorithm is based on the Bellman optimality equation, which relates the
value of a state to the value of its successor states under the optimal policy.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Downloaded by PILLI PRAMOD KUMAR ([email protected])

lOMoARcPSD|51464582

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
Organizational Change Management
100% (5)
Organizational Change Management
107 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Lecture3 InsideAnAgent
No ratings yet
Lecture3 InsideAnAgent
35 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Grade 06 History 1st Term Test Paper With Answers 2019 Sinhala Medium North Western Province
83% (6)
Grade 06 History 1st Term Test Paper With Answers 2019 Sinhala Medium North Western Province
7 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
21ai020 & Reinforcement Learning: Topic
No ratings yet
21ai020 & Reinforcement Learning: Topic
8 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
RL Basics 1737166593
No ratings yet
RL Basics 1737166593
30 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
Unit Ix Cost Effectiveness and Cost Accounting
No ratings yet
Unit Ix Cost Effectiveness and Cost Accounting
38 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Unit 4
No ratings yet
Unit 4
6 pages
M 2
No ratings yet
M 2
12 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
RL Frra
No ratings yet
RL Frra
9 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
RL Frra
No ratings yet
RL Frra
10 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Subnet Mask PDF
No ratings yet
Subnet Mask PDF
5 pages
2018 Icas Invitation ENGLISH2
No ratings yet
2018 Icas Invitation ENGLISH2
2 pages
Belt Conveyor (V1)
No ratings yet
Belt Conveyor (V1)
45 pages
Establishing OPC UA Connectivity With Rockwell Automation® Integrated Architecture
No ratings yet
Establishing OPC UA Connectivity With Rockwell Automation® Integrated Architecture
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
Dismantling Naik
No ratings yet
Dismantling Naik
45 pages
A Brief History of Consumer Culture
No ratings yet
A Brief History of Consumer Culture
6 pages
TIẾNG ANH CHUYÊN NGÀNH 2
No ratings yet
TIẾNG ANH CHUYÊN NGÀNH 2
12 pages
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
No ratings yet
Reinforcement Learning B.Tech. IV Year I Sem. Unit - I
27 pages
Mental Health Essay
100% (2)
Mental Health Essay
7 pages
QP Ans
No ratings yet
QP Ans
40 pages
2 PDF
No ratings yet
2 PDF
232 pages
Taj Wellington Mews - Tri Fold Brochure
No ratings yet
Taj Wellington Mews - Tri Fold Brochure
2 pages
Julia de Burgos Biography - Bilingual
No ratings yet
Julia de Burgos Biography - Bilingual
2 pages
PEPSICO
No ratings yet
PEPSICO
5 pages
E-Cse Unit-2
No ratings yet
E-Cse Unit-2
30 pages
RL Unit 1 Notes RL Unit 1 Notes: Scan To Open On Studocu Scan To Open On Studocu
No ratings yet
RL Unit 1 Notes RL Unit 1 Notes: Scan To Open On Studocu Scan To Open On Studocu
24 pages
Mini Research On Homeless
No ratings yet
Mini Research On Homeless
6 pages
E-Cse Unit-1
No ratings yet
E-Cse Unit-1
9 pages
AEIF 2024 Proposal Forms
No ratings yet
AEIF 2024 Proposal Forms
10 pages
EE1005 L01 Computers & Programming
No ratings yet
EE1005 L01 Computers & Programming
35 pages
Death by Thomas Nagel Com
100% (1)
Death by Thomas Nagel Com
10 pages
DIY Simple Machine Model Rubric
No ratings yet
DIY Simple Machine Model Rubric
1 page
15 Advanced English Phrases For Better Expressing Emotions
No ratings yet
15 Advanced English Phrases For Better Expressing Emotions
4 pages
FINAL MODEL PAPER 2023-24 Class 7
No ratings yet
FINAL MODEL PAPER 2023-24 Class 7
11 pages
1st Batch Uat - March 11 - Ibajay
No ratings yet
1st Batch Uat - March 11 - Ibajay
3 pages
Group Assignment 6 ICT (XII IPA 5) - 20240118 - 003400 - 0000
No ratings yet
Group Assignment 6 ICT (XII IPA 5) - 20240118 - 003400 - 0000
13 pages
Fundamentals of Multimedia
No ratings yet
Fundamentals of Multimedia
3 pages
ONDC - Sept 2022
No ratings yet
ONDC - Sept 2022
16 pages
Lymph 4649 Document PDF
No ratings yet
Lymph 4649 Document PDF
17 pages
Mandelbrot Zoom Report
No ratings yet
Mandelbrot Zoom Report
9 pages
Uniu S2466 Sti Ii Ul
No ratings yet
Uniu S2466 Sti Ii Ul
1 page
Test Bench TS1300 - High Quality in A Small Space
No ratings yet
Test Bench TS1300 - High Quality in A Small Space
2 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2

Uploaded by

RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2

Uploaded by

lOMoARcPSD|51464582

Cse (ai & ml) (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Markov Decision Process (MDP):

The Agent–Environment Interface

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Markov Decision Process (MDP) model contains

1. States (S): The System can be represented with A finite set

These probabilities are denoted by P(s'|s, a), which is the

Downloaded by PILLI PRAMOD KUMAR ([email protected])

 The objective of an agent in a Markov Decision Process

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Policy and value function

 In the context of a Markov Decision Process (MDP), both

π(a | s) = P(take action a | in state s),

π(a | s) is the probability of taking action a in state s according to the policy.

 A policy can be deterministic (choosing a single action for

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid.

 RIGHT RIGHT UP UPRIGHT

Downloaded by PILLI PRAMOD KUMAR ([email protected])

The agent receives rewards each time step:-

Downloaded by PILLI PRAMOD KUMAR ([email protected])

1.The state value function (V∏(s))

We call the function v∏ the state-value function for policy ∏.

 where E∏[·] denotes the expected value of a random variable

Downloaded by PILLI PRAMOD KUMAR ([email protected])

A reward model can be thought of as a mapping from state-action pairs to

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Infinite Discounted Reward Model: In the infinite discounted

1. The objective of the agent is to find a policy that maximizes the

2.Total Reward Model: In

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Each of these reward models represents different objectives for the

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Episodic and continuing tasks

 In continuing tasks, the agent's interactions with the

Downloaded by PILLI PRAMOD KUMAR ([email protected])

With respect to Episodic tasks Continuing tasks

In an episodic environment the agent In sequential environment on

2. while continuing tasks do not

In continuing tasks, the objective

3. . In continuing tasks, there is no

5. In continuing tasks, there is no

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Reinforcement learning algorithms and approaches can be

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Using the given sequence of rewards R1 = -1, R2 = 2, R3 = 6, R4 = 3, and R5 = 2, let's calculate

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Bellman’s optimality operator

 The Bellman optimality operator is a mathematical operator

The Bellman optimality operator is used to iteratively update the value

Downloaded by PILLI PRAMOD KUMAR ([email protected])

 V*(s) is the optimal value function for state s.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

 Vπ(s) is the value function for policy π in state s.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

The updated policy, denoted as π', is given by:

π'(s) = argmax[a] { Σ[s', r] P(s', r | s, a) * [r + γ * Vπ(s')] }

 π'(s) is the action selected by the updated policy π' in state s.

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Policy iteration Algorithm:

 The process of repeatedly applying the Bellman optimality operator to

 It is a dynamic programming method and is particularly efficient for solving

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Downloaded by PILLI PRAMOD KUMAR ([email protected])

Downloaded by PILLI PRAMOD KUMAR ([email protected])

You might also like