0% found this document useful (0 votes)

11 views6 pages

Unit 4

This document provides an overview of Markov Decision Processes (MDPs) and their role in Reinforcement Learning, detailing components such as states, actions, rewards, and policies. It also discusses the utility function, differences between value iteration and policy iteration, and introduces Partially Observable Markov Decision Processes (POMDPs) along with strategies for solving them. Key strategies include belief state representation, value iteration, point-based methods, policy search methods, and Monte Carlo methods.

Uploaded by

solankironak423

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views6 pages

Unit 4

Uploaded by

solankironak423

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Unit 4

Markov Decision Process

Reinforcement Learning is a type of Machine Learning. It allows machines

and software agents to automatically determine the ideal behavior within a
specific context, in order to maximize its performance. Simple reward
feedback is required for the agent to learn its behavior; this is known as the
reinforcement signal.

There are many different algorithms that tackle this issue. As a matter of
fact, Reinforcement Learning is defined by a specific type of problem and all
its solutions are classed as Reinforcement Learning algorithms. In the
problem, an agent is supposed to decide the best action to select based on
his current state. When this step is repeated, the problem is known as
a Markov Decision Process.

A Markov Decision Process (MDP) model contains:

 A set of possible world states S.

 A set of Models.

 A set of possible actions A.

 A real-valued reward function R(s,a).

 A policy is a solution to Markov Decision Process.

State

A State is a set of tokens that represent every state that the agent can be
in.

Model
A Model (sometimes called Transition Model) gives an action’s effect in a
state. In particular, T(S, a, S’) defines a transition T where being in state S
and taking an action ‘a’ takes us to state S’ (S and S’ may be the same). For
stochastic actions (noisy, non-deterministic) we also define a probability P(S’|
S,a) which represents the probability of reaching a state S’ if action ‘a’ is
taken in state S. Note Markov property states that the effects of an action
taken in a state depend only on that state and not on the prior history.

Actions

Action A is a set of all possible actions. A(s) defines the set of actions that
can be taken being in state S.

Reward

A Reward is a real-valued reward function. R(s) indicates the reward for

simply being in the state S. R(S,a) indicates the reward for being in a state S
and taking an action ‘a’. R(S,a,S’) indicates the reward for being in a state S,
taking an action ‘a’ and ending up in a state S’.

Policy

A Policy is a solution to the Markov Decision Process. A policy is a mapping

from S to a. It indicates the action ‘a’ to be taken while in state S.
Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid. The grid has a
START state(grid no 1,1). The purpose of the agent is to wander around the
grid to finally reach the Blue Diamond (grid no 4,3). Under all circumstances,
the agent should avoid the Fire grid (orange color, grid no 4,2). Also, the grid
no 2,2 is a blocked grid, it acts as a wall hence the agent cannot enter it.

The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT

Walls block the agent’s path, i.e., if there is a wall in the direction the agent
would have taken, the agent stays in the same place. So for example, if the
agent says LEFT in the START grid he would stay put in the START grid.

First Aim: To find the shortest sequence getting from START to the Diamond.
Two such sequences can be found:

 RIGHT RIGHT UP UPRIGHT

 UP UP RIGHT RIGHT RIGHT

Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent
discussion.
The move is now noisy. 80% of the time the intended action works
correctly. 20% of the time the action agent takes causes it to move at right
angles. For example, if the agent says UP the probability of going UP is 0.8
whereas the probability of going LEFT is 0.1, and the probability of going
RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).

Utility theory
Utility theory is a fundamental concept in economics and decision theory.
This theory provides a framework for understanding how individuals make
choices under uncertainty. The aim of this agent is not only to achieve the
goal but the best possible way to reach the goal. This idea suggests that
people give a value to each possible result of a choice showing how much
they like or are happy with that result. The aim is to get the highest expected
value, which is the average of the values of all possible results taking into
account how likely each one is to happen.

1. Utility Function: Definition and Purpose

The utility function is a core element of utility-based agents, serving as a

mathematical representation of the agent’s preferences. It assigns a
numerical value (utility) to each possible outcome, reflecting the desirability
or satisfaction associated with that outcome.

 Definition: A utility function U(s) is a mapping from states sss to real

numbers, indicating the utility or value of each state.
 Purpose: The primary purpose of the utility function is to quantify the
agent’s preferences, allowing it to compare and evaluate different
states. By maximizing utility, the agent can choose actions that lead to
the most desirable outcomes according to its objectives.

For example, in an autonomous vehicle, the utility function might consider

factors such as safety, speed, fuel efficiency, and passenger comfort. Each
possible driving state would be assigned a utility value based on these
criteria.

Difference between Value Iteration and Policy Iteration

Aspect Value Iteration Policy Iteration

Methodolo Iteratively updates value Alternates between policy

gy functions until convergence evaluation and improvement

Converges to optimal value Converges to the optimal

Goal function policy

Directly computes value Evaluate and improve policies

Execution functions sequentially

Complexit Typically simpler to Involves more steps and

y implement and understand computations

Converge May converge faster in some Generally converges slower

nce scenarios but yields better policies

Partially Observable Markov Decision Process (POMDP)

A POMDP models decision-making tasks where an agent must make

decisions based on incomplete or uncertain state information. It is
particularly useful in scenarios where the agent cannot directly observe the
underlying state of the system but rather receives observations that provide
partial information about the state.

Components of a POMDP
A POMDP is formally defined by the following elements:

 States (S): A finite set of states representing all possible conditions

the system can be in.

 Actions (A): A finite set of actions available to the agent.

 Transition Model (T): A function T(s,a,s′)=P(s′∣s,a) that defines the

probability of transitioning from state s to state s′ under action ?a.

 Observations (O): A finite set of observations that the agent can

perceive.

 Observation Model (Z): A function Z(s′,a,o)=P(o∣s′,a) that defines the

probability of observing ?o after taking action a and ending up in
state s′.

 Rewards (R): A function R(s,a) that assigns a numerical reward to

taking action a in

 state s.

 Discount Factor (γ): A factor between 0 and 1 that discounts future

rewards, reflecting the preference for immediate rewards over future
gains.

Mathematical Framework of Partially Observable Markov Decision

Process

The decision process in a POMDP is a cycle of states, actions, and

observations. At each time step, the agent:

1. Observes a signal that partially reveals the state of the environment.

2. Chooses an action based on the accumulated observations.

3. Receives a reward dependent on the action and the underlying state.

4. Moves to a new state based on the transition model.

The key challenge in a POMDP is that the agent does not know its exact state
but has a belief or probability distribution over the possible states. This belief
is updated using the Bayes’ rule as new observations are made, forming a
belief update rule:

Bel(s’) =\frac{ P(o|s’,a) \sum_s P(s’|s,a) Bel(s)}{P(o|a, Bel)}

Where:
 Bel(s) is the prior belief of being in state s.

 Bel(s′) is the updated belief after observing o and taking action a.

Strategies for Solving Partially Observable Markov Decision

Processes

Partially Observable Markov Decision Processes (POMDPs) pose significant

challenges in environments where agents have incomplete information.
Solving POMDPs involves optimizing decision-making strategies under
uncertainty, crucial in many real-world applications. This overview highlights
key strategies and methods for addressing these challenges.

Belief State Representation:

In POMDPs, agents maintain a belief state—a probability distribution over all

possible states—to manage uncertainty. This belief updates dynamically with
actions and observations via Bayes’ rule.

Solving Techniques:

1. Value Iteration: Extends traditional value iteration to belief states,

using a piecewise linear and convex value function to calculate the
expected rewards and update beliefs accordingly.

2. Point-Based Methods: These methods, such as Perseus and Point-

Based Value Iteration (PBVI), focus on a select set of belief points to
simplify computations and efficiently approximate the value function.

3. Policy Search Methods: Methods like QMDP and Finite-state

controllers (FIB) search for optimal policies, sometimes simplifying the
problem by assuming full observability post-action or using a finite set
of controller states.

4. Monte Carlo Methods: Techniques like Partially Observable Monte

Carlo Planning (POMCP) and Despot leverage Monte Carlo simulations
within a tree search framework to estimate policy values under
uncertainty, focusing on key scenarios to reduce complexity.

These methods illustrate the ongoing advancements in computational

techniques to manage and solve the complexities of POMDPs, enhancing
decision-making in uncertain environments.

Solutions: Solutions Manual For Introduction To The Thermodynamics of Materials 6Th Edition Gaskell
75% (4)
Solutions: Solutions Manual For Introduction To The Thermodynamics of Materials 6Th Edition Gaskell
228 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
MAS - Class
No ratings yet
MAS - Class
71 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Unit 4
No ratings yet
Unit 4
49 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Lecture3 InsideAnAgent
No ratings yet
Lecture3 InsideAnAgent
35 pages
Becoming A Critical Thinker: A User-Friendly Manual, 6/e
36% (11)
Becoming A Critical Thinker: A User-Friendly Manual, 6/e
35 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Unit 5
No ratings yet
Unit 5
39 pages
(24F-COSE361) 5. Markov Decision Process
No ratings yet
(24F-COSE361) 5. Markov Decision Process
40 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Chapter17 1
No ratings yet
Chapter17 1
40 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
No ratings yet
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
19 pages
Markov Decision Process
No ratings yet
Markov Decision Process
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Markov Decision Processes: - The Markov Property - The Markov Decision Process - Partially Observable Mdps
No ratings yet
Markov Decision Processes: - The Markov Property - The Markov Decision Process - Partially Observable Mdps
24 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
ML Unit-5
No ratings yet
ML Unit-5
9 pages
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
No ratings yet
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
450 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
AS02
No ratings yet
AS02
16 pages
Sum of An Arithmetic Sequence
100% (1)
Sum of An Arithmetic Sequence
2 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
RL Assignment1
No ratings yet
RL Assignment1
5 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
E-Learning Course Material On "Engineering Mechanics" - PPT 1
0% (1)
E-Learning Course Material On "Engineering Mechanics" - PPT 1
59 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Chapter 1 Managers and You in The Workplace
100% (1)
Chapter 1 Managers and You in The Workplace
75 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Cgat Series
No ratings yet
Cgat Series
20 pages
AAnalyst 300 Data Sheet
No ratings yet
AAnalyst 300 Data Sheet
2 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Module #2: Transformation of Stresses in 2-D
No ratings yet
Module #2: Transformation of Stresses in 2-D
34 pages
CSET106 DMS Course File
No ratings yet
CSET106 DMS Course File
4 pages
Dam Safety Workshop 2023-1 India
No ratings yet
Dam Safety Workshop 2023-1 India
4 pages
Technical Data Sheet Jazeera Maxim Tex JA-26002: Description
No ratings yet
Technical Data Sheet Jazeera Maxim Tex JA-26002: Description
3 pages
Nigerian Agricultural Journal: Adoption of Improved Soybean Production Technologies in Benue State, Nigeria
No ratings yet
Nigerian Agricultural Journal: Adoption of Improved Soybean Production Technologies in Benue State, Nigeria
6 pages
Website Template For MSC by Coursework - ODL MSC Process Safety
No ratings yet
Website Template For MSC by Coursework - ODL MSC Process Safety
19 pages
RRL
No ratings yet
RRL
20 pages
Golden Rice - A Case Study in Intellectual Property Management and
No ratings yet
Golden Rice - A Case Study in Intellectual Property Management and
23 pages
01 - SS036 - Historical Antecedents
No ratings yet
01 - SS036 - Historical Antecedents
46 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
How Can Theory-Based Evaluation Make Greater Headway
No ratings yet
How Can Theory-Based Evaluation Make Greater Headway
25 pages
NMP5 Q4 Week 2
No ratings yet
NMP5 Q4 Week 2
16 pages
Work Civility
No ratings yet
Work Civility
7 pages
Risk Management Q-A 1-5 Module-1
No ratings yet
Risk Management Q-A 1-5 Module-1
4 pages
Translation Criticism-Week 1
No ratings yet
Translation Criticism-Week 1
50 pages
Unit2.5 Compoundsand Solutions
No ratings yet
Unit2.5 Compoundsand Solutions
17 pages
Quantitative Investigation
No ratings yet
Quantitative Investigation
10 pages
Chapter 4 Introduction To Discontinuity Study
No ratings yet
Chapter 4 Introduction To Discontinuity Study
87 pages
WORK
No ratings yet
WORK
17 pages
Newton's Laws of Motion at Work Science Presentation in Beige Charcoal Hand Drawn Style
No ratings yet
Newton's Laws of Motion at Work Science Presentation in Beige Charcoal Hand Drawn Style
18 pages
SPARK STAR-68 Final 33KV Tripping
No ratings yet
SPARK STAR-68 Final 33KV Tripping
2 pages
Worksheet 2
No ratings yet
Worksheet 2
2 pages
5 Explain How A Series of Chapters, Scenes, or Stanzas Fits Together To Provide The Overall Structure of A Particular Story, Drama, or Poem
No ratings yet
5 Explain How A Series of Chapters, Scenes, or Stanzas Fits Together To Provide The Overall Structure of A Particular Story, Drama, or Poem
3 pages

Unit 4

Uploaded by

Unit 4

Uploaded by

Unit 4

Markov Decision Process

Reinforcement Learning is a type of Machine Learning. It allows machines

A Markov Decision Process (MDP) model contains:

 A set of possible world states S.

 A set of possible actions A.

 A real-valued reward function R(s,a).

 A policy is a solution to Markov Decision Process.

A Reward is a real-valued reward function. R(s) indicates the reward for

A Policy is a solution to the Markov Decision Process. A policy is a mapping

 RIGHT RIGHT UP UPRIGHT

 UP UP RIGHT RIGHT RIGHT

1. Utility Function: Definition and Purpose

The utility function is a core element of utility-based agents, serving as a

 Definition: A utility function U(s) is a mapping from states sss to real

For example, in an autonomous vehicle, the utility function might consider

Difference between Value Iteration and Policy Iteration

Aspect Value Iteration Policy Iteration

Methodolo Iteratively updates value Alternates between policy

Converges to optimal value Converges to the optimal

Directly computes value Evaluate and improve policies

Complexit Typically simpler to Involves more steps and

Converge May converge faster in some Generally converges slower

Partially Observable Markov Decision Process (POMDP)

A POMDP models decision-making tasks where an agent must make

 States (S): A finite set of states representing all possible conditions

 Actions (A): A finite set of actions available to the agent.

 Transition Model (T): A function T(s,a,s′)=P(s′∣s,a) that defines the

 Observations (O): A finite set of observations that the agent can

 Observation Model (Z): A function Z(s′,a,o)=P(o∣s′,a) that defines the

 Rewards (R): A function R(s,a) that assigns a numerical reward to

 Discount Factor (γ): A factor between 0 and 1 that discounts future

Mathematical Framework of Partially Observable Markov Decision

The decision process in a POMDP is a cycle of states, actions, and

1. Observes a signal that partially reveals the state of the environment.

2. Chooses an action based on the accumulated observations.

3. Receives a reward dependent on the action and the underlying state.

4. Moves to a new state based on the transition model.

Bel(s’) =\frac{ P(o|s’,a) \sum_s P(s’|s,a) Bel(s)}{P(o|a, Bel)}

 Bel(s′) is the updated belief after observing o and taking action a.

Strategies for Solving Partially Observable Markov Decision

Partially Observable Markov Decision Processes (POMDPs) pose significant

Belief State Representation:

In POMDPs, agents maintain a belief state—a probability distribution over all

1. Value Iteration: Extends traditional value iteration to belief states,

2. Point-Based Methods: These methods, such as Perseus and Point-

3. Policy Search Methods: Methods like QMDP and Finite-state

4. Monte Carlo Methods: Techniques like Partially Observable Monte

These methods illustrate the ongoing advancements in computational

You might also like