0% found this document useful (0 votes)

14 views5 pages

RL Assignment1

The document discusses Partially Observable Markov Decision Processes (POMDPs), highlighting their components, challenges, and policy evaluation steps, including belief updates and value functions. It also differentiates between on-policy and off-policy methods in reinforcement learning, outlining their key differences, strengths, and weaknesses. Applications of POMDPs in various fields such as robotics, healthcare, and finance are mentioned, along with guidance on when to use on-policy or off-policy approaches.

Uploaded by

sainathg1002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views5 pages

RL Assignment1

Uploaded by

sainathg1002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Assignment -1

Name: N.Sivasankar
RollNo:23691f00f9
Subject:Reinforcement Learning
Section:MCA-C

1. Discuss about Partially Observable MDPs and its Policy Evaluation?

A. Partially Observable Markov Decision Processes (POMDPs)
A Partially Observable Markov Decision Process (POMDP) is a
framework used to model decision-making problems where the agent does not
have full knowledge of the state of the environment. It extends the classical
Markov Decision Process (MDP) by incorporating uncertainty in state
observations.

Components of a POMDP:

1. States (SSS): A set of possible states in the environment.

2. Actions (AAA): A set of actions available to the agent.
3. Transition Function (T(s,a,s′)T(s, a, s')T(s,a,s′)): The probability of
transitioning from state sss to s′s's′ when action aaa is taken.
4. Rewards (R(s,a)R(s, a)R(s,a)): The immediate reward received after taking
action aaa in state sss.
5. Observations (OOO): A set of possible observations the agent can receive.
6. Observation Function (O(o∣s′,a)O(o | s', a)O(o∣s′,a)): The probability of
observing ooo after transitioning to state s′s's′ and taking action aaa.
7. Discount Factor (γ\gammaγ): A value 0≤γ<10 \leq \gamma < 10≤γ<1 that
models the importance of future rewards.

Key Challenge:

Unlike in an MDP, where the agent knows the exact state, in a POMDP, the agent
has only partial information about the state and must maintain a belief state, which
is a probability distribution over all

Policy Evaluation in POMDPs

The goal of policy evaluation in POMDPs is to determine the expected utility (value
function) of following a particular policy, given the uncertainties in the state and
observations.

Steps in Policy Evaluation:

1. Belief Update:
o After executing an action aaa and observing ooo, the belief state is
updated using Bayes' rule:

b′(s′)=ηO(o∣s′,a)s∈S∑T(s,a,s′)b(s)

Here, η\etaη is a normalization constant ensuring the updated belief

is a valid probability distribution.

2. Value Function:
o The value of a belief state bbb under a policy π\piπ is given by:

Vπ(b)=E[t=0∑∞γtR(st,π(bt))∣b0=b]

This computes the expected discounted reward starting from the belief
state bbb.

3. Backup Operation:
o Dynamic programming is often used to compute the value function
iteratively. A common approach is the Bellman backup:

V(b)=amax[s∈S∑b(s)R(s,a)+γo∈O∑P(o∣b,a)V(b′)]

Here, P(o∣b,a)P(o | b, a)P(o∣b,a) is the probability of observing ooo after

action aaa given belief bbb, and b′b'b′ is the updated belief

Computational Complexity

Policy evaluation in POMDPs is computationally challenging due to:

1. The continuous nature of the belief state space.

2. The combinatorial growth of possible observation sequences.

Solution Techniques:

1. Point-Based Value Iteration (PBVI):

o Approximate the value function using a finite set of belief points.
2. Monte Carlo Sampling:
o Use sampling methods to estimate belief updates and expected rewards.
3. Policy Search:
o Directly optimize the policy without explicitly computing the value
function.

Applications of POMDPs

● Robotics: Navigation with noisy sensors.

● Healthcare: Treatment planning with incomplete patient information.
● Finance: Decision-making under market uncertainty.
● Game AI: Modeling opponents with hidden strategies.

2. Differentiate on-policy and off-policy in reinforcement learning?

A. In Reinforcement Learning (RL), on-policy and off-policy methods differ
based on how the policy used for taking actions (behavior policy) relates to the
policy being improved (target policy).

Key Differences:

Aspect On-Policy RL Off-Policy RL

Behavior vs. Behavior policy is the same as Behavior policy is different from
Target Policy the target policy. the target policy.
Balances exploration and Can explore using a different
Exploration vs.
exploitation directly in the behavior policy while improving
Exploitation
same policy. the target policy.
- SARSA - Q-Learning
Examples of - Proximal Policy Optimization - Deep Q-Network (DQN)
Algorithms (PPO) - Deep Deterministic Policy
- Actor-Critic Gradient (DDPG)
Less data efficient because it More data efficient as it can learn
Data Efficiency learns only from data collected from data collected by other
by the current policy. policies.
More stable and consistent May face instability due to off-
Stability and
learning since it follows one policy corrections or divergence
Convergence
policy. issues.
Aspect On-Policy RL Off-Policy RL
Learns a value function or policy
Directly improves the policy it
Policy Learning based on an arbitrary behavior
uses to make decisions.
policy.
Well-suited for online learning Ideal for offline learning where
Real-World
where the current policy pre-collected data or experience
Suitability
interacts with the environment. from other agents can be reused.

Detailed Explanation:

On-Policy RL

● How It Works:
o The agent learns and improves the policy it uses to interact with the
environment.
o Data comes directly from the current policy.
● Example:
o SARSA: Updates the Q-value based on the action chosen by the same
policy:

Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]

o Here, a′a'a′ is selected using the same policy as aaa.

● Strengths:
o Simple and interpretable.
o Naturally incorporates exploration strategies like ϵ\epsilonϵ-greedy.
● Weaknesses:
o Inefficient in reusing data from different policies.

Off-Policy RL

● How It Works:
o The agent learns a policy (target policy) while using a different policy
(behavior policy) to collect data.
o Often uses importance sampling or other correction techniques to
reconcile the difference between the two policies.
● Example:
o Q-Learning: Updates the Q-value using the maximum possible future
reward, independent of the behavior policy: Q(s,a)←Q(s,a)+α[r+γmax⁡a
′Q(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \
max_{a'} Q(s', a') - Q(s, a) \right]Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)
−Q(s,a)]
o Here, a′a'a′ is chosen greedily, not necessarily using the behavior policy.
● Strengths:
o Reuses past experiences, making it more data-efficient.
o Handles scenarios with pre-collected datasets (offline RL).
● Weaknesses:
o Can be unstable due to discrepancies between the target and behavior
policies.

Choosing Between On-Policy and Off-Policy

● Use On-Policy:
o When interaction with the environment is inexpensive.
o For tasks where policy stability and convergence are critical.
o Examples: Real-time decision-making in games or simulations.
● Use Off-Policy:
o When you want to leverage pre-collected experience or data.
o In situations where exploration is risky or costly (e.g., robotics,
healthcare).
o Examples: Offline learning, batch reinforcement learning .

Reinforcement Learning - Week 12
No ratings yet
Reinforcement Learning - Week 12
3 pages
Learning Symbolic Persistent Macro-Actions For POMDP Solving Over Time
No ratings yet
Learning Symbolic Persistent Macro-Actions For POMDP Solving Over Time
15 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Stanford Machine Learning Course Notes by Andrew NG
No ratings yet
Stanford Machine Learning Course Notes by Andrew NG
16 pages
Constrained RL Merged
No ratings yet
Constrained RL Merged
790 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
(Communication Electronic Circuits) Preface
No ratings yet
(Communication Electronic Circuits) Preface
2 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Model-Based Policy Optimization With Unsupervised Model Adaptation
No ratings yet
Model-Based Policy Optimization With Unsupervised Model Adaptation
17 pages
Government of India Act 1858
No ratings yet
Government of India Act 1858
20 pages
OIST Research Intern Application
No ratings yet
OIST Research Intern Application
12 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Partially Observable Markov Decision Processes and Robotics
No ratings yet
Partially Observable Markov Decision Processes and Robotics
25 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
No ratings yet
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
19 pages
Partially Observable MDP AI
No ratings yet
Partially Observable MDP AI
2 pages
Dynamic Programming RL Answers Final
No ratings yet
Dynamic Programming RL Answers Final
3 pages
How To Become A Full Sea Ship Captain
No ratings yet
How To Become A Full Sea Ship Captain
2 pages
THHDH
No ratings yet
THHDH
56 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Nikhil Sharma Resume
No ratings yet
Nikhil Sharma Resume
6 pages
Markov Decision Process
No ratings yet
Markov Decision Process
15 pages
Follow Actions PDF
No ratings yet
Follow Actions PDF
42 pages
DemoProject2Project Report
No ratings yet
DemoProject2Project Report
9 pages
Unit 4
No ratings yet
Unit 4
6 pages
Biogase and Its Uses
No ratings yet
Biogase and Its Uses
1 page
DP CP Final Statistical Bulletin May 2023
No ratings yet
DP CP Final Statistical Bulletin May 2023
34 pages
Piping Supervisor
No ratings yet
Piping Supervisor
12 pages
Research Technology Resource Mar 07
No ratings yet
Research Technology Resource Mar 07
7 pages
Air Filter Grades PDF
No ratings yet
Air Filter Grades PDF
2 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Asymptotic Analysis: Objectives
No ratings yet
Asymptotic Analysis: Objectives
20 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Analog Display Digital VFO
No ratings yet
Analog Display Digital VFO
3 pages
A Pac RL Algorithm For Episodic Pomdps
No ratings yet
A Pac RL Algorithm For Episodic Pomdps
9 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Experimental Study On The Application of Polymer Modified Bitumen in The Flexible Pavement
No ratings yet
Experimental Study On The Application of Polymer Modified Bitumen in The Flexible Pavement
1 page
Reinf 2
No ratings yet
Reinf 2
4 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Lewatit Monoplus S 108 H
No ratings yet
Lewatit Monoplus S 108 H
5 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
About Financial Accounting Volume 2 8th Doussy
100% (10)
About Financial Accounting Volume 2 8th Doussy
503 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
Endorsement Letter Honda
No ratings yet
Endorsement Letter Honda
1 page
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Apprinova Neossance Hemisqualane Latest
No ratings yet
Apprinova Neossance Hemisqualane Latest
4 pages
MODULE6 7 A Partially Observable Markov Decision Process
No ratings yet
MODULE6 7 A Partially Observable Markov Decision Process
19 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Architect Thomas Doerr
No ratings yet
Architect Thomas Doerr
7 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
625 Kva Gas Set
No ratings yet
625 Kva Gas Set
31 pages
A.I Unit4
No ratings yet
A.I Unit4
54 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Gaming Industry - Group 1 - MM
No ratings yet
Gaming Industry - Group 1 - MM
20 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Film Insurance
100% (1)
Film Insurance
8 pages
BS en ISO 12781-2-2011 - Geometrical Product Specifications (GPS) - Flatness - Part 2 - Specification Operators
No ratings yet
BS en ISO 12781-2-2011 - Geometrical Product Specifications (GPS) - Flatness - Part 2 - Specification Operators
24 pages
Nonlinear Inversion Flight Control For A Supermaneuverable Aircraft
100% (1)
Nonlinear Inversion Flight Control For A Supermaneuverable Aircraft
9 pages
BMW530 Wire Color Coding Description
No ratings yet
BMW530 Wire Color Coding Description
2 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
012 Cleanliness
No ratings yet
012 Cleanliness
34 pages
Netbackup Troubleshooting Commands
No ratings yet
Netbackup Troubleshooting Commands
4 pages
Commerce Clause Flowchart
100% (1)
Commerce Clause Flowchart
1 page
Sten Plans The Sten Mkii
100% (1)
Sten Plans The Sten Mkii
28 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet

RL Assignment1

Uploaded by

RL Assignment1

Uploaded by

Assignment -1

1. Discuss about Partially Observable MDPs and its Policy Evaluation?

1. States (SSS): A set of possible states in the environment.

Policy Evaluation in POMDPs

Steps in Policy Evaluation:

Here, η\etaη is a normalization constant ensuring the updated belief

Here, P(o∣b,a)P(o | b, a)P(o∣b,a) is the probability of observing ooo after

Policy evaluation in POMDPs is computationally challenging due to:

1. The continuous nature of the belief state space.

1. Point-Based Value Iteration (PBVI):

● Robotics: Navigation with noisy sensors.

2. Differentiate on-policy and off-policy in reinforcement learning?

Aspect On-Policy RL Off-Policy RL

o Here, a′a'a′ is selected using the same policy as aaa.

Choosing Between On-Policy and Off-Policy

You might also like