0% found this document useful (0 votes)
14 views5 pages

RL Assignment1

The document discusses Partially Observable Markov Decision Processes (POMDPs), highlighting their components, challenges, and policy evaluation steps, including belief updates and value functions. It also differentiates between on-policy and off-policy methods in reinforcement learning, outlining their key differences, strengths, and weaknesses. Applications of POMDPs in various fields such as robotics, healthcare, and finance are mentioned, along with guidance on when to use on-policy or off-policy approaches.

Uploaded by

sainathg1002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

RL Assignment1

The document discusses Partially Observable Markov Decision Processes (POMDPs), highlighting their components, challenges, and policy evaluation steps, including belief updates and value functions. It also differentiates between on-policy and off-policy methods in reinforcement learning, outlining their key differences, strengths, and weaknesses. Applications of POMDPs in various fields such as robotics, healthcare, and finance are mentioned, along with guidance on when to use on-policy or off-policy approaches.

Uploaded by

sainathg1002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment -1

Name: N.Sivasankar
RollNo:23691f00f9
Subject:Reinforcement Learning
Section:MCA-C

1. Discuss about Partially Observable MDPs and its Policy Evaluation?


A. Partially Observable Markov Decision Processes (POMDPs)
A Partially Observable Markov Decision Process (POMDP) is a
framework used to model decision-making problems where the agent does not
have full knowledge of the state of the environment. It extends the classical
Markov Decision Process (MDP) by incorporating uncertainty in state
observations.

Components of a POMDP:

1. States (SSS): A set of possible states in the environment.


2. Actions (AAA): A set of actions available to the agent.
3. Transition Function (T(s,a,s′)T(s, a, s')T(s,a,s′)): The probability of
transitioning from state sss to s′s's′ when action aaa is taken.
4. Rewards (R(s,a)R(s, a)R(s,a)): The immediate reward received after taking
action aaa in state sss.
5. Observations (OOO): A set of possible observations the agent can receive.
6. Observation Function (O(o∣s′,a)O(o | s', a)O(o∣s′,a)): The probability of
observing ooo after transitioning to state s′s's′ and taking action aaa.
7. Discount Factor (γ\gammaγ): A value 0≤γ<10 \leq \gamma < 10≤γ<1 that
models the importance of future rewards.

Key Challenge:

Unlike in an MDP, where the agent knows the exact state, in a POMDP, the agent
has only partial information about the state and must maintain a belief state, which
is a probability distribution over all

Policy Evaluation in POMDPs


The goal of policy evaluation in POMDPs is to determine the expected utility (value
function) of following a particular policy, given the uncertainties in the state and
observations.

Steps in Policy Evaluation:

1. Belief Update:
o After executing an action aaa and observing ooo, the belief state is
updated using Bayes' rule:

b′(s′)=ηO(o∣s′,a)s∈S∑T(s,a,s′)b(s)

Here, η\etaη is a normalization constant ensuring the updated belief


is a valid probability distribution.

2. Value Function:
o The value of a belief state bbb under a policy π\piπ is given by:

Vπ(b)=E[t=0∑∞γtR(st,π(bt))∣b0=b]

This computes the expected discounted reward starting from the belief
state bbb.

3. Backup Operation:
o Dynamic programming is often used to compute the value function
iteratively. A common approach is the Bellman backup:

V(b)=amax[s∈S∑b(s)R(s,a)+γo∈O∑P(o∣b,a)V(b′)]

Here, P(o∣b,a)P(o | b, a)P(o∣b,a) is the probability of observing ooo after


action aaa given belief bbb, and b′b'b′ is the updated belief

Computational Complexity

Policy evaluation in POMDPs is computationally challenging due to:

1. The continuous nature of the belief state space.


2. The combinatorial growth of possible observation sequences.

Solution Techniques:

1. Point-Based Value Iteration (PBVI):


o Approximate the value function using a finite set of belief points.
2. Monte Carlo Sampling:
o Use sampling methods to estimate belief updates and expected rewards.
3. Policy Search:
o Directly optimize the policy without explicitly computing the value
function.

Applications of POMDPs

● Robotics: Navigation with noisy sensors.


● Healthcare: Treatment planning with incomplete patient information.
● Finance: Decision-making under market uncertainty.
● Game AI: Modeling opponents with hidden strategies.

2. Differentiate on-policy and off-policy in reinforcement learning?


A. In Reinforcement Learning (RL), on-policy and off-policy methods differ
based on how the policy used for taking actions (behavior policy) relates to the
policy being improved (target policy).

Key Differences:

Aspect On-Policy RL Off-Policy RL


Behavior vs. Behavior policy is the same as Behavior policy is different from
Target Policy the target policy. the target policy.
Balances exploration and Can explore using a different
Exploration vs.
exploitation directly in the behavior policy while improving
Exploitation
same policy. the target policy.
- SARSA - Q-Learning
Examples of - Proximal Policy Optimization - Deep Q-Network (DQN)
Algorithms (PPO) - Deep Deterministic Policy
- Actor-Critic Gradient (DDPG)
Less data efficient because it More data efficient as it can learn
Data Efficiency learns only from data collected from data collected by other
by the current policy. policies.
More stable and consistent May face instability due to off-
Stability and
learning since it follows one policy corrections or divergence
Convergence
policy. issues.
Aspect On-Policy RL Off-Policy RL
Learns a value function or policy
Directly improves the policy it
Policy Learning based on an arbitrary behavior
uses to make decisions.
policy.
Well-suited for online learning Ideal for offline learning where
Real-World
where the current policy pre-collected data or experience
Suitability
interacts with the environment. from other agents can be reused.

Detailed Explanation:

On-Policy RL

● How It Works:
o The agent learns and improves the policy it uses to interact with the
environment.
o Data comes directly from the current policy.
● Example:
o SARSA: Updates the Q-value based on the action chosen by the same
policy:

Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]

o Here, a′a'a′ is selected using the same policy as aaa.


● Strengths:
o Simple and interpretable.
o Naturally incorporates exploration strategies like ϵ\epsilonϵ-greedy.
● Weaknesses:
o Inefficient in reusing data from different policies.

Off-Policy RL

● How It Works:
o The agent learns a policy (target policy) while using a different policy
(behavior policy) to collect data.
o Often uses importance sampling or other correction techniques to
reconcile the difference between the two policies.
● Example:
o Q-Learning: Updates the Q-value using the maximum possible future
reward, independent of the behavior policy: Q(s,a)←Q(s,a)+α[r+γmax⁡a
′Q(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \
max_{a'} Q(s', a') - Q(s, a) \right]Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)
−Q(s,a)]
o Here, a′a'a′ is chosen greedily, not necessarily using the behavior policy.
● Strengths:
o Reuses past experiences, making it more data-efficient.
o Handles scenarios with pre-collected datasets (offline RL).
● Weaknesses:
o Can be unstable due to discrepancies between the target and behavior
policies.

Choosing Between On-Policy and Off-Policy

● Use On-Policy:
o When interaction with the environment is inexpensive.
o For tasks where policy stability and convergence are critical.
o Examples: Real-time decision-making in games or simulations.
● Use Off-Policy:
o When you want to leverage pre-collected experience or data.
o In situations where exploration is risky or costly (e.g., robotics,
healthcare).
o Examples: Offline learning, batch reinforcement learning .

You might also like