RL Assignment1
RL Assignment1
Name: N.Sivasankar
RollNo:23691f00f9
Subject:Reinforcement Learning
Section:MCA-C
Components of a POMDP:
Key Challenge:
Unlike in an MDP, where the agent knows the exact state, in a POMDP, the agent
has only partial information about the state and must maintain a belief state, which
is a probability distribution over all
1. Belief Update:
o After executing an action aaa and observing ooo, the belief state is
updated using Bayes' rule:
b′(s′)=ηO(o∣s′,a)s∈S∑T(s,a,s′)b(s)
2. Value Function:
o The value of a belief state bbb under a policy π\piπ is given by:
Vπ(b)=E[t=0∑∞γtR(st,π(bt))∣b0=b]
This computes the expected discounted reward starting from the belief
state bbb.
3. Backup Operation:
o Dynamic programming is often used to compute the value function
iteratively. A common approach is the Bellman backup:
V(b)=amax[s∈S∑b(s)R(s,a)+γo∈O∑P(o∣b,a)V(b′)]
Computational Complexity
Solution Techniques:
Applications of POMDPs
Key Differences:
Detailed Explanation:
On-Policy RL
● How It Works:
o The agent learns and improves the policy it uses to interact with the
environment.
o Data comes directly from the current policy.
● Example:
o SARSA: Updates the Q-value based on the action chosen by the same
policy:
Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
Off-Policy RL
● How It Works:
o The agent learns a policy (target policy) while using a different policy
(behavior policy) to collect data.
o Often uses importance sampling or other correction techniques to
reconcile the difference between the two policies.
● Example:
o Q-Learning: Updates the Q-value using the maximum possible future
reward, independent of the behavior policy: Q(s,a)←Q(s,a)+α[r+γmaxa
′Q(s′,a′)−Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \
max_{a'} Q(s', a') - Q(s, a) \right]Q(s,a)←Q(s,a)+α[r+γa′maxQ(s′,a′)
−Q(s,a)]
o Here, a′a'a′ is chosen greedily, not necessarily using the behavior policy.
● Strengths:
o Reuses past experiences, making it more data-efficient.
o Handles scenarios with pre-collected datasets (offline RL).
● Weaknesses:
o Can be unstable due to discrepancies between the target and behavior
policies.
● Use On-Policy:
o When interaction with the environment is inexpensive.
o For tasks where policy stability and convergence are critical.
o Examples: Real-time decision-making in games or simulations.
● Use Off-Policy:
o When you want to leverage pre-collected experience or data.
o In situations where exploration is risky or costly (e.g., robotics,
healthcare).
o Examples: Offline learning, batch reinforcement learning .