RL Sem Ans
RL Sem Ans
1. Apply the PAC learning framework to design a binary classifier for a given dataset
and determine the minimum number of training examples required for a specific level
of confidence and accuracy.
• To ensure that our learned model performs well (accurate) with high probability
(confidence)
Parameters:
• ε (epsilon): The maximum error we can tolerate. For example, ε = 0.1 means we
allow 10% error.
• δ (delta): The maximum allowed probability of failure. For example, δ = 0.05 means
we want 95% confidence.
• The hypothesis space H is the set of all classifiers your algorithm can choose from.
• For example, if you use linear classifiers in 2D, H could be the set of all lines.
4. Example Calculation
Assume:
• Hypothesis space size |H| = 1000 (you have 1000 possible classifiers)
• ε = 0.1
• δ = 0.05
Plug into the formula:
So, you need at least 99 training examples to guarantee that your classifier has at most
10% error with 95% confidence.
Step Description
3. Use the PAC formula to calculate how many training examples (m) are needed.
2. Given a multi-armed bandit scenario with five arms and their respective reward
Distributions, apply the Upper Confidence Bound (UCB) algorithm to select the
best arm for maximizing cumulative rewards.
3. Explain the Epsilon Greedy Algorithm for Action Selection and analyse
mathematically the Exploration/ Exploitation trade off
4. a.Compare and contrast the strategies used by the Upper Confidence Bound (UCB)
Algorithm and other bandit algorithms for balancing exploration and exploitation.
b. Describe the role of sample complexity (m) in the PAC learning framework and how
it affects the learning process.
(a)
(b)
5. Evaluate the effectiveness of the Upper Confidence Bound (UCB) algorithm in real-
world scenarios with non-stationary reward distributions, discussing its strengths and
limitations.
Evaluating the effectiveness of the Upper Confidence Bound (UCB) algorithm in real-world
scenarios with non-stationary reward distributions requires a deep look into both the
principles of UCB and the nature of non-stationary environments.
In a non-stationary setting:
• An arm that used to give high rewards might start giving lower rewards later, or vice
versa.
Strengths of UCB
o It performs very well when reward distributions are fixed and stable.
2. Inflexibility to Change
• UCB might keep recommending an article that was popular in the morning, even
though interest has now shifted.
To make UCB more effective in non-stationary settings, researchers use modified versions,
such as:
1. Sliding-Window UCB
o Uses only the most recent rewards (e.g., last 100 interactions).
(a)
1. The Setup
Imagine a gambler at a casino with multiple slot machines (called "arms"), each
giving a different and unknown reward distribution. The gambler's goal is to play
the machines in a way that maximizes the total reward over time.
• The gambler can try new arms (exploration) or stick with the best-known one
(exploitation).
This dilemma of choosing between exploring new options and exploiting known ones
is what we call the exploration-exploitation trade-off.
- Exploration:
- Exploitation:
The Challenge:
Too much exploration leads to wasted effort on suboptimal arms.
Too much exploitation risks missing out on better options not tried enough.
The UCB algorithm balances exploration and exploitation by assigning each arm a
score that combines:
UCB Formula:
What This Does:
• Arms with high average rewards and few samples get high UCB scores.
Bandit algorithms are traditionally designed to work in environments where rewards are
immediately observable after an action is taken. However, in many real-world
applications — such as online advertising, recommendation systems, or clinical trials —
the feedback is delayed or indirect, making standard bandit approaches ineffective
without adaptation.
To handle such scenarios, bandit algorithms can be modified in the following ways:
• Update reward estimates only when feedback arrives, not immediately after an
action.
• Ensure that actions are not repeatedly penalized just because the result has not been
observed yet.
Approach:
• Instead of updating the belief after each action, the algorithm estimates when
feedback is likely to arrive and updates only at those points.
• Track how earlier actions correlate with later results (e.g., user retention after seeing
recommendations).
o Regression
In some cases, the algorithm doesn't observe the full reward even after a delay. Partial
monitoring bandits address this by:
• Use sliding windows to average rewards only over actions with observed outcomes.
• Treat the problem as a reinforcement learning (RL) task instead of a pure bandit
problem.
• Bandit algorithms can serve as a component (e.g., for action selection in policy
exploration).
The Epsilon-Greedy algorithm is a simple and widely used strategy in reinforcement learning
and multi-armed bandit problems to balance exploration (trying new actions) and exploitation
(choosing the best-known action).
Key Components
o With probability 1 - ε, the agent exploits (chooses the action with the highest
estimated reward).
• Over time, the algorithm starts favoring the action with the highest Q-value (i.e.,
exploitation).
• You can also use epsilon decay, where ε decreases over time to shift from exploration
to exploitation.
Let’s define:
Update Rule:
Q(a)←Q(a)+α⋅(R−Q(a))
• The Q-value moves toward the new reward R with step size α\alpha.
Over time, Q-values converge to expected rewards, helping the agent make better decisions.
Maximize cumulative reward over time by optimally balancing exploration and exploitation
in environments where only limited feedback is available.
Key Goals:
3. Sample-Efficient Learning
These algorithms aim to learn the best actions using as few trials as possible, which
is crucial when data collection is expensive or time-sensitive.
4. Foundational Component of RL
The bandit setting is a simplified form of reinforcement learning, with:
o Immediate feedback
It serves as the building block for more complex RL problems like Markov
Decision Processes (MDPs), where actions influence future states.
10. Discuss a real-world application where bandit algorithms have been successfully
used, and explain the benefits of employing such algorithms in that context.
Real-World Application of Bandit Algorithms: Online Advertising (Ad Placement and
Recommendation Systems)
Scenario
Online platforms like Google, Facebook, YouTube, and Amazon often face a crucial
problem:
“Which advertisement or content should we show to a user to maximize engagement (like
clicks or purchases)?”
This is where multi-armed bandit algorithms come into play.
Each ad, video recommendation, or product listing is treated as an arm of the bandit. When a
user visits the platform:
Example:
2. Real-Time Adaptation
• User preferences can change quickly.
• Unlike traditional A/B testing that needs long-term trials, bandits can make decisions
after just a few interactions.
Benefits of Using Bandit Algorithms in Online Advertising
Benefit Explanation
Higher Efficiency Maximizes returns (clicks, purchases) with fewer trial samples.
11. Given a set of k arms with different reward distributions, apply the Median
Elimination Algorithm to identify the optimal arm based on the provided sample mean
12. Assess the efficiency of the Median Elimination algorithm compared to other
advanced Bandit algorithms for bandit problems with a large number of arms.
13. Evaluate the potential real-time applications of the Policy Gradient algorithm in
various domains, and discuss the challenges it may face in certain scenarios.
Challenges Faced by Policy Gradient Algorithms
• Problem: The estimates of the gradient (how to update the policy) can have very
high variance, making the learning process unstable and slow.
• Reason: Since PG relies on sampled trajectories, small changes in actions can lead to
large changes in cumulative rewards, especially in long-horizon tasks.
2. Sample Inefficiency
• Reason: The policy is updated incrementally using stochastic gradients, which may
not make full use of all the collected data.
• Reason: The optimization landscape is complex, and without proper exploration, the
algorithm may get stuck in poor solutions.
4. Exploration Challenges
• Problem: If the initial policy is poor, the algorithm might not explore promising
areas of the action space.
• Reason: Policy gradient relies on the policy's own exploration, which may not cover
all useful state-action pairs.
• Problem: Determining which actions led to success or failure over long episodes
can be difficult.
• Reason: In delayed reward settings, it's hard to assign reward contributions accurately
to earlier actions.
6. Sensitive to Hyperparameters
• Problem: Learning rate, entropy regularization, and reward discount factor must be
carefully tuned.
• Problem: While PG is often used for continuous actions, in discrete action spaces,
value-based methods like DQN may perform better.
• Reason: Discrete policies can be hard to optimize due to abrupt changes in action
probabilities.
1. Objective
To assess how effectively and accurately the Median Elimination algorithm identifies the
optimal arm in a multi-armed bandit setting with varying reward distributions.
2. Setup
a. Environment
• Let’s define:
o k = 10 arms
A1 0.10
A2 0.25
A3 0.35
A4 0.30
A5 0.50
Arm True Mean Reward
A6 0.45
A7 0.20
A8 0.40
A9 0.15
b. Parameters
o Eliminate arms whose empirical means are less than the median.
4. Performance Metrics
Evaluate performance using:
• Count how many times the algorithm selects the true best arm (A10).
• Check whether the selected arm’s mean is within ε (0.1) of the best arm’s mean with ≥
95% confidence.
P(μselected≥μoptimal−ϵ)≥0.95P
6. Visualizations
7. Expected Observations
• MEA should select the optimal arm or one close to it (within ε) ≥ 95% of the time.
15. Create a new variant of the Policy Gradient algorithm that incorporates a baseline
technique to reduce variance in the policy gradient estimates.
16. How does the Policy Gradient algorithm handle continuous action spaces in bandit
problems? What are some advantages of using policy gradient methods in such
scenarios?
Advantages of Using Policy Gradient Methods in Continuous Action Bandit Problems:
1. Direct Optimization of the Policy:
Policy gradient methods optimize the policy directly without needing to discretize the
action space.
17. a. Analyze how the Policy Gradient algorithm can be adapted to handle continuous
action spaces in bandit problems. (Refer Q.16)
b. Compare and contrast the Median Elimination algorithm and the Policy Gradient
Algorithm in terms of their strengths and weaknesses when applied to bandit problems.
(Refer Q.12)
• Exploitation: Selecting the action that currently seems best (i.e., has the highest
estimated reward) to maximize immediate gains.
Example Scenario:
Imagine a gambler at a casino facing several slot machines (arms). Each machine has an
unknown reward probability. The gambler needs to decide whether to:
This means even suboptimal actions have a non-zero chance of being selected, enabling
exploration.
2. Gradient-Based Optimization:
The algorithm adjusts policy parameters θ\theta in the direction that improves the expected
reward, but since the policy is probabilistic, it maintains some degree of exploration
throughout training.
• Over time: the policy becomes more confident in high-reward actions (leading to
more exploitation).
Techniques like advantage functions and learned baselines help guide the policy towards
better actions while reducing the noise, enabling more informed exploration.
To adapt the Median Elimination algorithm for a non-stationary bandit environment (where
reward distributions change over time), we must modify the algorithm to respond to changes
dynamically, since the original Median Elimination is designed for stationary settings with
fixed distributions.
Instead of using all past rewards to estimate the mean for each arm (which assumes
stationarity), use a fixed-size sliding window that only considers the most recent
observations.
This ensures the algorithm reflects recent trends in reward changes.
• How it helps: It forgets outdated information and adapts to new reward distributions.
• Window size tuning: Choose based on how quickly the environment changes.
2. Apply Weighted Reward Averaging (Exponential Decay)
Use exponentially weighted moving averages (EWMA) to estimate the mean reward of
each arm:
μ^t=(1−α)⋅μ^t−1+α⋅r_t
Where:
• μ^t: updated mean estimate at time tt
This avoids sticking to a suboptimal arm that became worse over time.
• Monitor for significant shifts in reward distributions (e.g., sudden drops or spikes).
• Page-Hinkley test
• Use sliding-window reward estimates in the confidence intervals used for elimination.
• Adjust the elimination threshold dynamically based on reward variance in the recent
window.
By making these changes, the Median Elimination algorithm becomes capable of adapting to
non-stationary environments, which are common in real-world applications like online
recommendation systems, dynamic pricing, or financial trading.
20. a. Recall the key steps involved in the Median Elimination algorithm for bandit
problems.
b. Outline the two specific advanced bandit algorithms used to solve multi-armed
bandit problems.
(b)
Idea:
Choose the arm that has the highest potential for reward by balancing:
How it works:
2. Thompson Sampling
Idea:
Choose arms based on probability of being the best, using Bayesian reasoning.
How it works:
• For each arm, maintain a probability distribution over its possible reward.
22. Design a simple MDP for a robotic agent navigating through a grid based
environment With rewards and penalties.
23. Assess the strengths and weaknesses of using deep neural networks as function
Approximates in RL algorithms.
Strength Explanation
2. Scalability to High- Suitable for handling complex inputs like images, videos, or
Dimensional Inputs sensor data where traditional tabular methods fail.
DNNs learn abstract features from raw data, which can capture
5. Powerful
underlying patterns in the environment more effectively than
Representation Learning
linear methods.
Weakness Explanation
24. Critique the effectiveness of the reward function in shaping the behaviour of an RL
agent in a complex environment.
• The reward function directly defines the objective for the agent.
Aspect Impact
Helps the agent understand the goal, leading to faster and more
Clear Guidance
stable learning.
Encourages Desired Promotes strategies that align with the designer's intended
Behavior outcomes.
Deceptive Misleading intermediate rewards can cause the agent to learn suboptimal
Rewards or unintended behaviors.
Overfitting to Agent may perform well in training but fail to generalize if it learns to
Rewards over-optimize for a poorly designed reward signal.
Difficult to In complex environments, it's often hard to design a reward function that
Design captures all desirable outcomes.
3. Imitation or Inverse RL: Learn rewards from expert behavior instead of hand-
designing them.
4. Human Feedback: Incorporate feedback from human preferences to adjust reward
functions.
25. Design an RL framework for a real-world problem of your choice, specifying the
state space, action space, and reward function.(Refer Q.21)
26. Devise a novel algorithm that combines elements of both model based and model-
free RL approaches.
To leverage the sample efficiency and planning ability of model-based RL, while
preserving the stability and asymptotic performance of model-free methods like policy
gradients or Q-learning.
Component Description
Actor-Critic Uses a policy network (actor) and a value function (critic) to guide
Framework learning.
Real + Simulated Combines real experiences from the environment with imagined
Rollouts experiences from the model.
Algorithm Overview
• Train a neural network model f(s, a) → (s’, r) using the collected data.
o Reward r
• Periodically retrain the model with new data and continue policy learning.
Advantages of IMAAC
Strength Benefit
Reduced Variance in
Synthetic data complements real data and stabilizes learning.
Updates
Challenge Solution
Inaccurate environment Use short simulated rollouts and train the model frequently with
model diverse data.
Overfitting to simulated Balance real vs. simulated data with a mixing coefficient (e.g., α
data ∈ [0,1]).
Challenge Solution
27. Given a scenario, analyze the impact of changing the discount factor (γ) on the
agent's decision-making process.
• Definition:
The discount factor γ ∈ [0, 1] determines the weight assigned to future rewards in
the return calculation.
• Return formula:
G_t=R_t+1+γR_t+2+γ2R_t+3+⋯
where:
Scenario:
A robot in a grid world needs to reach a goal located far from the starting point. It receives:
• +10 reward for reaching the goal
• -1 reward for each step taken
• Learns optimal paths but may take longer to converge due to high reliance on future
predictions.
• Learns a practical path to the goal with a good balance between short-term cost and
long-term gain.
Use Case
Real-time, reactive tasks Strategic, planning-based tasks
Suitability
28.Design a simple MDP for a robotic agent navigating through a grid based
environment With rewards and penalties(Refer Q.22)
29. Describe the role of the reward function in RL and its importance in shaping agent
behaviour.
The reward function is one of the most critical components in Reinforcement Learning (RL).
It defines the goal of the agent by providing feedback from the environment about how good
or bad its actions are. The reward function essentially guides the agent's learning process and
shapes its behavior over time.
The agent tends to repeat actions that yield high rewards and avoid
Behavior Induction
those that result in penalties.
It ensures the agent's actions are aligned with the task goals. For
Goal Alignment
example, a robot should reach a destination while avoiding obstacles.
Trade-offs and When multiple objectives exist, the reward function helps balance
Prioritization them by assigning different weights or penalties.
• Positive reward: Staying in the lane, obeying traffic signals, reaching the destination
quickly.
• Negative reward: Collisions, sudden braking, veering off the road.
The car learns to drive safely and efficiently only because these outcomes are explicitly
defined in the reward function.
30. a. Compare and contrast value iteration and policy iteration methods for solving
MDPs in RL.
May take more iterations, but each Fewer iterations needed, but each
Convergence
iteration is computationally policy evaluation step can be
Speed
cheaper computationally heavy
Initialization Starts with arbitrary value function Starts with an arbitrary policy
Receives
Receives delayed rewards
Feedback Type immediate, labeled No explicit feedback
from the environment
feedback
Passive learning
Interaction with Actively interacts with Passive exploration of
from pre-existing
Environment environment for learning data patterns
data
Data Generates data through Needs large labeled Works with raw,
Requirement experience datasets unlabeled data
Email spam
Example Game AI, robotics, self- Customer segmentation,
detection, image
Applications driving cars anomaly detection
recognition
UNIT – 4
31. Assess the effectiveness of Dynamic Programming methods for solving large-scale
RL problems compared to other approaches, such as Monte Carlo methods.
Assessing the Effectiveness of Dynamic Programming (DP) Methods for Solving Large-
Scale RL Problems
Dynamic Programming (DP) methods are classical approaches for solving Reinforcement
Learning (RL) problems, particularly Markov Decision Processes (MDPs). They are highly
structured, mathematical techniques that break down a complex problem into simpler
subproblems, which can then be solved efficiently. However, when it comes to solving large-
scale RL problems, DP methods face certain challenges compared to other approaches, such
as Monte Carlo (MC) methods. Below is a detailed comparison of DP methods and Monte
Carlo methods for large-scale RL problems.
1. Value Iteration: Iteratively updates the value function for each state until it converges
to the optimal value function.
2. Policy Iteration: Alternates between policy evaluation (calculating the value function
for the current policy) and policy improvement (updating the policy based on the
value function).
Strengths of DP Methods:
2. Efficiency in Known Environments: If the full transition model is known (i.e., the
environment is fully observable), DP methods can be highly efficient in determining
the optimal policy.
Weaknesses of DP Methods:
2. Need for a Complete Model: DP methods require a full model of the environment,
which may not always be available or feasible to compute, especially in real-world
problems.
3. Scalability Issues: DP is not well-suited for environments with large state spaces or
continuous spaces, as storing the entire state-value function and updating it becomes
impractical.
2. Scalability: Since MC methods only require sample-based estimates, they are more
scalable to large or continuous state spaces compared to DP methods.
4. Simplicity: MC methods are easier to implement because they do not involve solving
a system of equations as in DP methods. They focus on sampling and averaging the
rewards over multiple episodes.
1. Slow Convergence: Monte Carlo methods may require a large number of samples to
converge to an accurate estimate of the value function, especially in high-variance
environments.
2. High Variance: Since MC methods rely on sampling, they can suffer from high
variance in their estimates, leading to inefficient learning.
Use in Real-World Limited due to the need for a full More widely used in practice
Problems model due to flexibility and scalability
32. Design a new RL algorithm that combines Dynamic Programming and Temporal
Difference methods to address a specific challenge in a complex environment.
In Reinforcement Learning (RL), there are various approaches to solving problems involving
large or complex environments. Two such approaches are Dynamic Programming (DP) and
Temporal Difference (TD) learning. DP methods, such as Value Iteration and Policy Iteration,
require a complete model of the environment, while TD methods, like Q-learning and
SARSA, can learn directly from interaction with the environment without requiring a model.
Combining both DP and TD can help address challenges where we want the benefits of a
model-based approach (like DP) with the flexibility and scalability of TD methods (that
work well in unknown environments). In this scenario, the algorithm will aim to:
2. Model-Free TD Updates:
o If the model is incomplete or unreliable (due to noisy or incomplete
observations), the algorithm shifts to TD updates (like Q-learning or
SARSA).
o The TD updates help the algorithm learn directly from the environment
without relying on transition models, adapting to changes in the environment
over time.
o Model Confidence Metric: This metric estimates how reliable the current
model (transition and reward function) is based on observed feedback (e.g.,
prediction errors or discrepancies between expected vs. observed outcomes).
1. Initialize:
4. Switching Condition:
5. Repeat:
Advantages of METD:
1. Scalability: METD is more scalable than purely DP-based approaches since it can
handle large or continuous state spaces by relying on TD methods when the model is
not reliable.
2. Flexibility: The algorithm can adapt to both known and unknown environments.
When a reliable model is available, it can benefit from DP’s structured updates. When
the model is inaccurate or incomplete, it switches to model-free learning.
3. Improved Learning Efficiency: By combining model-based and model-free
learning, METD can converge faster than pure TD methods in environments where a
reliable model is available but still handle model-free learning when the model is
unreliable.
• Model-Based Component: The robot can initially rely on known maps, traffic rules,
and dynamic models to make high-level decisions.
33. Create a novel RL scenario where the Bellman Optimality equation needs to be
modified to accommodate additional constraints.
In this scenario, the goal is to manage the inventory of products in a warehouse using
reinforcement learning (RL). The agent's task is to decide how much of each product to
restock at different times to maximize profit while adhering to constraints such as storage
limits, budget, and demand satisfaction.
Scenario Description:
• State Space: The state represents the current inventory levels of various products in
the warehouse, the available budget for restocking, and the historical demand for each
product.
• Action Space: The actions represent the amount of each product to restock at the
current time.
• Rewards: The agent earns rewards based on the profit it makes by restocking
products, but penalties are applied if it overshoots the budget or exceeds storage
capacity. Additionally, there's a penalty if demand is not satisfied.
Additional Constraints:
1. Storage Limits: Each product has a maximum storage capacity in the warehouse. The
agent cannot restock more than the available storage space.
2. Budget Constraints: The agent has a limited budget for restocking products, which it
cannot exceed.
3. Demand Satisfaction: The agent should aim to meet customer demand, but if it
overestimates demand, it wastes resources. If it underestimates demand, it loses
potential profit.
Modifying the Bellman Optimality Equation
In a typical reinforcement learning setting, the agent aims to maximize rewards over time.
The Bellman equation generally helps compute the expected cumulative reward, but here we
need to modify it to account for the constraints.
To incorporate the constraints of storage limits, budget, and demand satisfaction, we adjust
the reward R(s,a)R(s, a) as follows:
1. Storage Limits: If the agent tries to restock more than the available storage space, it
gets a penalty in the reward.
Budget Constraints: If the agent exceeds its restocking budget, it gets a penalty.
2. Demand Satisfaction: If the agent doesn't meet the required demand, it incurs a
penalty. If it overestimates demand, it loses resources.
Thus, the modified Bellman equation for this scenario would be:
Where R(s, a) now incorporates the penalties for violating storage, budget, or demand
constraints.
34. Analyze how the Bellman Optimality equation changes when the environment has
stochastic transitions and rewards.
When the environment has stochastic transitions and rewards, the Bellman Optimality
Equation must account for the inherent randomness in both the state transitions and the
rewards associated with actions. This change reflects the uncertainty in the outcomes of
taking specific actions in certain states.
Original Bellman Optimality Equation (Deterministic Case)
In a deterministic environment, the Bellman Optimality equation is given by:
V(s)=maxa[R(s,a)+γV(s′)]
Where:
In this case, the next state s′s' is fully determined by the current state s and the action a. The
reward R(s, a) is also deterministic, meaning it does not vary based on any probabilistic
factors.
When the environment is stochastic, the transition between states and the rewards are not
deterministic. This means that:
• The agent cannot be sure of the next state given the current state and action. There is a
probability distribution over the next states.
• The reward for taking an action in a state also has a probability distribution.
Thus, the Bellman Optimality equation needs to incorporate expectations to account for the
randomness.
V(s)=maxa[E[R(s,a)]+γE[V(s′)]]
Where:
• E[R(s,a)] is the expected immediate reward when taking action aa in state ss,
• E[V(s′)]is the expected value of the next state s′s', considering the probability of
transitioning to different states.
1. Stochastic Transitions:
o In a stochastic environment, taking action aa in state ss results in a probability
distribution over possible next states s′s'. This is represented as P(s′∣s,a)P(s' | s,
a), the probability of transitioning from state ss to state s′s' when action aa is
taken.
Where:
Where:
• ∑s′P(s′∣s,a)V(s′) is the expected value of the next state, considering the probability
distribution over states.
Impact of Stochasticity on the Agent's Decision-Making:
• Uncertainty in Decision-Making: The agent must make decisions based on expected
values, taking into account the probabilities of different outcomes rather than
deterministic results.
This modified Bellman Optimality equation helps the agent adapt to uncertain environments,
where outcomes are not fully predictable and must be evaluated in terms of expected values.
35. Apply Temporal Difference learning to update the value function for a specific state
in an RL task.
36. Given a simple RL environment, demonstrate how you would apply Dynamic
Programming methods to find the optimal value function
This gives the optimal action to take for each state based on the final value function.
37. a. How does the Bellman Optimality equation help in finding the optimal policy in
RL problems?
The exploration-exploitation dilemma in Temporal Difference (TD) learning and the concept
of the "horizon" in Dynamic Programming (DP) both play crucial roles in reinforcement
learning (RL), though they address different aspects of the learning and decision-making
process. Below is a detailed comparison and explanation of their individual impact and
interaction in RL.
Aspect Description
Where It Prominent in model-free methods like TD learning, where the agent learns
Appears directly from interactions with the environment without a model.
If the agent explores too much, learning is slow; if it exploits too early, it
Impact may converge to a suboptimal policy. Proper exploration strategies (like ε-
greedy) are essential.
To ensure that the agent gathers enough information about the environment to
Goal
learn the optimal policy.
Aspect Description
The planning depth or the number of future steps the agent considers while
Definition making decisions. It reflects how far into the future the value of rewards is
considered.
Balancing information gathering vs. Planning how far into the future to
Core Idea
reward maximizing optimize
Impact on Affects how well the agent Affects the depth and quality of
Learning discovers good policies the learned policy
Relevance to Policy Ensures agent doesn’t get stuck Ensures agent plans beyond
Quality with suboptimal actions immediate rewards
Final Insight
• Horizon is about how far ahead the agent considers future consequences when
planning.
o If exploration is poor, even the best-planned horizon won't help (the agent may
not discover better strategies).
o If the horizon is too short, the agent may ignore the long-term benefits of
explored actions.
Thus, effective RL requires a synergy between good exploration strategies and appropriate
horizon planning to achieve optimal learning and decision-making.
39. How does the concept of "Bellman backup" play a crucial role in both Dynamic
Programming and Temporal Difference methods? Can you provide an example of how
this backup process is applied in a specific RL scenario?
A Bellman backup is the process of updating the estimated value of a state (or state-action
pair) based on:
Let’s define:
• V(s): Value of state ss
Example Scenario
Let’s consider a simple gridworld where an agent can move up, down, left, or right. Each
action gives a reward of -1, and reaching the goal gives +10.
Dynamic Programming:
• Agent has access to the full environment model (knows all transition probabilities).
• It performs Bellman backups for all states by computing expected values over all
possible outcomes.
Example:
This is done
iteratively over the whole grid.
Temporal Difference:
• From experience: Suppose the agent takes action Right from state A, receives reward
-1, and ends up in state B.
It updates:
Why It Matters
• Bellman backups allow RL agents to propagate value estimates backward from goal
states to earlier states.
• In TD, they are approximate but more scalable and suitable for unknown
environments.
40. The Bellman Optimality equation is a fundamental concept in RL. How does it
mathematically express the principle of optimality, and how is it used to find the
optimal policy in a Markov Decision Process (MDP)?
UNIT – 5
41. Design an RL agent to navigate a grid world using Fitted Q-learning with function
approximation.
42. Implement the Deep Q-Network (DQN) algorithm to solve a continuous action space
problem.
43. Develop a Policy Gradient algorithm to train a robotic arm to reach a target in a
simulated environment.
4.Implementation Overview
state = env.reset()
for t in range(max_steps):
action = sample_action(policy_network, state)
next_state, reward, done, _ = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
state = next_state
if done:
break
# Compute returns
# Update policy
44. Analyze the impact of using different function approximation architectures in Fitted
Qlearning.
Approximates Q(s,
Poor performance in
Linear a) as a linear Fast, interpretable, low
high-dimensional, non-
Approximator combination of computation cost
linear problems
features
Convolutional Specialized for Ideal for pixel-based Requires large data and
Neural Networks spatial/visual input environments (e.g., compute; sensitive to
(CNNs) like images Atari) architecture choice
Useful in partially
Recurrent Neural Capture temporal observable
Complex training, risk
Networks (RNNs / dependencies in environments (e.g.,
of vanishing gradients
LSTMs) input sequences agent doesn’t see full
state)
Key Considerations:
1. Generalization: Deeper models (e.g., neural nets) tend to generalize better with
enough data but may overfit on small datasets.
2. Stability: Using deep networks requires techniques like target networks and
experience replay to stabilize training.
3. Sample Efficiency: Some models (like trees or linear models) may converge faster in
simple domains due to lower capacity.
4. Computation Cost: Neural networks demand more compute than linear models or
tree-based models.
Summary
• For simple or low-dimensional tasks, linear models or decision trees may suffice.
• For complex, high-dimensional tasks, deep neural networks (with CNNs or RNNs
where appropriate) are more suitable but require stabilization techniques.
• The right choice depends on the nature of the environment, the structure of the
state space, and the amount of data available.
45. Assess the effectiveness of using Eligibility Traces for updating Q values in a
dynamic environment.
46. Evaluate the performance of Deep Q-Network (DQN) compared to Fitted Q-learning
in a grid world scenario with a large state space.
47. Devise a novel function approximation method for handling continuous state spaces
in RL.
48. a. Compare the advantages and disadvantages of Eligibility Traces and Function
Approximation in RL.
Fitted Q-Learning is a batch variant of Q-learning that uses a fixed dataset of experiences to
train a function approximator (e.g., a neural network or regression model) to estimate the Q-
function.
Experience Replay in Fitted Q-Learning:
2. Uses the batch to fit (or re-fit) a Q-function approximator by minimizing the
Bellman error:
3. Updates the model iteratively using all past experiences, not just recent ones.
Benefits:
49. a. What are the main advantages and limitations of Fitted Q learning compared to
DQN? b. In which scenarios would you prefer to use Fitted Q learning over DQN and
vice versa?
a. Advantages and Limitations of Fitted Q-Learning Compared to DQN
More stable due to batch updates Less stable, needs tricks like target
Stability
and fixed datasets networks & experience replay
Preferred
Scenario Reason
Algorithm
Offline RL (using logged data Fitted Q- Works well on batch data and is more
from past experiences) Learning sample-efficient
50. How do Policy Gradient algorithms and Least Squares Methods handle the
exploration-exploitation trade-off differently?
Policy Gradient algorithms and Least Squares methods approach the exploration-exploitation
trade-off in reinforcement learning (RL) from fundamentally different angles. Here's an
elaborated comparison of how each handles this trade-off:
Approach:
• Policy Gradient methods directly optimize the policy (i.e., the probability distribution
over actions) using gradient ascent on the expected return.
Exploration-Exploitation Handling:
o These algorithms often use a soft policy like the softmax or Gaussian
distribution to sample actions.
• The exploration level can be controlled using parameters like the temperature in
softmax or variance in Gaussian policies.
Advantages in Trade-off:
Limitations:
Approach:
• These methods use function approximation to estimate the value function or Q-
function based on a batch of experiences, usually using linear regression or similar
techniques.
Exploration-Exploitation Handling:
• These are typically off-policy methods, meaning the behavior policy (used for data
collection) can be different from the target policy (being learned).
Advantages in Trade-off:
• Very sample efficient if the dataset has good coverage of the state-action space.
Limitations:
• If the training data is biased toward exploitation (not enough exploration), the model
might fail to learn the optimal policy.