0% found this document useful (0 votes)
22 views90 pages

RL Sem Ans

The document discusses the application of the PAC learning framework for designing binary classifiers, detailing the steps to determine the required number of training examples for desired accuracy and confidence. It also covers various algorithms for multi-armed bandit problems, including Upper Confidence Bound (UCB) and Epsilon Greedy, analyzing their effectiveness in both stationary and non-stationary environments. Additionally, it explores adaptations for delayed feedback scenarios and highlights the use of bandit algorithms in real-world applications like online advertising.

Uploaded by

masami768736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views90 pages

RL Sem Ans

The document discusses the application of the PAC learning framework for designing binary classifiers, detailing the steps to determine the required number of training examples for desired accuracy and confidence. It also covers various algorithms for multi-armed bandit problems, including Upper Confidence Bound (UCB) and Epsilon Greedy, analyzing their effectiveness in both stationary and non-stationary environments. Additionally, it explores adaptations for delayed feedback scenarios and highlights the use of bandit algorithms in real-world applications like online advertising.

Uploaded by

masami768736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

UNIT – 1

1. Apply the PAC learning framework to design a binary classifier for a given dataset
and determine the minimum number of training examples required for a specific level
of confidence and accuracy.

PAC Learning is a framework in machine learning that helps us understand:

• How many training examples we need

• To ensure that our learned model performs well (accurate) with high probability
(confidence)

We want our classifier to be:

• Approximately Correct: It makes only a small error (say less than ε)

• Probably: This happens with high probability (say at least 1 − δ)

Parameters:

• ε (epsilon): The maximum error we can tolerate. For example, ε = 0.1 means we
allow 10% error.

• δ (delta): The maximum allowed probability of failure. For example, δ = 0.05 means
we want 95% confidence.

Step-by-Step: Applying PAC Learning to a Binary Classifier

1. Define the Hypothesis Space (H)


• A hypothesis is a function that tries to classify the data.

• The hypothesis space H is the set of all classifiers your algorithm can choose from.
• For example, if you use linear classifiers in 2D, H could be the set of all lines.

Let’s say the size of your hypothesis space is |H|.

2. Choose your desired ε and δ

• Decide how accurate you want the classifier to be (ε).

• Decide how confident you want to be in that accuracy (1 − δ).


Example:

• You want at most 10% error → ε = 0.1

• You want at least 95% confidence → δ = 0.05

3. Compute the minimum number of training examples (m)

Using the PAC Learning formula:


Where:
• m is the number of training samples

• |H| is the size of the hypothesis space

4. Example Calculation

Assume:

• Hypothesis space size |H| = 1000 (you have 1000 possible classifiers)

• ε = 0.1

• δ = 0.05
Plug into the formula:

So, you need at least 99 training examples to guarantee that your classifier has at most
10% error with 95% confidence.

Step Description

1. Define the hypothesis space H (set of all classifiers you consider).

2. Set desired error (ε) and confidence level (1−δ).

3. Use the PAC formula to calculate how many training examples (m) are needed.

4. Train the classifier on m examples and evaluate its performance.

2. Given a multi-armed bandit scenario with five arms and their respective reward
Distributions, apply the Upper Confidence Bound (UCB) algorithm to select the
best arm for maximizing cumulative rewards.
3. Explain the Epsilon Greedy Algorithm for Action Selection and analyse
mathematically the Exploration/ Exploitation trade off
4. a.Compare and contrast the strategies used by the Upper Confidence Bound (UCB)
Algorithm and other bandit algorithms for balancing exploration and exploitation.
b. Describe the role of sample complexity (m) in the PAC learning framework and how
it affects the learning process.
(a)
(b)

5. Evaluate the effectiveness of the Upper Confidence Bound (UCB) algorithm in real-
world scenarios with non-stationary reward distributions, discussing its strengths and
limitations.
Evaluating the effectiveness of the Upper Confidence Bound (UCB) algorithm in real-world
scenarios with non-stationary reward distributions requires a deep look into both the
principles of UCB and the nature of non-stationary environments.

What is a Non-Stationary Environment?

In a non-stationary setting:

• The reward distribution of arms changes over time.

• An arm that used to give high rewards might start giving lower rewards later, or vice
versa.

Such scenarios occur often in real-world applications like:

• Online recommendation systems (user preferences shift)


• Financial trading (market conditions vary)

• Adaptive routing in networks (latencies change)

Effectiveness of UCB in Non-Stationary Settings

Strengths of UCB

1. Strong Theoretical Foundations (in stationary settings)

o UCB is designed with a provable logarithmic regret in stationary


environments.

o It performs very well when reward distributions are fixed and stable.

2. Efficient Exploration-Exploitation Trade-off


o UCB naturally prefers arms with high average reward but still explores
uncertain arms.

o This is useful when changes are slow or infrequent.


3. Simple and Interpretable

o Easy to implement and debug.

o Decisions can be explained due to the confidence-bound-based structure.

Limitations in Non-Stationary Environments

1. Outdated Averages and Counts

o UCB uses cumulative averages over all past pulls.


o If the reward distribution of an arm drops suddenly, UCB may continue
selecting it based on past good performance, ignoring recent decline.

2. Inflexibility to Change

o UCB is slow to adapt to changes because:

▪ It doesn’t "forget" old data.

▪ It lacks built-in mechanisms to detect shifts in the reward distribution.

3. Delayed Reaction to Trends


o If an arm’s reward increases later, UCB might not explore it enough again to
detect the improvement.
Real-World Example: News Article Recommendation

• Suppose you have 5 articles to recommend (arms).

• User interest (reward) in each article changes every few hours.

• UCB might keep recommending an article that was popular in the morning, even
though interest has now shifted.

• This can result in missed opportunities and suboptimal engagement.

Solutions and Alternatives

To make UCB more effective in non-stationary settings, researchers use modified versions,
such as:

1. Sliding-Window UCB

o Uses only the most recent rewards (e.g., last 100 interactions).

o Helps to focus on recent trends, ignoring outdated data.


2. Discounted UCB

o Applies an exponential decay to past rewards.

o Recent rewards have more weight.

3. Change Detection + UCB

o Monitor reward patterns and reset UCB when a change is detected.

o Can combine with statistical tests or drift detection methods.

4. Thompson Sampling or Bandit over Bandits

o In rapidly changing environments, Bayesian methods or more adaptive


algorithms sometimes outperform UCB.
6.Design an improved variant of the Upper Confidence Bound (UCB) algorithm that
dynamically adjusts the exploration rate based on the feedback received from the
environment.
7. a. Explain the difference between PAC learning and UCB algorithm in terms of
their Fundamental purposes and problem settings.

b. Name the exploration-exploitation trade-off problem that the Upper Confidence


Bound (UCB) algorithm aims to address.

(a)

(b) The exploration-exploitation trade-off problem that the Upper Confidence


Bound (UCB) algorithm aims to address is known as the Multi-Armed Bandit (MAB)
Problem.

Understanding the Multi-Armed Bandit Problem

1. The Setup
Imagine a gambler at a casino with multiple slot machines (called "arms"), each
giving a different and unknown reward distribution. The gambler's goal is to play
the machines in a way that maximizes the total reward over time.

• Each arm provides a reward drawn from an unknown probability distribution.

• The gambler can try new arms (exploration) or stick with the best-known one
(exploitation).

This dilemma of choosing between exploring new options and exploiting known ones
is what we call the exploration-exploitation trade-off.

The Trade-Off in Detail

- Exploration:

• Trying different arms to gather more information about their reward


distributions.

• Helps in discovering potentially better arms that were initially underestimated.

- Exploitation:

• Using the arm that has performed the best so far.

• Maximizes immediate reward based on current knowledge.

The Challenge:
Too much exploration leads to wasted effort on suboptimal arms.
Too much exploitation risks missing out on better options not tried enough.

How UCB Addresses This Problem

The UCB algorithm balances exploration and exploitation by assigning each arm a
score that combines:

• Estimated mean reward (exploitation): Based on observed rewards.

• Confidence bound (exploration): A term that increases when an arm is less


explored.

UCB Formula:
What This Does:

• Arms with high average rewards and few samples get high UCB scores.

• Encourages the algorithm to try uncertain arms early on.

• As more data is collected, it gradually shifts to exploitation.

8. In what ways can bandit algorithms be adapted to handle situations where


the rewards are not immediately observable, but rather, manifest as delayed
feedback or indirect consequences?

Bandit algorithms are traditionally designed to work in environments where rewards are
immediately observable after an action is taken. However, in many real-world
applications — such as online advertising, recommendation systems, or clinical trials —
the feedback is delayed or indirect, making standard bandit approaches ineffective
without adaptation.
To handle such scenarios, bandit algorithms can be modified in the following ways:

1. Delayed Feedback Handling

a. Delayed UCB and Thompson Sampling

These are extensions of standard UCB and Thompson Sampling that:

• Maintain a buffer or queue of pending rewards.

• Update reward estimates only when feedback arrives, not immediately after an
action.

• Ensure that actions are not repeatedly penalized just because the result has not been
observed yet.

Approach:

• Use placeholder or estimated values until the actual reward is observed.

• Maintain confidence intervals considering the delay.


Example: In online ads, clicks might be reported a few hours after the ad was shown.

b. Bayesian Updating with Delay Models

• Use a Bayesian framework to model the expected delay.

• Instead of updating the belief after each action, the algorithm estimates when
feedback is likely to arrive and updates only at those points.

• Incorporates uncertainty in both reward and timing.

2. Credit Assignment for Indirect Feedback


When rewards are not directly tied to a single action but are influenced by a sequence of
actions, proper credit assignment becomes crucial.

a. Contextual Bandits with Long-Term Goals

• Use contextual information to relate indirect outcomes to the original actions.

• Track how earlier actions correlate with later results (e.g., user retention after seeing
recommendations).

• Modify reward functions to attribute delayed outcomes to initial actions using


models like:

o Regression

o Inverse propensity scoring

o Reinforcement learning (for complex dependencies)

b. Use of Surrogate Rewards


• When real rewards are unavailable or too delayed, algorithms can use proxy
measures (surrogate rewards).
• Example: In e-commerce, user click-through may serve as a short-term proxy for
purchase behavior.
Surrogate rewards:

• Provide faster feedback.

• Should be strongly correlated with long-term goals.

3. Partial Monitoring Bandits

In some cases, the algorithm doesn't observe the full reward even after a delay. Partial
monitoring bandits address this by:

• Maintaining models of how feedback relates to actions.

• Using indirect signals (e.g., engagement, views) to update beliefs.

• Often more computationally intensive due to sparse information.

4. Window-Based and Weighted Averaging


To avoid biases from incomplete feedback:

• Use sliding windows to average rewards only over actions with observed outcomes.

• Weight rewards based on recency or confidence to reduce the effect of outdated or


sparse data.
5. Reinforcement Learning Integration
When reward delays are complex and depend on state transitions or long-term effects:

• Treat the problem as a reinforcement learning (RL) task instead of a pure bandit
problem.

• Use policy gradient or Q-learning methods to learn policies over sequences of


actions.

• Bandit algorithms can serve as a component (e.g., for action selection in policy
exploration).

9. a. State and explain Epsilon Greedy algorithm in detail graphically by mapping


exploration and exploitation . Also analyze the Q value update process with every action
selection and state transition.

b. Outline the primary objective of bandit algorithms in the context of reinforcement


learning
a. Epsilon-Greedy Algorithm: Explanation, Exploration vs. Exploitation, and Q-value
Update

The Epsilon-Greedy algorithm is a simple and widely used strategy in reinforcement learning
and multi-armed bandit problems to balance exploration (trying new actions) and exploitation
(choosing the best-known action).

Key Components

• Epsilon (ε): A small positive value between 0 and 1 (e.g., 0.1).

o With probability ε, the agent explores (chooses a random action).

o With probability 1 - ε, the agent exploits (chooses the action with the highest
estimated reward).

Exploration vs. Exploitation: Graphical Intuition


• Early in training, exploration is frequent to learn about all actions.

• Over time, the algorithm starts favoring the action with the highest Q-value (i.e.,
exploitation).

• You can also use epsilon decay, where ε decreases over time to shift from exploration
to exploitation.

Q-Value Update Process

Let’s define:

• Q(a): Estimated value of action aa

• R: Reward received after taking action aa

• α\alpha: Learning rate (0 < α ≤ 1)

Update Rule:

Q(a)←Q(a)+α⋅(R−Q(a))

This is the temporal difference update:

• R - Q(a): The error between actual and estimated reward.

• The Q-value moves toward the new reward R with step size α\alpha.

Process with Each Action and State Transition


1. Select action a:

o With ε probability: choose random action (explore)

o With 1 - ε: choose a=argmaxQ(a) (exploit)

2. Take action, observe reward RR


3. Update Q-value of that action using the rule above
4. Repeat for next timestep

Example Table for Q-Value Updates

Step Action Chosen Reward (R) Old Q(a) Updated Q(a)

1 A (explore) 1.0 0.5 0.5 + 0.1(1.0 - 0.5) = 0.55

2 B (explore) 0.2 0.0 0.0 + 0.1(0.2 - 0.0) = 0.02

3 A (exploit) 0.9 0.55 0.55 + 0.1(0.9 - 0.55) = 0.59

Over time, Q-values converge to expected rewards, helping the agent make better decisions.

b. Primary Objective of Bandit Algorithms in Reinforcement Learning


The primary objective of bandit algorithms in reinforcement learning is to:

Maximize cumulative reward over time by optimally balancing exploration and exploitation
in environments where only limited feedback is available.

Key Goals:

1. Efficient Decision-Making Under Uncertainty


Bandit algorithms help an agent decide which action to take when it does not know
the full outcome of each action.
Example: Which ad to display to get the most clicks?

2. Online Learning and Adaptation


The agent learns and adapts in real time based on the rewards it receives, rather
than needing a full model of the environment.

3. Sample-Efficient Learning
These algorithms aim to learn the best actions using as few trials as possible, which
is crucial when data collection is expensive or time-sensitive.

4. Foundational Component of RL
The bandit setting is a simplified form of reinforcement learning, with:

o No state transitions (stateless environment)

o Immediate feedback
It serves as the building block for more complex RL problems like Markov
Decision Processes (MDPs), where actions influence future states.

10. Discuss a real-world application where bandit algorithms have been successfully
used, and explain the benefits of employing such algorithms in that context.
Real-World Application of Bandit Algorithms: Online Advertising (Ad Placement and
Recommendation Systems)
Scenario

Online platforms like Google, Facebook, YouTube, and Amazon often face a crucial
problem:
“Which advertisement or content should we show to a user to maximize engagement (like
clicks or purchases)?”
This is where multi-armed bandit algorithms come into play.

How Bandit Algorithms Are Used

Each ad, video recommendation, or product listing is treated as an arm of the bandit. When a
user visits the platform:

• The system must choose one or a few items to display.


• It gets partial feedback (e.g., whether the user clicked the ad or not).

• It updates the estimated value (reward probability) of each item accordingly.

Example:

• Show Ad A to User X → User clicks → Reward = 1

• Show Ad B to User Y → User skips → Reward = 0


The algorithm then updates the Q-values (estimated click-through rates) for Ads A and B.
Why Use Bandit Algorithms in This Context?
1. Balance Between Exploration and Exploitation

• Bandit algorithms like Epsilon-Greedy or UCB allow the platform to:

o Explore new or less-known ads.

o Exploit ads that historically perform well.


• This prevents the system from showing the same ad always and missing potentially
better ones.

2. Real-Time Adaptation
• User preferences can change quickly.

• Bandit algorithms update in real-time based on each interaction, adapting to current


trends or user behavior shifts.

3. Increased Revenue and Engagement

• By choosing the best-performing content dynamically, companies see:

o Higher click-through rates (CTR)

o More purchases or signups

o Better user satisfaction

4. Efficient Use of Data


• Bandit algorithms work well with limited feedback.

• Unlike traditional A/B testing that needs long-term trials, bandits can make decisions
after just a few interactions.
Benefits of Using Bandit Algorithms in Online Advertising

Benefit Explanation

Higher Efficiency Maximizes returns (clicks, purchases) with fewer trial samples.

Learns what works quickly without full knowledge of all


Faster Learning
options.

Can be combined with contextual bandits to tailor content per


Personalization
user profile.

Reduced Opportunity Avoids wasting impressions on consistently underperforming


Cost ads.

Responds to changing user trends or product popularity over


Adaptability
time.
UNIT – 2

11. Given a set of k arms with different reward distributions, apply the Median
Elimination Algorithm to identify the optimal arm based on the provided sample mean

12. Assess the efficiency of the Median Elimination algorithm compared to other
advanced Bandit algorithms for bandit problems with a large number of arms.
13. Evaluate the potential real-time applications of the Policy Gradient algorithm in
various domains, and discuss the challenges it may face in certain scenarios.
Challenges Faced by Policy Gradient Algorithms

1. High Variance in Gradient Estimates

• Problem: The estimates of the gradient (how to update the policy) can have very
high variance, making the learning process unstable and slow.

• Reason: Since PG relies on sampled trajectories, small changes in actions can lead to
large changes in cumulative rewards, especially in long-horizon tasks.

2. Sample Inefficiency

• Problem: Policy Gradient methods often require a large number of episodes


(samples) to learn effective policies.

• Reason: The policy is updated incrementally using stochastic gradients, which may
not make full use of all the collected data.

3. Convergence to Local Optima

• Problem: PG algorithms often converge to suboptimal policies due to the non-


convexity of the policy space.

• Reason: The optimization landscape is complex, and without proper exploration, the
algorithm may get stuck in poor solutions.

4. Exploration Challenges

• Problem: If the initial policy is poor, the algorithm might not explore promising
areas of the action space.

• Reason: Policy gradient relies on the policy's own exploration, which may not cover
all useful state-action pairs.

5. Credit Assignment Problem

• Problem: Determining which actions led to success or failure over long episodes
can be difficult.

• Reason: In delayed reward settings, it's hard to assign reward contributions accurately
to earlier actions.

6. Sensitive to Hyperparameters

• Problem: Learning rate, entropy regularization, and reward discount factor must be
carefully tuned.

• Reason: Improper tuning may result in divergence or very slow learning.

7. Noisy or Delayed Rewards


• Problem: If the environment provides delayed or noisy rewards, it becomes harder
to learn effective policies.
• Reason: This worsens the variance and hampers the effectiveness of the gradient
estimates.

8. Difficulty in Handling Discrete Action Spaces

• Problem: While PG is often used for continuous actions, in discrete action spaces,
value-based methods like DQN may perform better.

• Reason: Discrete policies can be hard to optimize due to abrupt changes in action
probabilities.

14. Design an experiment to evaluate the performance of the Median Elimination


algorithm On a simulated multi-armed bandit problem with different reward
distributions.

To design an experiment to evaluate the performance of the Median Elimination Algorithm


(MEA) on a simulated multi-armed bandit problem, we follow a structured approach: define
the setup, simulate the environment, run the algorithm, and measure performance.

Experiment Design to Evaluate Median Elimination Algorithm

1. Objective

To assess how effectively and accurately the Median Elimination algorithm identifies the
optimal arm in a multi-armed bandit setting with varying reward distributions.

2. Setup

a. Environment

• Create a k-armed bandit environment with known, fixed reward distributions.

• Let’s define:
o k = 10 arms

o Each arm has a Bernoulli distribution (rewards are 0 or 1).

o Assign different true mean rewards to each arm, for example:

Arm True Mean Reward

A1 0.10

A2 0.25

A3 0.35

A4 0.30

A5 0.50
Arm True Mean Reward

A6 0.45

A7 0.20

A8 0.40

A9 0.15

A10 0.60 (optimal)

b. Parameters

• Accuracy: ϵ=0.1\epsilon = 0.1


• Confidence: δ=0.05\delta = 0.05

3. Median Elimination Implementation

Implement MEA with the following steps:

1. Initialize active set S with all arms.

2. For each round:

o Sample each active arm a number of times based on current ε and δ.

o Estimate the empirical mean for each arm.


o Compute the median of these means.

o Eliminate arms whose empirical means are less than the median.

o Update ε and δ as per MEA rules.

3. Continue until only one arm remains.

4. Output the selected arm.

4. Performance Metrics
Evaluate performance using:

a. Accuracy of Best Arm Selection


• Run the experiment 1000 times.

• Count how many times the algorithm selects the true best arm (A10).

• Calculate success rate = (correct selections / total runs) × 100%

b. Number of Samples Used


• Measure the total number of arm pulls required in each run.
• Average over all runs to evaluate sample efficiency.

c. PAC Guarantee Validation

• Check whether the selected arm’s mean is within ε (0.1) of the best arm’s mean with ≥
95% confidence.

• That is, check if:

P(μselected≥μoptimal−ϵ)≥0.95P

5. Tools and Libraries


Use Python with:

• NumPy for simulation

• Matplotlib for visualization

• Optional: Jupyter Notebook for experimentation

6. Visualizations

• Histogram of selected arms over 1000 runs

• Line chart of sample count vs accuracy


• Box plot showing reward distributions

7. Expected Observations

• MEA should select the optimal arm or one close to it (within ε) ≥ 95% of the time.

• As ε decreases, accuracy improves but total samples increase.

• Compared to naive methods, MEA should be more sample-efficient while still


achieving PAC guarantees.

15. Create a new variant of the Policy Gradient algorithm that incorporates a baseline
technique to reduce variance in the policy gradient estimates.
16. How does the Policy Gradient algorithm handle continuous action spaces in bandit
problems? What are some advantages of using policy gradient methods in such
scenarios?
Advantages of Using Policy Gradient Methods in Continuous Action Bandit Problems:
1. Direct Optimization of the Policy:
Policy gradient methods optimize the policy directly without needing to discretize the
action space.

2. Naturally Handle Continuous Actions:


They model actions as continuous distributions (e.g., Gaussian), making them well-
suited for continuous control tasks.

3. Stochastic Policy Representation:


Enables learning probabilistic policies, which are essential for effective exploration in
continuous spaces.

4. No Need for Value Function Approximation:


In pure bandit problems (no state transitions), policy gradients don’t rely on
estimating a value function, simplifying the learning process.
5. Flexibility in Policy Structure:
Easily adaptable to complex, parameterized policy models (e.g., neural networks) for
high-dimensional action spaces.

6. Effective in High-Dimensional Actions:


More scalable and effective than value-based methods in problems with high-
dimensional or infinite action spaces.

17. a. Analyze how the Policy Gradient algorithm can be adapted to handle continuous
action spaces in bandit problems. (Refer Q.16)

b. Compare and contrast the Median Elimination algorithm and the Policy Gradient
Algorithm in terms of their strengths and weaknesses when applied to bandit problems.
(Refer Q.12)

18. Describe the concept of the exploration-exploitation trade-off in bandit problems.


How does the Policy Gradient algorithm handle this trade-off?

Exploration-Exploitation Trade-off in Bandit Problems

The exploration-exploitation trade-off is a central concept in multi-armed bandit


problems and reinforcement learning. It refers to the challenge of choosing between two
competing actions:

• Exploration: Trying out new or less-selected actions to gather more information


about their potential rewards.

• Exploitation: Selecting the action that currently seems best (i.e., has the highest
estimated reward) to maximize immediate gains.

Balancing these two is essential because:


• Relying only on exploitation can cause the algorithm to miss better options.

• Excessive exploration may lead to suboptimal short-term performance.

Example Scenario:

Imagine a gambler at a casino facing several slot machines (arms). Each machine has an
unknown reward probability. The gambler needs to decide whether to:

• Explore new machines (to learn which ones pay better).

• Exploit the known best machine (to earn more immediately).

How Policy Gradient Handles the Trade-off


The Policy Gradient algorithm handles the exploration-exploitation trade-off implicitly
through its use of stochastic policies.
1. Stochastic Policy Representation:
Policy Gradient methods learn a probability distribution over actions rather than a single
deterministic action. For instance:

πθ(a∣s)=Probability of choosing action a in state s

This means even suboptimal actions have a non-zero chance of being selected, enabling
exploration.

2. Gradient-Based Optimization:
The algorithm adjusts policy parameters θ\theta in the direction that improves the expected
reward, but since the policy is probabilistic, it maintains some degree of exploration
throughout training.

3. Exploration Through Sampling:

During training, actions are sampled from the learned distribution:


• Early in training: the policy is more exploratory due to untrained weights.

• Over time: the policy becomes more confident in high-reward actions (leading to
more exploitation).

4. Advantage Functions and Baselines (Advanced):

Techniques like advantage functions and learned baselines help guide the policy towards
better actions while reducing the noise, enabling more informed exploration.

19. Consider a real-world application where the reward distributions in a bandit


problem change over time (non-stationary). How could you adapt the Median
Elimination algorithm to cope with this dynamic environment?

To adapt the Median Elimination algorithm for a non-stationary bandit environment (where
reward distributions change over time), we must modify the algorithm to respond to changes
dynamically, since the original Median Elimination is designed for stationary settings with
fixed distributions.

Here is how it can be adapted effectively:


1. Introduce a Sliding Window for Rewards

Instead of using all past rewards to estimate the mean for each arm (which assumes
stationarity), use a fixed-size sliding window that only considers the most recent
observations.
This ensures the algorithm reflects recent trends in reward changes.

• How it helps: It forgets outdated information and adapts to new reward distributions.

• Window size tuning: Choose based on how quickly the environment changes.
2. Apply Weighted Reward Averaging (Exponential Decay)
Use exponentially weighted moving averages (EWMA) to estimate the mean reward of
each arm:

μ^t=(1−α)⋅μ^t−1+α⋅r_t

Where:
• μ^t: updated mean estimate at time tt

• r_t: observed reward

• α\alpha: learning rate (higher = more weight to recent rewards)

• Benefit: Gives more importance to recent feedback, allowing faster reaction to


distribution changes.

3. Periodic Re-Elimination or Reset

To adapt to shifting optimal arms:


• Restart Median Elimination periodically after a fixed number of rounds.

• Or reintroduce previously eliminated arms after certain time intervals for


reevaluation.

This avoids sticking to a suboptimal arm that became worse over time.

4. Use Change Detection Techniques


Incorporate change-point detection mechanisms:

• Monitor for significant shifts in reward distributions (e.g., sudden drops or spikes).

• If detected, restart the algorithm or re-evaluate previously discarded arms.

Common techniques include:

• Cumulative sum (CUSUM) control

• Page-Hinkley test

5. Hybrid with Sliding-Window UCB


Blend ideas from non-stationary UCB with Median Elimination:

• Use sliding-window reward estimates in the confidence intervals used for elimination.

• Adjust the elimination threshold dynamically based on reward variance in the recent
window.

By making these changes, the Median Elimination algorithm becomes capable of adapting to
non-stationary environments, which are common in real-world applications like online
recommendation systems, dynamic pricing, or financial trading.
20. a. Recall the key steps involved in the Median Elimination algorithm for bandit
problems.

b. Outline the two specific advanced bandit algorithms used to solve multi-armed
bandit problems.
(b)

1. Upper Confidence Bound (UCB) Algorithm

Idea:
Choose the arm that has the highest potential for reward by balancing:

• How much we’ve explored that arm.

• How high its average reward has been.

How it works:

• Choose the arm with the highest UCB value.

Why it’s good:

• Explores less-visited arms.


• Exploits arms with high average rewards.
• Automatically balances exploration and exploitation.

2. Thompson Sampling

Idea:
Choose arms based on probability of being the best, using Bayesian reasoning.

How it works:

• For each arm, maintain a probability distribution over its possible reward.

• At each time step:


o Sample a reward from the distribution of each arm.

o Select the arm with the highest sampled reward.

o Update the distribution based on the actual reward received.

Why it’s good:

• Naturally balances exploration and exploitation.

• Performs well in practice even in complex scenarios.

• Easy to implement and works well with delayed feedback.


UNIT – 3

21. Implement a basic RL algorithm to update the policy of an agent based on Q-


learning.

22. Design a simple MDP for a robotic agent navigating through a grid based
environment With rewards and penalties.
23. Assess the strengths and weaknesses of using deep neural networks as function
Approximates in RL algorithms.

Using deep neural networks (DNNs) as function approximators in reinforcement learning


(RL) has enabled RL agents to solve complex, high-dimensional problems. This approach,
known as Deep Reinforcement Learning (Deep RL), has been widely adopted in applications
such as game playing (e.g., AlphaGo, DQN in Atari), robotics, and autonomous systems.

Strengths of Using Deep Neural Networks in RL

Strength Explanation

DNNs can generalize knowledge to unseen states, allowing the


1. Generalization Across
agent to make informed decisions even in unfamiliar parts of the
States
state space.

2. Scalability to High- Suitable for handling complex inputs like images, videos, or
Dimensional Inputs sensor data where traditional tabular methods fail.

DNNs enable agents to directly map raw inputs (e.g., pixels) to


3. End-to-End Learning
actions without needing hand-crafted features.

4. Reusability and Neural networks can be fine-tuned or transferred across related


Transfer Learning tasks, reducing the training time for new environments.

DNNs learn abstract features from raw data, which can capture
5. Powerful
underlying patterns in the environment more effectively than
Representation Learning
linear methods.

Weaknesses of Using Deep Neural Networks in RL

Weakness Explanation

1. Instability and Training can be unstable due to correlated data, non-stationary


Divergence targets, and high variance in updates.

Requires a large number of interactions with the environment to


2. Sample Inefficiency
learn an effective policy compared to simpler algorithms.

3. Hyperparameter Performance is highly sensitive to hyperparameters like learning


Sensitivity rate, architecture, batch size, etc.
Weakness Explanation

4. Lack of Theoretical Unlike tabular methods, convergence is not guaranteed, and it is


Guarantees hard to interpret what the network has learned.

5. Computationally Training deep networks is resource-intensive and requires


Expensive significant hardware (e.g., GPUs) and time.

24. Critique the effectiveness of the reward function in shaping the behaviour of an RL
agent in a complex environment.

The reward function is a fundamental component in reinforcement learning (RL), acting as


the primary signal that shapes the behavior of an agent. Its design greatly influences how
effectively the agent learns and how well it performs in a given environment—especially in
complex scenarios.
Effectiveness of the Reward Function: A Critical Analysis

1. Importance of the Reward Function

• The reward function directly defines the objective for the agent.

• It guides learning by providing feedback based on the agent's actions.

• In complex environments, it determines what is considered success or failure.

Strengths of a Well-Designed Reward Function

Aspect Impact

Helps the agent understand the goal, leading to faster and more
Clear Guidance
stable learning.

Well-shaped rewards accelerate convergence by providing


Efficient Learning
consistent signals.

Encourages Desired Promotes strategies that align with the designer's intended
Behavior outcomes.

Intermediate rewards can help the agent break down a complex


Task Decomposition
task into learnable subgoals.

Limitations and Challenges


Issue Effect on Agent Behavior

Agent receives feedback too infrequently, making learning slow and


Sparse Rewards
exploration difficult.

Deceptive Misleading intermediate rewards can cause the agent to learn suboptimal
Rewards or unintended behaviors.

Agent may exploit loopholes in the reward function to maximize reward


Reward Hacking
in unintended ways.

Overfitting to Agent may perform well in training but fail to generalize if it learns to
Rewards over-optimize for a poorly designed reward signal.

Difficult to In complex environments, it's often hard to design a reward function that
Design captures all desirable outcomes.

Example: Robotic Navigation


• Effective Reward: Gives small positive reward for moving toward the goal and a
large reward for reaching it.
• Problematic Reward: Only gives a reward at the goal, making learning difficult due
to sparse feedback.

Strategies to Improve Reward Effectiveness


1. Reward Shaping: Add intermediate rewards to guide learning without altering the
optimal policy.
2. Curriculum Learning: Start with simple tasks and gradually increase complexity.

3. Imitation or Inverse RL: Learn rewards from expert behavior instead of hand-
designing them.
4. Human Feedback: Incorporate feedback from human preferences to adjust reward
functions.

25. Design an RL framework for a real-world problem of your choice, specifying the
state space, action space, and reward function.(Refer Q.21)

26. Devise a novel algorithm that combines elements of both model based and model-
free RL approaches.

A novel algorithm that combines elements of model-based and model-free reinforcement


learning (RL) is a hybrid approach known as:
Integrated Model-Augmented Actor-Critic (IMAAC)
Objective:

To leverage the sample efficiency and planning ability of model-based RL, while
preserving the stability and asymptotic performance of model-free methods like policy
gradients or Q-learning.

Key Components of IMAAC:

Component Description

Actor-Critic Uses a policy network (actor) and a value function (critic) to guide
Framework learning.

A learned model of the environment’s dynamics (transition and


Environment Model
reward function) used for planning.

Real + Simulated Combines real experiences from the environment with imagined
Rollouts experiences from the model.

Hybrid Loss Integrates model-free gradients and model-based planning updates


Function into a unified training objective.

Algorithm Overview

Step 1: Collect Real Experience

• The agent interacts with the environment and stores tuples:


(state, action, reward, next state) in a replay buffer.

Step 2: Learn the Environment Model

• Train a neural network model f(s, a) → (s’, r) using the collected data.

• This model predicts:

o Next state s'

o Reward r

Step 3: Generate Simulated Rollouts


• Use the learned model to generate "imagined" trajectories from states in the replay
buffer.
• Simulated transitions:
(sim_state, sim_action, sim_reward, sim_next_state)

Step 4: Critic Update (Value Function)


• Update critic using both real and simulated transitions:
o Real: Target = r + γ * V(s')

o Simulated: Target = r_sim + γ * V(s'_sim)

Step 5: Actor Update (Policy Improvement)

• Use the critic to compute policy gradients from real transitions.


• Additionally, simulate future rewards using the model to improve the policy's
exploration and foresight.

Step 6: Planning-Augmented Actor Update


• Incorporate model-based planning via short rollouts to compute expected long-term
rewards and guide the policy more effectively.
Step 7: Repeat

• Periodically retrain the model with new data and continue policy learning.

Advantages of IMAAC

Strength Benefit

Simulated rollouts reduce the number of real interactions


Sample Efficiency
needed.

The model enables foresight by simulating the effect of novel


Improved Exploration
actions.

Faster Convergence Planning accelerates learning of optimal behavior.

Combines benefits of deep policy optimization and model-


Flexible Policy Learning
based updates.

Reduced Variance in
Synthetic data complements real data and stabilizes learning.
Updates

Possible Challenges and Solutions

Challenge Solution

Inaccurate environment Use short simulated rollouts and train the model frequently with
model diverse data.

Overfitting to simulated Balance real vs. simulated data with a mixing coefficient (e.g., α
data ∈ [0,1]).
Challenge Solution

Limit the number of imagined rollouts and use lightweight


Computational complexity
model architectures.

27. Given a scenario, analyze the impact of changing the discount factor (γ) on the
agent's decision-making process.

The discount factor (γ) in reinforcement learning is a crucial hyperparameter that


significantly impacts how an agent values future rewards compared to immediate rewards.
It directly influences the agent’s behavior and learning process over time.

Understanding the Discount Factor (γ)

• Definition:
The discount factor γ ∈ [0, 1] determines the weight assigned to future rewards in
the return calculation.

• Return formula:

G_t=R_t+1+γR_t+2+γ2R_t+3+⋯

where:

o R_t is the reward at time tt

o γ\gamma is the discount factor

Impact of Different Values of γ on Decision-Making

γ Value Agent Behavior Impact on Learning and Policy

Short-sighted / Focuses only on immediate rewards. Ignores long-term


γ≈0
myopic consequences. May miss better long-term strategies.

Strong emphasis on future rewards. Considers long-term


Far-sighted /
γ≈1 consequences, which can lead to more optimal global
long-term planner
policies.

Moderate γ Balanced Balances short-term gain with long-term benefits. Often


(e.g., 0.9) approach leads to stable and practical solutions.

Scenario Analysis: Robot Navigation

Scenario:
A robot in a grid world needs to reach a goal located far from the starting point. It receives:
• +10 reward for reaching the goal
• -1 reward for each step taken

Case 1: γ = 0.1 (Short-sighted)

• Robot avoids steps that incur small negative rewards.

• May fail to reach the goal as immediate penalties dominate decision-making.


• Learns suboptimal or local policies (e.g., staying in place to avoid penalties).

Case 2: γ = 0.99 (Far-sighted)

• Robot is willing to endure small negative rewards if it leads to a larger positive


reward later.

• Prioritizes reaching the goal, even if it takes many steps.

• Learns optimal paths but may take longer to converge due to high reliance on future
predictions.

Case 3: γ = 0.9 (Balanced)

• Robot seeks the goal efficiently, tolerating some penalties.

• Learns a practical path to the goal with a good balance between short-term cost and
long-term gain.

Trade-offs and Considerations

Aspect Low γ High γ

Convergence Faster, due to focus on immediate Slower, more complex return


Speed rewards estimation

More stable in uncertain Less stable if future rewards are


Stability
environments highly variable

More likely to achieve globally


Policy Optimality Often suboptimal globally
optimal policy

Use Case
Real-time, reactive tasks Strategic, planning-based tasks
Suitability

28.Design a simple MDP for a robotic agent navigating through a grid based
environment With rewards and penalties(Refer Q.22)

29. Describe the role of the reward function in RL and its importance in shaping agent
behaviour.
The reward function is one of the most critical components in Reinforcement Learning (RL).
It defines the goal of the agent by providing feedback from the environment about how good
or bad its actions are. The reward function essentially guides the agent's learning process and
shapes its behavior over time.

Role of the Reward Function in RL

1. Signal for Learning


The reward function returns a numerical signal (reward) after the agent performs an
action in a given state. This signal tells the agent whether the action taken was
beneficial or not.
2. Defines Desired Behavior
By designing specific rewards for specific outcomes, you implicitly define what the
agent should strive for. The agent learns to select actions that maximize the
cumulative reward over time.

3. Drives Policy Optimization


The agent updates its policy (its strategy) based on the rewards received. A well-
designed reward function helps the agent to converge towards an optimal policy
efficiently.

4. Guides Exploration and Exploitation


Rewards help the agent evaluate which actions are worth exploring and which ones
are good enough to exploit, impacting the exploration-exploitation trade-off.

Importance of the Reward Function in Shaping Agent Behavior

Aspect Effect of Reward Function

The agent tends to repeat actions that yield high rewards and avoid
Behavior Induction
those that result in penalties.

A clear, consistent reward function accelerates learning, while a


Learning Efficiency
poorly designed one leads to confusion or undesired behavior.

It ensures the agent's actions are aligned with the task goals. For
Goal Alignment
example, a robot should reach a destination while avoiding obstacles.

Stability and A well-defined reward structure helps in stabilizing the learning


Convergence process and ensures convergence to an optimal policy.

Trade-offs and When multiple objectives exist, the reward function helps balance
Prioritization them by assigning different weights or penalties.

Example: Self-driving Car

• Positive reward: Staying in the lane, obeying traffic signals, reaching the destination
quickly.
• Negative reward: Collisions, sudden braking, veering off the road.

The car learns to drive safely and efficiently only because these outcomes are explicitly
defined in the reward function.

30. a. Compare and contrast value iteration and policy iteration methods for solving
MDPs in RL.

b. Explain how reinforcement learning differs from supervised and unsupervised


learning.

a. Compare and Contrast: Value Iteration vs Policy Iteration in MDPs

Aspect Value Iteration Policy Iteration

Combines policy evaluation and Alternates between full policy


Approach
improvement into a single step evaluation and policy improvement

May take more iterations, but each Fewer iterations needed, but each
Convergence
iteration is computationally policy evaluation step can be
Speed
cheaper computationally heavy

Performs full evaluation of current


Uses Bellman optimality equation
Computation policy using Bellman expectation
iteratively for value updates
equation

More stable in environments with Can be unstable if policy evaluation is


Stability
large or infinite state spaces not performed accurately

Policy Policy is derived after the value


Policy is updated in each iteration
Extraction function converges

Preferred for large or continuous Effective in small, discrete MDPs with


Use Case
state spaces due to simplicity limited states and actions

Initialization Starts with arbitrary value function Starts with an arbitrary policy

b. How Reinforcement Learning Differs from Supervised and Unsupervised Learning

Reinforcement Learning Supervised


Aspect Unsupervised Learning
(RL) Learning

Learn optimal actions by Learn from labeled Discover hidden patterns


Objective maximizing cumulative data to make or groupings in
rewards predictions unlabeled data
Reinforcement Learning Supervised
Aspect Unsupervised Learning
(RL) Learning

Receives
Receives delayed rewards
Feedback Type immediate, labeled No explicit feedback
from the environment
feedback

Learning from Learning from data


Type of Learning Trial-and-error learning
labeled examples structure

Passive learning
Interaction with Actively interacts with Passive exploration of
from pre-existing
Environment environment for learning data patterns
data

Data Generates data through Needs large labeled Works with raw,
Requirement experience datasets unlabeled data

Accounts for temporal


Time No time dimension Usually time-
credit assignment (reward
Dependency involved independent
over time)

Email spam
Example Game AI, robotics, self- Customer segmentation,
detection, image
Applications driving cars anomaly detection
recognition

UNIT – 4

31. Assess the effectiveness of Dynamic Programming methods for solving large-scale
RL problems compared to other approaches, such as Monte Carlo methods.

Assessing the Effectiveness of Dynamic Programming (DP) Methods for Solving Large-
Scale RL Problems

Dynamic Programming (DP) methods are classical approaches for solving Reinforcement
Learning (RL) problems, particularly Markov Decision Processes (MDPs). They are highly
structured, mathematical techniques that break down a complex problem into simpler
subproblems, which can then be solved efficiently. However, when it comes to solving large-
scale RL problems, DP methods face certain challenges compared to other approaches, such
as Monte Carlo (MC) methods. Below is a detailed comparison of DP methods and Monte
Carlo methods for large-scale RL problems.

Dynamic Programming Methods


Dynamic Programming methods in RL primarily include Value Iteration and Policy
Iteration. These methods require a complete model of the environment (i.e., full
knowledge of the transition dynamics and reward functions), which is often not feasible in
large or real-world problems. The general steps involved in DP methods are:

1. Value Iteration: Iteratively updates the value function for each state until it converges
to the optimal value function.

2. Policy Iteration: Alternates between policy evaluation (calculating the value function
for the current policy) and policy improvement (updating the policy based on the
value function).
Strengths of DP Methods:

1. Convergence to Optimal Solution: DP methods, when applicable, guarantee


convergence to the optimal policy and value function.

2. Efficiency in Known Environments: If the full transition model is known (i.e., the
environment is fully observable), DP methods can be highly efficient in determining
the optimal policy.

3. Structured Approach: DP provides a clear mathematical framework for solving


MDPs, offering a solid foundation for theoretical analysis.

4. Exact Solutions: DP methods give exact solutions, as they do not rely on


approximations or sampling.

Weaknesses of DP Methods:

1. Exponential Complexity: The primary drawback of DP is its high computational


complexity. As the size of the state or action space grows, the number of operations
required grows exponentially, making DP intractable for large-scale problems.

2. Need for a Complete Model: DP methods require a full model of the environment,
which may not always be available or feasible to compute, especially in real-world
problems.

3. Scalability Issues: DP is not well-suited for environments with large state spaces or
continuous spaces, as storing the entire state-value function and updating it becomes
impractical.

4. Limited Applicability in Unknown Environments: DP methods assume that the


transition and reward models are known, which is often not the case in real-world
scenarios.

Monte Carlo Methods


Monte Carlo methods, on the other hand, do not require a complete model of the
environment. Instead, they rely on sampling to estimate the value function or policy through
actual experience, making them more flexible and suitable for large-scale problems where the
full model of the environment is unknown or hard to compute.
Strengths of Monte Carlo Methods:
1. Model-Free: Monte Carlo methods do not need a full model of the environment.
They estimate the value function based on actual experiences and interactions with the
environment.

2. Scalability: Since MC methods only require sample-based estimates, they are more
scalable to large or continuous state spaces compared to DP methods.

3. Applicability to Real-World Problems: They are more applicable in real-world


scenarios where environments are dynamic and unknown, as they can work in
simulation environments where transition models are not fully known.

4. Simplicity: MC methods are easier to implement because they do not involve solving
a system of equations as in DP methods. They focus on sampling and averaging the
rewards over multiple episodes.

Weaknesses of Monte Carlo Methods:

1. Slow Convergence: Monte Carlo methods may require a large number of samples to
converge to an accurate estimate of the value function, especially in high-variance
environments.

2. High Variance: Since MC methods rely on sampling, they can suffer from high
variance in their estimates, leading to inefficient learning.

3. Delayed Feedback: MC methods depend on full episodes to estimate returns, which


may not provide immediate feedback and can be inefficient in scenarios requiring
more frequent updates.

4. Exploration Challenges: While MC methods allow exploration, they may not be as


effective as DP in quickly converging to an optimal policy in environments where the
state space is large and highly dynamic.

Comparison: DP vs Monte Carlo for Large-Scale RL Problems

Aspect Dynamic Programming (DP) Monte Carlo (MC)

Requires a complete model of the Model-free; uses experience


Model Requirement
environment (transition/reward) (sampling)

Lower computational cost but


Computational High computational cost due to
requires many samples for
Complexity exponential state-action space size
convergence

Applicability to Struggles with large or continuous More scalable for large or


Large State Spaces state spaces continuous state spaces

Exploration vs No inherent exploration-exploitation Explores through random


Exploitation trade-off; assumes full knowledge sampling but can be inefficient
Aspect Dynamic Programming (DP) Monte Carlo (MC)

May require many episodes to


Guarantees convergence to the
Convergence converge, high variance in
optimal policy if the model is known
results

Assumes immediate feedback (from Relies on delayed feedback


Feedback Type model); can be inefficient for real- from sample paths; more
time problems suitable for real-time learning

Does not need an environment


Environment Needs a full environment model
model, purely based on
Knowledge (transitions/rewards)
experience

Use in Real-World Limited due to the need for a full More widely used in practice
Problems model due to flexibility and scalability

32. Design a new RL algorithm that combines Dynamic Programming and Temporal
Difference methods to address a specific challenge in a complex environment.

In Reinforcement Learning (RL), there are various approaches to solving problems involving
large or complex environments. Two such approaches are Dynamic Programming (DP) and
Temporal Difference (TD) learning. DP methods, such as Value Iteration and Policy Iteration,
require a complete model of the environment, while TD methods, like Q-learning and
SARSA, can learn directly from interaction with the environment without requiring a model.

Combining both DP and TD can help address challenges where we want the benefits of a
model-based approach (like DP) with the flexibility and scalability of TD methods (that
work well in unknown environments). In this scenario, the algorithm will aim to:

1. Leverage known information about the environment (model) to accelerate


learning when available.

2. Handle unknown or dynamic environments through online learning from


experience.

Hybrid Algorithm: Model-Enhanced Temporal Difference (METD)

This new algorithm, Model-Enhanced Temporal Difference (METD), combines Dynamic


Programming (DP) techniques to utilize available model information with Temporal
Difference (TD) learning, which does not require a model. It adapts based on the
environment's reliability (whether the model is accurate or not) and transitions between
model-based learning and model-free learning dynamically.

Key Components of METD:


1. Model-Based Learning:
o Initially, the algorithm leverages available transition models (state transition
probabilities and rewards) to perform DP-based updates (similar to Value
Iteration or Policy Iteration).

o The algorithm uses model-based updates to quickly propagate information


and improve the value function or policy when reliable model information is
available.

2. Model-Free TD Updates:
o If the model is incomplete or unreliable (due to noisy or incomplete
observations), the algorithm shifts to TD updates (like Q-learning or
SARSA).

o The TD updates help the algorithm learn directly from the environment
without relying on transition models, adapting to changes in the environment
over time.

3. Dynamic Switching Mechanism:

o The algorithm includes a dynamic switching mechanism to switch between


DP and TD based on the confidence in the model's accuracy.

o Model Confidence Metric: This metric estimates how reliable the current
model (transition and reward function) is based on observed feedback (e.g.,
prediction errors or discrepancies between expected vs. observed outcomes).

o Threshold for Switching: If the model's confidence falls below a threshold,


the algorithm will lean more heavily on TD methods. If the confidence is high,
the algorithm will rely more on model-based DP methods.

4. Learning Rate and Exploration-Exploitation Balancing:


o Learning Rate Adaptation: The algorithm adapts its learning rate based on
whether it is using model-based or model-free updates. For model-based
updates, it may use a higher learning rate to rapidly propagate information,
while for TD methods, it uses a lower learning rate to ensure stability.

o Exploration-Exploitation: When the model is highly uncertain, the algorithm


emphasizes exploration (more exploratory actions), whereas, in a confident
model state, it focuses more on exploitation (maximizing known rewards).

Steps of the METD Algorithm:

1. Initialize:

o Initialize state values or Q-values based on the model (if available).

o Initialize model confidence based on available environmental models.


o Set exploration-exploitation parameters (e.g., epsilon for epsilon-greedy).
2. Model-Based Updates (when model confidence is high):

o Use DP methods (e.g., Value Iteration) to compute the value function or


update policies based on known transition dynamics and reward models.

3. Model-Free TD Updates (when model confidence is low):

o Use TD learning methods such as Q-learning to update value estimates based


on real-time interactions and feedback from the environment.

4. Switching Condition:

o After each episode or set of transitions, compute the model's confidence by


comparing the predicted vs. actual outcomes.

o If the model’s confidence is above a predefined threshold, continue model-


based updates. Otherwise, switch to TD learning.

5. Repeat:

o Continue alternating between model-based DP updates and model-free TD


updates as the environment evolves and the model’s accuracy changes.

Advantages of METD:

1. Scalability: METD is more scalable than purely DP-based approaches since it can
handle large or continuous state spaces by relying on TD methods when the model is
not reliable.

2. Flexibility: The algorithm can adapt to both known and unknown environments.
When a reliable model is available, it can benefit from DP’s structured updates. When
the model is inaccurate or incomplete, it switches to model-free learning.
3. Improved Learning Efficiency: By combining model-based and model-free
learning, METD can converge faster than pure TD methods in environments where a
reliable model is available but still handle model-free learning when the model is
unreliable.

4. Real-Time Adaptation: The ability to dynamically switch between model-based and


model-free approaches makes METD highly adaptive to changing environments,
allowing it to adjust as more information becomes available.

5. Reduced Exploration Cost: The model-based component can speed up learning by


guiding exploration towards more promising areas of the state space, reducing the
amount of random exploration needed in the early stages of training.

Use Case: Robotics or Autonomous Vehicles


Consider a scenario in robotics or autonomous vehicles where an agent navigates through a
dynamic environment (e.g., traffic conditions, obstacles). The robot may have some
knowledge of its environment (e.g., traffic laws, road layouts) but must also adapt to real-
time changes (e.g., other vehicles' behavior, pedestrian movement).

• Model-Based Component: The robot can initially rely on known maps, traffic rules,
and dynamic models to make high-level decisions.

• Model-Free Component: When the robot encounters unpredictable elements (e.g.,


sudden changes in traffic or obstacles), it switches to model-free learning to adapt its
behavior based on actual observations.
In this setting, METD would allow the robot to efficiently use its prior knowledge while still
being capable of adapting to unforeseen situations without requiring an exhaustive model of
every possible interaction.

33. Create a novel RL scenario where the Bellman Optimality equation needs to be
modified to accommodate additional constraints.

Novel RL Scenario: Warehouse Inventory Management

In this scenario, the goal is to manage the inventory of products in a warehouse using
reinforcement learning (RL). The agent's task is to decide how much of each product to
restock at different times to maximize profit while adhering to constraints such as storage
limits, budget, and demand satisfaction.

Scenario Description:
• State Space: The state represents the current inventory levels of various products in
the warehouse, the available budget for restocking, and the historical demand for each
product.

• Action Space: The actions represent the amount of each product to restock at the
current time.

• Rewards: The agent earns rewards based on the profit it makes by restocking
products, but penalties are applied if it overshoots the budget or exceeds storage
capacity. Additionally, there's a penalty if demand is not satisfied.

Additional Constraints:

1. Storage Limits: Each product has a maximum storage capacity in the warehouse. The
agent cannot restock more than the available storage space.

2. Budget Constraints: The agent has a limited budget for restocking products, which it
cannot exceed.

3. Demand Satisfaction: The agent should aim to meet customer demand, but if it
overestimates demand, it wastes resources. If it underestimates demand, it loses
potential profit.
Modifying the Bellman Optimality Equation
In a typical reinforcement learning setting, the agent aims to maximize rewards over time.
The Bellman equation generally helps compute the expected cumulative reward, but here we
need to modify it to account for the constraints.

In the usual Bellman equation:

To incorporate the constraints of storage limits, budget, and demand satisfaction, we adjust
the reward R(s,a)R(s, a) as follows:

1. Storage Limits: If the agent tries to restock more than the available storage space, it
gets a penalty in the reward.

R(s,a)=Base Reward−λstorage⋅Penalty for excess

Budget Constraints: If the agent exceeds its restocking budget, it gets a penalty.

R(s,a)=Base Reward−λbudget⋅Penalty for exceeding budget

2. Demand Satisfaction: If the agent doesn't meet the required demand, it incurs a
penalty. If it overestimates demand, it loses resources.

R(s,a)=Base Reward−λdemand⋅Penalty for unsatisfied or overestimated

Thus, the modified Bellman equation for this scenario would be:

Where R(s, a) now incorporates the penalties for violating storage, budget, or demand
constraints.

34. Analyze how the Bellman Optimality equation changes when the environment has
stochastic transitions and rewards.

When the environment has stochastic transitions and rewards, the Bellman Optimality
Equation must account for the inherent randomness in both the state transitions and the
rewards associated with actions. This change reflects the uncertainty in the outcomes of
taking specific actions in certain states.
Original Bellman Optimality Equation (Deterministic Case)
In a deterministic environment, the Bellman Optimality equation is given by:

V(s)=max⁡a[R(s,a)+γV(s′)]

Where:

• V(s) is the value of state ss,


• R(s,a) is the immediate reward when taking action aa in state ss,

• γ\gamma is the discount factor,

• s′ is the next state.

In this case, the next state s′s' is fully determined by the current state s and the action a. The
reward R(s, a) is also deterministic, meaning it does not vary based on any probabilistic
factors.

Modified Bellman Optimality Equation (Stochastic Case)

When the environment is stochastic, the transition between states and the rewards are not
deterministic. This means that:

• The agent cannot be sure of the next state given the current state and action. There is a
probability distribution over the next states.

• The reward for taking an action in a state also has a probability distribution.

Thus, the Bellman Optimality equation needs to incorporate expectations to account for the
randomness.

The modified Bellman Optimality equation becomes:

V(s)=max⁡a[E[R(s,a)]+γE[V(s′)]]

Where:

• E[R(s,a)] is the expected immediate reward when taking action aa in state ss,

• E[V(s′)]is the expected value of the next state s′s', considering the probability of
transitioning to different states.

Breaking Down the Changes:

1. Stochastic Transitions:
o In a stochastic environment, taking action aa in state ss results in a probability
distribution over possible next states s′s'. This is represented as P(s′∣s,a)P(s' | s,
a), the probability of transitioning from state ss to state s′s' when action aa is
taken.

o The expected value of the next state, E[V(s′)]\mathbb{E}[V(s')], is computed


as a weighted sum over all possible next states, where the weights are the
transition probabilities:
Stochastic Rewards:

o Similarly, the reward R(s,a)R(s, a) is no longer deterministic and must be


represented as a random variable with a probability distribution. The expected
reward E[R(s,a)]\mathbb{E}[R(s, a)] is computed as the weighted average of
possible rewards, where the weights are the probabilities of each outcome.

Where:

• P(r∣s,a)P(r | s, a) is the probability of receiving reward rr when taking action aa in


state ss.

Final Modified Bellman Optimality Equation:


Taking both stochastic transitions and stochastic rewards into account, the Bellman
Optimality equation becomes:

Where:

• ∑rP(r∣s,a)r is the expected reward for taking action aa in state ss,

• ∑s′P(s′∣s,a)V(s′) is the expected value of the next state, considering the probability
distribution over states.
Impact of Stochasticity on the Agent's Decision-Making:
• Uncertainty in Decision-Making: The agent must make decisions based on expected
values, taking into account the probabilities of different outcomes rather than
deterministic results.

• Risk Sensitivity: The agent's decisions will be influenced by the distribution of


possible rewards and transitions, and the agent might need to account for risk
preferences (e.g., whether to take an action with high variance in rewards but
potentially high reward vs. a more stable, low-reward action).

This modified Bellman Optimality equation helps the agent adapt to uncertain environments,
where outcomes are not fully predictable and must be evaluated in terms of expected values.
35. Apply Temporal Difference learning to update the value function for a specific state
in an RL task.
36. Given a simple RL environment, demonstrate how you would apply Dynamic
Programming methods to find the optimal value function
This gives the optimal action to take for each state based on the final value function.

37. a. How does the Bellman Optimality equation help in finding the optimal policy in
RL problems?

b. Explain the fundamental difference between Dynamic Programming and Temporal


Difference methods for RL.
(b)
38. Compare the exploration-exploitation dilemma in Temporal Difference learning with
the concept of "horizon" in Dynamic Programming. How do these two aspects impact
the learning process and decision-making in RL?

The exploration-exploitation dilemma in Temporal Difference (TD) learning and the concept
of the "horizon" in Dynamic Programming (DP) both play crucial roles in reinforcement
learning (RL), though they address different aspects of the learning and decision-making
process. Below is a detailed comparison and explanation of their individual impact and
interaction in RL.

Exploration-Exploitation Dilemma in Temporal Difference Learning

Aspect Description

The challenge of balancing exploration (trying new actions to discover their


Definition rewards) versus exploitation (choosing actions known to yield high rewards)
during learning.

Where It Prominent in model-free methods like TD learning, where the agent learns
Appears directly from interactions with the environment without a model.

An agent must decide whether to explore a new path (potentially better


Example
reward) or stick to a known path with a decent reward.

If the agent explores too much, learning is slow; if it exploits too early, it
Impact may converge to a suboptimal policy. Proper exploration strategies (like ε-
greedy) are essential.

To ensure that the agent gathers enough information about the environment to
Goal
learn the optimal policy.

Horizon in Dynamic Programming

Aspect Description

The planning depth or the number of future steps the agent considers while
Definition making decisions. It reflects how far into the future the value of rewards is
considered.

Where It In model-based methods like DP, where a complete model of the


Appears environment is known and planning is possible.

- Finite Horizon: Only considers a limited number of future steps.- Infinite


Types Horizon: Considers an unlimited number of future steps, usually discounted
by a factor γ.
Aspect Description

A short horizon may lead to greedy or short-sighted policies, while a long


Impact
horizon ensures long-term reward maximization.

To optimize policies with respect to cumulative rewards over time, balancing


Goal
immediate and future returns.

Comparison and Combined Impact

Feature Exploration-Exploitation (TD) Horizon (DP)

Balancing information gathering vs. Planning how far into the future to
Core Idea
reward maximizing optimize

Behavioral: influences what the Computational: influences what


Nature
agent does the agent plans

Impact on Affects how well the agent Affects the depth and quality of
Learning discovers good policies the learned policy

Relevance to Policy Ensures agent doesn’t get stuck Ensures agent plans beyond
Quality with suboptimal actions immediate rewards

Discount factor (γ), Bellman


Common Tools ε-greedy, softmax, UCB
equations

Final Insight

• Exploration-exploitation is about how an agent gathers knowledge from the


environment.

• Horizon is about how far ahead the agent considers future consequences when
planning.

• Both influence decision-making:

o If exploration is poor, even the best-planned horizon won't help (the agent may
not discover better strategies).

o If the horizon is too short, the agent may ignore the long-term benefits of
explored actions.

Thus, effective RL requires a synergy between good exploration strategies and appropriate
horizon planning to achieve optimal learning and decision-making.
39. How does the concept of "Bellman backup" play a crucial role in both Dynamic
Programming and Temporal Difference methods? Can you provide an example of how
this backup process is applied in a specific RL scenario?

The concept of Bellman backup is fundamental in Reinforcement Learning (RL) as it


underpins how value functions are updated during learning. It is at the core of both Dynamic
Programming (DP) and Temporal Difference (TD) methods, though it is applied differently in
each.

What is Bellman Backup?

A Bellman backup is the process of updating the estimated value of a state (or state-action
pair) based on:

• Immediate reward, and

• Estimated value of successor states.


It uses the Bellman equation, which expresses the relationship between the value of a state
and the values of its possible next states.
Role of Bellman Backup in RL Methods

Method Use of Bellman Backup

Performs full backups using the entire model of the


Dynamic Programming
environment (i.e., known transition probabilities and rewards).

Temporal Difference Performs sample-based backups using actual experience,


(TD) Learning without needing a model.

Bellman Backup Equations

Let’s define:
• V(s): Value of state ss

• R(s, a): Immediate reward for taking action aa in state ss

• P(s'|s,a): Probability of moving to state s′s'

• γ\gamma: Discount factor

• Q(s,a): Action-value function

1. In Dynamic Programming (Value Iteration)


This is a full Bellman backup using all possible future states.

2. In Temporal Difference Learning (TD(0))

This is a sampled Bellman backup, using a single transition from experience.

Example Scenario

Let’s consider a simple gridworld where an agent can move up, down, left, or right. Each
action gives a reward of -1, and reaching the goal gives +10.

Dynamic Programming:

• Agent has access to the full environment model (knows all transition probabilities).

• It performs Bellman backups for all states by computing expected values over all
possible outcomes.
Example:

This is done
iteratively over the whole grid.

Temporal Difference:

• Agent interacts with the environment.

• From experience: Suppose the agent takes action Right from state A, receives reward
-1, and ends up in state B.

It updates:

This is a Bellman backup using one sample (TD learning).

Why It Matters

• Bellman backups allow RL agents to propagate value estimates backward from goal
states to earlier states.

• In DP, they are accurate but require full knowledge.

• In TD, they are approximate but more scalable and suitable for unknown
environments.
40. The Bellman Optimality equation is a fundamental concept in RL. How does it
mathematically express the principle of optimality, and how is it used to find the
optimal policy in a Markov Decision Process (MDP)?
UNIT – 5

41. Design an RL agent to navigate a grid world using Fitted Q-learning with function
approximation.
42. Implement the Deep Q-Network (DQN) algorithm to solve a continuous action space
problem.
43. Develop a Policy Gradient algorithm to train a robotic arm to reach a target in a
simulated environment.
4.Implementation Overview

for episode in range(num_episodes):

states, actions, rewards = [], [], []

state = env.reset()

for t in range(max_steps):
action = sample_action(policy_network, state)
next_state, reward, done, _ = env.step(action)

states.append(state)

actions.append(action)

rewards.append(reward)
state = next_state

if done:

break

# Compute returns

returns = compute_discounted_returns(rewards, gamma)

# Update policy

update_policy(policy_network, states, actions, returns, learning_rate)

44. Analyze the impact of using different function approximation architectures in Fitted
Qlearning.

In Fitted Q-Learning, the use of different function approximation architectures


significantly impacts the learning stability, generalization ability, and sample efficiency of
the algorithm. Fitted Q-Learning is a variant of Q-learning where a function approximator
(like a neural network or decision tree) is trained to estimate the Q-values based on a batch of
experience tuples.

Impact of Different Function Approximators in Fitted Q-Learning

Architecture Type Description Impact Challenges

Approximates Q(s,
Poor performance in
Linear a) as a linear Fast, interpretable, low
high-dimensional, non-
Approximator combination of computation cost
linear problems
features

Can handle discrete Limited scalability,


Decision Trees / Non-linear,
state spaces well, good poor performance in
Ensembles (e.g., interpretable models
for tabular-like continuous action
Random Forests) trained on batch data
environments spaces

Excellent generalization Prone to overfitting or


Multi-layered non-
Neural Networks in complex, high- instability (requires
linear models (e.g.,
(DNNs) dimensional state-action careful tuning,
MLPs)
spaces regularization)

Convolutional Specialized for Ideal for pixel-based Requires large data and
Neural Networks spatial/visual input environments (e.g., compute; sensitive to
(CNNs) like images Atari) architecture choice

Useful in partially
Recurrent Neural Capture temporal observable
Complex training, risk
Networks (RNNs / dependencies in environments (e.g.,
of vanishing gradients
LSTMs) input sequences agent doesn’t see full
state)

Key Considerations:

1. Generalization: Deeper models (e.g., neural nets) tend to generalize better with
enough data but may overfit on small datasets.

2. Stability: Using deep networks requires techniques like target networks and
experience replay to stabilize training.

3. Sample Efficiency: Some models (like trees or linear models) may converge faster in
simple domains due to lower capacity.

4. Computation Cost: Neural networks demand more compute than linear models or
tree-based models.
Summary

The choice of function approximation architecture in Fitted Q-learning must balance


complexity, data availability, and computational resources:

• For simple or low-dimensional tasks, linear models or decision trees may suffice.

• For complex, high-dimensional tasks, deep neural networks (with CNNs or RNNs
where appropriate) are more suitable but require stabilization techniques.

• The right choice depends on the nature of the environment, the structure of the
state space, and the amount of data available.

45. Assess the effectiveness of using Eligibility Traces for updating Q values in a
dynamic environment.
46. Evaluate the performance of Deep Q-Network (DQN) compared to Fitted Q-learning
in a grid world scenario with a large state space.
47. Devise a novel function approximation method for handling continuous state spaces
in RL.
48. a. Compare the advantages and disadvantages of Eligibility Traces and Function
Approximation in RL.

b. How does Fitted Q-learning leverage the concept of experience replay?

a. Comparison of Eligibility Traces vs. Function Approximation in Reinforcement


Learning

Aspect Eligibility Traces Function Approximation

Helps speed up learning by bridging Generalizes learning to large or


Purpose
short-term and long-term rewards continuous state spaces

State (and action) generalization


Type Temporal credit assignment method
method

Requires maintaining traces for visited Requires maintaining


Memory
states/actions weights/parameters only

Works Well State/action space is large or


State space is small or discrete
When continuous

Accelerates learning by updating Handles infinite/large input spaces


Advantage
multiple prior states and unseen states
Aspect Eligibility Traces Function Approximation

Becomes inefficient or memory-heavy May suffer from approximation


Disadvantage
for large spaces error or divergence

Linear models, Neural Networks


Examples TD(λ), Sarsa(λ), Q(λ)
(DQN, Actor-Critic)

b. How Does Fitted Q-Learning Leverage Experience Replay?

Fitted Q-Learning is a batch variant of Q-learning that uses a fixed dataset of experiences to
train a function approximator (e.g., a neural network or regression model) to estimate the Q-
function.
Experience Replay in Fitted Q-Learning:

• Experience Replay stores past experiences in a buffer:


(s,a,r,s′)
• Instead of learning from sequential, on-policy data (which is correlated), Fitted Q-
Learning:
1. Samples a batch of experiences randomly from this buffer

2. Uses the batch to fit (or re-fit) a Q-function approximator by minimizing the
Bellman error:

3. Updates the model iteratively using all past experiences, not just recent ones.

Benefits:

• Reduces variance and improves sample efficiency

• Breaks correlation between sequential samples

• Allows for offline learning or reusing old data

49. a. What are the main advantages and limitations of Fitted Q learning compared to
DQN? b. In which scenarios would you prefer to use Fitted Q learning over DQN and
vice versa?
a. Advantages and Limitations of Fitted Q-Learning Compared to DQN

Aspect Fitted Q-Learning Deep Q-Network (DQN)

Batch Q-learning (offline or


Online Q-learning with deep neural
Algorithm Type iterative training on fixed
networks
datasets)

More stable due to batch updates Less stable, needs tricks like target
Stability
and fixed datasets networks & experience replay

Lower, especially without


Sample Efficiency High, especially in offline settings
experience replay

Function Can use any regressor (e.g.,


Uses deep neural networks
Approximation decision trees, linear models)

More complex due to deep learning


Implementation Conceptually simpler
infrastructure

Can handle different function Tightly coupled with neural


Flexibility
approximators easily network models

Not designed for continual/online More suited for interactive/online


Limitation
learning environments

Requires complete episodes or


Limitation Can learn from streaming data
datasets

b. When to Prefer Fitted Q-Learning vs. DQN

Preferred
Scenario Reason
Algorithm

Offline RL (using logged data Fitted Q- Works well on batch data and is more
from past experiences) Learning sample-efficient

Large-scale continuous state Neural networks can generalize better


DQN
spaces in high-dimensional spaces

Environments with limited data or Fitted Q- Allows safe training without


safety constraints Learning interacting with the environment
Preferred
Scenario Reason
Algorithm

Designed for online interaction and


Real-time or online learning DQN
continual updates

Fitted Q- Can use simpler models than deep


Low computational resources
Learning neural nets

High-performance deep learning Deep networks can approximate


DQN
environment available complex Q-functions effectively

50. How do Policy Gradient algorithms and Least Squares Methods handle the
exploration-exploitation trade-off differently?

Policy Gradient algorithms and Least Squares methods approach the exploration-exploitation
trade-off in reinforcement learning (RL) from fundamentally different angles. Here's an
elaborated comparison of how each handles this trade-off:

1. Policy Gradient Algorithms:

Approach:

• Policy Gradient methods directly optimize the policy (i.e., the probability distribution
over actions) using gradient ascent on the expected return.

Exploration-Exploitation Handling:

• Exploration is built into the stochastic policy.

o These algorithms often use a soft policy like the softmax or Gaussian
distribution to sample actions.

o This means even if one action is preferred, there's still a probability of


choosing less-optimal ones, allowing exploration.

• The exploration level can be controlled using parameters like the temperature in
softmax or variance in Gaussian policies.

• There's no explicit ε-greedy mechanism, unlike value-based methods.

Advantages in Trade-off:

• Continuous and smooth exploration.


• Suited for continuous action spaces.
• Naturally balances exploration and exploitation through policy updates.

Limitations:

• May require careful tuning of entropy regularization or variance parameters to ensure


sufficient exploration.

• Exploration may diminish too quickly if the policy converges prematurely.

2. Least Squares Methods (e.g., Least Squares Policy Iteration, LSTD):

Approach:
• These methods use function approximation to estimate the value function or Q-
function based on a batch of experiences, usually using linear regression or similar
techniques.

Exploration-Exploitation Handling:

• Exploration is not inherent in the algorithm.


o They rely on exploration in the data used for training.

o For example, an ε-greedy or Boltzmann strategy must be used externally to


gather diverse samples.

• These are typically off-policy methods, meaning the behavior policy (used for data
collection) can be different from the target policy (being learned).

Advantages in Trade-off:

• Very sample efficient if the dataset has good coverage of the state-action space.

• Can converge quickly with good data.

Limitations:

• Highly sensitive to the quality of exploration in the data.

• If the training data is biased toward exploitation (not enough exploration), the model
might fail to learn the optimal policy.

Summary Comparison Table:

Feature Policy Gradient Algorithms Least Squares Methods

Deterministic (relies on behavior


Nature of Policy Stochastic (exploration built-in)
policy)

Via softmax/Gaussian Requires external exploration


Exploration Mechanism
distribution strategy
Feature Policy Gradient Algorithms Least Squares Methods

Through entropy or variance


Control Over Exploration Through behavior policy design
tuning

Less efficient, needs more More efficient with well-


Data Efficiency
samples explored data

Good for continuous action


Action Space Support Typically for discrete actions
spaces

Risk of Premature Moderate (depends on entropy


High if data lacks exploration
Exploitation decay)

You might also like