0% found this document useful (0 votes)
11 views25 pages

RL Unit - Iv

Bootstrapping in reinforcement learning allows for faster and more efficient learning by updating state values based on estimates of future values rather than waiting for complete episode returns. It has advantages such as efficiency, memory usage, and scalability, but also presents challenges like bias and error propagation. Key algorithms like TD(0), Q-learning, and SARSA utilize bootstrapping techniques to optimize learning in various applications including game AI, robotics, and healthcare.

Uploaded by

harsha8383m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views25 pages

RL Unit - Iv

Bootstrapping in reinforcement learning allows for faster and more efficient learning by updating state values based on estimates of future values rather than waiting for complete episode returns. It has advantages such as efficiency, memory usage, and scalability, but also presents challenges like bias and error propagation. Key algorithms like TD(0), Q-learning, and SARSA utilize bootstrapping techniques to optimize learning in various applications including game AI, robotics, and healthcare.

Uploaded by

harsha8383m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT – IV

BOOTSTRAPPING IN REINFORCEMENT LEARNING


Bootstrapping is a key concept in reinforcement learning (RL) that refers to updating the
value of a state (or state-action pair) based on estimates of future values rather than
waiting for actual returns from full episodes. It allows the agent to learn and update its
estimates incrementally, making learning faster and more efficient.
Comparison: Monte Carlo vs. Bootstrapping

Aspect Monte Carlo Bootstrapping

Uses complete returns from Uses estimates of future values


Update Basis
episodes. (partial).

Episodic Does not require termination


Requires episodes to terminate.
Requirement (works online).

Efficiency Slower (waits for full episodes). Faster (updates after each step).

Unbiased (based on actual Biased (depends on value


Bias
returns). estimates).

High variance (due to full Lower variance (due to smaller


Variance
episode returns). updates).

Advantages of Bootstrapping

1. Efficiency:

o Updates can be performed online after each step, enabling faster learning.

2. Memory Usage:

o Does not require storing complete episodes or computing full returns.

3. Scalability:

o Works well for large or continuous state spaces.

4. Flexibility:
o Can be used for both on-policy (e.g., SARSA) and off-policy (e.g., Q-learning)
algorithms.

Challenges of Bootstrapping

1. Bias:

o Updates are biased by the current value function estimate, which can lead to
errors if the initial estimates are poor.

2. Error Propagation:

o Errors in the value estimates can propagate and compound over time.

3. Exploration:

o Requires sufficient exploration to visit all states for accurate updates.


Applications of Bootstrapping

1. Game AI:

o Learning optimal strategies in board games like chess or Go.

2. Robotics:

o Bootstrapping enables robots to learn navigation and manipulation tasks in


real-time.

3. Finance:

o Portfolio optimization using reinforcement learning.

4. Healthcare:

o Personalized treatment planning based on patient data.

Conclusion

Bootstrapping is a fundamental concept in reinforcement learning that significantly


speeds up learning by leveraging current estimates of value functions. It is widely used in
modern RL algorithms like Q-learning, SARSA, and TD learning. While it introduces bias,
its efficiency and scalability make it indispensable for solving large-scale and continuous
RL problems.
TD(0) ALGORITHM
TD(0) (Temporal Difference learning with 0-step return) is one of the simplest and most
fundamental reinforcement learning algorithms. It combines ideas from Monte Carlo
methods (learning from experience) and Dynamic Programming (using value function
updates based on bootstrapping). It updates the value of a state incrementally based on
the immediate reward and the estimated value of the next state.
Advantages of TD(0)

1. Efficiency:

o Updates occur after every step, making it suitable for online learning.

2. Memory:

o Requires less memory compared to Monte Carlo methods, as it doesn't store


complete episodes.

3. Applicability:

o Can be used for continuing tasks (non-episodic problems).

4. Combination:

o Bridges the gap between Monte Carlo methods (pure sampling) and Dynamic
Programming (bootstrapping).

Disadvantages

1. Bias:

o Bootstrapping introduces bias since the update depends on the current value
estimates.

2. Exploration:

o Requires sufficient exploration to visit all states for accurate value estimation.

3. Convergence:

o Convergence may be slow if the learning rate α\alphaα is not appropriately


tuned.

Applications

1. Robotics:

o Real-time navigation and control.

2. Game AI:

o Learning strategies in board games like tic-tac-toe.

3. Financial Modeling:

o Predicting asset prices and portfolio optimization.

4. Healthcare:

o Dynamic treatment planning for chronic diseases.


Conclusion

TD(0) is a foundational algorithm in reinforcement learning that combines the strengths


of sampling and bootstrapping. Its simplicity and efficiency make it a powerful tool for
learning value functions in both episodic and continuous tasks.

CONVERGENCE OF MONTE CARLO AND BATCH TD(0) ALGORITHMS


Both Monte Carlo (MC) and batch TD(0) are important methods in reinforcement learning
for evaluating policies by estimating the value function. While both can converge to
accurate estimates of the value function under certain conditions, the nature of their
convergence differs because of how they use data and update value estimates.

Monte Carlo Convergence

Monte Carlo methods estimate the value of a state V(s)V(s)V(s) by averaging the empirical
returns observed from complete episodes starting from that state.
Comparison of Monte Carlo and Batch TD(0)

Aspect Monte Carlo Batch TD(0)

Uses one-step transitions


Data Usage Uses full episode returns.
(bootstrapping).

Convergence Converges to the Bellman equation


Converges to Vπ(s)V^\pi(s)Vπ(s).
Target solution.

Convergence Slower, as it requires complete Faster, as it updates after each


Speed episodes. step.

Bias Unbiased (asymptotically). Biased due to bootstrapping.

Low (depends on one-step


Variance High (depends on full returns).
estimates).

Works for both episodic and


Applicability Requires episodic tasks.
continuing tasks.
Conclusion

 Monte Carlo provides unbiased estimates but requires complete episodes, making
it slower and high variance.

 Batch TD(0) uses bootstrapping for faster and more stable convergence but
introduces bias.

 In practice, TD(0) is often preferred for large-scale problems due to its efficiency and
flexibility in online settings.

MODEL-FREE CONTROL
Model-free control refers to reinforcement learning methods that solve control problems
without explicitly using a model of the environment's dynamics. Instead of predicting the
state transitions and rewards, these algorithms learn directly from the interaction with
the environment to optimize a policy for action selection.

Two Main Approaches

1. Value-Based Methods:

o The agent learns the value function (Q(s,a) or V(s)) and derives a policy from
it.

o Example: Q-learning.

2. Policy-Based Methods:

o The agent directly learns the policy π(a∣s) without learning a value function.
o Example: REINFORCE algorithm.

3. Actor-Critic Methods:

o Combines value-based and policy-based approaches.

o Example: Deep Deterministic Policy Gradient (DDPG).


4. Actor-Critic Methods

Combines the strengths of value-based and policy-based methods:

 Actor: Updates the policy πθ(a∣s).

 Critic: Updates a value function (e.g., V(s) or Q(s,a)) to guide the actor.

Disadvantages

1. Sample Inefficiency:

o Requires a large number of interactions with the environment to learn


effectively.

2. Exploration Challenges:

o Balancing exploration and exploitation can be difficult, especially in sparse


reward settings.

3. Instability:

o Training can be unstable or diverge, particularly when using function


approximators.

Example: Solving a Cart-Pole Problem with Q-Learning

Environment

 Goal: Balance a pole on a moving cart by applying forces to the left or right.

 State: Position and velocity of the cart and angle and angular velocity of the pole.

 Actions: Left or right.


Applications of Model-Free Control

1. Robotics:

o Learning locomotion and manipulation tasks without explicit models.

2. Games:

o Training AI for board games and video games (e.g., AlphaGo and DQN for Atari
games).

3. Autonomous Vehicles:

o Learning driving policies through simulation without modeling dynamics.

4. Healthcare:

o Dynamic treatment optimization for chronic diseases.

Conclusion

Model-free control algorithms, like Q-learning and policy gradient methods, are versatile
tools for solving control problems in unknown environments. They focus on directly
improving the policy or value function without relying on a model of the environment's
dynamics, making them highly applicable to complex real-world tasks.
Q-Learning Algorithm Overview

1. Initialize Q-values:

 Initialize the action-value function Q(s,a) arbitrarily for each state-action pair.
Usually, Q(s,a) is set to zero for all s and a, but can also be initialized to small
random values.

2. For each episode:

 Start in an initial state s0.

 For each time step in the episode:


Q-Learning Algorithm in Pseudocode

Key Characteristics of Q-Learning

 Off-policy: Q-learning is off-policy, meaning it learns the optimal policy regardless


of the agent's behavior. The agent can explore using one policy (such as ϵ\epsilonϵ-
greedy) while still learning about the optimal policy.

 Convergence: Under certain conditions (e.g., using decaying learning rates), Q-


learning is guaranteed to converge to the optimal action-value function Q∗(s,a) for a
given policy.

 Exploration vs Exploitation: The ϵ-greedy strategy helps balance exploration


(trying new actions) and exploitation (selecting the best-known action). Over time, ϵ
is usually decayed to favor exploitation as the Q-values become more accurate.

Exploration Strategy: ϵ-greedy

In Q-learning, the agent often uses an ϵ-greedy policy to explore the environment:

 With probability ϵ, select a random action (exploration).

 With probability 1−ϵ, select the action with the highest Q-value maxa Q(s,a)
(exploitation).
Over time, ϵ is typically reduced (decayed) to focus more on exploiting the learned Q-values
as the agent becomes more confident in its estimates.

Example: Q-Learning for Gridworld

Consider a simple Gridworld environment where an agent needs to move from a start
position to a goal position.

 States: The cells of the grid.

 Actions: Up, Down, Left, Right.

 Rewards: The agent gets a reward of 0 for each move, and a reward of +1 when it
reaches the goal.

 Goal: Find the shortest path to the goal.

The agent will explore the environment, and the Q-values for each state-action pair will be
updated using the Q-learning update rule. Eventually, the Q-values will converge to an
optimal set, and the agent will follow the best action at each state, which will lead it to the
goal.

Advantages of Q-Learning

1. Model-free: Q-learning does not require a model of the environment (i.e., the
transition function and reward function). It learns purely from experience.

2. Off-policy: The agent can learn the optimal policy even if it is not following the
optimal policy during training. This allows for more flexibility in exploration.

3. Convergence: Q-learning is guaranteed to converge to the optimal action-value


function under the assumption of sufficient exploration and a decaying learning
rate.

Disadvantages of Q-Learning

1. Sample inefficiency: Q-learning can require a large number of episodes to


converge, especially in large state-action spaces.

2. State-Action Space Explosion: The state-action space can become very large for
real-world problems, making it impractical to store and update the Q-table for every
state-action pair (this is mitigated using function approximation, e.g., Deep Q-
Networks).

3. Exploration challenges: If the exploration strategy (e.g., ϵ-greedy) is not well-tuned,


the agent may explore too much or not enough, leading to poor learning
performance.

Q-Learning with Function Approximation (Deep Q-Networks)


For environments with large or continuous state spaces, storing and updating Q-values
for every state-action pair becomes impractical. In such cases, Deep Q-Networks (DQN)
can be used to approximate Q(s,a) with a neural network.

 Instead of maintaining a table of Q-values, the neural network approximates the Q-


function.

 The neural network takes a state as input and outputs Q-values for all possible
actions.

 DQN also uses techniques such as experience replay and target networks to
stabilize learning.

Conclusion

Q-learning is a powerful and widely used reinforcement learning algorithm for finding the
optimal policy in discrete environments. Its off-policy nature allows the agent to explore
and learn about the optimal policy independently, and its convergence guarantees make
it a robust choice for many tasks. However, for large or continuous state spaces, Q-
learning often requires function approximation techniques like Deep Q-Networks.

SARSA (State-Action-Reward-State-Action)
SARSA is a model-free, on-policy reinforcement learning algorithm. It is similar to Q-
Learning, but the key difference is in how the Q-values are updated. While Q-learning
uses the maximum possible future Q-value to update the current Q-value (off-policy),
SARSA uses the action actually taken by the agent in the next state (on-policy).

SARSA Overview

SARSA stands for State-Action-Reward-State-Action, which refers to the sequence of


components used in the update rule:
Key Characteristics of SARSA

1. On-policy: SARSA is an on-policy algorithm because it updates the Q-values based


on the action taken by the agent, which is selected according to its current policy.
The agent learns the value of the policy it is actually following.

o In contrast, Q-Learning is off-policy because it updates the Q-values


assuming the agent will always take the action that maximizes future
rewards, regardless of the current exploration strategy.

2. Exploration Strategy: Like Q-learning, SARSA often uses an ϵ-greedy exploration


strategy to balance exploration and exploitation:

o With probability ϵ, choose a random action (exploration).

o With probability 1−ϵ, choose the action that maximizes Q(s,a) (exploitation).

3. Learning Process: The agent updates the Q-values based on the action it actually
takes in the next state, rather than assuming the best possible action. This makes
SARSA sensitive to the exploration strategy and more conservative in its updates
compared to Q-learning.

4. Policy Improvement: The agent learns a policy directly while interacting with the
environment. The learned policy becomes a balance of exploration and exploitation
that can be extracted from the Q-table by selecting the action with the highest Q-
value for each state.
Example of SARSA in Gridworld

Imagine the agent in a Gridworld environment where it has to move to a goal while
avoiding obstacles.

Setup:

 States: Each grid cell.

 Actions: Move up, down, left, right.


 Rewards: The agent receives +1 for reaching the goal and 0 for every other move.

Steps in SARSA:

1. The agent starts at a random state and chooses an action using an ϵ\epsilonϵ-
greedy policy.

2. The agent moves, observes the reward, and transitions to the next state.

3. It chooses the next action in the new state according to the ϵ\epsilonϵ-greedy policy.

4. The Q-value for the state-action pair is updated based on the action taken in the
next state.

5. Repeat this process for many episodes until the Q-values converge and the agent
learns the optimal path to the goal.

Advantages of SARSA

1. On-policy learning: SARSA can be more stable since it updates the Q-values based
on the actions the agent actually takes, not based on a hypothetical best-case
future.

2. Conservative Learning: Since SARSA is on-policy, it takes a more cautious


approach to learning. If the agent is exploring, the Q-values will reflect the
exploration strategy, making it less likely to take risky actions.

Disadvantages of SARSA

1. Slower convergence: Due to its on-policy nature, SARSA may take longer to
converge to the optimal policy compared to Q-learning, especially in environments
with high uncertainty.

2. Exploration dependence: The performance of SARSA is highly sensitive to the


exploration strategy (like ϵ\epsilonϵ-greedy). Poor exploration may result in
suboptimal policies.

SARSA with Function Approximation

In more complex environments (e.g., continuous state spaces), the Q-values can be
approximated using function approximators like neural networks. This approach is similar
to Deep SARSA, where a neural network is used to approximate the Q-function for large
state-action spaces.

Conclusion

SARSA is a reinforcement learning algorithm that is both simple and effective for learning
in environments where an agent interacts with the world to make decisions. Its on-policy
nature makes it more conservative in updates, leading to a more stable but potentially
slower learning process compared to off-policy methods like Q-learning. Depending on the
problem and the environment, SARSA may be preferred when the goal is to learn a policy
that balances exploration and exploitation in a more controlled way.
Expected SARSA
Expected SARSA is an improvement over the standard SARSA algorithm, which helps to
reduce the variance in the updates to the Q-values. While SARSA updates the Q-value
based on the action actually taken in the next state, Expected SARSA updates the Q-
value using the expected value of the next state-action pair, averaged over all possible
actions weighted by their respective probabilities according to the agent’s policy.

Difference from SARSA

In standard SARSA, the Q-value update is based on the action actually taken in the next
state a′. This can lead to high variance if the policy is exploratory (i.e., ϵ-greedy), as it
might randomly select suboptimal actions.

Expected SARSA, on the other hand, uses the expected value of the Q-values over all
actions, weighted by the probability of each action under the current policy. This reduces
the variance and leads to more stable learning.

Key Characteristics of Expected SARSA

1. On-policy: Like SARSA, Expected SARSA is an on-policy method. It updates the Q-


values based on the actions taken under the current policy, which is typically an
ϵ\epsilonϵ-greedy policy.

2. Reduced Variance: The main advantage of Expected SARSA over SARSA is the
reduction in variance. In SARSA, the Q-value is updated based on the actual action
taken, which can vary significantly if the exploration rate is high. In Expected
SARSA, the Q-value is updated based on the expected value of future actions,
leading to more stable and consistent updates.

3. Expected Value Calculation: Instead of using the Q-value of the action actually
taken in the next state, Expected SARSA uses the expected Q-value, which considers
the probability of taking each possible action according to the current policy.
Advantages of Expected SARSA

1. Stability: By using the expected value of the next state-action pair, Expected SARSA
reduces the variance in Q-value updates, which leads to more stable learning
compared to standard SARSA.

2. More Efficient Learning: The reduction in variance can also lead to more efficient
learning, as the updates are less sensitive to the randomness introduced by
exploration.

3. On-Policy Learning: Expected SARSA retains the on-policy nature of SARSA,


meaning it learns the value of the policy the agent is actually following. This ensures
the agent gradually improves its policy while learning.
Disadvantages of Expected SARSA

1. Computational Complexity: Expected SARSA requires calculating the expected Q-


value over all possible actions at the next state. This can be computationally
expensive in environments with a large action space.

2. Requires Full Knowledge of Action Probabilities: The algorithm requires


knowledge of the action probabilities under the current policy π(a′∣s′), which can be
difficult to compute for complex or continuous action spaces.

3. Slower Convergence in Certain Environments: While Expected SARSA is more


stable, it may converge more slowly in environments where the rewards are highly
stochastic or the exploration strategy is suboptimal.

Example of Expected SARSA

Consider an agent in a Gridworld environment. The goal is for the agent to navigate
through a grid to reach a goal while avoiding obstacles. The agent can take actions like
up, down, left, and right, and receives a reward for reaching the goal or a negative reward
for hitting an obstacle.

In Expected SARSA:

1. The agent takes an action based on its policy (e.g., ϵ\epsilonϵ-greedy).

2. After taking the action, it observes the reward and the new state.

3. Instead of using the Q-value of the action actually taken in the next state (as in
SARSA), it computes the expected Q-value of the next state by averaging over all
possible actions according to the current policy.

4. The Q-value for the state-action pair is updated using the expected Q-value, which
leads to smoother updates and more stable learning.
Conclusion

Expected SARSA improves upon SARSA by reducing the variance in updates through the
use of expected values of the next state-action pair. This can make learning more stable
and efficient, especially in environments where exploration introduces a lot of
randomness. However, the main trade-off is the increased computational cost due to the
need to compute the expected Q-value over all possible actions. Despite this, Expected
SARSA is a solid choice for environments where stable, on-policy learning is required.

You might also like