RL Unit - Iv
RL Unit - Iv
Efficiency Slower (waits for full episodes). Faster (updates after each step).
Advantages of Bootstrapping
1. Efficiency:
o Updates can be performed online after each step, enabling faster learning.
2. Memory Usage:
3. Scalability:
4. Flexibility:
o Can be used for both on-policy (e.g., SARSA) and off-policy (e.g., Q-learning)
algorithms.
Challenges of Bootstrapping
1. Bias:
o Updates are biased by the current value function estimate, which can lead to
errors if the initial estimates are poor.
2. Error Propagation:
o Errors in the value estimates can propagate and compound over time.
3. Exploration:
1. Game AI:
2. Robotics:
3. Finance:
4. Healthcare:
Conclusion
1. Efficiency:
o Updates occur after every step, making it suitable for online learning.
2. Memory:
3. Applicability:
4. Combination:
o Bridges the gap between Monte Carlo methods (pure sampling) and Dynamic
Programming (bootstrapping).
Disadvantages
1. Bias:
o Bootstrapping introduces bias since the update depends on the current value
estimates.
2. Exploration:
o Requires sufficient exploration to visit all states for accurate value estimation.
3. Convergence:
Applications
1. Robotics:
2. Game AI:
3. Financial Modeling:
4. Healthcare:
Monte Carlo methods estimate the value of a state V(s)V(s)V(s) by averaging the empirical
returns observed from complete episodes starting from that state.
Comparison of Monte Carlo and Batch TD(0)
Monte Carlo provides unbiased estimates but requires complete episodes, making
it slower and high variance.
Batch TD(0) uses bootstrapping for faster and more stable convergence but
introduces bias.
In practice, TD(0) is often preferred for large-scale problems due to its efficiency and
flexibility in online settings.
MODEL-FREE CONTROL
Model-free control refers to reinforcement learning methods that solve control problems
without explicitly using a model of the environment's dynamics. Instead of predicting the
state transitions and rewards, these algorithms learn directly from the interaction with
the environment to optimize a policy for action selection.
1. Value-Based Methods:
o The agent learns the value function (Q(s,a) or V(s)) and derives a policy from
it.
o Example: Q-learning.
2. Policy-Based Methods:
o The agent directly learns the policy π(a∣s) without learning a value function.
o Example: REINFORCE algorithm.
3. Actor-Critic Methods:
Critic: Updates a value function (e.g., V(s) or Q(s,a)) to guide the actor.
Disadvantages
1. Sample Inefficiency:
2. Exploration Challenges:
3. Instability:
Environment
Goal: Balance a pole on a moving cart by applying forces to the left or right.
State: Position and velocity of the cart and angle and angular velocity of the pole.
1. Robotics:
2. Games:
o Training AI for board games and video games (e.g., AlphaGo and DQN for Atari
games).
3. Autonomous Vehicles:
4. Healthcare:
Conclusion
Model-free control algorithms, like Q-learning and policy gradient methods, are versatile
tools for solving control problems in unknown environments. They focus on directly
improving the policy or value function without relying on a model of the environment's
dynamics, making them highly applicable to complex real-world tasks.
Q-Learning Algorithm Overview
1. Initialize Q-values:
Initialize the action-value function Q(s,a) arbitrarily for each state-action pair.
Usually, Q(s,a) is set to zero for all s and a, but can also be initialized to small
random values.
In Q-learning, the agent often uses an ϵ-greedy policy to explore the environment:
With probability 1−ϵ, select the action with the highest Q-value maxa Q(s,a)
(exploitation).
Over time, ϵ is typically reduced (decayed) to focus more on exploiting the learned Q-values
as the agent becomes more confident in its estimates.
Consider a simple Gridworld environment where an agent needs to move from a start
position to a goal position.
Rewards: The agent gets a reward of 0 for each move, and a reward of +1 when it
reaches the goal.
The agent will explore the environment, and the Q-values for each state-action pair will be
updated using the Q-learning update rule. Eventually, the Q-values will converge to an
optimal set, and the agent will follow the best action at each state, which will lead it to the
goal.
Advantages of Q-Learning
1. Model-free: Q-learning does not require a model of the environment (i.e., the
transition function and reward function). It learns purely from experience.
2. Off-policy: The agent can learn the optimal policy even if it is not following the
optimal policy during training. This allows for more flexibility in exploration.
Disadvantages of Q-Learning
2. State-Action Space Explosion: The state-action space can become very large for
real-world problems, making it impractical to store and update the Q-table for every
state-action pair (this is mitigated using function approximation, e.g., Deep Q-
Networks).
The neural network takes a state as input and outputs Q-values for all possible
actions.
DQN also uses techniques such as experience replay and target networks to
stabilize learning.
Conclusion
Q-learning is a powerful and widely used reinforcement learning algorithm for finding the
optimal policy in discrete environments. Its off-policy nature allows the agent to explore
and learn about the optimal policy independently, and its convergence guarantees make
it a robust choice for many tasks. However, for large or continuous state spaces, Q-
learning often requires function approximation techniques like Deep Q-Networks.
SARSA (State-Action-Reward-State-Action)
SARSA is a model-free, on-policy reinforcement learning algorithm. It is similar to Q-
Learning, but the key difference is in how the Q-values are updated. While Q-learning
uses the maximum possible future Q-value to update the current Q-value (off-policy),
SARSA uses the action actually taken by the agent in the next state (on-policy).
SARSA Overview
o With probability 1−ϵ, choose the action that maximizes Q(s,a) (exploitation).
3. Learning Process: The agent updates the Q-values based on the action it actually
takes in the next state, rather than assuming the best possible action. This makes
SARSA sensitive to the exploration strategy and more conservative in its updates
compared to Q-learning.
4. Policy Improvement: The agent learns a policy directly while interacting with the
environment. The learned policy becomes a balance of exploration and exploitation
that can be extracted from the Q-table by selecting the action with the highest Q-
value for each state.
Example of SARSA in Gridworld
Imagine the agent in a Gridworld environment where it has to move to a goal while
avoiding obstacles.
Setup:
Steps in SARSA:
1. The agent starts at a random state and chooses an action using an ϵ\epsilonϵ-
greedy policy.
2. The agent moves, observes the reward, and transitions to the next state.
3. It chooses the next action in the new state according to the ϵ\epsilonϵ-greedy policy.
4. The Q-value for the state-action pair is updated based on the action taken in the
next state.
5. Repeat this process for many episodes until the Q-values converge and the agent
learns the optimal path to the goal.
Advantages of SARSA
1. On-policy learning: SARSA can be more stable since it updates the Q-values based
on the actions the agent actually takes, not based on a hypothetical best-case
future.
Disadvantages of SARSA
1. Slower convergence: Due to its on-policy nature, SARSA may take longer to
converge to the optimal policy compared to Q-learning, especially in environments
with high uncertainty.
In more complex environments (e.g., continuous state spaces), the Q-values can be
approximated using function approximators like neural networks. This approach is similar
to Deep SARSA, where a neural network is used to approximate the Q-function for large
state-action spaces.
Conclusion
SARSA is a reinforcement learning algorithm that is both simple and effective for learning
in environments where an agent interacts with the world to make decisions. Its on-policy
nature makes it more conservative in updates, leading to a more stable but potentially
slower learning process compared to off-policy methods like Q-learning. Depending on the
problem and the environment, SARSA may be preferred when the goal is to learn a policy
that balances exploration and exploitation in a more controlled way.
Expected SARSA
Expected SARSA is an improvement over the standard SARSA algorithm, which helps to
reduce the variance in the updates to the Q-values. While SARSA updates the Q-value
based on the action actually taken in the next state, Expected SARSA updates the Q-
value using the expected value of the next state-action pair, averaged over all possible
actions weighted by their respective probabilities according to the agent’s policy.
In standard SARSA, the Q-value update is based on the action actually taken in the next
state a′. This can lead to high variance if the policy is exploratory (i.e., ϵ-greedy), as it
might randomly select suboptimal actions.
Expected SARSA, on the other hand, uses the expected value of the Q-values over all
actions, weighted by the probability of each action under the current policy. This reduces
the variance and leads to more stable learning.
2. Reduced Variance: The main advantage of Expected SARSA over SARSA is the
reduction in variance. In SARSA, the Q-value is updated based on the actual action
taken, which can vary significantly if the exploration rate is high. In Expected
SARSA, the Q-value is updated based on the expected value of future actions,
leading to more stable and consistent updates.
3. Expected Value Calculation: Instead of using the Q-value of the action actually
taken in the next state, Expected SARSA uses the expected Q-value, which considers
the probability of taking each possible action according to the current policy.
Advantages of Expected SARSA
1. Stability: By using the expected value of the next state-action pair, Expected SARSA
reduces the variance in Q-value updates, which leads to more stable learning
compared to standard SARSA.
2. More Efficient Learning: The reduction in variance can also lead to more efficient
learning, as the updates are less sensitive to the randomness introduced by
exploration.
Consider an agent in a Gridworld environment. The goal is for the agent to navigate
through a grid to reach a goal while avoiding obstacles. The agent can take actions like
up, down, left, and right, and receives a reward for reaching the goal or a negative reward
for hitting an obstacle.
In Expected SARSA:
2. After taking the action, it observes the reward and the new state.
3. Instead of using the Q-value of the action actually taken in the next state (as in
SARSA), it computes the expected Q-value of the next state by averaging over all
possible actions according to the current policy.
4. The Q-value for the state-action pair is updated using the expected Q-value, which
leads to smoother updates and more stable learning.
Conclusion
Expected SARSA improves upon SARSA by reducing the variance in updates through the
use of expected values of the next state-action pair. This can make learning more stable
and efficient, especially in environments where exploration introduces a lot of
randomness. However, the main trade-off is the increased computational cost due to the
need to compute the expected Q-value over all possible actions. Despite this, Expected
SARSA is a solid choice for environments where stable, on-policy learning is required.