Unit 5
Unit 5
• What is Q-Learning?
• Q-learning is a model-free, value-based, off-policy algorithm
that will find the best series of actions based on the agent's
current state.
• The “Q” stands for quality. Quality represents how valuable the
action is in maximizing future rewards.
• The model-based algorithms use transition and reward
functions to estimate the optimal policy and create the model.
• In contrast, model-free algorithms learn the consequences of
their actions through the experience without transition and
reward function.
• The value-based method trains the value function to learn
which state is more valuable and take action.
• Q-Function
• The Q-function uses the Bellman equation and takes state(s)
and action(a) as input. The equation simplifies the state values
and state-action value calculation.
• Q-learning algorithm
• Initialize Q-Table
• We will first initialize the Q-table. We will build the table with
columns based on the number of actions and rows based on
the number of states.
• In our example, the character can move up, down, left, and
right. We have four possible actions and four states(start, Idle,
wrong path, and end). You can also consider the wrong path
for falling into the hole. We will initialize the Q-Table with
values at 0.
• Choose an Action
• The second step is quite simple. At the start, the agent will
choose to take the random action(down or right), and on the
second run, it will use an updated Q-Table to select the action.
• Perform an Action
• Choosing an action and performing the action will repeat
multiple times until the training loop stops. The first action and
state are selected using the Q-Table. In our case, all values of
the Q-Table are zero.
• Then, the agent will move down and update the Q-Table using
the Bellman equation. With every move, we will be updating
values in the Q-Table and also using it for determining the best
course of action.
• Initially, the agent is in exploration mode and chooses a
random action to explore the environment. The Epsilon Greedy
Strategy is a simple method to balance exploration and
exploitation. The epsilon stands for the probability of choosing
to explore and exploits when there are smaller chances of
exploring.
• At the start, the epsilon rate is higher, meaning the agent is in
exploration mode. While exploring the environment, the
epsilon decreases, and agents start to exploit the
environment. During exploration, with every iteration, the
agent becomes more confident in estimating Q-values
• In the frozen lake example, the agent is unaware of the
environment, so it takes random action (move down) to start.
As we can see in the above image, the Q-Table is updated
using the Bellman equation.
• Measuring the Rewards
• After taking the action, we will measure the outcome and the
reward.
• The reward for reaching the goal is +1
• The reward for taking the wrong path (falling into the hole) is 0
• The reward for Idle or moving on the frozen lake is also 0.
• Update Q-Table
• We will update the function Q(St, At) using the equation. It uses
the previous episode’s estimated Q-values, learning rate, and
Temporal Differences error. Temporal Differences error is
calculated using Immediate reward, the discounted maximum
expected future reward, and the former estimation Q-value.
• The process is repeated multiple times until the Q-Table is
updated and the Q-value function is maximized.
• At the start, the agent is exploring the environment to update
the Q-table. And when the Q-Table is ready, the agent will start
exploiting and start taking better decisions.
• In the case of a frozen lake, the agent will learn to take the
shortest path to reach the goal and avoid jumping into the
holes.
• Q-learning Algorithm: Brief Overview
• Q-learning is a model-free reinforcement learning algorithm used to find the
optimal action-selection policy for an agent interacting with an environment.
It aims to learn the best actions to take under certain conditions to maximize
the cumulative reward.
• Key Components:
• Agent: The decision-maker.
• Environment: The world with which the agent interacts.
• State (S): A specific situation in the environment.
• Action (A): The moves the agent can take.
• Reward (R): Feedback from the environment based on the agent's action.
• Q-value (Q): The expected cumulative reward of taking an action in a given
state.
• Benefits:
• Model-Free: It does not require a model of the environment, making it
versatile.
• Simple Implementation: Easy to implement in discrete action spaces.
• Convergence: It converges to the optimal policy when using proper learning
parameters.
• Uses:
• Robotics: Q-learning is used to teach robots how to navigate environments
autonomously.
• Game AI: It is commonly applied in video games for AI agents to learn optimal
strategies.
• Finance: Q-learning can be used to optimize trading strategies by learning
from historical data.
• Game-Based Example:
• A popular use case is training a Pac-Man agent:
• States: The grid locations of Pac-Man and the ghosts.
• Actions: Moving up, down, left, or right.
• Rewards: Positive for eating dots and negative for being caught by ghosts.
• The agent learns the best policy to maximize its score while avoiding ghosts.
• Evaluation:
• Q-learning is evaluated by:
1.Cumulative Rewards: Measuring how much reward the agent accumulates
over episodes.
2.Policy Performance: Checking the optimality of the learned policy (whether it
selects actions that maximize long-term rewards).
• Numerical Problem Example:
• Consider a simple environment where:
• 3 states: S1, S2, S3
• 2 actions: A1 (left), A2 (right)
• Transition and reward matrix:
• From S1, taking A1 moves to S2 with reward +5; A2 moves to S3 with reward 0.
• From S2, taking A1 moves to S1 with reward 0; A2 moves to S3 with reward +10.
• From S3, both actions lead to terminal state with no reward.
• Assume learning rate α=0.5, discount γ=0.9, and initial Q-values as 0.
• Advantages of Q-Learning:
1.Model-Free:
1. Q-learning does not require a model of the environment, meaning the agent doesn't
need to know the environment's dynamics beforehand. This makes it highly versatile
for different types of environments.
2.Guaranteed Convergence:
1. Given infinite exploration and a suitable learning rate, Q-learning will converge to an
optimal policy. This is a significant advantage when searching for the best action-
selection policy over time.
3.Simple Implementation:
1. The algorithm is relatively simple and easy to implement, especially in discrete state-
action spaces. It requires only a Q-table and simple updates.
• Exploration-Exploitation Balance:
• Q-learning naturally supports the exploration-exploitation trade-off, allowing
the agent to explore the environment and gradually learn the optimal
strategy.
• Off-Policy Learning:
• Q-learning is off-policy, meaning it can learn from actions taken outside the
current policy. This makes it more flexible and allows it to learn from different
sources (like simulated experiences or another agent's actions).
• Widely Applicable:
• Q-learning can be applied to a variety of domains, including robotics, game AI,
finance, and more. It is effective in situations where learning from trial and
error is necessary.
• Disadvantages of Q-Learning:
1.High Memory Usage for Large State Spaces:
1. Q-learning uses a Q-table to store values for every state-action pair. In large or
continuous state spaces, this leads to high memory consumption. The algorithm
doesn't scale well for very large environments.
2.Slow Convergence:
1. In complex environments with many states and actions, Q-learning may take a long
time to converge to an optimal policy, especially if the agent needs to explore many
different paths.
3.Lack of Generalization:
1. Since Q-learning assigns a specific value to each state-action pair, it does not generalize
well. Small changes in states (e.g., nearby grid cells in a game) are treated
independently, leading to inefficiencies.
4.Sensitive to Parameter Tuning:
1. The performance of Q-learning depends heavily on the selection of parameters like the
learning rate (α), discount factor (γ), and exploration rate (ϵ). Poor choices can result in
suboptimal learning or slow convergence.
• Inefficient in Continuous Action Spaces:
• In environments with continuous actions, Q-learning struggles since it needs
to maintain and update Q-values for all possible action-state pairs. This leads
to inefficiencies, and alternative methods like Deep Q-Networks (DQN) are
preferred in such cases.
• Exploration vs. Exploitation Balance:
• While Q-learning addresses the exploration-exploitation dilemma, balancing
them effectively over time is still challenging. Excessive exploration can slow
down learning, while too little exploration might result in a suboptimal policy.
• Note: Q-learning is an efficient and effective algorithm for small to medium-
sized problems, but it faces scalability issues in larger, more complex
environments.
• It’s also sensitive to the right hyperparameter settings, which requires careful
tuning for successful learning
• Neural Network Refinement in Q-Learning (Deep Q-Networks - DQNs)
• Neural Network Refinement in Q-Learning enhances the original Q-learning
algorithm by replacing the Q-table with a neural network.
• The neural network takes raw pixel input (game screens), processes it through
convolutional layers, and outputs Q-values for each possible action (e.g.,
moving left, right, or firing).
• Over time, the network learns to master the game by updating its Q-values to
maximize rewards, like breaking bricks in Breakout or eating pellets in Pac-
Man.
• Another use case is in robotics, where DQNs can help robots navigate, grasp
objects, or perform other tasks in complex environments by learning from
interaction rather than being explicitly programmed.
• Note: Neural Network Refinement in Q-learning, especially using Deep Q-
Networks (DQNs), significantly enhances the original algorithm's ability to
handle complex environments with large or continuous state spaces.
• This ensures the agent exploits the best-known action according to its learned
Q-values.
• However, the basic greedy approach may fail in cases where the agent needs
to explore the environment to discover better actions.
• To address this, refinements like epsilon-greedy or softmax exploration are
often used.
• 1. Epsilon-Greedy Policy:
• In epsilon-greedy refinement, the agent follows the greedy policy most of the
time but occasionally takes random actions to explore the environment. The
policy is refined to balance exploration and exploitation:
• With probability ϵ, the agent explores by choosing a random action.
• With probability 1−ϵ, it exploits by selecting the action with the highest Q-
value.
• Over time, ϵ decays, so the agent explores less and focuses more on exploiting
the learned policy.
• 2. Advantages:
• Exploration: Ensures the agent explores the environment and doesn't get
stuck in local optima by selecting random actions at times.
• Simplicity: Easy to implement and tune with a decaying ϵ\epsilonϵ value over
time.
• Efficiency: Works well with smaller environments where optimal actions can
be found through exploration.
• 3. Disadvantages:
• Suboptimal Behavior: During the early phases of training, the random actions
might lead to suboptimal rewards.
• Fixed Exploration: Even with epsilon decay, the exploration might not be
sufficient for large state spaces. A fixed decay rate might be inappropriate for
some environments.
• Inefficiency in Large State Spaces: In environments with large or continuous
state spaces, random exploration becomes less efficient as the chance of
finding optimal actions decreases.
• Neural Networks extract identifying features from data, lacking
pre-programmed understanding. Network components include
neurons, connections, weights, biases, propagation functions,
and a learning rule. Neurons receive inputs, governed by
thresholds and activation functions.