Reinforcement Learning
Reinforcement Learning
LEARNING
INTRODUCTION TO Q-
LEARNING
Q-learning is a type of model-free reinforcement learning. In
reinforcement learning, an agent interacts with an environment,
learns to perform actions that maximize cumulative rewards over
time. Q-learning is significant because it helps the agent learn the
optimal policy without requiring a model of the environment.
Goal: Find the optimal action-selection policy (or strategy) that
maximizes total reward.
Policy: A rule or mapping from states to actions that guides the
agent in the environment.
THE BELLMAN EQUATION
The Bellman Equation serves as the foundation for dynamic
programming in reinforcement learning. It decomposes the value of
a decision into the immediate reward plus the value of the next
state, which is crucial in Q-learning.
For a state s and action a, the Q-value, Q(s,a), represents the
expected cumulative reward the agent will receive from taking
action a in state s and following the optimal policy afterward.
The Bellman equation in Q-learning is as follows:
THE BELLMAN EQUATION
where:
Q(s, a): Q-value for taking action a in state s
r(s, a): immediate reward received after taking action a in state s
γ: discount factor (0 ≤ γ\gamma γ < 1), which balances the
importance of immediate and future rewards
s′: the next state resulting from taking action a in state s
maxa′ Q(s′, a′): maximum future Q-value from state s′ by taking the
best possible action a′
Q-LEARNING ALGORITHM
The Q-learning algorithm is iterative and updates the Q-values
using the Bellman Equation as follows:
CONVERGENCE
Q-learning converges to the optimal policy as long as:
Each state-action pair is visited infinitely often.
The learning rate α\alphaα is properly adjusted, usually decaying
over time.
Exploration vs. Exploitation
A key challenge in Q-learning is balancing exploration (trying new
actions) and exploitation (choosing the best-known action).
Strategies like epsilon-greedy are often used, where the agent
chooses a random action with probability ϵ\epsilonϵ and the best-
known action with probability 1−ϵ1 (epsilon).
EXAMPLE (MAP)
Suppose we have 5 rooms in a building connected by doors as
shown in the figure below. We'll number each room 0 through 4.
The outside of the building can be thought of as one big room (5).
Notice that doors 1 and 4 lead into the building from room 5
(outside).
EXAMPLE (FIGURE)
EXAMPLE (SOLUTION)
Now let's imagine what would happen if our agent were in state 5
(next state).
Look at the sixth row of the reward matrix R (i.e. state 5).It has 3
possible actions: go to state 1, 4 or 5.
Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all
actions)]
Q(1,5)= R(1, 5) + 0.8* Max[Q(5, 1), Q(5, 4), Q(5, 5)]= 100+ 0.8* 0=
100
EXAMPLE (SOLUTION)
Now we imagine that we are in state 1 (next state).
Look at the second row of reward matrix R (i.e. state 1).
It has 2 possible actions: go to state 3 or state 5.
Then, we compute the Q value: Q(state, action) = R(state, action)
+ Gamma * Max[Q(next state, all actions)]
Q(3, 1) = R3, 1) + 0.8 * Max[Q(1, 3), Q(1, 5)] = 0 + 0.8 * Max(0,
100) = 80
EXAMPLE (SOLUTION)
If our agent learns more through further
episodes, it will finally reach
convergence values in matrix Q like:
Tracing the best sequences of states is
as simple as following the links with the
highest values at each state.
SUMMARY
Q-learning is a powerful reinforcement learning algorithm that
relies on the Bellman Equation to iteratively learn the optimal
policy. This method's beauty lies in its simplicity and ability to find
optimal policies even without a model of the environment. The
Bellman Equation is central to Q-learning, as it provides the
recursive formula needed to calculate the expected cumulative
reward and thus guide the agent toward optimal behavior.