Reinforcement Learning
Reinforcement Learning
REINFORCEMENT
LEARNING
BROUGHT TO YOU BY :
W3B JMI
INTRODUCTION TO REINFORCEMENT LEARNING
WHAT IS REINFORCEMENT LEARNING (RL)?
Key Elements in RL:
Reinforcement Learning is a type of o Agent: The learner or decision-maker.
Machine Learning where an agent o Environment: The system the agent
learns to make decisions by interacts with.
interacting with an environment. The o Actions (A): Choices the agent can
goal is to maximize the cumulative make.
o State (S): A representation of the
reward over time. Unlike supervised
environment at a specific time.
learning, where the model learns
o Reward (R): Feedback from the
from labelled data, RL relies on
environment after an action.
feedback in the form of rewards or o Policy (π): A strategy that maps states
penalties. to actions.
DIFFERENCES BETWEEN RL, SUPERVISED
LEARNING, AND UNSUPERVISED LEARNING:
Reward or penalty
FEEDBACK Explicit labels No labels; only data
based on actions
Trial-and-error
LEARNING Training using Discovering hidden
(exploration and
PROCESS ground truth structures
exploitation)
REAL-WORLD APPLICATIONS:
1. Robotics: Teaching robots to perform tasks like walking or
picking objects.
2. Gaming: AI agents that can play complex games (e.g., AlphaGo).
3. Autonomous Vehicles: Learning to navigate roads by maximizing
safety and efficiency.
4. Finance: Portfolio management and stock trading.
5. Healthcare: Personalized treatment planning.
MARKOV DECISION PROCESSES (MDP)
WHAT ARE MARKOV DECISION PROCESSES?
Markov Decision Processes (MDPs) provide the formal framework
for decision-making problems where an agent interacts with an
environment. The agent's goal is to make a sequence of decisions
to maximize its total reward over time.
MDPs are fundamental to Reinforcement Learning because they
formalize the interaction between the agent and the environment
using mathematical principles
2. Policy (π):
A deterministic policy chooses a
specific action for each state. A
stochastic policy assigns
probabilities to actions.
Introduction
Value Iteration and Policy Iteration are two fundamental algorithms in
Reinforcement Learning for solving Markov Decision Processes (MDPs). Both
methods aim to find the optimal policy, but they achieve this through slightly
different approaches
VALUE ITERATION
Overview
Value Iteration is an iterative algorithm that directly computes the optimal
value function, V∗(s), and derives the optimal policy π* from it. It is based on
the Bellman Optimality Equation.
Example: Grid World
Scenario:
•A 3x3 grid where:
o The agent starts at any state.
o Goal state (3,3) has a reward of +10.
o Other states have a step penalty of -1.
o Actions: Up, Down, Left, Right.
o γ=0.9.
STEP-BY-STEP EXECUTION:
Step-by-Step Execution:
1. Initialize V(s) to 0 for all states.
2. Iteratively update V(s) using the Bellman Optimality Equation.
o Example: For state (2,2), calculate the value of moving in all directions and take the maximum.
3. Extract π*(s) once V(s) converges.
Policy Iteration
Overview
Policy Iteration is an iterative algorithm that alternates between evaluating a policy and
improving it until the optimal policy π* is found.
Example: Grid World
Scenario:
• Same 3x3 grid as before.
Step-by-Step Execution:
1. Initialize a random policy (e.g., move randomly in any direction).
2. Policy Evaluation: Compute Vπ(s) for all states using the current policy.
o Example: For state (2,2), compute Vπ(2,2) assuming the agent moves randomly.
3. Policy Improvement: Update the policy based on the new Vπ(s).
o Example: If moving right from (2,2) gives the highest reward, update the policy to
choose "right".
4. Repeat until the policy converges.
VALUE POLICY
ASPECT
ITERATION ITERATION
Focuses on optimizing the Alternates between policy
APPROACH value function directly. evaluation and improvement.
Q-LEARNING
Introduction
Q-Learning is one of the most widely used model-free Reinforcement Learning
algorithms. Unlike Value Iteration and Policy Iteration, Q-Learning doesn't require
knowledge of the environment's dynamics (i.e., transition probabilities P(s′∣s,a) and
rewards R(s,a,s′). Instead, it learns the optimal action-value function, Q∗(s,a), directly
through experience.
This makes Q-Learning particularly powerful for problems where the environment is
unknown or too complex to model explicitly.
ADVANTAGES OF Q-LEARNING LIMITATIONS OF Q-LEARNING
• Model-Free: Does not require knowledge • Slow Convergence: Can take a long
of P(s′∣s,a) or R(s,a,s′). time in environments with large state
spaces.
• Off-Policy: Learns the optimal policy
independently of the agent's current • Exploration-Exploitation Trade-off:
actions. Proper balancing is critical for success.
CONCLUSION
Q-Learning is a powerful algorithm for learning optimal policies in unknown
environments. It forms the basis for many advanced RL techniques, including deep
reinforcement learning methods like DQN.
THANK
YOU!