0% found this document useful (0 votes)
4 views

Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Key components of RL include the agent, environment, actions, states, rewards, and policies, with applications in robotics, gaming, autonomous vehicles, finance, and healthcare. The document also discusses Markov Decision Processes, Bellman Equations, and algorithms like Value Iteration and Q-Learning, highlighting their roles in optimizing decision-making in RL.

Uploaded by

qamarmallick12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Key components of RL include the agent, environment, actions, states, rewards, and policies, with applications in robotics, gaming, autonomous vehicles, finance, and healthcare. The document also discusses Markov Decision Processes, Bellman Equations, and algorithms like Value Iteration and Q-Learning, highlighting their roles in optimizing decision-making in RL.

Uploaded by

qamarmallick12
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

TOPIC 3

REINFORCEMENT
LEARNING
BROUGHT TO YOU BY :
W3B JMI
INTRODUCTION TO REINFORCEMENT LEARNING
WHAT IS REINFORCEMENT LEARNING (RL)?
Key Elements in RL:
Reinforcement Learning is a type of o Agent: The learner or decision-maker.
Machine Learning where an agent o Environment: The system the agent
learns to make decisions by interacts with.
interacting with an environment. The o Actions (A): Choices the agent can
goal is to maximize the cumulative make.
o State (S): A representation of the
reward over time. Unlike supervised
environment at a specific time.
learning, where the model learns
o Reward (R): Feedback from the
from labelled data, RL relies on
environment after an action.
feedback in the form of rewards or o Policy (π): A strategy that maps states
penalties. to actions.
DIFFERENCES BETWEEN RL, SUPERVISED
LEARNING, AND UNSUPERVISED LEARNING:

REINFORCEMENT SUPERVISED UNSUPERVISED


FEATURE
LEARNING LEARNING LEARNING

Maximize cumulative Learn from labelled Find patterns or clusters


GOAL reward data in unlabelled data

Reward or penalty
FEEDBACK Explicit labels No labels; only data
based on actions

Trial-and-error
LEARNING Training using Discovering hidden
(exploration and
PROCESS ground truth structures
exploitation)
REAL-WORLD APPLICATIONS:
1. Robotics: Teaching robots to perform tasks like walking or
picking objects.
2. Gaming: AI agents that can play complex games (e.g., AlphaGo).
3. Autonomous Vehicles: Learning to navigate roads by maximizing
safety and efficiency.
4. Finance: Portfolio management and stock trading.
5. Healthcare: Personalized treatment planning.
MARKOV DECISION PROCESSES (MDP)
WHAT ARE MARKOV DECISION PROCESSES?
Markov Decision Processes (MDPs) provide the formal framework
for decision-making problems where an agent interacts with an
environment. The agent's goal is to make a sequence of decisions
to maximize its total reward over time.
MDPs are fundamental to Reinforcement Learning because they
formalize the interaction between the agent and the environment
using mathematical principles

KEY COMPONENTS OF AN MDP


1. SSS (State Space) :
An MDP is defined by a tuple (S,A,P,R,γ) where:
1. SSS (State Space)
o The set of all possible states that the environment can be in.
o Example: In a chess game, each configuration of the
chessboard is a state.
3. P(s′∣s,a) (Transition 4. R(s,a) (Reward
2. AAA (Action Space): Probability): Function)
o The set of all o The probability of o The probability of
possible actions that transitioning to a new state s′ transitioning to a new
the agent can take. when taking action aaa in state s′ when taking
state sss. action aaa in state sss.
o Example: For a
o Example: In a grid world, if o Example: In a grid world,
robot, actions could the agent moves up, there if the agent moves up,
include "move might be an 80%80\%80% there might be an
forward," "turn left," chance it goes to the 80%80\%80% chance it
or "pick up an intended cell and a goes to the intended cell
object." 20%20\%20% chance it lands and a 20%20\%20%
elsewhere. chance it lands elsewhere.
5. γ (Discount Factor):
o A value between 0 and 1 that determines the importance of future rewards
o If γ=0, the agent only cares about immediate rewards.
o If γ≈1, future rewards are nearly as important as immediate rewards.
EXAMPLE: GRID WORLD PROBLEM KEY DETAILS:
Imagine a 4x4 grid world where: • States: Each cell in the grid is a
• The agent starts at the top- state.
left corner. • Actions: Move up, down, left,
• The goal state is at the right.
bottom-right corner, with a • Transition Probabilities: Moving
reward of +10. in the intended direction occurs
• There’s a pit at one cell with a 80% of the time, with 10%
reward of -5. chance of veering left or right.
• Each step has a small penalty • Rewards: +10 for reaching the
of -1 (to encourage shorter goal, -5 for the pit, -1 for each
paths). step.
FURTHER INSIGHTS
1. Stochastic Behavior:
Transition probabilities
introduce uncertainty, making
the environment realistic for
tasks like robotics and gaming.

2. Policy (π):
A deterministic policy chooses a
specific action for each state. A
stochastic policy assigns
probabilities to actions.

In the next section, we’ll use


Bellman Equations to
mathematically formalize the
optimal policy.
BELLMAN EQUATIONS STATE-VALUE FUNCTION (V(S))
INTRODUCTION TO BELLMAN EQUATIONS
The state-value function, V(s), represents
Bellman Equations are fundamental to the expected cumulative reward the
solving Reinforcement Learning agent can achieve starting from state s
problems. They provide a recursive and following a policy π\piπ.
decomposition of the value of a state
into immediate rewards and the value
Mathematically:
of subsequent states.

There are two main forms of Bellman


Equations:

1. Bellman Expectation Equation: For


evaluating policies.
2. Bellman Optimality Equation: For
finding the optimal policy.
EXPLANATION OF TERMS: ACTION-VALUE FUNCTION (Q(S,A))
• Vπ(s): Value of state sss under The action-value function, Q(s,a),
policy π\piπ. represents the expected cumulative
reward when the agent starts in state s,
• Eπ : Expectation over all possible takes action a, and then follows a policy π
sequences of actions chosen Mathematically:
according to policy π

• γt: Discount factor raised to the


power t, reducing the weight of
future rewards.

• Rt+1: Reward received at time step


t+1.
BELLMAN EXPECTATION EQUATION
BELLMAN OPTIMALITY EQUATION
WHY BELLMAN EQUATIONS MATTER?
• Policy Evaluation: Compute the value of a given policy.
• Policy Improvement: Derive better policies using value functions.
• Optimization: Solve for the best possible policy using the Bellman Optimality
Equation.

VALUE ITERATION AND POLICY ITERATION

Introduction
Value Iteration and Policy Iteration are two fundamental algorithms in
Reinforcement Learning for solving Markov Decision Processes (MDPs). Both
methods aim to find the optimal policy, but they achieve this through slightly
different approaches
VALUE ITERATION

Overview
Value Iteration is an iterative algorithm that directly computes the optimal
value function, V∗(s), and derives the optimal policy π* from it. It is based on
the Bellman Optimality Equation.
Example: Grid World
Scenario:
•A 3x3 grid where:
o The agent starts at any state.
o Goal state (3,3) has a reward of +10.
o Other states have a step penalty of -1.
o Actions: Up, Down, Left, Right.
o γ=0.9.
STEP-BY-STEP EXECUTION:
Step-by-Step Execution:
1. Initialize V(s) to 0 for all states.
2. Iteratively update V(s) using the Bellman Optimality Equation.
o Example: For state (2,2), calculate the value of moving in all directions and take the maximum.
3. Extract π*(s) once V(s) converges.

Advantages of Value Iteration:


• Simpler to implement compared to Policy Iteration.
• Efficient in environments with large state spaces.

Policy Iteration
Overview
Policy Iteration is an iterative algorithm that alternates between evaluating a policy and
improving it until the optimal policy π* is found.
Example: Grid World
Scenario:
• Same 3x3 grid as before.
Step-by-Step Execution:
1. Initialize a random policy (e.g., move randomly in any direction).
2. Policy Evaluation: Compute Vπ(s) for all states using the current policy.
o Example: For state (2,2), compute Vπ(2,2) assuming the agent moves randomly.
3. Policy Improvement: Update the policy based on the new Vπ(s).
o Example: If moving right from (2,2) gives the highest reward, update the policy to
choose "right".
4. Repeat until the policy converges.

Advantages of Policy Iteration:


• Converges in fewer iterations compared to Value Iteration.
• Guarantees a valid policy at every step
COMPARISON: VALUE ITERATION VS. POLICY ITERATION

VALUE POLICY
ASPECT
ITERATION ITERATION
Focuses on optimizing the Alternates between policy
APPROACH value function directly. evaluation and improvement.

CONVERGENCE May require more iterations Fewer iterations due to full


SPEED for convergence. policy updates.

COMPUTATIONAL Lower per iteration, but Higher per iteration, but


COST more iterations needed. fewer iterations needed.

Large state spaces where


Smaller state spaces with
USE CASE partial convergence is
acceptable. exact solutions.
Conclusion
• Both algorithms use Bellman Equations to compute value functions and optimal
policies.
• Value Iteration is simpler and often preferred for large-scale problems.
• Policy Iteration converges faster but can be computationally expensive for large
state spaces.

Q-LEARNING

Introduction
Q-Learning is one of the most widely used model-free Reinforcement Learning
algorithms. Unlike Value Iteration and Policy Iteration, Q-Learning doesn't require
knowledge of the environment's dynamics (i.e., transition probabilities P(s′∣s,a) and
rewards R(s,a,s′). Instead, it learns the optimal action-value function, Q∗(s,a), directly
through experience.
This makes Q-Learning particularly powerful for problems where the environment is
unknown or too complex to model explicitly.
ADVANTAGES OF Q-LEARNING LIMITATIONS OF Q-LEARNING

• Model-Free: Does not require knowledge • Slow Convergence: Can take a long
of P(s′∣s,a) or R(s,a,s′). time in environments with large state
spaces.
• Off-Policy: Learns the optimal policy
independently of the agent's current • Exploration-Exploitation Trade-off:
actions. Proper balancing is critical for success.

• Flexibility: Can handle large or • High Memory Usage: Maintaining a


continuous state-action spaces using Q-table for large state-action spaces is
function approximation (e.g., deep Q- computationally expensive.
networks).
EXTENSIONS OF Q-LEARNING
1. Deep Q-Learning (DQN): Uses a neural network to approximate Q(s,a) for large
state-action spaces.
2. Double Q-Learning: Reduces the overestimation bias in Q-values by maintaining
two Q-functions.

3. SARSA (State-Action-Reward-State-Action): An on-policy alternative to


Q-Learning.

CONCLUSION
Q-Learning is a powerful algorithm for learning optimal policies in unknown
environments. It forms the basis for many advanced RL techniques, including deep
reinforcement learning methods like DQN.
THANK
YOU!

You might also like