0% found this document useful (0 votes)

18 views31 pages

Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Key components of RL include the agent, environment, actions, states, rewards, and policies, with applications in robotics, gaming, autonomous vehicles, finance, and healthcare. The document also discusses Markov Decision Processes, Bellman Equations, and algorithms like Value Iteration and Q-Learning, highlighting their roles in optimizing decision-making in RL.

Uploaded by

qamarmallick12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views31 pages

Reinforcement Learning

Uploaded by

qamarmallick12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

TOPIC 3

REINFORCEMENT
LEARNING
BROUGHT TO YOU BY :
W3B JMI
INTRODUCTION TO REINFORCEMENT LEARNING
WHAT IS REINFORCEMENT LEARNING (RL)?
Key Elements in RL:
Reinforcement Learning is a type of o Agent: The learner or decision-maker.
Machine Learning where an agent o Environment: The system the agent
learns to make decisions by interacts with.
interacting with an environment. The o Actions (A): Choices the agent can
goal is to maximize the cumulative make.
o State (S): A representation of the
reward over time. Unlike supervised
environment at a specific time.
learning, where the model learns
o Reward (R): Feedback from the
from labelled data, RL relies on
environment after an action.
feedback in the form of rewards or o Policy (π): A strategy that maps states
penalties. to actions.
DIFFERENCES BETWEEN RL, SUPERVISED
LEARNING, AND UNSUPERVISED LEARNING:

REINFORCEMENT SUPERVISED UNSUPERVISED

FEATURE
LEARNING LEARNING LEARNING

Maximize cumulative Learn from labelled Find patterns or clusters

GOAL reward data in unlabelled data

Reward or penalty
FEEDBACK Explicit labels No labels; only data
based on actions

Trial-and-error
LEARNING Training using Discovering hidden
(exploration and
PROCESS ground truth structures
exploitation)
REAL-WORLD APPLICATIONS:
1. Robotics: Teaching robots to perform tasks like walking or
picking objects.
2. Gaming: AI agents that can play complex games (e.g., AlphaGo).
3. Autonomous Vehicles: Learning to navigate roads by maximizing
safety and efficiency.
4. Finance: Portfolio management and stock trading.
5. Healthcare: Personalized treatment planning.
MARKOV DECISION PROCESSES (MDP)
WHAT ARE MARKOV DECISION PROCESSES?
Markov Decision Processes (MDPs) provide the formal framework
for decision-making problems where an agent interacts with an
environment. The agent's goal is to make a sequence of decisions
to maximize its total reward over time.
MDPs are fundamental to Reinforcement Learning because they
formalize the interaction between the agent and the environment
using mathematical principles

KEY COMPONENTS OF AN MDP

1. SSS (State Space) :
An MDP is defined by a tuple (S,A,P,R,γ) where:
1. SSS (State Space)
o The set of all possible states that the environment can be in.
o Example: In a chess game, each configuration of the
chessboard is a state.
3. P(s′∣s,a) (Transition 4. R(s,a) (Reward
2. AAA (Action Space): Probability): Function)
o The set of all o The probability of o The probability of
possible actions that transitioning to a new state s′ transitioning to a new
the agent can take. when taking action aaa in state s′ when taking
state sss. action aaa in state sss.
o Example: For a
o Example: In a grid world, if o Example: In a grid world,
robot, actions could the agent moves up, there if the agent moves up,
include "move might be an 80%80\%80% there might be an
forward," "turn left," chance it goes to the 80%80\%80% chance it
or "pick up an intended cell and a goes to the intended cell
object." 20%20\%20% chance it lands and a 20%20\%20%
elsewhere. chance it lands elsewhere.
5. γ (Discount Factor):
o A value between 0 and 1 that determines the importance of future rewards
o If γ=0, the agent only cares about immediate rewards.
o If γ≈1, future rewards are nearly as important as immediate rewards.
EXAMPLE: GRID WORLD PROBLEM KEY DETAILS:
Imagine a 4x4 grid world where: • States: Each cell in the grid is a
• The agent starts at the top- state.
left corner. • Actions: Move up, down, left,
• The goal state is at the right.
bottom-right corner, with a • Transition Probabilities: Moving
reward of +10. in the intended direction occurs
• There’s a pit at one cell with a 80% of the time, with 10%
reward of -5. chance of veering left or right.
• Each step has a small penalty • Rewards: +10 for reaching the
of -1 (to encourage shorter goal, -5 for the pit, -1 for each
paths). step.
FURTHER INSIGHTS
1. Stochastic Behavior:
Transition probabilities
introduce uncertainty, making
the environment realistic for
tasks like robotics and gaming.

2. Policy (π):
A deterministic policy chooses a
specific action for each state. A
stochastic policy assigns
probabilities to actions.

In the next section, we’ll use

Bellman Equations to
mathematically formalize the
optimal policy.
BELLMAN EQUATIONS STATE-VALUE FUNCTION (V(S))
INTRODUCTION TO BELLMAN EQUATIONS
The state-value function, V(s), represents
Bellman Equations are fundamental to the expected cumulative reward the
solving Reinforcement Learning agent can achieve starting from state s
problems. They provide a recursive and following a policy π\piπ.
decomposition of the value of a state
into immediate rewards and the value
Mathematically:
of subsequent states.

There are two main forms of Bellman

Equations:

1. Bellman Expectation Equation: For

evaluating policies.
2. Bellman Optimality Equation: For
finding the optimal policy.
EXPLANATION OF TERMS: ACTION-VALUE FUNCTION (Q(S,A))
• Vπ(s): Value of state sss under The action-value function, Q(s,a),
policy π\piπ. represents the expected cumulative
reward when the agent starts in state s,
• Eπ : Expectation over all possible takes action a, and then follows a policy π
sequences of actions chosen Mathematically:
according to policy π

• γt: Discount factor raised to the

power t, reducing the weight of
future rewards.

• Rt+1: Reward received at time step

t+1.
BELLMAN EXPECTATION EQUATION
BELLMAN OPTIMALITY EQUATION
WHY BELLMAN EQUATIONS MATTER?
• Policy Evaluation: Compute the value of a given policy.
• Policy Improvement: Derive better policies using value functions.
• Optimization: Solve for the best possible policy using the Bellman Optimality
Equation.

VALUE ITERATION AND POLICY ITERATION

Introduction
Value Iteration and Policy Iteration are two fundamental algorithms in
Reinforcement Learning for solving Markov Decision Processes (MDPs). Both
methods aim to find the optimal policy, but they achieve this through slightly
different approaches
VALUE ITERATION

Overview
Value Iteration is an iterative algorithm that directly computes the optimal
value function, V∗(s), and derives the optimal policy π* from it. It is based on
the Bellman Optimality Equation.
Example: Grid World
Scenario:
•A 3x3 grid where:
o The agent starts at any state.
o Goal state (3,3) has a reward of +10.
o Other states have a step penalty of -1.
o Actions: Up, Down, Left, Right.
o γ=0.9.
STEP-BY-STEP EXECUTION:
Step-by-Step Execution:
1. Initialize V(s) to 0 for all states.
2. Iteratively update V(s) using the Bellman Optimality Equation.
o Example: For state (2,2), calculate the value of moving in all directions and take the maximum.
3. Extract π*(s) once V(s) converges.

Advantages of Value Iteration:

• Simpler to implement compared to Policy Iteration.
• Efficient in environments with large state spaces.

Policy Iteration
Overview
Policy Iteration is an iterative algorithm that alternates between evaluating a policy and
improving it until the optimal policy π* is found.
Example: Grid World
Scenario:
• Same 3x3 grid as before.
Step-by-Step Execution:
1. Initialize a random policy (e.g., move randomly in any direction).
2. Policy Evaluation: Compute Vπ(s) for all states using the current policy.
o Example: For state (2,2), compute Vπ(2,2) assuming the agent moves randomly.
3. Policy Improvement: Update the policy based on the new Vπ(s).
o Example: If moving right from (2,2) gives the highest reward, update the policy to
choose "right".
4. Repeat until the policy converges.

Advantages of Policy Iteration:

• Converges in fewer iterations compared to Value Iteration.
• Guarantees a valid policy at every step
COMPARISON: VALUE ITERATION VS. POLICY ITERATION

VALUE POLICY
ASPECT
ITERATION ITERATION
Focuses on optimizing the Alternates between policy
APPROACH value function directly. evaluation and improvement.

CONVERGENCE May require more iterations Fewer iterations due to full

SPEED for convergence. policy updates.

COMPUTATIONAL Lower per iteration, but Higher per iteration, but

COST more iterations needed. fewer iterations needed.

Large state spaces where

Smaller state spaces with
USE CASE partial convergence is
acceptable. exact solutions.
Conclusion
• Both algorithms use Bellman Equations to compute value functions and optimal
policies.
• Value Iteration is simpler and often preferred for large-scale problems.
• Policy Iteration converges faster but can be computationally expensive for large
state spaces.

Q-LEARNING

Introduction
Q-Learning is one of the most widely used model-free Reinforcement Learning
algorithms. Unlike Value Iteration and Policy Iteration, Q-Learning doesn't require
knowledge of the environment's dynamics (i.e., transition probabilities P(s′∣s,a) and
rewards R(s,a,s′). Instead, it learns the optimal action-value function, Q∗(s,a), directly
through experience.
This makes Q-Learning particularly powerful for problems where the environment is
unknown or too complex to model explicitly.
ADVANTAGES OF Q-LEARNING LIMITATIONS OF Q-LEARNING

• Model-Free: Does not require knowledge • Slow Convergence: Can take a long
of P(s′∣s,a) or R(s,a,s′). time in environments with large state
spaces.
• Off-Policy: Learns the optimal policy
independently of the agent's current • Exploration-Exploitation Trade-off:
actions. Proper balancing is critical for success.

• Flexibility: Can handle large or • High Memory Usage: Maintaining a

continuous state-action spaces using Q-table for large state-action spaces is
function approximation (e.g., deep Q- computationally expensive.
networks).
EXTENSIONS OF Q-LEARNING
1. Deep Q-Learning (DQN): Uses a neural network to approximate Q(s,a) for large
state-action spaces.
2. Double Q-Learning: Reduces the overestimation bias in Q-values by maintaining
two Q-functions.

3. SARSA (State-Action-Reward-State-Action): An on-policy alternative to

Q-Learning.

CONCLUSION
Q-Learning is a powerful algorithm for learning optimal policies in unknown
environments. It forms the basis for many advanced RL techniques, including deep
reinforcement learning methods like DQN.
THANK
YOU!

Project On Layout Survey 1-3
100% (1)
Project On Layout Survey 1-3
29 pages
The 22 Immutable Laws of Branding
No ratings yet
The 22 Immutable Laws of Branding
15 pages
Army - FM3 - 22X9 - RIFLE MARKSMANSHIP M16A1, M16A2-3, M16A4 AND M4 CARBINE
100% (8)
Army - FM3 - 22X9 - RIFLE MARKSMANSHIP M16A1, M16A2-3, M16A4 AND M4 CARBINE
384 pages
Popular Music in Contemporary France. Authenticity, Politics, Debate
100% (3)
Popular Music in Contemporary France. Authenticity, Politics, Debate
266 pages
كتاب الرقابة الداخلية
100% (2)
كتاب الرقابة الداخلية
26 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Unit 4
No ratings yet
Unit 4
49 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Unit-5 Genetic Reinforcement Markov Q-Learning
No ratings yet
Unit-5 Genetic Reinforcement Markov Q-Learning
39 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Chapter 18 - Reinforcement Learning
No ratings yet
Chapter 18 - Reinforcement Learning
29 pages
ML_CH_18_RL.pptx
No ratings yet
ML_CH_18_RL.pptx
29 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages
Yang - Good Brief of RL
No ratings yet
Yang - Good Brief of RL
87 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
IntroductiontoRL-BR
No ratings yet
IntroductiontoRL-BR
22 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
RL
No ratings yet
RL
62 pages
RL_Basics_1737166593
No ratings yet
RL_Basics_1737166593
30 pages
DLMAIRIL01_Q4-2024_Session1
No ratings yet
DLMAIRIL01_Q4-2024_Session1
84 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
databookRL Steve Brunton PDF
No ratings yet
databookRL Steve Brunton PDF
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
2025_MDPs 1
No ratings yet
2025_MDPs 1
62 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
HSC 3U Math Parametrics - Summary
No ratings yet
HSC 3U Math Parametrics - Summary
6 pages
Pathfit 1 - Semi Final Exam
No ratings yet
Pathfit 1 - Semi Final Exam
6 pages
Lab-05 Nested Loops in Python: Objectives
No ratings yet
Lab-05 Nested Loops in Python: Objectives
6 pages
SU-Graduation-2024-Publication-3
No ratings yet
SU-Graduation-2024-Publication-3
70 pages
A Thermomechanical Analysis of The Multi-Wedge Helical Rolling (MWHR) Process For Producing Balls
No ratings yet
A Thermomechanical Analysis of The Multi-Wedge Helical Rolling (MWHR) Process For Producing Balls
4 pages
Application of Modular Fixturing For FMS: Liviu Segal, Camelia Romanescu and Nicolae Gojinetchi
No ratings yet
Application of Modular Fixturing For FMS: Liviu Segal, Camelia Romanescu and Nicolae Gojinetchi
6 pages
FnFal X8e1 X8e2 Manual
No ratings yet
FnFal X8e1 X8e2 Manual
32 pages
AB Selected - Works ENG
No ratings yet
AB Selected - Works ENG
82 pages
Animal Diversity - I (Non-Chordates) : Protozoa Dr. (MRS.) Hardeep Kaur
No ratings yet
Animal Diversity - I (Non-Chordates) : Protozoa Dr. (MRS.) Hardeep Kaur
48 pages
Instagram Monetization
No ratings yet
Instagram Monetization
14 pages
Thesis Design Concept
100% (3)
Thesis Design Concept
6 pages
Jis G 3466square and Rectangular Hollow PDF
No ratings yet
Jis G 3466square and Rectangular Hollow PDF
3 pages
Biotechnology Presentation - Google Drive
No ratings yet
Biotechnology Presentation - Google Drive
1 page
Optical Fibre
No ratings yet
Optical Fibre
19 pages
Rover Technology: Submitted To: Submitted by
No ratings yet
Rover Technology: Submitted To: Submitted by
40 pages
Final Exam Monitor
No ratings yet
Final Exam Monitor
4 pages
0037 7516
No ratings yet
0037 7516
8 pages
A Sonometer
No ratings yet
A Sonometer
9 pages
Year - 11 - Physics T1. FINALE - Version2
No ratings yet
Year - 11 - Physics T1. FINALE - Version2
23 pages
PRC Seafarer Form Updated
No ratings yet
PRC Seafarer Form Updated
9 pages
Competitive Differentiation and Value Proposition
No ratings yet
Competitive Differentiation and Value Proposition
20 pages
Main Catalog 500 04.2020 en Dok 0150d 0
No ratings yet
Main Catalog 500 04.2020 en Dok 0150d 0
158 pages
Expansion of A Magnetic Field
No ratings yet
Expansion of A Magnetic Field
5 pages
CRC ACE RFBT 1st PB Oct 2022 Questionnaire
No ratings yet
CRC ACE RFBT 1st PB Oct 2022 Questionnaire
13 pages
Project Guide Allocation BE 2024-25
No ratings yet
Project Guide Allocation BE 2024-25
8 pages