0% found this document useful (0 votes)

17 views27 pages

AI Seminar RL

Uploaded by

Debdeep sanyal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views27 pages

AI Seminar RL

Uploaded by

Debdeep sanyal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

An Introduction to Q-Learning: Reinforcement Learning

Welcome everyone! Have you ever trained a pet, rewarding it for each correct
command? Today, we step into the captivating realm of Reinforcement Learning—a
branch of Artificial Intelligence where machines learn through Rewards, much like our
furry friends do. In this presentation, we will delve into the intricacies of the Q-Learning
Algorithm, the simplest yet powerful technique that enables machines to learn
autonomously.
What is Reinforcement learning?
● Reinforcement Learning is a feedback-based Machine learning Approach here an agent learns to which actions to perform
by looking at the environment and the results of actions.
● For each correct action, the agent gets positive feedback, and for each incorrect action, the agent gets negative feedback or
penalty.
● In RL, we build an agent that can make smart decisions. For instance, an agent that learns to play a video game. Or a
trading agent that learns to maximize its benefits by deciding on what stocks to buy and when to sell.
Simple Example of Reinforcement Learning :
Example: Imagine an agent (e.g., a robot) in an environment with two key elements:
🔥 Fire: Represents danger (Negative reward).
💧 Water: Represents safety or a goal (Positive reward).
How RL Works Here:
States: The agent’s position in the grid (e.g.,near fire/water).

Actions: Move up, down, left, or right.

Rewards: +100 for reaching water (success).

-50 for touching fire (penalty).
+1 for each step (encourages efficiency).
Learning Process: The agent starts by exploring randomly.
Over time, the agent learns to avoid fire (due to penalties) and
seek water (due to rewards).Eventually, it discovers the optimal
path to water while avoiding fire.
● To make intelligent decisions, our agent will learn from the environment by interacting with it through trial and error and
receiving rewards (positive or negative) as unique feedback.
● Its goal is to maximize its expected cumulative reward (because of the reward hypothesis).
● The agent’s decision-making process is called the policy π: given a state, a policy will output an action or a probability
distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple
probabilities for each action) that the agent should take.
Our goal is to find an optimal policy π* , aka., a policy that leads to the best expected cumulative reward.

And to find this optimal policy (hence solving the RL problem), there are two main types of RL methods:

● Policy-based methods: Train the policy directly to learn which action to take given a state.
● Value-based methods: Train a value function to learn which state is more valuable and use this value function to take the action
that leads to it.
Here Q - Learning algorithm is based on value-based method.

Introducing Q-Learning

What is Q-Learning?

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function:

● Off-policy: we’ll talk about that at the end of this unit.

● Value-based method: finds the optimal policy indirectly by training a value or action-value function that will tell us the value
of each state or each state-action pair.
● TD approach: updates its action-value function at each step instead of at the end of the episode.

Q-Learning is the algorithm we use to train our Q-function, an action-value function that determines the value of being at a
particular state and taking a specific action at that state.
Given a state and action, our Q Function outputs a state-action value (also called Q-value)

The Q comes from “the Quality” (the value) of that action at that state.
Let’s recap the difference between value and reward:

● The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-
action pair) and then acts accordingly to its policy.
● The reward is the feedback I get from the environment after performing an action at a state.

Internally, our Q-function is encoded by a Q-table, a table where each cell corresponds to a state-action pair value. Think of this Q-
table as the memory or cheat sheet of our Q-function.
The Q-Learning algorithm
This is the Q-Learning pseudocode; let’s study each part and see how it works with a simple example before
implementing it. Don’t be intimidated by it, it’s simpler than it looks! We’ll go over each step.
Example Of Q-Learning in Reinforcement Learning :
• We have an Agent named Jerry(a mouse) in this tiny maze. He always starts at the same starting point.
• Jerry’s Aim is to eat the big pile of cheese(Reward) placed at the 2nd Box(1 st row) and Reach safely to his
Home(GOAL).
• Jerry must avoid getting caught by Tom, the cat (Penalty). After all, who doesn’t like to steal cheese without
getting caught ?
• The episode ends if Tom catches Jerry, OR Jerry reaches home with the Cheese, OR If he takes more than
five steps.
• The learning rate is 0.1
• The discount rate (gamma) is 0.99
• For a perfect visualization we can visit this link: https://qltesting-4njdfvxgmcjwqnuzcpmwna.streamlit.app/

Jerry’s reward function F() goes like this:

• +0: Going to a state with no cheese in it.
• +1: Going to a state with a piece of cheese in it.
• +10: Going to the state which is Jerry’s home.
• -10: Going to the state where Tom is present and thus getting caught.
• +0: If he takes more than five steps.
Penalty - Goal -
Task 1-

State 0 : Initially, Agent Jerry enters the maze

+1
Maze
Entry +0

+0 -10 -10
+10

GOAL
(Home)
State 1 : Agent Jerry thinks if :
 If he moves Right ; he gets a [+1] and the pile of Cheese as Reward.
 If he moves Down; he gets a [+0] as Reward.
 To get maximum reward , Jerry must go Rightwards.

+1 +0

+0
-10 -10
+10

GOAL (Home)
State 2 : This state has 2 Possibilities - Possibility 2
 If Jerry moves Down, he gets a [-10] Penalty in this
Possibility 1
case, since he gets caught by Tom .
 If Jerry moves Right, he gets a [+0] Reward in this case, as  Hence, Jerry won’t make it to his home(Goal State).
he did not get caught by Tom and hasn’t completed 5
steps.
 That makes him close to his home. So, he gets Glad. That makes Jerry Sad

[+0

-10 -10
+10 +10

GOAL (Home)
GOAL (Home)
State 3 :
Now Agent Jerry has to move downwards since there is no block on Right.
 He knows for Goal state(Home) reward is [+10]
 So, he moves Downward and reaches home.
 Now Jerry is Happy. He will enjoy the Cheese.

+10 GOAL (Home)

reached
Step 1: Initialize the Q-table
Initialize Q arbitrarily as Q(terminal−state,⋅)=0
Q(Si, a)=0, for i = 0,1,…,5 and a ∈ {Left, Right, Up, Down}

S0 0 0 0 0
S1 0 0 0 0
S2 0 0 0 0
S3 0 0 0 0
S4 0 0 0 0
S5 0 0 0 0
Step 2: Update Q(St, At)
Agent now updates Q(St , At) using the Bellman Equation:

Updating Q-VALUE Estimation

1. From S0:
• Action : Right
• Transition : S0  S1
• Immediate reward: +1(Cheese is collected)
• Update: Since initially Q(S1,.)=0, new update
Q(S0 , Right) = 0 + 0.1 * [+1 + 0.99 * 0 – 0]
Q(S0 , Right) = 0.1
2. From S1:
• Action : Right
• Transition : S1  S2
• Immediate reward: +0(Already Cheese is given)
• Update:
Q(S1 , Right) = 0 + 0.1 * [+0 + 0.99 * maxaQ(S2,a) – 0]
Q(S1 , Right) = 0.1 * 0.99 * 0 = 0
For Transition : S0  S1(Right)
Updating QTable with Q(S0 , Right) = 0.1

S0 0 0.1 0 0
S1 0 0 0 0
S2 0 0 0 0
S3 0 0 0 0
S4 0 0 0 0
S5 0 0 0 0
3.From S2:
• Action : Down
• Transition : S2  S5(home)
• Immediate reward: +10
• Update: Since S5 is terminal so maxaQ(S5,a)=0, hence
Q(S2 , Down) = 0 + 0.1 * [+10 + 0.99 * 0 – 0]
Q(S2 , Down) = 1
4. From S1: If Agent moves Down, he gets caught by Tom:
• Action : Down
• Transition : S1  S4
• Immediate reward: -10
• Update: Q(S1 , Down) = 0 + 0.1 * [-10 + 0.99 * maxaQ(S4,a) – 0]
Q(S1 , Down) = 0.1 * [-10 + 0.99 * 0 – 0]
For Transition : S1  S4(Down)
Updating QTable with Q(S1 , Down) = -1

S0 0 0.1 0 0
S1 0 0 0 -1
S2 0 0 0 0
S3 0 0 0 0
S4 0 0 0 0
S5 0 0 0 0
After many episodes, Using Bellman Optimality Equations, the Q-Values will
converge as :
• For S2: Q(S2 , Down) ≈ 10(since moving down from S2 directly gives +10)
• For S1: Optimal Action = Right, so
Q(S1 , Right) ≈ 0 + 0.99 * Q(S2 , Down) ≈ 0.99*10 = 9.9
• For S0: Optimal Action = Right, so
Q(S0 , Right) ≈ +1 + 0.99 * Q(S1 , Right) ≈ 1 + 0.99 * 9.9 ≈ 1 + 9.801 ≈ 10.801
• For the other states (S3, S4, etc.), the values remain lower because they do not
lead to high rewards:
S3 remains unexplored
S4 is a terminal “bad” state (Tom is present); any action in here ends the
episode.
S5 is terminal(GOAL).
A plausible final Q-table may look like:
Best Action

S0 0 10.80 0 0 S0,Right  S1

S1 0 9.90 0 -10 S1,Right  S2 S1,Down

S4
S2 0 0 0 10 S2,RightS5 (Home)

S3 (not used)0 0 0 0 S3 not in optimal path

S4 -10 -10 -10 -10 Reaching S4 means

Tom catches Jerry
S5 +10 +10 +10 +10 S5 means Jerry reaches
home(GOAL)
• Optimal Policy
From the final Q-table the optimal policy (greedy with respect to Q) is:
At S0: Choose “Right” → goes to S1.
At S1: Choose “Right” → goes to S2.
At S2: Choose “Down” → goes to S5 (Home).

𝑆0→𝑆1→𝑆2→𝑆5.
• Optimal Path :

Total cumulative reward is approximately:

+1 (from S0→S1)+0 (S1→S2)+ +10(S2→S5) = +11.
• Calculations :
For S2, going Down gives immediate reward 10.
For S1, the best action is Right (leading to S2), so Q(S1,Right)≈0.99×10=9.9
For S0, going Right gives +1+1+1 (cheese) plus discounted future value from S1:
Q(S0,Right)≈ 1+0.99×9.9≈10.801
 Hence, Jerry avoids Tom (S4) and reaches his goal within 3 steps ( < 5-steps).
Advantages of Q – Learning:
The Q-learning approach in reinforcement learning offers various benefits such as −

● Model-Free Algorithm: Q-learning does not require a predefined model of the environment, making
it suitable for complex or unknown environments.
● Effective in Stochastic Environments: The algorithm performs well even in environments with
randomness or uncertainty, where the outcome of actions can vary.
● Simplicity and Ease of Implementation: Q-learning is straightforward to implement with minimal
theoretical prerequisites, making it accessible for beginners.
● Convergence to Optimal Policy: Given enough time and proper exploration, Q-learning guarantees
convergence to an optimal policy.
● Learns Like A Human: This learning approach, which is trial and error, resembles how people learn,
making it almost ideal.
● The model has the ability to fix mistakes while training, and there is very little probability that the
fixed mistake would happen again.
Limitations of Q – Learning :
The Q-learning approach in reinforcement learning also has some disadvantages such as −

● Slow Convergence: In environments with many states and actions, the learning process can be slow, requiring many
episodes to achieve optimal performance.

● Inefficient for Large State Spaces: As the state space grows, maintaining and updating a Q-table becomes impractical,
leading to high memory usage.

● Exploration-Exploitation Trade-off: Balancing between exploring new actions and exploiting known actions can be
challenging and may affect learning quality.

● Sensitive to Hyperparameters: Choosing appropriate values for the learning rate, discount factor, and exploration rate is
critical. Poor choices can result in suboptimal learning or even prevent convergence.

● The Q-learning model sometimes exhibits excessive optimism and overestimates how good a particular action or strategy
is.

● Sometimes, it is time-consuming for a Q-learning model to determine the optimal strategy when faced with multiple
problem-solving options.
Applications of Q – Learning : The Q-learning models can improve processes in various scenarios.
Some of the fields include −
● Gaming − Q-learning algorithms can teach gaming systems to reach
expert levels of skill in various games by learning the best strategy to
progress.
● Recommendation Systems − Q-learning algorithms can be utilized to
improve recommendation systems, like advertising platforms.
● Robotics − Q-learning algorithms enable robots to learn how to
perform different tasks like manipulating objects, avoiding obstacles,
and transporting items.
● Autonomous Vehicles − Q-learning algorithms are used to train self-
driving cars to make driving choices like changing lanes or coming to a
halt.
● Supply Chain − Q-learning models can enhance the efficiency of supply
chains by optimizing the path for products to market.
Conclusion :
● Q-learning is a powerful reinforcement learning algorithm that enables agents to learn optimal policies through trial and
error. By using a Q-table to store expected rewards for state-action pairs, it helps an agent navigate an environment
without requiring prior knowledge. The algorithm follows the Bellman equation to update Q-values iteratively based on
the rewards received and future state values.
● Q-learning follows the Markov Property, meaning the present state is not dependent on future states—only on the
current state and action. This allows the agent to make decisions based only on what it knows now, without worrying
about what happens later.
● We finally have come to the very end of the presentation. We covered a lot of preliminary grounds of reinforcement
learning that will be useful if you are planning to further strengthen your knowledge of reinforcement learning. We also
implemented the simplest reinforcement learning just by using Numpy. These base scratch implementations are not
only for just fun but also they help tremendously to know the nuts and bolts of an algorithm.
References :
Reinforcement learning has given solutions to many problems from a wide variety of different domains. One that I particularly like is Google’s
NasNet which uses deep reinforcement learning for finding an optimal neural network architecture for a given dataset.

Let’s now review some of the best resources for breaking into reinforcement learning in a serious manner:

● Reinforcement Learning, Second Edition: An Introduction by Richard S. Sutton and Andrew G. Barto which is considered to be the
textbook of reinforcement learning
● Practical Reinforcement Learning a course designed by the National Research University Higher School of Economics offered by
Coursera
● Reinforcement Learning a course designed by the Georgia University and offered by Udacity
● If you are interested in the conjunction of meta-learning and reinforcement learning then you may follow this article
● How about combining deep learning + reinforcement learning? Check out Deep RL Bootcamp.
● Deep Reinforcement Learning Hands-On a book by Maxim Lapan which covers many cutting edge RL concepts like deep Q-networks,
value iteration, policy gradients and so on.
● MIT Deep Learning a course taught by Lex Fridman which teaches you how different deep learning applications are used in
autonomous vehicle systems and more.

Q Learning
No ratings yet
Q Learning
187 pages
PerDev Q1 Module-3 Developmental-Tasks AccordingToDevelopmental-Stage Ver1
100% (10)
PerDev Q1 Module-3 Developmental-Tasks AccordingToDevelopmental-Stage Ver1
30 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Unit 5
No ratings yet
Unit 5
70 pages
Unit 5
No ratings yet
Unit 5
65 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
RL Class Mtech
No ratings yet
RL Class Mtech
67 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Unit 5
No ratings yet
Unit 5
54 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Q Learning
No ratings yet
Q Learning
38 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
Q Learning
No ratings yet
Q Learning
38 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Q Learning Ejemplo
100% (1)
Q Learning Ejemplo
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
A Mother's Guide to Addition & Subtraction
From Everand
A Mother's Guide to Addition & Subtraction
Sandhya Anugopal
5/5 (1)
Adobe Scan Nov 18, 2024
No ratings yet
Adobe Scan Nov 18, 2024
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
12 pages
Q Learning
No ratings yet
Q Learning
12 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Q-Learning Algorithm
No ratings yet
Q-Learning Algorithm
13 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Unit 4
No ratings yet
Unit 4
12 pages
112 Q Learning N
100% (1)
112 Q Learning N
15 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Unit 1
No ratings yet
Unit 1
18 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Q-Learning in RL With Openai Gym: Joo Soon Lee
No ratings yet
Q-Learning in RL With Openai Gym: Joo Soon Lee
34 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
37 RL
No ratings yet
37 RL
18 pages
TOC Main Notes
No ratings yet
TOC Main Notes
136 pages
Starbucks Coffee-In Bangladesh-Marketing
No ratings yet
Starbucks Coffee-In Bangladesh-Marketing
22 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Q Learning
No ratings yet
Q Learning
9 pages
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
No ratings yet
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
22 pages
Inter-House Social Science Quiz
No ratings yet
Inter-House Social Science Quiz
39 pages
Q-Learning in C++
No ratings yet
Q-Learning in C++
4 pages
Lab2 q1 200001064
No ratings yet
Lab2 q1 200001064
2 pages
Basic Mathematics. Explained Easy | For Beginners
From Everand
Basic Mathematics. Explained Easy | For Beginners
ExaGrecation
No ratings yet
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Harvard Referencing
0% (1)
Harvard Referencing
15 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
TOC Short Notes
No ratings yet
TOC Short Notes
32 pages
Participle Adjectives Short List PDF
No ratings yet
Participle Adjectives Short List PDF
3 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Expository Writing Lesson
100% (1)
Expository Writing Lesson
3 pages
Ia TP
No ratings yet
Ia TP
22 pages
Excerpts of Perception of Bliss by Gopal Baghel Madhu
100% (1)
Excerpts of Perception of Bliss by Gopal Baghel Madhu
6 pages
History of Eth & Horn Chapter One
No ratings yet
History of Eth & Horn Chapter One
28 pages
EPE462 - Introduction
No ratings yet
EPE462 - Introduction
4 pages
Austrian Man Has Confessed To Imprisoning His Daughter
No ratings yet
Austrian Man Has Confessed To Imprisoning His Daughter
6 pages
Surah Al Baqarah (2:222) - Prohibiting Coitus During Menstruation
No ratings yet
Surah Al Baqarah (2:222) - Prohibiting Coitus During Menstruation
4 pages
Women and Girls in Hindi Magazine
No ratings yet
Women and Girls in Hindi Magazine
26 pages
AI Seminar Group 8
No ratings yet
AI Seminar Group 8
21 pages
The Hajj Its Spiritual and Social Significance
No ratings yet
The Hajj Its Spiritual and Social Significance
3 pages
Tale of Despereaux Draft 1 Andrea Matos Devesa
No ratings yet
Tale of Despereaux Draft 1 Andrea Matos Devesa
40 pages
The Story of Halloween
100% (1)
The Story of Halloween
18 pages
Ai Assignment Report-2
No ratings yet
Ai Assignment Report-2
1 page
Gate Cse Notes: Joyoshish Saha
No ratings yet
Gate Cse Notes: Joyoshish Saha
8 pages
Chapter 3 (Ethics in AI) (Part 1)
No ratings yet
Chapter 3 (Ethics in AI) (Part 1)
8 pages
The Relativity of Simultaneity: An Analysis Based On The Properties of Electromagnetic Waves
No ratings yet
The Relativity of Simultaneity: An Analysis Based On The Properties of Electromagnetic Waves
13 pages
Neron The Time Demon
No ratings yet
Neron The Time Demon
4 pages
CMPLDW Model Validation - FIDVR Events
No ratings yet
CMPLDW Model Validation - FIDVR Events
20 pages
Research Problems
No ratings yet
Research Problems
32 pages
Synchronicity in Encyclopedia Psychology Religion 2020
No ratings yet
Synchronicity in Encyclopedia Psychology Religion 2020
3 pages
Mls Assignment
No ratings yet
Mls Assignment
9 pages
Roleplay Cards - Advanced Intermediate
No ratings yet
Roleplay Cards - Advanced Intermediate
2 pages
SSS V.jarque, G.R. No. 165545, March 24, 2006
No ratings yet
SSS V.jarque, G.R. No. 165545, March 24, 2006
1 page
214 The Crucible
No ratings yet
214 The Crucible
4 pages
Nika Revolt Bibliography
No ratings yet
Nika Revolt Bibliography
3 pages
ABA Confidential Student Recommendation - 230823 - 051441
No ratings yet
ABA Confidential Student Recommendation - 230823 - 051441
2 pages
FRIA (Posters)
No ratings yet
FRIA (Posters)
1 page
Lesson Plan Science - Aug 10 To 12
No ratings yet
Lesson Plan Science - Aug 10 To 12
5 pages