AI Seminar RL
AI Seminar RL
Welcome everyone! Have you ever trained a pet, rewarding it for each correct
command? Today, we step into the captivating realm of Reinforcement Learning—a
branch of Artificial Intelligence where machines learn through Rewards, much like our
furry friends do. In this presentation, we will delve into the intricacies of the Q-Learning
Algorithm, the simplest yet powerful technique that enables machines to learn
autonomously.
What is Reinforcement learning?
● Reinforcement Learning is a feedback-based Machine learning Approach here an agent learns to which actions to perform
by looking at the environment and the results of actions.
● For each correct action, the agent gets positive feedback, and for each incorrect action, the agent gets negative feedback or
penalty.
● In RL, we build an agent that can make smart decisions. For instance, an agent that learns to play a video game. Or a
trading agent that learns to maximize its benefits by deciding on what stocks to buy and when to sell.
Simple Example of Reinforcement Learning :
Example: Imagine an agent (e.g., a robot) in an environment with two key elements:
🔥 Fire: Represents danger (Negative reward).
💧 Water: Represents safety or a goal (Positive reward).
How RL Works Here:
States: The agent’s position in the grid (e.g.,near fire/water).
And to find this optimal policy (hence solving the RL problem), there are two main types of RL methods:
● Policy-based methods: Train the policy directly to learn which action to take given a state.
● Value-based methods: Train a value function to learn which state is more valuable and use this value function to take the action
that leads to it.
Here Q - Learning algorithm is based on value-based method.
Introducing Q-Learning
What is Q-Learning?
Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function:
Q-Learning is the algorithm we use to train our Q-function, an action-value function that determines the value of being at a
particular state and taking a specific action at that state.
Given a state and action, our Q Function outputs a state-action value (also called Q-value)
The Q comes from “the Quality” (the value) of that action at that state.
Let’s recap the difference between value and reward:
● The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-
action pair) and then acts accordingly to its policy.
● The reward is the feedback I get from the environment after performing an action at a state.
Internally, our Q-function is encoded by a Q-table, a table where each cell corresponds to a state-action pair value. Think of this Q-
table as the memory or cheat sheet of our Q-function.
The Q-Learning algorithm
This is the Q-Learning pseudocode; let’s study each part and see how it works with a simple example before
implementing it. Don’t be intimidated by it, it’s simpler than it looks! We’ll go over each step.
Example Of Q-Learning in Reinforcement Learning :
• We have an Agent named Jerry(a mouse) in this tiny maze. He always starts at the same starting point.
• Jerry’s Aim is to eat the big pile of cheese(Reward) placed at the 2nd Box(1 st row) and Reach safely to his
Home(GOAL).
• Jerry must avoid getting caught by Tom, the cat (Penalty). After all, who doesn’t like to steal cheese without
getting caught ?
• The episode ends if Tom catches Jerry, OR Jerry reaches home with the Cheese, OR If he takes more than
five steps.
• The learning rate is 0.1
• The discount rate (gamma) is 0.99
• For a perfect visualization we can visit this link: https://qltesting-4njdfvxgmcjwqnuzcpmwna.streamlit.app/
+1
Maze
Entry +0
+0 -10 -10
+10
GOAL
(Home)
State 1 : Agent Jerry thinks if :
If he moves Right ; he gets a [+1] and the pile of Cheese as Reward.
If he moves Down; he gets a [+0] as Reward.
To get maximum reward , Jerry must go Rightwards.
+1 +0
+0
-10 -10
+10
GOAL (Home)
State 2 : This state has 2 Possibilities - Possibility 2
If Jerry moves Down, he gets a [-10] Penalty in this
Possibility 1
case, since he gets caught by Tom .
If Jerry moves Right, he gets a [+0] Reward in this case, as Hence, Jerry won’t make it to his home(Goal State).
he did not get caught by Tom and hasn’t completed 5
steps.
That makes him close to his home. So, he gets Glad. That makes Jerry Sad
[+0
+0
-10 -10
+10 +10
GOAL (Home)
GOAL (Home)
State 3 :
Now Agent Jerry has to move downwards since there is no block on Right.
He knows for Goal state(Home) reward is [+10]
So, he moves Downward and reaches home.
Now Jerry is Happy. He will enjoy the Cheese.
S0 0 0 0 0
S1 0 0 0 0
S2 0 0 0 0
S3 0 0 0 0
S4 0 0 0 0
S5 0 0 0 0
Step 2: Update Q(St, At)
Agent now updates Q(St , At) using the Bellman Equation:
S0 0 0.1 0 0
S1 0 0 0 0
S2 0 0 0 0
S3 0 0 0 0
S4 0 0 0 0
S5 0 0 0 0
3.From S2:
• Action : Down
• Transition : S2 S5(home)
• Immediate reward: +10
• Update: Since S5 is terminal so maxaQ(S5,a)=0, hence
Q(S2 , Down) = 0 + 0.1 * [+10 + 0.99 * 0 – 0]
Q(S2 , Down) = 1
4. From S1: If Agent moves Down, he gets caught by Tom:
• Action : Down
• Transition : S1 S4
• Immediate reward: -10
• Update: Q(S1 , Down) = 0 + 0.1 * [-10 + 0.99 * maxaQ(S4,a) – 0]
Q(S1 , Down) = 0.1 * [-10 + 0.99 * 0 – 0]
For Transition : S1 S4(Down)
Updating QTable with Q(S1 , Down) = -1
S0 0 0.1 0 0
S1 0 0 0 -1
S2 0 0 0 0
S3 0 0 0 0
S4 0 0 0 0
S5 0 0 0 0
After many episodes, Using Bellman Optimality Equations, the Q-Values will
converge as :
• For S2: Q(S2 , Down) ≈ 10(since moving down from S2 directly gives +10)
• For S1: Optimal Action = Right, so
Q(S1 , Right) ≈ 0 + 0.99 * Q(S2 , Down) ≈ 0.99*10 = 9.9
• For S0: Optimal Action = Right, so
Q(S0 , Right) ≈ +1 + 0.99 * Q(S1 , Right) ≈ 1 + 0.99 * 9.9 ≈ 1 + 9.801 ≈ 10.801
• For the other states (S3, S4, etc.), the values remain lower because they do not
lead to high rewards:
S3 remains unexplored
S4 is a terminal “bad” state (Tom is present); any action in here ends the
episode.
S5 is terminal(GOAL).
A plausible final Q-table may look like:
Best Action
S0 0 10.80 0 0 S0,Right S1
𝑆0→𝑆1→𝑆2→𝑆5.
• Optimal Path :
● Model-Free Algorithm: Q-learning does not require a predefined model of the environment, making
it suitable for complex or unknown environments.
● Effective in Stochastic Environments: The algorithm performs well even in environments with
randomness or uncertainty, where the outcome of actions can vary.
● Simplicity and Ease of Implementation: Q-learning is straightforward to implement with minimal
theoretical prerequisites, making it accessible for beginners.
● Convergence to Optimal Policy: Given enough time and proper exploration, Q-learning guarantees
convergence to an optimal policy.
● Learns Like A Human: This learning approach, which is trial and error, resembles how people learn,
making it almost ideal.
● The model has the ability to fix mistakes while training, and there is very little probability that the
fixed mistake would happen again.
Limitations of Q – Learning :
The Q-learning approach in reinforcement learning also has some disadvantages such as −
● Slow Convergence: In environments with many states and actions, the learning process can be slow, requiring many
episodes to achieve optimal performance.
● Inefficient for Large State Spaces: As the state space grows, maintaining and updating a Q-table becomes impractical,
leading to high memory usage.
● Exploration-Exploitation Trade-off: Balancing between exploring new actions and exploiting known actions can be
challenging and may affect learning quality.
● Sensitive to Hyperparameters: Choosing appropriate values for the learning rate, discount factor, and exploration rate is
critical. Poor choices can result in suboptimal learning or even prevent convergence.
● The Q-learning model sometimes exhibits excessive optimism and overestimates how good a particular action or strategy
is.
● Sometimes, it is time-consuming for a Q-learning model to determine the optimal strategy when faced with multiple
problem-solving options.
Applications of Q – Learning : The Q-learning models can improve processes in various scenarios.
Some of the fields include −
● Gaming − Q-learning algorithms can teach gaming systems to reach
expert levels of skill in various games by learning the best strategy to
progress.
● Recommendation Systems − Q-learning algorithms can be utilized to
improve recommendation systems, like advertising platforms.
● Robotics − Q-learning algorithms enable robots to learn how to
perform different tasks like manipulating objects, avoiding obstacles,
and transporting items.
● Autonomous Vehicles − Q-learning algorithms are used to train self-
driving cars to make driving choices like changing lanes or coming to a
halt.
● Supply Chain − Q-learning models can enhance the efficiency of supply
chains by optimizing the path for products to market.
Conclusion :
● Q-learning is a powerful reinforcement learning algorithm that enables agents to learn optimal policies through trial and
error. By using a Q-table to store expected rewards for state-action pairs, it helps an agent navigate an environment
without requiring prior knowledge. The algorithm follows the Bellman equation to update Q-values iteratively based on
the rewards received and future state values.
● Q-learning follows the Markov Property, meaning the present state is not dependent on future states—only on the
current state and action. This allows the agent to make decisions based only on what it knows now, without worrying
about what happens later.
● We finally have come to the very end of the presentation. We covered a lot of preliminary grounds of reinforcement
learning that will be useful if you are planning to further strengthen your knowledge of reinforcement learning. We also
implemented the simplest reinforcement learning just by using Numpy. These base scratch implementations are not
only for just fun but also they help tremendously to know the nuts and bolts of an algorithm.
References :
Reinforcement learning has given solutions to many problems from a wide variety of different domains. One that I particularly like is Google’s
NasNet which uses deep reinforcement learning for finding an optimal neural network architecture for a given dataset.
Let’s now review some of the best resources for breaking into reinforcement learning in a serious manner:
● Reinforcement Learning, Second Edition: An Introduction by Richard S. Sutton and Andrew G. Barto which is considered to be the
textbook of reinforcement learning
● Practical Reinforcement Learning a course designed by the National Research University Higher School of Economics offered by
Coursera
● Reinforcement Learning a course designed by the Georgia University and offered by Udacity
● If you are interested in the conjunction of meta-learning and reinforcement learning then you may follow this article
● How about combining deep learning + reinforcement learning? Check out Deep RL Bootcamp.
● Deep Reinforcement Learning Hands-On a book by Maxim Lapan which covers many cutting edge RL concepts like deep Q-networks,
value iteration, policy gradients and so on.
● MIT Deep Learning a course taught by Lex Fridman which teaches you how different deep learning applications are used in
autonomous vehicle systems and more.