Q-Learning in Reinforcement Learning
Last Updated :
03 Jun, 2025
Q-Learning is a popular model-free reinforcement learning algorithm that helps an agent learn how to make the best decisions by interacting with its environment. Instead of needing a model of the environment the agent learns purely from experience by trying different actions and seeing their results
Imagine a system that sees an apple but incorrectly says, “It’s a mango.” The system is told, “Wrong! It’s an apple.” It learns from this mistake. Next time, when shown the apple, it correctly says “It’s an apple.” This trial-and-error process, guided by feedback is like how Q-Learning works.
Q LearningThe core idea is that the agent builds a Q-table which stores Q-values. Each Q-value estimates how good it is to take a specific action in a given state in terms of the expected future rewards. Over time the agent updates this table using the feedback it receives
Key Components of Q-learning
1. Q-Values or Action-Values
Q-values represent the expected rewards for taking an action in a specific state. These values are updated over time using the Temporal Difference (TD) update rule.
2. Rewards and Episodes
The agent moves through different states by taking actions and receiving rewards. The process continues until the agent reaches a terminal state which ends the episode.
3. Temporal Difference or TD-Update
The agent updates Q-values using the formula:
Q(S,A)\leftarrow Q(S,A) + \alpha (R + \gamma Q({S}',{A}') - Q(S,A))
Where,
- S is the current state.
- A is the action taken by the agent.
- S' is the next state the agent moves to.
- A' is the best next action in state S'.
- R is the reward received for taking action A in state S.
- γ (Gamma) is the discount factor which balances immediate rewards with future rewards.
- α (Alpha) is the learning rate determining how much new information affects the old Q-values.
4. ϵ-greedy Policy (Exploration vs. Exploitation)
The ϵ-greedy policy helps the agent decide which action to take based on the current Q-value estimates:
- Exploitation: The agent picks the action with the highest Q-value with probability 1 - ϵ. This means the agent uses its current knowledge to maximize rewards.
- Exploration: With probability ϵ, the agent picks a random action, exploring new possibilities to learn if there are better ways to get rewards. This allows the agent to discover new strategies and improve its decision-making over time.
How does Q-Learning Works?
Q-learning models follow an iterative process where different components work together to train the agent. Here's how it works step-by-step:
Q learning algorithm1. Start at a State (S)
The environment provides the agent with a starting state which describes the current situation or condition.
2. Agent Selects an Action (A)
Based on the current state and the agent chooses an action using its policy. This decision is guided by a Q-table which estimates the potential rewards for different state-action pairs. The agent typically uses an ε-greedy strategy:
- It sometimes explores new actions (random choice).
- It mostly exploits known good actions (based on current Q-values).
3. Action is Executed and Environment Responds
The agent performs the selected action. The environment then provides:
- A new state (S′) — the result of the action.
- A reward (R) — feedback on the action's effectiveness.
4. Learning Algorithm Updates the Q-Table
The agent updates the Q-table using the new experience:
- It adjusts the value for the state-action pair based on the received reward and the new state.
- This helps the agent better estimate which actions are more beneficial over time.
5. Policy is Refined and the Cycle Repeats
With updated Q-values the agent:
- Improves its policy to make better future decisions.
- Continues this loop — observing states, taking actions, receiving rewards and updating Q-values across many episodes.
Over time the agent learns the optimal policy that consistently yields the highest possible reward in the environment.
Methods for Determining Q-values
1. Temporal Difference (TD):
Temporal Difference is calculated by comparing the current state and action values with the previous ones. It provides a way to learn directly from experience, without needing a model of the environment.
2. Bellman’s Equation:
Bellman’s Equation is a recursive formula used to calculate the value of a given state and determine the optimal action. It is fundamental in the context of Q-learning and is expressed as:
Q(s, a) = R(s, a) + \gamma \max_a Q(s', a)
Where:
- Q(s, a) is the Q-value for a given state-action pair.
- R(s, a) is the immediate reward for taking action a in state s.
- γ is the discount factor, representing the importance of future rewards.
- max_a Q(s', a) is the maximum Q-value for the next state s' and all possible actions.
What is a Q-table?
The Q-table is essentially a memory structure where the agent stores information about which actions yield the best rewards in each state. It is a table of Q-values representing the agent's understanding of the environment. As the agent explores and learns from its interactions with the environment, it updates the Q-table. The Q-table helps the agent make informed decisions by showing which actions are likely to lead to better rewards.
Structure of a Q-table:
- Rows represent the states.
- Columns represent the possible actions.
- Each entry in the table corresponds to the Q-value for a state-action pair.
Over time, as the agent learns and refines its Q-values through exploration and exploitation, the Q-table evolves to reflect the best actions for each state, leading to optimal decision-making.
Implementation of Q-Learning
Here, we implement basic Q-learning algorithm where agent learns the optimal action-selection strategy to reach a goal state in a grid-like environment.
Step 1: Define the Environment
Set up the environment parameters including the number of states and actions and initialize the Q-table. In this each state represents a position and actions move the agent within this environment.
Python
import numpy as np
n_states = 16
n_actions = 4
goal_state = 15
Q_table = np.zeros((n_states, n_actions))
Step 2: Set Hyperparameters
Define the parameters for the Q-learning algorithm which include the learning rate, discount factor, exploration probability and the number of training epochs.
Python
learning_rate = 0.8
discount_factor = 0.95
exploration_prob = 0.2
epochs = 1000
Step 3: Implement the Q-Learning Algorithm
Perform the Q-learning algorithm over multiple epochs. Each epoch involves selecting actions based on an epsilon-greedy strategy updating Q-values based on rewards received and transitioning to the next state.
Python
for epoch in range(epochs):
current_state = np.random.randint(0, n_states)
while current_state != goal_state:
if np.random.rand() < exploration_prob:
action = np.random.randint(0, n_actions)
else:
action = np.argmax(Q_table[current_state])
next_state = (current_state + 1) % n_states
reward = 1 if next_state == goal_state else 0
Q_table[current_state, action] += learning_rate * \
(reward + discount_factor *
np.max(Q_table[next_state]) - Q_table[current_state, action])
current_state = next_state
Step 4: Output the Learned Q-Table
After training, print the Q-table to examine the learned Q-values which represent the expected rewards for taking specific actions in each state.
Python
q_values_grid = np.max(Q_table, axis=1).reshape((4, 4))
# Plot the grid of Q-values
plt.figure(figsize=(6, 6))
plt.imshow(q_values_grid, cmap='coolwarm', interpolation='nearest')
plt.colorbar(label='Q-value')
plt.title('Learned Q-values for each state')
plt.xticks(np.arange(4), ['0', '1', '2', '3'])
plt.yticks(np.arange(4), ['0', '1', '2', '3'])
plt.gca().invert_yaxis() # To match grid layout
plt.grid(True)
# Annotating the Q-values on the grid
for i in range(4):
for j in range(4):
plt.text(j, i, f'{q_values_grid[i, j]:.2f}', ha='center', va='center', color='black')
plt.show()
# Print learned Q-table
print("Learned Q-table:")
print(Q_table)
Output:
Q values on grid
The learned Q-table shows the expected rewards for each state-action pair, with higher Q-values near the goal state (state 15), indicating the optimal actions that lead to reaching the goal. The agent's actions gradually improve over time, as reflected in the increasing Q-values across states leading to the goal.
Advantages of Q-learning
- Trial and Error Learning: Q-learning improves over time by trying different actions and learning from experience.
- Self-Improvement: Mistakes lead to learning, helping the agent avoid repeating them.
- Better Decision-Making: Stores successful actions to avoid bad choices in future situations.
- Autonomous Learning: It learns without external supervision, purely through exploration.
Disadvantages of Q-learning
- Slow Learning: Requires many examples, making it time-consuming for complex problems.
- Expensive in Some Environments: In robotics, testing actions can be costly due to physical limitations.
- Curse of Dimensionality: Large state and action spaces make the Q-table too large to handle efficiently.
- Limited to Discrete Actions: It struggles with continuous actions like adjusting speed, making it less suitable for real-world applications involving continuous decisions.
Similar Reads
Deep Learning Tutorial
Deep Learning tutorial covers the basics and more advanced topics, making it perfect for beginners and those with experience. Whether you're just starting or looking to expand your knowledge, this guide makes it easy to learn about the different technologies of Deep Learning.Deep Learning is a branc
5 min read
Introduction to Deep Learning
Artificial Neural Network
Introduction to Convolution Neural Network
Introduction to Convolution Neural Network
Convolutional Neural Network (CNN) is an advanced version of artificial neural networks (ANNs), primarily designed to extract features from grid-like matrix datasets. This is particularly useful for visual datasets such as images or videos, where data patterns play a crucial role. CNNs are widely us
8 min read
Digital Image Processing Basics
Digital Image Processing means processing digital image by means of a digital computer. We can also say that it is a use of computer algorithms, in order to get enhanced image either to extract some useful information. Digital image processing is the use of algorithms and mathematical models to proc
7 min read
Difference between Image Processing and Computer Vision
Image processing and Computer Vision both are very exciting field of Computer Science. Computer Vision: In Computer Vision, computers or machines are made to gain high-level understanding from the input digital images or videos with the purpose of automating tasks that the human visual system can do
2 min read
CNN | Introduction to Pooling Layer
Pooling layer is used in CNNs to reduce the spatial dimensions (width and height) of the input feature maps while retaining the most important information. It involves sliding a two-dimensional filter over each channel of a feature map and summarizing the features within the region covered by the fi
5 min read
CIFAR-10 Image Classification in TensorFlow
Prerequisites:Image ClassificationConvolution Neural Networks including basic pooling, convolution layers with normalization in neural networks, and dropout.Data Augmentation.Neural Networks.Numpy arrays.In this article, we are going to discuss how to classify images using TensorFlow. Image Classifi
8 min read
Implementation of a CNN based Image Classifier using PyTorch
Introduction: Introduced in the 1980s by Yann LeCun, Convolution Neural Networks(also called CNNs or ConvNets) have come a long way. From being employed for simple digit classification tasks, CNN-based architectures are being used very profoundly over much Deep Learning and Computer Vision-related t
9 min read
Convolutional Neural Network (CNN) Architectures
Convolutional Neural Network(CNN) is a neural network architecture in Deep Learning, used to recognize the pattern from structured arrays. However, over many years, CNN architectures have evolved. Many variants of the fundamental CNN Architecture This been developed, leading to amazing advances in t
11 min read
Object Detection vs Object Recognition vs Image Segmentation
Object Recognition: Object recognition is the technique of identifying the object present in images and videos. It is one of the most important applications of machine learning and deep learning. The goal of this field is to teach machines to understand (recognize) the content of an image just like
5 min read
YOLO v2 - Object Detection
In terms of speed, YOLO is one of the best models in object recognition, able to recognize objects and process frames at the rate up to 150 FPS for small networks. However, In terms of accuracy mAP, YOLO was not the state of the art model but has fairly good Mean average Precision (mAP) of 63% when
7 min read
Recurrent Neural Network
Natural Language Processing (NLP) Tutorial
Natural Language Processing (NLP) is the branch of Artificial Intelligence (AI) that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format.Applications of NLPThe applications of Natural Language Processing are as follows:Voice
5 min read
Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging
Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text - NLTK's API has covered everything. In this article, we will accustom o
5 min read
Word Embeddings in NLP
Word Embeddings are numeric representations of words in a lower-dimensional space, capturing semantic and syntactic information. They play a vital role in Natural Language Processing (NLP) tasks. This article explores traditional and neural approaches, such as TF-IDF, Word2Vec, and GloVe, offering i
15+ min read
Introduction to Recurrent Neural Networks
Recurrent Neural Networks (RNNs) differ from regular neural networks in how they process information. While standard neural networks pass information in one direction i.e from input to output, RNNs feed information back into the network at each step.Imagine reading a sentence and you try to predict
10 min read
Recurrent Neural Networks Explanation
Today, different Machine Learning techniques are used to handle different types of data. One of the most difficult types of data to handle and the forecast is sequential data. Sequential data is different from other types of data in the sense that while all the features of a typical dataset can be a
8 min read
Sentiment Analysis with an Recurrent Neural Networks (RNN)
Recurrent Neural Networks (RNNs) are used in sequence tasks such as sentiment analysis due to their ability to capture context from sequential data. In this article we will be apply RNNs to analyze the sentiment of customer reviews from Swiggy food delivery platform. The goal is to classify reviews
5 min read
Short term Memory
In the wider community of neurologists and those who are researching the brain, It is agreed that two temporarily distinct processes contribute to the acquisition and expression of brain functions. These variations can result in long-lasting alterations in neuron operations, for instance through act
5 min read
What is LSTM - Long Short Term Memory?
Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural Network (RNN) designed by Hochreiter and Schmidhuber. LSTMs can capture long-term dependencies in sequential data making them ideal for tasks like language translation, speech recognition and time series forecasting. Unlike
5 min read
Long Short Term Memory Networks Explanation
Prerequisites: Recurrent Neural Networks To solve the problem of Vanishing and Exploding Gradients in a Deep Recurrent Neural Network, many variations were developed. One of the most famous of them is the Long Short Term Memory Network(LSTM). In concept, an LSTM recurrent unit tries to "remember" al
7 min read
LSTM - Derivation of Back propagation through time
Long Short-Term Memory (LSTM) are a type of neural network designed to handle long-term dependencies by handling the vanishing gradient problem. One of the fundamental techniques used to train LSTMs is Backpropagation Through Time (BPTT) where we have sequential data. In this article we see how BPTT
4 min read
Text Generation using Recurrent Long Short Term Memory Network
LSTMs are a type of neural network that are well-suited for tasks involving sequential data such as text generation. They are particularly useful because they can remember long-term dependencies in the data which is crucial when dealing with text that often has context that spans over multiple words
4 min read