Implementing Deep Q-Learning using PyTorch
Last Updated :
12 Jul, 2025
Deep Q-Learning is a reinforcement learning method which uses a neural network to help an agent learn how to make decisions by estimating Q-values which represent how good an action is in a given situation. In this article we’ll implement Deep Q-Learning from scratch using PyTorch.
How Deep Q-Learning Works
Deep Q-Learning works in 5 simple steps help an agent learn from its surroundings and improve how it makes decisions:
Working of Deep Q Learning- Define the Q-network: The Q-network is a deep neural network. It takes the current state of the agent as input and outputs Q-values for all possible actions. These Q-values represent how good each action is in that state.
- Initialize the Q-network’s parameters: These are the weights of the neural network. PyTorch can automatically initialize them.
- Define the loss function: This helps the network learn. The most commonly used loss here is Mean Squared Error (MSE) which compares the predicted Q-values and the target Q-values.
- Define the optimizer: An optimizer adjusts the network’s weights to reduce the loss. It include Adam and RMSprop.
- Collect experiences: The agent plays in the environment and collects data in the form of (state, action, reward, next_state). This experience helps the model learn which actions are better over time.
Let’s implement Deep Q-Learning using PyTorch.
Step 1: Importing the required libraries
First we will import all the necessary libraries like numpy , pandas , PyTorch and more.
Python
import gym
import random
import numpy as np
from collections import deque
import torch
import torch.nn as nn
import torch.optim as optim
Step 2: Defining the Q-Network (Neural Network)
This is a simple neural network with three layers. It takes the state as input and outputs Q-values for all possible actions.
Python
class DQN(nn.Module):
def __init__(self, state_size, action_size):
super(DQN, self).__init__()
self.fc1 = nn.Linear(state_size, 24)
self.fc2 = nn.Linear(24, 24)
self.fc3 = nn.Linear(24, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
Step 3: Defining Hyperparameters
These parameters control the learning process: exploration vs exploitation, learning rate, memory, etc.
- gamma: Future reward discount closer to 1 means long-term focus.
- epsilon: Controls exploration vs exploitation.
- batch_size: How many experiences we train on at once.
- Memory_size: Maximum size of the experience replay buffer.
Python
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# Hyperparameters
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
learning_rate = 0.001
batch_size = 64
memory_size = 10000
Step 4: Creating Replay Memory
The agent stores past experiences in memory and samples from it during training to break correlation in data.
- Stores past experiences: (state, action, reward, next_state, done).
- deque automatically removes old experiences when it exceeds the limit.
Python
memory = deque(maxlen=memory_size)
Step 5: Initializing Network and Optimizer
We use two networks: policy network (for selecting actions) and target network (for stable learning).
- policy_net: Train actively and chooses actions.
- target_net: Provides stable target Q-values (updated less frequently).
- Adam: Adaptive optimizer to update weights.
- MSELoss: Compares predicted Q-values to target Q-values.
Python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy_net = DQN(state_size, action_size).to(device)
target_net = DQN(state_size, action_size).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()
optimizer = optim.Adam(policy_net.parameters(), lr=learning_rate)
loss_fn = nn.MSELoss()
Step 6: Defining Function to Choose Action
The agent either explores randomly or exploits its learned policy based on epsilon. Otherwise we select the action with the maximum predicted Q-value.
Python
def get_action(state, epsilon):
if random.random() < epsilon:
return random.choice(range(action_size))
else:
state = torch.FloatTensor(state).unsqueeze(0).to(device)
with torch.no_grad():
q_values = policy_net(state)
return q_values.argmax().item()
Step 7: Training on a Mini-Batch
We randomly sample experiences and update the network using the Bellman equation.
- Use the Bellman equation to compute target Q-values.
- Minimize the MSE loss between predicted and target Q-values.
Python
def replay():
if len(memory) < batch_size:
return
minibatch = random.sample(memory, batch_size)
states, actions, rewards, next_states, dones = zip(*minibatch)
states = torch.FloatTensor(states).to(device)
actions = torch.LongTensor(actions).unsqueeze(1).to(device)
rewards = torch.FloatTensor(rewards).unsqueeze(1).to(device)
next_states = torch.FloatTensor(next_states).to(device)
dones = torch.FloatTensor(dones).unsqueeze(1).to(device)
# Current Q values
current_q = policy_net(states).gather(1, actions)
# Target Q values
next_q = target_net(next_states).max(1)[0].detach().unsqueeze(1)
target_q = rewards + (gamma * next_q * (1 - dones))
loss = loss_fn(current_q, target_q)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Step 8: Training the Agent
This loop trains the agent over multiple episodes. After each episode we slowly update the target network and decay the exploration rate.
- Calls get_action() to pick actions and replay() to train.
- Updates target network periodically for stability.
- Decay epsilon to reduce exploration over time.
Python
episodes = 500
target_update_freq = 10
for episode in range(episodes):
reset_result = env.reset()
state = reset_result[0] if isinstance(reset_result, tuple) else reset_result
total_reward = 0
for t in range(500):
action = get_action(state, epsilon)
step_result = env.step(action)
if len(step_result) == 5:
next_state, reward, terminated, truncated, _ = step_result
done = terminated or truncated
else:
next_state, reward, done, _ = step_result
memory.append((state, action, reward, next_state, done))
state = next_state
total_reward += reward
replay()
if done:
break
if epsilon > epsilon_min:
epsilon *= epsilon_decay
if episode % target_update_freq == 0:
target_net.load_state_dict(policy_net.state_dict())
print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.3f}")
Output:
ResultAs shown in above output:
- Total Reward shows how long the agent balanced the pole (higher is better).
- Epsilon shows how much the agent is exploring lower means it’s exploiting more.
As training continues the agent learns better actions and Total Reward improves. When the total reward approaches the max value (like 500) it means the agent has learned to perform well consistently
To download complete notebook: click here
Explore
Deep Learning Basics
Neural Networks Basics
Deep Learning Models
Deep Learning Frameworks
Model Evaluation
Deep Learning Projects