Program Explanation
Program Explanation
Stochastic-gradient Implementation
1. Environment Definition
The GridEnvironment class defines a simple gridworld environment:
Grid Layout:
[
[0, 0, 0, 1],
[0, -1, 0, -1],
[0, 0, 0, 0]
]
o Cells contain rewards:
0: Neutral cell.
1: Goal cell (positive reward).
-1: Penalty cell (negative reward).
States:
o The agent starts at (2, 0) and aims to reach the goal state (0, 3).
Actions:
o 0: Move up.
o 1: Move down.
o 2: Move left.
o 3: Move right.
Methods:
o reset(): Resets the agent to the starting position.
3. Agent Training
The agent learns through episodes. Each episode consists of multiple steps, where
the agent:
1. Selects an action using a policy.
2. Moves to the next state.
3. Receives a reward.
4. Updates the value function using the Bellman error.
2. Policy:
o Random Action (Exploration): With probability epsilon, the agent
chooses a random action.
o Greedy Action (Exploitation): Otherwise, it chooses the action that
maximizes the estimated value of the next state.
3. Bellman Error: The Bellman equation computes the value of a state:
This minimizes the mean squared error between the estimated and true values.
4. MSVE Calculation
MSVE (Mean Squared Value Error) measures the average squared error
between:
o The estimated value V(s) (using weights).
The code stores MSVE for each episode, allowing for evaluation of learning
performance.
2. Semi-gradient Methods
1. Environment: GridEnvironment
The environment is a 3x4 gridworld where the agent:
Starts at (2, 0).
Tries to reach the goal state (0, 3) with a reward of +1.
Faces penalties (-1) in certain cells.
Navigates using 4 possible actions (Up, Down, Left, Right).
2. State-Action Representation
The agent uses one-hot encoding for state-action pairs:
o A state (x, y) in the grid has a unique index in the range [0, n_states-
1].
o Each action (Up, Down, Left, Right) maps to a specific position in the
feature vector.
3.4 Metrics
Rewards: Tracks cumulative rewards in each episode.
MSVE (Mean Squared Value Error):
o Measures the average squared error between the target and
estimated Q-values over all steps:
4. Results Visualization
MSVE Plot:
o Tracks the learning progress and accuracy of the Q-value function
over episodes.
o A decreasing MSVE indicates the agent is learning effectively.
def reset(self):
self.current_state = self.start_state
return self.current_state
self.current_state = (x, y)
reward = self.grid[x, y]
done = (self.current_state == self.goal_state)
return self.current_state, reward, done
Explanation
1. Environment Definition:
o The grid environment is a 2D array (self.grid) where each cell has a
specific meaning:
1: The goal (where the agent wants to reach).
-1: Obstacles that the agent should avoid.
0: Regular cells with no specific reward or penalty.
2. States:
o The agent starts at start_state = (2, 0) (bottom-left corner).
3. Reset Function:
o Resets the agent to the starting position.
4. Step Function:
o Takes an action (e.g., move up, down, left, or right) and updates the
agent's position.
o Handles edge cases:
# Initialize weights
weights_sgd = np.zeros(n_states) # For SGD (value function approximation)
weights_semi = np.zeros(n_states * n_actions) # For Semi-gradient (Q-value
approximation)
1. Hyperparameters:
o alpha_sgd and alpha_semi: Learning rates control how quickly the
model updates weights.
o gamma: Discount factor determines the importance of future
rewards.
o epsilon: Exploration rate for ϵ\epsilonϵ-greedy policy.
3. Weights Initialization:
o weights_sgd: Weights for the state-value function (SGD-based).
2. For SGD:
o Encodes a state as a vector of size n_states.
total_reward_sgd = 0
total_reward_semi = 0
done = False
total_bellman_error_sgd = 0
total_bellman_error_semi = 0
while not done:
next_state, reward, done = env.step(action)
# SGD Update
target_sgd = reward + gamma * np.dot(weights_sgd,
get_features_sgd(next_state))
bellman_error_sgd = target_sgd - np.dot(weights_sgd,
get_features_sgd(state))
total_bellman_error_sgd += bellman_error_sgd ** 2
weights_sgd += alpha_sgd * bellman_error_sgd * get_features_sgd(state)
# Semi-gradient Update
next_action = choose_action(next_state, epsilon)
target_semi = reward + gamma * np.dot(weights_semi,
get_features_semi(next_state, next_action))
bellman_error_semi = target_semi - np.dot(weights_semi,
get_features_semi(state, action))
total_bellman_error_semi += bellman_error_semi ** 2
weights_semi += alpha_semi * bellman_error_semi *
get_features_semi(state, action)
state = next_state
action = next_action
total_reward_sgd += reward
total_reward_semi += reward
if (episode + 1) % 100 == 0:
print(f"Episode: {episode + 1}, SGD Reward: {total_reward_sgd}, Semi-
gradient Reward: {total_reward_semi}")
Explanation
1. Loop Over Episodes:
o Reset the environment and start with the initial state.
2. SGD Update:
o Compute Bellman error and update weights based on state values.
3. Semi-Gradient Update:
o Compute Bellman error and update weights based on Q-values.
Part 6: Visualization
import matplotlib.pyplot as plt
# Total rewards
plt.subplot(2, 2, 1)
plt.plot(rewards_sgd)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('SGD: Total Reward per Episode')
plt.subplot(2, 2, 2)
plt.plot(rewards_semi)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Semi-gradient: Total Reward per Episode')
# MSVE
plt.subplot(2, 2, 3)
plt.plot(msve_sgd)
plt.xlabel('Episode')
plt.ylabel('MSVE')
plt.title('SGD: MSVE over Episodes')
plt.subplot(2, 2, 4)
plt.plot(msve_semi)
plt.xlabel('Episode')
plt.ylabel('MSVE')
plt.title('Semi-gradient: MSVE over Episodes')
plt.tight_layout()
plt.show()
Explanation
1. Visual Comparison:
o Plots total rewards and MSVE for both methods across episodes.
o The environment provides state, reward, and done flag for each step.
o where:
α: Learning rate.
o Example: For state (0, 0) and action Up, only one feature is 1, and the
rest are 0.
4. Epsilon-Greedy Policy:
o Balances exploration and exploitation.
o 1: Move Down.
o 2: Move Left.
o 3: Move Right.
State Transitions
Transitions are deterministic:
o Moving in a specified direction changes the agent's position in the
grid.
o Boundary conditions are handled using max() and min() functions to
ensure the agent doesn't leave the grid.
Rewards
The agent receives:
o +1 for reaching the goal.
2. Hyperparameters
Key Hyperparameters:
1. Learning Rate (alpha): Controls how much the weights are adjusted
during learning.
o alpha = 0.1: Small updates to weights.
3. Feature Representation
Each state-action pair is represented using one-hot encoding.
Total features: n_states * n_actions.
o Example:
get_features(state, action):
o Transforms a state-action pair into a feature vector:
4. Epsilon-Greedy Policy
choose_action(state, epsilon):
o With probability ϵ\epsilonϵ, choose a random action (exploration).
2. Episode Loop:
o Take the action and observe:
Target Q-value:
Current Q-value:
6. Metrics
Total Rewards:
o Tracks cumulative rewards per episode to monitor performance.
7. Visualization
MSVE Plot:
o Tracks how the value approximation improves over episodes.
8. Testing
After training, the agent is tested with a greedy policy:
o ϵ=ϵmin: No exploration.
o Observes and prints the agent's trajectory, actions, and total reward.
3. Code Components
Discretizing the State Space
The MountainCar-v0 environment has a continuous state space, which
needs to be discretized for tabular reinforcement learning methods like
SARSA.
Discretization Method: Divide the continuous state space into discrete
bins:
state_space = [np.linspace(-1.2, 0.6, num_states), # Position bins
np.linspace(-0.07, 0.07, num_states)] # Velocity bins
The function discretize_state(state) maps a continuous state to discrete
indices:
np.digitize(state[i], state_space[i]) - 1
Q-Table Initialization
The Q-table stores the estimated Q-values for each state-action pair.
Dimensions:
o States: Discretized position and velocity (20 bins each).
o Reward (r)
7. λ-Return
Block 1: Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
numpy: Used for handling numerical calculations and arrays.
matplotlib.pyplot: Used for plotting graphs and visualizations.
8. TD(𝜆)
1. Imports:
import gym
import numpy as np
import matplotlib.pyplot as plt
gym: The Gym library is used for creating and interacting with
reinforcement learning environments.
numpy: Used for numerical operations and managing arrays (e.g., value
function and eligibility traces).
matplotlib: Used for plotting the results of the learning process (episode
rewards).
3. Initialization:
V = np.zeros(env.observation_space.n)
eligibility_trace = np.zeros(env.observation_space.n)
V: Initializes the value function, representing the estimated value of each
state. All states are initially assumed to have value 0.
eligibility_trace: This array tracks how recently a state has been visited. It
will decay over time with factor gamma * lambd.
9. Example Usage:
env = gym.make("FrozenLake-v1")
td_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100)
Creates the FrozenLake-v1 environment using gym.make.
Calls the td_lambda function to train the agent using TD(λ) with specified
parameters.
9. SARSA(𝜆)
1. Imports:
import gym
import numpy as np
import matplotlib.pyplot as plt
gym: The Gym library is used for creating and interacting with
reinforcement learning environments.
numpy: Used for numerical operations and managing arrays (e.g., Q-
function and eligibility traces).
matplotlib: Used for plotting the results of the learning process (episode
rewards).
2. The sarsa_lambda Function:
def sarsa_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100):
This function implements the SARSA(λ) algorithm, where:
env: The environment (e.g., FrozenLake-v1).
gamma: The discount factor that determines how much future rewards are
considered.
lambd: The λ parameter that controls the combination of n-step returns
and bootstrapping (eligibility trace decay).
alpha: The learning rate that controls how much to update the Q-function.
episodes: The number of episodes to run the algorithm for.
3. Initialization:
Q = np.zeros((env.observation_space.n, env.action_space.n))
eligibility_trace = np.zeros((env.observation_space.n, env.action_space.n))
Q: Initializes the Q-function, which estimates the value of state-action pairs.
It's a 2D array where rows correspond to states and columns correspond to
actions.
eligibility_trace: This array tracks how recently a state-action pair has
been visited. It will decay over time with factor gamma * lambd.
9. Example Usage:
env = gym.make("FrozenLake-v1")
sarsa_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100)
Creates the FrozenLake-v1 environment using gym.make.
Calls the sarsa_lambda function to train the agent using SARSA(λ) with
specified parameters
10. Reinforce
1. Imports and Configuration:
import gym
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
gym: Used for the reinforcement learning environment.
pandas: Used for data handling (not extensively used in this example).
numpy: For array manipulation and handling numerical operations.
torch: PyTorch library for building and training neural networks.
matplotlib: For plotting the rewards over episodes.
2. Hyperparameters and Device Configuration:
DEVICE = "cpu" # Change to "cuda:0" if using a GPU
class ValueNet(nn.Module):
def __init__(self, hidden_dim=16):
super().__init__()
self.hidden = nn.Linear(4, hidden_dim)
self.output = nn.Linear(hidden_dim, 1)
for i in range(1500):
done = False
states, actions, rewards = [], [], []
s, _ = env.reset()
print("\nDone")
env.close()
This block contains the core training loop:
Environment reset: Initializes each episode.
Action selection: The agent picks an action using the pick_sample
function.
Reward computation: Cumulative rewards are calculated using
discounting.
Training: The networks are optimized by:
o Critic (ValueNet): Mean squared error loss is used for value
function.
o Actor (ActorNet): Policy gradient is used to update the actor based
on the advantage.
6. Plotting the Results
# Plotting the results: Rewards over episodes
average_reward = []
for idx in range(len(reward_records)):
avg_list = np.empty(shape=(1,), dtype=int)
if idx < 50:
avg_list = reward_records[:idx+1]
else:
avg_list = reward_records[idx-49:idx+1]
average_reward.append(np.average(avg_list))
plt.plot(reward_records)
plt.plot(average_reward)
plt.title('Rewards Over Time')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()
After training, this block generates a plot:
reward_records: Shows total rewards per episode.
average_reward: Shows the moving average of the last 50 episodes to
smooth the plot.