0% found this document useful (0 votes)
4 views

Program Explanation

The document describes the implementation of a stochastic-gradient and semi-gradient reinforcement learning agent in a grid environment. It details the environment setup, agent training process, value function approximation, and the use of one-hot encoding for state and action representation. Additionally, it covers the training loop, hyperparameters, and visualization of rewards and Mean Squared Value Error (MSVE) to evaluate the agent's learning performance.

Uploaded by

dd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Program Explanation

The document describes the implementation of a stochastic-gradient and semi-gradient reinforcement learning agent in a grid environment. It details the environment setup, agent training process, value function approximation, and the use of one-hot encoding for state and action representation. Additionally, it covers the training loop, hyperparameters, and visualization of rewards and Mean Squared Value Error (MSVE) to evaluate the agent's learning performance.

Uploaded by

dd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

1.

Stochastic-gradient Implementation
1. Environment Definition
The GridEnvironment class defines a simple gridworld environment:
 Grid Layout:
[
[0, 0, 0, 1],
[0, -1, 0, -1],
[0, 0, 0, 0]
]
o Cells contain rewards:

 0: Neutral cell.
 1: Goal cell (positive reward).
 -1: Penalty cell (negative reward).
 States:
o The agent starts at (2, 0) and aims to reach the goal state (0, 3).

 Actions:
o 0: Move up.

o 1: Move down.

o 2: Move left.

o 3: Move right.

 Methods:
o reset(): Resets the agent to the starting position.

o step(action): Moves the agent based on the action and returns:

 Next state: New position after the action.


 Reward: The reward at the new state.
 Done: Boolean indicating whether the agent has reached the
goal.

2. Value Function Approximation


In RL, the agent estimates the value of states V(s) using weights for a linear
function approximator.
 Feature Representation:
o The state space is represented as a one-hot encoded vector using
get_features(state).
o Example:

 For state (2, 0) in a 3x4 grid, the feature vector is:


[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
 The weight vector weights is updated during training.

3. Agent Training
The agent learns through episodes. Each episode consists of multiple steps, where
the agent:
1. Selects an action using a policy.
2. Moves to the next state.
3. Receives a reward.
4. Updates the value function using the Bellman error.

Key Steps in Training


1. Hyperparameters:
o alpha: Learning rate controls how much weights are updated in each
step.
o gamma: Discount factor for future rewards.

o epsilon: Exploration probability for the ε-greedy policy.

2. Policy:
o Random Action (Exploration): With probability epsilon, the agent
chooses a random action.
o Greedy Action (Exploitation): Otherwise, it chooses the action that
maximizes the estimated value of the next state.
3. Bellman Error: The Bellman equation computes the value of a state:

o Target Value V(s′): Estimated value of the next state.

o Current Value V(s): Current state value.

o Error: Difference between the target and the current value.

4. Weights Update: Using Stochastic Gradient Descent (SGD):

This minimizes the mean squared error between the estimated and true values.
4. MSVE Calculation
 MSVE (Mean Squared Value Error) measures the average squared error
between:
o The estimated value V(s) (using weights).

o The true value V∗(s) (if known).

The code stores MSVE for each episode, allowing for evaluation of learning
performance.

5. Rewards and Visualization


 The rewards list tracks the total rewards earned in each episode.
 msve_values stores the MSVE for each episode, showing the accuracy of
value estimates over time.
The results are plotted to visualize:
1. Rewards: Indicates how well the agent performs.
2. MSVE: Shows how accurately the agent learns value estimates.

2. Semi-gradient Methods
1. Environment: GridEnvironment
The environment is a 3x4 gridworld where the agent:
 Starts at (2, 0).
 Tries to reach the goal state (0, 3) with a reward of +1.
 Faces penalties (-1) in certain cells.
 Navigates using 4 possible actions (Up, Down, Left, Right).

2. State-Action Representation
 The agent uses one-hot encoding for state-action pairs:
o A state (x, y) in the grid has a unique index in the range [0, n_states-
1].
o Each action (Up, Down, Left, Right) maps to a specific position in the
feature vector.

o Feature vector size:


3. Training Procedure
The agent is trained for 1000 episodes using the following steps:
3.1 Initialization
 Weights: A zero-initialized vector of size n_features.
 Epsilon-Greedy Policy:
o With probability epsilon, the agent explores by choosing a random
action.
o Otherwise, it exploits by selecting the action with the highest
estimated Q-value.
3.2 Episode Execution
 The agent interacts with the environment until reaching the goal state:
1. Choose an Action:
 Based on the epsilon-greedy policy.
2. Take a Step:
 Execute the action, observe the next state, reward, and
whether the episode ends.
3. Update Weights:

4. Accumulate Bellman Error for MSVE calculation.


5. Transition to the next state-action pair.
3.3 Epsilon Decay
 Gradually reduce the exploration rate (epsilon) to favor exploitation over
exploration:

3.4 Metrics
 Rewards: Tracks cumulative rewards in each episode.
 MSVE (Mean Squared Value Error):
o Measures the average squared error between the target and
estimated Q-values over all steps:

4. Results Visualization
 MSVE Plot:
o Tracks the learning progress and accuracy of the Q-value function
over episodes.
o A decreasing MSVE indicates the agent is learning effectively.

3. Linear Methods using Stochastic-gradient and Semi-


gradient Methods
Part 1: Grid Environment Definition
import numpy as np

# Define a simple grid environment


class GridEnvironment:
def __init__(self):
self.grid = np.array([
[0, 0, 0, 1], # 1 is the goal state
[0, -1, 0, -1], # -1 are obstacles
[0, 0, 0, 0] # 0 are regular cells
])
self.start_state = (2, 0) # Starting position
self.goal_state = (0, 3) # Goal position
self.current_state = self.start_state

def reset(self):
self.current_state = self.start_state
return self.current_state

def step(self, action):


x, y = self.current_state
if action == 0: # Up
x = max(x - 1, 0)
elif action == 1: # Down
x = min(x + 1, self.grid.shape[0] - 1)
elif action == 2: # Left
y = max(y - 1, 0)
elif action == 3: # Right
y = min(y + 1, self.grid.shape[1] - 1)

self.current_state = (x, y)
reward = self.grid[x, y]
done = (self.current_state == self.goal_state)
return self.current_state, reward, done
Explanation
1. Environment Definition:
o The grid environment is a 2D array (self.grid) where each cell has a
specific meaning:
 1: The goal (where the agent wants to reach).
 -1: Obstacles that the agent should avoid.
 0: Regular cells with no specific reward or penalty.
2. States:
o The agent starts at start_state = (2, 0) (bottom-left corner).

o The agent's goal is to reach goal_state = (0, 3) (top-right corner).

3. Reset Function:
o Resets the agent to the starting position.

4. Step Function:
o Takes an action (e.g., move up, down, left, or right) and updates the
agent's position.
o Handles edge cases:

 Prevents moving out of bounds using max() and min().


5. Returns:
o current_state: The agent's new position after the action.

o reward: The value of the cell the agent moves to.


o done: Boolean indicating if the agent reached the goal.

Part 2: Hyperparameters and Initialization


# Hyperparameters
alpha_sgd = 0.1 # Learning rate for SGD
alpha_semi = 0.1 # Learning rate for Semi-gradient
gamma = 0.99 # Discount factor
epsilon = 1.0 # Exploration rate
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 1000 # Total episodes for training

# Initialize the environment


env = GridEnvironment()

# Feature construction: One-hot encoding of states and state-action pairs


n_states = env.grid.shape[0] * env.grid.shape[1]
n_actions = 4 # Up, Down, Left, Right

# Initialize weights
weights_sgd = np.zeros(n_states) # For SGD (value function approximation)
weights_semi = np.zeros(n_states * n_actions) # For Semi-gradient (Q-value
approximation)
1. Hyperparameters:
o alpha_sgd and alpha_semi: Learning rates control how quickly the
model updates weights.
o gamma: Discount factor determines the importance of future
rewards.
o epsilon: Exploration rate for ϵ\epsilonϵ-greedy policy.

o epsilon_decay and epsilon_min: Reduce exploration over time to


focus on exploitation.
2. State and Action Representation:
o n_states: Total number of states (grid cells).
o n_actions: Total possible actions (up, down, left, right).

3. Weights Initialization:
o weights_sgd: Weights for the state-value function (SGD-based).

o weights_semi: Weights for the Q-value function (semi-gradient-


based).

Part 3: Feature Representation


# Function to transform state into feature vector (for SGD)
def get_features_sgd(state):
x, y = state
state_index = x * env.grid.shape[1] + y
features = np.zeros(n_states)
features[state_index] = 1
return features

# Function to transform state-action pair into feature vector (for Semi-gradient)


def get_features_semi(state, action):
x, y = state
state_index = x * env.grid.shape[1] + y
features = np.zeros(n_states * n_actions)
features[state_index * n_actions + action] = 1
return features
1. One-Hot Encoding:
o Converts states and state-action pairs into one-hot vectors.

o Each vector is sparse with a single 1 at the relevant index.

2. For SGD:
o Encodes a state as a vector of size n_states.

o Example: State (0, 3) in a 3×43 \times 43×4 grid corresponds to


index 3.
3. For Semi-Gradient:
o Encodes a state-action pair as a vector of size n_states * n_actions.

o Example: State (0, 3) with action Right corresponds to an index based


on both the state and action.
Part 4: Epsilon-Greedy Policy
def choose_action(state, epsilon):
if np.random.rand() < epsilon:
return np.random.randint(n_actions) # Random action
else:
q_values = [np.dot(weights_semi, get_features_semi(state, a)) for a in
range(n_actions)]
return np.argmax(q_values) # Greedy action
Explanation
1. Exploration vs. Exploitation:
o With probability ϵ\epsilonϵ, choose a random action (exploration).

o Otherwise, choose the action with the highest predicted Q-value


(exploitation).
2. Q-Value Calculation:
o Compute Q-values for all possible actions using the current weights.

Part 5: Training Loop


# Training the agent
rewards_sgd = []
rewards_semi = []
msve_sgd = []
msve_semi = []

for episode in range(episodes):


state = env.reset()
action = choose_action(state, epsilon)

total_reward_sgd = 0
total_reward_semi = 0
done = False
total_bellman_error_sgd = 0
total_bellman_error_semi = 0
while not done:
next_state, reward, done = env.step(action)

# SGD Update
target_sgd = reward + gamma * np.dot(weights_sgd,
get_features_sgd(next_state))
bellman_error_sgd = target_sgd - np.dot(weights_sgd,
get_features_sgd(state))
total_bellman_error_sgd += bellman_error_sgd ** 2
weights_sgd += alpha_sgd * bellman_error_sgd * get_features_sgd(state)

# Semi-gradient Update
next_action = choose_action(next_state, epsilon)
target_semi = reward + gamma * np.dot(weights_semi,
get_features_semi(next_state, next_action))
bellman_error_semi = target_semi - np.dot(weights_semi,
get_features_semi(state, action))
total_bellman_error_semi += bellman_error_semi ** 2
weights_semi += alpha_semi * bellman_error_semi *
get_features_semi(state, action)

state = next_state
action = next_action
total_reward_sgd += reward
total_reward_semi += reward

epsilon = max(epsilon_min, epsilon * epsilon_decay)


msve_sgd.append(total_bellman_error_sgd / (episode + 1))
msve_semi.append(total_bellman_error_semi / (episode + 1))
rewards_sgd.append(total_reward_sgd)
rewards_semi.append(total_reward_semi)

if (episode + 1) % 100 == 0:
print(f"Episode: {episode + 1}, SGD Reward: {total_reward_sgd}, Semi-
gradient Reward: {total_reward_semi}")
Explanation
1. Loop Over Episodes:
o Reset the environment and start with the initial state.

o Repeat until the goal is reached (done=True).

2. SGD Update:
o Compute Bellman error and update weights based on state values.

3. Semi-Gradient Update:
o Compute Bellman error and update weights based on Q-values.

4. Rewards and Errors:


o Track total rewards and Mean Squared Value Errors (MSVE) for
comparison.

Part 6: Visualization
import matplotlib.pyplot as plt

# Plot rewards and MSVE


plt.figure(figsize=(12, 10))

# Total rewards
plt.subplot(2, 2, 1)
plt.plot(rewards_sgd)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('SGD: Total Reward per Episode')

plt.subplot(2, 2, 2)
plt.plot(rewards_semi)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Semi-gradient: Total Reward per Episode')
# MSVE
plt.subplot(2, 2, 3)
plt.plot(msve_sgd)
plt.xlabel('Episode')
plt.ylabel('MSVE')
plt.title('SGD: MSVE over Episodes')

plt.subplot(2, 2, 4)
plt.plot(msve_semi)
plt.xlabel('Episode')
plt.ylabel('MSVE')
plt.title('Semi-gradient: MSVE over Episodes')

plt.tight_layout()
plt.show()
Explanation
1. Visual Comparison:
o Plots total rewards and MSVE for both methods across episodes.

o Provides insights into performance improvements over time.

4. Episodic Semi-gradient Control - SARSA


This code implements an agent that uses Episodic Semi-gradient SARSA with
linear function approximation to solve a simple 2x2 grid world problem. The agent
learns the Q-values using a feature-based representation of state-action pairs and
updates its weights based on the semi-gradient method.
1. Environment (GridWorld):
o A 2x2 grid where the agent starts at (0, 0) and aims to reach the goal
at (1, 1).
o Four possible actions (Up, Down, Left, Right) move the agent within
grid boundaries.
o Reward is 1 at the goal and 0 elsewhere.

o The environment provides state, reward, and done flag for each step.

2. Episodic Semi-gradient SARSA:


o Uses state-action features to represent Q-values.
o Weights (w) are updated iteratively based on the semi-gradient of the
TD-error.

o where:

 α: Learning rate.

 δ: Temporal Difference (TD) error,


 ϕ(s,a): Feature vector for state-action pair.
3. Feature Representation:
o A one-hot encoded vector represents state-action pairs uniquely.

o Example: For state (0, 0) and action Up, only one feature is 1, and the
rest are 0.
4. Epsilon-Greedy Policy:
o Balances exploration and exploitation.

o Initially explores more (epsilon = 1.0) and decays over episodes to


exploit learned Q-values (epsilon_min = 0.01).
5. MSVE (Mean Squared Value Error):
o Tracks the Bellman error squared across episodes to evaluate
learning stability.
o Helps monitor how well the approximator converges to optimal Q-
values.

5. Semi-gradient n-step Sarsa


1. Environment: GridEnvironment
Grid Setup
 The grid environment is represented as a 2D numpy array:
self.grid = np.array([
[0, 0, 0, 1], # Goal state at (0,3) with reward +1
[0, -1, 0, -1], # Penalty states (-1 reward) at obstacles
[0, 0, 0, 0] # Start state at (2,0) with reward 0
])
o 0 represents neutral cells.

o 1 represents the goal cell with a reward.

o -1 represents penalty cells or obstacles.

State and Action Spaces


 States: Each cell in the grid is a unique state.
 Actions: Four possible actions:
o 0: Move Up.

o 1: Move Down.

o 2: Move Left.

o 3: Move Right.

State Transitions
 Transitions are deterministic:
o Moving in a specified direction changes the agent's position in the
grid.
o Boundary conditions are handled using max() and min() functions to
ensure the agent doesn't leave the grid.
Rewards
 The agent receives:
o +1 for reaching the goal.

o -1 for stepping into penalty cells.

o 0 for other transitions.

2. Hyperparameters
Key Hyperparameters:
1. Learning Rate (alpha): Controls how much the weights are adjusted
during learning.
o alpha = 0.1: Small updates to weights.

2. Discount Factor (gamma): Represents the importance of future rewards.


o gamma = 0.99: High importance for future rewards.

3. Exploration Rate (epsilon): Balances exploration vs. exploitation:


o Starts at epsilon = 1.0 (fully random actions).

o Decays over episodes (epsilon_decay = 0.995) to encourage


exploitation as the agent learns.
o Minimum value: epsilon_min = 0.01.

4. Episodes: Number of training episodes (episodes = 1000).

3. Feature Representation
 Each state-action pair is represented using one-hot encoding.
 Total features: n_states * n_actions.
o Example:

 get_features(state, action):
o Transforms a state-action pair into a feature vector:

4. Epsilon-Greedy Policy
 choose_action(state, epsilon):
o With probability ϵ\epsilonϵ, choose a random action (exploration).

o Otherwise, choose the action with the highest Q-value (exploitation).

 Q-values are approximated using:


q_values = [np.dot(weights, get_features(state, a)) for a in range(env.n_actions)]
o The weights vector maps feature vectors to Q-values.

6. Episodic SARSA in mountain car


The Semi-Gradient SARSA algorithm updates the weights iteratively to
approximate the Q-values for each state-action pair.
1. Initialization:
o Reset the environment (state = env.reset()).

o Choose the initial action using epsilon-greedy.

2. Episode Loop:
o Take the action and observe:

 Next state, reward, and whether the episode is done.


o Compute:

 Target Q-value:

 Current Q-value:

o Calculate Bellman error:


o Update the weights:

weights += alpha * bellman_error * get_features(state, action)


3. Repeat:
o Move to the next state and action.

o Continue until the episode ends.

6. Metrics
 Total Rewards:
o Tracks cumulative rewards per episode to monitor performance.

 Mean Squared Value Error (MSVE):


o Bellman error squared is accumulated over an episode to measure
the agent's value approximation accuracy.

7. Visualization
 MSVE Plot:
o Tracks how the value approximation improves over episodes.

o Helps assess the convergence of the learning process.

8. Testing
 After training, the agent is tested with a greedy policy:
o ϵ=ϵmin: No exploration.

o Observes and prints the agent's trajectory, actions, and total reward.

7. Episodic SARSA in mountain car


1. Environment: MountainCar-v0
 Goal: In the MountainCar-v0 environment, the objective is to drive a car to
the top of a mountain.
 Challenge: The car's engine is weak and cannot directly climb the
mountain. The agent needs to build momentum by moving back and forth.
 State Space: The environment provides continuous state variables:
o Position: Ranges from -1.2 to 0.6.

o Velocity: Ranges from -0.07 to 0.07.

 Action Space: There are three discrete actions:


o Accelerate left (0), do nothing (1), or accelerate right (2).
2. SARSA Algorithm
SARSA (State-Action-Reward-State-Action) is an on-policy Temporal Difference (TD)
reinforcement learning algorithm. The key steps are:
Update Rule:
The SARSA update rule for Q-values is:

 s,a: Current state and action.


 r: Reward received for taking action aaa in state sss.
 s′,a′: Next state and next action.
 α: Learning rate, controls how much new information overrides old Q-values.
 γ: Discount factor, balances immediate vs. future rewards.
Steps in SARSA:
1. Start with a state s.
2. Choose an action a using the ε-greedy policy.
3. Perform a and observe r (reward) and s′ (next state).
4. Choose a′ (next action) in s′′ using the ε-greedy policy.
5. Update Q(s,a) using the SARSA rule.
6. Set s=s′, a=a′, and repeat until the episode ends.

3. Code Components
Discretizing the State Space
 The MountainCar-v0 environment has a continuous state space, which
needs to be discretized for tabular reinforcement learning methods like
SARSA.
 Discretization Method: Divide the continuous state space into discrete
bins:
state_space = [np.linspace(-1.2, 0.6, num_states), # Position bins
np.linspace(-0.07, 0.07, num_states)] # Velocity bins
 The function discretize_state(state) maps a continuous state to discrete
indices:
np.digitize(state[i], state_space[i]) - 1
Q-Table Initialization
 The Q-table stores the estimated Q-values for each state-action pair.
 Dimensions:
o States: Discretized position and velocity (20 bins each).

o Actions: 3 discrete actions.

Q = np.zeros((num_states, num_states, env.action_space.n))

Episodic Training Loop


For Each Episode:
1. Initialize State: Discretize the initial state.
state = discretize_state(env.reset()[0])
2. Action Selection (ε-greedy policy): With probability ϵ, select a random
action (exploration). Otherwise, select the action with the highest Q-value
(exploitation).
action = np.argmax(Q[state]) if np.random.rand() >= epsilon else
env.action_space.sample()
3. Action Execution: Take the action in the environment and observe:
o Next state (s′)

o Reward (r)

o Done flag (indicates episode termination).

next_state, reward, done, _, _ = env.step(action)


4. Next Action Selection (ε-greedy): Choose the next action a′ in the new
state s′
next_action = np.argmax(Q[next_state]) if np.random.rand() >= epsilon else
env.action_space.sample()
5. Q-value Update: Apply the SARSA update rule:
Q[state][action] += alpha * (reward + gamma * Q[next_state][next_action] -
Q[state][action])
6. Transition to the Next State: Update the current state and action for the
next iteration:
state = next_state
action = next_action
7. Decay Epsilon: Gradually reduce ϵ\epsilonϵ (exploration rate) to shift from
exploration to exploitation:
epsilon = max(epsilon_min, epsilon * epsilon_decay)
8. Store Reward: Track the cumulative reward for each episode.
4. Post-Training: Testing the Agent
After training, the agent is tested by always selecting the greedy action
in each state.

5. Results and Visualization


1. Reward Trend: The code plots the total reward per episode. A successful
training should show an upward trend in rewards.
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Episodic SARSA on MountainCar-v0')
2. Visual Testing: During testing, the environment renders the agent's
performance in solving the task.
env.render()

7. λ-Return
Block 1: Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
 numpy: Used for handling numerical calculations and arrays.
 matplotlib.pyplot: Used for plotting graphs and visualizations.

Block 2: lambda_return Function


def lambda_return(rewards, values, gamma, lambd):
"""
Compute the λ-return for a given trajectory with debugging information.
"""
 lambda_return: The function calculates the λ-return.
 Inputs:
o rewards: List of rewards received at each time step.

o values: Estimated values for each state in the trajectory.

o gamma: Discount factor for future rewards.

o lambd: λ value used to blend n-step returns and bootstrapping.


Block 3: Initialize Variables
T = len(rewards) # Number of steps
g_lambda = 0 # Initialize the λ-return
returns = [] # To store intermediate returns for plotting
 T: The total number of steps (length of the rewards list).
 g_lambda: Variable to store the final λ-return.
 returns: List to store intermediate n-step returns (for visualization).

Block 4: Loop Over Each Step and Compute n-Step Returns


for n in range(1, T + 1):
n_step_return = sum(gamma**k * rewards[k] for k in range(n)) # n-step
return
if n < T:
n_step_return += gamma**n * values[n] # Add bootstrapped value for
next state
g_lambda += (lambd**(n - 1)) * n_step_return # Add weighted n-step return
to λ-return
returns.append(n_step_return) # Store n-step return for plotting
 The loop computes the n-step return for each step n and combines it with
bootstrapped values (the estimated values of future states).
 gamma discounts the rewards, and lambd determines how much weight
each n-step return gets.

Block 5: Final Calculation of λ-Return


g_lambda *= (1 - lambd) # Final scaling of λ-return
 Scales the total λ-return by (1 - λ) to adjust for the full sum.

Block 6: Plotting and Debugging


print(f"Intermediate returns: {returns}") # Print n-step returns
plt.plot(returns, label="n-step Returns") # Plot the n-step returns
plt.axhline(y=g_lambda, color='r', linestyle='--', label="λ-return") # Plot the λ-
return
plt.xlabel("Step") # Label for x-axis
plt.ylabel("Return") # Label for y-axis
plt.legend() # Add legend to the plot
plt.title("λ-Return and n-step Returns") # Title for the plot
plt.show() # Display the plot
 Prints out the intermediate n-step returns.
 Plots the n-step returns and the final λ-return.
 axhline is used to plot the horizontal line representing the final λ-return.

Block 7: Return the λ-Return


return g_lambda
 The function returns the final λ-return.

Block 8: Example Usage


# Example
rewards = [1, 1, 1, 1] # Rewards at each time step
values = [0.5, 0.6, 0.7, 0.8] # Estimated values for each state
gamma = 0.9 # Discount factor for future rewards
lambd = 0.8 # λ to combine n-step returns and bootstrapping

# Call the function to compute λ-return and plot results


lambda_return(rewards, values, gamma, lambd)
 Here, we define some example rewards and values.
 gamma is the discount factor (how much future rewards are valued).
 lambd is the mixing factor for λ-return.
 The function lambda_return is called to compute the λ-return and visualize
the results.

8. TD(𝜆)
1. Imports:
import gym
import numpy as np
import matplotlib.pyplot as plt
 gym: The Gym library is used for creating and interacting with
reinforcement learning environments.
 numpy: Used for numerical operations and managing arrays (e.g., value
function and eligibility traces).
 matplotlib: Used for plotting the results of the learning process (episode
rewards).

2. The td_lambda Function:


def td_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100):
This function implements the TD(λ) algorithm, where:
 env: The environment (e.g., FrozenLake-v1).
 gamma: The discount factor that determines how much future rewards are
considered.
 lambd: The λ parameter that controls the combination of n-step returns
and bootstrapping (eligibility trace decay).
 alpha: The learning rate that controls how much to update the value
function.
 episodes: The number of episodes to run the algorithm for.

3. Initialization:
V = np.zeros(env.observation_space.n)
eligibility_trace = np.zeros(env.observation_space.n)
 V: Initializes the value function, representing the estimated value of each
state. All states are initially assumed to have value 0.
 eligibility_trace: This array tracks how recently a state has been visited. It
will decay over time with factor gamma * lambd.

4. Main Loop (Over Episodes):


for episode in range(episodes):
state, _ = env.reset() # Reset environment to initial state
done = False
total_reward = 0
 For each episode, the environment is reset.
 state: The starting state of the environment after reset.
 done: A flag indicating if the episode has finished.
 total_reward: Keeps track of the total reward accumulated in the episode.

5. Loop Over Time Steps (Within Each Episode):


while not done:
action = env.action_space.sample() # Random action selection
next_state, reward, done, _, info = env.step(action) # Take action and
get the result
 action: A random action is chosen from the action space of the
environment.
 env.step(action): Executes the chosen action and returns the next state,
reward, done flag, additional info.

6. TD(λ) Update Rule:


td_error = reward + gamma * V[next_state] - V[state] # Temporal
Difference error
eligibility_trace[state] += 1 # Increase eligibility trace for current state

# Update value function and eligibility traces


V += alpha * td_error * eligibility_trace
eligibility_trace *= gamma * lambd # Decay eligibility traces
 td_error: The difference between the predicted value (V[state]) and the
observed value (reward + discounted value of the next state).
 eligibility_trace[state]: Increases the eligibility trace for the current state,
indicating it has been recently visited.
 The value function V is updated based on the TD(λ) update rule:
o V += alpha * td_error * eligibility_trace: Updates the value
function for all states using the TD error and eligibility traces.
o eligibility_trace *= gamma * lambd: Decays the eligibility traces,
where gamma * lambd represents the decay factor.

7. Recording Total Reward:


state = next_state # Transition to the next state
total_reward += reward # Accumulate total reward for this episode
episode_lengths.append(total_reward) # Store the total reward for this
episode
 state = next_state: Transitions to the next state.
 total_reward += reward: Adds the reward from the current time step to
the total reward.

8. Plotting the Learning Curve:


plt.plot(range(episodes), episode_lengths) # Plot rewards over episodes
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("TD(λ) Learning")
plt.show()
 episode_lengths stores the total reward for each episode.
 The plot shows the total reward per episode as the learning progresses,
which gives an indication of how well the agent is performing.

9. Example Usage:
env = gym.make("FrozenLake-v1")
td_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100)
 Creates the FrozenLake-v1 environment using gym.make.
 Calls the td_lambda function to train the agent using TD(λ) with specified
parameters.

9. SARSA(𝜆)
1. Imports:
import gym
import numpy as np
import matplotlib.pyplot as plt
 gym: The Gym library is used for creating and interacting with
reinforcement learning environments.
 numpy: Used for numerical operations and managing arrays (e.g., Q-
function and eligibility traces).
 matplotlib: Used for plotting the results of the learning process (episode
rewards).
2. The sarsa_lambda Function:
def sarsa_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100):
This function implements the SARSA(λ) algorithm, where:
 env: The environment (e.g., FrozenLake-v1).
 gamma: The discount factor that determines how much future rewards are
considered.
 lambd: The λ parameter that controls the combination of n-step returns
and bootstrapping (eligibility trace decay).
 alpha: The learning rate that controls how much to update the Q-function.
 episodes: The number of episodes to run the algorithm for.

3. Initialization:
Q = np.zeros((env.observation_space.n, env.action_space.n))
eligibility_trace = np.zeros((env.observation_space.n, env.action_space.n))
 Q: Initializes the Q-function, which estimates the value of state-action pairs.
It's a 2D array where rows correspond to states and columns correspond to
actions.
 eligibility_trace: This array tracks how recently a state-action pair has
been visited. It will decay over time with factor gamma * lambd.

4. Main Loop (Over Episodes):


for episode in range(episodes):
state, _ = env.reset() # Reset environment to initial state
action = np.random.choice(env.action_space.n) # Random action selection
done = False
total_reward = 0
 For each episode, the environment is reset.
 state: The starting state of the environment after reset.
 action: A random action is chosen from the action space of the
environment.
 done: A flag indicating if the episode has finished.
 total_reward: Keeps track of the total reward accumulated in the episode.

5. Loop Over Time Steps (Within Each Episode):


while not done:
next_state, reward, done, truncated, info = env.step(action) # Take action
and get the result
next_action = np.random.choice(env.action_space.n)
 action: The current action chosen.
 env.step(action): Executes the chosen action and returns the next state,
reward, done flag, truncated flag, and additional info.
 next_action: A new action is chosen for the next state randomly.

6. SARSA(λ) Update Rule:


td_error = reward + gamma * Q[next_state, next_action] - Q[state, action]
eligibility_trace[state, action] += 1 # Increase eligibility trace for current
state-action pair

# Update Q-function and eligibility traces


Q += alpha * td_error * eligibility_trace
eligibility_trace *= gamma * lambd # Decay eligibility traces
 td_error: The difference between the expected Q-value (using the next
state's action) and the observed Q-value (for the current state-action pair).
 eligibility_trace[state, action]: Increases the eligibility trace for the
current state-action pair, indicating it has been recently visited.
 The Q-function is updated according to the SARSA(λ) update rule:
o Q += alpha * td_error * eligibility_trace: Updates the Q-function
for all state-action pairs using the temporal difference (TD) error and
eligibility traces.
o eligibility_trace *= gamma * lambd: Decays the eligibility traces
with the factor gamma * lambd.

7. Recording Total Reward:


state = next_state # Transition to the next state
action = next_action # Transition to the next action
total_reward += reward # Accumulate total reward for this episode

episode_lengths.append(total_reward) # Store the total reward for this


episode
 state = next_state: Transitions to the next state.
 action = next_action: Transitions to the next action.
 total_reward += reward: Adds the reward from the current time step to
the total reward.

8. Plotting the Learning Curve:


plt.plot(range(episodes), episode_lengths) # Plot rewards over episodes
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("SARSA(λ) Learning")
plt.show()
 episode_lengths stores the total reward for each episode.
 The plot shows the total reward per episode as the learning progresses,
giving an indication of how well the agent is performing.

9. Example Usage:
env = gym.make("FrozenLake-v1")
sarsa_lambda(env, gamma=0.9, lambd=0.8, alpha=0.1, episodes=100)
 Creates the FrozenLake-v1 environment using gym.make.
 Calls the sarsa_lambda function to train the agent using SARSA(λ) with
specified parameters

10. Reinforce
1. Imports and Configuration:
import gym
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
 gym: Used for the reinforcement learning environment.
 pandas: Used for data handling (not extensively used in this example).
 numpy: For array manipulation and handling numerical operations.
 torch: PyTorch library for building and training neural networks.
 matplotlib: For plotting the rewards over episodes.
2. Hyperparameters and Device Configuration:
DEVICE = "cpu" # Change to "cuda:0" if using a GPU

ACTION_SPACE = [0, 1] # CartPole actions: 0 (left), 1 (right)


EPISODES = 800 # Number of episodes
STEPS = 500 # Max steps per episode
GAMMA = 0.9 # Discount factor for rewards
RENDER = False # Whether to render the environment during training
 DEVICE: Set to "cpu" by default but can be changed to "cuda:0" for GPU
use.
 ACTION_SPACE: Actions available in CartPole (left or right).
 EPISODES: Number of episodes to train for.
 STEPS: Max steps per episode.
 GAMMA: Discount factor used in the calculation of discounted rewards.
 RENDER: Flag to enable or disable rendering.
3. Reinforce Model Definition:
class ReinforceModel(nn.Module):
def __init__(self, num_action, num_input):
super(ReinforceModel, self).__init__()
self.num_action = num_action
self.num_input = num_input
self.layer1 = nn.Linear(num_input, 64) # Input layer
self.layer2 = nn.Linear(64, num_action) # Output layer (actions)

def forward(self, x):


x = torch.tensor(x, dtype=torch.float32, device=DEVICE).unsqueeze(0)
x = F.relu(self.layer1(x)) # ReLU activation
actions = F.softmax(self.layer2(x), dim=1) # Softmax for action probabilities
action = self.get_action(actions) # Sample action based on probabilities
log_prob_action = torch.log(actions.squeeze(0))[action] # Log probability of
chosen action
return action, log_prob_action

def get_action(self, a):


return np.random.choice(ACTION_SPACE,
p=a.squeeze(0).detach().cpu().numpy()) # Action sampling
 ReinforceModel: A neural network for the REINFORCE algorithm. It has two
layers:
o Layer 1: Linear transformation from input (state) to a hidden layer of
size 64.
o Layer 2: Linear transformation to output probabilities for each action.

o forward: Performs forward pass, computes action probabilities, and


returns the chosen action and its log probability.
o get_action: Samples an action from the distribution defined by the
action probabilities.
4. Training Loop:
for episode in range(EPISODES):
done = False
state = env.reset()
lp = [] # Log probabilities
r = [] # Rewards
s = [] # States
a = [] # Actions
d = [] # Done flags

for step in range(STEPS):


action, log_prob = model(state) # Get action and log probability from model
state, r_, done, _ = env.step(action) # Perform action and get next state,
reward
lp.append(log_prob) # Save log probabilities
r.append(r_) # Save rewards
if done:
all_rewards.append(np.sum(r)) # Save total reward for episode
if episode % 100 == 0:
print(f"EPISODE {episode} SCORE: {np.sum(r)}
roll{pd.Series(all_rewards).tail(30).mean()}")
break
 For each episode, it interacts with the environment until the episode ends
(either by reaching a goal or exceeding the step limit).
 action: Chosen by the model.
 log_prob: Log probability of the action.
 state, r_, done, _: The state, reward, done flag, and info returned by
env.step(action).
5. Discounted Reward Calculation:
discounted_rewards = np.zeros_like(r)
for t in range(len(r)):
Gt = 0
pw = 0
for r_ in r[t:]:
Gt += GAMMA ** pw * r_
pw += 1
discounted_rewards[t] = Gt
 discounted_rewards: Computes the discounted rewards for each step in
the episode.
6. Normalize and Update Model:
discounted_rewards = torch.tensor(discounted_rewards, dtype=torch.float32,
device=DEVICE)
discounted_rewards = (discounted_rewards - torch.mean(discounted_rewards)) /
(torch.std(discounted_rewards)) # Normalize
log_prob = torch.stack(lp) # Stack log probabilities
policy_gradient = -log_prob * discounted_rewards # Compute policy gradient
model.zero_grad() # Zero gradients
policy_gradient.sum().backward() # Backpropagate
optimizer.step() # Update model parameters
 discounted_rewards: Normalized discounted rewards.
 log_prob: Stacks log probabilities for all time steps in the episode.
 policy_gradient: The policy gradient is computed as the negative log
probability times the discounted reward.
 optimizer.step(): Updates the model parameters using backpropagation.
7. Plotting Rewards:
plt.plot(all_rewards) # Plot the rewards
plt.xlabel('Episodes') # X-axis label
plt.ylabel('Score') # Y-axis label
plt.title('Training Progress over Episodes') # Title for the plot
plt.show()
 all_rewards: A list containing the total rewards for each episode.
 Plotting: The rewards are plotted to visualize training progress.
8. Saving the Model (Optional):
torch.save(model.state_dict(), "reinforce_cartpole_model.pth")
 torch.save: Saves the trained model's parameters to a file.

11. CartPole with Monte-Carlo


To evaluate the performance of a random policy in the FrozenLake-v1 environment
using OpenAI's Gym. The environment is a grid world where the agent must
navigate through a lake to reach the goal while avoiding holes.
Key Functions:
1. create_random_policy(env):
o This function generates a random policy for the environment. Each
state has an equal probability of taking any action. For FrozenLake-v1,
this means the agent could move in any direction with equal
probability at each state.
2. run_game(env, policy, display=True):
o This function simulates one episode in the environment based on the
given policy. It iterates over steps, selecting actions according to the
policy, and recording the states, actions, and rewards at each
timestep.
o The environment is rendered if display=True, showing the agent's
actions as it navigates.
3. test_policy(policy, env, num_episodes=100):
o This function runs multiple episodes (default 100) with the provided
policy and records the results. It counts how many times the agent
wins (reaches the goal) and stores the reward for each episode.
o After running all episodes, it plots the rewards and returns the win
fraction (the percentage of episodes where the agent won).
Process Flow:
1. Policy Creation: A random policy is created using create_random_policy,
where each state-action pair has an equal probability.
2. Game Simulation: The game is run with the random policy using the
run_game function.
3. Evaluation: The test_policy function is used to run multiple episodes, count
the number of wins (episodes where the agent reaches the goal), and plot
the rewards over the episodes.
4. Result: The win fraction (percentage of wins) is printed after testing the
policy.
Expected Outcome:
 The agent's performance with a random policy will be limited since it's
purely random and doesn't learn or adapt.
 The plot will show the rewards per episode, and the win fraction will likely
be low, indicating the random policy is not effective in solving the
environment.
Sample Output:
 Win Fraction: This will be the fraction of episodes where the agent
successfully reaches the goal.
 Rewards Plot: A graph of rewards for each episode will show how the
agent performs across all episodes.

12. Actor–Critic Methods


1. Importing Libraries
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
import matplotlib.pyplot as plt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


This block imports the required libraries:
 gymnasium for the environment (CartPole-v1).
 numpy for array operations.
 torch for defining and training neural networks.
 matplotlib for plotting the results.
2. Defining Actor and Value Networks
class ActorNet(nn.Module):
def __init__(self, hidden_dim=16):
super().__init__()
self.hidden = nn.Linear(4, hidden_dim) # State dimension is 4 for CartPole
self.output = nn.Linear(hidden_dim, 2) # Two actions: left or right

def forward(self, s):


outs = self.hidden(s)
outs = F.relu(outs)
logits = self.output(outs)
return logits

class ValueNet(nn.Module):
def __init__(self, hidden_dim=16):
super().__init__()
self.hidden = nn.Linear(4, hidden_dim)
self.output = nn.Linear(hidden_dim, 1)

def forward(self, s):


outs = self.hidden(s)
outs = F.relu(outs)
value = self.output(outs)
return value
Here, two neural networks are defined:
 ActorNet: Outputs logits for choosing actions (left or right).
 ValueNet: Outputs a scalar value (state value) used for computing the
advantage in policy gradient methods.
3. Instantiating Networks and Optimizers
actor_func = ActorNet().to(device)
value_func = ValueNet().to(device)
gamma = 0.99 # Discount factor
# Optimizers
opt1 = torch.optim.AdamW(value_func.parameters(), lr=0.001) # For Critic
(ValueNet)
opt2 = torch.optim.AdamW(actor_func.parameters(), lr=0.001) # For Actor
(ActorNet)
The actor and value networks are instantiated and moved to the available device
(GPU/CPU). Optimizers (AdamW) are set up for both networks.
4. Defining the Action Selection Function
def pick_sample(s):
with torch.no_grad():
# Convert state to tensor and pass through actor network
s_batch = np.expand_dims(s, axis=0)
s_batch = torch.tensor(s_batch, dtype=torch.float).to(device)
logits = actor_func(s_batch)
logits = logits.squeeze(dim=0)
probs = F.softmax(logits, dim=-1) # Convert logits to probabilities
a = torch.multinomial(probs, num_samples=1) # Sample action
return a.tolist()[0] # Return action as integer
This function uses the actor network to select an action. The state is passed
through the network to get logits, then softmax is applied to get probabilities, and
the action is sampled using torch.multinomial.
5. Main Training Loop
env = gym.make("CartPole-v1")
reward_records = []

for i in range(1500):
done = False
states, actions, rewards = [], [], []
s, _ = env.reset()

while not done:


states.append(s.tolist())
a = pick_sample(s)
s, r, term, trunc, _ = env.step(a)
done = term or trunc
actions.append(a)
rewards.append(r)

# Compute cumulative rewards (discounted)


cum_rewards = np.zeros_like(rewards)
reward_len = len(rewards)
for j in reversed(range(reward_len)):
cum_rewards[j] = rewards[j] + (cum_rewards[j+1] * gamma if j+1 <
reward_len else 0)

# Train the value function (Critic)


opt1.zero_grad()
states = torch.tensor(states, dtype=torch.float).to(device)
cum_rewards = torch.tensor(cum_rewards, dtype=torch.float).to(device)
values = value_func(states).squeeze(dim=1)
vf_loss = F.mse_loss(values, cum_rewards, reduction="none")
vf_loss.sum().backward()
opt1.step()

# Train the policy (Actor)


with torch.no_grad():
values = value_func(states)
opt2.zero_grad()
actions = torch.tensor(actions, dtype=torch.int64).to(device)
advantages = cum_rewards - values # Compute advantage
logits = actor_func(states)
log_probs = -F.cross_entropy(logits, actions, reduction="none") # Log
probabilities
pi_loss = -log_probs * advantages
pi_loss.sum().backward()
opt2.step()
print(f"Run episode {i} with rewards {sum(rewards)}", end="\r")
reward_records.append(sum(rewards))

# Early stopping if average reward is higher than threshold


if np.average(reward_records[-50:]) > 475.0:
break

print("\nDone")
env.close()
This block contains the core training loop:
 Environment reset: Initializes each episode.
 Action selection: The agent picks an action using the pick_sample
function.
 Reward computation: Cumulative rewards are calculated using
discounting.
 Training: The networks are optimized by:
o Critic (ValueNet): Mean squared error loss is used for value
function.
o Actor (ActorNet): Policy gradient is used to update the actor based
on the advantage.
6. Plotting the Results
# Plotting the results: Rewards over episodes
average_reward = []
for idx in range(len(reward_records)):
avg_list = np.empty(shape=(1,), dtype=int)
if idx < 50:
avg_list = reward_records[:idx+1]
else:
avg_list = reward_records[idx-49:idx+1]
average_reward.append(np.average(avg_list))

plt.plot(reward_records)
plt.plot(average_reward)
plt.title('Rewards Over Time')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()
After training, this block generates a plot:
 reward_records: Shows total rewards per episode.
 average_reward: Shows the moving average of the last 50 episodes to
smooth the plot.

You might also like