Tutorial
Tutorial
DSA: Tutorial
Class: B.Tech CSE AIML [Semester 6]
Name: K V NATARAJA YADAV
Reg No: 2022BCSE07AED794
Qu
est Questions CO BTL
ion
No
1 Develop an agent that can interact with a Multi-Armed Bandit CO BTL2
environment, explore the different arms, and gradually converge to 1
the arm that provides the highest average reward. The agent should
learn to make decisions that maximize the cumulative reward over
time by effectively balancing exploration and exploitation.
Ans Implementation of an agent interacting with a Multi-Armed Bandit
(MAB) environment using the ε-greedy strategy, which balances
exploration and exploitation. The agent will gradually learn to
prefer the arm with the highest average reward.
reward.
1. Simulate a bandit environment with a given number of arms.
2. Implement an agent using the ε-greedy strategy.
3. Track the cumulative reward over time.
Program:
import numpy as np
import matplotlib.pyplot as plt
class MultiArmedBandit:
def __init__(self, k): self.k, self.probs = k, np.random.rand(k)
def pull(self, arm): return 1 if np.random.rand() < self.probs[arm]
else 0
class EpsilonGreedyAgent:
def __init__(self, k, eps): self.k, self.eps, self.c, self.v = k, eps,
np.zeros(k), np.zeros(k)
def select_arm(self): return np.random.randint(self.k) if
np.random.rand() < self.eps else np.argmax(self.v)
def update(self, arm, reward): self.c[arm] += 1; self.v[arm] +=
(reward - self.v[arm]) / self.c[arm]
class UCB1Agent:
def __init__(self, k): self.k, self.c, self.v, self.t = k, np.zeros(k),
np.zeros(k), 0
def select_arm(self):
for a in range(self.k):
if self.c[a] == 0: return a
return np.argmax(self.v + np.sqrt(2 * np.log(self.t) / self.c))
def update(self, arm, reward): self.c[arm] += 1; self.t += 1;
self.v[arm] += (reward - self.v[arm]) / self.c[arm]
class ThompsonSamplingAgent:
def __init__(self, k): self.k, self.s, self.f = k, np.ones(k), np.ones(k)
def select_arm(self): return np.argmax(np.random.beta(self.s,
self.f))
def update(self, arm, reward): self.s[arm] += reward; self.f[arm] +=
1 - reward
A. Devise three example tasks of your own that fit into the MDP
framework, identifying for each its states, actions, and rewards.
Make the three examples as different from each other as
possible. The framework is abstract and flexible and can be
applied in many ways. Stretch its limits in some way in at least
one of your examples
B. Is the MDP framework adequate to usefully represent all goal-
directed learning tasks?
3 Jack’s Car Rental : Jack manages two locations for a nationwide car CO BTL2
rental company. Each day, some number of customers arrive at each 1
location to rent cars. If Jack has a car available, he rents it out and is
credited $10 by the national company. If he is out of cars at that
location, then the business is lost. Cars become available for renting
the day after they are returned. To help ensure that cars are available
where they are needed, Jack can move them between the two
locations overnight, at a cost of $2 per car moved. We assume that
the number of cars requested and returned at each location are
Poisson random variables. Suppose λ is 3 and 4 for rental requests at
the first and second locations and 3 and 2 for returns. To simplify the
problem slightly, we assume that there can be no more than 20 cars at
each location (any additional cars are returned to the nationwide
company, and thus disappear from the problem) and a maximum of
five cars can be moved from one location to the other in one night.
Take the discount rate to be γ=0.9 and formulate this as a continuing
finite MDP, where the time steps are days, the state is the number of
cars at each location at the end of the day, and the actions are the net
numbers of cars moved between the two locations overnight.
Ans MDP Formulation of Jack’s Car Rental
1. States (S)
Each state represents the number of cars at both locations at the
end of the day.
Let:
s = (x, y)
2. Actions (A)
The action is the net number of cars moved overnight from
3. Transitions (P(s'|s,a))
This models the probability of moving from one state to another,
based on:
Car requests and returns at both locations.
Requests and returns follow Poisson distributions:
o Location 1:
Rental requests: λ = 3
Returns: λ = 3
o Location 2:
Rental requests: λ = 4
Returns: λ = 2
Cars are rented up to the number available.
Extra cars returned beyond 20 are lost.
The next state depends on:
1. Current cars after moving
2. Cars rented based on request (min(request, available))
3. Cars returned
4. Truncating to 20 max cars per location
4. Rewards (R(s,a))
For a state-action pair:
+$10 per car rented
–$2 per car moved (action cost: 2 * |a|)
Reward =
+10 * (cars rented at loc1 + loc2) – 2 * |a|
MDP Summary
Description
o P(s1∣s1,a1)=0.8,P(s2∣s1,a1)=0. 2
Transition probabilities:
o P(s1∣s2,a2)=0.4,P(s2∣s2,a2)=0.6
Rewards:
o R(s1,a1)=5,R(s2,a2)=10
Discount factor γ=0.9
MDP Details:
States: S={s1,s2}S = \{s_1, s_2\}
Actions: A={a1,a2}A = \{a_1, a_2\}
Policy:
o π(s1)=a1\pi(s_1) = a_1
o π(s2)=a2\pi(s_2) = a_2
Transitions:
o P(s1∣s1,a1)=0.8P(s_1 \mid s_1, a_1) = 0.8,
P(s2∣s1,a1)=0.2P(s_2 \mid s_1, a_1) = 0.2
o P(s1∣s2,a2)=0.4P(s_1 \mid s_2, a_2) = 0.4,
P(s2∣s2,a2)=0.6P(s_2 \mid s_2, a_2) = 0.6
Rewards:
o R(s1,a1)=5R(s_1, a_1) = 5
o R(s2,a2)=10R(s_2, a_2) = 10
Discount factor: γ=0.9\gamma = 0.9
Goal:
Compute the state values under policy π\pi:
Vπ(s1)V^\pi(s_1)
Vπ(s2)V^\pi(s_2)
Final Answer:
Vπ(s1)≈64.07\boxed{V^\pi(s_1) \approx 64.07}
Vπ(s2)≈71.89\boxed{V^\pi(s_2) \approx 71.89}
Driving Home : Each day as you drive home from work, you try to CO BTL3
9 predict how long it will take to get home. When you leave your office, 3
you note the time, the day of week, the weather, and anything else
that might be relevant. Say on this Friday you are leaving at exactly 6
o’clock, and you estimate that it will take 30 minutes to get home. As
you reach your car it is 6:05, and you notice it is starting to rain. Traffic
is often slower in the rain, so you reestimate that it will take 35
minutes from then, or a total of 40 minutes. Fifteen minutes later you
have completed the highway portion of your journey in good time. As
you exit onto a secondary road you cut your estimate of total travel
time to 35 minutes. Unfortunately, at this point you get stuck behind a
slow truck, and the road is too narrow to pass. You end up having to
follow the truck until you turn onto the side street where you live at
6:40. Three minutes later you are home. The sequence of states,
times, and predictions is thus as follows:
Design an agent that can balance a pole on a cart by applying forces CO BTL4
10 (left or right) to the CartPole reinforcement learning problem using 4
Deep-Q-Network
Ans Designing an agent to balance a pole on a cart using a Deep Q-
Network (DQN) for the CartPole reinforcement learning problem
involves the following steps:
1. Understanding the Problem:
o The CartPole environment has four input features:
1. Cart Position: The position of the cart on the track.
2. Cart Velocity: The velocity of the cart.
3. Pole Angle: The angle of the pole relative to the vertical.
4. Pole Velocity: The velocity at which the pole is moving.
o The agent can take one of two actions: apply a force to the left or
right (often denoted as 0 and 1).
o The reward is given as:
+1 for each time step the pole remains balanced.
A terminal state occurs when the pole falls, which ends the
episode.
2. Deep Q-Network (DQN) Overview: A DQN uses a neural
network to approximate the Q-value function Q(s,a)Q(s, a)Q(s,a),
where:
o sss is the state (4-dimensional vector in CartPole).
o aaa is the action (left or right).
o The neural network learns to predict the Q-values for each action
in each state, and these predictions guide the agent's decisions.
3. Steps for Designing the DQN Agent:
o Define the neural network architecture to approximate the Q-
values.
o Implement experience replay to store past experiences and
sample them for training.
o Implement target network to stabilize learning.
o Train the agent using the Q-learning update rule.
o Evaluate the agent's performance over episodes.
1. Imports:
python
CopyEdit
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import gym
from collections import deque
2. Simplified Q-Network:
A minimal neural network with just the essential layers to
approximate Q-values.
python
CopyEdit
class DQN(nn.Module):
def __init__(self, state_size, action_size):
super(DQN, self).__init__()
self.fc = nn.Sequential(
nn.Linear(state_size, 64),
nn.ReLU(),
nn.Linear(64, action_size)
)
def sample(self):
return random.sample(self.buffer, self.batch_size)
def size(self):
return len(self.buffer)
4. DQN Agent:
The agent selects actions and learns from its experiences using Q-
learning.
python
CopyEdit
class DQNAgent:
def __init__(self, state_size, action_size, gamma=0.99,
epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
learning_rate=0.001, batch_size=64):
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.batch_size = batch_size
def update_target_network(self):
self.target_network.load_state_dict(self.q_network.state_dict())
states = torch.FloatTensor(states)
next_states = torch.FloatTensor(next_states)
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
dones = torch.BoolTensor(dones)
q_values = self.q_network(states).gather(1,
actions.unsqueeze(1)).squeeze(1)
next_q_values = self.target_network(next_states).max(1)[0]
target_q_values = rewards + (self.gamma * next_q_values *
~dones)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
if episode % 10 == 0:
agent.update_target_network()
class GridWorld(gym.Env):
def __init__(self, grid_size=5):
super(GridWorld, self).__init__()
self.grid_size = grid_size
self.robot_pos = (0, 0) # Starting position (top-left)
self.object_pos = (4, 4) # Object at bottom-right
self.obstacle_pos = [(2, 2), (3, 1)] # Obstacles
self.done = False
def reset(self):
self.robot_pos = (0, 0)
self.done = False
return self.robot_pos_to_state(self.robot_pos)
class TaxiEnv(gym.Env):
def __init__(self, grid_size=5):
super(TaxiEnv, self).__init__()
self.grid_size = grid_size
self.taxi_pos = (0, 0) # Starting position (top-left)
self.passenger_pos = self.random_position()
self.destination_pos = self.random_position()
self.done = False
def reset(self):
self.taxi_pos = (0, 0)
self.passenger_pos = self.random_position()
self.destination_pos = self.random_position()
self.done = False
return self.state()
def random_position(self):
return (random.randint(0, self.grid_size-1), random.randint(0,
self.grid_size-1))
def state(self):
return (self.taxi_pos[0], self.taxi_pos[1], self.passenger_pos[0],
self.passenger_pos[1], self.destination_pos[0], self.destination_pos[1])
self.taxi_pos = next_pos
13 Two teams of agents are playing a simplified soccer game in a grid CO BTL5
environment. Each agent learns its own policy using REINFORCE. Apply 5
Policy Gradient Methods
Ans To implement a simplified soccer game in a grid environment using
REINFORCE (a policy gradient method), we'll need to model the
environment and agents such that each agent can learn to optimize its
own policy using reinforcement learning.
In REINFORCE, the agent learns a policy directly by optimizing the
parameters of its policy network using gradient descent. It does so by
updating its policy parameters based on the returns (rewards)
collected during episodes.
Here’s how we can approach this problem:
1. Grid Environment Setup for the Soccer Game:
We will have a 2D grid where two teams of agents (Team 1 and
Team 2) will play.
Each agent has a position on the grid and can take actions such
as moving in the grid to either attack, defend, or pass the ball.
There will be a ball in the environment, and the goal is to score
points by getting the ball into the opposing team's goal area.
Each agent has its own policy, and the REINFORCE algorithm will
update the policy for each agent.
2. State Representation:
The state can be represented by:
o The positions of all agents on the grid (for both teams).
o The position of the ball.
o The direction or momentum of the ball.
Each agent can observe the state (its own position, the ball’s
position, and other agents' positions).
3. Actions:
The actions for each agent might include:
o Move up
o Move down
o Move left
o Move right
o Kick the ball (if the ball is in range)
These actions allow agents to control their movement and interactions
with the ball.
4. Rewards:
A positive reward (+1) when an agent scores a goal by getting
the ball into the opponent's goal area.
A small penalty (-0.1) for each move to encourage fewer steps.
Negative reward (-1) for losing possession or moving into an
unstrategic position.
5. Policy Gradient Method (REINFORCE):
We will use REINFORCE, a Monte Carlo method, where each
agent learns its policy by updating the parameters of its policy
network using the returns (i.e., total accumulated reward) from
the episodes.
Key Components:
Policy Network: A neural network that takes the state as input
and outputs a probability distribution over actions (policy).
Returns: The total rewards accumulated by an agent during an
episode.
Update Rule: The policy is updated using the gradient of the
log-probability of actions taken, weighted by the return.
Step-by-Step Implementation
1. Environment Setup:
python
CopyEdit
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
class SoccerEnv:
def __init__(self, grid_size=5):
self.grid_size = grid_size
self.agent1_pos = (0, 0)
self.agent2_pos = (0, 4)
self.ball_pos = (2, 2)
self.goal1 = [(0, 2)] # Goal for team 1
self.goal2 = [(4, 2)] # Goal for team 2
self.done = False
def reset(self):
self.agent1_pos = (0, 0)
self.agent2_pos = (0, 4)
self.ball_pos = (2, 2)
self.done = False
return self.get_state()
def get_state(self):
return np.array([self.agent1_pos[0], self.agent1_pos[1],
self.agent2_pos[0], self.agent2_pos[1], self.ball_pos[0],
self.ball_pos[1]])
reward = 0
if self.agent1_pos == self.goal1[0] and self.ball_pos ==
self.goal1[0]:
reward = 1 # Team 1 scores
self.done = True
elif self.agent2_pos == self.goal2[0] and self.ball_pos ==
self.goal2[0]:
reward = -1 # Team 2 scores
self.done = True
class DynamicEnv:
def __init__(self, grid_size=5):
self.grid_size = grid_size
self.robot_pos = (0, 0)
self.goal_pos = (4, 4)
self.obstacles = [(2, 2), (1, 3), (3, 1)] # Example static obstacles
self.done = False
def reset(self):
self.robot_pos = (0, 0)
self.done = False
return self.get_state()
def get_state(self):
return np.array([self.robot_pos[0], self.robot_pos[1],
self.goal_pos[0], self.goal_pos[1]])
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
state = next_state