0% found this document useful (0 votes)
7 views28 pages

Tutorial

The document outlines a tutorial for a reinforcement learning course, focusing on various tasks such as developing agents for Multi-Armed Bandit environments, controlling a pick-and-place robot, and formulating a Markov Decision Process (MDP) for a car rental scenario. It includes programming implementations, examples of MDPs, and value iteration calculations for a GridWorld environment. Additionally, it discusses the adequacy of the MDP framework in representing goal-directed learning tasks.

Uploaded by

Manikantaa Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

Tutorial

The document outlines a tutorial for a reinforcement learning course, focusing on various tasks such as developing agents for Multi-Armed Bandit environments, controlling a pick-and-place robot, and formulating a Markov Decision Process (MDP) for a car rental scenario. It includes programming implementations, examples of MDPs, and value iteration calculations for a GridWorld environment. Additionally, it discusses the adequacy of the MDP framework in representing goal-directed learning tasks.

Uploaded by

Manikantaa Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 28

ALLIANCE SCHOOL OF ADVANCED COMPUTING

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING

6CS1216 – REINFORCEMENT LEARNING

DSA: Tutorial
Class: B.Tech CSE AIML [Semester 6]
Name: K V NATARAJA YADAV
Reg No: 2022BCSE07AED794
Qu
est Questions CO BTL
ion
No
1 Develop an agent that can interact with a Multi-Armed Bandit CO BTL2
environment, explore the different arms, and gradually converge to 1
the arm that provides the highest average reward. The agent should
learn to make decisions that maximize the cumulative reward over
time by effectively balancing exploration and exploitation.
Ans Implementation of an agent interacting with a Multi-Armed Bandit
(MAB) environment using the ε-greedy strategy, which balances
exploration and exploitation. The agent will gradually learn to
prefer the arm with the highest average reward.
reward.
1. Simulate a bandit environment with a given number of arms.
2. Implement an agent using the ε-greedy strategy.
3. Track the cumulative reward over time.
Program:
import numpy as np
import matplotlib.pyplot as plt

class MultiArmedBandit:
def __init__(self, k): self.k, self.probs = k, np.random.rand(k)
def pull(self, arm): return 1 if np.random.rand() < self.probs[arm]
else 0
class EpsilonGreedyAgent:
def __init__(self, k, eps): self.k, self.eps, self.c, self.v = k, eps,
np.zeros(k), np.zeros(k)
def select_arm(self): return np.random.randint(self.k) if
np.random.rand() < self.eps else np.argmax(self.v)
def update(self, arm, reward): self.c[arm] += 1; self.v[arm] +=
(reward - self.v[arm]) / self.c[arm]
class UCB1Agent:
def __init__(self, k): self.k, self.c, self.v, self.t = k, np.zeros(k),
np.zeros(k), 0
def select_arm(self):
for a in range(self.k):
if self.c[a] == 0: return a
return np.argmax(self.v + np.sqrt(2 * np.log(self.t) / self.c))
def update(self, arm, reward): self.c[arm] += 1; self.t += 1;
self.v[arm] += (reward - self.v[arm]) / self.c[arm]
class ThompsonSamplingAgent:
def __init__(self, k): self.k, self.s, self.f = k, np.ones(k), np.ones(k)
def select_arm(self): return np.argmax(np.random.beta(self.s,
self.f))
def update(self, arm, reward): self.s[arm] += reward; self.f[arm] +=
1 - reward

k, steps, eps = 10, 1000, 0.1


bandit = MultiArmedBandit(k)
agent = ThompsonSamplingAgent(k) # Change to
EpsilonGreedyAgent(k, eps) or UCB1Agent(k)
rewards = [agent.update(a := agent.select_arm(), r := bandit.pull(a))
or r for _ in range(steps)]
plt.plot(np.cumsum(rewards) / (np.arange(steps) + 1))
plt.xlabel('Steps'); plt.ylabel('Average Reward')
plt.title(type(agent).__name__); plt.grid(); plt.show()

2 Consider using reinforcement learning to control the motion of a robot CO BTL2


arm “Pick-and-Place Robot” in a repetitive pick-and-place task. If 1
we want to learn movements that are fast and smooth, the learning
agent will have to control the motors directly and have low-latency
information about the current positions and velocities of the
mechanical linkages. The actions in this case might be the voltages
applied to each motor at each joint, and the states might be the latest
readings of joint angles and velocities. The reward might be +1 for
each object successfully picked up and placed. To encourage smooth
movements, on each time step a small, negative reward can be given
as a function of the moment-to-moment “jerkiness” of the motion.

A. Devise three example tasks of your own that fit into the MDP
framework, identifying for each its states, actions, and rewards.
Make the three examples as different from each other as
possible. The framework is abstract and flexible and can be
applied in many ways. Stretch its limits in some way in at least
one of your examples
B. Is the MDP framework adequate to usefully represent all goal-
directed learning tasks?

Ans A. Three MDP Examples


1. Smart Traffic Light
 States: Vehicle queue lengths, light status, time since change.
 Actions: Change or keep light configuration.
 Rewards: –Total waiting time; penalty for too frequent switches.
2. Virtual Personal Trainer
 States: Heart rate, fatigue, current activity, time of day.
 Actions: Suggest exercise, intensity, breaks.
 Rewards: +1 for completion, –1 for skip, –0.1 for strain signs.
3. Adaptive Story Generator
 States: Plot status, emotional tone, engagement level.
 Actions: Add twist, resolve tension, escalate.
 Rewards: +1 for engagement, –1 for drop-off or dislike.

B. Is MDP Always Adequate?


Strengths of MDPs:
 Excellent for modeling sequential decision-making under
uncertainty.
 Can represent complex tasks with states, actions, and rewards.
 The assumption of the Markov property (future depends only on
present state and action) simplifies modeling and computation.
Limitations:

 Partial Observability: Need POMDPs.


 Long-Term Dependencies: Hard with strict Markov
assumption.
 Multi-Agent Scenarios: Needs Dec-POMDP or game theory.
 High-Dimensionality: Requires approximations (e.g., deep RL).

3 Jack’s Car Rental : Jack manages two locations for a nationwide car CO BTL2
rental company. Each day, some number of customers arrive at each 1
location to rent cars. If Jack has a car available, he rents it out and is
credited $10 by the national company. If he is out of cars at that
location, then the business is lost. Cars become available for renting
the day after they are returned. To help ensure that cars are available
where they are needed, Jack can move them between the two
locations overnight, at a cost of $2 per car moved. We assume that
the number of cars requested and returned at each location are
Poisson random variables. Suppose λ is 3 and 4 for rental requests at
the first and second locations and 3 and 2 for returns. To simplify the
problem slightly, we assume that there can be no more than 20 cars at
each location (any additional cars are returned to the nationwide
company, and thus disappear from the problem) and a maximum of
five cars can be moved from one location to the other in one night.
Take the discount rate to be γ=0.9 and formulate this as a continuing
finite MDP, where the time steps are days, the state is the number of
cars at each location at the end of the day, and the actions are the net
numbers of cars moved between the two locations overnight.
Ans MDP Formulation of Jack’s Car Rental
1. States (S)
Each state represents the number of cars at both locations at the
end of the day.
Let:
 s = (x, y)

o x ∈ {0, 1, ..., 20} = number of cars at location 1


where:

o y ∈ {0, 1, ..., 20} = number of cars at location 2


→ Total number of states: 21 × 21 = 441

2. Actions (A)
The action is the net number of cars moved overnight from

 a ∈ {-5, -4, ..., 5}


location 1 to location 2:

 The constraints are:


o You can’t move more cars than you have (so a ≤ x)
o You can’t move more cars than space available (so a ≥ -y)
→ Valid actions depend on the current state.

3. Transitions (P(s'|s,a))
This models the probability of moving from one state to another,
based on:
 Car requests and returns at both locations.
 Requests and returns follow Poisson distributions:
o Location 1:
 Rental requests: λ = 3
 Returns: λ = 3
o Location 2:
 Rental requests: λ = 4
 Returns: λ = 2
 Cars are rented up to the number available.
 Extra cars returned beyond 20 are lost.
The next state depends on:
1. Current cars after moving
2. Cars rented based on request (min(request, available))
3. Cars returned
4. Truncating to 20 max cars per location

4. Rewards (R(s,a))
For a state-action pair:
 +$10 per car rented
 –$2 per car moved (action cost: 2 * |a|)
Reward =
+10 * (cars rented at loc1 + loc2) – 2 * |a|

5. Discount Factor (γ)


 Given as 0.9, since this is a continuing problem (infinite
horizon).

MDP Summary
Description

s = (cars at loc1, cars at loc2), ∈


Element
States

a ∈ [-5, ..., 5], constrained by


[0,20]²
Actions
state
Transitio Based on Poisson rentals and
ns returns
Rewards +10 per car rented, –2 per car
moved
Discount 0.9
γ
4 Consider a 3×3 GridWorld environment where the agent can move up, CO BTL3
down, left, or right. Each move gives a reward of −1-1−1, and the 2
goal is to reach the bottom-right corner, which gives a reward of
+10+10+10. If the agent hits a wall, it stays in place and receives a
reward of −1-1−1. The discount factor γ=0.9. The initial value of all
states is 000. Compute the value of the top-left corner after two
iterations of the value iteration algorithm.
Ans Environment Setup:
 Grid size: 3×3
 Actions: Up, Down, Left, Right
 Rewards:
o −1 for each move (including hitting a wall)
o +10 for reaching the goal at (2,2) (bottom-right)
 Transitions: Deterministic
 γ (discount factor) = 0.9
 Initial value: V(s) = 0 for all s

Step 1: Initialize the value function


We represent the grid as a 3×3 matrix:
lua
CopyEdit
V = [[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]

Step 2: First Iteration


We compute:
Vk+1(s)=max⁡a[R(s,a)+γVk(s′)]V_{k+1}(s) = \max_a \left[ R(s,a) + \
gamma V_k(s') \right]Vk+1(s)=amax[R(s,a)+γVk(s′)]
For state (0,0), the agent has 2 possible moves that don’t hit a wall:
 Right → (0,1)
 Down → (1,0)
 Up/Left hit walls → stay in (0,0)
All V(s') = 0 now, so:
 Any move → reward = −1 + 0 = −1
So,
lua
CopyEdit
V1(0,0) = max(-1, -1, -1, -1) = -1
Step 3: Second Iteration
Now, let's compute for (0,0) again using updated V1 values.
We need:
 V1(0,1), V1(1,0), V1(0,0)
From the previous step:
lua
CopyEdit
V1 = [[-1, -1, -1],
[-1, 0, 0],
[ 0, 0, 0]]
Now for each action from (0,0):
 Right → (0,1): V = −1 + 0.9×(−1) = −1.9
 Down → (1,0): same → −1.9
 Left/Up → hit wall, stay in (0,0): V = −1 + 0.9×(−1) = −1.9
Take the max:
lua
CopyEdit
V2(0,0) = max(−1.9, −1.9, −1.9, −1.9) = **−1.9**
Final Answer:
After two iterations, the value of the top-left corner is −1.9.

5 Given the following Markov Decision Process (MDP): CO BTL3


 States: S={s1,s2} 2
 Actions: A={a1,a2}

o P(s1∣s1,a1)=0.8,P(s2∣s1,a1)=0. 2
 Transition probabilities:

o P(s1∣s2,a2)=0.4,P(s2∣s2,a2)=0.6
 Rewards:
o R(s1,a1)=5,R(s2,a2)=10
 Discount factor γ=0.9

Assume a deterministic policy: π(s1)=a1,π(s2)=a2. Compute the state


values V(s1) and V(s2) under the given policy.
Ans We’re given an MDP with:

MDP Details:
 States: S={s1,s2}S = \{s_1, s_2\}
 Actions: A={a1,a2}A = \{a_1, a_2\}
 Policy:
o π(s1)=a1\pi(s_1) = a_1
o π(s2)=a2\pi(s_2) = a_2
 Transitions:
o P(s1∣s1,a1)=0.8P(s_1 \mid s_1, a_1) = 0.8,
P(s2∣s1,a1)=0.2P(s_2 \mid s_1, a_1) = 0.2
o P(s1∣s2,a2)=0.4P(s_1 \mid s_2, a_2) = 0.4,
P(s2∣s2,a2)=0.6P(s_2 \mid s_2, a_2) = 0.6
 Rewards:
o R(s1,a1)=5R(s_1, a_1) = 5
o R(s2,a2)=10R(s_2, a_2) = 10
 Discount factor: γ=0.9\gamma = 0.9

Goal:
Compute the state values under policy π\pi:
 Vπ(s1)V^\pi(s_1)
 Vπ(s2)V^\pi(s_2)

Policy Evaluation Equation:


Vπ(s)=R(s,π(s))+γ∑s′P(s′∣s,π(s))⋅Vπ(s′)V^\pi(s) = R(s, \pi(s)) + \gamma \
sum_{s'} P(s' \mid s, \pi(s)) \cdot V^\pi(s')

Step 1: Write equations using the policy


For s1s_1:
V(s1)=R(s1,a1)+γ[P(s1∣s1,a1)V(s1)+P(s2∣s1,a1)V(s2)]V(s_1) = R(s_1,
a_1) + \gamma \left[ P(s_1 \mid s_1, a_1) V(s_1) + P(s_2 \mid s_1, a_1)
V(s_2) \right] V(s1)=5+0.9[0.8V(s1)+0.2V(s2)](1)V(s_1) = 5 + 0.9 \
left[ 0.8 V(s_1) + 0.2 V(s_2) \right] \tag{1}
For s2s_2:
V(s2)=R(s2,a2)+γ[P(s1∣s2,a2)V(s1)+P(s2∣s2,a2)V(s2)]V(s_2) = R(s_2,
a_2) + \gamma \left[ P(s_1 \mid s_2, a_2) V(s_1) + P(s_2 \mid s_2, a_2)
V(s_2) \right] V(s2)=10+0.9[0.4V(s1)+0.6V(s2)](2)V(s_2) = 10 + 0.9 \
left[ 0.4 V(s_1) + 0.6 V(s_2) \right] \tag{2}

Step 2: Solve the system of equations


From (1):
V(s1)=5+0.9(0.8V(s1)+0.2V(s2))V(s1)=5+0.72V(s1)+0.18V(s2)V(s1)−
0.72V(s1)=5+0.18V(s2)0.28V(s1)=5+0.18V(s2)(3)V(s_1) = 5 + 0.9 \
left( 0.8 V(s_1) + 0.2 V(s_2) \right) \\ V(s_1) = 5 + 0.72 V(s_1) + 0.18
V(s_2) \\ V(s_1) - 0.72 V(s_1) = 5 + 0.18 V(s_2) \\ 0.28 V(s_1) = 5 +
0.18 V(s_2) \tag{3}
From (2):
V(s2)=10+0.9(0.4V(s1)+0.6V(s2))V(s2)=10+0.36V(s1)+0.54V(s2)V(s2
)−0.54V(s2)=10+0.36V(s1)0.46V(s2)=10+0.36V(s1)(4)V(s_2) = 10 +
0.9 (0.4 V(s_1) + 0.6 V(s_2)) \\ V(s_2) = 10 + 0.36 V(s_1) + 0.54 V(s_2)
\\ V(s_2) - 0.54 V(s_2) = 10 + 0.36 V(s_1) \\ 0.46 V(s_2) = 10 + 0.36
V(s_1) \tag{4}

Step 3: Solve Equations (3) and (4)


From (3):
V(s1)=5+0.18V(s2)0.28(5)V(s_1) = \frac{5 + 0.18 V(s_2)}{0.28} \
tag{5}
Substitute (5) into (4):
0.46V(s2)=10+0.36⋅(5+0.18V(s2)0.28)0.46 V(s_2) = 10 + 0.36 \cdot \
left( \frac{5 + 0.18 V(s_2)}{0.28} \right)
Compute:
 0.36⋅5=1.80.36 \cdot 5 = 1.8
 0.36⋅0.18=0.06480.36 \cdot 0.18 = 0.0648
So:
0.46V(s2)=10+1.8+0.0648V(s2)0.280.46 V(s_2) = 10 + \frac{1.8 +
0.0648 V(s_2)}{0.28}
Now compute:
0.46V(s2)=10+1.80.28+0.0648V(s2)0.280.46V(s2)=10+6.4286+0.23
14V(s2)0.46V(s2)−0.2314V(s2)=16.42860.2286V(s2)=16.4286V(s2)≈
16.42860.2286≈71.890.46 V(s_2) = 10 + \frac{1.8}{0.28} + \
frac{0.0648 V(s_2)}{0.28} \\ 0.46 V(s_2) = 10 + 6.4286 + 0.2314
V(s_2) \\ 0.46 V(s_2) - 0.2314 V(s_2) = 16.4286 \\ 0.2286 V(s_2) =
16.4286 \\ V(s_2) \approx \frac{16.4286}{0.2286} \approx 71.89
Now plug back into (5):
V(s1)=5+0.18⋅71.890.28=5+12.940.28≈17.940.28≈64.07V(s_1) = \
frac{5 + 0.18 \cdot 71.89}{0.28} = \frac{5 + 12.94}{0.28} \approx \
frac{17.94}{0.28} \approx 64.07

Final Answer:
 Vπ(s1)≈64.07\boxed{V^\pi(s_1) \approx 64.07}
 Vπ(s2)≈71.89\boxed{V^\pi(s_2) \approx 71.89}

6 Consider the 4 * 4 gridworld shown below CO BTL3


2

if π is the equiprobable random policy, what is q π (11, down)?


What is qπ (7, down)?
Ans Grid Info Summary:
 Grid size: 4×4, 16 states (numbered 0 to 15)
 Gray squares = terminal states (no reward beyond entering)
 Reward Rt = −1 for every transition
 Actions: Up, Down, Left, Right
 π is the equiprobable random policy: equal probability (¼)
for each action in all non-terminal states
 We are asked for:
o qπ(11,down)q_\pi(11, \text{down})qπ(11,down)
o qπ(7,down)q_\pi(7, \text{down})qπ(7,down)
We’ll assume discount factor γ = 1, since it is typically so in the
standard Sutton & Barto GridWorld (unless otherwise stated).

How to compute qπ(s,a)q_\pi(s,a)qπ(s,a)?


The action-value function:
qπ(s,a)=∑s′P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]q_\pi(s,a) = \sum_{s'} P(s'|s,a)\
left[R(s,a,s') + \gamma V_\pi(s')\right]qπ(s,a)=s′∑P(s′∣s,a)[R(s,a,s′)+γVπ
(s′)]
Where:
 sss = current state
 aaa = action taken
 s′s's′ = next state
 R(s,a,s′)=−1R(s,a,s') = -1R(s,a,s′)=−1 for all transitions
 Vπ(s′)V_\pi(s')Vπ(s′) is the value of the next state under policy π\
piπ
We need:
 Transition result for given (state, action)
 Value of the next state V(s')
We’ll use the precomputed values from Sutton & Barto’s GridWorld
example, or compute them directly if needed.

First: qπ(11,down)q_\pi(11, \text{down})qπ(11,down)


From state 11 (3rd row, last column), moving down takes you to state
15, which is a terminal state.
So:
qπ(11,down)=R+γVπ(15)=−1+1⋅0=−1q_\pi(11, \text{down}) = R + \
gamma V_\pi(15) = -1 + 1 \cdot 0 = \boxed{-1}qπ(11,down)=R+γVπ
(15)=−1+1⋅0=−1

Second: qπ(7,down)q_\pi(7, \text{down})qπ(7,down)


From state 7 (2nd row, last column), moving down takes you to state
11
Now we need Vπ(11)V_\pi(11)Vπ(11). According to Sutton & Barto’s
standard GridWorld (equiprobable policy), the values for key states
are:
CopyEdit
V(11) = -14
So:
qπ(7,down)=R+γV(11)=−1+1⋅(−14)=−15q_\pi(7, \text{down}) = R
+ \gamma V(11) = -1 + 1 \cdot (-14) = \boxed{-15}qπ
(7,down)=R+γV(11)=−1+1⋅(−14)=−15
Final Answers:
 qπ(11,down)=−1q_\pi(11, \text{down}) = \boxed{-1}qπ
(11,down)=−1
 qπ(7,down)=−15q_\pi(7, \text{down}) = \boxed{-15}qπ
(7,down)=−15

Consider a 4×12 GridWorld where: CO BTL3


7  The start state is at the bottom-left corner (S). 3
 The goal state is at the bottom-right corner (G).
 The agent receives a reward of −1 for every step.
 Falling into the "cliff" (cells between S and G) results in a reward
of −100 and the agent is sent back to S.
 Discount factor γ=0.9
 Learning rate α=0.5
Using the following episode, update the Q-values for SARSA. Assume
ϵ = 0.1 for an ϵ greedy policy.
Ans To update the Q-values using SARSA for the given 4×12 GridWorld, we
will use the SARSA update rule. The general form of the SARSA update
rule is:
Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)−Q(st,at)]Q(s_t, a_t) \
leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1},
a_{t+1}) - Q(s_t, a_t) \right]Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)
−Q(st,at)]
Where:
 sts_tst is the current state.
 ata_tat is the current action taken.
 rt+1r_{t+1}rt+1 is the reward obtained from the current state
and action.
 st+1s_{t+1}st+1 is the next state.
 at+1a_{t+1}at+1 is the next action (chosen based on an ϵ\
epsilonϵ-greedy policy).
 α\alphaα is the learning rate.
 γ\gammaγ is the discount factor.
Example Episode
Let's assume the following episode is given as a list of state-action-
reward tuples:
[(S,a1,−1),(s2,a2,−1),(s3,a3,−1),(G,a4,0)][(S, a_1, -1), (s_2, a_2, -1),
(s_3, a_3, -1), (G, a_4, 0)][(S,a1,−1),(s2,a2,−1),(s3,a3,−1),(G,a4,0)]
This episode means:
1. Start at state SSS, take action a1a_1a1, receive reward −1-1−1.
2. Move to state s2s_2s2, take action a2a_2a2, receive reward −1-
1−1.
3. Move to state s3s_3s3, take action a3a_3a3, receive reward −1-
1−1.
4. Reach the goal GGG, take action a4a_4a4, and receive reward 0.
In the SARSA update, the next action at+1a_{t+1}at+1 is chosen
based on the current policy, and we need to update the Q-values for
each step.
Steps to Update Q-values
We start by iterating over the episode in reverse, updating the Q-
values for each state-action pair. Here is a step-by-step outline of the
process:
1. State SSS, Action a1a_1a1:
o Current Q-value: Q(S,a1)Q(S, a_1)Q(S,a1)
o Reward: −1-1−1
o Next state: s2s_2s2
o Next action: a2a_2a2 (chosen with the ϵ\epsilonϵ-greedy
policy)
o Update:
Q(S,a1)←Q(S,a1)+0.5[−1+0.9⋅Q(s2,a2)−Q(S,a1)]Q(S, a_1) \leftarrow
Q(S, a_1) + 0.5 \left[ -1 + 0.9 \cdot Q(s_2, a_2) - Q(S, a_1) \
right]Q(S,a1)←Q(S,a1)+0.5[−1+0.9⋅Q(s2,a2)−Q(S,a1)]
2. State s2s_2s2, Action a2a_2a2:
o Current Q-value: Q(s2,a2)Q(s_2, a_2)Q(s2,a2)
o Reward: −1-1−1
o Next state: s3s_3s3
o Next action: a3a_3a3 (chosen with the ϵ\epsilonϵ-greedy
policy)
o Update:
Q(s2,a2)←Q(s2,a2)+0.5[−1+0.9⋅Q(s3,a3)−Q(s2,a2)]Q(s_2, a_2) \
leftarrow Q(s_2, a_2) + 0.5 \left[ -1 + 0.9 \cdot Q(s_3, a_3) - Q(s_2,
a_2) \right]Q(s2,a2)←Q(s2,a2)+0.5[−1+0.9⋅Q(s3,a3)−Q(s2,a2)]
3. State s3s_3s3, Action a3a_3a3:
o Current Q-value: Q(s3,a3)Q(s_3, a_3)Q(s3,a3)
o Reward: −1-1−1
o Next state: GGG
o Next action: a4a_4a4 (chosen with the ϵ\epsilonϵ-greedy
policy)
o Update:
Q(s3,a3)←Q(s3,a3)+0.5[−1+0.9⋅Q(G,a4)−Q(s3,a3)]Q(s_3, a_3) \
leftarrow Q(s_3, a_3) + 0.5 \left[ -1 + 0.9 \cdot Q(G, a_4) - Q(s_3, a_3) \
right]Q(s3,a3)←Q(s3,a3)+0.5[−1+0.9⋅Q(G,a4)−Q(s3,a3)]
4. State GGG, Action a4a_4a4:
o Current Q-value: Q(G,a4)Q(G, a_4)Q(G,a4)
o Reward: 0
o No next state (because GGG is the goal)
o Update:
Q(G,a4)←Q(G,a4)+0.5[0−Q(G,a4)]Q(G, a_4) \leftarrow Q(G, a_4) +
0.5 \left[ 0 - Q(G, a_4) \right]Q(G,a4)←Q(G,a4)+0.5[0−Q(G,a4)]
This updates the Q-value for the goal state-action pair.
Final Q-value Updates
After applying the SARSA update rule to each of these state-action
pairs, the Q-values for each pair will have been updated.

An agent navigates a 3×3 grid. Each cell has a deterministic CO BTL3


8 transition. 3
 Rewards:
o +10 for reaching the goal state (G) at (2,2).
o −1 for every move.
 Actions: up, down, left, right.
 Discount factor γ=0.9
 Learning rate α=0.5
Using the following episode, update the Q-values for Q-learning.
Assume ϵ = 0.1 for an ϵ greedy policy.
Ans Q-learning for 3x3 Gridworld:
For the 3×3 Gridworld, we can apply the Q-learning update rule to
update the Q-values. The general form for Q-learning is:
Q(st,at)←Q(st,at)+α[rt+1+γmax⁡a′Q(st+1,a′)−Q(st,at)]Q(s_t, a_t) \
leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'}
Q(s_{t+1}, a') - Q(s_t, a_t) \right]Q(st,at)←Q(st,at)+α[rt+1+γa′max
Q(st+1,a′)−Q(st,at)]
Where:
 sts_tst is the current state.
 ata_tat is the action taken.
 rt+1r_{t+1}rt+1 is the reward received after taking action
ata_tat.
 st+1s_{t+1}st+1 is the next state.
 a′a'a′ is the next possible action (the best action selected for
st+1s_{t+1}st+1).
 α\alphaα is the learning rate.
 γ\gammaγ is the discount factor.
Given that the goal is to reach (2,2), and assuming a random episode
that reaches the goal, you would update the Q-values by following the
steps in Q-learning. Here's how you'd do it based on an episode that
starts in any state and progresses until reaching the goal.

Driving Home : Each day as you drive home from work, you try to CO BTL3
9 predict how long it will take to get home. When you leave your office, 3
you note the time, the day of week, the weather, and anything else
that might be relevant. Say on this Friday you are leaving at exactly 6
o’clock, and you estimate that it will take 30 minutes to get home. As
you reach your car it is 6:05, and you notice it is starting to rain. Traffic
is often slower in the rain, so you reestimate that it will take 35
minutes from then, or a total of 40 minutes. Fifteen minutes later you
have completed the highway portion of your journey in good time. As
you exit onto a secondary road you cut your estimate of total travel
time to 35 minutes. Unfortunately, at this point you get stuck behind a
slow truck, and the road is too narrow to pass. You end up having to
follow the truck until you turn onto the side street where you live at
6:40. Three minutes later you are home. The sequence of states,
times, and predictions is thus as follows:

Use Monte Carlo method to plot the predicted total time


Ans Monte Carlo Method to Predict Travel Time:
To apply the Monte Carlo method for predicting total travel time in the
given driving home scenario:
1. States and Actions: Your states could be:
o Time of day (e.g., 6:00, 6:05, etc.).
o Weather (e.g., rain or not).
o Traffic conditions (e.g., slow truck, clear highway).
2. Predictions: At each state, you'll reestimate the travel time (e.g.,
30 minutes, 35 minutes, etc.). You then simulate many episodes
with different variations (random variations in traffic, weather,
etc.) and calculate the expected travel time.
To simulate this using Monte Carlo methods:
 Sample a number of episodes where each step consists of a
state-action pair (e.g., "it's 6:05, it’s raining, estimate 35
minutes").
 For each state, you accumulate the total travel time and
average it over the number of episodes.
Once you gather a number of samples, you can plot the predicted total
time based on the updates as you move through the journey.

Design an agent that can balance a pole on a cart by applying forces CO BTL4
10 (left or right) to the CartPole reinforcement learning problem using 4
Deep-Q-Network
Ans Designing an agent to balance a pole on a cart using a Deep Q-
Network (DQN) for the CartPole reinforcement learning problem
involves the following steps:
1. Understanding the Problem:
o The CartPole environment has four input features:
1. Cart Position: The position of the cart on the track.
2. Cart Velocity: The velocity of the cart.
3. Pole Angle: The angle of the pole relative to the vertical.
4. Pole Velocity: The velocity at which the pole is moving.
o The agent can take one of two actions: apply a force to the left or
right (often denoted as 0 and 1).
o The reward is given as:
 +1 for each time step the pole remains balanced.
 A terminal state occurs when the pole falls, which ends the
episode.
2. Deep Q-Network (DQN) Overview: A DQN uses a neural
network to approximate the Q-value function Q(s,a)Q(s, a)Q(s,a),
where:
o sss is the state (4-dimensional vector in CartPole).
o aaa is the action (left or right).
o The neural network learns to predict the Q-values for each action
in each state, and these predictions guide the agent's decisions.
3. Steps for Designing the DQN Agent:
o Define the neural network architecture to approximate the Q-
values.
o Implement experience replay to store past experiences and
sample them for training.
o Implement target network to stabilize learning.
o Train the agent using the Q-learning update rule.
o Evaluate the agent's performance over episodes.
1. Imports:
python
CopyEdit
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import gym
from collections import deque
2. Simplified Q-Network:
A minimal neural network with just the essential layers to
approximate Q-values.
python
CopyEdit
class DQN(nn.Module):
def __init__(self, state_size, action_size):
super(DQN, self).__init__()
self.fc = nn.Sequential(
nn.Linear(state_size, 64),
nn.ReLU(),
nn.Linear(64, action_size)
)

def forward(self, state):


return self.fc(state)
3. Experience Replay Buffer:
The experience replay stores the agent’s experiences.
python
CopyEdit
class ReplayBuffer:
def __init__(self, size, batch_size):
self.buffer = deque(maxlen=size)
self.batch_size = batch_size

def push(self, experience):


self.buffer.append(experience)

def sample(self):
return random.sample(self.buffer, self.batch_size)

def size(self):
return len(self.buffer)
4. DQN Agent:
The agent selects actions and learns from its experiences using Q-
learning.
python
CopyEdit
class DQNAgent:
def __init__(self, state_size, action_size, gamma=0.99,
epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
learning_rate=0.001, batch_size=64):
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.batch_size = batch_size

self.q_network = DQN(state_size, action_size)


self.target_network = DQN(state_size, action_size)
self.optimizer = optim.Adam(self.q_network.parameters(),
lr=learning_rate)
self.replay_buffer = ReplayBuffer(size=10000,
batch_size=batch_size)
self.update_target_network()

def update_target_network(self):
self.target_network.load_state_dict(self.q_network.state_dict())

def select_action(self, state):


if random.random() <= self.epsilon:
return random.choice(range(self.action_size)) # Explore
else:
state = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
q_values = self.q_network(state)
return torch.argmax(q_values).item() # Exploit
def train(self):
if self.replay_buffer.size() < self.batch_size:
return

# Sample a batch of experiences


experiences = self.replay_buffer.sample()
states, actions, rewards, next_states, dones = zip(*experiences)

states = torch.FloatTensor(states)
next_states = torch.FloatTensor(next_states)
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
dones = torch.BoolTensor(dones)

q_values = self.q_network(states).gather(1,
actions.unsqueeze(1)).squeeze(1)
next_q_values = self.target_network(next_states).max(1)[0]
target_q_values = rewards + (self.gamma * next_q_values *
~dones)

loss = nn.MSELoss()(q_values, target_q_values)

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

if self.epsilon > self.epsilon_min:


self.epsilon *= self.epsilon_decay
5. Training Loop:
python
CopyEdit
def train_dqn(agent, env, episodes=1000):
for episode in range(episodes):
state = env.reset()
done = False
total_reward = 0

while not done:


action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)

agent.replay_buffer.push((state, action, reward, next_state,


done))
agent.train()
state = next_state
total_reward += reward

if episode % 10 == 0:
agent.update_target_network()

print(f"Episode {episode}/{episodes}, Reward: {total_reward},


Epsilon: {agent.epsilon:.2f}")
6. Running the Training:
python
CopyEdit
# Create environment and agent
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

agent = DQNAgent(state_size, action_size)

# Train the agent


train_dqn(agent, env, episodes=1000)

# After training, evaluate the agent's performance


state = env.reset()
done = False
total_reward = 0

while not done:


action = agent.select_action(state)
state, reward, done, _ = env.step(action)
total_reward += reward

print(f"Total Reward after training: {total_reward}")


Explanation:
 DQN is used to approximate the Q-values, using a small 2-layer
neural network.
 Replay Buffer is used for experience replay, ensuring that the
agent doesn’t overfit to recent experiences.
 The epsilon-greedy policy balances exploration and
exploitation.
 The target network stabilizes learning by updating at intervals
instead of at every step.

11 A mobile robot is navigating a 5×5 gridworld to collect objects CO BTL4


scattered across the environment while avoiding obstacles. Develop a 4
Deep Q-Network (DQN) to learn an optimal policy for navigating the
grid.
Ans To develop a Deep Q-Network (DQN) to enable a mobile robot to
navigate a 5×5 gridworld, collect objects, and avoid obstacles, we
need to define the following:
1. GridWorld Setup:
o A 5×5 grid with obstacles and objects scattered.
o The robot can move in four possible directions: up, down, left,
and right.
o The robot receives rewards for collecting objects and penalized
for hitting obstacles or taking steps.
2. DQN Setup:
o States: The robot's position on the grid, the objects collected,
or the proximity of obstacles.
o Actions: Move in one of four directions.
o Rewards: Collecting objects gives positive rewards; hitting
obstacles gives negative rewards, and each move may incur a
small penalty.
3. Key Components:
o Neural Network: To approximate Q-values for each action.
o Experience Replay: To store and sample past experiences for
training.
o Target Network: To stabilize learning by preventing quick
updates to the Q-function.
o Training Loop: The robot learns by interacting with the
environment and updating its policy.
GridWorld Environment:
import numpy as np
import random
import gym
from gym import spaces
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque

class GridWorld(gym.Env):
def __init__(self, grid_size=5):
super(GridWorld, self).__init__()
self.grid_size = grid_size
self.robot_pos = (0, 0) # Starting position (top-left)
self.object_pos = (4, 4) # Object at bottom-right
self.obstacle_pos = [(2, 2), (3, 1)] # Obstacles
self.done = False

# Define the action and observation space


self.action_space = spaces.Discrete(4) # 4 actions: up, down,
left, right
self.observation_space = spaces.Discrete(grid_size * grid_size)

def reset(self):
self.robot_pos = (0, 0)
self.done = False
return self.robot_pos_to_state(self.robot_pos)

def robot_pos_to_state(self, pos):


return pos[0] * self.grid_size + pos[1]

def state_to_robot_pos(self, state):


return (state // self.grid_size, state % self.grid_size)

def step(self, action):


if self.done:
return self.robot_pos_to_state(self.robot_pos), 0, self.done, {}
# Determine the new position based on the action
if action == 0: # Up
next_pos = (self.robot_pos[0] - 1, self.robot_pos[1]) if
self.robot_pos[0] > 0 else self.robot_pos
elif action == 1: # Down
next_pos = (self.robot_pos[0] + 1, self.robot_pos[1]) if
self.robot_pos[0] < self.grid_size - 1 else self.robot_pos
elif action == 2: # Left
next_pos = (self.robot_pos[0], self.robot_pos[1] - 1) if
self.robot_pos[1] > 0 else self.robot_pos
elif action == 3: # Right
next_pos = (self.robot_pos[0], self.robot_pos[1] + 1) if
self.robot_pos[1] < self.grid_size - 1 else self.robot_pos

# Check for obstacles


if next_pos in self.obstacle_pos:
reward = -10 # Penalize for hitting obstacle
self.done = False # Continue the episode
else:
self.robot_pos = next_pos
if self.robot_pos == self.object_pos:
reward = 10 # Reward for collecting object
self.done = True # Episode ends when object is collected
else:
reward = -1 # Small negative reward for each step

return self.robot_pos_to_state(self.robot_pos), reward, self.done,


{}
12 A self-driving taxi operates in a 5×5 grid city. It needs to pick up CO BTL4
passengers from random locations and drop them off at their 4
destinations. Show how does the taxi use a DQN agent to learn the
optimal policy.
Ans To develop a self-driving taxi operating in a 5×5 grid city, we need
to design an environment where the taxi can learn to pick up
passengers from random locations and drop them off at their
designated destinations. The environment will have several states
(taxi positions, passenger positions, and destination locations), actions
(move in the grid), and rewards (positive for successful drop-offs and
negative for unnecessary steps or incorrect actions).
Steps for Designing the Self-Driving Taxi with DQN:
1. Environment Setup:
o A 5×5 grid representing the city.
o Locations for passengers, which can randomly appear at
various positions on the grid.
o Destinations where the passengers need to be dropped off.
o The taxi can take 4 possible actions: move up, move down,
move left, or move right.
o Rewards:
 +20 for successfully dropping off a passenger at the
destination.
 -1 for each move to encourage fewer steps.
 -10 for dropping off the passenger at the wrong location.
2. State Representation: The state will be represented by the
taxi position, the passenger's position, and the
destination's position. Each of these values can be encoded
as integers, and the state will be a combination of them.
3. Action Representation: The taxi has 4 possible actions: up,
down, left, and right.
4. Q-Network: We'll define a neural network that takes the state
as input (taxi position, passenger position, and destination) and
outputs Q-values for each action.
Step-by-Step Implementation:
1. GridWorld for Taxi Environment:
python
CopyEdit
import numpy as np
import random
import gym
from gym import spaces
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque

class TaxiEnv(gym.Env):
def __init__(self, grid_size=5):
super(TaxiEnv, self).__init__()
self.grid_size = grid_size
self.taxi_pos = (0, 0) # Starting position (top-left)
self.passenger_pos = self.random_position()
self.destination_pos = self.random_position()

self.done = False

# Define the action space: 4 actions (up, down, left, right)


self.action_space = spaces.Discrete(4)

# Define the observation space: taxi position, passenger position,


destination position
self.observation_space = spaces.Discrete(grid_size * grid_size *
grid_size * grid_size)

def reset(self):
self.taxi_pos = (0, 0)
self.passenger_pos = self.random_position()
self.destination_pos = self.random_position()
self.done = False
return self.state()

def random_position(self):
return (random.randint(0, self.grid_size-1), random.randint(0,
self.grid_size-1))

def state(self):
return (self.taxi_pos[0], self.taxi_pos[1], self.passenger_pos[0],
self.passenger_pos[1], self.destination_pos[0], self.destination_pos[1])

def step(self, action):


if self.done:
return self.state(), 0, self.done, {}

# Move taxi based on the action


if action == 0: # Move up
next_pos = (max(0, self.taxi_pos[0] - 1), self.taxi_pos[1])
elif action == 1: # Move down
next_pos = (min(self.grid_size - 1, self.taxi_pos[0] + 1),
self.taxi_pos[1])
elif action == 2: # Move left
next_pos = (self.taxi_pos[0], max(0, self.taxi_pos[1] - 1))
elif action == 3: # Move right
next_pos = (self.taxi_pos[0], min(self.grid_size - 1,
self.taxi_pos[1] + 1))

self.taxi_pos = next_pos

# If the taxi reaches the passenger, it picks it up


if self.taxi_pos == self.passenger_pos:
passenger_picked_up = True
else:
passenger_picked_up = False

# If the taxi reaches the destination with the passenger, it drops


off
if self.taxi_pos == self.destination_pos and passenger_picked_up:
reward = 20 # Successful drop-off
self.done = True
else:
reward = -1 # Penalize for each move
if self.taxi_pos == self.destination_pos and not
passenger_picked_up:
reward = -10 # Wrong destination without a passenger

return self.state(), reward, self.done, {}

13 Two teams of agents are playing a simplified soccer game in a grid CO BTL5
environment. Each agent learns its own policy using REINFORCE. Apply 5
Policy Gradient Methods
Ans To implement a simplified soccer game in a grid environment using
REINFORCE (a policy gradient method), we'll need to model the
environment and agents such that each agent can learn to optimize its
own policy using reinforcement learning.
In REINFORCE, the agent learns a policy directly by optimizing the
parameters of its policy network using gradient descent. It does so by
updating its policy parameters based on the returns (rewards)
collected during episodes.
Here’s how we can approach this problem:
1. Grid Environment Setup for the Soccer Game:
 We will have a 2D grid where two teams of agents (Team 1 and
Team 2) will play.
 Each agent has a position on the grid and can take actions such
as moving in the grid to either attack, defend, or pass the ball.
 There will be a ball in the environment, and the goal is to score
points by getting the ball into the opposing team's goal area.
 Each agent has its own policy, and the REINFORCE algorithm will
update the policy for each agent.
2. State Representation:
 The state can be represented by:
o The positions of all agents on the grid (for both teams).
o The position of the ball.
o The direction or momentum of the ball.
 Each agent can observe the state (its own position, the ball’s
position, and other agents' positions).
3. Actions:
 The actions for each agent might include:
o Move up
o Move down
o Move left
o Move right
o Kick the ball (if the ball is in range)
These actions allow agents to control their movement and interactions
with the ball.
4. Rewards:
 A positive reward (+1) when an agent scores a goal by getting
the ball into the opponent's goal area.
 A small penalty (-0.1) for each move to encourage fewer steps.
 Negative reward (-1) for losing possession or moving into an
unstrategic position.
5. Policy Gradient Method (REINFORCE):
 We will use REINFORCE, a Monte Carlo method, where each
agent learns its policy by updating the parameters of its policy
network using the returns (i.e., total accumulated reward) from
the episodes.
Key Components:
 Policy Network: A neural network that takes the state as input
and outputs a probability distribution over actions (policy).
 Returns: The total rewards accumulated by an agent during an
episode.
 Update Rule: The policy is updated using the gradient of the
log-probability of actions taken, weighted by the return.
Step-by-Step Implementation
1. Environment Setup:
python
CopyEdit
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim

class SoccerEnv:
def __init__(self, grid_size=5):
self.grid_size = grid_size
self.agent1_pos = (0, 0)
self.agent2_pos = (0, 4)
self.ball_pos = (2, 2)
self.goal1 = [(0, 2)] # Goal for team 1
self.goal2 = [(4, 2)] # Goal for team 2
self.done = False

def reset(self):
self.agent1_pos = (0, 0)
self.agent2_pos = (0, 4)
self.ball_pos = (2, 2)
self.done = False
return self.get_state()

def get_state(self):
return np.array([self.agent1_pos[0], self.agent1_pos[1],
self.agent2_pos[0], self.agent2_pos[1], self.ball_pos[0],
self.ball_pos[1]])

def step(self, agent1_action, agent2_action):


if self.done:
return self.get_state(), 0, self.done

# Take the actions for both agents


self.agent1_pos = self.take_action(self.agent1_pos,
agent1_action)
self.agent2_pos = self.take_action(self.agent2_pos,
agent2_action)

reward = 0
if self.agent1_pos == self.goal1[0] and self.ball_pos ==
self.goal1[0]:
reward = 1 # Team 1 scores
self.done = True
elif self.agent2_pos == self.goal2[0] and self.ball_pos ==
self.goal2[0]:
reward = -1 # Team 2 scores
self.done = True

return self.get_state(), reward, self.done

def take_action(self, position, action):


if action == 0: # Move up
return (max(0, position[0] - 1), position[1])
elif action == 1: # Move down
return (min(self.grid_size - 1, position[0] + 1), position[1])
elif action == 2: # Move left
return (position[0], max(0, position[1] - 1))
elif action == 3: # Move right
return (position[0], min(self.grid_size - 1, position[1] + 1))
return position
2. Policy Network:
python
CopyEdit
class PolicyNetwork(nn.Module):
def __init__(self, state_size, action_size):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, 128)
self.fc2 = nn.Linear(128, action_size)

def forward(self, state):


x = torch.relu(self.fc1(state))
action_probs = torch.softmax(self.fc2(x), dim=-1)
return action_probs
14 A robot is tasked with navigating through a dynamic environment CO BTL5
where obstacles move unpredictably. Use Actor-Critic method to 5
learn an optimal policy.
Ans To solve the task of navigating a robot through a dynamic environment
with moving obstacles using the Actor-Critic method, we need to
implement both the actor and the critic components of the
reinforcement learning algorithm.
Actor-Critic Method Overview:
The Actor-Critic method is a type of Policy Gradient method where:
1. The Actor is responsible for selecting actions (policy) based on
the current state of the environment.
2. The Critic evaluates the actions taken by the Actor and
provides feedback in the form of value estimates, typically the
State-Value Function (V(s)) or the Advantage Function
(A(s, a)).
In the Actor-Critic method, the Critic helps the Actor improve by
providing feedback on the quality of the chosen actions and thus
refining the policy. This approach balances the exploration-exploitation
trade-off by both evaluating states (Critic) and suggesting actions
(Actor).
Problem Setup:
 A robot is tasked with navigating in a grid environment where
obstacles move unpredictably.
 The robot receives a reward (+1) for reaching a goal state, and it
gets a penalty (-1) for colliding with obstacles or stepping out of
bounds.
 The state of the environment includes the position of the robot,
positions of obstacles, and the goal state.
 The goal is to learn a policy that helps the robot navigate safely
and efficiently towards the goal.
Key Components:
1. State Representation:
o The robot's position in the grid.
o Positions of dynamic obstacles.
o Goal position.
2. Actions:
o The actions might include moving in four directions: up,
down, left, right.
3. Reward Function:
o Positive reward (+1) for reaching the goal.
o Negative reward (-1) for colliding with obstacles or moving
out of bounds.
4. Learning:
o The Critic estimates the value of the state V(s)V(s)V(s) or
the advantage A(s,a)A(s, a)A(s,a).
o The Actor updates its policy using the feedback (from the
Critic) to improve its action selection.
Implementation Steps:
1. Environment Setup (Simplified Dynamic Grid):
python
CopyEdit
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim

class DynamicEnv:
def __init__(self, grid_size=5):
self.grid_size = grid_size
self.robot_pos = (0, 0)
self.goal_pos = (4, 4)
self.obstacles = [(2, 2), (1, 3), (3, 1)] # Example static obstacles
self.done = False

def reset(self):
self.robot_pos = (0, 0)
self.done = False
return self.get_state()

def get_state(self):
return np.array([self.robot_pos[0], self.robot_pos[1],
self.goal_pos[0], self.goal_pos[1]])

def step(self, action):


if self.done:
return self.get_state(), 0, self.done

# Take action and move robot


if action == 0: # Move up
self.robot_pos = (max(0, self.robot_pos[0] - 1),
self.robot_pos[1])
elif action == 1: # Move down
self.robot_pos = (min(self.grid_size - 1, self.robot_pos[0] + 1),
self.robot_pos[1])
elif action == 2: # Move left
self.robot_pos = (self.robot_pos[0], max(0, self.robot_pos[1] -
1))
elif action == 3: # Move right
self.robot_pos = (self.robot_pos[0], min(self.grid_size - 1,
self.robot_pos[1] + 1))

# Check for obstacles


reward = -1 if self.robot_pos in self.obstacles else 0

# Check if robot reaches the goal


if self.robot_pos == self.goal_pos:
reward = 1
self.done = True

return self.get_state(), reward, self.done

15 Apply Policy Gradient Methods to solve the CartPole problem CO BTL5


5
Ans To solve the CartPole problem using Policy Gradient Methods, we
need to design an agent that learns the optimal policy directly by
adjusting the parameters of its policy function based on the gradients
of the expected reward.
In this case, we'll use a reinforcement learning policy gradient
method with a neural network to approximate the policy function
and update its parameters. Specifically, we'll use the REINFORCE
algorithm, which is a Monte Carlo-based policy gradient method.
CartPole Problem Overview:
 Objective: A pole is placed on a cart, and the goal is to balance
the pole by applying forces (left or right) to the cart.
 Actions: The agent can apply a force to move the cart left or
right.
 State: The state consists of four variables: the position of the
cart, the velocity of the cart, the angle of the pole, and the
angular velocity of the pole.
 Reward: The agent receives a reward of +1 for each time step
that the pole remains balanced.
 Termination: The episode ends when the pole falls over or the
maximum number of steps is reached.
Policy Gradient (REINFORCE) Approach:
 We will use a neural network to model the policy (the
probability distribution over actions). The network will take the
state as input and output the action probabilities.
 We will compute the policy gradient using the reward-to-go
approach, which is the sum of rewards from a given time step
until the end of the episode.
Implementation Steps:
1. Environment Setup:
We will use the CartPole-v1 environment from OpenAI's gym library.
2. Policy Network:
The policy network will take the state of the environment as input and
output a probability distribution over the two actions (left or right).
3. Training Loop:
We will use the REINFORCE algorithm, where we compute the
cumulative reward for each episode, calculate the gradients, and
update the policy network.
Implementation Code:
python
CopyEdit
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# 1. Define the Policy Network (Neural Network)


class PolicyNetwork(nn.Module):
def __init__(self, state_size, action_size):
super(PolicyNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, 128)
self.fc2 = nn.Linear(128, action_size)

def forward(self, state):


x = torch.relu(self.fc1(state))
action_probs = torch.softmax(self.fc2(x), dim=-1)
return action_probs

# 2. Define the Policy Gradient Agent


class PolicyGradientAgent:
def __init__(self, state_size, action_size, gamma=0.99, lr=0.01):
self.state_size = state_size
self.action_size = action_size
self.gamma = gamma # Discount factor
self.policy = PolicyNetwork(state_size, action_size)
self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)

def select_action(self, state):


state = torch.FloatTensor(state).unsqueeze(0)
action_probs = self.policy(state)
action = torch.multinomial(action_probs, 1).item() # Sample
action based on probabilities
return action, action_probs

def update(self, rewards, log_probs):


discounted_rewards = []
running_total = 0
for reward in reversed(rewards):
running_total = reward + self.gamma * running_total
discounted_rewards.insert(0, running_total)

# Normalize the rewards


discounted_rewards = torch.tensor(discounted_rewards)
discounted_rewards = (discounted_rewards -
discounted_rewards.mean()) / (discounted_rewards.std() + 1e-7)

# Policy Gradient update


loss = 0
for log_prob, reward in zip(log_probs, discounted_rewards):
loss -= log_prob * reward # Negative log-likelihood loss (Policy
Gradient)

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

# 3. Train the agent


def train_agent(env, agent, episodes=1000):
for episode in range(episodes):
state = env.reset()
state = np.array(state)
done = False
rewards = []
log_probs = []

while not done:


action, action_probs = agent.select_action(state)
log_prob = torch.log(action_probs[0, action])
next_state, reward, done, _, _ = env.step(action)

# Collect rewards and log probabilities


rewards.append(reward)
log_probs.append(log_prob)

state = next_state

# Update the policy after the episode is finished


agent.update(rewards, log_probs)

# Print the reward for every 100th episode


if episode % 100 == 0:
print(f"Episode {episode}, Total Reward: {sum(rewards)}")

# 4. Initialize the environment and agent


env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0] # 4: cart position, cart
velocity, pole angle, pole angular velocity
action_size = env.action_space.n # 2: left or right

agent = PolicyGradientAgent(state_size, action_size)

# 5. Train the agent


train_agent(env, agent, episodes=1000)

You might also like