0% found this document useful (0 votes)
13 views6 pages

Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL

Uploaded by

Manikantaa Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL

Uploaded by

Manikantaa Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

TUTORIAL QUESTIONS [ANNEXURE I]

Que
s-
Questions CO BTL
tion
No
Develop an agent that can interact with a Multi-Armed Bandit en-
vironment, explore the different arms, and gradually converge to
the arm that provides the highest average reward. The agent
should learn to make decisions that maximize the cumulative re-
ward over time by effectively balancing exploration and exploita-
tion.
Ans: To develop this agent:
 Initialize estimated values for each arm.
 Use an ε-greedy strategy: CO BTL
1
o With probability ε, explore (choose a random arm). 1 2
o With probability 1–ε, exploit (choose arm with highest
estimated value).
 After each pull, update the estimated value using:
Qnew(a)=Qold(a)+α[R−Qold(a)]Q_{new}(a) = Q_{old}(a) + \al-
pha [R - Q_{old}(a)]Qnew(a)=Qold(a)+α[R−Qold(a)]
This balances exploration and exploitation to gradually favor the
best arm.

2 Consider using reinforcement learning to control the motion of a CO BTL


robot arm “Pick-and-Place Robot” in a repetitive pick-and- 1 2
place task. If we want to learn movements that are fast and
smooth, the learning agent will have to control the motors directly
and have low-latency information about the current positions and
velocities of the mechanical linkages. The actions in this case
might be the voltages applied to each motor at each joint, and the
states might be the latest readings of joint angles and velocities.
The reward might be +1 for each object successfully picked up
and placed. To encourage smooth movements, on each time step
a small, negative reward can be given as a function of the mo-
ment-to-moment “jerkiness” of the motion.

A. Devise three example tasks of your own that fit into the
MDP framework, identifying for each its states, actions, and
rewards. Make the three examples as different from each
other as possible. The framework is abstract and flexible
and can be applied in many ways. Stretch its limits in some
way in at least one of your examples
Ans: Three Example MDPs:
1. Autonomous Vacuum Cleaner
o States: Room layout and dust status.
o Actions: Move, clean.
o Rewards: +1 for cleaning, −1 for bumping wall.
2. Stock Trading Bot
o States: Current stock prices and portfolio.
o Actions: Buy, sell, hold.
o Rewards: Profit/loss at each step.
3. Dynamic Game NPC Behavior
o States: Player proximity and health.
o Actions: Attack, defend, hide.
o Rewards: +1 for damage, −1 for getting hit.

B. Is the MDP framework adequate to usefully represent all


goal-directed learning tasks?
Ans: MDP is suitable for many goal-directed tasks where the
environment is Markovian. However, it struggles with partial
observability or very long-term dependencies unless extended
(e.g., POMDPs).

Jack’s Car Rental : Jack manages two locations for a nationwide


car rental company. Each day, some number of customers arrive
at each location to rent cars. If Jack has a car available, he rents it
out and is credited $10 by the national company. If he is out of
cars at that location, then the business is lost. Cars become avail-
able for renting the day after they are returned. To help ensure
that cars are available where they are needed, Jack can move
them between the two locations overnight, at a cost of $2 per car
moved. We assume that the number of cars requested and re-
turned at each location are Poisson random variables. Suppose λ
is 3 and 4 for rental requests at the first and second locations and
3 and 2 for returns. To simplify the problem slightly, we assume
that there can be no more than 20 cars at each location (any ad-
ditional cars are returned to the nationwide company, and thus
disappear from the problem) and a maximum of five cars can be CO BTL
3
moved from one location to the other in one night. Take the dis- 1 2
count rate to be γ=0.9 and formulate this as a continuing finite
MDP, where the time steps are days, the state is the number of
cars at each location at the end of the day, and the actions are
the net numbers of cars moved between the two locations
overnight.
Ans:

 States: (cars at location A, cars at location B)


 Actions: Cars moved overnight (−5 to +5)
 Rewards: +$10 per rental, −$2 per car moved
 Transitions: Based on Poisson rental and return γ: 0.9
This is a continuing MDP with policy optimization for balanc-
ing income and transport cost.

4 Consider a 3×3 GridWorld environment where the agent can CO BTL


move up, down, left, or right. Each move gives a reward of −1- 2 3
1−1, and the goal is to reach the bottom-right corner, which gives
a reward of +10+10+10. If the agent hits a wall, it stays in place
and receives a reward of −1-1−1. The discount factor γ=0.9. The
initial value of all states is 000. Compute the value of the top-left
corner after two iterations of the value iteration algorithm.
Ans: Let’s denote top-left as (0,0). Using reward −1 and γ=0.9,
and V initialized to 0:
 1stIteration:
V(0,0) = max over actions of [−1 + 0] = −1
 2ndIteration:
V(0,0) = max [−1 + γ × V(next state)]
Depending on wall states, it stays or moves. Suppose right
gives −1 + 0.9 × (−1) = −1.9, then V(0,0) ≈ −1.9

Given the following Markov Decision Process (MDP):


 States: S={s1,s2}
 Actions: A={a1,a2}

o P(s1∣s1,a1)=0.8,P(s2∣s1,a1)=0. 2
 Transition probabilities:

o P(s1∣s2,a2)=0.4,P(s2∣s2,a2)=0.6
 Rewards:
o R(s1,a1)=5,R(s2,a2)=10
 Discount factor γ=0.9

Assume a deterministic policy: π(s 1)=a1,π(s2)=a2. Compute the CO BTL


5 state values V(s1) and V(s2) under the given policy. 2 3
Ans: Using Bellman Equation:
V(s1)=R(s1,a1)+γ[0.8V(s1)+0.2V(s2)]V(s2)=R(s2,a2)+γ[0.4V(s1)
+0.6V(s2)]V(s1) = R(s1,a1) + γ [0.8 V(s1) + 0.2 V(s2)] V(s2) =
R(s2,a2) + γ [0.4 V(s1) + 0.6
V(s2)]V(s1)=R(s1,a1)+γ[0.8V(s1)+0.2V(s2)]V(s2)=R(s2,a2)+γ[0.4
V(s1)+0.6V(s2)]
Solving:
 V(s1) = 5 + 0.9[0.8 V(s1) + 0.2 V(s2)]
 V(s2)=10+0.9[0.4V(s1)+0.6V(s2)]
Solve system to find V(s1) ≈ 41.67, V(s2) ≈ 49.17

Consider the 4 * 4 gridworld shown below

if π is the equiprobable random policy, what is q π (11, down)? CO BTL


6
What is qπ (7, down)? 2 3
Ans: Use:
qπ(s,a)=∑s′P(s′∣s,a)[R+γV(s′)]q_\pi(s,a) = \sum_{s'} P(s'|s,a) [R +
γ V(s')]qπ(s,a)=s′∑P(s′∣s,a)[R+γV(s′)]
For equiprobable actions and policy, use known Vπ(s) values (pro-
vided in problem or precomputed).
You’d plug in values for V(11) and V(7) and compute using the
transition model.

Consider a 4×12 GridWorld where: CO BTL


7  The start state is at the bottom-left corner (S). 3 3
 The goal state is at the bottom-right corner (G).
 The agent receives a reward of −1 for every step.
 Falling into the "cliff" (cells between S and G) results in a re-
ward of −100 and the agent is sent back to S.
 Discount factor γ=0.9
 Learning rate α=0.5
Using the following episode, update the Q-values for SARSA. As-
sume ϵ = 0.1 for an ϵ greedy policy.
Ans: Update:
Q(s,a)=Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s,a) = Q(s,a) + α [r + γ
Q(s',a') - Q(s,a)]Q(s,a)=Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
Use episode step data (s, a, r, s’, a’) to update Q values step-by-
step. Apply ε-greedy action selection.

An agent navigates a 3×3 grid. Each cell has a deterministic


transition.
 Rewards:
o +10 for reaching the goal state (G) at (2,2).
o −1 for every move.
 Actions: up, down, left, right.
 Discount factor γ=0.9
 Learning rate α=0.5 CO BTL
8 Using the following episode, update the Q-values for Q-learning. 3 3
Assume ϵ = 0.1 for an ϵ greedy policy.
Ans: Q-learning update:
Q(s,a)=Q(s,a)+α[r+γmax⁡aQ(s′,a)−Q(s,a)]Q(s,a) = Q(s,a) + α [r +
γ \max_a Q(s',a) - Q(s,a)]Q(s,a)=Q(s,a)+α[r+γamaxQ(s′,a)−Q(s,a)]
Apply this for each step in the episode using the max Q of the
next state instead of next action (as in SARSA).

Driving Home : Each day as you drive home from work, you try CO BTL
9 to predict how long it will take to get home. When you leave your 3 3
office, you note the time, the day of week, the weather, and any-
thing else that might be relevant. Say on this Friday you are leav-
ing at exactly 6 o’clock, and you estimate that it will take 30
minutes to get home. As you reach your car it is 6:05, and you no-
tice it is starting to rain. Traffic is often slower in the rain, so you
reestimate that it will take 35 minutes from then, or a total of 40
minutes. Fifteen minutes later you have completed the highway
portion of your journey in good time. As you exit onto a secondary
road you cut your estimate of total travel time to 35 minutes. Un-
fortunately, at this point you get stuck behind a slow truck, and
the road is too narrow to pass. You end up having to follow the
truck until you turn onto the side street where you live at 6:40.
Three minutes later you are home. The sequence of states, times,
and predictions is thus as follows:
Use Monte Carlo method to plot the predicted total time
Ans:
Track states: (6:00, 6:05, etc.)
Track actual returns (total time = 43 min)
First-visitMC:Average returns for each state
Plot predictions vs actual over episode to visualize convergence.

Design an agent that can balance a pole on a cart by applying


forces (left or right) to the CartPole reinforcement learning prob-
lem using Deep-Q-Network
Ans: Apply experience replay and periodically update target net-
work.
CO BTL
 State: position, velocity, angle, angular velocity
10 4 4
 Action: left, right
 Use replay buffer, target network
 DQN learns Q(s,a) using neural network trained with Bell-
man loss

A mobile robot is navigating a 5×5 gridworld to collect objects


scattered across the environment while avoiding obstacles. De-
velop a Deep Q-Network (DQN) to learn an optimal policy for nav-
igating the grid.
Ans:
CO BTL
11  Inputs: grid observation
4 4
 Output: Q-values for move directions
 Rewards: +1 for object, −1 for obstacle, 0 otherwise
Train using ε-greedy policy and Bellman loss for optimal
navigation.

A self-driving taxi operates in a 5×5 grid city. It needs to pick up


passengers from random locations and drop them off at their des-
tinations. Show how does the taxi use a DQN agent to learn the
optimal policy.
Ans:
 State: taxi position, passenger location, destination CO BTL
12
 Actions: move, pick-up, drop-off 4 4
 Rewards: −1 per step, +20 on successful drop, −10 for ille-
gal action
DQN is trained with experience replay and Q-learning up-
dates.

13 Two teams of agents are playing a simplified soccer game in a CO BTL


grid environment. Each agent learns its own policy using REIN- 5 5
FORCE. Apply Policy Gradient Methods
Ans:
 Each agent collects (state, action, reward) trajectory
 Apply:θ=θ+α∇θlog⁡πθ(a∣s)Gt\theta = \theta + α \nabla_\
theta \log \pi_\theta(a|s) G_tθ=θ+α∇θlogπθ(a∣s)Gt
This optimizes policy directly using sampled returns (REINFORCE
algorithm).

A robot is tasked with navigating through a dynamic environment


where obstacles move unpredictably. Use Actor-Critic method
to learn an optimal policy.
Ans:
 Actor: learns policy π(a|s; θ)
CO BTL
14  Critic: learns value function V(s; w)
5 5
∇θlog⁡π(a∣s)(r+γV(s′)−V(s))\nabla_\theta \log \pi(a|s) (r + γ
 Update actor using:

V(s') - V(s))∇θlogπ(a∣s)(r+γV(s′)−V(s))
Effective for real-time adaptation in dynamic environments.

Apply Policy Gradient Methods to solve the CartPole problem


Ans:
 Use collected trajectories to compute returns
CO BTL
15  Update:θ=θ+α∑t∇θlog⁡π(at∣st)Gt\theta = \theta + α \sum_t \
5 5
nabla_\theta \log \pi(a_t|s_t) G_tθ=θ+αt∑∇θlogπ(at∣st)Gt
No value function needed, directly learns from episode returns.

You might also like