0% found this document useful (0 votes)

13 views6 pages

Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL

Uploaded by

Manikantaa Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL

Uploaded by

Manikantaa Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

TUTORIAL QUESTIONS [ANNEXURE I]

Que
s-
Questions CO BTL
tion
No
Develop an agent that can interact with a Multi-Armed Bandit en-
vironment, explore the different arms, and gradually converge to
the arm that provides the highest average reward. The agent
should learn to make decisions that maximize the cumulative re-
ward over time by effectively balancing exploration and exploita-
tion.
Ans: To develop this agent:
 Initialize estimated values for each arm.
 Use an ε-greedy strategy: CO BTL
1
o With probability ε, explore (choose a random arm). 1 2
o With probability 1–ε, exploit (choose arm with highest
estimated value).
 After each pull, update the estimated value using:
Qnew(a)=Qold(a)+α[R−Qold(a)]Q_{new}(a) = Q_{old}(a) + \al-
pha [R - Q_{old}(a)]Qnew(a)=Qold(a)+α[R−Qold(a)]
This balances exploration and exploitation to gradually favor the
best arm.

2 Consider using reinforcement learning to control the motion of a CO BTL

robot arm “Pick-and-Place Robot” in a repetitive pick-and- 1 2
place task. If we want to learn movements that are fast and
smooth, the learning agent will have to control the motors directly
and have low-latency information about the current positions and
velocities of the mechanical linkages. The actions in this case
might be the voltages applied to each motor at each joint, and the
states might be the latest readings of joint angles and velocities.
The reward might be +1 for each object successfully picked up
and placed. To encourage smooth movements, on each time step
a small, negative reward can be given as a function of the mo-
ment-to-moment “jerkiness” of the motion.

A. Devise three example tasks of your own that fit into the
MDP framework, identifying for each its states, actions, and
rewards. Make the three examples as different from each
other as possible. The framework is abstract and flexible
and can be applied in many ways. Stretch its limits in some
way in at least one of your examples
Ans: Three Example MDPs:
1. Autonomous Vacuum Cleaner
o States: Room layout and dust status.
o Actions: Move, clean.
o Rewards: +1 for cleaning, −1 for bumping wall.
2. Stock Trading Bot
o States: Current stock prices and portfolio.
o Actions: Buy, sell, hold.
o Rewards: Profit/loss at each step.
3. Dynamic Game NPC Behavior
o States: Player proximity and health.
o Actions: Attack, defend, hide.
o Rewards: +1 for damage, −1 for getting hit.

B. Is the MDP framework adequate to usefully represent all

goal-directed learning tasks?
Ans: MDP is suitable for many goal-directed tasks where the
environment is Markovian. However, it struggles with partial
observability or very long-term dependencies unless extended
(e.g., POMDPs).

Jack’s Car Rental : Jack manages two locations for a nationwide

car rental company. Each day, some number of customers arrive
at each location to rent cars. If Jack has a car available, he rents it
out and is credited $10 by the national company. If he is out of
cars at that location, then the business is lost. Cars become avail-
able for renting the day after they are returned. To help ensure
that cars are available where they are needed, Jack can move
them between the two locations overnight, at a cost of $2 per car
moved. We assume that the number of cars requested and re-
turned at each location are Poisson random variables. Suppose λ
is 3 and 4 for rental requests at the first and second locations and
3 and 2 for returns. To simplify the problem slightly, we assume
that there can be no more than 20 cars at each location (any ad-
ditional cars are returned to the nationwide company, and thus
disappear from the problem) and a maximum of five cars can be CO BTL
3
moved from one location to the other in one night. Take the dis- 1 2
count rate to be γ=0.9 and formulate this as a continuing finite
MDP, where the time steps are days, the state is the number of
cars at each location at the end of the day, and the actions are
the net numbers of cars moved between the two locations
overnight.
Ans:

 States: (cars at location A, cars at location B)

 Actions: Cars moved overnight (−5 to +5)
 Rewards: +$10 per rental, −$2 per car moved
 Transitions: Based on Poisson rental and return γ: 0.9
This is a continuing MDP with policy optimization for balanc-
ing income and transport cost.

4 Consider a 3×3 GridWorld environment where the agent can CO BTL

move up, down, left, or right. Each move gives a reward of −1- 2 3
1−1, and the goal is to reach the bottom-right corner, which gives
a reward of +10+10+10. If the agent hits a wall, it stays in place
and receives a reward of −1-1−1. The discount factor γ=0.9. The
initial value of all states is 000. Compute the value of the top-left
corner after two iterations of the value iteration algorithm.
Ans: Let’s denote top-left as (0,0). Using reward −1 and γ=0.9,
and V initialized to 0:
 1stIteration:
V(0,0) = max over actions of [−1 + 0] = −1
 2ndIteration:
V(0,0) = max [−1 + γ × V(next state)]
Depending on wall states, it stays or moves. Suppose right
gives −1 + 0.9 × (−1) = −1.9, then V(0,0) ≈ −1.9

Given the following Markov Decision Process (MDP):

 States: S={s1,s2}
 Actions: A={a1,a2}

o P(s1∣s1,a1)=0.8,P(s2∣s1,a1)=0. 2
 Transition probabilities:

o P(s1∣s2,a2)=0.4,P(s2∣s2,a2)=0.6
 Rewards:
o R(s1,a1)=5,R(s2,a2)=10
 Discount factor γ=0.9

Assume a deterministic policy: π(s 1)=a1,π(s2)=a2. Compute the CO BTL

5 state values V(s1) and V(s2) under the given policy. 2 3
Ans: Using Bellman Equation:
V(s1)=R(s1,a1)+γ[0.8V(s1)+0.2V(s2)]V(s2)=R(s2,a2)+γ[0.4V(s1)
+0.6V(s2)]V(s1) = R(s1,a1) + γ [0.8 V(s1) + 0.2 V(s2)] V(s2) =
R(s2,a2) + γ [0.4 V(s1) + 0.6
V(s2)]V(s1)=R(s1,a1)+γ[0.8V(s1)+0.2V(s2)]V(s2)=R(s2,a2)+γ[0.4
V(s1)+0.6V(s2)]
Solving:
 V(s1) = 5 + 0.9[0.8 V(s1) + 0.2 V(s2)]
 V(s2)=10+0.9[0.4V(s1)+0.6V(s2)]
Solve system to find V(s1) ≈ 41.67, V(s2) ≈ 49.17

Consider the 4 * 4 gridworld shown below

if π is the equiprobable random policy, what is q π (11, down)? CO BTL

6
What is qπ (7, down)? 2 3
Ans: Use:
qπ(s,a)=∑s′P(s′∣s,a)[R+γV(s′)]q_\pi(s,a) = \sum_{s'} P(s'|s,a) [R +
γ V(s')]qπ(s,a)=s′∑P(s′∣s,a)[R+γV(s′)]
For equiprobable actions and policy, use known Vπ(s) values (pro-
vided in problem or precomputed).
You’d plug in values for V(11) and V(7) and compute using the
transition model.

Consider a 4×12 GridWorld where: CO BTL

7  The start state is at the bottom-left corner (S). 3 3
 The goal state is at the bottom-right corner (G).
 The agent receives a reward of −1 for every step.
 Falling into the "cliff" (cells between S and G) results in a re-
ward of −100 and the agent is sent back to S.
 Discount factor γ=0.9
 Learning rate α=0.5
Using the following episode, update the Q-values for SARSA. As-
sume ϵ = 0.1 for an ϵ greedy policy.
Ans: Update:
Q(s,a)=Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]Q(s,a) = Q(s,a) + α [r + γ
Q(s',a') - Q(s,a)]Q(s,a)=Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]
Use episode step data (s, a, r, s’, a’) to update Q values step-by-
step. Apply ε-greedy action selection.

An agent navigates a 3×3 grid. Each cell has a deterministic

transition.
 Rewards:
o +10 for reaching the goal state (G) at (2,2).
o −1 for every move.
 Actions: up, down, left, right.
 Discount factor γ=0.9
 Learning rate α=0.5 CO BTL
8 Using the following episode, update the Q-values for Q-learning. 3 3
Assume ϵ = 0.1 for an ϵ greedy policy.
Ans: Q-learning update:
Q(s,a)=Q(s,a)+α[r+γmax⁡aQ(s′,a)−Q(s,a)]Q(s,a) = Q(s,a) + α [r +
γ \max_a Q(s',a) - Q(s,a)]Q(s,a)=Q(s,a)+α[r+γamaxQ(s′,a)−Q(s,a)]
Apply this for each step in the episode using the max Q of the
next state instead of next action (as in SARSA).

Driving Home : Each day as you drive home from work, you try CO BTL
9 to predict how long it will take to get home. When you leave your 3 3
office, you note the time, the day of week, the weather, and any-
thing else that might be relevant. Say on this Friday you are leav-
ing at exactly 6 o’clock, and you estimate that it will take 30
minutes to get home. As you reach your car it is 6:05, and you no-
tice it is starting to rain. Traffic is often slower in the rain, so you
reestimate that it will take 35 minutes from then, or a total of 40
minutes. Fifteen minutes later you have completed the highway
portion of your journey in good time. As you exit onto a secondary
road you cut your estimate of total travel time to 35 minutes. Un-
fortunately, at this point you get stuck behind a slow truck, and
the road is too narrow to pass. You end up having to follow the
truck until you turn onto the side street where you live at 6:40.
Three minutes later you are home. The sequence of states, times,
and predictions is thus as follows:
Use Monte Carlo method to plot the predicted total time
Ans:
Track states: (6:00, 6:05, etc.)
Track actual returns (total time = 43 min)
First-visitMC:Average returns for each state
Plot predictions vs actual over episode to visualize convergence.

Design an agent that can balance a pole on a cart by applying

forces (left or right) to the CartPole reinforcement learning prob-
lem using Deep-Q-Network
Ans: Apply experience replay and periodically update target net-
work.
CO BTL
 State: position, velocity, angle, angular velocity
10 4 4
 Action: left, right
 Use replay buffer, target network
 DQN learns Q(s,a) using neural network trained with Bell-
man loss

A mobile robot is navigating a 5×5 gridworld to collect objects

scattered across the environment while avoiding obstacles. De-
velop a Deep Q-Network (DQN) to learn an optimal policy for nav-
igating the grid.
Ans:
CO BTL
11  Inputs: grid observation
4 4
 Output: Q-values for move directions
 Rewards: +1 for object, −1 for obstacle, 0 otherwise
Train using ε-greedy policy and Bellman loss for optimal
navigation.

A self-driving taxi operates in a 5×5 grid city. It needs to pick up

passengers from random locations and drop them off at their des-
tinations. Show how does the taxi use a DQN agent to learn the
optimal policy.
Ans:
 State: taxi position, passenger location, destination CO BTL
12
 Actions: move, pick-up, drop-off 4 4
 Rewards: −1 per step, +20 on successful drop, −10 for ille-
gal action
DQN is trained with experience replay and Q-learning up-
dates.

13 Two teams of agents are playing a simplified soccer game in a CO BTL

grid environment. Each agent learns its own policy using REIN- 5 5
FORCE. Apply Policy Gradient Methods
Ans:
 Each agent collects (state, action, reward) trajectory
 Apply:θ=θ+α∇θlog⁡πθ(a∣s)Gt\theta = \theta + α \nabla_\
theta \log \pi_\theta(a|s) G_tθ=θ+α∇θlogπθ(a∣s)Gt
This optimizes policy directly using sampled returns (REINFORCE
algorithm).

A robot is tasked with navigating through a dynamic environment

where obstacles move unpredictably. Use Actor-Critic method
to learn an optimal policy.
Ans:
 Actor: learns policy π(a|s; θ)
CO BTL
14  Critic: learns value function V(s; w)
5 5
∇θlog⁡π(a∣s)(r+γV(s′)−V(s))\nabla_\theta \log \pi(a|s) (r + γ
 Update actor using:

V(s') - V(s))∇θlogπ(a∣s)(r+γV(s′)−V(s))
Effective for real-time adaptation in dynamic environments.

Apply Policy Gradient Methods to solve the CartPole problem

Ans:
 Use collected trajectories to compute returns
CO BTL
15  Update:θ=θ+α∑t∇θlog⁡π(at∣st)Gt\theta = \theta + α \sum_t \
5 5
nabla_\theta \log \pi(a_t|s_t) G_tθ=θ+αt∑∇θlogπ(at∣st)Gt
No value function needed, directly learns from episode returns.

Ai Model Question Paper-1
No ratings yet
Ai Model Question Paper-1
12 pages
Flow Diagram of Machine Learning or Life Cycle of Machine Learning
No ratings yet
Flow Diagram of Machine Learning or Life Cycle of Machine Learning
91 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
100% (1)
Introduction To Machine Learning: WWW - Seas.upenn - Edu/ Cis519
51 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Machine Learning Manual
100% (1)
Machine Learning Manual
81 pages
3 Csesyll
No ratings yet
3 Csesyll
55 pages
A Review On Reinforcement Learning Algorithms and Applications in Supply Chain Management
No ratings yet
A Review On Reinforcement Learning Algorithms and Applications in Supply Chain Management
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Maai 6
No ratings yet
Maai 6
143 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Adversarial Machine Learning Attack Surfaces Defence Mechanisms Learning Theories in Artificial Intelligence Aneesh Sreevallabh Chivukula Download
No ratings yet
Adversarial Machine Learning Attack Surfaces Defence Mechanisms Learning Theories in Artificial Intelligence Aneesh Sreevallabh Chivukula Download
78 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
The Rise and Potential of Large Language Model
No ratings yet
The Rise and Potential of Large Language Model
86 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Sections
No ratings yet
Sections
76 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Deep Reinforcement Learning 1st Ed 2022 Aske Plaat Download
No ratings yet
Deep Reinforcement Learning 1st Ed 2022 Aske Plaat Download
58 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
AI CH 5
No ratings yet
AI CH 5
37 pages
37 RL
No ratings yet
37 RL
18 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Machine Learning Applications in Physical Design: Recent Results and Directions
No ratings yet
Machine Learning Applications in Physical Design: Recent Results and Directions
114 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Discovery Service For Vellore Institute of Technology
No ratings yet
Discovery Service For Vellore Institute of Technology
26 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Automatic Reward Shaping From Confounded Offline Data: Mingxuan Li Junzhe Zhang Elias Bareinboim
No ratings yet
Automatic Reward Shaping From Confounded Offline Data: Mingxuan Li Junzhe Zhang Elias Bareinboim
29 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Tutorial
No ratings yet
Tutorial
28 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Kimi k1.5
No ratings yet
Kimi k1.5
25 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Solutions To Reinforcement Learning by Sutton Chapter 3 rx1
No ratings yet
Solutions To Reinforcement Learning by Sutton Chapter 3 rx1
10 pages
Lecun 20161205 Nips Keynote
No ratings yet
Lecun 20161205 Nips Keynote
75 pages
CS229
No ratings yet
CS229
17 pages
Constrained Policy Opt
No ratings yet
Constrained Policy Opt
18 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Robot Task Planning
No ratings yet
Robot Task Planning
12 pages
RL Frra
No ratings yet
RL Frra
9 pages
Day 1 Special Bonus
No ratings yet
Day 1 Special Bonus
23 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
No ratings yet
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
12 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Human-Level Control Through Deep Reinforcement Learning - Nature
No ratings yet
Human-Level Control Through Deep Reinforcement Learning - Nature
11 pages
Reinforcement Learning For Automatic Test Case Prioritization and Selection in Continuous Integration
No ratings yet
Reinforcement Learning For Automatic Test Case Prioritization and Selection in Continuous Integration
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
No ratings yet
Reinforcement and Imitation Learning Via Interactive No-Regret Learning
14 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Using Ai in Wargaming Simulation Frazernash
No ratings yet
Using Ai in Wargaming Simulation Frazernash
14 pages
Wa 1
No ratings yet
Wa 1
9 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
The Role of Deep Reinforcement Learning in Motion Planning For Robotic Arm Assistive Systems
No ratings yet
The Role of Deep Reinforcement Learning in Motion Planning For Robotic Arm Assistive Systems
7 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
IMG - 2014-Compressed 2
No ratings yet
IMG - 2014-Compressed 2
6 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
q2B Review
No ratings yet
q2B Review
9 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Krishna 683 2
No ratings yet
Krishna 683 2
4 pages
PyBrain Slides
No ratings yet
PyBrain Slides
20 pages
BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning For Task-Oriented Dialogue Systems
No ratings yet
BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning For Task-Oriented Dialogue Systems
8 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
AML774 Post Assignment 2
No ratings yet
AML774 Post Assignment 2
4 pages
Cse3011 RL End Term Announcement
No ratings yet
Cse3011 RL End Term Announcement
2 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Time Management Tips
No ratings yet
Time Management Tips
1 page
Mr. Manikantaa Reddy
No ratings yet
Mr. Manikantaa Reddy
1 page
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet

Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL

Uploaded by

Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL

Uploaded by

TUTORIAL QUESTIONS [ANNEXURE I]

2 Consider using reinforcement learning to control the motion of a CO BTL

B. Is the MDP framework adequate to usefully represent all

Jack’s Car Rental : Jack manages two locations for a nationwide

 States: (cars at location A, cars at location B)

4 Consider a 3×3 GridWorld environment where the agent can CO BTL

Given the following Markov Decision Process (MDP):

Assume a deterministic policy: π(s 1)=a1,π(s2)=a2. Compute the CO BTL

Consider the 4 * 4 gridworld shown below

if π is the equiprobable random policy, what is q π (11, down)? CO BTL

Consider a 4×12 GridWorld where: CO BTL

An agent navigates a 3×3 grid. Each cell has a deterministic

Design an agent that can balance a pole on a cart by applying

A mobile robot is navigating a 5×5 gridworld to collect objects

A self-driving taxi operates in a 5×5 grid city. It needs to pick up

13 Two teams of agents are playing a simplified soccer game in a CO BTL

A robot is tasked with navigating through a dynamic environment

Apply Policy Gradient Methods to solve the CartPole problem

You might also like