0% found this document useful (0 votes)
20 views4 pages

Question Bank RL

Uploaded by

Anvitha m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views4 pages

Question Bank RL

Uploaded by

Anvitha m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CSE (AI & ML)

Course code: MR201CS0244(R20) Course Name REINFORCEMENT LEARNING


QUESTION BANK
Qno Question Marks Section
1 Apply the PAC learning framework to design a binary classifier for a 12 Section-I
given dataset and determine the minimum number of training
examples required for a specific level of confidence and accuracy.
2 Given a multi-armed bandit scenario with five arms and their 12 Section-I
respective reward Distributions, apply the Upper Confidence Bound
(UCB) algorithm to select the best arm for maximizing cumulative
rewards.
3 Analyze the impact of increasing the complexity of the hypothesis 12 Section-I
space on the sample complexity and the generalization
performance in Probably Approximately Correct (PAC) learning.
4 a. Compare and contrast the strategies used by the Upper 6 Section-I
Confidence Bound (UCB) Algorithm and other bandit
algorithms for balancing exploration and exploitation.
b. Describe the role of sample complexity (m) in the PAC learning 6
framework and how it affects the learning process.
5 Evaluate the effectiveness of the Upper Confidence Bound (UCB) 12 Section-I
algorithm in realWorld scenarios with non-stationary reward
distributions, discussing its strengths and limitations.
6 Design an improved variant of the Upper Confidence Bound (UCB) 12 Section-I
algorithm that dynamically adjusts the exploration rate based on
the feedback received from the environment.
7 a. Explain the difference between PAC learning and UCB 6 Section-I
algorithm in terms of their Fundamental purposes and
problem settings.
b. Name the exploration-exploitation trade-off problem that the 6
Upper Confidence Bound (UCB) algorithm aims to address.
8 In what ways can bandit algorithms be adapted to handle situations 12 Section-I
where the rewards are not immediately observable, but rather,
manifest as delayed feedback or indirect consequences?
9 a. Illustrate the key components of the Probably Approximately 6 Section-I
Correct (PAC) learning framework.
b. Outline the primary objective of bandit algorithms in the 6
context of reinforcement learning.
10 Discuss a real-world application where bandit algorithms have 12 Section-I
been successfully used, and explain the benefits of employing such
algorithms in that context.
11 Given a set of k arms with different reward distributions, apply the 12 Section-II
Median Elimination Algorithm to identify the optimal arm based on
the provided sample means.
12 Assess the efficiency of the Median Elimination algorithm 12 Section-II
compared to other advanced Bandit algorithms for bandit problems
with a large number of arms.
13 Evaluate the potential real-time applications of the Policy Gradient 12 Section-II
algorithm in various domains, and discuss the challenges it may
face in certain scenarios.
14 Design an experiment to evaluate the performance of the Median 12 Section-II
Elimination algorithm On a simulated multi-armed bandit problem
with different reward distributions.
15 Create a new variant of the Policy Gradient algorithm that 12 Section-II
incorporates a baseline technique to reduce variance in the policy
gradient estimates.
16 How does the Policy Gradient algorithm handle continuous action 12 Section-II
spaces in bandit problems? What are some advantages of using
policy gradient methods in such scenarios?
17 a. Analyze how the Policy Gradient algorithm can be adapted to 6 Section-II
handle continuous action spaces in bandit problems.
b. Compare and contrast the Median Elimination algorithm and 6
the Policy Gradient Algorithm in terms of their strengths and
weaknesses when applied to bandit problems.
18 Describe the concept of the exploration-exploitation trade-off in 12 Section-II
bandit problems. How does the Policy Gradient algorithm handle
this trade-off?
19 Consider a real-world application where the reward distributions 12 Section-II
in a bandit problem change over time (non-stationary). How could
you adapt the Median Elimination algorithm to cope with this
dynamic environment?
20 a. Recall the key steps involved in the Median Elimination 6 Section-II
algorithm for bandit problems.
b. Outline the two specific advanced bandit algorithms used to 6
solve multi-armed bandit problems.
21 Implement a basic RL algorithm to update the policy of an agent 12 Section-
based on Q-learning III
22 Design a simple MDP for a robotic agent navigating through a grid- 12 Section-
based environment With rewards and penalties. III
23 Assess the strengths and weaknesses of using deep neural 12 Section-
networks as function Approximates in RL algorithms. III
24 Critique the effectiveness of the reward function in shaping the 12 Section-
behaviour of an RL agent in a complex environment. III
25 Design an RL framework for a real-world problem of your choice, 12 Section-
specifying the state space, action space, and reward function. III
26 Devise a novel algorithm that combines elements of both model- 12 Section-
based and model-free RL approaches. III
27 Given a scenario, analyze the impact of changing the discount factor 12 Section-
(γ) on the agent's decision-making process. III
28 Design a simple MDP for a robotic agent navigating through a grid- 12 Section-
based environment With rewards and penalties III
29 Describe the role of the reward function in RL and its importance in 12 Section-
shaping agent behaviour. III
30 a. Compare and contrast value iteration and policy iteration 6 Section-
methods for solving MDPs in RL. III
b. Explain how reinforcement learning differs from supervised 6
and unsupervised learning.
31 Assess the effectiveness of Dynamic Programming methods for 12 Section-IV
solving large-scale RL problems compared to other approaches,
such as Monte Carlo methods.
32 Design a new RL algorithm that combines Dynamic Programming 12 Section-IV
and Temporal Difference methods to address a specific challenge
in a complex environment.
33 Create a novel RL scenario where the Bellman Optimality 12 Section-IV
equation needs to be modified to accommodate additional
constraints.
34 Analyze how the Bellman Optimality equation changes when the 12 Section-IV
environment has stochastic transitions and rewards.
35 Apply Temporal Difference learning to update the value function 12 Section-IV
for a specific state in an RL task.
36 Given a simple RL environment, demonstrate how you would 12 Section-IV
apply Dynamic Programming methods to find the optimal value
function.
37 a. How does the Bellman Optimality equation help in finding 6 Section-IV
the optimal policy in RL problems?
b. Explain the fundamental difference between Dynamic 6
Programming and Temporal Difference methods for RL.
38 Compare the exploration-exploitation dilemma in Temporal 12 Section-IV
Difference learning with the concept of "horizon" in Dynamic
Programming. How do these two aspects impact the learning
process and decision-making in RL?
39 How does the concept of "Bellman backup" play a crucial role in 12 Section-IV
both Dynamic Programming and Temporal Difference methods?
Can you provide an example of how this backup process is
applied in a specific RL scenario?
40 The Bellman Optimality equation is a fundamental concept in RL. 12 Section-IV
How does it mathematically express the principle of optimality,
and how is it used to find the optimal policy in a Markov Decision
Process (MDP)?
41 Design an RL agent to navigate a grid world using Fitted Q- 12 Section-V
learning with function approximation.
42 Implement the Deep Q-Network (DQN) algorithm to solve a 12 Section-V
continuous action space problem.
43 Develop a Policy Gradient algorithm to train a robotic arm to 12 Section-V
reach a target in a simulated environment.
44 Analyze the impact of using different function approximation 12 Section-V
architectures in Fitted Qlearning.
45 Assess the effectiveness of using Eligibility Traces for updating Q- 12 Section-V
values in a dynamic environment.
46 Evaluate the performance of Deep Q-Network (DQN) compared to 12 Section-V
Fitted Q-learning in a grid world scenario with a large state space.
47 Devise a novel function approximation method for handling 12 Section-V
continuous state spaces in RL.
48 a. Compare the advantages and disadvantages of Eligibility 6 Section-V
Traces and Function Approximation in RL.
b. How does Fitted Q-learning leverage the concept of 6
experience replay?
49 a. What are the main advantages and limitations of Fitted Q- 6 Section-V
learning compared to DQN?
b. In which scenarios would you prefer to use Fitted Q-learning 6
over DQN and vice versa?
50 How do Policy Gradient algorithms and Least Squares Methods 12 Section-V
handle the exploration-exploitation trade-off differently?

You might also like