We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4
CSE (AI & ML)
Course code: MR201CS0244(R20) Course Name REINFORCEMENT LEARNING
QUESTION BANK Qno Question Marks Section 1 Apply the PAC learning framework to design a binary classifier for a 12 Section-I given dataset and determine the minimum number of training examples required for a specific level of confidence and accuracy. 2 Given a multi-armed bandit scenario with five arms and their 12 Section-I respective reward Distributions, apply the Upper Confidence Bound (UCB) algorithm to select the best arm for maximizing cumulative rewards. 3 Analyze the impact of increasing the complexity of the hypothesis 12 Section-I space on the sample complexity and the generalization performance in Probably Approximately Correct (PAC) learning. 4 a. Compare and contrast the strategies used by the Upper 6 Section-I Confidence Bound (UCB) Algorithm and other bandit algorithms for balancing exploration and exploitation. b. Describe the role of sample complexity (m) in the PAC learning 6 framework and how it affects the learning process. 5 Evaluate the effectiveness of the Upper Confidence Bound (UCB) 12 Section-I algorithm in realWorld scenarios with non-stationary reward distributions, discussing its strengths and limitations. 6 Design an improved variant of the Upper Confidence Bound (UCB) 12 Section-I algorithm that dynamically adjusts the exploration rate based on the feedback received from the environment. 7 a. Explain the difference between PAC learning and UCB 6 Section-I algorithm in terms of their Fundamental purposes and problem settings. b. Name the exploration-exploitation trade-off problem that the 6 Upper Confidence Bound (UCB) algorithm aims to address. 8 In what ways can bandit algorithms be adapted to handle situations 12 Section-I where the rewards are not immediately observable, but rather, manifest as delayed feedback or indirect consequences? 9 a. Illustrate the key components of the Probably Approximately 6 Section-I Correct (PAC) learning framework. b. Outline the primary objective of bandit algorithms in the 6 context of reinforcement learning. 10 Discuss a real-world application where bandit algorithms have 12 Section-I been successfully used, and explain the benefits of employing such algorithms in that context. 11 Given a set of k arms with different reward distributions, apply the 12 Section-II Median Elimination Algorithm to identify the optimal arm based on the provided sample means. 12 Assess the efficiency of the Median Elimination algorithm 12 Section-II compared to other advanced Bandit algorithms for bandit problems with a large number of arms. 13 Evaluate the potential real-time applications of the Policy Gradient 12 Section-II algorithm in various domains, and discuss the challenges it may face in certain scenarios. 14 Design an experiment to evaluate the performance of the Median 12 Section-II Elimination algorithm On a simulated multi-armed bandit problem with different reward distributions. 15 Create a new variant of the Policy Gradient algorithm that 12 Section-II incorporates a baseline technique to reduce variance in the policy gradient estimates. 16 How does the Policy Gradient algorithm handle continuous action 12 Section-II spaces in bandit problems? What are some advantages of using policy gradient methods in such scenarios? 17 a. Analyze how the Policy Gradient algorithm can be adapted to 6 Section-II handle continuous action spaces in bandit problems. b. Compare and contrast the Median Elimination algorithm and 6 the Policy Gradient Algorithm in terms of their strengths and weaknesses when applied to bandit problems. 18 Describe the concept of the exploration-exploitation trade-off in 12 Section-II bandit problems. How does the Policy Gradient algorithm handle this trade-off? 19 Consider a real-world application where the reward distributions 12 Section-II in a bandit problem change over time (non-stationary). How could you adapt the Median Elimination algorithm to cope with this dynamic environment? 20 a. Recall the key steps involved in the Median Elimination 6 Section-II algorithm for bandit problems. b. Outline the two specific advanced bandit algorithms used to 6 solve multi-armed bandit problems. 21 Implement a basic RL algorithm to update the policy of an agent 12 Section- based on Q-learning III 22 Design a simple MDP for a robotic agent navigating through a grid- 12 Section- based environment With rewards and penalties. III 23 Assess the strengths and weaknesses of using deep neural 12 Section- networks as function Approximates in RL algorithms. III 24 Critique the effectiveness of the reward function in shaping the 12 Section- behaviour of an RL agent in a complex environment. III 25 Design an RL framework for a real-world problem of your choice, 12 Section- specifying the state space, action space, and reward function. III 26 Devise a novel algorithm that combines elements of both model- 12 Section- based and model-free RL approaches. III 27 Given a scenario, analyze the impact of changing the discount factor 12 Section- (γ) on the agent's decision-making process. III 28 Design a simple MDP for a robotic agent navigating through a grid- 12 Section- based environment With rewards and penalties III 29 Describe the role of the reward function in RL and its importance in 12 Section- shaping agent behaviour. III 30 a. Compare and contrast value iteration and policy iteration 6 Section- methods for solving MDPs in RL. III b. Explain how reinforcement learning differs from supervised 6 and unsupervised learning. 31 Assess the effectiveness of Dynamic Programming methods for 12 Section-IV solving large-scale RL problems compared to other approaches, such as Monte Carlo methods. 32 Design a new RL algorithm that combines Dynamic Programming 12 Section-IV and Temporal Difference methods to address a specific challenge in a complex environment. 33 Create a novel RL scenario where the Bellman Optimality 12 Section-IV equation needs to be modified to accommodate additional constraints. 34 Analyze how the Bellman Optimality equation changes when the 12 Section-IV environment has stochastic transitions and rewards. 35 Apply Temporal Difference learning to update the value function 12 Section-IV for a specific state in an RL task. 36 Given a simple RL environment, demonstrate how you would 12 Section-IV apply Dynamic Programming methods to find the optimal value function. 37 a. How does the Bellman Optimality equation help in finding 6 Section-IV the optimal policy in RL problems? b. Explain the fundamental difference between Dynamic 6 Programming and Temporal Difference methods for RL. 38 Compare the exploration-exploitation dilemma in Temporal 12 Section-IV Difference learning with the concept of "horizon" in Dynamic Programming. How do these two aspects impact the learning process and decision-making in RL? 39 How does the concept of "Bellman backup" play a crucial role in 12 Section-IV both Dynamic Programming and Temporal Difference methods? Can you provide an example of how this backup process is applied in a specific RL scenario? 40 The Bellman Optimality equation is a fundamental concept in RL. 12 Section-IV How does it mathematically express the principle of optimality, and how is it used to find the optimal policy in a Markov Decision Process (MDP)? 41 Design an RL agent to navigate a grid world using Fitted Q- 12 Section-V learning with function approximation. 42 Implement the Deep Q-Network (DQN) algorithm to solve a 12 Section-V continuous action space problem. 43 Develop a Policy Gradient algorithm to train a robotic arm to 12 Section-V reach a target in a simulated environment. 44 Analyze the impact of using different function approximation 12 Section-V architectures in Fitted Qlearning. 45 Assess the effectiveness of using Eligibility Traces for updating Q- 12 Section-V values in a dynamic environment. 46 Evaluate the performance of Deep Q-Network (DQN) compared to 12 Section-V Fitted Q-learning in a grid world scenario with a large state space. 47 Devise a novel function approximation method for handling 12 Section-V continuous state spaces in RL. 48 a. Compare the advantages and disadvantages of Eligibility 6 Section-V Traces and Function Approximation in RL. b. How does Fitted Q-learning leverage the concept of 6 experience replay? 49 a. What are the main advantages and limitations of Fitted Q- 6 Section-V learning compared to DQN? b. In which scenarios would you prefer to use Fitted Q-learning 6 over DQN and vice versa? 50 How do Policy Gradient algorithms and Least Squares Methods 12 Section-V handle the exploration-exploitation trade-off differently?