Fundamentals of Reinforcement Learning Learning Objectives
This document outlines the modules and lessons that will be covered in a course on fundamentals of reinforcement learning. Module 1 covers the k-armed bandit problem and explores action value estimation methods, exploration vs exploitation tradeoffs, and optimism in the face of uncertainty. Module 2 introduces Markov decision processes and defines goals, episodes, returns, and modeling continuing tasks. Module 3 discusses value functions, policies, and Bellman equations. Module 4 examines dynamic programming techniques including policy evaluation, policy iteration, and generalized policy iteration.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
48 views3 pages
Fundamentals of Reinforcement Learning Learning Objectives
This document outlines the modules and lessons that will be covered in a course on fundamentals of reinforcement learning. Module 1 covers the k-armed bandit problem and explores action value estimation methods, exploration vs exploitation tradeoffs, and optimism in the face of uncertainty. Module 2 introduces Markov decision processes and defines goals, episodes, returns, and modeling continuing tasks. Module 3 discusses value functions, policies, and Bellman equations. Module 4 examines dynamic programming techniques including policy evaluation, policy iteration, and generalized policy iteration.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3
Fundamentals of Reinforcement Learning: Learning Objectives
Module 00: Welcome to the Course
Understand the prerequisites, goals and roadmap for the course.
Module 01: The K-Armed Bandit Problem
Lesson 1: The K-Armed Bandit Problem
Define reward Understand the temporal nature of the bandit problem Define k-armed bandit Define action-values
Lesson 2: What to Learn? Estimating Action Values
Define action-value estimation methods Define exploration and exploitation Select actions greedily using an action-value function Define online learning Understand a simple online sample-average action-value estimation method Define the general online update equation Understand why we might use a constant stepsize in the case of non-stationarity
Lesson 3: Exploration vs. Exploitation Tradeoff
Define epsilon-greedy Compare the short-term benefits of exploitation and the long-term benefits of exploration Understand optimistic initial values Describe the benefits of optimistic initial values for early exploration Explain the criticisms of optimistic initial values Describe the upper confidence bound action selection method Define optimism in the face of uncertainty
Module 02: Markov Decision Processes
Lesson 1: Introduction to Markov Decision Processes
Understand Markov Decision Processes, or MDPs Describe how the dynamics of an MDP are defined Understand the graphical representation of a Markov Decision Process Explain how many diverse processes can be written in terms of the MDP framework
Lesson 2: Goal of Reinforcement Learning
Describe how rewards relate to the goal of an agent Understand episodes and identify episodic tasks
Lesson 3: Continuing Tasks
Formulate returns for continuing tasks using discounting Describe how returns at successive time steps are related to each other Understand when to formalize a task as episodic or continuing
Module 03: Values Functions & Bellman Equations
Lesson 1: Policies and Value Functions
Recognize that a policy is a distribution over actions for each possible state Describe the similarities and differences between stochastic and deterministic policies Identify the characteristics of a well-defined policy Generate examples of valid policies for a given MDP Describe the roles of state-value and action-value functions in reinforcement learning Describe the relationship between value functions and policies Create examples of valid value functions for a given MDP
Lesson 2: Bellman Equations
Derive the Bellman equation for state-value functions Derive the Bellman equation for action-value functions Understand how Bellman equations relate current and future values Use the Bellman equations to compute value functions
Lesson 3: Optimality (Optimal Policies & Value Functions)
Define an optimal policy Understand how a policy can be at least as good as every other policy in every state Identify an optimal policy for given MDPs Derive the Bellman optimality equation for state-value functions Derive the Bellman optimality equation for action-value functions Understand how the Bellman optimality equations relate to the previously introduced Bellman equations Understand the connection between the optimal value function and optimal policies Verify the optimal value function for given MDPs
Module 04: Dynamic Programming
Lesson 1: Policy Evaluation (Prediction)
Understand the distinction between policy evaluation and control Explain the setting in which dynamic programming can be applied, as well as its limitations Outline the iterative policy evaluation algorithm for estimating state values under a given policy Apply iterative policy evaluation to compute value functions
Lesson 2: Policy Iteration (Control)
Understand the policy improvement theorem Use a value function for a policy to produce a better policy for a given MDP Outline the policy iteration algorithm for finding the optimal policy Understand “the dance of policy and value” Apply policy iteration to compute optimal policies and optimal value functions
Lesson 3: Generalized Policy Iteration
Understand the framework of generalized policy iteration Outline value iteration, an important example of generalized policy iteration Understand the distinction between synchronous and asynchronous dynamic programming methods Describe brute force search as an alternative method for searching for an optimal policy Describe Monte Carlo as an alternative method for learning a value function Understand the advantage of Dynamic programming and “bootstrapping” over these alternative strategies for finding the optimal policy