0% found this document useful (0 votes)

22 views38 pages

AI T8 ReinfoLearning

This document discusses reinforcement learning techniques. It introduces reinforcement learning and describes how it differs from traditional planning by learning models from interaction rather than knowing them explicitly. The document outlines model-based and model-free reinforcement learning approaches, including learning value functions directly through methods like Q-learning without learning the model.

Uploaded by

irvingzqy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views38 pages

AI T8 ReinfoLearning

Uploaded by

irvingzqy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

CS 6511: Artificial Intelligence

Reinforcement Learning

Amrinder Arora
The George Washington University
[Original version of these slides was created by Dan Klein and Pieter Abbeel for Intro to AI at UC Berkeley. http://ai.berkeley.edu]
Reinforcement Learning

Agent
State: s
Actions: a
Reward: r

Environment

§ Basic idea:
§ Receive feedback in the form of rewards
§ Agent’s utility is defined by the reward function
§ Must (learn to) act so as to maximize expected rewards
§ All learning is based on observed samples of outcomes!
AI-4511/6511 GWU 2
Reinforcement Learning
§ Still assume a Markov decision process (MDP):
§ A set of states s Î S
§ A set of actions (per state) A
§ A model T(s,a,s’)
§ A reward function R(s,a,s’)
§ Still looking for a policy p(s)

§ New twist: don’t know T or R

§ I.e. we don’t know which states are good or what the actions do
§ Must actually try actions and states out to learn

AI-4511/6511 GWU 3
Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning

AI-4511/6511 GWU 4
Two Broad Categories
§ Model Based – We will learn the MDP model (T, R, …)
§ Model Free – We learn the Q, V values directly

AI-4511/6511 GWU 5
Model-Based Learning
§ Model-Based Idea:
§ Learn an approximate model based on experiences
§ Solve for values as if the learned model were correct

§ Step 1: Learn empirical MDP model

§ Count outcomes s’ for each s, a
§ Normalize to give an estimate of
§ Discover each when we experience (s, a, s’)

§ Step 2: Solve the learned MDP

§ For example, use value iteration, as before

AI-4511/6511 GWU 6
Example: Model-Based Learning
Input Policy p Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A C, east, D, -1 C, east, D, -1 T(C, east, D) = 0.75
D, exit, x, +10 D, exit, x, +10 T(C, east, A) = 0.25
…
B C D
Episode 3 Episode 4 R(s,a,s’).
E E, north, C, -1 E, north, C, -1 R(B, east, C) = -1
C, east, D, -1 C, east, A, -1 R(C, east, D) = -1
D, exit, x, +10 A, exit, x, -10 R(D, exit, x) = +10
Assume: g = 1
…

AI-4511/6511 GWU 7
Model-Free Learning
§ A key mechanism to learn in MDP settings
§ In this, we don’t try to learn T and R values. We learn Q and V
values directly.

§ Subtopics
§ Passive RL – Evaluating a policy V/Q values for given policy
§ Active RL – Learn the policy also
§ Q-Learning – Learn the Q values, using an Exponential Moving Average
kind of approach.
AI-4511/6511 GWU 8
Passive Reinforcement Learning

AI-4511/6511 GWU 9
Exponential Moving Average
§ Exponential moving average
§ The running interpolation update:

§ Makes recent samples more important:

§ Forgets about the past (distant past values were wrong anyway)

§ Decreasing learning rate (alpha) can give converging averages

AI-4511/6511 GWU 10
Passive Reinforcement Learning
§ Simplified task: policy evaluation
§ Input: a fixed policy p(s)
§ You don’t know the transitions T(s,a,s’)
§ You don’t know the rewards R(s,a,s’)
§ Goal: learn the state values

§ In this case:
§ Learner is “along for the ride”
§ No choice about what actions to take
§ Just execute the policy and learn from experience
§ This is NOT offline planning! You actually take actions in the world.

AI-4511/6511 GWU 11
Direct Evaluation
§ Goal: Compute values for each state under p

§ Idea: Average together observed sample values

§ Act according to p
§ Every time you visit a state, write down what the
sum of discounted rewards turned out to be
§ Average those samples

§ This is called direct evaluation

AI-4511/6511 GWU 12
Example: Direct Evaluation
Input Policy p Observed Episodes (Training) Output Values
R Value
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A C, east, D, -1 C, east, D, -1 A
D, exit, x, +10 D, exit, x, +10
+8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E, north, C, -1 E, north, C, -1 E
C, east, D, -1 C, east, A, -1
D, exit, x, +10 A, exit, x, -10
Assume: g = 1

AI-4511/6511 GWU 13
Problems with Direct Evaluation
§ What’s good about direct evaluation? Output Values
§ It’s easy to understand
§ It doesn’t require any knowledge of T, R -10
A
§ It eventually computes the correct average values,
using just sample transitions +8 +4 +10
B C D
-2
§ What bad about it? E
§ It wastes information about state connections
If B and E both go to C
§ Each state must be learned separately
under this policy, how can
§ So, it takes a long time to learn their values be different?

AI-4511/6511 GWU 14
Why We Can’t Use Policy Evaluation?

§ Simplified Bellman updates calculate V for a fixed policy: s

§ Each round, replace V with a one-step-look-ahead layer over V p(s)

s, p(s)

s, p(s),s’
s’

§ This approach fully exploited the connections between the states

§ Unfortunately, we need T and R to do it!

§ Key question: how can we do this update to V without knowing T and R?

§ In other words, how to we take a weighted average without knowing the weights?
AI-4511/6511 GWU 15
Sample-Based Policy Evaluation?
§ We want to improve our estimate of V by computing these averages:

§ Idea: Take samples of outcomes s’ (by doing the action!) and average
s
p(s)
s, p(s)

s, p(s),s’
s2' s1'
s' s3'

Almost! But we can’t

rewind time to get sample
after sample from state s.
AI-4511/6511 GWU 16
Active Reinforcement Learning

AI-4511/6511 GWU 17
Active Reinforcement Learning
§ Full reinforcement learning: optimal policies (like value iteration)
§ You don’t know the transitions T(s,a,s’)
§ You don’t know the rewards R(s,a,s’)
§ You choose the actions now
§ Goal: learn the optimal policy / values

§ In this case:
§ Learner makes choices!
§ Fundamental tradeoff: exploration vs. exploitation
§ This is NOT offline planning! You actually take actions in the world and
find out what happens…

AI-4511/6511 GWU 18
Q-Value Iteration
§ Value iteration: find successive (depth-limited) values
§ Start with V0(s) = 0, which we know is right
§ Given Vk, calculate the depth k+1 values for all states:

§ But Q-values are more useful, so compute them instead

§ Start with Q0(s,a) = 0, which we know is right
§ Given Qk, calculate the depth k+1 q-values for all q-states:

AI-4511/6511 GWU 19
Q-Learning
§ We’d like to do Q-value updates to each Q-state:

§ But can’t compute this update without knowing T, R

§ Instead, compute average as we go

§ Receive a sample transition (s,a,r,s’)
§ This sample suggests

§ But we want to average over results from (s,a) (Why?)

§ So keep a running average

AI-4511/6511 GWU 20
Q-Learning Properties
§ Amazing result: Q-learning converges to optimal policy -- even
if you’re acting suboptimally!

§ This is called off-policy learning

§ Caveats:
§ You have to explore enough
§ You have to eventually make the learning rate
small enough
§ … but not decrease it too quickly
§ Basically, in the limit, it doesn’t matter how you select actions (!)
AI-4511/6511 GWU 21
Exploration vs. Exploitation

AI-4511/6511 GWU 22
How to Explore?
§ Several schemes for forcing exploration
§ Simplest: random actions (e-greedy)
§ Every time step, flip a coin
§ With (small) probability e, act randomly
§ With (large) probability 1-e, act on current policy

§ Problems with random actions?

§ You do eventually explore the space, but keep
thrashing around once learning is done
§ One solution: lower e over time
§ Another solution: exploration functions

AI-4511/6511 GWU 23
Exploration Functions
§ When to explore?
§ Random actions: explore a fixed amount
§ Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring

§ Exploration function
§ Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:

Modified Q-Update:

§ Note: this propagates the “bonus” back to states that lead to unknown states as well!
AI-4511/6511 GWU 24
Regret
§ Even if you learn the optimal policy, you still
make mistakes along the way!
§ Regret is a measure of your total mistake
cost: the difference between your
(expected) rewards, including youthful
suboptimality, and optimal (expected)
rewards
§ Minimizing regret goes beyond learning to
be optimal – it requires optimally learning to
be optimal
§ Example: random exploration and
exploration functions both end up optimal,
but random exploration has higher regret
AI-4511/6511 GWU 25
Generalizing Across States
§ Basic Q-Learning keeps a table of all q-values

§ In realistic situations, we cannot possibly learn

about every single state!
§ Too many states to visit them all in training
§ Too many states to hold the q-tables in memory

§ Instead, we want to generalize:

§ Learn about some small number of training states from
experience
§ Generalize that experience to new, similar situations
§ This is a fundamental idea in machine learning, and we’ll
see it over and over again

AI-4511/6511 GWU 26
Example: Pacman
Let’s say we discover In naïve q-learning, we Or even this one!
through experience that know nothing about this
this state is bad: state:

AI-4511/6511 GWU 27
Feature-Based Representations

§ Solution: describe a state using a vector of

features (properties)
§ Features are functions from states to real numbers
(often 0/1) that capture important properties of the
state
§ Example features:
§ Distance to closest ghost
§ Distance to closest dot
§ Number of ghosts
§ 1 / (dist to dot)2
§ Is Pacman in a tunnel? (0/1)
§ …… etc.
§ Is it the exact state on this slide?
§ Can also describe a q-state (s, a) with features (e.g.
action moves closer to food)

AI-4511/6511 GWU 28
Linear Value Functions

§ Using a feature representation, we can write a q function (or value function) for any
state using a few weights:

§ Advantage: our experience is summed up in a few powerful numbers

§ Disadvantage: states may share features but actually be very different in value!

AI-4511/6511 GWU 29
Approximate Q-Learning

§ Q-learning with linear Q-functions:

Exact Q’s

Approximate Q’s

§ Intuitive interpretation:
§ Adjust weights of active features
§ E.g., if something unexpectedly bad happens, blame the features that were on:
disprefer all states with that state’s features

Formal justification: online least squares

§AI-4511/6511 30
GWU
Example: Q-Pacman

AI-4511/6511 GWU 31
Q-Learning and Least Squares

AI-4511/6511 GWU 32
Linear Approximation: Regression*
40

24
20
22

30
40
0 20 30
0 20
10 20
10
0 0

Prediction: Prediction:

AI-4511/6511
GWU
33
Optimization: Least Squares*

Error or “residual”
Observation

Prediction

0
0 20
AI-4511/6511 GWU
34
34
Minimizing Error*
Imagine we had only one point x, with features f(x), target value y, and weights w:

Approximate q update explained:

“target” “prediction”
AI-4511/6511 GWU 35
Credit Assignment Problem
§ Not easy to identify credit for each move in a Chess game
§ If credit is only given at the end of the game, then..
§ Many good moves can get a negative credit if the end result is a loss
§ Many bad moves can get a positive credit if the end result is a win
§ Many many games need to be played before learning really happens
§ One solution is to give rewards early on (Reward Shaping)
§ If we try to give rewards early on, then..
§ Agent will maximize on those rewards, not the actual outcome

AI-4511/6511 GWU 36
§ Introduction
§ What is Reinforcement Learning
§ Handling MDPs, when we don't know T and R functions.
§ Two broad categories of Reinforcement Learning (RL)
§ Model Based - Simply try and learn T and R values. Then, calculate Q, V as usual.
§ Model Free - Don't worry about T and R values. Learn Q, V values directly.
§ Q-Learning: Algorithm to learn Q values by trying. Update Q value using something like exponential moving average
Summary

§ [A useful background technique - Exponential Moving Average]

§ Exploration vs. exploitation in RL
§ Quantify exploration vs. exploitation
§ How much exploration to do - how to make it "time" based (Like in case of simulated annealing)
§ How to make it time based for each state, action combination (Exploration can go down with time)
§ Advanced Topics
§ What is credit assignment problem in RL?
§ Is it more of a problem in case of episodic environment or non-episodic environments?
§ How we can use reward shaping (and what are the problems associated with it)?
§ [Not discussed in class] How can we make a generic technique for reward shaping that is not environment
based?

AI-4511/6511 GWU 37
Conclusion

§ We’re done with Part I: Search and Planning!

§ We’ve seen how AI methods can solve

problems in:
§ Search
§ Constraint Satisfaction Problems
§ Games
§ Markov Decision Problems
§ Reinforcement Learning

§ Next up: Part II: Uncertainty and Learning!

AI-4511/6511 GWU 38

Algorithm // Addition of Matrices: Pseudocode
No ratings yet
Algorithm // Addition of Matrices: Pseudocode
4 pages
Bankers Algorithm Solved Problem
No ratings yet
Bankers Algorithm Solved Problem
3 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Data Structures and Algorithm
No ratings yet
Data Structures and Algorithm
35 pages
Lecture - 4 - Circular Linked List in Data Structure
No ratings yet
Lecture - 4 - Circular Linked List in Data Structure
31 pages
Data Structures Through C in Depth 2nd Revised and Updated Edition by Srivastava, Deepali Srivastava ISBN 8176567418 9788176567411
100% (11)
Data Structures Through C in Depth 2nd Revised and Updated Edition by Srivastava, Deepali Srivastava ISBN 8176567418 9788176567411
79 pages
Games2 6pp
No ratings yet
Games2 6pp
15 pages
CZ3005 Module 5 - Reinforcement Learning
No ratings yet
CZ3005 Module 5 - Reinforcement Learning
31 pages
Unit 4
No ratings yet
Unit 4
49 pages
Lec 22
No ratings yet
Lec 22
22 pages
LECTURE 4 - Parallel Computing Design (PART 1)
No ratings yet
LECTURE 4 - Parallel Computing Design (PART 1)
47 pages
SET - 11 (String Basic)
No ratings yet
SET - 11 (String Basic)
12 pages
Search Algorithms in AI
No ratings yet
Search Algorithms in AI
7 pages
1D & 2D Arrays
No ratings yet
1D & 2D Arrays
10 pages
Lec 11
No ratings yet
Lec 11
45 pages
Evaluation of Postfix Expression-1
No ratings yet
Evaluation of Postfix Expression-1
14 pages
DR Nazir A. Zafar Advanced Algorithms Analysis and Design
No ratings yet
DR Nazir A. Zafar Advanced Algorithms Analysis and Design
28 pages
Exp 10 - Binary Search Tree (BST)
No ratings yet
Exp 10 - Binary Search Tree (BST)
8 pages
Data Structure and Algorithm
No ratings yet
Data Structure and Algorithm
19 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
AI 11 Reinforcement Learning II
No ratings yet
AI 11 Reinforcement Learning II
35 pages
Learning Task
No ratings yet
Learning Task
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Binary To Decimal and Decimal To Binary Conversion
No ratings yet
Binary To Decimal and Decimal To Binary Conversion
4 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Learning and Generalization in Single Layer Perceptrons: Introduction To Neural Networks: Lecture 4
No ratings yet
Learning and Generalization in Single Layer Perceptrons: Introduction To Neural Networks: Lecture 4
16 pages
R Program 2025,-1
No ratings yet
R Program 2025,-1
11 pages
Lec 10
No ratings yet
Lec 10
50 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Unit 3
No ratings yet
Unit 3
32 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Barani Institute of Management Sciences: Final-Term Exam Fall-2019
No ratings yet
Barani Institute of Management Sciences: Final-Term Exam Fall-2019
2 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
cs188 sp23 Note14
No ratings yet
cs188 sp23 Note14
2 pages
Ai Unit 2
No ratings yet
Ai Unit 2
29 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
MidtermRevision DSA 2022
No ratings yet
MidtermRevision DSA 2022
4 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Mini Project: By: Ketaki Limaye Yash Dhadve Nakul Bahurupi Sakshi Gole
No ratings yet
Mini Project: By: Ketaki Limaye Yash Dhadve Nakul Bahurupi Sakshi Gole
20 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
program For Binary Search
No ratings yet
program For Binary Search
18 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
2101CS301 48 Question Paper
No ratings yet
2101CS301 48 Question Paper
2 pages
GaussQuadrature Code Matlab
No ratings yet
GaussQuadrature Code Matlab
5 pages
ADSA Answers
No ratings yet
ADSA Answers
7 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
37 RL
No ratings yet
37 RL
18 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Daa Module 6
No ratings yet
Daa Module 6
17 pages
Binary Tree
No ratings yet
Binary Tree
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
ML Unit-3
No ratings yet
ML Unit-3
23 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Edexcel数学qp 202401 Math d1
No ratings yet
Edexcel数学qp 202401 Math d1
28 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Unofficial Usaco Syllabus
No ratings yet
Unofficial Usaco Syllabus
10 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
CS 4410 Operating Systems: Deadlocks: Avoidance - Detection - Recovery
No ratings yet
CS 4410 Operating Systems: Deadlocks: Avoidance - Detection - Recovery
21 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Pre-Calculus Essentials
From Everand
Pre-Calculus Essentials
Ernest Woodward
No ratings yet

AI T8 ReinfoLearning

Uploaded by

AI T8 ReinfoLearning

Uploaded by

CS 6511: Artificial Intelligence

§ New twist: don’t know T or R

Offline Solution Online Learning

§ Step 1: Learn empirical MDP model

§ Step 2: Solve the learned MDP

§ Makes recent samples more important:

§ Decreasing learning rate (alpha) can give converging averages

§ Idea: Average together observed sample values

§ This is called direct evaluation

§ Simplified Bellman updates calculate V for a fixed policy: s

§ Each round, replace V with a one-step-look-ahead layer over V p(s)

§ This approach fully exploited the connections between the states

§ Key question: how can we do this update to V without knowing T and R?

Almost! But we can’t

§ But Q-values are more useful, so compute them instead

§ But can’t compute this update without knowing T, R

§ Instead, compute average as we go

§ But we want to average over results from (s,a) (Why?)

§ This is called off-policy learning

§ Problems with random actions?

§ In realistic situations, we cannot possibly learn

§ Instead, we want to generalize:

§ Solution: describe a state using a vector of

§ Advantage: our experience is summed up in a few powerful numbers

§ Q-learning with linear Q-functions:

Formal justification: online least squares

Approximate q update explained:

§ [A useful background technique - Exponential Moving Average]

§ We’re done with Part I: Search and Planning!

§ We’ve seen how AI methods can solve

§ Next up: Part II: Uncertainty and Learning!

You might also like