AI T8 ReinfoLearning
AI T8 ReinfoLearning
Reinforcement Learning
Amrinder Arora
The George Washington University
[Original version of these slides was created by Dan Klein and Pieter Abbeel for Intro to AI at UC Berkeley. http://ai.berkeley.edu]
Reinforcement Learning
Agent
State: s
Actions: a
Reward: r
Environment
§ Basic idea:
§ Receive feedback in the form of rewards
§ Agent’s utility is defined by the reward function
§ Must (learn to) act so as to maximize expected rewards
§ All learning is based on observed samples of outcomes!
AI-4511/6511 GWU 2
Reinforcement Learning
§ Still assume a Markov decision process (MDP):
§ A set of states s Î S
§ A set of actions (per state) A
§ A model T(s,a,s’)
§ A reward function R(s,a,s’)
§ Still looking for a policy p(s)
AI-4511/6511 GWU 3
Offline (MDPs) vs. Online (RL)
AI-4511/6511 GWU 4
Two Broad Categories
§ Model Based – We will learn the MDP model (T, R, …)
§ Model Free – We learn the Q, V values directly
AI-4511/6511 GWU 5
Model-Based Learning
§ Model-Based Idea:
§ Learn an approximate model based on experiences
§ Solve for values as if the learned model were correct
AI-4511/6511 GWU 6
Example: Model-Based Learning
Input Policy p Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A C, east, D, -1 C, east, D, -1 T(C, east, D) = 0.75
D, exit, x, +10 D, exit, x, +10 T(C, east, A) = 0.25
…
B C D
Episode 3 Episode 4 R(s,a,s’).
E E, north, C, -1 E, north, C, -1 R(B, east, C) = -1
C, east, D, -1 C, east, A, -1 R(C, east, D) = -1
D, exit, x, +10 A, exit, x, -10 R(D, exit, x) = +10
Assume: g = 1
…
AI-4511/6511 GWU 7
Model-Free Learning
§ A key mechanism to learn in MDP settings
§ In this, we don’t try to learn T and R values. We learn Q and V
values directly.
§ Subtopics
§ Passive RL – Evaluating a policy V/Q values for given policy
§ Active RL – Learn the policy also
§ Q-Learning – Learn the Q values, using an Exponential Moving Average
kind of approach.
AI-4511/6511 GWU 8
Passive Reinforcement Learning
AI-4511/6511 GWU 9
Exponential Moving Average
§ Exponential moving average
§ The running interpolation update:
§ Forgets about the past (distant past values were wrong anyway)
AI-4511/6511 GWU 10
Passive Reinforcement Learning
§ Simplified task: policy evaluation
§ Input: a fixed policy p(s)
§ You don’t know the transitions T(s,a,s’)
§ You don’t know the rewards R(s,a,s’)
§ Goal: learn the state values
§ In this case:
§ Learner is “along for the ride”
§ No choice about what actions to take
§ Just execute the policy and learn from experience
§ This is NOT offline planning! You actually take actions in the world.
AI-4511/6511 GWU 11
Direct Evaluation
§ Goal: Compute values for each state under p
AI-4511/6511 GWU 12
Example: Direct Evaluation
Input Policy p Observed Episodes (Training) Output Values
R Value
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A C, east, D, -1 C, east, D, -1 A
D, exit, x, +10 D, exit, x, +10
+8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E, north, C, -1 E, north, C, -1 E
C, east, D, -1 C, east, A, -1
D, exit, x, +10 A, exit, x, -10
Assume: g = 1
AI-4511/6511 GWU 13
Problems with Direct Evaluation
§ What’s good about direct evaluation? Output Values
§ It’s easy to understand
§ It doesn’t require any knowledge of T, R -10
A
§ It eventually computes the correct average values,
using just sample transitions +8 +4 +10
B C D
-2
§ What bad about it? E
§ It wastes information about state connections
If B and E both go to C
§ Each state must be learned separately
under this policy, how can
§ So, it takes a long time to learn their values be different?
AI-4511/6511 GWU 14
Why We Can’t Use Policy Evaluation?
s, p(s)
s, p(s),s’
s’
§ Idea: Take samples of outcomes s’ (by doing the action!) and average
s
p(s)
s, p(s)
s, p(s),s’
s2' s1'
s' s3'
AI-4511/6511 GWU 17
Active Reinforcement Learning
§ Full reinforcement learning: optimal policies (like value iteration)
§ You don’t know the transitions T(s,a,s’)
§ You don’t know the rewards R(s,a,s’)
§ You choose the actions now
§ Goal: learn the optimal policy / values
§ In this case:
§ Learner makes choices!
§ Fundamental tradeoff: exploration vs. exploitation
§ This is NOT offline planning! You actually take actions in the world and
find out what happens…
AI-4511/6511 GWU 18
Q-Value Iteration
§ Value iteration: find successive (depth-limited) values
§ Start with V0(s) = 0, which we know is right
§ Given Vk, calculate the depth k+1 values for all states:
AI-4511/6511 GWU 19
Q-Learning
§ We’d like to do Q-value updates to each Q-state:
AI-4511/6511 GWU 20
Q-Learning Properties
§ Amazing result: Q-learning converges to optimal policy -- even
if you’re acting suboptimally!
§ Caveats:
§ You have to explore enough
§ You have to eventually make the learning rate
small enough
§ … but not decrease it too quickly
§ Basically, in the limit, it doesn’t matter how you select actions (!)
AI-4511/6511 GWU 21
Exploration vs. Exploitation
AI-4511/6511 GWU 22
How to Explore?
§ Several schemes for forcing exploration
§ Simplest: random actions (e-greedy)
§ Every time step, flip a coin
§ With (small) probability e, act randomly
§ With (large) probability 1-e, act on current policy
AI-4511/6511 GWU 23
Exploration Functions
§ When to explore?
§ Random actions: explore a fixed amount
§ Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring
§ Exploration function
§ Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:
§ Note: this propagates the “bonus” back to states that lead to unknown states as well!
AI-4511/6511 GWU 24
Regret
§ Even if you learn the optimal policy, you still
make mistakes along the way!
§ Regret is a measure of your total mistake
cost: the difference between your
(expected) rewards, including youthful
suboptimality, and optimal (expected)
rewards
§ Minimizing regret goes beyond learning to
be optimal – it requires optimally learning to
be optimal
§ Example: random exploration and
exploration functions both end up optimal,
but random exploration has higher regret
AI-4511/6511 GWU 25
Generalizing Across States
§ Basic Q-Learning keeps a table of all q-values
AI-4511/6511 GWU 26
Example: Pacman
Let’s say we discover In naïve q-learning, we Or even this one!
through experience that know nothing about this
this state is bad: state:
AI-4511/6511 GWU 27
Feature-Based Representations
AI-4511/6511 GWU 28
Linear Value Functions
§ Using a feature representation, we can write a q function (or value function) for any
state using a few weights:
§ Disadvantage: states may share features but actually be very different in value!
AI-4511/6511 GWU 29
Approximate Q-Learning
Exact Q’s
Approximate Q’s
§ Intuitive interpretation:
§ Adjust weights of active features
§ E.g., if something unexpectedly bad happens, blame the features that were on:
disprefer all states with that state’s features
AI-4511/6511 GWU 31
Q-Learning and Least Squares
AI-4511/6511 GWU 32
Linear Approximation: Regression*
40
26
24
20
22
20
30
40
0 20 30
0 20
10 20
10
0 0
Prediction: Prediction:
AI-4511/6511
GWU
33
Optimization: Least Squares*
Error or “residual”
Observation
Prediction
0
0 20
AI-4511/6511 GWU
34
34
Minimizing Error*
Imagine we had only one point x, with features f(x), target value y, and weights w:
“target” “prediction”
AI-4511/6511 GWU 35
Credit Assignment Problem
§ Not easy to identify credit for each move in a Chess game
§ If credit is only given at the end of the game, then..
§ Many good moves can get a negative credit if the end result is a loss
§ Many bad moves can get a positive credit if the end result is a win
§ Many many games need to be played before learning really happens
§ One solution is to give rewards early on (Reward Shaping)
§ If we try to give rewards early on, then..
§ Agent will maximize on those rewards, not the actual outcome
AI-4511/6511 GWU 36
§ Introduction
§ What is Reinforcement Learning
§ Handling MDPs, when we don't know T and R functions.
§ Two broad categories of Reinforcement Learning (RL)
§ Model Based - Simply try and learn T and R values. Then, calculate Q, V as usual.
§ Model Free - Don't worry about T and R values. Learn Q, V values directly.
§ Q-Learning: Algorithm to learn Q values by trying. Update Q value using something like exponential moving average
Summary
AI-4511/6511 GWU 37
Conclusion
AI-4511/6511 GWU 38