AI 11 Reinforcement Learning II
AI 11 Reinforcement Learning II
Reinforcement Learning II
Instructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Reinforcement Learning
We still assume an MDP:
A set of states s S
A set of actions (per state) A
A model T(s,a,s’)
A reward function R(s,a,s’)
Still looking for a policy (s)
Evaluate a fixed policy PE on approx. MDP Evaluate a fixed policy Value Learning
Model-Free Learning
s
Model-free (temporal difference) learning a
Experience world through episodes s, a
r
’s
a’
Update estimates each transition
s’, a’
Caveats:
You have to explore enough
You have to eventually make the learning rate
small enough
… but not decrease it too quickly
Basically, in the limit, it doesn’t matter how you select actions (!)
[Demo: Q-learning – auto – cliff grid (L11D1)]
Video of Demo Q-Learning Auto Cliff Grid
Exploration vs. Exploitation
How to Explore?
Several schemes for forcing exploration
Simplest: random actions (-greedy)
Every time step, flip a coin
With (small) probability , act randomly
With (large) probability 1-, act on current policy
Exploration function
Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:
Note: this propagates the “bonus” back to states that lead to unknown states as well!
[Demo: exploration – Q-learning – crawler – exploration function (L11D4)]
Video of Demo Q-learning – Exploration Function – Crawler
Regret
Even if you learn the optimal policy,
you still make mistakes along the way!
Regret is a measure of your total
mistake cost: the difference between
your (expected) rewards, including
youthful suboptimality, and optimal
(expected) rewards
Minimizing regret goes beyond
learning to be optimal – it requires
optimally learning to be optimal
Example: random exploration and
exploration functions both end up
optimal, but random exploration has
higher regret
Approximate Q-Learning
Generalizing Across States
Basic Q-Learning keeps a table of all q-values
[demo – RL pacman]
Example: Pacman
Let’s say we discover In naïve q-learning, Or even this one!
through experience we know nothing
that this state is bad: about this state:
Using a feature representation, we can write a q function (or value function) for any
state using a few weights:
Disadvantage: states may share features but actually be very different in value!
Approximate Q-Learning
Exact Q’s
Approximate Q’s
Intuitive interpretation:
Adjust weights of active features
E.g., if something unexpectedly bad happens, blame the features that were on:
disprefer all states with that state’s features
[Demo: approximate Q-
learning pacman (L11D10)]
Video of Demo Approximate Q-Learning -- Pacman
Q-Learning and Least Squares
Linear Approximation: Regression*
40
26
24
20
22
20
30
40
0 20
30
0 20
10 20
10
0 0
Prediction: Prediction:
Optimization: Least Squares*
Error or “residual”
Observation
Prediction
0
0 20
Minimizing Error*
Imagine we had only one point x, with features f(x), target value y, and weights w:
“target” “prediction”
Overfitting: Why Limiting Capacity Can Help*
30
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Policy Search
Policy Search
Problem: often the feature-based policies that work well (win games, maximize utilities)
aren’t the ones that approximate V / Q best
E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they
still produced good decisions
Q-learning’s priority: get Q-values close (modeling)
Action selection priority: get ordering of Q-values right (prediction)
We’ll see this distinction between modeling and prediction again later in the course
Solution: learn policies that maximize rewards, not the values that predict them
Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing
on feature weights
Policy Search
Simplest policy search:
Start with an initial linear value function or Q-function
Nudge each feature weight up and down and see if your policy is better than before
Problems:
How do we tell the policy got better?
Need to run many sample episodes!
If there are a lot of features, this can be impractical