Lec 11
Lec 11
Reinforcement Learning II
o Caveats:
o You have to explore enough
o You have to eventually make the learning rate
small enough
o … but not decrease it too quickly
o Basically, in the limit, it doesn’t matter how you select actions (!)
o In this case:
o Learner makes choices!
o Fundamental tradeoff: exploration vs. exploitation
o This is NOT offline planning! You actually take actions in the world
and find out what happens…
Exploration vs. Exploitation
How to Explore?
o Several schemes for forcing exploration
o Simplest: random actions (e-greedy)
o Every time step, flip a coin
o With (small) probability e, act randomly
o With (large) probability 1-e, act on current policy
o Exploration function
o Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:
Modified Q-Update:
o Note: this propagates the “bonus” back to states that lead to unknown states
as well! [Demo: exploration – Q-learning – crawler – exploration function (L10D4)]
Video of Demo Q-learning – Exploration Function –
Crawler
Regret
o Even if you learn the optimal
policy, you still make mistakes
along the way!
o Regret is a measure of your total
mistake cost: the difference
between your (expected) rewards,
including youthful suboptimality,
and optimal (expected) rewards
o Minimizing regret goes beyond
learning to be optimal – it requires
optimally learning to be optimal
o Example: random exploration and
exploration functions both end up
optimal, but random exploration
has higher regret
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)
Exact Q’s
Approximate Q’s
o Intuitive interpretation:
o Adjust weights of active features
o E.g., if something unexpectedly bad happens, blame the features that were
on: disprefer all states with that state’s features
[Demo: approximate Q-
learning pacman (L11D8)]
Video of Demo Approximate Q-Learning --
Pacman
DeepMind Atari (©Two Minute Lectures)
approximate Q-learning with neural nets
35
Q-Learning and Least Squares
Linear Approximation: Regression
40
26
24
20
22
20
30
40
0 20
30
0 20
10 20
10
0 0
Prediction: Prediction:
Optimization: Least Squares
Error or “residual”
Observation
Prediction
0
0 20
Minimizing Error
Imagine we had only one point x, with features f(x), target value y, and weights w:
“target” “prediction”
Overfitting: Why Limiting Capacity Can Help
30
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Reinforcement Learning -- Overview
o Passive Reinforcement Learning (= how to learn from experiences)
o Model-based Passive RL
o Learn the MDP model from experiences, then solve the MDP
o Model-free Passive RL
o Forego learning the MDP model, directly learn V or Q:
o Value learning – learns value of a fixed policy; 2 approaches: Direct Evaluation & TD Learning
o Q learning – learns Q values of the optimal policy (uses a Q version of TD Learning)
o Active Reinforcement Learning (= agent also needs to decide how to collect experiences)
o Key challenges:
o How to efficiently explore?
o How to trade off exploration <> exploitation
o Applies to both model-based and model-free. In CS188 we’ll cover only in context of Q-learning
o Approximate Reinforcement Learning (= to handle large state spaces)
o Approximate Q-Learning
o Policy Search
Policy Search
Policy Search
o Problem: often the feature-based policies that work well (win games, maximize
utilities) aren’t the ones that approximate V / Q best
o E.g. your value functions from project 2 were probably horrible estimates of future rewards,
but they still produced good decisions
o Q-learning’s priority: get Q-values close (modeling)
o Action selection priority: get ordering of Q-values right (prediction)
o We’ll see this distinction between modeling and prediction again later in the course
o Solution: learn policies that maximize rewards, not the values that predict them
o Policy search: start with an ok solution (e.g. Q-learning) then fine-tune by hill
climbing on feature weights
Policy Search
o Simplest policy search:
o Start with an initial linear value function or Q-function
o Nudge each feature weight up and down and see if your policy is better than
before
o Problems:
o How do we tell the policy got better?
o Need to run many sample episodes!
o If there are a lot of features, this can be impractical