0% found this document useful (0 votes)
22 views38 pages

AI T8 ReinfoLearning

This document discusses reinforcement learning techniques. It introduces reinforcement learning and describes how it differs from traditional planning by learning models from interaction rather than knowing them explicitly. The document outlines model-based and model-free reinforcement learning approaches, including learning value functions directly through methods like Q-learning without learning the model.

Uploaded by

irvingzqy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views38 pages

AI T8 ReinfoLearning

This document discusses reinforcement learning techniques. It introduces reinforcement learning and describes how it differs from traditional planning by learning models from interaction rather than knowing them explicitly. The document outlines model-based and model-free reinforcement learning approaches, including learning value functions directly through methods like Q-learning without learning the model.

Uploaded by

irvingzqy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

CS 6511: Artificial Intelligence

Reinforcement Learning

Amrinder Arora
The George Washington University
[Original version of these slides was created by Dan Klein and Pieter Abbeel for Intro to AI at UC Berkeley. http://ai.berkeley.edu]
Reinforcement Learning

Agent
State: s
Actions: a
Reward: r

Environment

§ Basic idea:
§ Receive feedback in the form of rewards
§ Agent’s utility is defined by the reward function
§ Must (learn to) act so as to maximize expected rewards
§ All learning is based on observed samples of outcomes!
AI-4511/6511 GWU 2
Reinforcement Learning
§ Still assume a Markov decision process (MDP):
§ A set of states s Î S
§ A set of actions (per state) A
§ A model T(s,a,s’)
§ A reward function R(s,a,s’)
§ Still looking for a policy p(s)

§ New twist: don’t know T or R


§ I.e. we don’t know which states are good or what the actions do
§ Must actually try actions and states out to learn

AI-4511/6511 GWU 3
Offline (MDPs) vs. Online (RL)

Offline Solution Online Learning

AI-4511/6511 GWU 4
Two Broad Categories
§ Model Based – We will learn the MDP model (T, R, …)
§ Model Free – We learn the Q, V values directly

AI-4511/6511 GWU 5
Model-Based Learning
§ Model-Based Idea:
§ Learn an approximate model based on experiences
§ Solve for values as if the learned model were correct

§ Step 1: Learn empirical MDP model


§ Count outcomes s’ for each s, a
§ Normalize to give an estimate of
§ Discover each when we experience (s, a, s’)

§ Step 2: Solve the learned MDP


§ For example, use value iteration, as before

AI-4511/6511 GWU 6
Example: Model-Based Learning
Input Policy p Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A C, east, D, -1 C, east, D, -1 T(C, east, D) = 0.75
D, exit, x, +10 D, exit, x, +10 T(C, east, A) = 0.25

B C D
Episode 3 Episode 4 R(s,a,s’).
E E, north, C, -1 E, north, C, -1 R(B, east, C) = -1
C, east, D, -1 C, east, A, -1 R(C, east, D) = -1
D, exit, x, +10 A, exit, x, -10 R(D, exit, x) = +10
Assume: g = 1

AI-4511/6511 GWU 7
Model-Free Learning
§ A key mechanism to learn in MDP settings
§ In this, we don’t try to learn T and R values. We learn Q and V
values directly.

§ Subtopics
§ Passive RL – Evaluating a policy V/Q values for given policy
§ Active RL – Learn the policy also
§ Q-Learning – Learn the Q values, using an Exponential Moving Average
kind of approach.
AI-4511/6511 GWU 8
Passive Reinforcement Learning

AI-4511/6511 GWU 9
Exponential Moving Average
§ Exponential moving average
§ The running interpolation update:

§ Makes recent samples more important:

§ Forgets about the past (distant past values were wrong anyway)

§ Decreasing learning rate (alpha) can give converging averages

AI-4511/6511 GWU 10
Passive Reinforcement Learning
§ Simplified task: policy evaluation
§ Input: a fixed policy p(s)
§ You don’t know the transitions T(s,a,s’)
§ You don’t know the rewards R(s,a,s’)
§ Goal: learn the state values

§ In this case:
§ Learner is “along for the ride”
§ No choice about what actions to take
§ Just execute the policy and learn from experience
§ This is NOT offline planning! You actually take actions in the world.

AI-4511/6511 GWU 11
Direct Evaluation
§ Goal: Compute values for each state under p

§ Idea: Average together observed sample values


§ Act according to p
§ Every time you visit a state, write down what the
sum of discounted rewards turned out to be
§ Average those samples

§ This is called direct evaluation

AI-4511/6511 GWU 12
Example: Direct Evaluation
Input Policy p Observed Episodes (Training) Output Values
R Value
Episode 1 Episode 2
B, east, C, -1 B, east, C, -1 -10
A C, east, D, -1 C, east, D, -1 A
D, exit, x, +10 D, exit, x, +10
+8 +4 +10
B C D B C D
Episode 3 Episode 4 -2
E E, north, C, -1 E, north, C, -1 E
C, east, D, -1 C, east, A, -1
D, exit, x, +10 A, exit, x, -10
Assume: g = 1

AI-4511/6511 GWU 13
Problems with Direct Evaluation
§ What’s good about direct evaluation? Output Values
§ It’s easy to understand
§ It doesn’t require any knowledge of T, R -10
A
§ It eventually computes the correct average values,
using just sample transitions +8 +4 +10
B C D
-2
§ What bad about it? E
§ It wastes information about state connections
If B and E both go to C
§ Each state must be learned separately
under this policy, how can
§ So, it takes a long time to learn their values be different?

AI-4511/6511 GWU 14
Why We Can’t Use Policy Evaluation?

§ Simplified Bellman updates calculate V for a fixed policy: s

§ Each round, replace V with a one-step-look-ahead layer over V p(s)

s, p(s)

s, p(s),s’
s’

§ This approach fully exploited the connections between the states


§ Unfortunately, we need T and R to do it!

§ Key question: how can we do this update to V without knowing T and R?


§ In other words, how to we take a weighted average without knowing the weights?
AI-4511/6511 GWU 15
Sample-Based Policy Evaluation?
§ We want to improve our estimate of V by computing these averages:

§ Idea: Take samples of outcomes s’ (by doing the action!) and average
s
p(s)
s, p(s)

s, p(s),s’
s2' s1'
s' s3'

Almost! But we can’t


rewind time to get sample
after sample from state s.
AI-4511/6511 GWU 16
Active Reinforcement Learning

AI-4511/6511 GWU 17
Active Reinforcement Learning
§ Full reinforcement learning: optimal policies (like value iteration)
§ You don’t know the transitions T(s,a,s’)
§ You don’t know the rewards R(s,a,s’)
§ You choose the actions now
§ Goal: learn the optimal policy / values

§ In this case:
§ Learner makes choices!
§ Fundamental tradeoff: exploration vs. exploitation
§ This is NOT offline planning! You actually take actions in the world and
find out what happens…

AI-4511/6511 GWU 18
Q-Value Iteration
§ Value iteration: find successive (depth-limited) values
§ Start with V0(s) = 0, which we know is right
§ Given Vk, calculate the depth k+1 values for all states:

§ But Q-values are more useful, so compute them instead


§ Start with Q0(s,a) = 0, which we know is right
§ Given Qk, calculate the depth k+1 q-values for all q-states:

AI-4511/6511 GWU 19
Q-Learning
§ We’d like to do Q-value updates to each Q-state:

§ But can’t compute this update without knowing T, R

§ Instead, compute average as we go


§ Receive a sample transition (s,a,r,s’)
§ This sample suggests

§ But we want to average over results from (s,a) (Why?)


§ So keep a running average

AI-4511/6511 GWU 20
Q-Learning Properties
§ Amazing result: Q-learning converges to optimal policy -- even
if you’re acting suboptimally!

§ This is called off-policy learning

§ Caveats:
§ You have to explore enough
§ You have to eventually make the learning rate
small enough
§ … but not decrease it too quickly
§ Basically, in the limit, it doesn’t matter how you select actions (!)
AI-4511/6511 GWU 21
Exploration vs. Exploitation

AI-4511/6511 GWU 22
How to Explore?
§ Several schemes for forcing exploration
§ Simplest: random actions (e-greedy)
§ Every time step, flip a coin
§ With (small) probability e, act randomly
§ With (large) probability 1-e, act on current policy

§ Problems with random actions?


§ You do eventually explore the space, but keep
thrashing around once learning is done
§ One solution: lower e over time
§ Another solution: exploration functions

AI-4511/6511 GWU 23
Exploration Functions
§ When to explore?
§ Random actions: explore a fixed amount
§ Better idea: explore areas whose badness is not
(yet) established, eventually stop exploring

§ Exploration function
§ Takes a value estimate u and a visit count n, and
returns an optimistic utility, e.g.
Regular Q-Update:

Modified Q-Update:

§ Note: this propagates the “bonus” back to states that lead to unknown states as well!
AI-4511/6511 GWU 24
Regret
§ Even if you learn the optimal policy, you still
make mistakes along the way!
§ Regret is a measure of your total mistake
cost: the difference between your
(expected) rewards, including youthful
suboptimality, and optimal (expected)
rewards
§ Minimizing regret goes beyond learning to
be optimal – it requires optimally learning to
be optimal
§ Example: random exploration and
exploration functions both end up optimal,
but random exploration has higher regret
AI-4511/6511 GWU 25
Generalizing Across States
§ Basic Q-Learning keeps a table of all q-values

§ In realistic situations, we cannot possibly learn


about every single state!
§ Too many states to visit them all in training
§ Too many states to hold the q-tables in memory

§ Instead, we want to generalize:


§ Learn about some small number of training states from
experience
§ Generalize that experience to new, similar situations
§ This is a fundamental idea in machine learning, and we’ll
see it over and over again

AI-4511/6511 GWU 26
Example: Pacman
Let’s say we discover In naïve q-learning, we Or even this one!
through experience that know nothing about this
this state is bad: state:

AI-4511/6511 GWU 27
Feature-Based Representations

§ Solution: describe a state using a vector of


features (properties)
§ Features are functions from states to real numbers
(often 0/1) that capture important properties of the
state
§ Example features:
§ Distance to closest ghost
§ Distance to closest dot
§ Number of ghosts
§ 1 / (dist to dot)2
§ Is Pacman in a tunnel? (0/1)
§ …… etc.
§ Is it the exact state on this slide?
§ Can also describe a q-state (s, a) with features (e.g.
action moves closer to food)

AI-4511/6511 GWU 28
Linear Value Functions

§ Using a feature representation, we can write a q function (or value function) for any
state using a few weights:

§ Advantage: our experience is summed up in a few powerful numbers

§ Disadvantage: states may share features but actually be very different in value!

AI-4511/6511 GWU 29
Approximate Q-Learning

§ Q-learning with linear Q-functions:

Exact Q’s

Approximate Q’s

§ Intuitive interpretation:
§ Adjust weights of active features
§ E.g., if something unexpectedly bad happens, blame the features that were on:
disprefer all states with that state’s features

Formal justification: online least squares


§AI-4511/6511 30
GWU
Example: Q-Pacman

AI-4511/6511 GWU 31
Q-Learning and Least Squares

AI-4511/6511 GWU 32
Linear Approximation: Regression*
40

26

24
20
22

20

30
40
0 20 30
0 20
10 20
10
0 0

Prediction: Prediction:

AI-4511/6511
GWU
33
Optimization: Least Squares*

Error or “residual”
Observation

Prediction

0
0 20
AI-4511/6511 GWU
34
34
Minimizing Error*
Imagine we had only one point x, with features f(x), target value y, and weights w:

Approximate q update explained:

“target” “prediction”
AI-4511/6511 GWU 35
Credit Assignment Problem
§ Not easy to identify credit for each move in a Chess game
§ If credit is only given at the end of the game, then..
§ Many good moves can get a negative credit if the end result is a loss
§ Many bad moves can get a positive credit if the end result is a win
§ Many many games need to be played before learning really happens
§ One solution is to give rewards early on (Reward Shaping)
§ If we try to give rewards early on, then..
§ Agent will maximize on those rewards, not the actual outcome

AI-4511/6511 GWU 36
§ Introduction
§ What is Reinforcement Learning
§ Handling MDPs, when we don't know T and R functions.
§ Two broad categories of Reinforcement Learning (RL)
§ Model Based - Simply try and learn T and R values. Then, calculate Q, V as usual.
§ Model Free - Don't worry about T and R values. Learn Q, V values directly.
§ Q-Learning: Algorithm to learn Q values by trying. Update Q value using something like exponential moving average
Summary

§ [A useful background technique - Exponential Moving Average]


§ Exploration vs. exploitation in RL
§ Quantify exploration vs. exploitation
§ How much exploration to do - how to make it "time" based (Like in case of simulated annealing)
§ How to make it time based for each state, action combination (Exploration can go down with time)
§ Advanced Topics
§ What is credit assignment problem in RL?
§ Is it more of a problem in case of episodic environment or non-episodic environments?
§ How we can use reward shaping (and what are the problems associated with it)?
§ [Not discussed in class] How can we make a generic technique for reward shaping that is not environment
based?

AI-4511/6511 GWU 37
Conclusion

§ We’re done with Part I: Search and Planning!

§ We’ve seen how AI methods can solve


problems in:
§ Search
§ Constraint Satisfaction Problems
§ Games
§ Markov Decision Problems
§ Reinforcement Learning

§ Next up: Part II: Uncertainty and Learning!

AI-4511/6511 GWU 38

You might also like