0% found this document useful (0 votes)

8 views43 pages

Reinforcement Learning

Uploaded by

pratik665123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views43 pages

Reinforcement Learning

Uploaded by

pratik665123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Fall 2010 Graduate Course on

Unit - 5
Dynamic Learning

Chapter 6: MDP and Reinforcement

Learning

Reinforcement Learning
October 12, 2010
Byoung-Tak Zhang

School of Computer Science and Engineering &

Cognitive Science and Brain Science Programs
Seoul National University

http://bi.snu.ac.kr/~btzhang/
Overview
• Motivating
M ti ti Applications
A li ti
– Learning robots
– Web
W b agents
t
• Markov Decision Processes
– MDP
– POMDP
• Reinforcement Learning
– Adaptive Dynamic Programming (ADP)
– Temporal Difference (TD) Learning
– TD-Q Learning

2
Motivating Applications
• G
Generalized
li d model
d l learning
l i for
f reinforcement
i f learning
l i
on a humanoid robot:
http://www youtube com/watch?v=mRpX9DFCdwI
http://www.youtube.com/watch?v=mRpX9DFCdwI
• Autonomous spider learns to walk forward by
reinforcement learning:
http://www.youtube.com/watch?v=RZf8fR1SmNY&fe
ature=related
• Reinforcement learning for a robotic soccer goalkeeper:
http://www.youtube.com/watch?v=CIF2SBVY-
J0&feature=related

(c) 2010 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 4

WAIR: Reinforcement Learning
g Web Agent
g
for Personalized Information Filtering

Rewardi Actioni
WAIR
Learn
(modify profile)
Statei
Document Filtering
User Profile

Rewardi+1
(Relevance Feedback)
User ...
Filtered Documents

Zhang, B.-T. and Seo, Y.-W., Applied Artificial Intelligence, 15(7):665-685, 2001 5
Reinforcement Learning is
is…

… learningg from trial-and-error

and reward by interaction with an
environment.
Markov Decision Processes (MDPs)

• A simple, general framework for modeling

sequential decision making in a stochastic
environment
• Earlyy work by
y Bellman and Howard in the 1950s
• Currently a popular model in OR and AI
• Used to model problems in robotics
robotics, finance
finance,
agriculture, etc.
Markov Decision Processes (MDPs)

• The decision-making agent starts in some state and

chooses an action according to its policy
• This determines an immediate reward and causes a
stochastic transition to the next state
• Aim is to find the policy that leads to the highest total
reward over T time steps (finite horizon)
• Because
B off the
th Markov
M k property, t the
th policy
li does
d nott
have to remember previous states
The Formal Decision Problem - MDP

• Given <S, A, P, R, T, K>

– S is a finite state set (with start state s )
0

– A is a finite action set

– P(s’ | s, a) is a table of transition probabilities
– R(s, a, s’) is a reward function
• Policy π(s, t) = a
• Is there a policy that yields total reward at least K
over finite horizon T
• We will require that T < |S|
POMDP: Partially Observable MDP

• The agent
agent’ss observation is a function of the
current state
• Now you may need to remember previous
observations in order to act optimally
POMDP

• Given <S, A, P, R, Ω, O, T, K>

– Ω is a finite observation set
– O(o | s, a, s’) is a table of observation probabilities
• Policy π(o1, …, ot) = a
• Does there exist a policy that yields total
reward at least K over horizon T?
DEC POMDP2
DEC-POMDP

• Now two cooperating

p g ((decentralized)) agents
g
with different observation functions
DEC POMDP2
DEC-POMDP

• Given <S, A, P, R, Ω1, Ω2, O, T, K>

– Ω1 and Ω2 are finite observation sets
– O(o1, o2 | s, a1, a2, s’) is a table of observation probabilities
• Local policies π1(o11, …, o1t) = a1, π2(o21, …, o2t) = a2
• Joint policy <π1, π2>
• Does there exist a joint policy that yields total reward at
least K over horizon T??

Note: DEC-POMDP1 = POMDP

DEC MDP2
DEC-MDP

• DEC-POMDP2 with requirement that the state is

uniquely determined by the tuple of observations:
J : Ω1 × Ω 2 → S

Note: DEC-MDP1 = MDP

MDP

• Can use dynamic programming to “back up” values

from the end
⎛ ⎞
V ( s ) = max a∈A ⎜ R ( s, a ) + ∑ P ( s ' | s, a )V ( s ') ⎟
⎝ s '∈S ⎠
Elements of Reinforcement Learning

0 100 0
90 100 0
0 G G
0 0

0 0 0 0
100
0 0 81 90 100

r(state, action) V*(state) values

immediate reward values

• Value function: maps states to state values

V (s ) ≡ r (t )+γr (t +1)+γ2 r (t+2 )+...

Discount factor γ ∈ [0, 1) (here 0.9)

Elements of MDP
• A set S of states {s1, s2, s3, …}

• A set A of actions {a1, a2, a3, …}

• A transition function T: S x A Æ S (deterministic)

or T: S x A x S Æ [0, 1] (stochastic)

• A reward function R: S x A Æ Real

or R: S x A x S Æ Real

• A policy π: S Æ A (deterministic)
or π: S x A Æ [[0, 1]] ((stochastic))
Optimality Criteria
Suppose an agent receives
S i a reward
d rt at time
i t. Then
Th optimal
i l
behaviour might:
• Maximise the sum of expected
p future rewards: ∑r
t
t
T
• Maximise over a finite horizon: ∑r
t =0
t

∞
• Maximise over an infinite horizon: ∑r
t =0
t

∞
• Maximise over a discounted infinite horizon: ∑ rt
χ t

t =0
1 n
• Maximise average reward: lim ∑ rt
n →∞ n
t =1
Examples
p of MDPs
• Goal-directed, Indefinite Horizon, Cost Minimization MDP
– <S, A, Pr, C, G, s0>
– Most often studied in planning community

• Infinite Horizon, Discounted Reward Maximization MDP

– <S, A, Pr, R, γ>
– Most often studied in reinforcement learning

• Goal-directed,, Finite Horizon,, Prob. Maximization MDP

– <S, A, Pr, G, s0, T>
– Also studied in planning community

• Oversubscription Planning: Non absorbing goals, Reward Max. MDP

– <S, A, Pr, G, R, s0>
– Relatively recent model
Assumptions

• First-Order Markovian dynamics (history independence)

– Pr(St+1|At,St,At-1,St-1,..., S0) = Pr(St+1|At,St)
– Next state only depends on current state and current action
• First-Order
First Order Markovian reward process
– Pr(Rt|At,St,At-1,St-1,..., S0) = Pr(Rt|At,St)
– Reward only depends on current state and action
– As described earlier we will assume reward is specified by a deterministic
function R(s)
• i.e. Pr(Rt=R(St) | At,St) = 1
• Stationary dynamics and reward
– Pr(St+1|At,St) = Pr(Sk+1|Ak,Sk) for all t, k
– The world dynamics do not depend on the absolute time
• F ll observability
Full b bili
– Though we can’t predict exactly which state we will reach when we execute an
action, once it is realized, we know what it is

20
Policies ((“Plans” for MDPs))
• Nonstationary policy
– π:S x T → A, where T is the non-negative integers
– π(s,t) is action to do at state s with t stages-to-go
– What if we want to keep acting indefinitely?
• Stationary policy
– π:S → A
– π(s) is action to do at state s (regardless of time)
– specifies a continuously reactive controller
• These assume or have these properties:
– full observability
– history-independence
– deterministic action choice

21
Value of a Policy
• How good is a policy π?
• How do we measure “accumulated” reward?
• V
Value u c o V: S → assoc
ue function associates
a es value
va ue with
w each
eac
state (or each state and time for non-stationary π)
• Vπ(s) denotes value of policy at state s
– Depends on immediate reward, but also what you achieve
subsequently by following π
– An optimal policy is one that is no worse than any other
ppolicyy at anyy state
• The goal of MDP planning is to compute an optimal
ppolicyy (method
( depends
p on how we define value))
22
Finite-Horizon Value Functions
• We first consider maximizing total reward over a
fi i horizon
finite h i
• Assumes the agent has n time steps to live
• To act optimally, should the agent use a
stationary
y or non-stationaryy policy?
p y
• Put another way:
– If you had only one week to live would you act the
same way as if you had fifty years to live?

23
Finite Horizon Problems
• Value (utility) depends on stage-to-go
– hence so should policy: nonstationary π(s,k)
k
• Vπ (s ) is k-stage-to-go value function for π

– expected total reward after executing π for k time steps

k
Vπk ( s ) = E [ ∑ R t | π , s ]
t =0
k
= E [ ∑ R( s t ) | a t = π ( s t , k − t ), s = s 0 ]
t =0

• Here Rt and st are random variables denoting the reward

received and state at stage t respectively

24
Computing
p g Finite-Horizon Value
• Can use dynamic programming to compute Vπk (s )
– Markov property is critical for this
0
(a) Vπ ( s ) = R ( s ),
) ∀s

(b) V k ( s ) = R ( s ) +
π ∑ s' T ( s, π ( s, k ),
) s ' ) ⋅ Vπk −1 ( s ' )
immediate reward
expected
t d future
f t payoff
ff
with k-1 stages to go
π(s,k)

0.7
What is time complexity?
03
0.3
25
Vk Vk-1
Bellman Backup
H
How can we compute
t optimal
ti l Vt+1(s)
( ) given
i optimal
ti l Vt ?

Compute Vt
E
Expectations
i
s1
Compute 07
0.7
a1
Max s2
0.3
Vt+1(s) s
0.4 s3

a2
0.6 s4

0 7 Vt (s1) + 0.3
Vt+1(s) = R(s)+max { 0.7 0 3 Vt (s4)
26 0.4 Vt (s2) + 0.6 Vt(s3) }
Value Iteration: Finite Horizon Case
• Markov property allows exploitation of DP principle
f optimal
for ti l policy
li construction
t ti
– no need to enumerate |A|Tn possible policies
• Value Iteration Bellman backup
0
V ( s ) = R( s ),
) ∀s
V ( s ) = R( s ) + max ∑ T ( s, a, s ') ⋅V
k k −1
( s ')
a
s '
π * ( s, k ) = arg max ∑ s ' T ( s, a, s ')) ⋅V k −1 ( s '))
a
Vk is optimal
p k-stage-to-go
g g value function
f
Π*(s,k) is optimal k-stage-to-go policy
27
Value Iteration

V3 V2 V1 V0
s1

s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6

0.3 0.3 0.3

V1(s4) = R(s4)+max { 0.7 V0 (s1) + 0.3 V0 (s4)

0.4 V0 (s2) + 0.6 V0(s3) }
28
Value Iteration
V3 V2 V1 V0
s1

s2
0 7
0.7 0 7
0.7 0 7
0.7
0.4 0.4 0.4
s3
0.6 0.6 0.6

0.3 0.3 0.3

Π*(s4,t) = max { }

29
Discounted Infinite Horizon MDPs
• Defining value as total reward is problematic with
infinite horizons
– many or all policies have infinite expected reward
– some MDPs are ok (e (e.g.,
g zero
zero-cost
cost absorbing states)
• “Trick”: introduce discount factor 0 ≤ β < 1
– future
f rewards
d discounted
di d by
b β per time
i step
∞
Vπk ( s ) = E [ ∑ β t R t | π , s ]
t =0
∞
• Note: V ( s ) ≤ E [ β t R max ] = 1
π ∑ 1− β
R max
t =0

• Motivation: economic? failure prob? convenience?

30
Computing an Optimal Value Function
• Bellman equation for optimal value function
V * ( s ) = R( s ) + β max ∑ T ( s, a, s ')V * ( s ')
a
s'
– Bellman
B ll provedd this
thi is
i always
l true
t
• How can we compute the optimal value function?
– Th
The MAX operator t makes
k ththe system
t non-linear,
li so the
th problem
bl isi more
difficult than policy evaluation
• Notice that the optimal value function is a fixed
fixed-point
point of
the Bellman Backup operator B
– B takes a value function as input
p and returns a new value function

B[V ]( s ) = R( s ) + β max ∑ T ( s, a, s ')V ( s ')

a
s'
31
Value Iteration
• Can compute
p optimal
p ppolicy
y using
g value iteration,,
just like finite-horizon problems (just include
discount term))

V 0 ( s) = 0
V k ( s ) = R( s ) + β max ∑ T ( s, a, s ')V k −1 ( s ')
a
s '

• Will converge
g to the optimal
p value function as k ggets
large. Why?

32
Policy Evaluation

• Value
l equation
i for
f fixed
fi d policy
li
Vπ ( s ) = R( s ) + β∑ T ( s, π ( s )), s '))Vπ ( s '))
s'
• How can we compute the value function for a
policy?
– we are given R and Pr
– simple linear system with n variables (each variables is
value of a state)) and n constraints ((one value equation
q
for each state)
– Use linear algebra (e.g. matrix inverse)

33
Policyy Iteration
• Given fixed policy, can compute its value exactly:
Vπ ( s ) = R( s ) + β ∑ T ( s, π ( s ), s ')Vπ ( s ')
s'
• Policy iteration exploits this: iterates steps of policy evaluation
and policy improvement

1. Choose a random policy π

Policy improvement
2 L
2. Loop:
(a) Evaluate Vπ
(b) For each s in S,
S set π '(( s ) = arg max ∑ s 'T (s, a, s '))Vπ (s '))
(c) Replace π with π’ a
Until no improving action possible at any state

34
Policy Iteration Notes

• Each step of policy iteration is guaranteed to

strictly improve the policy at some state when
improvement is possible
• Convergence assured (Howard)
– intuitively: no local maxima in value space, and each
policy must improve value; since finite number of
policies will converge to optimal policy
policies,
• Gives exact value of optimal policy

35
Value Iteration vs. Policy Iteration
• Which is faster? VI or PI
– It depends on the problem
• VI takes more iterations than PI, but PI requires
more time on each iteration
– PI must perform policy evaluation on each step which
i l
involves solving
l i a linear
li system
t
• Complexity:
– Th
There are att mostt exp(n)
( ) policies,
li i so PI is
i no worse
than exponential time in number of states
– Empirically O(n) iterations are required
– Still no polynomial bound on the number of PI
iterations (open problem)!

36
Adaptive Dynamic Programming (ADP)
For each s, initialize V(s) , P(s’|s,a) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and P) using
g max(( R( s, a) + γ ∑ P( s ' |, s, a)V ( s '))
a = arg ))
a s'
Receive immediate reward r and observe the new state s’
U i the
Using h transition
i i tuple
l <s,a,s’,r>
’ to update
d model d l R and
dP
For all the sate s, update V(s) using the updating rule

V ( s ) = max(( R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))

a
s'
s = s’
}
How to Learn Model?
• U
Use the
h transition
i i tuple l <s, a, s’,
’ r> to learn
l
P(s’|s,a) and R(s,a). That’s supervised learning!
– Since the agent can get every transition (s, a, s’,r)
directly, so take (s,a) and s’ as an input-output example
of the transition probability function P(s
P(s’|s
|s,a).
a)
– Different techniques in the supervised learning
• Neural networks
• Decision trees
– Use r and P(s |s,a) to learn R(s,a)
P(s’|s,a)
R ( s, a ) = ∑ P ( s ' | s, a ) r
s'
From ADP to TD Learning
• In each step,
step ADP updates V for all states ss’
V ( s ) = max( R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
a
s'

– This is intractable for large state space!

– Improve this by prioritized-sweeping.
• Temporal
p difference ((TD)) learning:
g
– Adjust the estimated utility value V of the current state
based on its immediate reward R(s) and the estimated
value of the next state s’.
V ( s ) = V ( s ) + α ( R( s ) + V ( s ')) − V ( s ))
TD-Q
Q Learning
g
• Define Q-value function
V ( s ) = max Q( s, a )
a

Q( s, a) = R( s, a ) + γ ∑ P( s ' | s, a )V ( s '))
s'

= R ( s, a ) + γ ∑ P ( s ' | s, a ) max Q( s ', a ')

a'
s'
• Recall: TD learning
V ( s ) = V ( s ) + α ( R( s) + V ( s ')) − V ( s ))
• Key idea of TD-Q learning
Q( s, a ) = Q( s, a) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'

• No
Note:
e: Delta
e rule
u e (neural
( eu network
e wo learning)
e g)
w(t ) = w(t ) + α (d − o) x
TD Q Learning algorithm
TD-Q
For each pair (s,a), initialize Q(s,a)
Observe the current state s
Loop forever
{
Select an action a and execute it
a = arg max Q( s, a )
a
Receive immediate reward r and observe the new state s’
Update Q(s,a)
Q( s, a ) = Q( s, a ) + α (r + γ max Q( s ' , a ' ) − Q( s, a))
a'

s=s’
}
Generalization in RL

• So far we assumed that all the functions

learned by the agent are (V(V, P, R Q) are
P R,
tabular forms, i.e. it is possible to enumerate
state
t t andd action
ti spaces.
• Use ggeneralization techniques
q to deal with
large state or action space.
– Function approximation techniques
– Neural networks
– Other supervised/unsupervised learning methods
Alternatives

• Function approximation of the Q-table:

– Neural
N l networks
t k
– Decision trees
– Gradient descent methods