0% found this document useful (0 votes)
94 views40 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

This document provides an overview of Markov decision processes (MDPs) and reinforcement learning (RL). It defines key concepts such as stochastic processes, the Markov property, Markov chains, and MDPs. MDPs extend Markov chains by adding actions and rewards. The goal of an MDP is to maximize cumulative rewards by finding the optimal policy. The document discusses solutions to MDPs using dynamic programming approaches like value iteration, policy iteration, and Monte Carlo methods. It also introduces temporal-difference learning methods like TD(0), SARSA, and Q-learning which can learn directly from experience without a model of the environment. Finally, it provides an example of applying RL to solve scheduling problems for NASA's space shuttle payload processing.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views40 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

This document provides an overview of Markov decision processes (MDPs) and reinforcement learning (RL). It defines key concepts such as stochastic processes, the Markov property, Markov chains, and MDPs. MDPs extend Markov chains by adding actions and rewards. The goal of an MDP is to maximize cumulative rewards by finding the optimal policy. The document discusses solutions to MDPs using dynamic programming approaches like value iteration, policy iteration, and Monte Carlo methods. It also introduces temporal-difference learning methods like TD(0), SARSA, and Q-learning which can learn directly from experience without a model of the environment. Finally, it provides an example of applying RL to solve scheduling problems for NASA's space shuttle payload processing.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Markov Decision Processes &

Reinforcement Learning
Megan Smith
Lehigh University, Fall 2006
Outline
 Stochastic Process
 Markov Property
 Markov Chain
 Markov Decision Process
 Reinforcement Learning
 RL Techniques
 Example Applications
Stochastic Process
 Quick definition: A Random Process
 Often viewed as a collection of
indexed random variables
http://en.wikipedia.org/wiki
/
Image:AAMarkov.jpg
 Useful to us: Set of states with
probabilities of being in those states
indexed over time
 We’ll deal with discrete stochastic
processes
Stochastic Process Example
 Classic: Random Walk
 Start at state X0 at time t0
 At time ti, move a step Zi where
P(Zi = -1) = p and P(Zi = 1) = 1 - p
 At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
 Also thought of as the “memoryless”
property
 A stochastic process is said to have
the Markov property if the probability
of state Xn+1 having any given value
depends only upon state Xn
 Very much depends on description of
states
Markov Property Example
 Checkers:
 Current State: The current configuration
of the board
 Contains all information needed for
transition to next state
 Thus, each configuration can be said to
have the Markov property
Markov Chain
 Discrete-time
stochastic process
with the Markov
property
 Industry Example:
Google’s PageRank
algorithm
 Probability
distribution
representing
likelihood of
random linking
ending up on a
page http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
 Discrete time stochastic control
process
 Extension of Markov chains
 Differences:
 Addition of actions (choice)
 Addition of rewards (motivation)

 If the actions are fixed, an MDP


reduces to a Markov chain
Description of MDPs
 Tuple (S, A, P(.,.), R(.)))
 S -> state space
 A -> action space
 Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
 R(s) = immediate reward at state s
 Goal is to maximize some cumulative
function of the rewards
 Finite MDPs have finite state and
action spaces
Simple MDP Example
 Recycling MDP Robot
 Can search for trashcan, wait for
someone to bring a trashcan, or go
news.bbc.co.uk home and recharge battery
 Has two energy levels – high and low
 Searching runs down battery, waiting
does not, and a depleted battery has a
very low reward
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Rass’
high high search α Rsearch
high low search 1-α Rsearch
low high search 1-β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph

state node

action node
Solution to an MDP = Policy π
 Gives the action to take from a given
state regardless of history
 Two arrays indexed by state
 V is the value function, namely the
discounted sum of rewards on average from
following a policy
 π is an array of actions to be taken in each
state (Policy)
2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants

2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2

 Value Iteration Value Function


 Policy Iteration
 Modified Policy Iteration
 Prioritized Sweeping
Value Iteration

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)


1 0 0 10 10
2 0 4.5 14.5 19
3 2.03 8.55 18.55 24.18
4 4.76 11.79 19.26 29.23
5 7.45 15.30 20.81 31.82
6 10.23 17.67 22.72 33.68
Why So Interesting?
 If the transition probabilities are
known, this becomes a
straightforward computational
problem, however…
 If the transition probabilities are
unknown, then this is a problem for
reinforcement learning.
Typical Agent
 In reinforcement learning (RL), the
agent observes a state and takes an
action.

 Afterward, the agent receives a


reward.
Mission: Optimize Reward
 Rewards are calculated in the
environment
 Used to teach the agent how to reach
a goal state
 Must signal what we ultimately want
achieved, not necessarily subgoals
 May be discounted over time
 In general, seek to maximize the
expected return
Value Functions
 Vπ is a value function
(How good is it to be
in this state?)
State-value function for policy π
 V is the unique
π

solution to its
Bellman Equation
 Expresses Bellman Equation:
relationship between
a state and its
successor states
Another Value Function
 Qπ defines the value of taking action a in state s
under policy π
 Expected return starting from s, taking action a,
and thereafter following policy π
Action-value function for policy π

Backup diagrams for (a) Vπ and (b) Qπ


Dynamic Programming
 Classically, a collection of algorithms
used to compute optimal policies
given a perfect model of environment
as an MDP
 The classical view is not so useful in
practice since we rarely have a
perfect environment model
 Provides foundation for other
methods
 Not practical for large problems
DP Continued…
 Use value functions to organize and
structure the search for good policies.
 Turn Bellman equations into update
policies.
 Iterative policy evaluation using full
backups
Policy Improvement
 When should we change the policy?
 If we pick a new action α from state s
and thereafter follow the current
policy and V(π’) >= V(π), then
picking α from state s is a better
policy overall.
 Results from the policy improvement
theorem
Policy Iteration
 Continue improving
the policy π and
recalculating V(π)
 A finite MDP has a
finite number of
policies, so
convergence is
guaranteed in a
finite number of
iterations
Remember Value Iteration?

Used to truncate policy iteration by combining


one sweep of policy evaluation and one of policy
improvement in each of its sweeps.
Monte Carlo Methods
 Requires only episodic experience –
on-line or simulated
 Based on averaging sample returns
 Value estimates and policies only
changed at the end of each episode,
not on a step-by-step basis
Policy Evaluation
 Compute
average returns
as the episode
runs
 Two methods:
first-visit and
every-visit
 First-visit is
most widely First-visit MC method
studied
Estimation of Action Values
 State values are not enough without
a model – we need action values as
well
 Qπ(s, a)  expected return when
starting in state s, taking action a,
and thereafter following policy π
 Exploration vs. Exploitation
 Exploring starts
Example Monte Carlo Algorithm

First-visit Monte Carlo assuming exploring starts


Another MC Algorithm

On-line, first-visit, ε-greedy MC without exploring starts


Temporal-Difference Learning
 Central and novel to reinforcement
learning
 Combines Monte Carlo and DP
methods
 Can learn from experience w/o a
model – like MC
 Updates estimates based on other
learned estimates (bootstraps) – like
DP
TD(0)

 Simplest TD method
 Uses sample backup from single successor
state or state-action pair instead of full
backup of DP methods
SARSA – On-policy Control

 Quintuple of events (st, at, rt+1, st+1, at+1)


 Continually estimate Qπ while changing π
Q-Learning – Off-policy Control

 Learned action-value function, Q, directly


approximates Q*, the optimal action-value
function, independent of policy being
followed
Case Study
 Job-shop Scheduling
 Temporal and resource constraints
 Find constraint-satisfying schedules of
short duration
 In it’s general form, NP-complete
NASA Space Shuttle Payload
Processing Problem (SSPPP)
 Schedule tasks required for installation and
testing of shuttle cargo bay payloads
 Typical: 2-6 shuttle missions, each
requiring 34-164 tasks
 Zhang and Dietterich (1995, 1996; Zhang,
1996)
 First successful instance of RL applied in
plan-space
 states = complete plans
 actions = plan modifications
SSPPP – continued…
 States were an entire schedule
 Two types of actions:
 REASSIGN-POOL operators – reassigns a
resource to a different pool
 MOVE operators – moves task to first
earlier or later time with satisfied
resource constraints
 Small negative reward for each step
 Resource dilation factor (RDF)
formula for rewarding final schedule’s
duration
Even More SSPPP…
 Used TD() to learn value function
 Actions selected by decreasing ε-greedy
policy with one-step lookahead
 Function approximation used multilayer
neural networks
 Training generally took 10,000 episodes
 Each resulting network represented
different scheduling algorithm – not a
schedule for a specific instance!
RL and CBR
 Example: CBR used to store various
policies and RL used to learn and
modify those policies
 Ashwin Ram and Juan Carlos
Santamarıa, 1993
 Autonomous Robotic Control

 Job shop scheduling: RL used to


repair schedules, CBR used to
determine which repair to make
 Similar methods can be used for IDSS
References
 Sutton, R. S. and Barto A. G. Reinforcement
Learning: An Introduction. The MIT Press,
Cambridge, MA, 1998
 Stochastic Processes, www.hanoivn.net
 http://en.wikipedia.org/wiki/PageRank
 http://en.wikipedia.org/wiki/Markov_decision_proc
ess
 Using Case-Based Reasoning as a Reinforcement
Learning framework for Optimization with
Changing Criteria, Zeng, D. and Sycara, K. 1995

You might also like