0% found this document useful (0 votes)

94 views40 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

This document provides an overview of Markov decision processes (MDPs) and reinforcement learning (RL). It defines key concepts such as stochastic processes, the Markov property, Markov chains, and MDPs. MDPs extend Markov chains by adding actions and rewards. The goal of an MDP is to maximize cumulative rewards by finding the optimal policy. The document discusses solutions to MDPs using dynamic programming approaches like value iteration, policy iteration, and Monte Carlo methods. It also introduces temporal-difference learning methods like TD(0), SARSA, and Q-learning which can learn directly from experience without a model of the environment. Finally, it provides an example of applying RL to solve scheduling problems for NASA's space shuttle payload processing.

Uploaded by

Sanja Lazarova-Molnar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views40 pages

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

Uploaded by

Sanja Lazarova-Molnar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Markov Decision Processes &

Reinforcement Learning
Megan Smith
Lehigh University, Fall 2006
Outline
 Stochastic Process
 Markov Property
 Markov Chain
 Markov Decision Process
 Reinforcement Learning
 RL Techniques
 Example Applications
Stochastic Process
 Quick definition: A Random Process
 Often viewed as a collection of
indexed random variables
http://en.wikipedia.org/wiki
/
Image:AAMarkov.jpg
 Useful to us: Set of states with
probabilities of being in those states
indexed over time
 We’ll deal with discrete stochastic
processes
Stochastic Process Example
 Classic: Random Walk
 Start at state X0 at time t0
 At time ti, move a step Zi where
P(Zi = -1) = p and P(Zi = 1) = 1 - p
 At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
 Also thought of as the “memoryless”
property
 A stochastic process is said to have
the Markov property if the probability
of state Xn+1 having any given value
depends only upon state Xn
 Very much depends on description of
states
Markov Property Example
 Checkers:
 Current State: The current configuration
of the board
 Contains all information needed for
transition to next state
 Thus, each configuration can be said to
have the Markov property
Markov Chain
 Discrete-time
stochastic process
with the Markov
property
 Industry Example:
Google’s PageRank
algorithm
 Probability
distribution
representing
likelihood of
random linking
ending up on a
page http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
 Discrete time stochastic control
process
 Extension of Markov chains
 Differences:
 Addition of actions (choice)
 Addition of rewards (motivation)

 If the actions are fixed, an MDP

reduces to a Markov chain
Description of MDPs
 Tuple (S, A, P(.,.), R(.)))
 S -> state space
 A -> action space
 Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
 R(s) = immediate reward at state s
 Goal is to maximize some cumulative
function of the rewards
 Finite MDPs have finite state and
action spaces
Simple MDP Example
 Recycling MDP Robot
 Can search for trashcan, wait for
someone to bring a trashcan, or go
news.bbc.co.uk home and recharge battery
 Has two energy levels – high and low
 Searching runs down battery, waiting
does not, and a depleted battery has a
very low reward
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Rass’
high high search α Rsearch
high low search 1-α Rsearch
low high search 1-β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph

state node

action node
Solution to an MDP = Policy π
 Gives the action to take from a given
state regardless of history
 Two arrays indexed by state
 V is the value function, namely the
discounted sum of rewards on average from
following a policy
 π is an array of actions to be taken in each
state (Policy)
2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants

2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2

 Value Iteration Value Function

 Policy Iteration
 Modified Policy Iteration
 Prioritized Sweeping
Value Iteration

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1 0 0 10 10
2 0 4.5 14.5 19
3 2.03 8.55 18.55 24.18
4 4.76 11.79 19.26 29.23
5 7.45 15.30 20.81 31.82
6 10.23 17.67 22.72 33.68
Why So Interesting?
 If the transition probabilities are
known, this becomes a
straightforward computational
problem, however…
 If the transition probabilities are
unknown, then this is a problem for
reinforcement learning.
Typical Agent
 In reinforcement learning (RL), the
agent observes a state and takes an
action.

 Afterward, the agent receives a

reward.
Mission: Optimize Reward
 Rewards are calculated in the
environment
 Used to teach the agent how to reach
a goal state
 Must signal what we ultimately want
achieved, not necessarily subgoals
 May be discounted over time
 In general, seek to maximize the
expected return
Value Functions
 Vπ is a value function
(How good is it to be
in this state?)
State-value function for policy π
 V is the unique
π

solution to its
Bellman Equation
 Expresses Bellman Equation:
relationship between
a state and its
successor states
Another Value Function
 Qπ defines the value of taking action a in state s
under policy π
 Expected return starting from s, taking action a,
and thereafter following policy π
Action-value function for policy π

Backup diagrams for (a) Vπ and (b) Qπ

Dynamic Programming
 Classically, a collection of algorithms
used to compute optimal policies
given a perfect model of environment
as an MDP
 The classical view is not so useful in
practice since we rarely have a
perfect environment model
 Provides foundation for other
methods
 Not practical for large problems
DP Continued…
 Use value functions to organize and
structure the search for good policies.
 Turn Bellman equations into update
policies.
 Iterative policy evaluation using full
backups
Policy Improvement
 When should we change the policy?
 If we pick a new action α from state s
and thereafter follow the current
policy and V(π’) >= V(π), then
picking α from state s is a better
policy overall.
 Results from the policy improvement
theorem
Policy Iteration
 Continue improving
the policy π and
recalculating V(π)
 A finite MDP has a
finite number of
policies, so
convergence is
guaranteed in a
finite number of
iterations
Remember Value Iteration?

Used to truncate policy iteration by combining

one sweep of policy evaluation and one of policy
improvement in each of its sweeps.
Monte Carlo Methods
 Requires only episodic experience –
on-line or simulated
 Based on averaging sample returns
 Value estimates and policies only
changed at the end of each episode,
not on a step-by-step basis
Policy Evaluation
 Compute
average returns
as the episode
runs
 Two methods:
first-visit and
every-visit
 First-visit is
most widely First-visit MC method
studied
Estimation of Action Values
 State values are not enough without
a model – we need action values as
well
 Qπ(s, a)  expected return when
starting in state s, taking action a,
and thereafter following policy π
 Exploration vs. Exploitation
 Exploring starts
Example Monte Carlo Algorithm

First-visit Monte Carlo assuming exploring starts

Another MC Algorithm

On-line, first-visit, ε-greedy MC without exploring starts

Temporal-Difference Learning
 Central and novel to reinforcement
learning
 Combines Monte Carlo and DP
methods
 Can learn from experience w/o a
model – like MC
 Updates estimates based on other
learned estimates (bootstraps) – like
DP
TD(0)

 Simplest TD method
 Uses sample backup from single successor
state or state-action pair instead of full
backup of DP methods
SARSA – On-policy Control

 Quintuple of events (st, at, rt+1, st+1, at+1)

 Continually estimate Qπ while changing π
Q-Learning – Off-policy Control

 Learned action-value function, Q, directly

approximates Q*, the optimal action-value
function, independent of policy being
followed
Case Study
 Job-shop Scheduling
 Temporal and resource constraints
 Find constraint-satisfying schedules of
short duration
 In it’s general form, NP-complete
NASA Space Shuttle Payload
Processing Problem (SSPPP)
 Schedule tasks required for installation and
testing of shuttle cargo bay payloads
 Typical: 2-6 shuttle missions, each
requiring 34-164 tasks
 Zhang and Dietterich (1995, 1996; Zhang,
1996)
 First successful instance of RL applied in
plan-space
 states = complete plans
 actions = plan modifications
SSPPP – continued…
 States were an entire schedule
 Two types of actions:
 REASSIGN-POOL operators – reassigns a
resource to a different pool
 MOVE operators – moves task to first
earlier or later time with satisfied
resource constraints
 Small negative reward for each step
 Resource dilation factor (RDF)
formula for rewarding final schedule’s
duration
Even More SSPPP…
 Used TD() to learn value function
 Actions selected by decreasing ε-greedy
policy with one-step lookahead
 Function approximation used multilayer
neural networks
 Training generally took 10,000 episodes
 Each resulting network represented
different scheduling algorithm – not a
schedule for a specific instance!
RL and CBR
 Example: CBR used to store various
policies and RL used to learn and
modify those policies
 Ashwin Ram and Juan Carlos
Santamarıa, 1993
 Autonomous Robotic Control

 Job shop scheduling: RL used to

repair schedules, CBR used to
determine which repair to make
 Similar methods can be used for IDSS
References
 Sutton, R. S. and Barto A. G. Reinforcement
Learning: An Introduction. The MIT Press,
Cambridge, MA, 1998
 Stochastic Processes, www.hanoivn.net
 http://en.wikipedia.org/wiki/PageRank
 http://en.wikipedia.org/wiki/Markov_decision_proc
ess
 Using Case-Based Reasoning as a Reinforcement
Learning framework for Optimization with
Changing Criteria, Zeng, D. and Sycara, K. 1995

Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
Unit 4
No ratings yet
Unit 4
49 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Markov Decision Process
No ratings yet
Markov Decision Process
15 pages
06 MDP
No ratings yet
06 MDP
89 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Simulation-Based Optimization Parametric Optimizat
100% (1)
Simulation-Based Optimization Parametric Optimizat
11 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
15 MDP
No ratings yet
15 MDP
35 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
CS229
No ratings yet
CS229
17 pages
Buet Admission Quest Basic
No ratings yet
Buet Admission Quest Basic
7 pages
Lathe Controller 990
No ratings yet
Lathe Controller 990
128 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Dynamic Business Strategy Competing in A Fastchanging Uncertain
No ratings yet
Dynamic Business Strategy Competing in A Fastchanging Uncertain
134 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
AS02
No ratings yet
AS02
16 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Printable Cyber Crime Law India
No ratings yet
Printable Cyber Crime Law India
185 pages
737 Book NG 22 303
100% (2)
737 Book NG 22 303
76 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
XSL (Extensible Stylesheet Language)
100% (1)
XSL (Extensible Stylesheet Language)
11 pages
Introduction To DES
No ratings yet
Introduction To DES
129 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Module 1 Study Guide
No ratings yet
Module 1 Study Guide
9 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
JavaScript - If... Else Statement
No ratings yet
JavaScript - If... Else Statement
4 pages
Computerized Accounting System
100% (1)
Computerized Accounting System
6 pages
Analysis of Algorithm Chapter 1
No ratings yet
Analysis of Algorithm Chapter 1
35 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Unit 3.4 Hashing Techniques
No ratings yet
Unit 3.4 Hashing Techniques
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Regulation-2019 Curriculum - UG - CSE 1
No ratings yet
Regulation-2019 Curriculum - UG - CSE 1
19 pages
Group 5 Results Discussion Recommendation
No ratings yet
Group 5 Results Discussion Recommendation
18 pages
Modulador Digital IP To RF User's Manual - V1.0
No ratings yet
Modulador Digital IP To RF User's Manual - V1.0
12 pages
ProCapture-T&ProRF-T User Manual V1.1 - 20160608
No ratings yet
ProCapture-T&ProRF-T User Manual V1.1 - 20160608
69 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
General Principles - Revived: Dr. Sanja Lazarova-Molnar
No ratings yet
General Principles - Revived: Dr. Sanja Lazarova-Molnar
39 pages
Sigma Personal Voice Assistance Mid - Defence - Report
No ratings yet
Sigma Personal Voice Assistance Mid - Defence - Report
27 pages
Audi A4 3G+ MMI Entertainment System
100% (2)
Audi A4 3G+ MMI Entertainment System
6 pages
Lecture04 Simulation Examples
No ratings yet
Lecture04 Simulation Examples
28 pages
Smooth Coefficient Estimation of A Seemingly Unrelated Regression
No ratings yet
Smooth Coefficient Estimation of A Seemingly Unrelated Regression
15 pages
Interrupts
No ratings yet
Interrupts
34 pages
FSSAI - Internship Portal
No ratings yet
FSSAI - Internship Portal
3 pages
Dh485 Router/B: Panelview800 To SLC or Micrologix Setup
No ratings yet
Dh485 Router/B: Panelview800 To SLC or Micrologix Setup
15 pages
Keshav Com Seminar
No ratings yet
Keshav Com Seminar
5 pages
CN Practical
No ratings yet
CN Practical
9 pages
Learn HTML - Semantic HTML Cheatsheet - Codecademy
No ratings yet
Learn HTML - Semantic HTML Cheatsheet - Codecademy
2 pages
Code No. M1: Series: SA01
No ratings yet
Code No. M1: Series: SA01
9 pages
LTS Load Transfer Switch User Manual
No ratings yet
LTS Load Transfer Switch User Manual
21 pages
MATLAB Report
No ratings yet
MATLAB Report
17 pages
Computer Graphics-Output Primitives
100% (2)
Computer Graphics-Output Primitives
30 pages
ETECH WorkSheet 1
No ratings yet
ETECH WorkSheet 1
10 pages
How To Make Micro-SIM From Usual SIM Card
No ratings yet
How To Make Micro-SIM From Usual SIM Card
1 page
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

Uploaded by

Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006

Uploaded by

Markov Decision Processes &

 If the actions are fixed, an MDP

 Value Iteration Value Function

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

 Afterward, the agent receives a

Backup diagrams for (a) Vπ and (b) Qπ

Used to truncate policy iteration by combining

First-visit Monte Carlo assuming exploring starts

On-line, first-visit, ε-greedy MC without exploring starts

 Quintuple of events (st, at, rt+1, st+1, at+1)

 Learned action-value function, Q, directly

 Job shop scheduling: RL used to

You might also like