Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
Reinforcement Learning
Megan Smith
Lehigh University, Fall 2006
Outline
Stochastic Process
Markov Property
Markov Chain
Markov Decision Process
Reinforcement Learning
RL Techniques
Example Applications
Stochastic Process
Quick definition: A Random Process
Often viewed as a collection of
indexed random variables
http://en.wikipedia.org/wiki
/
Image:AAMarkov.jpg
Useful to us: Set of states with
probabilities of being in those states
indexed over time
We’ll deal with discrete stochastic
processes
Stochastic Process Example
Classic: Random Walk
Start at state X0 at time t0
At time ti, move a step Zi where
P(Zi = -1) = p and P(Zi = 1) = 1 - p
At time ti, state Xi = X0 + Z1 +…+ Zi
http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
Also thought of as the “memoryless”
property
A stochastic process is said to have
the Markov property if the probability
of state Xn+1 having any given value
depends only upon state Xn
Very much depends on description of
states
Markov Property Example
Checkers:
Current State: The current configuration
of the board
Contains all information needed for
transition to next state
Thus, each configuration can be said to
have the Markov property
Markov Chain
Discrete-time
stochastic process
with the Markov
property
Industry Example:
Google’s PageRank
algorithm
Probability
distribution
representing
likelihood of
random linking
ending up on a
page http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
Discrete time stochastic control
process
Extension of Markov chains
Differences:
Addition of actions (choice)
Addition of rewards (motivation)
state node
action node
Solution to an MDP = Policy π
Gives the action to take from a given
state regardless of history
Two arrays indexed by state
V is the value function, namely the
discounted sum of rewards on average from
following a policy
π is an array of actions to be taken in each
state (Policy)
2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants
2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2
solution to its
Bellman Equation
Expresses Bellman Equation:
relationship between
a state and its
successor states
Another Value Function
Qπ defines the value of taking action a in state s
under policy π
Expected return starting from s, taking action a,
and thereafter following policy π
Action-value function for policy π
Simplest TD method
Uses sample backup from single successor
state or state-action pair instead of full
backup of DP methods
SARSA – On-policy Control