0% found this document useful (0 votes)
12 views4 pages

Markov Decision

A Finite Markov Decision Process (MDP) is a mathematical model for decision-making in uncertain environments, defined by states, actions, transition probabilities, rewards, and a discount factor. The goal is to find an optimal policy that maximizes expected cumulative rewards, using value functions and Bellman equations for evaluation. Solutions can be derived through methods such as Dynamic Programming, Monte Carlo, and Temporal-Difference Learning.

Uploaded by

hksun.12731
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

Markov Decision

A Finite Markov Decision Process (MDP) is a mathematical model for decision-making in uncertain environments, defined by states, actions, transition probabilities, rewards, and a discount factor. The goal is to find an optimal policy that maximizes expected cumulative rewards, using value functions and Bellman equations for evaluation. Solutions can be derived through methods such as Dynamic Programming, Monte Carlo, and Temporal-Difference Learning.

Uploaded by

hksun.12731
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Finite Markov Decision Process (MDP) :

A Markov Decision Process (MDP) is a mathematical framework used to model


decision-making in environments where outcomes are partly random and partly
under the control of a decision-maker. An MDP provides a formal foundation for
many reinforcement learning algorithms.

Components of a Finite MDP

A finite MDP is defined by the following components:

1. States (S):
A finite set of states representing the environment's different situations.
Example: In a grid world, each cell (position) can be a state.

2. Actions (A):
A finite set of actions available to the agent in each state.
Example: In a grid world, the actions could be "up," "down," "left," and "right."

3. Transition Probability (P):


The probability of transitioning from one state to another, given an action. This
is represented as:
\[
P(s' | s, a) = t{Probability of reaching state } s' \{ from state } s \{ by taking
action } a.
\]
Example: Moving "up" in the grid world has an 80% chance of moving up, a
10% chance of staying in the same place, and a 10% chance of moving left.

4. Reward Function (R):


The reward received after transitioning from one state to another, given an
action. It is represented as:
\[
R(s, a, s') = \{Reward received after taking action } a \{ in state } s \{ and
moving to state } s'.
\]
Example: In a grid world, reaching a goal state might yield a reward of +10,
while all other transitions yield a reward of -1.

5. Discount Factor (γ):


A factor between 0 and 1 that represents the importance of future rewards. It
determines how much future rewards are worth compared to immediate rewards.
- If \( \gamma = 0 \), the agent is shortsighted and only cares about immediate
rewards.
- If \( \gamma = 1 \), the agent is farsighted and cares equally about future
rewards.

Goal of MDP

The goal of an MDP is to find a policy \( \pi \) that maximizes the expected
cumulative reward over time. A policy is a function that maps states to actions, \(
\pi: S \rightarrow A \).

- Optimal Policy (\( \pi^* \)): The policy that yields the maximum expected reward
starting from any state \( s \).

Value Functions in MDPs

Value functions are used to estimate how good it is for an agent to be in a


particular state or to perform a particular action in that state.

1. State Value Function (V):


The value of a state \( s \) under policy \( \pi \) is the expected cumulative
reward starting from \( s \) and following policy \( \pi \).
\[
V^\pi(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \big|
s_0 = s, \pi \right].
\]

2. Action Value Function (Q):


The value of taking action \( a \) in state \( s \) under policy \( \pi \) is the
expected cumulative reward starting from \( s \), taking action \( a \), and then
following policy \( \pi \).
\[
Q^\pi(s, a) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \
big| s_0 = s, a_0 = a, \pi \right].
\]

Bellman Equations

The Bellman Equation provides a recursive decomposition for value functions,


expressing the value of a state in terms of the values of successor states.

1. Bellman Expectation Equation for \( V^\pi(s) \):


\[
V^\pi(s) = \sum_{a \in A} \pi(a | s) \sum_{s' \in S} P(s' | s, a) \left[ R(s, a, s') + \
gamma V^\pi(s') \right].
\]

2. Bellman Expectation Equation for \( Q^\pi(s, a) \):


\[
Q^\pi(s, a) = \sum_{s' \in S} P(s' | s, a) \left[ R(s, a, s') + \gamma \sum_{a' \in
A} \pi(a' | s') Q^\pi(s', a') \right].
\]

Solving MDPs

1. Dynamic Programming:
Methods like Policy Iteration and Value Iteration are used to compute the
optimal policy. These methods rely on the Bellman equations to iteratively
improve the value functions until convergence.

2. Monte Carlo Methods:


These methods use sample sequences of states, actions, and rewards to
estimate value functions based on the actual experience of the agent. They are
particularly useful when the model (transition probabilities) is unknown.
3. Temporal-Difference (TD) Learning:
A combination of dynamic programming and Monte Carlo methods. TD
methods like Q-Learning and SARSA learn from incomplete episodes, updating
estimates based on observed transitions.

Summary

- A Finite MDP consists of states, actions, transition probabilities, rewards, and a


discount factor.
- The objective is to find an optimal policy that maximizes the expected
cumulative reward.
- Value functions (state-value and action-value) are crucial for evaluating the
desirability of states or actions.
- The Bellman equations provide a recursive formula to calculate these value
functions, which can be solved using methods like Dynamic Programming, Monte
Carlo, or Temporal-Difference Learning.

You might also like