RL Unit 2
RL Unit 2
Markov Decision Process : Introduction to RL terminology, Markov property, Markov chains, Markov
reward process (MRP). Introduction to and proof of Bellman equations for MRPs along with proof of
existence of solution to Bellman equations in MRP. Introduction to Markov decision process (MDP), state
and action value functions, Bellmman expectation equations, optimality of value functions and policies,
Bellman optimality equations.
Prediction and Control by Dynamic Programming : Overview of dynamic programming for MDP,
definition and formulation of planning in MDPs. principle of optimality, iterative policy evaluation, policy
iteration, value iteration, Banach fixed point theorem, proof of contraction mapping property of Bellman
expectation and optimality operators, proof of convergence of policy evaluation and value iteration
algorithms, DP extensions.
Introduction to RL terminology
Reinforcement Learning (RL) is a subfield of machine learning that deals with agents
learning to make sequential decisions by interacting with an environment.
Key Concepts and Terminology:
The main characters of RL are the agent and the environment. The environment is the
world that the agent lives in and interacts with. At every step of interaction, the agent sees
a (possibly partial) observation of the state of the world, and then decides on an action to
take. The environment changes when the agent acts on it, but may also change on its own.
The agent also perceives a reward signal from the environment, a number that tells it how
good or bad the current world state is. The goal of the agent is to maximize its cumulative
reward, called return. Reinforcement learning methods are ways that the agent can learn
behaviors to achieve its goal.
Here, we’ll see the some of the basic terminologies that are used in Reinforcement Learning:
1. Agent: The entity (learning algorithm + policy) which interacts with the environment and
takes certain actions to get the maximum rewards. Ex: An Autonomous Car.
2. Environment E: It is the surroundings through which the agent moves. The environment
considers the action and the current state of the agent as the input and grants a reward for
the agent and the next state, and that is the output. Ex: City is the environment.
3. Action A: The action taken by the agent based on the state of the environment. Ex: Stop
the car when the signal turns Red.
4. Action Space: Finite set of all possible actions that the agent can take. Ex: Move forward,
turn left, turn right, accelerate, apply brakes, etc.
5. State S: State refers to the current situation returned by the environment. The state
contains all the useful information the agent needs to make the right action. Ex: In our
case of the Autonomous Car Reinforcement Learning Problem, the state would consist of
the obstacles in the city (environment), signages, terrain, etc.
6. State Space: The State Space is the set of all possible states our agent could take in order
to reach the goal. Ex: The Autonomous Car can take multiple routes to reach the same
destination.
7. Reward R: An immediate feedback given to an agent when it performs a specific action
or task. The reward can be positive or negative based on the action taken. Ex: The
Autonomous Car will be rewarded if it followed the traffic rules correctly and will be
penalized/negatively rewarded if it doesn’t follow or if it crashes somewhere.
8. Policy π(s): It is a strategy which is applied by the agent to decide the next action based
on the current state. It is a mapping from states of environment to actions to be taken
when in those states. So given the current state, the agent looks up that state in the table to
find the action that it should pick. Ex: Policy for the route to be taken by the Autonomous
Car to reach its destination is shown in the figure below.
Markov property
The Markov property, also known as the memoryless property, is a fundamental concept
in probability theory and statistics. It essentially states that the future evolution of a
system depends only on its present state, not on its past history. In simpler terms, what
happened before doesn't matter, only what's happening now determines what will happen
next. . In other words, given the present, the future is conditionally independent of the
past.
Mathematically, A stochastic process is said to have the Markov property if and only if
the conditional probability distribution of its future states (conditional on both past and
present values) depends only on the present state. Let Xt represent a random variable
denoting the state of a system at time t. The Markov property can be stated as follows:
P(Xt+1=xt+1∣Xt=xt,Xt−1=xt−1,…,X0=x0) = P(Xt+1=xt+1∣Xt=xt)
This equation implies that the probability of transitioning to the next state (Xt+1) depends
only on the current state (Xt) and is independent of the entire history of states that
preceded it (Xt−1,Xt−2,…,X0).
Intuitive explanation:
Imagine a weather forecasting model. If the Markov property holds, the model only needs
to consider the current weather conditions (temperature, pressure, etc.) to predict
tomorrow's weather. It doesn't need to know the entire history of past weather patterns,
just the current snapshot.
The Markov property is often visualized using a transition probability matrix. Suppose
the state space of the Markov chain is S={s1,s2,…,sn}, and the transition probability
from state si to state sj at time t is denoted by Pij(t). The Markov property is satisfied if:
P(Xt+1=sj∣Xt=si)=Pij(t)
For all i and j, where Pij(t) is the probability of transitioning from state si to state sj at
time t.
Applications of the Markov property:
o Markov chains: These are discrete-time stochastic processes where the future state
depends only on the present state. They are used in various fields like modeling
financial markets, queuing systems, and biological sequences.
o Hidden Markov models: These are powerful tools for modeling systems with
hidden states, where we observe only partial information. They are used in speech
recognition, natural language processing, and signal processing.
Markov chains
A Markov chain is a mathematical model that describes a sequence of events or states in
which the probability of transitioning from one state to another depends only on the
current state and is not influenced by previous states. This "memoryless" property is what
makes Markov chains special.
Here are the key components and concepts associated with Markov chains:
1. States (S): Markov chains involve a set of distinct states that represent possible
conditions or situations within a system. These states can be discrete or continuous,
depending on the specific application.
2. Transitions (P): The transition probabilities describe the likelihood of moving from one
state to another in a single time step. Mathematically, the transition probabilities are
defined as Pij, which represents the probability of transitioning from state i to state j.
3. Transition Matrix (P): A transition matrix, often denoted as P, is used to represent all the
transition probabilities in a Markov chain. It is a square matrix where each row
corresponds to a starting state, and each column corresponds to a destination state. The
entries in the matrix represent the transition probabilities.
4. Homogeneous Markov Chain: If the transition probabilities remain constant over time,
the Markov chain is considered homogeneous. In this case, the transition matrix P does
not change with time. This simplifies the analysis and modelling of the Markov chain.
5. Markov Property: The fundamental assumption in a Markov chain is that it follows the
Markov property, meaning that the probability of transitioning to a future state depends
solely on the current state and is independent of the sequence of states that led to the
current state.
6. State Space: The set of all possible states in a Markov chain is known as the state space.
7. Initial State Distribution (π): This represents the probabilities of starting in each state at
the beginning of the Markov chain. It's often represented as a probability vector π, where
πi is the probability of starting in state i.
8. Stationary Distribution: A stationary distribution π is a probability distribution over the
states that remains unchanged by the transition. It satisfies the condition πP=π, where π is
a row vector, and P is the transition matrix. If the system is in the stationary distribution,
it will remain in the same distribution in subsequent time steps.
Introduction to and proof of Bellman equations for MRPs along with proof
of existence of solution to Bellman equations in MRP
Bellman equations for MRPs
The Bellman equations for a Markov Reward Process (MRP) provide a recursive
relationship between the value function of a state and the values of its successor states,
considering both the immediate reward and the expected discounted future rewards. The
Bellman equations are crucial for solving and analysing MRPs and play a fundamental
role in dynamic programming and reinforcement learning.
Bellman Expectation Equation for MRP:
The Bellman expectation equation for an MRP is expressed as follows for a state s in the
MRP:
V(s)=R(s)+γ∑s′P(s′∣s)V(s′)
Here:
V(s) is the value function for state s.
R(s) is the immediate reward in state s.
γ is the discount factor (a constant between 0 and 1).
P(s′∣s) is the transition probability from state s to state ′s′.
The summation is over all possible successor states ′s′.
This equation states that the value of a state is the sum of its immediate reward and the
expected discounted value of its successor states. It captures the recursive nature of the value
function in terms of both the current reward and the expected future rewards.
Proof of existence of solution to Bellman equations in MRP
Steps:
Define Space: Think of all possible value functions (ways of assigning values to each
state) as a big space.
Operator T: Imagine an operator (let's call it �T) that takes one value function and gives
you a new value function according to the Bellman equation.
Distance Measure: We use a way to measure the "distance" between two value functions.
For example, it could be how much the values differ for each state.
Contracting Mapping: Show that �T is like a magic machine that always brings two
value functions closer together. It's like a machine that makes things smaller.
Magic Theorem: There's a special theorem (called the Banach Fixed-Point Theorem) that
says if you have a shrinking machine (contraction mapping) and you keep applying it,
there's a point that doesn't move — a fixed point.
Fixed Point: This fixed point is a special value function. When you apply the operator to
it, you get back the same value function. This means it's a solution to the Bellman
equation!
Proof of Bellman equations for MRPs
The Bellman equation expresses the value of a state as the sum of the immediate reward
and the expected discounted value of its successor states.
This completes the proof for the Bellman expectation equation for an MRP. It
demonstrates how the value of a state is recursively related to the values of its successor
states, considering both the immediate reward and the expected discounted future
rewards. The Bellman equation is a key tool for solving and analyzing Markov Reward
Processes.
Introduction to Markov decision process (MDP)
Markov decision processes (MDPs) are a mathematical framework for modeling
decision-making in situations where the outcomes are partly random and partly under the
control of a decision-maker.
It is a framework used to model decision-making in situations where an agent interacts
with an environment over a sequence of discrete time steps.
Here are the key components and concepts in the Markov Decision Processes:
States (S): The system or environment can be in different states. These states represent
the possible configurations or conditions of the system. The set of all possible states is
denoted as S.
Actions (A): At each state, the decision-maker (agent) can choose from a set of possible
actions. The set of all possible actions is denoted as A.
Transition Probabilities (P): When the agent takes an action in a particular state, there
are probabilities associated with transitioning to different states. The transition
probabilities are represented by the function P, which gives the probability of moving to
each state given the current state and action.
Rewards (R): Each state-action pair is associated with a numerical reward. The reward
function R defines the immediate reward the agent receives when taking a particular
action in a specific state.
Policy (π): A policy is a strategy or a set of rules that the agent uses to decide which
action to take in each state. It is represented by the policy function π, which maps states to
actions.
Value Function (V or Q): The value function represents the expected cumulative reward
the agent can achieve starting from a particular state and following a specific policy.
There are two types of value functions: state value function (V) and action value function
(Q).
Here, Gt is the return, which is the sum of immediate rewards and discounted future rewards.
State and action value functions
This equation expresses the value of taking a particular action in a state as the sum of the
immediate reward obtained by taking that action and the discounted expected value of the
next state-action pair. The expectation is taken over all possible outcomes of the next state
and the immediate reward.
Intuition:
The Bellman Expectation Equation captures the idea of breaking down the value of a
decision into the immediate reward and the expected future value. It enables the recursive
computation of values, allowing for the evaluation of policies and the determination of
optimal strategies in Markov Decision Processes.
In practice, these equations are often used iteratively to update the values of states and
actions until convergence.
The Bellman Optimality Equation for State-Value Functions states that the value of a state
under the optimal policy is equal to the maximum expected immediate reward plus the
maximum expected value of the state's successor states, discounted by γ, considering all
possible actions.
Optimal Policy (π∗): The Optimal Policy, denoted as π∗, represents the best possible
strategy an agent can follow to maximize its expected cumulative reward in an MDP. It
specifies the optimal action to take in each state. The Optimal Policy is derived from the
Optimal Action-Value Function Q∗(s,a), which is defined as:
The Bellman Optimality Equation for State-Value Functions states that the value of a state
under the optimal policy is equal to the maximum expected immediate reward plus the
maximum expected value of the state's successor states, discounted by γ, considering all
possible actions.
*Bellman Optimality Equation for the Optimal Action-Value Function (Q)**: The
Bellman Optimality Equation for the Optimal Action-Value Function, denoted as Q∗(s,a),
describes the maximum expected cumulative reward an agent can obtain when starting
from a particular state s, taking a specific action a, and then following the best possible
policy. It relates the value of the current state-action pair to the values of its successor
state-action pairs under the optimal policy:
The Bellman Optimality Equation for Action-Value Functions states that the value of a
state action pair under the optimal policy is equal to the expected immediate reward plus
the maximum expected value of taking actions in the successor state s′, discounted by γ.