Unit-4 of Ai
Unit-4 of Ai
MDP FORMULATION:
Reinforcement Learning:
Reinforcement Learning is a type of Machine
Learning. It allows machines and software agents to
automatically determine the ideal behavior within a
specific context, in order to maximize its
performance. Simple reward feedback is required for
the agent to learn its behavior; this is known as the
reinforcement signal.
There are many different algorithms that tackle this
issue. As a matter of fact, Reinforcement Learning is
defined by a specific type of problem and all its
solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to
decide the best action to select based on his current
state. When this step is repeated, the problem is
known as a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
What is a State?
A State is a set of tokens that represent every state
that the agent can be in.
What is a Model?:
A Model (sometimes called Transition Model) gives
an action’s effect in a state. In particular, T(S, a, S’)
defines a transition T where being in state S and
taking an action ‘a’ takes us to state S’ (S and S’ may
be the same).
What is a Policy?
A Policy is a solution to the Markov Decision
Process. A policy is a mapping from S to a. It
indicates the action ‘a’ to be taken while in state S.
UTILITY THEORY:
Utility theory offers a framework for making
decisions in situations of ambiguity by putting
utilities(values) on several possible results. It is very
useful in optimising and modelling decision-making
processes by considering uncertain and
probabilistic outcomes in different situations.
In artificial intelligence(AI), utility theory aims to
represent and measure the choices and ideas of an
intelligent entity(agent). It offers a framework for
making decisions in situations of ambiguity by
putting utilities(values) on several possible results.
It can be used in various artificial intelligence areas,
such as game theory, reinforcement learning,
decision making etc. It is very useful in optimising
and modelling decision-making processes by
considering uncertain and probabilistic outcomes in
different situations.A utility function is used in utility
theory to represent an agent's preferences. It maps
potential outcomes or states to fundamental values
expressing the agent's desirability.
UTILITY FUNCTION:
In AI, a utility function is a mathematical
representation of an agent's preferences, assigning
numerical values (utilities) to different outcomes,
guiding decision-making by favoring outcomes with
higher utility values. A utility function is a function
that takes inputs (like states or outcomes) and
outputs a numerical value representing the agent's
satisfaction or preference for that input.
It helps AI systems make decisions by quantifying
the desirability of different actions or states,
allowing the system to choose the option that
maximizes its expected utility.
VALUE ITERATION:
Value Iteration is an algorithm used in
Reinforcement Learning and Markov Decision
Processes (MDPs) to compute the optimal policy
for an agent. It focuses on finding the optimal value
function, which assigns a value to each state,
representing the maximum cumulative reward an
agent can achieve starting from that state.
Here’s a step-by-step explanation of how it works:
1. Initialize Value Function: Start with an arbitrary
value for each state (usually zero).
2. Iterative Update:
o For each state, compute the maximum
expected reward by considering all possible
actions and their outcomes (using the
transition probabilities and rewards).
o Update the value of the state based on this
computation.
3. Convergence:
o Repeat the updates until the values stabilize
(i.e., the difference between consecutive
updates is smaller than a predefined
threshold).
POLICY ITERATION:
Policy Iteration is an algorithm used in
Reinforcement Learning and Markov Decision
Processes (MDPs) to find the optimal policy for an
agent. Unlike Value Iteration, which focuses on
directly refining the value function, Policy Iteration
alternates between two steps: policy evaluation
and policy improvement.
Here’s how it works:
1. Policy Evaluation:
o Start with an initial policy (which can be
arbitrary).
o Evaluate how "good" this policy is by
calculating the value function for each
state, assuming the agent follows this
policy.
2. Policy Improvement:
o Using the value function from the policy
evaluation step, update the policy by
selecting the action in each state that leads
to the maximum expected reward.
o This creates a new, improved policy.
3. Repeat Until Convergence:
o Alternate between policy evaluation and
policy improvement until the policy stops
changing. At this point, you have the
optimal policy.
PARTIALLY OBSERVABLE MDPS:
Partially Observable Markov Decision Processes
(POMDPs) are an extension of Markov Decision
Processes (MDPs) used to model decision-making in
situations where the agent doesn't have full visibility
or certainty about the environment's current state.
They are particularly useful in real-world scenarios
where an agent must act under uncertainty.
The agent's goal in a POMDP is to develop a policy
that determines the best action to take based on its
belief about the current state. This belief is
represented as a probability distribution over all
possible states, updated as new observations are
made.
Key elements of a POMDP:
1. States: A set of all possible states the
environment can be in (similar to MDPs).
2. Actions: The set of actions the agent can take to
interact with the environment.
3. Transition Probabilities: Probabilities that
taking an action in a given state will lead to a
specific new state.
4. Observations: Instead of directly observing the
state, the agent receives observations that
provide partial information about the true state.
5. Observation Probabilities: The likelihood of
receiving a specific observation in a given state.
6. Rewards: Numerical values representing the
immediate benefit of taking an action in a
particular state.
Applications
POMDPs are widely used in areas where uncertainty
is inherent, such as:
• Robot Navigation: Robots may not have full
knowledge of their surroundings due to sensor
limitations.
• Medical Diagnosis: Doctors making decisions
based on incomplete or noisy patient data.
• Speech Recognition: Understanding spoken
words when the input contains ambiguity or
noise.