0% found this document useful (0 votes)
21 views21 pages

Markov

markov

Uploaded by

97 Haseeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views21 pages

Markov

markov

Uploaded by

97 Haseeb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Markov Decision Process (MDP)

• A Markov Decision Process (MDP) is a mathematical


framework used for modeling decision-making in
environments where outcomes are partly random and
partly under the control of a decision-maker. MDPs are
widely used in reinforcement learning and operations
research to solve sequential decision problems.
Components of an MDP

• An MDP is defined by the tuple (S,A,P,R,γ):


1.States (S):
The set of all possible situations the agent can be in.
Example: The location of a robot in a grid.
2.Actions (A):
The set of all possible actions the agent can take in a state.
Example: Moving up, down, left, or right.
• Transition Probabilities (𝑃):The probability of moving from
one state to another, given an action: 𝑃(𝑠
′∣𝑠,𝑎)=Probability of transitioning to state 𝑠
′ from state 𝑠 after taking action 𝑎.
• Reward Function (𝑅):The immediate reward received
after transitioning from one state to another due to an

• 𝑅(𝑠,𝑎,𝑠′)=Reward for transitioning from 𝑠 to 𝑠


action:

′ using action 𝑎.
• Discount Factor (γ):
• A value between 0 and 1 that determines the
importance of future rewards:
• γ=0 Only considers immediate rewards.
• γ=1 Places high importance on future rewards.
• The goal is to find a policy (π) that tells the agent the
best action to take in each state to maximize the
cumulative reward (expected return) over time.
Hidden Markov Model (HMM)
• The hidden Markov Model (HMM) is a statistical model
that is used to describe the probabilistic relationship
between a sequence of observations and a sequence of
hidden states. It is often used in situations where the
underlying system or process that generates the
observations is unknown or hidden, hence it has the
name “Hidden Markov Model.”
• It is used to predict future observations or classify
sequences, based on the underlying hidden process that
generates the data.
• An HMM consists of two types of variables: hidden states
and observations.
• The hidden states are the underlying variables that
generate the observed data, but they are not directly
observable.

• The observations are the variables that are measured and


observed.
.
• The relationship between the hidden states and the
observations is modeled using a probability distribution.
The Hidden Markov Model (HMM) is the relationship
between the hidden states and the observations using
two sets of probabilities: the transition probabilities and
the emission probabilities.
• The transition probabilities describe the probability
of transitioning from one hidden state to another.

• The emission probabilities describe the probability of


observing an output given a hidden state.
Algorithm Steps
• Step 1: Define the state space and observation space
• The state space is the set of all possible hidden states, and
the observation space is the set of all possible observations.

• Step 2: Define the initial state distribution


• This is the probability distribution over the initial state.

• Step 3: Define the state transition probabilities


• These are the probabilities of transitioning from one state
to another. This forms the transition matrix, which describes
the probability of moving from one state to another.
• Step 4: Define the observation likelihoods:
• These are the probabilities of generating each observation
from each state. This forms the emission matrix, which
describes the probability of generating each observation
from each state.

• Step 5: Train the model


• The parameters of the state transition probabilities and the
observation likelihoods are estimated using the Baum-Welch
algorithm, or the forward-backward algorithm. This is done
by iteratively updating the parameters until convergence.
• Step 6: Decode the most likely sequence of
hidden states
• Given the observed data, the Viterbi algorithm is used
to compute the most likely sequence of hidden states.
This can be used to predict future observations, classify
sequences, or detect patterns in sequential data.

• Step 7: Evaluate the model


• The performance of the HMM can be evaluated using
various metrics, such as accuracy, precision, recall, or
F1 score.
Example
.
.
Exploration and Exploitation
• Exploration and Exploitation are methods for building effective learning
algorithms that can adapt and perform optimally in different environments.
Exploitation

Exploitation is a strategy of using the accumulated knowledge to make decisions that


maximize the expected reward based on the present information.
1. Reward Maximization: Maximizing the immediate or short-term reward based on
the current understanding of the environment is the main objective of exploitation. This
is choosing courses of action based on learned values or rewards that the model predicts
will yield the highest expected payoff.
2. Decision Efficiency: Exploitation can often make more efficient decisions by
concentrating on known high-reward actions, which lowers the computational and
temporal costs associated with exploration.

3. Risk Aversion: Exploitation inherently involves a lower level of risk as it relies on


tried and tested actions, avoiding the uncertainty associated with less familiar options.
Techniques
• Greedy Algorithms: Greedy algorithms tend to choose the locally optimal
solutions at each step without consideration of the potential impact on the
overall solution. They are often efficient in terms of computation time;
however, this approach may be suboptimal when sacrifices are required to
achieve the best global solution

• Exploitation of Learned Policies: Reinforcement learning algorithms tend to


base their pursuits on previously learned policies as a way of leveraging on old
gains. This is picking the activity that amounts in high rewards, when it is
similar to the previous experiences.

• Model-Based Methods: Model-based approaches take advantage of


underlying models that make decisions based on their predictive capabilities.
Exploration
• Exploration is used to increase knowledge about an environment or model. The exploration
process selects actions with uncertain outcomes to gather information about the possible
states and rewards that the performed actions will result.
1. Information Gain: The main objective of exploration is to gather fresh data that can improve
the model's comprehension of the surroundings. This involves exploring distinct regions of
the state space or experimenting with different actions whose outcomes are unknown.

2. Uncertainty Reduction: Reducing uncertainty in the model's estimates of the environment


guides the actions that are selected. For example, activities that are rarely selected in the past
are ranked in order of possible rewards.

3. State Space Coverage: In certain models, especially those with large or continuous state
spaces, exploration makes sure that enough different areas of the state space are visited to
prevent learning that is biased toward a small number of experiences.
Techniques
• Epsilon-Greedy Exploration: Epsilon-greedy algorithms
manage to unify those two characteristics (exploitation and
exploration) by sometimes choosing completely random
actions with probability epsilon while continuing to use the
current best-known action with probability (1 - epsilon).

• Thompson Sampling: Thompson sampling exploits the


Bayesian method to explore and exploit services
simultaneously. It helps to keep the chances that are
associated with the parameters and takes in considerations
of what is most likely to happen so as to balance for
exploration and exploitation.
Balancing Exploitation and
Exploration
• Exploration-Exploitation Trade-off: The foremost idea here
is to understand the exchange between exploration and
exploitation processes. Allocation of resources should rest on
needs to both streams alternatively depending on current state
of knowledge and complexity of the learning task or a given
day.

• Dynamic Parameter Tuning: It makes the algorithm


dynamically set the exploration and exploitation parameters
according to how the model performs and the environment
changes characteristics, thus the algorithm can be changed in
a way that better adapts to the changing environment and is
learning efficiently.

You might also like