0% found this document useful (0 votes)
9 views61 pages

What Are Probabilistic Machine Learning Models?

The document discusses probabilistic machine learning models, which provide probability distributions over classes for classification tasks, contrasting them with non-probabilistic models that only output the most likely class. It explains the advantages of probabilistic models, such as conveying uncertainty and confidence in predictions, and details the Naïve Bayes classifier, including its assumptions, advantages, disadvantages, and applications. Additionally, it covers objective functions used in machine learning, including loss functions for both probabilistic and non-probabilistic models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views61 pages

What Are Probabilistic Machine Learning Models?

The document discusses probabilistic machine learning models, which provide probability distributions over classes for classification tasks, contrasting them with non-probabilistic models that only output the most likely class. It explains the advantages of probabilistic models, such as conveying uncertainty and confidence in predictions, and details the Naïve Bayes classifier, including its assumptions, advantages, disadvantages, and applications. Additionally, it covers objective functions used in machine learning, including loss functions for both probabilistic and non-probabilistic models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

1

Unit IV
Probabilistic Methods for Learning
Introduction
What are Probabilistic Machine Learning Models?

In order to understand what is a probabilistic machine learning model, let’s


consider a classification problem with N classes. If the classification model
(classifier) is probabilistic, for a given input, it will provide probabilities for each
class (of the N classes) as the output. In other words, a probabilistic classifier will
provide a probability distribution over the N classes. Usually, the class with the
highest probability is then selected as the Class for which the input data instance
belongs.

However, logistic regression (which is a probabilistic binary classification


technique based on the Sigmoid function) can be considered as an exception, as it
provides the probability in relation to one class only (usually Class 1, and it is not
necessary to have “1 — probability of Class1 = probability of Class 0” relationship).
Because of these properties, Logistic Regression is useful in Multi-Label
Classification problems as well, where a single data point can have multiple class
labels.

Some examples for probabilistic models are Logistic Regression, Bayesian


Classifiers, Hidden Markov Models, and Neural Networks (with a Softmax output
layer).

If the model is Non-Probabilistic (Deterministic), it will usually output only the


most likely class that the input data instance belongs to. “Support Vector
Machines” is a popular non-probabilistic classifier.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


2

Let’s discuss an example to better understand probabilistic classifiers. Take the task
of classifying an image of an animal into five classes — {Dog, Cat, Deer, Lion,
Rabbit} as the problem. As input, we have an image (of a dog). For this example,
let’s consider that the classifier works well and provides correct/ acceptable results
for the particular input we are discussing. When the image is provided as the input to
the probabilistic classifier, it will provide an output such as (Dog (0.6), Cat (0.2),
Deer(0.1), Lion(0.04), Rabbit(0.06)). But, if the classifier is non-probabilistic, it will
only output “Dog”.

Why probabilistic ML models?

One of the major advantages of probabilistic models is that they provide an idea
about the uncertainty associated with predictions. In other words, we can get an
idea of how confident a machine learning model is on its prediction. If we consider
the above example, if the probabilistic classifier assigns a probability of 0.9 for ‘Dog’
class instead of 0.6, it means the classifier is more confident that the animal in the
image is a dog. These concepts related to uncertainty and confidence are
extremely useful when it comes to critical machine learning applications such as
disease diagnosis and autonomous driving. Also, probabilistic outcomes would be
useful for numerous techniques related to Machine Learning such as Active
Learning.

Objective Functions

In order to identify whether a particular model is probabilistic or not, we can


look at its Objective Function. In machine learning, we aim to optimize a model to
excel at a particular task. The aim of having an objective function is to provide a

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


3

value based on the model’s outputs, so optimization can be done by either


maximizing or minimizing the particular value. In Machine Learning, usually, the
goal is to minimize prediction error. So, we define what is called a loss function as
the objective function and tries to minimize the loss function in the training phase of
an ML model.

If we take a basic machine learning model such as Linear Regression, the


objective function is based on the squared error. The objective of the training is to
minimize the Mean Squared Error / Root Mean Squared Error (RMSE) (Eq. 1). The
intuition behind calculating Mean Squared Error is, the loss/ error created by a
prediction given to a particular data point is based on the difference between the
actual value and the predicted value (note that when it comes to Linear Regression,
we are talking about a regression problem, not a classification problem).

The loss created by a particular data point will be higher if the prediction gives
by the model is significantly higher or lower than the actual value. The loss will be
less when the predicted value is very close to the actual value. As you can see, the
objective function here is not based on probabilities, but on the difference (absolute
difference) between the actual value and the predicted value.

Eq: 1

Here, n indicates the number of data instances in the data set, y_true is the
correct/ true value and y_predict is the predicted value (by the linear regression
model).

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


4

When it comes to Support Vector Machines, the objective is to maximize the


margins or the distance between support vectors. This concept is also known as the
‘Large Margin Intuition’. As you can see, in both Linear Regression and Support
Vector Machines, the objective functions are not based on probabilities. So, they can
be considered as non-probabilistic models.

On the other hand, if we consider a neural network with a softmax output layer, the
loss function is usually defined using Cross-Entropy Loss (CE loss) (Eq. 2). Note that
we are considering a training dataset with ’n’ number of data points, so finally take
the average of the losses of each data point as the CE loss of the dataset. Here, y_i
means the true label of the data point i and p(y_i) means the predicted probability for
the class y_i (probability of this data point belongs to the class y_i as assigned by
the model).

The intuition behind Cross-Entropy Loss is ; if the probabilistic model is able to


predict the correct class of a data point with high confidence, the loss will be less. In
the example we discussed about image classification, if the model provides a
probability of 1.0 to the class ‘Dog’ (which is the correct class), the loss due to that
prediction = -log(P(‘Dog’)) = -log(1.0)=0. Instead, if the predicted probability for
‘Dog’ class is 0.8, the loss = -log(0.8)= 0.097. However, if the model provides a low
probability for the correct class, like 0.3, the loss = -log(0.3) = 0.523, which can be
considered as a significant loss.

Eq. 2

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


5

In a binary classification model based on Logistic Regression, the loss function is


usually defined using the Binary Cross Entropy loss (BCE loss).

Eq. 3

Here y_i is the class label (1 if similar, 0 otherwise) and p(s_i) is the predicted
probability of a point being class 1 for each point ‘i’ in the dataset. N is t he number
of data points. Note that as this is a binary classification problem, there are only two
classes, class 1 and class 0.

Naïve Bayes Classifier Algorithm

o Naïve Bayes algorithm is a supervised learning algorithm, which is based


on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


6

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event


B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.e

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


7

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:


Stepwise Bayes Theorem

Let’s come back to the problem at hand. Looks like you’re very serious with your
resolution this time given that you have been keeping track of the weather outside
for the past two weeks:

Step 1 – Collect raw data

Next, you need to create a frequency table for each attribute of your dataset.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


8

Step 2 – Convert data to a frequency table(s)

Then, for each frequency table, you will create a likelihood table.

Step 3 – Calculate prior probability and evidence

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


9

Step 4 – Apply probabilities to Bayes’ Theorem equation

Let’s say you want to focus on the likelihood that you go for a run given that it’s
sunny outside.

P(Yes|Sunny) = P(Sunny|Yes) * P(Yes) / P(Sunny) = 0.625 * 0.571 / 0.428 = 0.834

The Naïve Bayes Algorithm

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


10

Naïve Bayes assumes conditional independence over the training dataset. The
classifier separates data into different classes according to the Bayes’ Theorem.
But assumes that the relationship between all input features in a class is
independent. Hence, the model is called naïve.

This helps in simplifying the calculations by dropping the denominator from the
formula while assuming independence:

Let’s understand this through our running resolution example:

Say you want to predict if on the coming Wednesday, given the following weather
conditions, should you go for a run or sleep in:

Outlook: Rainy

Humidity: Normal

Wind: Weak

Run: ?

Likelihood of ‘Yes’ on Wednesday:

P(Outlook = Rainy|Yes) * P(Humidity = Normal|Yes) * P(Wind = Weak|Yes) *


P(Yes)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


11

= 1/8 * 1/8 * 9/9 * 8/14 = 0.0089

Likelihood of ‘No’ on Wednesday:

P(Outlook = Rainy|No) * P(Humidity = Normal|No) * P(Wind = Weak|No) *


P(No)

= 3/6 * 3/6 * 2/5 * 6/14 = 0.0428

Now, to determine the probability of going for a run on Wednesday, you just need
to divide P(Yes) with the sum of the likelihoods of Yes and No.

P(Yes) = 0.0089 / (0.0089 + 0.0428) = 0.172

Similarly, P(No) = 0.0428 / (0.0089 + 0.0428) = 0.827

According to your model, it looks like there’s an almost 83% probability that
you’re going to stay under the covers next Wednesday!

This was just a fun example. Although Naïve Bayes IS used for weather
predictions, for advanced machine learning problems, the complexity of the
Bayesian classifier needs to be reduced for it to be practical. This is where the
naïve in Naïve Bayes comes in.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


12

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
o It is used in Text classification such as Spam filtering and Sentiment
analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data
is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans variables.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


13

Such as if a particular word is present or not in a document. This model is also


famous for document classification tasks
Limitations of Naïve Bayes
• Assumes that all the features are independent, which is highly unlikely in
practical scenarios.
• Unsuitable for numerical data.
• The number of features must be equal to the number of attributes in the data
for the algorithm to make correct predictions.
• Encounters ‘Zero Frequency’ problem: If a categorical variable has a
category in the test dataset that wasn’t included in the training dataset, the
model will assign it a 0 probability and will be unable to make a prediction.
This problem can be resolved using smoothing techniques which are out of
the scope of this article.
• Computationally expensive when used to classify a large number of items

Maximum Likelihood and & Maximum A Posteriori(MAP)

Figure 1. Equation of the Bayes Rule

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


14

In this case, the x and e imply the given dataset, and H and 𝚯 represents the
parameter(hypothesis). In other words, x = e and H = 𝚯 in the above figure

The image below explains the difference between the probability and the
likelihood.

Figure 2. from [1], probability vs likelihood

Probability vs Likelihood
You can estimate a probability of an event using the function that describes the
probability distribution and its parameters. For example, you can estimate the
outcome of a fair coin flip by using the Bernoulli distribution and the probability of
success 0.5. In this ideal case, you already know how the data is distributed.
But the real world is messy. Often you don’t know the exact parameter values, and
you may not even know the probability distribution that describes your specific use
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
15

case. Instead, you have to estimate the function and its parameters from the data.
The likelihood describes the relative evidence that the data has a particular
distribution and its associated parameters.
We can describe the likelihood as a function of an observed value of the data x, and
the distributions’ unknown parameter θ.

In short, when estimating the probability, you go from a distribution and its
parameters to the event.

When estimating the likelihood, you go from the data to the distribution and its
parameters.

To make this more concrete, let’s calculate the likelihood for a coin flip.
Recall that a coin flip is a Bernoulli trial, which can be described in the following
function.

The probability p is a parameter of the function. To be consistent with the likelihood


notation, we write down the formula for the likelihood function with theta instead of
p.

Now, we need a hypothesis about the parameter theta. We assume that the coin is
fair. The probability of obtaining heads is 0.5. This is our hypothesis A.
Let’s say we throw the coin 3 times. It comes up heads the first 2 times. The last
time it comes up tails. What is the likelihood that hypothesis A given the data?
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
16

First, we can calculate the relative likelihood that hypothesis A is true and the coin
is fair. We plug our parameters and our outcomes into our probability function.

Multiplying all of these gives us the following value.

Likelihood Ratios
Once you’ve calculated the likelihood, you have a hypothesis that your data has a
specific set of parameters. The likelihood is your evidence for that hypothesis. To
pick the hypothesis with the maximum likelihood, you have to compare your
hypothesis to another by calculating the likelihood ratios.
Since your 3 coin tosses yielded two heads and one tail, you hypothesize that the
probability of getting heads is actually 2/3. This is your hypothesis B
Let’s repeat the previous calculations for B with a probability of 2/3 for the same
three coin tosses. I won’t go through the steps of plugging the values into the formula
again.

Given the evidence, hypothesis B seems more likely than hypothesis A.


In other words: Given the fact that 2 of our three coin tosses landed up heads, it
seems more likely that the true probability of getting heads is 2/3. In fact, in the
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
17

absence of more data in the form of coin tosses, 2/3 is the most likely candidate
for our true parameter value. So hypothesis B gives us the maximum likelihood
value.
We can express the relative likelihood of an outcome as a ratio of the likelihood for
our chosen parameter value θ to the maximum likelihood.
The relative likelihood that the coin is fair can be expressed as a ratio of the
likelihood that the true probability is 1/2 against the maximum likelihood that the
probability is 2/3.

The maximum value division helps to normalize the likelihood to a scale with 1 as
its maximum likelihood. We can plot the different parameter values against their
relative likelihoods given the current data.
For three coin tosses with 2 heads, the plot would look like this with the likelihood
maximized at 2/3.

What happens if we toss the coin for the fourth time and it comes up tails. Now
we’ve had 2 heads and 2 tails. Our likelihood plot now looks like this, with the
likelihood maximized at 1/2.
likelihood ratios

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


18

Mathematically we can denote the maximum likelihood estimation as a function that


results in the theta maximizing the likelihood.

The variable x represents the range of examples drawn from the unknown data
distribution, which we would like to approximate and n the number of examples.
Log-Likelihood
For most practical applications, maximizing the log-likelihood is often a better
choice because the logarithm reduced operations by one level. Multiplications
become additions; powers become multiplications, etc.

In computer-based implementations, this reduces the risk of numerical underflow


and generally makes the calculations simpler. Since logarithms are monotonically
increasing, increasing the log-likelihood is equivalent to maximizing the likelihood.
We distinguish the function for the log-likelihood from that of the likelihood using
lowercase l instead of capital L.
The log likelihood for n coin flips can be expressed in this formula.

1. Maximum Likelihood Estimation

A maximum likelihood(ML) estimation is a method of estimating the parameters


of a probability distribution by maximizing a likelihood function.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


19

Therefore, ML is defined as below.

Figure 3. Derive to the ML Estimation

In the case of the classification task with supervised learning, our dataset is
composed of pairs of data x and corresponding label y. This means the ML
estimation also needs to deal with the conditional probability of model(network)
output y’ given the input data x.

Figure 4. ML Estimation for the conditional distribution

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


20

3. Maximum A Posteriori(MAP)

An alternative estimator is the MAP estimator, which finds the parameter theta that
maximizes the posterior.

According to the Bayes rule, the posterior can be decomposed into the product of
the likelihood and prior. The MAP estimator begins with this idea and is defined
as below.

Figure 5. Derive to the MAP Estimation

As the ML can be generalized to the conditional probability distribution, so does the


MAP.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


21

Bayesian Belief Network

Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:

"A Bayesian network is a probabilistic graphical model which represents a set of


variables and their conditional dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian


model.

Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction and
anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and exp erts opinions,
and it consists of two parts:

o Directed Acyclic Graph

o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


22

A Bayesian network graph is made up of nodes and Arcs (directed links),


where:

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.

o Arc or directed arrows represent the causal relationship or conditional


probabilities between random variables. These directed links or arrows
connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each other

o In the above diagram, A, B, C, and D are random variables


represented by the nodes of the network graph.

o If we are considering node B, which is connected with node A by a


directed arrow, then node A is called the parent of Node B.

o Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.

The Bayesian network has mainly two components:

o Causal Component

o Actual numbers

Each node in the Bayesian network has condition probability


distribution P(Xi |Parent(Xi) ), which determines the effect of the parent on that
node.

Bayesian network is based on Joint probability distribution and conditional


probability. So let's first understand the joint probability distribution:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


23

Joint probability distribution:


If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.

P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.

= P[x1| x2, x3,....., xn]P[x2, x3,....., x n]

= P[x1| x2, x3,....., xn]P[x2|x3,....., x n]....P[x n-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))


Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed
acyclic graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls
Harry when he hears the alarm, but sometimes he got confused with the phone
ringing and calls at that time too. On the other hand, Sophia likes to listen to high
music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.

Solution:

o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


24

alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.

o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.

o The conditional distributions for each node are given as conditional


probabilities table or CPT.

o Each row in the CPT must be sum to 1 because all the entries in the ta ble
represent an exhaustive set of cases for the variable.

o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.


Hence, if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)

o Earthquake(E)

o Alarm(A)

o David Calls(D)

o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


25

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


26

B E P(A= P(A= False)


True)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of
Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


27

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.

The semantics of Bayesian Network:

There are two ways to understand the semantics of the Bayesian network, which is
given below:

1. To understand the network as the representation of the Joint probability


distribution.

It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional


independence statements.

It is helpful in designing inference procedure

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


28

Challenge of Probabilistic Modeling

Probabilistic models can be challenging to design and use.

Most often, the problem is the lack of information about the domain required to fully
specify the conditional dependence between random variables. If available,
calculating the full conditional probability for an event can be impractical.

A common approach to addressing this challenge is to add some simplifying


assumptions, such as assuming that all random variables in the model are
conditionally independent. This is a drastic assumption, although it proves useful in
practice, providing the basis for the Naive Bayes classification algorithm.

An alternative approach is to develop a probabilistic model of a problem with some


conditional independence assumptions. This provides an intermediate approach
between a fully conditional model and a fully conditionally independent model.

Bayesian belief networks are one example of a probabilistic model where some
variables are conditionally independent.

Thus, Bayesian belief networks provide an intermediate approach that is less


constraining than the global assumption of conditional independence made by the
naive Bayes classifier, but more tractable than avoiding conditional independence
assumptions altogether.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


29

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


30

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


31

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


32

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


33

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


34

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


35

Probability Density Function

Probability Density:
Assume a random variable x that has a probability distribution p(x). The
relationship between the outcomes of a random variable and its probability is
referred to as the probability density.
The problem is that we don’t always know the full probability distribution for a
random variable. This is because we only use a small subset of observations to
derive the outcome. This problem is referred to as Probability Density
Estimation as we use only a random sample of observations to find the general
density of the whole sample space.

Probability Density Function (PDF)


A PDF is a function that tells the probability of the random variable from a
sub-sample space falling within a particular range of values and not just one value.
It tells the likelihood of the range of values in the random variable sub-space being
the same as that of the whole sample.

By definition, if X is any continuous random variable, then the function f(x) is


called a probability density function if:

where,

a -> lower limit


b -> upper limit
X -> continuous random variable
f(x) -> probability density function

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


36

Steps Involved:
Step 1 - Create a histogram for the random set of observations to understand the
density of the random sample.

Step 2 - Create the probability density function and fit it on the random sample.
Observe how it fits the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:


3.1 - Calculate the distribution parameters.
3.2 - Calculate the PDF for the random sample distribution.
3.3 - Observe the resulting PDF against the data.
3.4 - Transform the data to until it best fits the distribution.

Most of the histogram of the different random sample after fitting should match the
histogram plot of the whole population.

Density Estimation: It is the process of finding out the density of the whole
population by examining a random sample of data from that population. One of the
best ways to achieve a density estimate is by using a histogram plot.
Parametric Density Estimation
A normal distribution has two given parameters, mean and standard deviation. We
calculate the sample mean and standard deviation of the random sample taken from
this population to estimate the density of the random sample. The reason it is
termed as ‘parametric’ is due to the fact that the relation between the observations
and its probability can be different based on the values of the two parameters.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


37

Now, it is important to understand that the mean and standard deviation of this
random sample is not going to be the same as that of the whole population due to
its small size. A sample plot for parametric density estimation is shown below.

PDF fitted over histogram plot with one peak value

Nonparametric Density Estimation


In some cases, the PDF may not fit the random sample as it doesn’t follow a
normal distribution (i.e instead of one peak there are multiple peaks in the graph).
Here, instead of using distribution parameters like mean and standard deviat ion, a
particular algorithm is used to estimate the probability distribution. Thus, it is
known as a ‘nonparametric density estimation’.
One of the most common nonparametric approach is known as Kernel Density
Estimation. In this, the objective is to calculate the unknown density f h(x) using
the equation given below:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


38

where,

K -> kernel (non-negative function)


h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
f h(x) -> density (to calculate)
n -> no. of samples in random sample.
A sample plot for nonparametric density estimation is given below.

PDF plot over sample histogram plot based on KDE


Problems with Probability Distribution Estimation
Probability Distribution Estimation relies on finding the best PDF and determining
its parameters accurately. But the random data sample that we consider, is very
small. Hence, it becomes very difficult to determine what parameters and what
probability distribution function to use. To tackle this problem, Maximum
Likelihood Estimation is used.

Maximum Likelihood Estimation


It is a method of determining the parameters (mean, standard deviation, etc) of
normally distributed random sample data or a method of finding the best fitting

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


39

PDF over the random sample data. This is done by maximizing the likelihood
function so that the PDF fitted over the random sample. Another way to look at it
is that MLE function gives the mean, the standard deviation of the random sample
is most similar to that of the whole sample.
NOTE: MLE assumes that all PDFs are a likely candidate to being the best fitting
curve. Hence, it is computationally expensive method.

Intuition:

Fig 1 : MLE Intuition


Fig 1 shows multiple attempts at fitting the PDF bell curve over the random sample
data. Red bell curves indicate poorly fitted PDF and the green bell curve shows the
best fitting PDF over the data. We obtained the optimum bell curve by checking
the values in Maximum Likelihood Estimate plot corresponding to each PDF.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


40

As observed in Fig 1, the red plots poorly fit the normal distribution, hence
their ‘likelihood estimate’ is also lower. The green PDF curve has the maximum
likelihood estimate as it fits the data perfectly. This is how the maximum likelihood
estimate method works.
Mathematics Involved
In the intuition, we discussed the role that Likelihood value plays in determining
the optimum PDF curve. Let us understand the math involved in MLE method.

We calculate Likelihood based on conditional probabilities. See the equation given


below.

where,

L -> Likelihood value


F -> Probability distribution function
P -> Probability
X1, X2, ... Xn -> random sample of size n taken from the whole population.
x1, x2, ... xn -> values that these random sample (Xi) takes when determining the PDF.
Π -> product from 1 to n.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


41

Sequence Models

Sequence models are the machine learning models that input or output sequences of
data. Sequential data includes text streams, audio clips, video clips, time-series data
and etc. Recurrent Neural Networks (RNNs) is a popular algorithm used in
sequence models.

Applications of Sequence Models


1. Speech recognition: In speech recognition, an audio clip is given as an input and
then the model has to generate its text transcript. Here both the input and output are
sequences of data.

2. Sentiment Classification: In sentiment classification opinions expressed in a


piece of text is categorized. Here the input is a sequence of words.

3. Video Activity Recognition: In video activity recognition, the model needs to


identify the activity in a video clip. A video clip is a sequence of video frames,
therefore in case of video activity recognition input is a sequence of data.
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
42

Video Activity Recognition (Source: Author)

These examples show that there are different applications of sequence models.
Sometimes both the input and output are sequences, in some either the input or the
output is a sequence. Recurrent neural network (RNN) is a popular sequence model
that has shown efficient performance for sequential data.

Different Sequential Model


RNN and its Variants Based Models
RNN stands for Recurrent Neural Network and is a Deep Learning and Artificial
Neural Network design that is suited for sequential data processing. In Natural
Language Processing, RNNs are frequently used (NLP). Because RNNs have
internal memory, they are especially useful for machine learning applications
that need sequential input. Time series data can also be forecasted using RNNs.

The key benefit of employing RNNs instead of conventional neural networks is


that the characteristics (weights) in standard neural networks are not sh ared. In
RNN, weights are shared over time. RNNs can recall their prior inputs, whereas
Standard Neural Networks cannot. For computation, RNN uses historical data.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


43

A different task that can be achieved using RNN areas,

Source

One-to-one

With one input and one output, this is the classic feed-forward neural network
architecture.

One-to-many

This is referred to as image captioning. We have one fixed-size image as input,


and the output can be words or phrases of varying lengths.

Many-to-one

This is used to categorize emotions. A succession of words or even paragraphs of


words is anticipated as input. The result can be a continuous -valued regression
output that represents the likelihood of having a favourable attitude.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


44

Many-to-many

This paradigm is suitable for machine translation, such as that seen on Google
Translate. The input could be a variable-length English sentence, and the output
could be a variable-length English sentence in a different language. On a frame-
by-frame basis, the last many to many models can be utilized for video
classification.

Sequence Modelling

Sequence Modelling is the ability of a computer program to model, interpret, make


predictions about or generate any type of sequential data, such as audio, text etc.
For example, a computer program that can take a piece of text in English and
translate it to French is an example of a Sequence Modelling program (because the
type of data being dealt with is text, which is sequential in nature). An AI algorithm
called the Recurrent Neural Network, is a specialized form of the
classic Artificial Neural Network (Multi-Layer Perceptron) that is used to solve
Sequence Modelling problems. Recurrent Neural Networks are like Artificial
Neural Networks which has loops in them. This means that the activation of each
neuron or cell depends not only on the current input to it but also its previous
activation values.

The architecture of an RNN is also inspired by the human brain. As we read any
essay, we are able to interpret the sentence we are currently reading better
because of the information we gained from previous sentences of the essay.
Similarly, we can understand the conclusion of a novel only if we have read the
beginning and middle of the novel. The same logic follows for audio as well. On

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


45

a basic level, interpreting a certain part of a sequence requires information


gained from the previous parts of the sequence. Thus, in a human brain,
information that persists in our memory while interpreting sequential data is
vital in understanding each part of the sequence. Similarly, RNNs also try to
incorporate this capacity of memory by updating something called the “state”
of its cells each time we move from one part of a sequence to another. The state
of a cell is basically the total information gained by it so far by reading the
sequence. So, the current state or knowledge of a cell in an RNN is not only
dependent on the current word or sentence it is reading, but is also dependent
on all the other words or sentences it has read before the current one. Thus the
name Recurrent Neural Network. (Classic ANNs do not have this mechanism
of memory. An ANN neuron’s current state depends only on the current input
as it discards information about the previous inputs to the cell)

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


46

• The first image above illustrates a recurrent neuron or cell. It is a simple


neuron that has a loop. It takes some input x and gives some output h.

• This neuron can be thought of as multiple copies of the same unit or cell
chained together. This is illustrated by the second image, which shows an
“unrolled” form of the recurrent neuron. Each copy or unit passes a
message (some information) to the next copy.

In Recurrent Neural Networks, there is a concept of time steps. This means that
the recurrent cells or units take inputs from a sequence one by one. Each step at
which the cell picks up an input is called a time step. For example, if we have a
sequence of words that form a sentence, such as “It’s a sunny day.”, our recurrent
cell will take the word “It’s” as its input at the first time step. Now it stores
information about the word “It’s” in its memory and updates its state. Next, it takes
the word “a” as its second input at the second time step. Now it incorporates
information about the word “a” into its memory and updates its state once again. It
repeats the process until the last word. Therefore, the cell state at the 1st time step
depends only on the 1st input, the cell state at the 2nd time state depends on the 1st
and 2nd inputs, the cell state at the third time step depends on the 1st, 2nd and 3rd

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


47

inputs and so on. In this way the cell continuously updates its memory as time
passes (similar to a human brain).

Referring to what you learnt from the previous paragraph to the images above; we
can say that $latex \Large{x_1}$, $latex \Large{x_2}$, $latex \Large{x_3}$ and so
on are the inputs to the recurrent cell at the 1st, 2nd, 3rd and so on time steps. At
each time step, the recurrent cell updates its state based on the current input, gives
an output vector h and then moves on to the next time step. This is demonstrated in
the “unrolled” RNN diagram above.

Therefore, we need 2 separate weight matrices at each time step to calculate the
current state of the recurrent cell. One matrix W and another matrix U are used.
Matrix W is multiplied by the current input and the matrix U is multiplied by the
previous state of the cell (at the previous time step) and the two products are added.
A bias vector b can be added to the sum. Then, the whole sum can be passed
through an activation function like ReLU, Tanh or Sigmoid to form the new
updated state of the cell (The activation function is used to introduce non-linearity
into the network so that it can fit more complex functions). So, the update formula
can be written as:

$latex \huge{h_t + 1 = W \cdot h_t + U \cdot x_t}$, where $latex h_t$ is the is the
cell state at time step t and $latex x_t$ is the cell input at time step t.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


48

The RNN

Many such Recurrent Neurons stacked one on top of the other (which may include
some Densely Connected Layers at the end) forms a Deep Recurrent Neural
Network or DRNN.

• A Deep Recurrent Neural Network. The outputs of the lower layers are
fed as inputs to the upper layers (at each time step). For example, in the
above figure, the output of the lowest layer at time step $latex x_(t — 1)$
is fed as input at the $latex x_(t — 1)$ time step in the middle layer.
With multiple recurrent units stacked one on top of the other, a DRNN
can learn more complex patterns in sequential data.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


49

The outputs from one recurrent unit at each time step can be fed as input to the next
unit at the same time step. This forms a deep sequential model that can model a
larger range of more complex sequences than a single recurrent unit.

Long Term Dependencies

Recurrent Neural Networks face the problem of long term dependencies very often.
On many occasions, in sequence modelling problems we need information from
long ago to make predictions about the next term/s in a sequence. For example, if
we want to find the next word in the sentence “I grew up in Spain and I am very
familiar with the traditions and customs of …..”. To predict the next word (which
seems to be Spain), we need to have information about the word “Spain”, which is
just the 5th word in the sentence. But we need to predict the 17th word in the
sentence. This is a large time gap, and RNNs are prone to losing information given
to it many time steps back. RNNs are unable to capture these long term
dependencies in practice.

Long Short Term Memory Networks

A special type of RNN called an LSTM Network was created to solve the problem
of long term dependencies. The constituent cells of an LSTM network each have
their own system of gates that decide what information and how much information
from the sequence (text or audio) is stored in the cell’s state and how much is
discarded at each time step. These gates regulate the state of the cell more
effectively and help the cell retain information that it has gained long ago. These
systems of gates are parametrized by weight matrices and bias vectors. These
parameters are trained using the Back Propagation algorithm.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


50

LSTM is a modification to the RNN hidden layer. LSTM has enabled RNNs to
remember its inputs over a long period of time. In LSTM in addition to the hidden
state, a cell state is passed to the next time step.

Internal structure of basic RNN and LSTM unit (Source: stanford.edu)

LSTM can capture long-range dependencies. It can have memory about previous
inputs for extended time durations. There are 3 gates in an LSTM cell. Memory
manipulations in LSTM are done using these gates. Long short-term memory
(LSTM) utilizes gates to control the gradient propagation in the recurrent network’s
memory.

• Forget Gate: Forget gate removes the information that is no longer useful
in the cell state

• Input Gate: Additional useful information to the cell state is added by


input gate

• Output Gate: Additional useful information to the cell state is added by


output gate

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


51

This gating mechanism of LSTM has allowed the network to learn the conditions
for when to forget, ignore, or keep information in the memory cell.

Markov Models

Markov Chains appear in many areas: Physics, Genetics, Finance and of course
in Data Science and Machine Learning. As a Data Scientist you probably would have
heard of the word ‘Markov’ come up a few times in your research or general reading.
It is a quintessential statistical technique in Natural Language Processing and
Reinforcement Learning.

Markov Property

For any modelling process to be considered Markov/Markovian it has to satisfy


the Markov Property. This property states that the probability of the next state only
depends on the current state, everything before the current state is irrelevant. In
other words, the whole system is completely memoryless.

Mathematically this is written as:

Where n is the time step parameter and X is a random variable that takes
on a value in a given state space s. The state space refers to all the possible outcomes
of an event. For example, a coin flip has two values in its state space: s = {Heads,
Tails} and the probability of transitioning from one state to the other is 0.5.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


52

Markov Chain
A process that uses the Markov Property is known as a Markov Process. If the state
space is finite and we use discrete time-steps this process is known as a Markov
Chain. In other words, it is a sequence of random variables that take on states in the
given state space.
In this article we will consider time-homogenous discrete-time Markov Chains as
they are the easiest to work with and build an intuition behind. There does exist time-
inhomogeneous Markov Chains where the transition probability between states is
not fixed and varies with time.

Shown below is an example Markov Chain with state space {A,B,C}. The
numbers on the arrows indicate the probability of transitioning between those two
states.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


53

For example, if you want to go from state B to C, then this transition has a 20%
chance. Mathematically we are working out the following:

Probability Transition Matrix

We can simplify and generalize these transitions through constructing


a probability transition matrix for our given Markov Chain. The transition matrix
has rows i and columns j and so the i,j index values give the probabilities of
transitions from i to j as:

The transition matrix for our above Markov Chain is:

So the 1,1 entry tells us that the probability of transition from B to A is 0.5. This
agrees with the result we have in our Markov Chain diagram above.

Markov Chains are a series of transitions in a finite state space in discrete


time where the probability of transition only depends on the current state. The
system is completely memoryless.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


54

The Transition Matrix displays the probability of transitioning between


states in the state space. The Chapman-Kolmogorov Equations tell us how to
compute the multi-step transition probabilities.
In this article, we will discuss what happens to the Transition Matrix when we
take a large number of discrete time steps. In other words, we will describe how
the Markov Chain develops as time tends to infinity.

Where each cell conveys the probability of transitioning from state i to


state j under the Markov Property. This matrix, however, is only for one-step
transitions. What if we wanted to go from i to j in two steps?

Well this problem is solved by The Chapman-Kolmogorov Equations, that tell us


that this is simply the square of the Transition Matrix:

If we wanted to calculate the three-step probabilities, we would then cube the


Transition Matrix. In general for n-steps:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


55

Now, what would happen as n becomes large? We will answer this next.
Stationary Distribution

As we progress through time, the probability of being in certain states are more
likely than others. Over the long run, the distribution will reach an equilibrium with
an associated probability of being in each state. This is known as the Stationary
Distribution.

The reason it is stationary is because if you apply the Transition Matrix to this given
distribution, the resultant distribution is the same as before:

Where π is some distribution which is a row vector with the number of columns equal
to the states in the state space and P is the Transition Matrix.

Eigenvalue Decomposition

Some people may recognise the above equation as π being an eigenvector of P with
an eigenvalue of 1. This is indeed true, so we can solve it using eigenvalue
decomposition (spectral theorem).

Let's work through our example Markov Chain above which has a 3x3 Transition
Matrix. From our above Transition Matrix, we want to solve the following equation:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


56

Where λ are the eigenvalues corresponding to the eigenvectors. Using the triangle
rule, this equals:

Therefore, our eigenvalues are 0, 1 and -0.5. We know our solution is only valid
for where the eigenvalue is equal to 1, so we will now use that to find our
corresponding eigenvector which will be our stationary distribution:

Here, the subscripts refer to the probability of being in the corresponding


state when we have a stationary distribution. The above equation can be solved
through gaussian elimination to arrive at the following result:

Normalising the above vector, our stationary distribution is then:

This means in the long term we are equally as likely to be in any of the three states

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


57

Hidden Markov Models

These appear in many facets of Data Science and Machine Learning,


particularly Natural Language Processing and Reinforcement Learning, so are
definitely worth gaining an understanding for.

Intuition and Example Model

In a regular Markov Chain we are able to see the states and their associated transition
probabilities. However, in a Hidden Markov Model (HMM), the Markov Chain
is hidden but we can infer its properties through its given observed states.Note: The
Hidden Markov Model is not a Markov Chain per se
Lets go through an example to gain some understanding:
• If the weather is Sunny, I have a 90% chance of being happy
and 10% chance of being sad.

• If the weather is Rainy, I have a 30% chance of being happy


and 70% chance of being sad.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


58

These associated probabilities of the observed states (Happy, Sad) are known as
the emission probabilities.

Now, lets say my friend wants to infer the weather from my mood. So, for a given
week, say, I am: Sad, Happy, Said, Happy, Sad, Happy, Sad. Therefore, my friend
would have inferred the weather to have been: Rainy, Sunny, Rainy, Sunny, Rainy,
Sunny, Rainy. This is an intuitive approach, however weather is very unlikely to be
that erratic. Therefore, we need to add the transition probabilities between
our hidden states.

The above plot is our Hidden Markov Model! We will now carry out some
basic calculations using our model!

Probability of Being Sunny?

What would be the probability that a random day is Sunny or Rainy? Well this
question is answered by the stationary distribution of the Markov Chain. This tells
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
59

us the probability of being in a given state in the long term, otherwise known as the
equilibrium of the Markov Chain.

The stationary distribution is a given distribution that if you apply the Transition
Matrix, P, the resultant distribution is the same as before:

Where π is the stationary distribution. This distribution can be derived


through finding the eigenvector of P which has an eigenvalue of 1. after applying
eigenvalue decomposition to our above HMM, we find the stationary distribution
to be {0.5, 0.5}. In other words, a random day is equally as likely to be Sunny or
Rainy!

Evaluating Sequence Likelihood

How do we compute the probability of a certain sequence of hidden and observed


states occurring?

For example, let's say yesterday I was Happy and it was Sunny and today I
am Sad and it is also Sunny. What is the probability of this sequence?

Mathematically, we want to calculate:

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


60

We can do this by brute force using the emission, transition and stationary
distribution probabilities that are shown and derived in our above HMM diagram.
We break this down into the following probabilities:

So the probability of the above sequence is 0.056!

To those with a keen eye might have noticed we have indirectly been using Bayes
theorem in the above calculation!

Decoding The Most Likely Sequence

What is the most likely hidden state (weather) sequence that generates an observed
(mood) sequence?

This answer can be carried by simply computing all the possible hidden state
combinations and choosing the one with the highest probability. This is known as
the Maximum Likelihood Estimation.

However, the number of combinations can quickly become very large. For N hidden
states and an observation sequence of T observations, we have (N^T) possible
combinations. In practise, N and T will be large, therefore it is not computationally
feasible to calculate every hidden state combination.

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN


61

To solve this complexity problem, we use the Viterbi Algorithm or Forward


Algorithm which uses dynamic programming that works on the order of O(N²).

S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN

You might also like