What Are Probabilistic Machine Learning Models?
What Are Probabilistic Machine Learning Models?
Unit IV
Probabilistic Methods for Learning
Introduction
What are Probabilistic Machine Learning Models?
Let’s discuss an example to better understand probabilistic classifiers. Take the task
of classifying an image of an animal into five classes — {Dog, Cat, Deer, Lion,
Rabbit} as the problem. As input, we have an image (of a dog). For this example,
let’s consider that the classifier works well and provides correct/ acceptable results
for the particular input we are discussing. When the image is provided as the input to
the probabilistic classifier, it will provide an output such as (Dog (0.6), Cat (0.2),
Deer(0.1), Lion(0.04), Rabbit(0.06)). But, if the classifier is non-probabilistic, it will
only output “Dog”.
One of the major advantages of probabilistic models is that they provide an idea
about the uncertainty associated with predictions. In other words, we can get an
idea of how confident a machine learning model is on its prediction. If we consider
the above example, if the probabilistic classifier assigns a probability of 0.9 for ‘Dog’
class instead of 0.6, it means the classifier is more confident that the animal in the
image is a dog. These concepts related to uncertainty and confidence are
extremely useful when it comes to critical machine learning applications such as
disease diagnosis and autonomous driving. Also, probabilistic outcomes would be
useful for numerous techniques related to Machine Learning such as Active
Learning.
Objective Functions
The loss created by a particular data point will be higher if the prediction gives
by the model is significantly higher or lower than the actual value. The loss will be
less when the predicted value is very close to the actual value. As you can see, the
objective function here is not based on probabilities, but on the difference (absolute
difference) between the actual value and the predicted value.
Eq: 1
Here, n indicates the number of data instances in the data set, y_true is the
correct/ true value and y_predict is the predicted value (by the linear regression
model).
On the other hand, if we consider a neural network with a softmax output layer, the
loss function is usually defined using Cross-Entropy Loss (CE loss) (Eq. 2). Note that
we are considering a training dataset with ’n’ number of data points, so finally take
the average of the losses of each data point as the CE loss of the dataset. Here, y_i
means the true label of the data point i and p(y_i) means the predicted probability for
the class y_i (probability of this data point belongs to the class y_i as assigned by
the model).
Eq. 2
Eq. 3
Here y_i is the class label (1 if similar, 0 otherwise) and p(s_i) is the predicted
probability of a point being class 1 for each point ‘i’ in the dataset. N is t he number
of data points. Note that as this is a binary classification problem, there are only two
classes, class 1 and class 0.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
o The formula for Bayes' theorem is given as:
Let’s come back to the problem at hand. Looks like you’re very serious with your
resolution this time given that you have been keeping track of the weather outside
for the past two weeks:
Next, you need to create a frequency table for each attribute of your dataset.
Then, for each frequency table, you will create a likelihood table.
Let’s say you want to focus on the likelihood that you go for a run given that it’s
sunny outside.
Naïve Bayes assumes conditional independence over the training dataset. The
classifier separates data into different classes according to the Bayes’ Theorem.
But assumes that the relationship between all input features in a class is
independent. Hence, the model is called naïve.
This helps in simplifying the calculations by dropping the denominator from the
formula while assuming independence:
Say you want to predict if on the coming Wednesday, given the following weather
conditions, should you go for a run or sleep in:
Outlook: Rainy
Humidity: Normal
Wind: Weak
Run: ?
Now, to determine the probability of going for a run on Wednesday, you just need
to divide P(Yes) with the sum of the likelihoods of Yes and No.
According to your model, it looks like there’s an almost 83% probability that
you’re going to stay under the covers next Wednesday!
This was just a fun example. Although Naïve Bayes IS used for weather
predictions, for advanced machine learning problems, the complexity of the
Bayesian classifier needs to be reduced for it to be practical. This is where the
naïve in Naïve Bayes comes in.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
There are three types of Naive Bayes Model, which are given below:
In this case, the x and e imply the given dataset, and H and 𝚯 represents the
parameter(hypothesis). In other words, x = e and H = 𝚯 in the above figure
The image below explains the difference between the probability and the
likelihood.
Probability vs Likelihood
You can estimate a probability of an event using the function that describes the
probability distribution and its parameters. For example, you can estimate the
outcome of a fair coin flip by using the Bernoulli distribution and the probability of
success 0.5. In this ideal case, you already know how the data is distributed.
But the real world is messy. Often you don’t know the exact parameter values, and
you may not even know the probability distribution that describes your specific use
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
15
case. Instead, you have to estimate the function and its parameters from the data.
The likelihood describes the relative evidence that the data has a particular
distribution and its associated parameters.
We can describe the likelihood as a function of an observed value of the data x, and
the distributions’ unknown parameter θ.
In short, when estimating the probability, you go from a distribution and its
parameters to the event.
When estimating the likelihood, you go from the data to the distribution and its
parameters.
To make this more concrete, let’s calculate the likelihood for a coin flip.
Recall that a coin flip is a Bernoulli trial, which can be described in the following
function.
Now, we need a hypothesis about the parameter theta. We assume that the coin is
fair. The probability of obtaining heads is 0.5. This is our hypothesis A.
Let’s say we throw the coin 3 times. It comes up heads the first 2 times. The last
time it comes up tails. What is the likelihood that hypothesis A given the data?
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
16
First, we can calculate the relative likelihood that hypothesis A is true and the coin
is fair. We plug our parameters and our outcomes into our probability function.
Likelihood Ratios
Once you’ve calculated the likelihood, you have a hypothesis that your data has a
specific set of parameters. The likelihood is your evidence for that hypothesis. To
pick the hypothesis with the maximum likelihood, you have to compare your
hypothesis to another by calculating the likelihood ratios.
Since your 3 coin tosses yielded two heads and one tail, you hypothesize that the
probability of getting heads is actually 2/3. This is your hypothesis B
Let’s repeat the previous calculations for B with a probability of 2/3 for the same
three coin tosses. I won’t go through the steps of plugging the values into the formula
again.
absence of more data in the form of coin tosses, 2/3 is the most likely candidate
for our true parameter value. So hypothesis B gives us the maximum likelihood
value.
We can express the relative likelihood of an outcome as a ratio of the likelihood for
our chosen parameter value θ to the maximum likelihood.
The relative likelihood that the coin is fair can be expressed as a ratio of the
likelihood that the true probability is 1/2 against the maximum likelihood that the
probability is 2/3.
The maximum value division helps to normalize the likelihood to a scale with 1 as
its maximum likelihood. We can plot the different parameter values against their
relative likelihoods given the current data.
For three coin tosses with 2 heads, the plot would look like this with the likelihood
maximized at 2/3.
What happens if we toss the coin for the fourth time and it comes up tails. Now
we’ve had 2 heads and 2 tails. Our likelihood plot now looks like this, with the
likelihood maximized at 1/2.
likelihood ratios
The variable x represents the range of examples drawn from the unknown data
distribution, which we would like to approximate and n the number of examples.
Log-Likelihood
For most practical applications, maximizing the log-likelihood is often a better
choice because the logarithm reduced operations by one level. Multiplications
become additions; powers become multiplications, etc.
In the case of the classification task with supervised learning, our dataset is
composed of pairs of data x and corresponding label y. This means the ML
estimation also needs to deal with the conditional probability of model(network)
output y’ given the input data x.
3. Maximum A Posteriori(MAP)
An alternative estimator is the MAP estimator, which finds the parameter theta that
maximizes the posterior.
According to the Bayes rule, the posterior can be decomposed into the product of
the likelihood and prior. The MAP estimator begins with this idea and is defined
as below.
Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:
Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction and
anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and exp erts opinions,
and it consists of two parts:
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
o Causal Component
o Actual numbers
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
In general for each variable Xi, we can write the equation as:
Example: Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls
Harry when he hears the alarm, but sometimes he got confused with the phone
ringing and calls at that time too. On the other hand, Sophia likes to listen to high
music, so sometimes she misses to hear the alarm. Here we would like to compute
the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o Each row in the CPT must be sum to 1 because all the entries in the ta ble
represent an exhaustive set of cases for the variable.
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability
distribution:
Let's take the observed probability for the Burglary and earthquake component:
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
The Conditional probability of David that he will call depends on the probability of
Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.
There are two ways to understand the semantics of the Bayesian network, which is
given below:
Most often, the problem is the lack of information about the domain required to fully
specify the conditional dependence between random variables. If available,
calculating the full conditional probability for an event can be impractical.
Bayesian belief networks are one example of a probabilistic model where some
variables are conditionally independent.
Probability Density:
Assume a random variable x that has a probability distribution p(x). The
relationship between the outcomes of a random variable and its probability is
referred to as the probability density.
The problem is that we don’t always know the full probability distribution for a
random variable. This is because we only use a small subset of observations to
derive the outcome. This problem is referred to as Probability Density
Estimation as we use only a random sample of observations to find the general
density of the whole sample space.
where,
Steps Involved:
Step 1 - Create a histogram for the random set of observations to understand the
density of the random sample.
Step 2 - Create the probability density function and fit it on the random sample.
Observe how it fits the histogram plot.
Most of the histogram of the different random sample after fitting should match the
histogram plot of the whole population.
Density Estimation: It is the process of finding out the density of the whole
population by examining a random sample of data from that population. One of the
best ways to achieve a density estimate is by using a histogram plot.
Parametric Density Estimation
A normal distribution has two given parameters, mean and standard deviation. We
calculate the sample mean and standard deviation of the random sample taken from
this population to estimate the density of the random sample. The reason it is
termed as ‘parametric’ is due to the fact that the relation between the observations
and its probability can be different based on the values of the two parameters.
Now, it is important to understand that the mean and standard deviation of this
random sample is not going to be the same as that of the whole population due to
its small size. A sample plot for parametric density estimation is shown below.
where,
PDF over the random sample data. This is done by maximizing the likelihood
function so that the PDF fitted over the random sample. Another way to look at it
is that MLE function gives the mean, the standard deviation of the random sample
is most similar to that of the whole sample.
NOTE: MLE assumes that all PDFs are a likely candidate to being the best fitting
curve. Hence, it is computationally expensive method.
Intuition:
As observed in Fig 1, the red plots poorly fit the normal distribution, hence
their ‘likelihood estimate’ is also lower. The green PDF curve has the maximum
likelihood estimate as it fits the data perfectly. This is how the maximum likelihood
estimate method works.
Mathematics Involved
In the intuition, we discussed the role that Likelihood value plays in determining
the optimum PDF curve. Let us understand the math involved in MLE method.
where,
Sequence Models
Sequence models are the machine learning models that input or output sequences of
data. Sequential data includes text streams, audio clips, video clips, time-series data
and etc. Recurrent Neural Networks (RNNs) is a popular algorithm used in
sequence models.
These examples show that there are different applications of sequence models.
Sometimes both the input and output are sequences, in some either the input or the
output is a sequence. Recurrent neural network (RNN) is a popular sequence model
that has shown efficient performance for sequential data.
Source
One-to-one
With one input and one output, this is the classic feed-forward neural network
architecture.
One-to-many
Many-to-one
Many-to-many
This paradigm is suitable for machine translation, such as that seen on Google
Translate. The input could be a variable-length English sentence, and the output
could be a variable-length English sentence in a different language. On a frame-
by-frame basis, the last many to many models can be utilized for video
classification.
Sequence Modelling
The architecture of an RNN is also inspired by the human brain. As we read any
essay, we are able to interpret the sentence we are currently reading better
because of the information we gained from previous sentences of the essay.
Similarly, we can understand the conclusion of a novel only if we have read the
beginning and middle of the novel. The same logic follows for audio as well. On
• This neuron can be thought of as multiple copies of the same unit or cell
chained together. This is illustrated by the second image, which shows an
“unrolled” form of the recurrent neuron. Each copy or unit passes a
message (some information) to the next copy.
In Recurrent Neural Networks, there is a concept of time steps. This means that
the recurrent cells or units take inputs from a sequence one by one. Each step at
which the cell picks up an input is called a time step. For example, if we have a
sequence of words that form a sentence, such as “It’s a sunny day.”, our recurrent
cell will take the word “It’s” as its input at the first time step. Now it stores
information about the word “It’s” in its memory and updates its state. Next, it takes
the word “a” as its second input at the second time step. Now it incorporates
information about the word “a” into its memory and updates its state once again. It
repeats the process until the last word. Therefore, the cell state at the 1st time step
depends only on the 1st input, the cell state at the 2nd time state depends on the 1st
and 2nd inputs, the cell state at the third time step depends on the 1st, 2nd and 3rd
inputs and so on. In this way the cell continuously updates its memory as time
passes (similar to a human brain).
Referring to what you learnt from the previous paragraph to the images above; we
can say that $latex \Large{x_1}$, $latex \Large{x_2}$, $latex \Large{x_3}$ and so
on are the inputs to the recurrent cell at the 1st, 2nd, 3rd and so on time steps. At
each time step, the recurrent cell updates its state based on the current input, gives
an output vector h and then moves on to the next time step. This is demonstrated in
the “unrolled” RNN diagram above.
Therefore, we need 2 separate weight matrices at each time step to calculate the
current state of the recurrent cell. One matrix W and another matrix U are used.
Matrix W is multiplied by the current input and the matrix U is multiplied by the
previous state of the cell (at the previous time step) and the two products are added.
A bias vector b can be added to the sum. Then, the whole sum can be passed
through an activation function like ReLU, Tanh or Sigmoid to form the new
updated state of the cell (The activation function is used to introduce non-linearity
into the network so that it can fit more complex functions). So, the update formula
can be written as:
$latex \huge{h_t + 1 = W \cdot h_t + U \cdot x_t}$, where $latex h_t$ is the is the
cell state at time step t and $latex x_t$ is the cell input at time step t.
The RNN
Many such Recurrent Neurons stacked one on top of the other (which may include
some Densely Connected Layers at the end) forms a Deep Recurrent Neural
Network or DRNN.
• A Deep Recurrent Neural Network. The outputs of the lower layers are
fed as inputs to the upper layers (at each time step). For example, in the
above figure, the output of the lowest layer at time step $latex x_(t — 1)$
is fed as input at the $latex x_(t — 1)$ time step in the middle layer.
With multiple recurrent units stacked one on top of the other, a DRNN
can learn more complex patterns in sequential data.
The outputs from one recurrent unit at each time step can be fed as input to the next
unit at the same time step. This forms a deep sequential model that can model a
larger range of more complex sequences than a single recurrent unit.
Recurrent Neural Networks face the problem of long term dependencies very often.
On many occasions, in sequence modelling problems we need information from
long ago to make predictions about the next term/s in a sequence. For example, if
we want to find the next word in the sentence “I grew up in Spain and I am very
familiar with the traditions and customs of …..”. To predict the next word (which
seems to be Spain), we need to have information about the word “Spain”, which is
just the 5th word in the sentence. But we need to predict the 17th word in the
sentence. This is a large time gap, and RNNs are prone to losing information given
to it many time steps back. RNNs are unable to capture these long term
dependencies in practice.
A special type of RNN called an LSTM Network was created to solve the problem
of long term dependencies. The constituent cells of an LSTM network each have
their own system of gates that decide what information and how much information
from the sequence (text or audio) is stored in the cell’s state and how much is
discarded at each time step. These gates regulate the state of the cell more
effectively and help the cell retain information that it has gained long ago. These
systems of gates are parametrized by weight matrices and bias vectors. These
parameters are trained using the Back Propagation algorithm.
LSTM is a modification to the RNN hidden layer. LSTM has enabled RNNs to
remember its inputs over a long period of time. In LSTM in addition to the hidden
state, a cell state is passed to the next time step.
LSTM can capture long-range dependencies. It can have memory about previous
inputs for extended time durations. There are 3 gates in an LSTM cell. Memory
manipulations in LSTM are done using these gates. Long short-term memory
(LSTM) utilizes gates to control the gradient propagation in the recurrent network’s
memory.
• Forget Gate: Forget gate removes the information that is no longer useful
in the cell state
This gating mechanism of LSTM has allowed the network to learn the conditions
for when to forget, ignore, or keep information in the memory cell.
Markov Models
Markov Chains appear in many areas: Physics, Genetics, Finance and of course
in Data Science and Machine Learning. As a Data Scientist you probably would have
heard of the word ‘Markov’ come up a few times in your research or general reading.
It is a quintessential statistical technique in Natural Language Processing and
Reinforcement Learning.
Markov Property
Where n is the time step parameter and X is a random variable that takes
on a value in a given state space s. The state space refers to all the possible outcomes
of an event. For example, a coin flip has two values in its state space: s = {Heads,
Tails} and the probability of transitioning from one state to the other is 0.5.
Markov Chain
A process that uses the Markov Property is known as a Markov Process. If the state
space is finite and we use discrete time-steps this process is known as a Markov
Chain. In other words, it is a sequence of random variables that take on states in the
given state space.
In this article we will consider time-homogenous discrete-time Markov Chains as
they are the easiest to work with and build an intuition behind. There does exist time-
inhomogeneous Markov Chains where the transition probability between states is
not fixed and varies with time.
Shown below is an example Markov Chain with state space {A,B,C}. The
numbers on the arrows indicate the probability of transitioning between those two
states.
For example, if you want to go from state B to C, then this transition has a 20%
chance. Mathematically we are working out the following:
So the 1,1 entry tells us that the probability of transition from B to A is 0.5. This
agrees with the result we have in our Markov Chain diagram above.
Now, what would happen as n becomes large? We will answer this next.
Stationary Distribution
As we progress through time, the probability of being in certain states are more
likely than others. Over the long run, the distribution will reach an equilibrium with
an associated probability of being in each state. This is known as the Stationary
Distribution.
The reason it is stationary is because if you apply the Transition Matrix to this given
distribution, the resultant distribution is the same as before:
Where π is some distribution which is a row vector with the number of columns equal
to the states in the state space and P is the Transition Matrix.
Eigenvalue Decomposition
Some people may recognise the above equation as π being an eigenvector of P with
an eigenvalue of 1. This is indeed true, so we can solve it using eigenvalue
decomposition (spectral theorem).
Let's work through our example Markov Chain above which has a 3x3 Transition
Matrix. From our above Transition Matrix, we want to solve the following equation:
Where λ are the eigenvalues corresponding to the eigenvectors. Using the triangle
rule, this equals:
Therefore, our eigenvalues are 0, 1 and -0.5. We know our solution is only valid
for where the eigenvalue is equal to 1, so we will now use that to find our
corresponding eigenvector which will be our stationary distribution:
This means in the long term we are equally as likely to be in any of the three states
In a regular Markov Chain we are able to see the states and their associated transition
probabilities. However, in a Hidden Markov Model (HMM), the Markov Chain
is hidden but we can infer its properties through its given observed states.Note: The
Hidden Markov Model is not a Markov Chain per se
Lets go through an example to gain some understanding:
• If the weather is Sunny, I have a 90% chance of being happy
and 10% chance of being sad.
These associated probabilities of the observed states (Happy, Sad) are known as
the emission probabilities.
Now, lets say my friend wants to infer the weather from my mood. So, for a given
week, say, I am: Sad, Happy, Said, Happy, Sad, Happy, Sad. Therefore, my friend
would have inferred the weather to have been: Rainy, Sunny, Rainy, Sunny, Rainy,
Sunny, Rainy. This is an intuitive approach, however weather is very unlikely to be
that erratic. Therefore, we need to add the transition probabilities between
our hidden states.
The above plot is our Hidden Markov Model! We will now carry out some
basic calculations using our model!
What would be the probability that a random day is Sunny or Rainy? Well this
question is answered by the stationary distribution of the Markov Chain. This tells
S. MANICKAM AP/CSE GRACE COLLEGE OF ENGINEERING, TUTICORIN
59
us the probability of being in a given state in the long term, otherwise known as the
equilibrium of the Markov Chain.
The stationary distribution is a given distribution that if you apply the Transition
Matrix, P, the resultant distribution is the same as before:
For example, let's say yesterday I was Happy and it was Sunny and today I
am Sad and it is also Sunny. What is the probability of this sequence?
We can do this by brute force using the emission, transition and stationary
distribution probabilities that are shown and derived in our above HMM diagram.
We break this down into the following probabilities:
To those with a keen eye might have noticed we have indirectly been using Bayes
theorem in the above calculation!
What is the most likely hidden state (weather) sequence that generates an observed
(mood) sequence?
This answer can be carried by simply computing all the possible hidden state
combinations and choosing the one with the highest probability. This is known as
the Maximum Likelihood Estimation.
However, the number of combinations can quickly become very large. For N hidden
states and an observation sequence of T observations, we have (N^T) possible
combinations. In practise, N and T will be large, therefore it is not computationally
feasible to calculate every hidden state combination.