0% found this document useful (0 votes)
106 views40 pages

ML Unit III

1. Bayes theorem is a method for calculating conditional probabilities and updating beliefs based on new evidence. It is widely used in machine learning for tasks like classification. 2. Bayes theorem provides a way to calculate the posterior probability of a hypothesis given observed data using the prior probability, likelihood, and evidence. 3. For concept learning problems, Bayes theorem can be used to determine the hypothesis with the highest posterior probability given the training data. The hypothesis that is consistent with the training data will have the highest posterior probability.

Uploaded by

Vasu 22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views40 pages

ML Unit III

1. Bayes theorem is a method for calculating conditional probabilities and updating beliefs based on new evidence. It is widely used in machine learning for tasks like classification. 2. Bayes theorem provides a way to calculate the posterior probability of a hypothesis given observed data using the prior probability, likelihood, and evidence. 3. For concept learning problems, Bayes theorem can be used to determine the hypothesis with the highest posterior probability given the training data. The hypothesis that is consistent with the training data will have the highest posterior probability.

Uploaded by

Vasu 22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Bayes Theorem in Machine learning

Machine Learning is one of the most emerging technology of Artificial


Intelligence. We are living in the 21th century which is completely driven by
new technologies and gadgets in which some are yet to be used and few are on
its full potential. Similarly, Machine Learning is also a technology that is still in
its developing phase. There are lots of concepts that make machine learning a
better technology such as supervised learning, unsupervised learning,
reinforcement learning, perceptron models, Neural networks, etc

Introduction to Bayes Theorem in Machine Learning


Bayes theorem is given by an English statistician, philosopher, and Presbyterian
minister named Mr. Thomas Bayes in 17th century. Bayes provides their
thoughts in decision theory which is extensively used in important mathematics
concepts as Probability. Bayes theorem is also widely used in Machine Learning
where we need to predict classes precisely and accurately. An important concept
of Bayes theorem named Bayesian method is used to calculate conditional
probability in Machine Learning application that includes classification tasks.
Further, a simplified version of Bayes theorem (Naïve Bayes classification) is
also used to reduce computation time and average cost of the projects.

Bayes theorem is also known with some other name such as Bayes rule or
Bayes Law. Bayes theorem helps to determine the probability of an event with
random knowledge. It is used to calculate the probability of occurring one event
while other one already occurred. It is a best method to relate the condition
probability and marginal probability.

In simple words, we can say that Bayes theorem helps to contribute more
accurate results.

Bayes Theorem is used to estimate the precision of values and provides a


method for calculating the conditional probability. However, it is hypocritically
a simple calculation but it is used to easily calculate the conditional probability
of events where intuition often fails. Some of the data scientist assumes that
Bayes theorem is most widely used in financial industries but it is not like that.
Other than financial, Bayes theorem is also extensively applied in health and
medical, research and survey industry, aeronautical sector, etc.

What is Bayes Theorem?


Bayes theorem is one of the most popular machine learning concepts that helps
to calculate the probability of occurring one event with uncertain knowledge
while other one has already occurred.

Bayes' theorem can be derived using product rule and conditional probability of
event X with known event Y:
o According to the product rule we can express as the probability of event X
with known event Y as follows;

1. P(X ? Y)= P(X|Y) P(Y) {equation 1}


o Further, the probability of event Y with known event X:

1. P(X ? Y)= P(Y|X) P(X) {equation 2}

Mathematically, Bayes theorem can be expressed by combining both equations


on right hand side. We will get:

Here, both events X and Y are independent events which means probability of
outcome of both events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as


updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when
hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before
considering the evidence
o P(Y) is called marginal probability. It is defined as the probability of
evidence under any consideration.

Hence, Bayes Theorem can be written as:

posterior = likelihood * prior / evidence

The Bayes theorem is a method for calculating a hypothesis’s probability


based on its prior probability, the probabilities of observing specific data
given the hypothesis, and the seen data itself.
Example:
Bayes theorem calculates the probability of each possible hypothesis and
outputs the most probable one.
Consider a medical diagnosis problem, There are two different
hypotheses:
(1) the patient has a certain type of cancer
(2) the patient does not have cancer.
The information is based on a laboratory test with two possible outcomes:
+ (positive) and – (negative).
We already knew that only.008 percent of the population is infected with
this disease. Furthermore, the laboratory test is merely an imperfect
indicator of disease.
Only 98 percent of the time does the test give an accurate positive result
when the disease is present, and only 97 percent of the time does it give a
valid negative result when the disease is not there. In other circumstances,
the test yields the opposite outcome.

That is,

Let’s say we come across a new patient who has a positive lab test result.
Should we give the patient a cancer diagnosis or not? Equation 6.2 can be
used to find the maximal a posteriori hypothesis
Let’s have a look at the Problem of Probability Density Estimation,

Given a sample of observations (X) from a domain (x1, x2, x3,…, xn),
each observation is taken independently from the domain with the same
probability distribution (so-called independent and identically distributed,
i.i.d., or close to it).
Density estimation entails choosing a probability distribution function and
its parameters that best explain the observed data’s joint probability
distribution (X).

There are several approaches to tackling this problem, but two of the most
common are:

A Bayesian approach is called Maximum a Posteriori (MAP).


Frequentist technique, Maximum Likelihood Estimation (MLE)
To find the maximum likelihood in Bayesian learning,

The small p indicates the probability density function. Maximum


likelihood estimation (MLE) is a statistical technique for estimating the
parameters of a probability distribution based on observed data.
This is accomplished by maximizing a probability function such that the
observed data is most likely under the assumed statistical model. The
maximum likelihood estimate is the point in the parameter space that
maximizes the likelihood function.
Bayes Theorem and Concept Learning:

The steps for brute force concept learning:

1. Given the training data, the Bayes theorem determines the posterior
probability of each hypothesis. It calculates the likelihood of each
conceivable hypothesis before determining which is the most likely.

2. Output the hypothesis hMAP with the highest posterior probability.

To calculate, we need to know the values of P(h) and P(D/h). To choose


these to be consistent with the following assumptions:

1. There is no noise in the training data D (i.e., di = c(xi)).


2. The hypothesis space H contains the goal notion c.
3. We have no reason to conclude that one hypothesis is more likely than another
based on prior evidence.

 Since we’re assuming the training data to be noise-free, the chances of


observing classification di given h are 1 if di = h(xi) and 0 if di != (xi).
Therefore,

To put it another way, the probability of data D given hypothesis h is 1 if


D agrees with h and 0 otherwise.
 Given no previous information of which hypothesis is more likely, it is fair to
give each hypothesis h in H the same prior probability. We should require that
these prior probabilities amount to 1 because we presume the target notion is
contained in H.

Now, let’s consider two cases:

Case 1: h is inconsistent with the training data D.

Here, since we know that P(D/h) = 0 when h is inconsistent with D. We


have,

That is, The posterior probability of a hypothesis inconsistent with D is


zero.

Case 2: Consider a case where h is consistent with D. and since we know


for h consistent with D, the value of P(D/h) = 1, we have,

| VSH,D | is the subset of hypotheses from H that are consistent with D.


It’s easier to verify that P(D) = | VSH,D | / | H | above. Since the add up
over all the hypotheses of P(h/D) must be one and because the number of
hypotheses from H consistent with D is by definition | VSH,D |.
Alternatively, we can derive P(D) from the theorem of total probability
and the fact that the hypothesis is mutually exclusive.

To summarize, Bayes theorem implies that the posterior probability p(h/D)


under the assumed P(h) and P(D/h) is

where | VSH,D | is the number of hypotheses from H consistent with D.

The least squares and the maximum likelihood estimation


methods.
Two commonly used approaches to estimate population parameters from a
random sample are the maximum likelihood estimation method (default)
and the least squares estimation method.
Maximum likelihood estimation method (MLE)
The likelihood function indicates how likely the observed sample is
as a function of possible parameter values. Therefore, maximizing
the likelihood function determines the parameters that are most
likely to produce the observed data. From a statistical point of view,
MLE is usually recommended for large samples because it is
versatile, applicable to most models and different types of data, and
produces the most precise estimates.
Least squares estimation method (LSE)
Least squares estimates are calculated by fitting a regression line to
the points from a data set that has the minimal sum of the deviations
squared (least square error). In reliability analysis, the line and the
data are plotted on a probability plot.

Why is MLE the default method

For large, complete data sets, both the LSE method and the MLE method
provide consistent results. In reliability applications, data sets are
typically small or moderate in size. Extensive simulation studies show
that in small sample designs where there are only a few failures, the MLE
method is better than the LSE method.1 Thus, the default estimation
method in Minitab is MLE.

The advantages of the MLE method over the LSE method are as follows:

The distribution parameter estimates are more precise.

The estimated variance is smaller.

Confidence intervals and tests for model parameters can be reliably


calculated.

The calculations use more of the information in the data.

When there are only a few failures because the data are heavily censored,
the MLE method uses the information in the entire data set, including the
censored values. The LSE method ignores the information in the censored
observations.1

Usually, the advantages of the MLE method outweigh the advantages of


the LSE method. The LSE method is easier to calculate by hand and
easier to program. The LSE method is also traditionally associated with
the use of probability plots to assess goodness-of-fit. However, the LSE
method can provide misleading results on a probability plot. Examples
exist where the points on a Weibull probability plot that uses the LSE
method fall along a line when the Weibull model is actually
inappropriate.1

MLE method with common shape or scale parameters


For the maximum likelihood method, Minitab uses the log likelihood
function. In this case, the log likelihood function of the model is the sum
of the individual log likelihood functions, with the same shape parameter
assumed in each individual log likelihood function. The resulting overall
log likelihood function is maximized to obtain the scale parameters
associated with each group and the common shape parameter.

LSE method with common shape or scale parameters


Minitab first calculates the y-coordinate and x-coordinate for each group.
Then, to obtain the LSE estimates, Minitab performs the following steps:

Pools the x-coordinate data.

Pools the y-coordinate data.

Uses an indicator variable (or By variable) to identify the groups.

Regresses the x-coordinates (response) against the predictors defined by


all the y-coordinates (continuous predictor) and the indicator variable
(categorical predictor).

Minimum Description Length (MDL)


Is a model selection principle where the shortest description of the data is
the best model. MDL methods learn through a data compression
perspective and are sometimes described as mathematical applications
of Occam's razor. The MDL principle can be extended to other forms of
inductive inference and learning, for example to estimation and sequential
prediction, without explicitly identifying a single model of the data.
MDL has its origins mostly in information theory and has been further
developed within the general fields of statistics, theoretical computer
science and machine learning, and more narrowly computational learning
theory.
Historically, there are different, yet interrelated, usages of the definite
noun phrase "the minimum description length principle" that vary in what
is meant by description:

 Within Jorma Rissanen's theory of learning, a central concept of


information theory, models are statistical hypotheses and descriptions
are defined as universal codes.
 Rissanen's 1978[1] pragmatic first attempt to automatically derive short
descriptions, relates to the Bayesian Information Criterion (BIC).
 Within Algorithmic Information Theory, where the description length
of a data sequence is the length of the smallest program that outputs
that data set
Selecting the minimum length description of the available data as the best
model observes the principle identified as Occam's razor. Prior to the
advent of computer programming, generating such descriptions was the
intellectual labor of scientific theorists. It was far less formal than it has
become in the computer age. If two scientists had a theoretic disagreement,
they rarely could formally apply Occam's razor to choose between their
theories. They would have different data sets and possibly different
descriptive languages. Nevertheless, science advanced as Occam's razor
was an informal guide in deciding which model was best.
With the advent of formal languages and computer programming Occam's
razor was mathematically defined. Models of a given set of observations,
encoded as bits of data, could be created in the form of computer programs
that output that data. Occam's razor could then formally select the shortest
program, measured in bits of this algorithmic information, as the best
model.
To avoid confusion, note that there is nothing in the MDL principle that
implies a machine produced the program embodying the model. It can be
entirely the product of humans. The MDL principle applies regardless of
whether the description to be run on a computer is the product of humans,
machines or any combination thereof. The MDL principle
requires only that the shortest description, when executed, produce the
original data set without error.
MDL applies in machine learning when algorithms (machines) generate
descriptions. Learning occurs when an algorithm generates a shorter
description of the same data set.
The theoretic minimum description length of a data set, called
its Kolmogorov complexity, cannot, however, be computed. That is to say,
even if by random chance an algorithm generates the shortest program of
all that outputs the data set, an automated theorem prover cannot prove
there is no shorter such program. Nevertheless, given two programs that
output the dataset, the MDL principle selects the shorter of the two as
embodying the best model.
Example of Statistical MDL Learning
A coin is flipped 1000 times, and the numbers of heads and tails are
recorded. Consider two model classes:
The first is a code that represents outcomes with a 0 for heads or a 1 for
tails. This code represents the hypothesis that the coin is fair. The code
length according to this code is always exactly 1000 bits.
The second consists of all codes that are efficient for a coin with some
specific bias, representing the hypothesis that the coin is not fair. Say that
we observe 510 heads and 490 tails. Then the code length according to the
best code in the second model class is shorter than 1000 bits.
For this reason, a naive statistical method might choose the second model
as a better explanation for the data. However, an MDL approach would
construct a single code based on the hypothesis, instead of just using the
best one. This code could be the normalized maximum likelihood code or a
Bayesian code. If such a code is used, then the total codelength based on
the second model class would be larger than 1000 bits. Therefore, the
conclusion when following an MDL approach is inevitably that there is not
enough evidence to support the hypothesis of the biased coin, even though
the best element of the second model class provides better fit to the data.
BAYES OPTIMAL CLASSIFIER
The Bayes Optimal Classifier is a probabilistic model that predicts the
most likely outcome for a new situation. The Bayes theorem is a method
for calculating a hypothesis’s probability based on its prior probability, the
probabilities of observing specific data given the hypothesis, and the seen
data itself. The Bayes Theorem, which provides a systematic means of
computing a conditional probability, is used to describe it. It’s also related
to Maximum a Posteriori (MAP), a probabilistic framework for
determining the most likely hypothesis for a training dataset.
Take a hypothesis space that has 3 hypotheses h1, h2, and h3.
The posterior probabilities of the hypotheses are as follows:
h1 -> 0.4
h2 -> 0.3
h3 -> 0.3
Hence, h1 is the MAP hypothesis. (MAP => max posterior)
Suppose a new instance x is encountered, which is classified negative by
h2 and h3 but positive by h1.
Taking all hypotheses into account, the probability that x is positive is .4
and the probability that it is negative is therefore .6.
The classification generated by the MAP hypothesis is different from the
most probable classification in this case which is negative.
If the new example’s probable classification can be any value vj from a set
V, the probability P(vj/D) that the right classification for the new instance
is vj is merely

The denominator is omitted since we’re only using this for comparison and
all the values of P(vj/D) will have the same denominator.

The value vj, for which P (vj/D) is maximum, is the best classification for
the new instance.
A Bayes optimal classifier is a system that classifies new cases according
to Equation. This strategy increases the likelihood that the new instance
will be appropriately classified.

Consider an example for Bayes Optimal Classification,

Let there be 5 hypotheses h1 through h5.

P(hi/D) P(F/ hi) P(L/hi) P(R/hi)

0.4 1 0 0

0.2 0 1 0

0.1 0 0 1

0.1 0 1 0

0.2 0 1 0
The MAP theory, therefore, argues that the robot should proceed forward
(F). Let’s see what the Bayes optimal procedure suggests.

Thus, the Bayes optimal procedure recommends the robot turn left.

Gibbs algorithm
In statistical mechanics, the Gibbs algorithm, introduced by J. Willard
Gibbs in 1902, is a criterion for choosing a probability distribution for
the statistical ensemble of microstates of a thermodynamic system by
minimizing the average log probability
{\displaystyle \langle \ln p_{i}\rangle =\sum _{i}p_{i}\ln p_{i}\,}
subject to the probability distribution pi satisfying a set of constraints
(usually expectation values) corresponding to the
known macroscopic quantities.[1] in 1948, Claude Shannon interpreted the
negative of this quantity, which he called information entropy, as a
measure of the uncertainty in a probability distribution.[1] In 1957, E.T.
Jaynes realized that this quantity could be interpreted as missing
information about anything, and generalized the Gibbs algorithm to non-
equilibrium systems with the principle of maximum entropy and maximum
entropy thermodynamics.[1]
Physicists call the result of applying the Gibbs algorithm the Gibbs
distribution for the given constraints, most notably Gibbs's grand canonical
ensemble for open systems when the average energy and the average
number of particles are given.
This general result of the Gibbs algorithm is then a maximum entropy
probability distribution. Statisticians identify such distributions as
belonging to exponential families.
Naive Bayes Classifier

The Naive Bayes classifiers, which are a set of classification algorithms,


are created using the Bayes’ Theorem. ‘Each pair of features categorized is
independent of the others. Naive Bayes Classifier is a group of algorithms
that all work on the above principle.

The naive Bayes classifier is useful for learning tasks in which each
instance x is represented by a set of attribute values and the target function
f(x) can take any value from a finite set V.

A set of target function training examples is provided, as well as a new


instance specified by the tuple of attribute values (a1, a2.. .an).

The learner is given the task of estimating the goal value. The most likely
target value VMAP is assigned in the Bayesian strategy to classify the new
instance.

Simply count the number of times each target value vj appears in the
training data to estimate each P(vj).

Where VNB stands for the Naive Bayes classifier’s target value.
The naive Bayes learning approach includes a learning stage in which the
different P(vj) and P(ai/vj) variables are estimated using the training data’s
frequency distribution.
The learned hypothesis is represented by the set of these estimations.
The basic Naive Bayes assumption is that each feature has the following
effect:
Contribution to the ultimate product that is both independent and equal.
Let’s understand the concept of the Naive Bayes classifier better, with the
help of an example.

Let’s use the naive Bayes classifier to solve a problem we discussed during
our decision tree learning discussion: classifying days based on whether or
not someone will play tennis.

Table 3.2 above shows 14 training instances of the goal concept


PlayTennis, with the characteristics Outlook, Temperature, Humidity, and
Wind describing each day. To categorize the following novel instance, we
utilize the naive Bayes classifier and the training data from this table:
First, based on the frequencies of the 14 training instances, the probability
of the various goal values may be easily determined.

We can also estimate conditional probabilities in the same way. Those for
Wind = strong, for example, include

Based on the probability estimates learned from the training data, the naive
Bayes classifier gives the goal value PlayTennis = no to this new
occurrence. Furthermore, given the observed attribute values, we can
determine the conditional probability that the target value is No by
normalizing the above amounts to sum to one.

For the current example, this probability is,


Text Classification :
Text classification is a machine learning technique that assigns a set of
predefined categories to open-ended text. Text classifiers can be used to
organize, structure, and categorize pretty much any kind of text – from
documents, medical studies and files, and all over the web.

For example, new articles can be organized by topics; support tickets can be
organized by urgency; chat conversations can be organized by language; brand
mentions can be organized by sentiment; and so on.

Text classification is one of the fundamental tasks in natural language


processing with broad applications such as sentiment analysis, topic labeling,
spam detection, and intent detection.

Here’s an example of how it works:

“The user interface is quite straightforward and easy to use.”

A text classifier can take this phrase as an input, analyze its content, and then
automatically assign relevant tags, such as UI and Easy To Use.

Why is Text Classification Important?


It’s estimated that around 80% of all information is unstructured, with text
being one of the most common types of unstructured data. Because of the messy
nature of text, analyzing, understanding, organizing, and sorting through text
data is hard and time-consuming, so most companies fail to use it to its full
potential.

This is where text classification with machine learning comes in. Using text
classifiers, companies can automatically structure all manner of relevant text,
from emails, legal documents, social media, chatbots, surveys, and more in a
fast and cost-effective way. This allows companies to save time analyzing text
data, automate business processes, and make data-driven business decisions.
Once it’s trained with enough training samples, the machine learning model can
begin to make accurate predictions. The same feature extractor is used to
transform unseen text to feature sets, which can be fed into the classification
model to get predictions on tags (e.g., sports, politics):

Text classification with machine learning is usually much more accurate than
human-crafted rule systems, especially on complex NLP classification tasks.
Also, classifiers with machine learning are easier to maintain and you can
always tag new examples to learn new tasks.
Machine Learning Text Classification Algorithms
Some of the most popular text classification algorithms include the Naive Bayes
family of algorithms, support vector machines (SVM), and deep learning.

Why use machine learning text classification? Some of the top reasons:
 Scalability
Manually analyzing and organizing is slow and much less accurate.. Machine
learning can automatically analyze millions of surveys, comments, emails, etc.,
at a fraction of the cost, often in just a few minutes. Text classification tools are
scalable to any business needs, large or small.
 Real-time analysis

There are critical situations that companies need to identify as soon as possible
and take immediate action (e.g., PR crises on social media). Machine learning
text classification can follow your brand mentions constantly and in real time,
so you'll identify critical information and be able to take action right away.

 Consistent criteria

Human annotators make mistakes when classifying text data due to distractions,
fatigue, and boredom, and human subjectivity creates inconsistent criteria.
Machine learning, on the other hand, applies the same lens and criteria to all
data and results. Once a text classification model is properly trained it performs
with unsurpassed accuracy.

You can perform text classification in two ways: manual or automatic.

Manual text classification involves a human annotator, who interprets the


content of text and categorizes it accordingly. This method can deliver good
results but it’s time-consuming and expensive.

Automatic text classification applies machine learning, natural language


processing (NLP), and other AI-guided techniques to automatically classify text
in a faster, more cost-effective, and more accurate manner.

In this guide, we’re going to focus on automatic text classification.

There are many approaches to automatic text classification, but they all fall
under three types of systems:

 Rule-based systems
 Machine learning-based systems
 Hybrid systems
Rule-based systems

Rule-based approaches classify text into organized groups by using a set of


handcrafted linguistic rules. These rules instruct the system to use semantically
relevant elements of a text to identify relevant categories based on its content.
Each rule consists of an antecedent or pattern and a predicted category.

Say that you want to classify news articles into two groups: Sports and Politics.
First, you’ll need to define two lists of words that characterize each group (e.g.,
words related to sports such as football, basketball, LeBron James, etc., and
words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.).

Next, when you want to classify a new incoming text, you’ll need to count the
number of sport-related words that appear in the text and do the same for
politics-related words. If the number of sports-related word appearances is
greater than the politics-related word count, then the text is classified as Sports
and vice versa.

For example, this rule-based system will classify the headline “When is LeBron
James' first game with the Lakers?” as Sports because it counted one sports-
related term (LeBron James) and it didn’t count any politics-related terms.

Rule-based systems are human comprehensible and can be improved over time.
But this approach has some disadvantages. For starters, these systems require
deep knowledge of the domain. They are also time-consuming, since generating
rules for a complex system can be quite challenging and usually requires a lot of
analysis and testing. Rule-based systems are also difficult to maintain and don’t
scale well given that adding new rules can affect the results of the pre-existing
rules.

Machine learning based systems

Instead of relying on manually crafted rules, machine learning text classification


learns to make classifications based on past observations. By using pre-labeled
examples as training data, machine learning algorithms can learn the different
associations between pieces of text, and that a particular output (i.e., tags) is
expected for a particular input (i.e., text). A “tag” is the pre-determined
classification or category that any given text could fall into.

The first step towards training a machine learning NLP classifier is feature
extraction: a method is used to transform each text into a numerical
representation in the form of a vector. One of the most frequently used
approaches is bag of words, where a vector represents the frequency of a word
in a predefined dictionary of words.
For example, if we have defined our dictionary to have the following words
{This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize the
text “This is awesome,” we would have the following vector representation of
that text: (1, 1, 0, 0, 1, 0, 0).
Then, the machine learning algorithm is fed with training data that consists of
pairs of feature sets (vectors for each text example) and tags
(e.g. sports, politics) to produce a classification model:
Hybrid Systems

Hybrid systems combine a machine learning-trained base classifier with a rule-


based system, used to further improve the results. These hybrid systems can be
easily fine-tuned by adding specific rules for those conflicting tags that haven’t
been correctly modeled by the base classifier.

Bayesian Belief Network

Bayesian belief network is a useful way to represent probabilistic models and


visualize them. Before we get into Bayesian networks, let us understand
probabilistic models.

Probabilistic models determine the relationship between variables, and then you
can calculate the various probabilities of those two values. Bayesian Network is
also called a Probabilistic Graphical Model (PGM).

For example, conditional models need a massive amount of information and


data to calculate all possible outcomes, and putting all those possibilities to
experiment is difficult. Simplifying these probabilities of all the random
variables proves to be effective.

Bayesian networks are such visual probabilistic models that depict the
conditional dependence of different variables in a graph. All the gaps and
inconsistencies describe the conditional independencies in the graph. It is a
powerful tool to visualize probabilities, understand and analyze the relationship
between random variables and the possibilities for different situations.

How to develop and use a Bayesian Belief Network

To build a Bayesian network, you have to ask three questions yourself:

 Variables: what are the random variables in my project?


 Conditional dependencies: what are the relationships between the variables, and
are they independent or dependent?
 Probability Dispersals: how is the probability of each variable distributed in my
project?
An expert can answer all these questions to you and even suggest a design for
the Bayesian Belief Network model. Usually, experts define the architecture of
such models, but you have to determine the probability distributions from the
given data. The probability distributions and the graph structure can be
calculated from the data, but it is a complicated process.

You can use algorithms to calculate the graph; for example, assume a Gaussian
distribution for continuous variables that are random to calculate the distribution
parameters.

After the Bayesian Belief Network is ready for any domain, you can use it for
logical reasoning like getting answers to situational problems and making
decisions.

The reasoning is accomplished by interpretation done by the model for a given


problem or situation. For example, if the outcome for some events is known,
then the model automatically calculates all the probabilities of causes for the
events and other possible outcomes.

Python Example of Belief Network


Bayesian Networks are popularly used to interpret Python programming
language.

You can refer to PyMC, a massive library that provides a wide range of tools to
build Bayesian networks, consisting of graphical models. The current version of
this library is PyMC3 for Python version 3. It was created on Theano
mathematical computation library that provides automatic differentiation.

Let us take three random variables: A, B, and C. A is a dependent variable on B,


and C is a dependent variable on B.

We can define the conditional dependencies as:

 A is conditionally dependent upon B, like P(A|B)


 C is conditionally dependent upon B, like P(C|B)
We know that C and A are independent of each other.

We can also define the conditional independencies as:

 A is conditionally independent of C: P(A|B, C)


 C is conditionally independent of A: P(C|B, A)
You will notice that the dependencies are mentioned in the presence of
independent variables. A is independent of C conditional, but it is dependent on
B conditionally in the presence of C.

We can also define the independence of A given C as the dependent variable


conditionally in the presence of B, as A is unaffected by C and can be calculated
from A given B alone.

P(A|C, B) = P(A|B)

You will see B is not affected by A and C and has no parents, so you can
determine the independence of B from A and C as P(B, P(A|B), P(C|B)) or
P(B).

We can also write the joint probability of A and C were given B, for example:

P(A, C | B) = P(A|B) * P(C|B)

The model bridges the joint probability of P(A, B, C), estimated as:

P(A, B, C) = P(A|B) * P(C|B) * P(B)

We can draw the graph as follows:


The random variables are given a mode, and the conditional relationships are
defined as direct connections between the nodes. A graph cannot be navigated
in a cycle; for example, loops are impossible when steering from one node to
another via edges.

The graph is useful even now when you don’t know about probability
distributions for the variables.

How can we use Bayesian Networks in SNA?

In SNA, you try to decode and understand the structure of a social network. You
can also comprehend the nodes’ significance, but we don’t know the outcome of
the network’s decision. Then, the Bayesian belief network comes into the
picture; for example, if you think the significance of a node is happened by
Degree Centrality and Link Centrality by the following:
It is a basic graph to understand how Bayesian Network is applied in Social
Network Analysis.

Another example is for friend groups; for example, the groups in a social
network are members of the group, and some of them might be friends as well.
So, there will be two nodes – friends in the group and members in the group
connected in a common group.

This concept is very new in Bayesian networks, and many scientists and experts
are researching it.

There are many Bayesian belief network advantages and disadvantages. They
are listed below:

Advantages
There are a few advantages of Bayesian belief networks as it visualizes different
probabilities of the variables. Some of them are:

 Graphical and visual networks provide a model to visualize the structure of the
probabilities and develop designs for new models as well.
 Relationships determine the type of relationship and the presence or absence of
it between variables.
 Computations calculate complex probability problems efficiently.
 Bayesian networks can investigate and tell you whether a particular feature is
taken into a note for the decision-making process and can force it to include that
feature if necessary. This network will ensure that all known features are
investigated for deciding on a problem.
 Bayesian Networks are more extensible than other networks and learning
methods. Adding a new piece in the network requires only a few probabilities
and a few edges in the graph. So, it is an excellent network for adding a new
piece of data to an existing probabilistic model.
 The graph of a Bayesian Network is useful. It is readable to both computers and
humans; both can interpret the information, unlike some networks like neural
networks, which humans can’t read.
Disadvantages
 The most significant disadvantage is that there is no universally acknowledged
method for constructing networks from data. There have been many
developments in this regard, but there hasn’t been a conqueror in a long time.
 The design of Bayesian Networks is hard to make compared to other networks.
It needs a lot of effort. Hence, only the person creating the network can exploit
causal influences. Neural networks are an advantage compared to this, as they
learn different patterns and aren’t limited to only the creator.
 The Bayesian network fails to define cyclic relationships—for example,
deflection of airplane wings and fluid pressure field around it. The deflection
depends on the pressure, and the pressure is dependent on the deflection. It is a
tightly coupled problem which this network fails to define and make decisions.
 The network is expensive to build.
 It performs poorly on high dimensional data.
 It is tough to interpret and require copula functions to separate effects and
causes.

Conclusion

Bayesian networks are used in Artificial Intelligence broadly. It is used in many


tasks like filtering your email account from spam mails. It is also used in
creating turbo codes and in 3G and 4G networks. It is used in image processing
–they convert images into different digital formats. It also has a massive
contribution to medical science and biotechnology like Biomonitoring, through
which it can determine the number of tissues present in our body through
indicators. Bayesian Networks also make the basis of Gene Regulatory
Network. It has proven to be a useful and impactful network among many other
networks and is developing each day with engineers and experts working on it
to make it more efficient.
Expectation-Maximization algorithm :

In the real-world applications of machine learning, it is very common that


there are many relevant features available for learning but only a small subset
of them are observable. So, for the variables which are sometimes observable
and sometimes not, then we can use the instances when that variable is visible
is observed for the purpose of learning and then predict its value in the
instances when it is not observable.
On the other hand, Expectation-Maximization algorithm can be used for the
latent variables (variables that are not directly observable and are actually
inferred from the values of the other observed variables) too in order to predict
their values with the condition that the general form of probability distribution
governing those latent variables is known to us. This algorithm is actually at
the base of many unsupervised clustering algorithms in the field of machine
learning.
It was explained, proposed and given its name in a paper published in 1977 by
Arthur Dempster, Nan Laird, and Donald Rubin. It is used to find the local
maximum likelihood parameters of a statistical model in the cases where latent
variables are involved and the data is missing or incomplete.

Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the
dataset, estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the
expectation (E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available


observed data of the dataset to estimate the missing data and then using that
data to update the values of the parameters. Let us understand the EM
algorithm in detail.
 Initially, a set of initial values of the parameters are considered. A set of
incomplete observed data is given to the system with the assumption that
the observed data comes from a specific model.
 The next step is known as “Expectation” – step or E-step. In this step, we
use the observed data in order to estimate or guess the values of the missing
or incomplete data. It is basically used to update the variables.
 The next step is known as “Maximization”-step or M-step. In this step, we
use the complete data generated in the preceding “Expectation” – step in
order to update the values of the parameters. It is basically used to update
the hypothesis.
 Now, in the fourth step, it is checked whether the values are converging or
not, if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” –
step and “Maximization” – step until the convergence occurs.
Flow chart for EM algorithm –

Usage of EM algorithm –
 It can be used to fill the missing data in a sample.
 It can be used as the basis of unsupervised learning of clusters.
 It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM).
 It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of
implementation.
 Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical
optimization requires only forward probability).

Computational Learning Theory


Computational learning theory, or CoLT for short, is a field of study concerned
with the use of formal mathematical methods applied to learning systems.
It seeks to use the tools of theoretical computer science to quantify learning
problems. This includes characterizing the difficulty of learning specific tasks.

Computational learning theory may be thought of as an extension or sibling


of statistical learning theory, or SLT for short, that uses formal methods to
quantify learning algorithms.
 Computational Learning Theory (CoLT): Formal study of learning tasks.
 Statistical Learning Theory (SLT): Formal study of learning algorithms.

The focus in computational learning theory is typically on supervised learning


tasks. Formal analysis of real problems and real algorithms is very challenging.
As such, it is common to reduce the complexity of the analysis by focusing on
binary classification tasks and even simple binary rule-based systems. As such,
the practical application of the theorems may be limited or challenging to
interpret for real problems and algorithms.

As a machine learning practitioner, it can be useful to know about


computational learning theory and some of the main areas of investigation. The
field provides a useful grounding for what we are trying to achieve when fitting
models on data, and it may provide insight into the methods.

Probably Approximately Correct Learning

Rather than just studying different learning algorithms that happen to work
well, computational learning theory investigates general principles that can be
proved to hold for classes of learning algorithms.
Some relevant questions that we can ask about a theory of computational
learning include the following:

 Is the learner guaranteed to converge to the correct hypothesis as the


number of examples increases?

 How many examples are required to identify a concept?

 How much computation is required to identify a concept?


In general, the answer to the first question is “no,” unless it can be guaranteed
that the examples always eventually rule out all but the correct hypothesis. An
adversary out to trick the learner could choose examples that do not help
discriminate correct hypotheses from incorrect hypotheses. If an adversary
cannot be ruled out, a learner cannot guarantee to find a consistent hypothesis.
However, given randomly chosen examples, a learner that always chooses a
consistent hypothesis can get arbitrarily close to the correct concept. This
requires a notion of closeness and a specification of what is a randomly chosen
example.
Consider a learning algorithm that chooses a hypothesis consistent with all
of the training examples. Assume a probability distribution over possible
examples and that the training examples and the test examples are chosen from
the same distribution. The distribution does not have to be known. We will
prove a result that holds for all distributions.
The error of hypothesis h∈Hh∈ℋ on instance space II,
written error(I,h)e⁢r⁢r⁢o⁢r⁢(I,h), is defined to be the probability of choosing an
element ii of II such that h(i)≠Y(i)h⁢(i)≠Y⁢(i), where h(i)h⁢(i) is the predicted value
of target variable YY on possible example ii, and Y(i)Y⁢(i) is the actual value
of YY on example ii. That is,
error(I,h)=P(h(i)≠Y(i)∣i∈I).error(I,h)=P(h(i)≠Y(i)∣i∈I).

An agent typically does not know PP or Y(i)Y⁢(i) for all ii and, thus, does not
actually know the error of a particular hypothesis.
Given ϵ>0ϵ>0, hypothesis hh is approximately
correct if error(I,h)≤ϵe⁢r⁢r⁢o⁢r⁢(I,h)≤ϵ.
We make the following assumption.
Assumption 7.3.
The training and test examples are chosen independently from the same
probability distribution as the population.
It is still possible that the examples do not distinguish hypotheses that are
far away from the concept. It is just very unlikely that they do not. A learner
that chooses a hypothesis consistent with the training examples is probably
approximately correct if, for an arbitrary number δδ (0<δ≤10<δ≤1), the
algorithm is not approximately correct in at most δδ of the cases. That is, the
hypothesis generated is approximately correct at least 1−δ1-δ of the time.
Under the preceding assumption, for arbitrary ϵϵ and δδ, we can guarantee
that an algorithm that returns a consistent hypothesis will find a hypothesis with
error less than ϵϵ, in at least 1−δ1-δ of the cases. Moreover, this result does not
depend on the probability distribution.
Proposition 7.4.
Given Assumption 7.3, if a hypothesis is consistent with at least
1ϵ(ln|H|+ln1δ)1ϵ⁢(ln⁡|ℋ|+ln⁡1δ)
training examples, it has error at most ϵϵ, at least 1−δ1-δ of the time.
Proof.
Suppose ϵ>0ϵ>0 and δ>0δ>0 are given. Partition the hypothesis
space Hℋ into
H0=ℋ0= {h∈H:error(I,h)≤ϵ}{h∈ℋ:error(I,h)≤ϵ}

H1=ℋ1= {h∈H:error(I,h)>ϵ}.{h∈ℋ:error(I,h)>ϵ}.

We want to guarantee that the learner does not choose an element of H1ℋ1 in
more than δδ of the cases.
Suppose h∈H1h∈ℋ1, then
P(h is wrong for a single example)≥ϵP⁢(h⁢is
wrong for a single example)≥ϵ
P(h is correct for a single
example)≤1−ϵP⁢(h⁢is correct for a single
example)≤1-ϵ
P(h is correct for m random
examples)≤(1−ϵ)m.P⁢(h⁢is correct
for m random examples)≤(1-ϵ)m.
Therefore,
P(H1 contains a hypothesis that is correct for m random
examples)P⁢(ℋ1⁢contains a hypothesis that is correct
for m random examples)

≤∣∣H1∣∣(1−ϵ)m ≤|ℋ1|(1-ϵ)m

≤∣∣H∣∣(1−ϵ)m ≤|ℋ|(1-ϵ)m

≤|H|e−ϵm ≤|ℋ|e-ϵ⁢m
using the inequality (1−ϵ)≤e−ϵ(1-ϵ)≤e-ϵ if 0≤ϵ≤10≤ϵ≤1.
If we ensure that |H|e−ϵm≤δ|ℋ|⁢e-ϵ⁢m≤δ, we guarantee that H1ℋ1 does not
contain a hypothesis that is correct for mm examples in more than δδ of the
cases. So H0H0 contains all of the correct hypotheses in all but δδ of the cases.
Solving for mm gives
m≥1ϵ(ln|H|+ln1δ)m≥1ϵ⁢(ln⁡|ℋ|+ln⁡1δ)

which proves the proposition. ∎


The number of examples required to guarantee this error bound is called
the sample complexity. The number of examples required according to this
proposition is a function of ϵϵ, δδ, and the size of the hypothesis space.

K-Nearest Neighbor(KNN) Algorithm

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms


based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
o K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider the
below diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
o Step-4: Among these k neighbors, count the number of the data points in
each category.
o Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the
k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as


three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN
algorithm:

o There is no particular way to determine the best value for "K", so we


need to try some values to find the best out of them. The most preferred
value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the
effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance between
the data points for all the training samples.
o Linear Regression is a supervised learning algorithm used for computing
linear relationships between input (X) and output (Y). The steps
involved in ordinary linear regression are:
o Training phase: Compute to minimize the
cost.
o Predict output: for given query point ,

As evident from the image below, this algorithm cannot be used for making
predictions when there exists a non-linear relationship between X and Y. In
such cases, locally weighted linear regression is used.

Locally Weighted Linear Regression:

Locally weighted linear regression is a non-parametric algorithm, that is, the


model does not learn a fixed set of parameters as is done in ordinary linear
regression. Rather parameters are computed individually for each query
point . While computing , a higher “preference” is given to the points in the
training set lying in the vicinity of than the points lying far away from .
The modified cost function is:
where, is a non-negative “weight” associated with training point .
For s lying closer to the query point , the value of is large, while
for s lying far away from the value of is small. A typical choice

of is: where is called the bandwidth


parameter and controls the rate at which falls with distance from
Clearly, if is small is close to 1 and if is
large is close to 0. Thus, the training set points lying closer to the query
point contribute more to the cost than the points lying far away from
.

For example: Consider a query point = 5.0 and let and be two
points in the training set such that = 4.9 and = 3.0. Using the

formula with =

0.5: [Tex]w^{(2)} = exp(\


frac{-(3.0 – 5.0)^2}{2(0.5)^2}) = 0.000335 [/Tex]

Thus, the weights fall exponentially as the distance between and


increases and so does the contribution of error in prediction for to the
cost. Consequently, while computing , we focus more on
reducing for the points lying closer to the query point
(having larger value of ).

Steps involved in locally weighted linear regression are:


Compute to minimize the cost.
Predict Output: for given query point ,
Points to remember:
 Locally weighted linear regression is a supervised learning algorithm.
 It is a non-parametric algorithm.
 There exists No training phase. All the work is done during the testing
phase/while making predictions.
 Locally weighted regression methods are a generalization of k-Nearest
Neighbour.
 In Locally weighted regression an explicit local approximation is
constructed from the target function for each query instance.
 The local approximation is based on the target function of the form like
constant, linear, or quadratic functions localized kernel functions.
 Radial Basis Kernel is a kernel function that is used in machine
learning to find a non-linear classifier or regression line.
 What is Kernel Function?
Kernel Function is used to transform n-dimensional input to m-
dimensional input, where m is much higher than n then find the dot
product in higher dimensional efficiently. The main idea to use kernel
is: A linear classifier or regression curve in higher dimensions becomes
a Non-linear classifier or regression curve in lower dimensions.
 Mathematical Definition of Radial Basis Kernel:


 Radial Basis Kernel

 where x, x’ are vector point in any fixed dimensional space.


But if we expand the above exponential expression, It will go upto
infinite power of x and x’, as expansion of ex contains infinite terms upto
infinite power of x hence it involves terms upto infinite powers in
infinite dimension.
If we apply any of the algorithms like perceptron Algorithm or linear
regression on this kernel, actually we would be applying our algorithm
to new infinite-dimensional datapoint we have created. Hence it will
give a hyperplane in infinite dimensions, which will give a very strong
non-linear classifier or regression curve after returning to our original
dimensions.


 polynomial of infinite power

 So, Although we are applying linear classifier/regression it will give a


non-linear classifier or regression line, that will be a polynomial of
infinite power. And being a polynomial of infinite power, Radial Basis
kernel is a very powerful kernel, which can give a curve fitting any
complex dataset.
 Why Radial Basis Kernel Is much powerful?
The main motive of the kernel is to do calculations in any d-dimensional
space where d > 1, so that we can get a quadratic, cubic or any
polynomial equation of large degree for our classification/regression
line. Since Radial basis kernel uses exponent and as we know the
expansion of e^x gives a polynomial equation of infinite power, so using
this kernel, we make our regression/classification line infinitely
powerful too.

Some Complex Dataset Fitted Using RBF Kernel easily:

As we know Nearest Neighbour classifiers stores training tuples as points in


Euclidean space. But Case-Based Reasoning classifiers (CBR) use a
database of problem solutions to solve new problems. It stores the tuples or
cases for problem-solving as complex symbolic descriptions. How CBR
works? When a new case arises to classify, a Case-based Reasoner(CBR) will
first check if an identical training case exists. If one is found, then the
accompanying solution to that case is returned. If no identical case is found,
then the CBR will search for training cases having components that are similar
to those of the new case. Conceptually, these training cases may be considered
as neighbours of the new case. If cases are represented as graphs, this involves
searching for subgraphs that are similar to subgraphs within the new case. The
CBR tries to combine the solutions of the neighbouring training cases to
propose a solution for the new case. If compatibilities arise with the individual
solutions, then backtracking to search for other solutions may be necessary.
The CBR may employ background knowledge and problem-solving strategies
to propose a feasible solution. Applications of CBR includes:
1. Problem resolution for customer service help desks, where cases describe
product-related diagnostic problems.
2. It is also applied to areas such as engineering and law, where cases are
either technical designs or legal rulings, respectively.
3. Medical educations, where patient case histories and treatments are used to
help diagnose and treat new patients.
Challenges with CBR
 Finding a good similarity metric (eg for matching subgraphs) and suitable
methods for combining solutions.
 Selecting salient features for indexing training cases and the development
of efficient indexing techniques.
CBR becomes more intelligent as the number of the trade-off between
accuracy and efficiency evolves as the number of stored cases becomes very
large. But after a certain point, the system’s efficiency will suffer as the time
required to search for and process relevant cases increases.

You might also like