ML Unit III
ML Unit III
Bayes theorem is also known with some other name such as Bayes rule or
Bayes Law. Bayes theorem helps to determine the probability of an event with
random knowledge. It is used to calculate the probability of occurring one event
while other one already occurred. It is a best method to relate the condition
probability and marginal probability.
In simple words, we can say that Bayes theorem helps to contribute more
accurate results.
Bayes' theorem can be derived using product rule and conditional probability of
event X with known event Y:
o According to the product rule we can express as the probability of event X
with known event Y as follows;
Here, both events X and Y are independent events which means probability of
outcome of both events does not depends one another.
That is,
Let’s say we come across a new patient who has a positive lab test result.
Should we give the patient a cancer diagnosis or not? Equation 6.2 can be
used to find the maximal a posteriori hypothesis
Let’s have a look at the Problem of Probability Density Estimation,
Given a sample of observations (X) from a domain (x1, x2, x3,…, xn),
each observation is taken independently from the domain with the same
probability distribution (so-called independent and identically distributed,
i.i.d., or close to it).
Density estimation entails choosing a probability distribution function and
its parameters that best explain the observed data’s joint probability
distribution (X).
There are several approaches to tackling this problem, but two of the most
common are:
1. Given the training data, the Bayes theorem determines the posterior
probability of each hypothesis. It calculates the likelihood of each
conceivable hypothesis before determining which is the most likely.
For large, complete data sets, both the LSE method and the MLE method
provide consistent results. In reliability applications, data sets are
typically small or moderate in size. Extensive simulation studies show
that in small sample designs where there are only a few failures, the MLE
method is better than the LSE method.1 Thus, the default estimation
method in Minitab is MLE.
The advantages of the MLE method over the LSE method are as follows:
When there are only a few failures because the data are heavily censored,
the MLE method uses the information in the entire data set, including the
censored values. The LSE method ignores the information in the censored
observations.1
The denominator is omitted since we’re only using this for comparison and
all the values of P(vj/D) will have the same denominator.
The value vj, for which P (vj/D) is maximum, is the best classification for
the new instance.
A Bayes optimal classifier is a system that classifies new cases according
to Equation. This strategy increases the likelihood that the new instance
will be appropriately classified.
0.4 1 0 0
0.2 0 1 0
0.1 0 0 1
0.1 0 1 0
0.2 0 1 0
The MAP theory, therefore, argues that the robot should proceed forward
(F). Let’s see what the Bayes optimal procedure suggests.
Thus, the Bayes optimal procedure recommends the robot turn left.
Gibbs algorithm
In statistical mechanics, the Gibbs algorithm, introduced by J. Willard
Gibbs in 1902, is a criterion for choosing a probability distribution for
the statistical ensemble of microstates of a thermodynamic system by
minimizing the average log probability
{\displaystyle \langle \ln p_{i}\rangle =\sum _{i}p_{i}\ln p_{i}\,}
subject to the probability distribution pi satisfying a set of constraints
(usually expectation values) corresponding to the
known macroscopic quantities.[1] in 1948, Claude Shannon interpreted the
negative of this quantity, which he called information entropy, as a
measure of the uncertainty in a probability distribution.[1] In 1957, E.T.
Jaynes realized that this quantity could be interpreted as missing
information about anything, and generalized the Gibbs algorithm to non-
equilibrium systems with the principle of maximum entropy and maximum
entropy thermodynamics.[1]
Physicists call the result of applying the Gibbs algorithm the Gibbs
distribution for the given constraints, most notably Gibbs's grand canonical
ensemble for open systems when the average energy and the average
number of particles are given.
This general result of the Gibbs algorithm is then a maximum entropy
probability distribution. Statisticians identify such distributions as
belonging to exponential families.
Naive Bayes Classifier
The naive Bayes classifier is useful for learning tasks in which each
instance x is represented by a set of attribute values and the target function
f(x) can take any value from a finite set V.
The learner is given the task of estimating the goal value. The most likely
target value VMAP is assigned in the Bayesian strategy to classify the new
instance.
Simply count the number of times each target value vj appears in the
training data to estimate each P(vj).
Where VNB stands for the Naive Bayes classifier’s target value.
The naive Bayes learning approach includes a learning stage in which the
different P(vj) and P(ai/vj) variables are estimated using the training data’s
frequency distribution.
The learned hypothesis is represented by the set of these estimations.
The basic Naive Bayes assumption is that each feature has the following
effect:
Contribution to the ultimate product that is both independent and equal.
Let’s understand the concept of the Naive Bayes classifier better, with the
help of an example.
Let’s use the naive Bayes classifier to solve a problem we discussed during
our decision tree learning discussion: classifying days based on whether or
not someone will play tennis.
We can also estimate conditional probabilities in the same way. Those for
Wind = strong, for example, include
Based on the probability estimates learned from the training data, the naive
Bayes classifier gives the goal value PlayTennis = no to this new
occurrence. Furthermore, given the observed attribute values, we can
determine the conditional probability that the target value is No by
normalizing the above amounts to sum to one.
For example, new articles can be organized by topics; support tickets can be
organized by urgency; chat conversations can be organized by language; brand
mentions can be organized by sentiment; and so on.
A text classifier can take this phrase as an input, analyze its content, and then
automatically assign relevant tags, such as UI and Easy To Use.
This is where text classification with machine learning comes in. Using text
classifiers, companies can automatically structure all manner of relevant text,
from emails, legal documents, social media, chatbots, surveys, and more in a
fast and cost-effective way. This allows companies to save time analyzing text
data, automate business processes, and make data-driven business decisions.
Once it’s trained with enough training samples, the machine learning model can
begin to make accurate predictions. The same feature extractor is used to
transform unseen text to feature sets, which can be fed into the classification
model to get predictions on tags (e.g., sports, politics):
Text classification with machine learning is usually much more accurate than
human-crafted rule systems, especially on complex NLP classification tasks.
Also, classifiers with machine learning are easier to maintain and you can
always tag new examples to learn new tasks.
Machine Learning Text Classification Algorithms
Some of the most popular text classification algorithms include the Naive Bayes
family of algorithms, support vector machines (SVM), and deep learning.
Why use machine learning text classification? Some of the top reasons:
Scalability
Manually analyzing and organizing is slow and much less accurate.. Machine
learning can automatically analyze millions of surveys, comments, emails, etc.,
at a fraction of the cost, often in just a few minutes. Text classification tools are
scalable to any business needs, large or small.
Real-time analysis
There are critical situations that companies need to identify as soon as possible
and take immediate action (e.g., PR crises on social media). Machine learning
text classification can follow your brand mentions constantly and in real time,
so you'll identify critical information and be able to take action right away.
Consistent criteria
Human annotators make mistakes when classifying text data due to distractions,
fatigue, and boredom, and human subjectivity creates inconsistent criteria.
Machine learning, on the other hand, applies the same lens and criteria to all
data and results. Once a text classification model is properly trained it performs
with unsurpassed accuracy.
There are many approaches to automatic text classification, but they all fall
under three types of systems:
Rule-based systems
Machine learning-based systems
Hybrid systems
Rule-based systems
Say that you want to classify news articles into two groups: Sports and Politics.
First, you’ll need to define two lists of words that characterize each group (e.g.,
words related to sports such as football, basketball, LeBron James, etc., and
words related to politics, such as Donald Trump, Hillary Clinton, Putin, etc.).
Next, when you want to classify a new incoming text, you’ll need to count the
number of sport-related words that appear in the text and do the same for
politics-related words. If the number of sports-related word appearances is
greater than the politics-related word count, then the text is classified as Sports
and vice versa.
For example, this rule-based system will classify the headline “When is LeBron
James' first game with the Lakers?” as Sports because it counted one sports-
related term (LeBron James) and it didn’t count any politics-related terms.
Rule-based systems are human comprehensible and can be improved over time.
But this approach has some disadvantages. For starters, these systems require
deep knowledge of the domain. They are also time-consuming, since generating
rules for a complex system can be quite challenging and usually requires a lot of
analysis and testing. Rule-based systems are also difficult to maintain and don’t
scale well given that adding new rules can affect the results of the pre-existing
rules.
The first step towards training a machine learning NLP classifier is feature
extraction: a method is used to transform each text into a numerical
representation in the form of a vector. One of the most frequently used
approaches is bag of words, where a vector represents the frequency of a word
in a predefined dictionary of words.
For example, if we have defined our dictionary to have the following words
{This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize the
text “This is awesome,” we would have the following vector representation of
that text: (1, 1, 0, 0, 1, 0, 0).
Then, the machine learning algorithm is fed with training data that consists of
pairs of feature sets (vectors for each text example) and tags
(e.g. sports, politics) to produce a classification model:
Hybrid Systems
Probabilistic models determine the relationship between variables, and then you
can calculate the various probabilities of those two values. Bayesian Network is
also called a Probabilistic Graphical Model (PGM).
Bayesian networks are such visual probabilistic models that depict the
conditional dependence of different variables in a graph. All the gaps and
inconsistencies describe the conditional independencies in the graph. It is a
powerful tool to visualize probabilities, understand and analyze the relationship
between random variables and the possibilities for different situations.
You can use algorithms to calculate the graph; for example, assume a Gaussian
distribution for continuous variables that are random to calculate the distribution
parameters.
After the Bayesian Belief Network is ready for any domain, you can use it for
logical reasoning like getting answers to situational problems and making
decisions.
You can refer to PyMC, a massive library that provides a wide range of tools to
build Bayesian networks, consisting of graphical models. The current version of
this library is PyMC3 for Python version 3. It was created on Theano
mathematical computation library that provides automatic differentiation.
P(A|C, B) = P(A|B)
You will see B is not affected by A and C and has no parents, so you can
determine the independence of B from A and C as P(B, P(A|B), P(C|B)) or
P(B).
We can also write the joint probability of A and C were given B, for example:
The model bridges the joint probability of P(A, B, C), estimated as:
The graph is useful even now when you don’t know about probability
distributions for the variables.
In SNA, you try to decode and understand the structure of a social network. You
can also comprehend the nodes’ significance, but we don’t know the outcome of
the network’s decision. Then, the Bayesian belief network comes into the
picture; for example, if you think the significance of a node is happened by
Degree Centrality and Link Centrality by the following:
It is a basic graph to understand how Bayesian Network is applied in Social
Network Analysis.
Another example is for friend groups; for example, the groups in a social
network are members of the group, and some of them might be friends as well.
So, there will be two nodes – friends in the group and members in the group
connected in a common group.
This concept is very new in Bayesian networks, and many scientists and experts
are researching it.
There are many Bayesian belief network advantages and disadvantages. They
are listed below:
Advantages
There are a few advantages of Bayesian belief networks as it visualizes different
probabilities of the variables. Some of them are:
Graphical and visual networks provide a model to visualize the structure of the
probabilities and develop designs for new models as well.
Relationships determine the type of relationship and the presence or absence of
it between variables.
Computations calculate complex probability problems efficiently.
Bayesian networks can investigate and tell you whether a particular feature is
taken into a note for the decision-making process and can force it to include that
feature if necessary. This network will ensure that all known features are
investigated for deciding on a problem.
Bayesian Networks are more extensible than other networks and learning
methods. Adding a new piece in the network requires only a few probabilities
and a few edges in the graph. So, it is an excellent network for adding a new
piece of data to an existing probabilistic model.
The graph of a Bayesian Network is useful. It is readable to both computers and
humans; both can interpret the information, unlike some networks like neural
networks, which humans can’t read.
Disadvantages
The most significant disadvantage is that there is no universally acknowledged
method for constructing networks from data. There have been many
developments in this regard, but there hasn’t been a conqueror in a long time.
The design of Bayesian Networks is hard to make compared to other networks.
It needs a lot of effort. Hence, only the person creating the network can exploit
causal influences. Neural networks are an advantage compared to this, as they
learn different patterns and aren’t limited to only the creator.
The Bayesian network fails to define cyclic relationships—for example,
deflection of airplane wings and fluid pressure field around it. The deflection
depends on the pressure, and the pressure is dependent on the deflection. It is a
tightly coupled problem which this network fails to define and make decisions.
The network is expensive to build.
It performs poorly on high dimensional data.
It is tough to interpret and require copula functions to separate effects and
causes.
Conclusion
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the
dataset, estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the
expectation (E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.
Usage of EM algorithm –
It can be used to fill the missing data in a sample.
It can be used as the basis of unsupervised learning of clusters.
It can be used for the purpose of estimating the parameters of Hidden
Markov Model (HMM).
It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
It is always guaranteed that likelihood will increase with each iteration.
The E-step and M-step are often pretty easy for many problems in terms of
implementation.
Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
It has slow convergence.
It makes convergence to the local optima only.
It requires both the probabilities, forward and backward (numerical
optimization requires only forward probability).
Rather than just studying different learning algorithms that happen to work
well, computational learning theory investigates general principles that can be
proved to hold for classes of learning algorithms.
Some relevant questions that we can ask about a theory of computational
learning include the following:
An agent typically does not know PP or Y(i)Y(i) for all ii and, thus, does not
actually know the error of a particular hypothesis.
Given ϵ>0ϵ>0, hypothesis hh is approximately
correct if error(I,h)≤ϵerror(I,h)≤ϵ.
We make the following assumption.
Assumption 7.3.
The training and test examples are chosen independently from the same
probability distribution as the population.
It is still possible that the examples do not distinguish hypotheses that are
far away from the concept. It is just very unlikely that they do not. A learner
that chooses a hypothesis consistent with the training examples is probably
approximately correct if, for an arbitrary number δδ (0<δ≤10<δ≤1), the
algorithm is not approximately correct in at most δδ of the cases. That is, the
hypothesis generated is approximately correct at least 1−δ1-δ of the time.
Under the preceding assumption, for arbitrary ϵϵ and δδ, we can guarantee
that an algorithm that returns a consistent hypothesis will find a hypothesis with
error less than ϵϵ, in at least 1−δ1-δ of the cases. Moreover, this result does not
depend on the probability distribution.
Proposition 7.4.
Given Assumption 7.3, if a hypothesis is consistent with at least
1ϵ(ln|H|+ln1δ)1ϵ(ln|ℋ|+ln1δ)
training examples, it has error at most ϵϵ, at least 1−δ1-δ of the time.
Proof.
Suppose ϵ>0ϵ>0 and δ>0δ>0 are given. Partition the hypothesis
space Hℋ into
H0=ℋ0= {h∈H:error(I,h)≤ϵ}{h∈ℋ:error(I,h)≤ϵ}
H1=ℋ1= {h∈H:error(I,h)>ϵ}.{h∈ℋ:error(I,h)>ϵ}.
We want to guarantee that the learner does not choose an element of H1ℋ1 in
more than δδ of the cases.
Suppose h∈H1h∈ℋ1, then
P(h is wrong for a single example)≥ϵP(his
wrong for a single example)≥ϵ
P(h is correct for a single
example)≤1−ϵP(his correct for a single
example)≤1-ϵ
P(h is correct for m random
examples)≤(1−ϵ)m.P(his correct
for m random examples)≤(1-ϵ)m.
Therefore,
P(H1 contains a hypothesis that is correct for m random
examples)P(ℋ1contains a hypothesis that is correct
for m random examples)
≤∣∣H1∣∣(1−ϵ)m ≤|ℋ1|(1-ϵ)m
≤∣∣H∣∣(1−ϵ)m ≤|ℋ|(1-ϵ)m
≤|H|e−ϵm ≤|ℋ|e-ϵm
using the inequality (1−ϵ)≤e−ϵ(1-ϵ)≤e-ϵ if 0≤ϵ≤10≤ϵ≤1.
If we ensure that |H|e−ϵm≤δ|ℋ|e-ϵm≤δ, we guarantee that H1ℋ1 does not
contain a hypothesis that is correct for mm examples in more than δδ of the
cases. So H0H0 contains all of the correct hypotheses in all but δδ of the cases.
Solving for mm gives
m≥1ϵ(ln|H|+ln1δ)m≥1ϵ(ln|ℋ|+ln1δ)
Suppose there are two categories, i.e., Category A and Category B, and we have
a new data point x1, so this data point will lie in which of these categories. To
solve this type of problem, we need a K-NN algorithm. With the help of K-NN,
we can easily identify the category or class of a particular dataset. Consider the
below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the
k=5.
o Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
As evident from the image below, this algorithm cannot be used for making
predictions when there exists a non-linear relationship between X and Y. In
such cases, locally weighted linear regression is used.
For example: Consider a query point = 5.0 and let and be two
points in the training set such that = 4.9 and = 3.0. Using the
formula with =
Radial Basis Kernel
polynomial of infinite power