0% found this document useful (0 votes)
11 views7 pages

Bayes Theorem

Bayes Theorem, formulated by Thomas Bayes in the 17th century, calculates the conditional probability of an event based on prior knowledge of related conditions. It is foundational for statistical inference and underpins Naïve Bayes classifiers, which are widely used in machine learning for classification tasks. The theorem and its applications highlight the importance of understanding probabilities in various contexts, including text classification and decision-making processes.

Uploaded by

mstdsproject2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

Bayes Theorem

Bayes Theorem, formulated by Thomas Bayes in the 17th century, calculates the conditional probability of an event based on prior knowledge of related conditions. It is foundational for statistical inference and underpins Naïve Bayes classifiers, which are widely used in machine learning for classification tasks. The theorem and its applications highlight the importance of understanding probabilities in various contexts, including text classification and decision-making processes.

Uploaded by

mstdsproject2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Bayes Theorem:

Bayes theorem is given by an English statistician, philosopher, named Mr. Thomas Bayes in 17th
century.
It is a very important theorem in mathematics that is used to find the probability of an event,
based on prior knowledge of conditions that might be related to that event.
Bayes theorem is also known as the Bayes Rule or Bayes Law. It is used to determine the
conditional probability of event A when event B has already happened.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the
probability of A divided by the probability of event B.” i.e.
P(A|B) = P(B|A)P(A) / P(B)
where,
P(A) and P(B) are the probabilities of events A and B
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens

Terms Related to Bayes Theorem:


Conditional Probability:
The probability of an event A based on the occurrence of another event B is termed conditional
Probability. It is denoted as P(A|B) and represents the probability of A when event B has already
happened.
Joint Probability:
When the probability of two more events occurring together and at the same time is measured it
is marked as Joint Probability. For two events A and B, it is denoted by joint probability is
denoted as, P(A∩B).
Random Variables:
Real-valued variables whose possible values are determined by random experiments are called
random variables. The probability of finding such variables is the experimental probability.
prior probability:
It is the probability of occurring A before occurring B. P(A)
Posterior probability:
It is the probability of occurring A after occurring B. P(A|B)
Proof of Bayes Theorem:
The probability of two events A & B happening, P(A∩B) is the probability of A P(A), times the
probability of B given that A has occurred P(B|A).
P(A∩B) = P(A)P(B|A) --------------(1)
On other hand the probability of A & B is equal to the probability of B times the probability of A
given B
P(A∩B) = P(B)P(A|B) ---------------(2)
Equating the two yields
P(B)P(A|B) = P(A)P(B|A)
Thus
P(A|B) = P(B|A) P(A) / P(B)
This equation known as Bayes theorem is the basis of statistical inference.
Example:
Three boxes labeled as A, B, and C, are present. Details of the boxes are:
 Box A contains 2 red and 3 black balls
 Box B contains 3 red and 1 black ball
 And box C contains 1 red ball and 4 black balls
All the three boxes are identical having an equal probability to be picked up.
Therefore, what is the probability that the red ball was picked up from box A?
Solution:
Let E denote the event that a red ball is picked up and A, B and C denote that the ball is picked
up from their respective boxes. Therefore the conditional probability would be P(A|E) which
needs to be calculated.
The existing probabilities P(A) = P(B) = P (C) = 1 / 3, since all boxes have equal probability of
getting picked.

P(E|A) = Number of red balls in box A / Total number of balls in box A = 2 / 5

Similarly, P(E|B) = 3 / 4 and P(E|C) = 1 / 5

Then evidence P(E) = P(E|A)*P(A) + P(E|B)*P(B) + P(E|C)*P(C)


= (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.45

Therefore, P(A|E) = P(E|A) * P(A) / P(E) = (2/5) * (1/3) / 0.45 = 0.296

Naive Bayes Classifiers:


Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem.
It is not a single algorithm but a family of algorithms where all of them share a common
principle, i.e. every pair of features being classified is independent of each other.
To start with, let us consider a dataset.
Consider a dataset that describes the weather conditions for playing a game of tennis. Given the
weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing
tennis.
Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

where, y is class variable and X is a dependent feature vector (of size n) where:

Just to clear, an example of a feature vector and corresponding class variable can be:
(refer 1st row of dataset)

X = (Sunny, Hot, High, Weak)


y = No
So basically, P(y|X) here means, the probability of “Not playing tennis” given that the weather
conditions are “Sunny outlook”, “Temperature is hot”, “high humidity” and “no wind”.
Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among
the features. So now, we split evidence into the independent parts.
Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:

which can be expressed as:

Now, as the denominator remains constant for a given input, we can remove that term:

Now, we need to create a classifier model. For this, we find the probability of given set of inputs
for all possible values of the class variable y and pick up the output with maximum probability.
This can be expressed mathematically as:

So, finally, we are left with the task of calculating P(y) and P(xi | y) .
Please note that P(y) is also called class probability and P(xi | y) is called conditional
probability.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the
distribution of P(xi | y).
Let us try to apply the above formula manually on our weather dataset.

Advantages of Naïve Bayes Classifier:


 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the other Algorithms.
 It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
 It is used for Credit Scoring.
 It is used in medical data classification.
 It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
 It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor
variables are the independent Booleans variables. Such as if a particular word is present or not in
a document. This model is also famous for document classification tasks.
Theorem 1 (Bayes Optimal Classifier).
The Bayes Optimal Classifier f (BO) achieves minimal zero/one error of any deterministic
classifier.
This theorem assumes that you are comparing against deterministic classifiers. You can actually
prove a stronger result that f(BO) is optimal for randomized classifiers as well, but the proof is a bit
messier.
However, the intuition is the same: for a given x, f(BO) chooses the label with highest probability,
thus minimizing the probability that it makes an error.
Proof of Theorem 1. Consider some other classifier g that claims to be better than f(BO). Then,
there must be some x on which g(x) f (BO)(x). Fix such an x. Now, the probability that f (BO) makes
an error on this particular x is 1 − D(x, f (BO)(x)) and the probability that g makes an error on this x is 1 −
D(x, g(x)). But f (BO) was chosen in such a way to maximize D(x, f (BO)(x)), so this must be greater than
D(x, g(x)). Thus, the probability that f (BO) errs on this particular x is smaller than the probability that g
errs on it. This applies to any x for which f (BO)(x) g(x) and therefore f (BO) achieves smaller zero/one
error than any g.
The Bayes error rate (or Bayes optimal error rate) is the error rate of the Bayes optimal classifier.
It is the best error rate you can ever hope to achieve on this classification problem (under
zero/one loss). The take-home message is that if someone gave you access to the data
distribution, forming an optimal classifier would be trivial. Unfortunately, no one gave you this
distribution, so we need to figure out ways of learning the mapping from x toy given only access
to a training set sampled from D, rather than D itself.

You might also like