0% found this document useful (0 votes)
16 views27 pages

Unit 6 Neural Network Part 2 2

This document discusses Bayesian concept learning using Bayes' theorem. It introduces Bayes' theorem and how it can be used for concept learning. Specifically: - Bayes' theorem describes how prior probabilities should be updated based on new evidence or data to determine posterior probabilities. - It discusses how a Bayesian learning algorithm can calculate posterior probabilities for different hypotheses to determine the most probable hypothesis given training data. - An example of using Bayes' theorem is provided to determine the probability of a tumor being malignant or not based on prior probabilities and test results. - The document outlines the assumptions and calculations involved in a brute force Bayesian learning algorithm to determine the hypothesis with the highest posterior probability given training data.

Uploaded by

Mihir Makwana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views27 pages

Unit 6 Neural Network Part 2 2

This document discusses Bayesian concept learning using Bayes' theorem. It introduces Bayes' theorem and how it can be used for concept learning. Specifically: - Bayes' theorem describes how prior probabilities should be updated based on new evidence or data to determine posterior probabilities. - It discusses how a Bayesian learning algorithm can calculate posterior probabilities for different hypotheses to determine the most probable hypothesis given training data. - An example of using Bayes' theorem is provided to determine the probability of a tumor being malignant or not based on prior probabilities and test results. - The document outlines the assumptions and calculations involved in a brute force Bayesian learning algorithm to determine the hypothesis with the highest posterior probability given training data.

Uploaded by

Mihir Makwana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Bayesian Concept Learning

Machine Learning
INTRODUCTION
• The technique was derived from the work of the 18th century
mathematician Thomas Bayes.
• He developed the foundational mathematical principles, known as
Bayesian methods, which describe the probability of events, and
more importantly, how probabilities should be revised when there is
additional information available.
• Bayesian learning algorithms, like the naive Bayes classifier, are highly
practical approaches to certain types of learning problems as they can
calculate explicit probabilities for hypotheses.
APPLICATIONS
• Text-based classification such as spam or junk mail filtering, author
identification, or topic categorization.
• Medical diagnosis such as given the presence of a set of observed
symptoms during a disease, identifying the probability of new
patients having the disease.
• Network security such as detecting illegal intrusion or anomaly in
computer networks.
BAYES’ THEOREM
• Concept Learning?
• Let us take an example of how a child starts to learn meaning of new
words, e.g. ‘ball’.
• Positive Examples
• Negative Examples
• Let us define a concept set C and a corresponding function f(k). We also
define f(k) = 1, when k is within the set C and f(k) = 0 otherwise. Our aim is
to learn the indicator function f that defines which elements are within the
set C.
• In Bayes’ theorem, we will learn how to use standard probability calculus to
determine the uncertainty about the function f, and we can validate the
classification by feeding positive examples.
BAYES’ THEOREM
• Bayes’ probability rule is given as:

where A and B are conditionally related events and p(A|B) denotes the
probability of event A occurring when event B has already occurred.
• Let us assume that we have a training data set D where we have
noted some observed data. Our task is to determine the best
hypothesis in space H by using the knowledge of D.
PRIOR (the probability before the evidence
is considered)
• The prior knowledge or belief about the probabilities of various
hypotheses in H is called Prior in context of Bayes’ theorem.
• For example, if we have to determine whether a particular type of
tumor is malignant for a patient, the prior knowledge of such tumors
becoming malignant can be used to validate our current hypothesis
and is a prior probability or simply called Prior.
• We will assume that P(h) is the initial probability of a hypothesis ‘h’
that the patient has a malignant tumour based only on the
malignancy test, without considering the prior knowledge of the
correctness of the test process or the so-called training data.
POSTERIOR (updated probability after the
evidence is considered)
• The probability that a particular hypothesis holds for a data set based on
the Prior is called the posterior probability or simply Posterior.
• In the above example, the probability of the hypothesis that the patient has
a malignant tumor considering the Prior of correctness of the malignancy
test is a posterior probability.
• In our notation, we will say that we are interested in finding out P(h|T),
which means whether the hypothesis holds true given the observed
training data T.
• So, the prior probability P(h), which represents the probability of the
hypothesis independent of the training data (Prior), now gets refined with
the introduction of influence of the training data as P(h|T).
• According to Bayes’ theorem

combines the prior and posterior probabilities together.


• From the above equation, we can deduce that P(h|T) increases as
P(h) and P(T|h) increases and also as P(T) decreases.
Likelihood (probability of the evidence,
given the belief is true)
• The term "probability" refers to the possibility of something happening.
The term Likelihood refers to the process of determining the best data
distribution given a specific situation in the data.
• If every hypothesis in H has equal probable priori as P(h ) = P(h ), and then,
we can determine P(h|T) from the probability P(T|h) only.
• Similarly, P(T) is the prior probability that the training data will be observed
or, in this case, the probability of positive malignancy test results. We will
denote P(T|h) as the probability of observing data T in a space where ‘h’
holds true, which means the probability of the test results showing a
positive value when the tumour is actually malignant.
• Thus, P(T|h) is called the likelihood of data T given h, and any hypothesis
that maximizes P(T|h) is called the maximum likelihood (ML) hypothesis, h.
• See figure 6.1 and 6.2 for the conceptual
and mathematical representation of
Bayes theorem and the relationship of
Prior, Posterior and Likelihood.
EXAMPLE
• Malignancy identification in a particular patient’s tumor as an application
for Bayes rule.
• We will calculate how the prior knowledge of the percentage of cancer
cases in a sample population and probability of the test result being correct
influence the probability outcome of the correct diagnosis.
• We have two alternative hypotheses:
(1) a particular tumour is of malignant type and
(2) a particular tumour is non-malignant type.
• The priori available are—1. only 0.5% of the population has this kind of
tumour which is malignant,
• The laboratory report has some amount of incorrectness as it could
detect the malignancy was present only with 98% accuracy whereas
could show the malignancy was not present correctly only in 97% of
cases. This means the test predicted malignancy was present which
actually was a false alarm in 2% of the cases, and also missed
detecting the real malignant tumour in 3% of the cases.
• Let us denote Malignant Tumour = MT, Positive Lab Test = PT,
Negative Lab Test = NT
• h1 = the particular tumour is of malignant type = MT in our example
• h2 = the particular tumour is not malignant type = !MT in our example

P(MT) = 0.005 P(!MT) = 0.995


P(PT|MT) = 0.98 P(PT|!MT) = 0.02
P(NT|!MT) = 0.97 P(NT|MT) = 0.03
• So, for the new patient, if the laboratory test report shows positive
result, let us see if we should declare this as the malignancy case or
not:
• As P(h2|PT) is higher than P(h1|PT), it is clear that the hypothesis h2
has more probability of being true. So, hMAP = h2 = !MT.
• This indicates that even if the posterior probability of malignancy is
significantly higher than that of non malignancy, the probability of
this patient not having malignancy is still higher on the basis of the
prior knowledge.
BAYES’ THEOREM AND CONCEPT LEARNING
• If we feed the machine with the training data, then it can calculate
the posterior probability of the hypotheses and outputs the most
probable hypothesis.
• This is also called brute-force Bayesian learning algorithm.
Brute-force Bayesian algorithm
• Let us assume that the learner considers a finite hypothesis space H
in which the learner will try to learn some target concept c:X → {0,1}
where X is the instance space corresponding to H.
• The sequence of training examples is {(x1 , t1 ), (x2 ,t2 ),…, (xm , tm )},
where x_i is the instance of X and t_i is the target concept of x_i
defined as t_i = c(x_i ).
• we can assume that the sequence of instances of x {x1 ,…, xm } is held
fixed, and then, the sequence of target values becomes T = {t_1,…,
t_m }.
• For calculating the highest posterior probability, we can use Bayes’
theorem.
• Calculate the posterior probability of each hypothesis h in H:

• Identify the h_map with the highest posterior probability

• Let us try to connect the concept learning problem with the problem
of identifying the h_map.
• On the basis of the probability distribution of P(h) and P(T|h), we can
derive the prior knowledge of the learning task.
• There are few important assumptions to be made as follows:
• The training data or target sequence T is noise free, which means that it is a
direct function of X only (i.e. t_i = c(x_i ))
• The concept c lies within the hypothesis space H
• Each hypothesis is equally probable and independent of each other
• On the basis of assumption 3, we can say that each hypothesis h
within the space H has equal prior probability, and also because of
assumption 2, we can say that these prior probabilities sum up to 1.
So, we can write

• P(T|h) is the probability of observing the target values t in the fixed


set of instances {x1 ,…, xm ) in the space where h holds true and
describes the concept c correctly.
• Using assumption 1 mentioned above, we can say that if T is
consistent with h, then the probability of data T given the hypothesis
h is 1 and is 0 otherwise:

• Using Bayes’ theorem to identify the posterior probability


1

• For the cases when h is inconsistent with the training data T, using Eq.
1. we get
and when h is consistent with T

• Now, if we define a subset of the hypothesis H which is consistent


with T as H , then by using the total probability equation, we get
• This makes Eq. 1. as

• So, with our set of assumptions about P(h) and P(T|h), we get the
posterior probability P(h|T) as

• where H is the number of hypotheses from the space H which are


consistent with target data set T.
• The interpretation of this evaluation is that initially, each hypothesis
has equal probability and, as we introduce the training data, the
posterior probability of inconsistent hypotheses becomes zero and
the total probability that sums up to 1 is distributed equally among
the consistent hypotheses in the set. So, under this condition, each
consistent hypothesis is a MAP hypothesis with posterior probability.
Bayes optimal classifier
• What is the most probable classification of the new instance given
the training data?
• To illustrate the concept, let us assume three hypotheses h1, h2, and
h3 in the hypothesis space H. Let the posterior probability of these
hypotheses be 0.4, 0.3, and 0.3, respectively.
• There is a new instance x, which is classified as true by h1, but false
by h2 and h3.
• Then the most probable classification of the new instance (x) can be
obtained by combining the predictions of all hypotheses weighed by
their corresponding posterior probabilities.
• By denoting the possible classification of the new instance as c from
the set C, the probability P(c |T) that the correct classification for the
new instance is c is

• The optimal classification is for which P(c_i|T) is maximum is


• So, extending the above example,
• The set of possible outcomes for the new instance x is within the set C =
{True, False} and
• This method maximizes the probability that the new instance is
classified correctly when the available training data, hypothesis space
and the prior probabilities of the hypotheses are known.
• This is thus also called Bayes optimal classifier.
Naïve Bayes classifier
• Already Covered in Supervised Learning Chapter.

You might also like