Unit 3 Bayesian Concept Learning
Unit 3 Bayesian Concept Learning
Unit 3
• Let us assume that we have a training data set D where we have noted
some observed data. Our task is to determine the best hypothesis in space
H by using the knowledge of D.
Prior (knowledge)
• The prior knowledge or belief about the probabilities of various hypotheses in H
is called Prior in context of Bayes’ theorem.
• For example, if we have to determine whether a particular type of tumour is
malignant for a patient, the prior knowledge of such tumours becoming
malignant can be used to validate our current hypothesis and is a prior
probability or simply called Prior.
• We will assume that P(h) is the initial probability of a hypothesis ‘h’ that the
patient has a malignant tumour based only on the malignancy test, without
considering the prior knowledge of the correctness of the test process
• P(T) is the prior probability that the training data will be observed or, in this case,
the probability of positive malignancy test results.
• We will denote P(T|h) as the probability of observing data T in a space where ‘h’
holds true, which means the probability of the test results showing a positive
value when the tumour is actually malignant.
Posterior
• The probability that a particular hypothesis holds for a data set based on
the Prior is called the posterior probability or simply Posterior.
• In the above example, the probability of the hypothesis that the patient
has a malignant tumour considering the Prior of correctness of the
malignancy test is a posterior probability.
• In our notation, we will say that we are interested in finding out P(h|T),
which means whether the hypothesis holds true given the observed
training data T. This is called the posterior probability or simply Posterior in
machine learning language.
• So, the prior probability P(h), which represents the probability of the
hypothesis independent of the training data (Prior), now gets refined with
the introduction of influence of the training data as P(h|T).
According to Bayes’ theorem
• The below equation combines the prior and posterior probabilities
together.
• we can deduce that P(h|T) increases as P(h) and P(T|h) increases and
also as P(T) decreases.
• The simple explanation is that when there is more probability that T
can occur independently of h then it is less probable that h can get
support from T in its occurrence.
Bayes’ Theorem
• Goal: To determine the most probable hypothesis, given the data T plus
any initial knowledge about the prior probabilities of the various
hypotheses in H.
• Prior probability of h, P(h): it reflects any background knowledge we have
about the chance that h is a correct hypothesis (before having observed
the data).
• Prior probability of T, P(T): it reflects the probability that training data T
will be observed given no knowledge about which hypothesis h holds.
• Conditional Probability of observation T, P(T|h): it denotes the probability
of observing data T given some world in which hypothesis h holds.
Bayes’ Theorem
• Posterior probability of h, P(h|T): it represents the probability that h
holds given the observed training data T. It reflects our confidence
that h holds after we have seen the training data T and it is the
quantity that Machine Learning researchers are interested in.
• Bayes Theorem allows us to compute P(h|T):
P(h|T)=P(T|h)P(h)/P(T)
Maximum A Posteriori (MAP)
Hypothesis and Maximum Likelihood
• Goal: To find the most probable hypothesis h from a set of candidate
hypotheses H given the observed data T. This maximally probable
hypothesis is called the maximum a posteriori (MAP) hypothesis.
• MAP Hypothesis, hMAP = argmax hH P(h|T)
= argmax hH P(T|h)P(h)/P(T)
= argmax hH P(T|h)P(h)
• If every hypothesis in H is equally probable a priori, we only need to
consider the likelihood of the data T given h, P(T|h). Then, hMAP
becomes the Maximum Likelihood,
hML= argmax hH P(T|h)P(h)
Some Results from the Analysis of Learners in
a Bayesian Framework
• If P(h)=1/|H| and if P(T|h)=1 if T is consistent with h, and 0
otherwise, then every hypothesis in the version space resulting from
T is a MAP hypothesis.
• Under certain assumptions regarding noise in the data, minimizing
the mean squared error (what common neural nets do) corresponds
to computing the maximum likelihood hypothesis.
• When using a certain representation for hypotheses, choosing the
smallest hypotheses corresponds to choosing MAP hypotheses (An
attempt at justifying Occam’s razor)
Example
• We will calculate how the prior knowledge of the percentage of cancer
cases in a sample population and probability of the test result being correct
influence the probability outcome of the correct diagnosis.
• We have two alternative hypotheses:
• (1) a particular tumour is of malignant type and
• (2) a particular tumour is non-malignant type.
• The priori available are—
• only 0.5% of the population has this kind of tumour which is malignant,
• the laboratory report has some amount of incorrectness as it could detect the
malignancy was present only with 98% accuracy whereas could show the malignancy
was not present correctly only in 97% of cases.
• This means the test predicted malignancy was present which actually was a false
alarm in 2% of the cases, and also missed detecting the real malignant tumour in 3%
of the cases.
Solution
• Let us denote Malignant Tumour = MT, Positive Lab Test = PT,
Negative Lab Test = NT
• h1 = the particular tumour is of malignant type = MT in our example
• h2 = the particular tumour is not malignant type = !MT in our example
• P(MT) = 0.005 P(!MT) = 0.995
• P(PT|MT) = 0.98 P(PT|!MT) = 0.02
• P(NT|!MT) = 0.97 P(NT|MT) = 0.03
Solution
• for the new patient, if the
laboratory test report shows
positive result, let us see if we
should declare this as the
malignancy case or not?
Solution
• As P(h2 |PT) is higher than P(h1 |PT), it is clear that the hypothesis h2
has more probability of being true. So, hMAP = h2 = !MT.
• This indicates that even if the posterior probability of malignancy is
significantly higher than that of nonmalignancy, the probability of this
patient not having malignancy is still higher on the basis of the prior
knowledge.
Naïve Bayesian Classification
• It is based on the Bayesian theorem It is particularly suited when the
dimensionality of the inputs is high. Parameter estimation for naive
Bayes models uses the method of maximum likelihood. In spite over-
simplified assumptions, it often performs better in many complex real
world situations.
• Advantage: Requires a small amount of training data to estimate the
parameters
Naïve Bayesian Classification
• Derivation:
• D : Set of tuples
• ** Each Tuple is an n dimensional attribute vector
• ** X : (x1,x2,x3,…. xn)
• Let there be m Classes : C1,C2,C3…Cm
• Naïve Bayes classifier predicts X belongs to Class Ci iff
• **P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i
• Maximum Posteriori Hypothesis
• **P(Ci/X) = P(X/Ci) P(Ci) / P(X)
• **Maximize P(X/Ci) P(Ci) as P(X) is constant
Naïve Bayesian Classification
• Bayes classification
P(C |X ) P( X |C )P(C ) = P( X1 , , Xn |C )P(C )
Difficulty: learning the joint probabilityP( X1 , , Xn |C )
– [ PMAP
( x1 |c *classification
) P( xn |c * )]Prule
(c * ) [ P( x1 |c ) P( xn |c )]P(c), c c * , c = c1 , , c L
29
Naïve Bayesian Classification
• As combined probability of the attributes defining the new
instance fully is always 1
30
• Naïve Bayes classifier makes a simple assumption that the attribute
values are conditionally independent of each other for the target
value. So, applying this simplification, we can now say that for a target
value of an instance, the probability of observing the combination a1
,a2 ,…, an is the product of probabilities of individual attributes
P(ai |cj ).
32
Example
• Learning Phase- conditional probabilities:
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
Sunny e
2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Ye Play=N
s o Wind Play=Yes Play=No
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
33
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
34
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
35
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
Previous example
• X = ( age= youth, income = medium, student = yes, credit_rating =
fair)
• A person belonging to tuple X will buy a computer?
Previous example
Naïve Bayes algorithm
• Steps to implement:
1. Data Pre-processing step
2. Fitting Naive Bayes to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Predict class for unknown data
1)Data Pre-processing step
Importing the libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
# Importing the dataset , Selecting data by row numbers (.iloc)
dataset = pd.read_csv('user_data.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
2) Fitting Naive Bayes to the Training Set
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)
Classifier.score(x_test, y_test)
• we have used the GaussianNB classifier to fit it to the training
dataset. We can also use other classifiers as per our requirement.
(Multinominal /Bernoulli)
3) Prediction of the test set result:
# Predicting the Test set results
y_pred = classifier.predict(x_test)
4) Creating Confusion Matrix:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
predicted
• actual
52
Bayesian Networks
• A Bayesian network specifies a joint distribution in a structured form
• General form:
𝑃 𝐴, 𝐵, 𝐶 = 𝑃 𝐶 𝐴, 𝐵 𝑃 𝐴 𝑃(𝐵)
C
A B C Absolute Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian Networks
• Conditionally independent
effects:
𝑝(𝐴, 𝐵, 𝐶) = 𝑝(𝐵|𝐴)𝑝(𝐶|𝐴)𝑝(𝐴)
A
• B and C are conditionally
independent given A
B C
A B
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
The Alarm Example
• What is P(B | M, J) ?
• We can use the full joint distribution to answer this question
• Requires 25 = 32 probabilities
• Can we use prior domain knowledge to come up with a Bayesian network that requires fewer
probabilities?
Constructing a Bayesian Network: Step 1
• Order the variables in terms of causality (may be a
partial order)
• e.g., {E, B} -> {A} -> {J, M}
Thunder ForestFire
65
Bayesian learning
• Prior knowledge of the candidate hypothesis is combined with the
observed data for arriving at the final probability of a hypothesis
• Flexible than the other approaches because each observed training pattern
can influence the outcome of the hypothesis by increasing or decreasing
the estimated probability about the hypothesis
• Perform better than the other methods while validating the hypotheses
that make probabilistic predictions
• It is possible to classify new instances by combining the predictions of
multiple hypotheses, weighted by their respective probabilities.
• They can be used to create a standard for the optimal decision against
which the performance of other methods can be measured