0% found this document useful (0 votes)
12 views

Unit 3 Bayesian Concept Learning

Uploaded by

parthpanchal2207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit 3 Bayesian Concept Learning

Uploaded by

parthpanchal2207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Bayesian Concept Learning

Unit 3

AML-Prof. Minal Chauhan


Introduction
• Principles of probability for classification are an important area of
machine learning algorithms. In our practical life, our decisions are
affected by our prior knowledge or belief about an event.
• Thus, an event that is otherwise very unlikely to occur may be
considered by us seriously to occur in certain situations if we know
that in the past, the event had certainly occurred when other events
were observed.
• The same concept is applied in machine learning using Bayes’
theorem.
Bayes’ Theorem
• Bayesian reasoning is applied to decision making and inferential
statistics that deals with probability inference. It is used the
knowledge of prior events to predict future events.
• Example: Predicting the color of marbles in a basket.
Example
Example
Applications of Bayesian classifiers
• Text-based classification such as spam or junk mail filtering, author
identification, or topic categorization
• Medical diagnosis such as given the presence of a set of observed
symptoms during a disease, identifying the probability of new
patients having the disease
• Network security such as detecting illegal intrusion or anomaly in
computer networks
Can we use probability to classify
• The world is a very uncertain place
• Almost 40 years of AI and ML dealing with uncertain domains
• Some researchers decided to employ ideas from probability to model
concepts
• Before saying more let Before saying more… let s’ go to the beginning
The Thomas Bayes
• Two main works:
• Divine Benevolence or an Attempt to Divine Benevolence, or
an Attempt to Prove That the Principal End of the Divine
Providence and Government is the H i f Hi C t (1731)
Happiness o f His Creatures (1731)
• An Introduction to the Doctrine of Fluxions, and a Defense of
the Mathematicians of the Mathematicians Against the
Objections of the Author of the Analyst (published
anonymously in 1736)
• But we are especially are especially interested in: interested
in: Essay Towards Solving a Problem in the Doctrine of
Chances (1764) which was actually published posthumously
b y Richard Price
Where These Ideas Came From?
• Bayes build his theory upon several ideas
• Immanuel Kant (1724-1804)
• Copernican revolution: Copernican revolution: our understanding our
understanding of the external world had its foundations not merely in
experience, but in both experience and a priori concepts, thus offering a non-
empiricist critique of rationalist philosophy.
• Isaac Newton (1643-1727)
• Universal gravitation
• three laws of motion which dominated the scientific view of the physical
universe for the next three centuries
What Was Bayes’ Point?
• Bayesian probability
• Notion of probability interpreted as partial belief rather than as
frequency
• Bayesian estimation
• Calculate the validity of a proposition
• On the basis of a prior estimate of its probability and new relevant
evidence E.g.:
• Before Bayes, forward probability
• given a specified number of white and black balls in an urn, what is the probability of
drawing a black ball?
• Bayes turned its attention to the converse problem
• given that one or more balls have been drawn, what can be said Slide 9 about the
number of white and black balls in the urn?
Introduction
• Bayesian Decision Theory came long before Version Spaces, Decision Tree
Learning and Neural Networks. It was studied in the field of Statistical Theory and
more specifically, in the field of Pattern Recognition.
• Bayesian Decision Theory is at the basis of important learning schemes such as
the Naïve Bayes Classifier, Learning Bayesian Belief Networks and the EM
Algorithm.
• He developed the foundational mathematical principles, known as Bayesian
methods, which describe the probability of events, and more importantly, how
probabilities should be revised when there is additional information available.
• Bayesian Decision Theory is also useful as it provides a framework within which
many non-Bayesian classifiers can be studied
Introduction
• Bayesian classifiers use a simple idea that the training data are utilized to
calculate an observed probability of each class based on feature values.
• When the same classifier is used later for unclassified data, it uses the
observed probabilities to predict the most likely class for the new features.
• The application of the observations from the training data can also be
thought of as applying our prior knowledge or prior belief to the
probability of an outcome, so that it has higher probability of meeting the
actual or real-life outcome.
• This simple concept is used in Bayes’ rule and applied for training a
machine in machine learning terms.
Bayes’ Theorem
• Before we learn this, we should be clear about what is concept learning.
• Let us take an example of how a child starts to learn meaning of new
words, e.g. ‘ball’.
• The child is provided with positive examples of ‘objects’ which are ‘ball’. At
first, the child may be confused with many different colours, shapes and
sizes of the balls and may also get confused with some objects which look
similar to ball, like a balloon or a globe.
• The child’s parent continuously feeds her positive examples like ‘that is a
ball’, ‘this is a green ball’, ‘bring me that small ball’, etc.
• Seldom there are negative examples used for such concept teaching, like
‘this is a non-ball’, but the parent may clear the confusion of the child when
it points to a balloon and says it is a ball by saying ‘that is not a ball’.
Bayes’ Theorem
• But it is observed that the learning is most influenced through
positive examples rather than through negative examples, and the
expectation is that the child will be able to identify the object ‘ball’
from a wide variety of objects and different types of balls kept
together once the concept of a ball is clear to her.
• We can extend this example to explain how we can expect machines
to learn through the feeding of positive examples, which forms the
basis for concept learning.
Bayes’ Theorem
• Let us define a concept set C and a corresponding function f(k). We also
define f(k) = 1, when k is within the set C and f(k) = 0 otherwise.
• Goal: Our aim is to learn the indicator function f that defines which
elements are within the set C. So, by using the function f, we will be able to
classify the element either inside or outside our concept set.
• We use standard probability calculus to determine the uncertainty about
the function f.
• Bayes’ probability rule as given as:

• Let us assume that we have a training data set D where we have noted
some observed data. Our task is to determine the best hypothesis in space
H by using the knowledge of D.
Prior (knowledge)
• The prior knowledge or belief about the probabilities of various hypotheses in H
is called Prior in context of Bayes’ theorem.
• For example, if we have to determine whether a particular type of tumour is
malignant for a patient, the prior knowledge of such tumours becoming
malignant can be used to validate our current hypothesis and is a prior
probability or simply called Prior.
• We will assume that P(h) is the initial probability of a hypothesis ‘h’ that the
patient has a malignant tumour based only on the malignancy test, without
considering the prior knowledge of the correctness of the test process
• P(T) is the prior probability that the training data will be observed or, in this case,
the probability of positive malignancy test results.
• We will denote P(T|h) as the probability of observing data T in a space where ‘h’
holds true, which means the probability of the test results showing a positive
value when the tumour is actually malignant.
Posterior
• The probability that a particular hypothesis holds for a data set based on
the Prior is called the posterior probability or simply Posterior.
• In the above example, the probability of the hypothesis that the patient
has a malignant tumour considering the Prior of correctness of the
malignancy test is a posterior probability.
• In our notation, we will say that we are interested in finding out P(h|T),
which means whether the hypothesis holds true given the observed
training data T. This is called the posterior probability or simply Posterior in
machine learning language.
• So, the prior probability P(h), which represents the probability of the
hypothesis independent of the training data (Prior), now gets refined with
the introduction of influence of the training data as P(h|T).
According to Bayes’ theorem
• The below equation combines the prior and posterior probabilities
together.

• we can deduce that P(h|T) increases as P(h) and P(T|h) increases and
also as P(T) decreases.
• The simple explanation is that when there is more probability that T
can occur independently of h then it is less probable that h can get
support from T in its occurrence.
Bayes’ Theorem
• Goal: To determine the most probable hypothesis, given the data T plus
any initial knowledge about the prior probabilities of the various
hypotheses in H.
• Prior probability of h, P(h): it reflects any background knowledge we have
about the chance that h is a correct hypothesis (before having observed
the data).
• Prior probability of T, P(T): it reflects the probability that training data T
will be observed given no knowledge about which hypothesis h holds.
• Conditional Probability of observation T, P(T|h): it denotes the probability
of observing data T given some world in which hypothesis h holds.
Bayes’ Theorem
• Posterior probability of h, P(h|T): it represents the probability that h
holds given the observed training data T. It reflects our confidence
that h holds after we have seen the training data T and it is the
quantity that Machine Learning researchers are interested in.
• Bayes Theorem allows us to compute P(h|T):

P(h|T)=P(T|h)P(h)/P(T)
Maximum A Posteriori (MAP)
Hypothesis and Maximum Likelihood
• Goal: To find the most probable hypothesis h from a set of candidate
hypotheses H given the observed data T. This maximally probable
hypothesis is called the maximum a posteriori (MAP) hypothesis.
• MAP Hypothesis, hMAP = argmax hH P(h|T)
= argmax hH P(T|h)P(h)/P(T)
= argmax hH P(T|h)P(h)
• If every hypothesis in H is equally probable a priori, we only need to
consider the likelihood of the data T given h, P(T|h). Then, hMAP
becomes the Maximum Likelihood,
hML= argmax hH P(T|h)P(h)
Some Results from the Analysis of Learners in
a Bayesian Framework
• If P(h)=1/|H| and if P(T|h)=1 if T is consistent with h, and 0
otherwise, then every hypothesis in the version space resulting from
T is a MAP hypothesis.
• Under certain assumptions regarding noise in the data, minimizing
the mean squared error (what common neural nets do) corresponds
to computing the maximum likelihood hypothesis.
• When using a certain representation for hypotheses, choosing the
smallest hypotheses corresponds to choosing MAP hypotheses (An
attempt at justifying Occam’s razor)
Example
• We will calculate how the prior knowledge of the percentage of cancer
cases in a sample population and probability of the test result being correct
influence the probability outcome of the correct diagnosis.
• We have two alternative hypotheses:
• (1) a particular tumour is of malignant type and
• (2) a particular tumour is non-malignant type.
• The priori available are—
• only 0.5% of the population has this kind of tumour which is malignant,
• the laboratory report has some amount of incorrectness as it could detect the
malignancy was present only with 98% accuracy whereas could show the malignancy
was not present correctly only in 97% of cases.
• This means the test predicted malignancy was present which actually was a false
alarm in 2% of the cases, and also missed detecting the real malignant tumour in 3%
of the cases.
Solution
• Let us denote Malignant Tumour = MT, Positive Lab Test = PT,
Negative Lab Test = NT
• h1 = the particular tumour is of malignant type = MT in our example
• h2 = the particular tumour is not malignant type = !MT in our example
• P(MT) = 0.005 P(!MT) = 0.995
• P(PT|MT) = 0.98 P(PT|!MT) = 0.02
• P(NT|!MT) = 0.97 P(NT|MT) = 0.03
Solution
• for the new patient, if the
laboratory test report shows
positive result, let us see if we
should declare this as the
malignancy case or not?
Solution
• As P(h2 |PT) is higher than P(h1 |PT), it is clear that the hypothesis h2
has more probability of being true. So, hMAP = h2 = !MT.
• This indicates that even if the posterior probability of malignancy is
significantly higher than that of nonmalignancy, the probability of this
patient not having malignancy is still higher on the basis of the prior
knowledge.
Naïve Bayesian Classification
• It is based on the Bayesian theorem It is particularly suited when the
dimensionality of the inputs is high. Parameter estimation for naive
Bayes models uses the method of maximum likelihood. In spite over-
simplified assumptions, it often performs better in many complex real
world situations.
• Advantage: Requires a small amount of training data to estimate the
parameters
Naïve Bayesian Classification
• Derivation:
• D : Set of tuples
• ** Each Tuple is an n dimensional attribute vector
• ** X : (x1,x2,x3,…. xn)
• Let there be m Classes : C1,C2,C3…Cm
• Naïve Bayes classifier predicts X belongs to Class Ci iff
• **P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i
• Maximum Posteriori Hypothesis
• **P(Ci/X) = P(X/Ci) P(Ci) / P(X)
• **Maximize P(X/Ci) P(Ci) as P(X) is constant
Naïve Bayesian Classification
• Bayes classification
P(C |X )  P( X |C )P(C ) = P( X1 ,  , Xn |C )P(C )
Difficulty: learning the joint probabilityP( X1 ,  , Xn |C )

• Naïve Bayes classification


– Making 2 ,  assumption
P( X1 , Xthe , Xn |C ) = P( Xthat 2 ,  input
1 | Xall , Xn ; Cattributes
)P( X2 ,  , Xare
n |Cindependent
)
= P( X1 |C )P( X2 ,  , Xn |C )
= P( X1 |C )P( X2 |C )    P( Xn |C )

– [ PMAP
( x1 |c *classification
)    P( xn |c * )]Prule
(c * )  [ P( x1 |c )    P( xn |c )]P(c), c  c * , c = c1 ,  , c L
29
Naïve Bayesian Classification
• As combined probability of the attributes defining the new
instance fully is always 1

• So, to get the most probable classifier, we have to evaluate


the two terms P(a , a , c, a |c ) and P(ci). In a practical
scenario, it is possible to calculate P(ci) by calculating the
frequency of each target value c in the training data set.
But the P(a , a , c, a |c ) cannot be estimated easily and
needs a very high effort of calculation. T

30
• Naïve Bayes classifier makes a simple assumption that the attribute
values are conditionally independent of each other for the target
value. So, applying this simplification, we can now say that for a target
value of an instance, the probability of observing the combination a1
,a2 ,…, an is the product of probabilities of individual attributes
P(ai |cj ).

• we get the approach for the Naïve Bayes classifier as

• Here, we will be able to compute P(ai |cj ). as we have to calculate this


only for the number of distinct attributes values (ai ) times the
number of distinct target values (cj ), which is much smaller set than
the product of both the sets.
Example
• Example: Play Tennis

32
Example
• Learning Phase- conditional probabilities:
Outlook Play=Yes Play=No Temperatur Play=Yes Play=No
Sunny e
2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Ye Play=N
s o Wind Play=Yes Play=No
High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

Prior probability P(Play=Yes) = 9/14 P(Play=No) = 5/14

33
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
34
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

– MAP rule

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
35
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
Previous example
• X = ( age= youth, income = medium, student = yes, credit_rating =
fair)
• A person belonging to tuple X will buy a computer?
Previous example
Naïve Bayes algorithm
• Steps to implement:
1. Data Pre-processing step
2. Fitting Naive Bayes to the Training set
3. Predicting the test result
4. Test accuracy of the result(Creation of Confusion matrix)
5. Predict class for unknown data
1)Data Pre-processing step
Importing the libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
# Importing the dataset , Selecting data by row numbers (.iloc)
dataset = pd.read_csv('user_data.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
2) Fitting Naive Bayes to the Training Set
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)
Classifier.score(x_test, y_test)
• we have used the GaussianNB classifier to fit it to the training
dataset. We can also use other classifiers as per our requirement.
(Multinominal /Bernoulli)
3) Prediction of the test set result:
# Predicting the Test set results
y_pred = classifier.predict(x_test)
4) Creating Confusion Matrix:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
predicted

• actual

• 7+3= 10 incorrect predictions, and 65+25=90 correct predictions.


5)Predict class for unknown data
#userID,age, estimated salary
test = [[ 15768443, 25, 33000 ]]
#test=x_test
a=classifier.predict(test)
print(a)
0
Brute-force Bayesian algorithm
• Brute force MAP concept learning:
• Calculate the posterior probability of each hypothesis h in H:

• Identify the h with the highest posterior probability


hMAP = argmax hH P(h|T)
Brute force MAP concept learning:
• Calculating P(h|T) the posterior probability for each hypothesis
requires a very high volume of computation, and for a large volume of
hypothesis space H.
• While it is impractical for large hypothesis spaces
• The algorithm is still of interest because it provides a standard
solution against which we may judge a performance of other concept
learning algorithms.
• This algorithm says that
Brute force MAP concept learning:
• Here H is the number of hypotheses from the space H which are
consistent with target data set T.
• The interpretation of this evaluation is that initially, each hypothesis
has equal probability.
• As we introduce the training data, the posterior probability of
inconsistent hypotheses becomes zero.
• The total probability that sums up to 1 is distributed equally among
the consistent hypotheses in the set.
Consistent learners
• consistent learner: a learning algorithm that outputs a hypothesis
that commits zero errors over the training examples.
• Every consistent learner outputs a MAP hypothesis if we assume:
• Uniform prior probability distribution over H deterministic,
• Noise-free training data
• Example: Find-S outputs the maximally specific consistent hypothesis,
which is a MAP hypothesis.
Concept Learning
• Concept learning can be viewed as the task of searching
through a large space of hypothesis implicitly defined by the
hypothesis representation
• The goal of the concept learning search is to find the hypothesis
that best fits the training examples
Bayes Optimal Classifier
• One great advantage of Bayesian Decision Theory is that it
gives us a lower bound on the classification error that can be
obtained for a given problem.
• Bayes Optimal Classification: The most probable
classification of a new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior
probabilities:
argmaxvjVhi HP(vh|hi)P(hi|D)
where V is the set of all the values a classification can take and
vj is one possible such classification.
• Unfortunately, Bayes Optimal Classifier is usually too costly
to apply! ==> Naïve Bayes Classifier

52
Bayesian Networks
• A Bayesian network specifies a joint distribution in a structured form

• Represent dependence/independence via a directed graph


• Nodes = random variables
• Edges = direct dependence

• Structure of the graph  Conditional independence relations

• Requires that graph is acyclic (no directed cycles)

• Two components to a Bayesian network


• The graph structure (conditional independence assumptions)
• The numerical probabilities (for each variable given its parents)
Bayesian Networks

• General form:

𝑃(𝑋1, 𝑋2, … . 𝑋𝑁) = ෑ 𝑃(𝑋𝑖 | 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 ) )


𝑖

The full joint distribution The graph-structured approximation


Example of a simple Bayesian network
𝑃(𝑋1, 𝑋2, … . 𝑋𝑁 ) = ෑ 𝑃(𝑋𝑖 | 𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖 ) ) A B
𝑖

𝑃 𝐴, 𝐵, 𝐶 = 𝑃 𝐶 𝐴, 𝐵 𝑃 𝐴 𝑃(𝐵)
C

• Probability model has simple factored form


• Directed edges => direct dependence
• Absence of an edge => conditional independence

• Also known as belief networks, graphical models, causal networks


• Other formulations, e.g., undirected graphical models
Examples of 3-way Bayesian Networks

A B C Absolute Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian Networks
• Conditionally independent
effects:
𝑝(𝐴, 𝐵, 𝐶) = 𝑝(𝐵|𝐴)𝑝(𝐶|𝐴)𝑝(𝐴)
A
• B and C are conditionally
independent given A
B C

• e.g., A is a disease, and we model


B and C as conditionally
independent symptoms given A
Examples of 3-way Bayesian Networks
• Independent Clauses:
𝑝(𝐴, 𝐵, 𝐶) = 𝑝(𝐶|𝐴, 𝐵)𝑝(𝐴)𝑝(𝐵)

A B

• “Explaining away” effect:


C
• A and B are independent but become
dependent once C is known!!
• (we’ll come back to this later)
Examples of 3-way Bayesian Networks

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
The Alarm Example

• You have a new burglar alarm installed


• It is reliable about detecting burglary, but responds to minor earthquakes
• Two neighbors (John, Mary) promise to call you at work when they hear the alarm
• John always calls when hears alarm, but confuses alarm with phone ringing
(and calls then also)
• Mary likes loud music and sometimes misses alarm!
• Given evidence about who has and hasn’t called, estimate the probability of a
burglary
The Alarm Example

• Represent problem using 5 binary variables:


• B = a burglary occurs at your house
• E = an earthquake occurs at your house
• A = the alarm goes off
• J = John calls to report the alarm
• M = Mary calls to report the alarm

• What is P(B | M, J) ?
• We can use the full joint distribution to answer this question
• Requires 25 = 32 probabilities

• Can we use prior domain knowledge to come up with a Bayesian network that requires fewer
probabilities?
Constructing a Bayesian Network: Step 1
• Order the variables in terms of causality (may be a
partial order)
• e.g., {E, B} -> {A} -> {J, M}

• Use these assumptions to create the graph structure


of the Bayesian network
The Resulting Bayesian Network

network topology reflects causal knowledge


Constructing a Bayesian Network: Step 2
• Fill in conditional probability
tables (CPTs)
• One for each node
• 2𝑝 entries, where 𝑝 is the number of
parents

• Where do these probabilities


come from?
• Expert knowledge
• From data (relative frequency
estimates)
• Or a combination of both
Representation in Bayesian Belief
Networks
BusTourGroup
Storm Associated with each
node is a conditional
probability table, which
specifies the conditional
Campfire distribution for the
Lightning variable given its
immediate parents in
the graph

Thunder ForestFire

Each node is asserted to be conditionally independent of


its non-descendants, given its immediate parents

65
Bayesian learning
• Prior knowledge of the candidate hypothesis is combined with the
observed data for arriving at the final probability of a hypothesis
• Flexible than the other approaches because each observed training pattern
can influence the outcome of the hypothesis by increasing or decreasing
the estimated probability about the hypothesis
• Perform better than the other methods while validating the hypotheses
that make probabilistic predictions
• It is possible to classify new instances by combining the predictions of
multiple hypotheses, weighted by their respective probabilities.
• They can be used to create a standard for the optimal decision against
which the performance of other methods can be measured

You might also like