UNIT 4 - Bayesian Learning
UNIT 4 - Bayesian Learning
Unit 4
Bayesian Learning
By
Dr. G. Sunitha
Professor & BoS Chairperson
Department of CSE
1
Introduction
❖ Bayesian Learning provides a probabilistic approach to inference.
❖ It is based on the assumption that the quantities of interest are governed by probability distributions and that
optimal decisions can be made by reasoning about these probabilities together with observed data.
❖ Bayesian learning algorithms calculate explicit probabilities for hypotheses.
2
Features of Bayesian learning methods
❖ Each observed training example can incrementally decrease or increase the estimated probability that a
hypothesis is correct. This provides a more flexible approach to learning than algorithms that completely
eliminate a hypothesis if it is found to be inconsistent with any single example.
❖ Prior knowledge can be combined with observed data to determine the final probability of a hypothesis. In
Bayesian learning, prior knowledge is provided by asserting
(1) a prior probability for each candidate hypothesis, and
(2) a probability distribution over observed data for each possible hypothesis.
❖ Bayesian methods can accommodate hypotheses that make probabilistic predictions (e.g., hypotheses such as
"this pneumonia patient has a 93% chance of complete recovery").
❖ New instances can be classified by combining the predictions of multiple hypotheses, weighted by their
probabilities.
❖ They require initial knowledge of many probabilities. When these probabilities are not known in advance they
are often estimated based on background knowledge, previously available data, and assumptions about the
form of the underlying distributions.
❖ They require significant computational cost.
❖ They can provide a standard of optimal decision making against which other
practical methods can be measured.
3
Bayes Theorem
Terminology
❖ H is Hypothesis Space; h is hypothesis.
❖ D is training dataset ; d is a training sample.
❖ P(h) denotes the prior probability that hypothesis h holds, before training data is observed.
It reflects any background knowledge we have about the chance that h is a correct hypothesis. If we have
no such prior knowledge, then we might simply assign the same prior probability to each candidate
hypothesis.
❖ P(D) denotes the prior probability that training data D will be observed (i.e., the probability of D given no
knowledge about which hypothesis holds).
❖ P(D | h) denotes the probability of observing data D given hypothesis h holds.
❖ P (h | D) denotes posterior probability of h that h holds given the observed training data D.
❖ Bayes Theorem
4
Maximum a Posteriori (MAP) Hypothesis
❖ For a given problem there can be multiple hypothesis statements that are true.
❖ The most (maximally) probable hypothesis for the given data D is called as the MAP.
❖ Assuming that P(h) is same for all hypothesis, P(D l h) is often called the likelihood of the data D given h, and
any hypothesis that maximizes P(D l h) is called a maximum likelihood (ML) hypothesis.
5
Bayes Rule - Medical Diagnosis Problem
❖ Two alternative hypotheses:
(1) that the patient has a particular form of cancer. Hypothesis h = cancer
(2) that the patient does not have cancer. Hypothesis -h = No cancer
❖ Dataset contains patients data belonging to two classes : positive class + negative class -
❖ Prior Knowledge: over the entire population of people only 0.8% have this disease.
P(h) = 0.008 P ( - h) = 0.992
❖ The test returns a correct positive result in only 98% of the cases in which the disease is actually present.
P( + | h) = 0.98 P( - | h) = 0.02
❖ A correct negative result in only 97% of the cases in which the disease is not present.
P( + | -h) = 0.03 P( - | -h) = 0.97
6
Bayes Rule - Medical Diagnosis Problem . . .
❖ Suppose we now observe a new patient for whom the lab test returns a positive result. Should we diagnose the
patient as having cancer or not? The maximum a posteriori hypothesis can be found as follows:
hMAP = P( h | - )
There is no cancer for the patient. The lab test might be false positive.
7
Bayes Theorem and Concept Learning
❖ Since Bayes theorem provides a principled way to calculate the posterior probability of each hypothesis given
the training data, we can use it as the basis for a straightforward learning algorithm that calculates the
probability for each possible hypothesis, then outputs the most probable.
8
Brute-Force Bayes Concept Learning
❖ D is Instance Space (training data samples). Each sample is represented as
<Xi , ti> where Xi is a vector of independent variables, ti is the target variable.
❖ Let Hypothesis Space H be defined over Instance Space D.
❖ The task is to learn some target concept c : D → {0,1} i.e., to learn ti = c (Xi).
❖ Brute-Force Map Learning Algorithm
9
Brute-Force Bayes Concept Learning . . . .
❖ Brute-Force Map Learning Algorithm . . . .
• This algorithm may require significant computation, because it applies Bayes theorem to each
hypothesis in H to calculate P( h | D ). While this may prove impractical for large hypothesis
spaces, the algorithm is still of interest because it provides a standard against which we may judge
the performance of other concept learning algorithms.
• In order to specify a learning problem for the Brute-Force Map Learning algorithm, values for P(h) and
P(D | h) must be specified. We may choose the probability distributions P(h) and P(D | h) in any way in
order to describe our prior knowledge about the learning task. Here let us choose them to be consistent
with the following assumptions:
1. The training data D is noise free (i.e., ti = c(Xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no prior reason to believe that any hypothesis is more probable than any other.
10
Brute-Force Bayes Concept Learning . . . .
❖ Brute-Force Map Learning Algorithm . . . .
• P(h) denotes the prior probability that hypothesis h holds, before training data is observed.
• How to choose value for P(h) –
o Given no prior knowledge that one hypothesis is more likely than another, it is reasonable to assign
the same prior probability to every hypothesis h in H.
o Furthermore, because we assume the target concept is contained in H we should require that these
prior probabilities sum to 1.
o Together these constraints imply that
11
Brute-Force Bayes Concept Learning . . . .
❖ Brute-Force Map Learning Algorithm . . . .
• P(D | h) denotes the probability of observing data D given hypothesis h holds.
• How to choose value for P(D | h) –
o Since noise-free training data is assumed, the probability of observing classification Xi given h is
12
Brute-Force Bayes Concept Learning . . . .
❖ Brute-Force Map Learning Algorithm . . . .
Bayes Theorem to compute the posterior probability P(h | D) of each hypothesis h given the observed training
data D is as follows.
13
Brute-Force Bayes Concept Learning . . . .
❖ Brute-Force Map Learning Algorithm . . . .
Case 2) Consider that h is consistent with the training data D. Then P( D| h ) = 1.
Let VSH,D is the subset of hypotheses from H that are consistent with D.
Then,
|𝑽𝑺𝑯,𝑫 |
𝑷 𝑫 =
|𝑯|
𝒃𝒆𝒄𝒂𝒖𝒔𝒆 𝒕𝒉𝒆 𝒔𝒖𝒎 𝒐𝒗𝒆𝒓 𝒂𝒍𝒍 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒆𝒔 𝒐𝒇 𝑷(𝒉│𝑫) 𝒎𝒖𝒔𝒕 𝒃𝒆 𝒐𝒏𝒆 𝒂𝒏𝒅 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔
𝒇𝒓𝒐𝒎 𝑯 𝒄𝒐𝒏𝒔𝒊𝒔𝒕𝒆𝒏𝒕 𝒘𝒊𝒕𝒉 𝑫 𝒊𝒔 | 𝑽𝑺𝑯, 𝑫 |
14
Brute-Force Bayes Concept Learning . . . .
❖ Brute-Force Map Learning Algorithm . . . .
15
MAP Hypotheses and Consistent Learners
16
MAP Hypotheses and Consistent Learners . . .
❖ For a given problem there can be multiple hypothesis statements that are true.
❖ The most (maximally) probable hypothesis for the given data D is called as the MAP (Maximum a Posteriori)
Hypothesis.
❖ A learning algorithm is a consistent learner provided it outputs a hypothesis that commits zero errors over the
training examples.
❖ Given the above analysis, it can be concluded that every consistent learner outputs a MAP hypothesis, if
• A uniform prior probability distribution over H is assumed and
• A deterministic, noise free training data is assumed.
17
Normal (Guassian) Distribution of Data
A Normal Distribution is a bell-shaped
distribution defined by the probability
density function
❖ About 68% of the values fall in the range between (x¯− σ) and (x¯+ σ) .
❖ About 95% of the values lie within two standard deviations of the mean, that is,
between (x¯−2σ)(x¯−2σ) and (x¯+2σ)(x¯+2σ) .
❖ About 99.7% of the values lie within three standard deviations of the mean, that is,
between (x¯−3σ)(x¯−3σ) and (x¯+3σ)(x¯+3σ) .
18
Maximum Likelihood and Least-squared Error Hypotheses
❖ Under certain assumptions any learning algorithm that minimizes the squared error between the output
hypothesis predictions and the training data will output a maximum likelihood hypothesis.
❖ Consider a set of training examples, where the target value of each example is corrupted by random noise
drawn according to a Normal probability distribution. More precisely, each training example is a pair of the form
(xi, ti) where ti = f(xi) + ei where f(xi) is the noise-free value of the target function and ei is a random
variable representing the noise.
❖ The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP hypothesis
assuming all hypotheses are equally probable a priori.
Ti = f(Xi)
Ti = F(Xi) + ei
19
Maximum Likelihood and Least-squared Error Hypotheses . . .
❖ In the case of continuous variables we cannot achieve this by assigning a finite probability to each of the infinite
set of possible values for the random variable. Instead, we speak of a probability density for continuous
variables such as e and require that the integral of this probability density over all possible values be one.
❖ Lower case p is used to refer to the probability density function, to distinguish it from a finite probability P.
20
Maximum Likelihood and Least-squared Error Hypotheses . . .
❖ Given that the noise ei obeys a Normal distribution with zero mean µ, each ti must also obey Normal distribution
centered around true target value f(xi). Hence µ = f(Xi) = h(Xi)
21
Maximum Likelihood and Least-squared Error Hypotheses . . .
22
Maximum Likelihood and Least-squared Error Hypotheses . . .
23
Minimum Description Length Principle
❖ Recall Occam's razor, a popular inductive bias that can be summarized as "choose the shortest explanation for
the observed data."
24
Minimum Description Length Principle . . .
25
Minimum Description Length Principle . . .
❖ The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum
of these two description lengths.
❖ Assuming we use the codes C1 and C2 to represent the hypothesis and the data given the hypothesis, we can
state the MDL principle as:
26
Naïve Bayesian Classifier
27
Naïve Bayesian Classifier . . .
28
Naïve Bayesian Classifier . . .
29
Naïve Bayesian Classifier . . .
30
Naïve Bayesian Classifier . . .
31
Naïve Bayesian Classifier – Example
32
Naïve Bayesian Classifier – Example . . .
33
Naïve Bayesian Classifier – Example . . .
34
Case Study: Learning to Classify Text using Naïve Bayes
35
Naïve Bayes Algorithm for Learning and Classifying Text
36
Naïve Bayes Algorithm for Learning and Classifying Text . . .
37
Optimal Bayes Classifier
38
Optimal Bayes Classifier . . .
39
Optimal Bayes Classifier . . .
40
Optimal Bayes Classifier . . .
Any system that classifies new instances according to above equation is called a Bayes optimal classifier, or
Bayes optimal learner. No other classification method using the same hypothesis space and same prior knowledge
can outperform this method on average. This method maximizes the probability that the new instance is classified
correctly, given the available data, hypothesis space, and prior probabilities over the hypotheses.
41
Gibb’s Algorithm
❖ Although the Bayes optimal classifier obtains the best performance that can be achieved from the given training
data, it can be quite costly to apply. The expense is due to the fact that it computes the posterior probability for
every hypothesis in H and then combines the predictions of each hypothesis to classify each new instance.
❖ Under certain conditions, Classifying the next instance according to a hypothesis drawn at random from the
current version space (according to a uniform distribution), will have expected error at most twice that of the
Bayes optimal classifier.
𝑬 𝒆𝒓𝒓𝒐𝒓𝑮𝒊𝒃𝒃𝒔 ≤ 𝟐 𝑬(𝒆𝒓𝒓𝒐𝒓𝑶𝒑𝒕𝒊𝒎𝒂𝒍𝑩𝒂𝒚𝒆𝒔 )
42
Bayes Classification Methods - Summary
43
Bayesian Belief Networks
❖ Naive Bayes classifier, assumes conditional independence between variables given the value of the target
variable. This assumption dramatically reduces the complexity of learning the target function. However, in many
cases this conditional independence assumption is clearly overly restrictive.
❖ A Bayesian belief network describes the probability distribution governing a set of variables by specifying a set
of conditional independence assumptions along with a set of conditional probabilities.
❖ Bayesian belief networks allow stating conditional independence assumptions that apply to subsets of the
variables. Thus, Bayesian belief networks provide an intermediate approach that is less constraining than the
global assumption of conditional independence made by the naive Bayes classifier, but more tractable than
avoiding conditional independence assumptions altogether.
❖ In general, a Bayesian belief network describes the probability distribution over a set of variables.
44
Bayesian Belief Networks – Conditional Independence
❖ The Naive Bayes classifier assumes that the instance attribute A1 is conditionally independent of instance
attribute A2 given the target value V. This allows the naive Bayes classifier to calculate P(Al, A2 | V) as follows:
𝑷 𝑨𝟏, 𝑨𝟐 𝑽) = 𝑷 𝑨𝟏 𝑽) 𝑷 𝑨𝟐 𝑽)
(product rule of probability - A1 is conditionally independent of A2 given V)
❖ Let X, Y, and Z be three discrete-valued random variables. It can be said that X is conditionally independent of Y
given Z if the probability distribution governing X is independent of the value of Y given a value for Z; that is, if
❖ This definition of conditional independence can be extended to sets of variables as well. It can be said that the
set of variables X1 . . . Xl is conditionally independent of the set of variables Yl . . . Ym given the set of
variables Z1 . . . Zn, if
45
Bayesian Belief Networks – Representation
❖ A Bayesian belief network (Bayesian network for short) represents the joint probability distribution for a set of
variables by specifying a set of conditional independence assumptions (represented by a directed acyclic graph),
together with sets of local conditional probabilities.
✓ Predecessors
✓ Immediate Predecessors
✓ Descendants
✓ Nondescendants
✓ variable is conditionally independent
of its nondescendants in the network
given its immediate predecessors in
the network.
46
Bayesian Belief Networks – Representation . . .
✓ Predecessors
✓ Immediate Predecessors
✓ Descendants
✓ Nondescendants
✓ variable is conditionally independent
of its nondescendants in the network
given its immediate predecessors in
the network.
47
Bayesian Belief Networks – Representation . . .
❖ A conditional probability table is given for each variable, describing the probability distribution for that variable
given the values of its immediate predecessors.
❖ The joint probability for any desired assignment of values (y1, . . . yn) to the tuple of network variables
(Y1 . . . Yn) can be computed by the formula:
48
Bayesian Belief Networks – Example
Causal relations are captured by Bayesian Belief Networks
49
Bayesian Belief Networks – Problem
50
Bayesian Belief Networks – Problem . . .
51
Bayesian Belief Networks – Problem . . .
= 0.00062
52
Bayesian Belief Networks – Problem . . .
53
Learning in Bayesian Belief Networks
❖ Scenario 1: Given both the network structure and all variables observable: compute only
the CPT entries.
❖ Scenario 2: Network structure known, some variables hidden: gradient descent (greedy
hill-climbing) method, i.e., search for a solution along the steepest descent of a criterion
function.
• Weights are initialized to random probability values.
• At each iteration, it moves towards what appears to be the best solution at the
moment, without backtracking.
• Weights are updated at each iteration & converge to local optimum.
❖ Scenario 3: Network structure unknown, all variables observable: search through the
model space to reconstruct network topology .
❖ Scenario 4: Unknown structure, all hidden variables: No good algorithms known for this
purpose.
54