0% found this document useful (0 votes)
6 views

Bayesian Learning

The document discusses Bayesian Learning, focusing on its algorithms, foundations for machine learning, and the use of prior knowledge in probabilistic learning. It explains concepts like Naïve Bayes, Maximum A Posteriori (MAP) hypothesis, and the Minimum Description Length principle, while also addressing challenges such as computational costs and the need for initial probability knowledge. Additionally, it covers practical applications of Bayesian classification and the estimation of a-posteriori probabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Bayesian Learning

The document discusses Bayesian Learning, focusing on its algorithms, foundations for machine learning, and the use of prior knowledge in probabilistic learning. It explains concepts like Naïve Bayes, Maximum A Posteriori (MAP) hypothesis, and the Minimum Description Length principle, while also addressing challenges such as computational costs and the need for initial probability knowledge. Additionally, it covers practical applications of Bayesian classification and the estimation of a-posteriori probabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 81

Bayesian Learning

• Provides practical learning algorithms


– Naïve Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities)

• Provides foundations for machine learning


– Evaluating learning algorithms
– Guiding the design of new algorithms
– Learning from models : meta learning
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
• Difficulty:
– They typically require initial knowledge of many
probabilities.
– When these probabilities are not known in advance they are
often estimated based on background knowledge, previously
available data, and assumptions about the form of the
underlying distributions.
– significant computational cost required to
determine the Bayes optimal hypothesis in the general case
Basic Formulas for Probabilities

• Product Rule : probability P(AB) of a conjunction of two events A


and B:
P( A, B ) P( A | B) P( B ) P( B | A) P( A)

•Sum Rule: probability of a disjunction of two events A and B:


P ( A  B ) P ( A)  P( B )  P ( AB )
•Theorem of Total Probability : if events A1, …., An are mutually
exclusive with
n
P( B)  P( B | Ai ) P( Ai )
i 1
Basic Approach
Bayes Rule: P ( D | h) P ( h)
P(h | D) 
P( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the
training data (Maximum A Posteriori hypothesis hmap)
hmap
max P (h | D)
hH

P ( D | h) P ( h)
max
hH P( D)
max P ( D | h) P (h)
hH
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of the
cases in which the disease is not present. Furthermore, .008 of the entire
population have this cancer.
P (cancer ) .008, P ( cancer ) .992
P (  | cancer ) .98, P (  | cancer ) .02
P (  |  cancer ) .03, P (  |  cancer ) .97
P (  | cancer ) P (cancer )
P (cancer | ) 
P ( )
P (  |  cancer ) P ( cancer )
P ( cancer | ) 
P ( )
MAP Learner

For each hypothesis h in H, calculate the posterior probability


P ( D | h) P ( h)
P(h | D) 
P( D)
Output the hypothesis hmap with the highest posterior probability
hmap max P (h | D)
hH

Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
MAP Learner
• Assumptions:
– The training data D is noise free (i.e., di = c(xi))
– The target concept c is contained in the hypothesis space H .
– We have no a priori reason to believe that any hypothesis is
more probable than any other.
MAP Learner
• Case 1:If h is not consistent with D then

• Case 2:h is consistent with D then


MAP Learner
• MAP Hypothesis and consistent Learners
– Given the above analysis, every hypothesis consistent
with D is a MAP hypothesis. we can conclude that
every consistent learner outputs a MAP hypothesis,
– I f we assume a uniform prior probability distribution
over H (i.e., P(hi)= P(hj) for all i, j), and if we assume
deterministic, noise free training data (i.e., P(D/h) = 1
i f D and h are consistent, and 0 otherwise) .
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES

• Consider the problem of learning a continuous-valued


target function .
• Any learning algorithm that minimizes the squared error
between the output hypothesis predictions and the training
data will output a maximum likelihood hypothesis .
• The significance of this result is that it provides a
Bayesianjustification .
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• Example:The problem faced by learner L is to learn an unknown target function f :
X -> R drawn from H.
• X-instance space ,R-Real value, H-hypothesis space
h : X -> R for each h in H.
• A set of m training examples is provided, where the target value of each example is
corrupted by random noise .
• <xi,di> where di=f(xi)+ei
• The task of the learner is to output a maximum likelihood hypothesis, or,
equivalently, a MAP hypothesis assuming all hypotheses are equally probable a
priori.
• The dashed line in the following figure corresponds to the hypothesis hML with
least-squared training error, hence the maximum likelihood hypothesis.
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• The two basic concepts from probability theory prove that a
hypothesis that minimizes the sum of squared errors
is also a maximum likelihood hypothesis for continuous
functions.
– Probability density:To discuss probabilities over continuous variables
we must introduce probability densities as finite probability cannot work.

– The random noise variable e is generated by a Normal


probability distribution.
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES

• Derivation of hML
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• First term is independent oh h so remove.
• Limitations
– The above analysis considers noise only in the
target value of the training example and does
not consider noise in the attributes describing
the instances themselves.
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• Learn a nondeterministic (probabilistic) function f :
X ->{0,1}, which has two discrete output values.
• Ex:medical patient symptoms.
• We might wish to learn a neural network (or other real-valued
function approximator) whose output is the probability that f (x) =
1.
• Now, the target function f' : X -? [O, 1] such that f'(x) = P (f (x) = 1).
• What criterion should we optimize in order to find a maximum
likelihood hypothesis for f' in this setting ?
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• The training data D is of the form D = {(xl,dl). . . (x,
dm)},where di is the observed 0 or 1 value for f (xi).
• P(D/h) can be written as below

• As x is independent of h(applying product rule)


MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• Gradient Search to Maximize Likelihood in a Neural Net

• Using single layer of sigmoid units


• Final expression is

• Weight updation rule.


Minimum description Length principle

• The Minimum Description Length principle is motivated by


interpreting the definition of hMAP in the light of basic concepts from
information theory .

• This can be interpreted as a statement that short hypotheses are


preferred .
• Problem: Designing a compact code to transmit messages
drawn at random .
• So assign shorter codes to messages that are more probable.
• The number of bits required to encode message i using code C
as the description length of message i with respect to C .
• Shannon and weaver showed that log2 pi bitst to encode
message i .
• Applying this to hMAP then

• Term1 is optimal code for hypothesis space H and Term2 is


optimal code for describing data D assuming that both the
sender and receiver know the hypothesis h .
• The Minimum Description Length (MDL) principle
recommends choosing the hypothesis that minimizes the sum of
these two description lengths
• Example:
– We wish to apply the MDL principle to the problem of
learning decision trees from some training data.
– C1: number of nodes in the tree and with the number of
edges .
– C2:
• If the sequence of instances are already known we need
only classifications. If the training classifications are
identical to the predictions of the hypothesis, then there is
no need to transmit any information about these examples .
• In the case where some examples are misclassified by h,
then for each misclassification we need to encode a
message that identifies which example is misclassified .
• Thus the MDL principle provides a way of trading off
hypothesis complexity for the number of errors committed by
the hypothesis.
• It might select a shorter hypothesis that makes a few errors over
a longer hypothesis that perfectly classifies the training data.
• Conclusions
– If a representation of hypotheses is chosen so
that the size of hypothesis h is -log2P(h)
– If a representation for exceptions is chosen so that the
encoding length of D given h is equal to -log2 P(Dlh),
then the MDL principle produces MAP hypotheses .
Bayes Optimal Classifier
• Question: Given new instance x, what is its most probable classification?
• Hmap(x) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:

Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h ) P(h | D) .4
hiH
i i

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0


P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  P( | h ) P(h | D) .6
hiH
i i
Gibbs algorithm
• Choose a hypothesis h from H at random, according to the
posterior probability distribution over H.
• Use h to predict the classification of the next instance x.
• The expected misclassification error for the Gibbs algorithm
is at most twice the expected error of the Bayes optimal
classifier .
• This result has an interesting implication for the concept
learning problem .

Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described
by attributes <a1, a2, …., an>. Most probable value of f(x) is:
v max P (v j | a1 , a2 ....an )
vjV

P ( a1 , a2 ....an | v j ) P (v j )
max
vjV P ( a1 , a2 ....an )
max P ( a1 , a2 ....an | v j ) P (v j )
vjV

Naïve Bayes assumption:


P (a1 , a2 ....an | v j )  P(ai | v j ) (attributes are conditionally independent)
i
Naive Bayesian Classifier (II)
• Given a training set, we can compute the probabilities

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N | outlook=sunny,windy=true,…)

• Idea: assign to sample X the class label C such


that P(C|X) is maximal
Estimating a-posteriori probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples
having value xi as i-th attribute in class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases
Play-tennis example: estimating P(x i|C)
outlook
Outlook Temperature Humidity Windy Class
sunny hot high false N
P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high true N
overcast hot high false P
P(overcast|p) = 4/9 P(overcast|n) = 0
rain mild high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal false P
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast mild high true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast hot normal false P
rain mild high true N humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
P(p) = 9/14
windy
P(n) = 5/14 P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high,
strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following
training data:
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
# days of playing tennise with strong wind
we have :
# days of playing tennise
p ( y ) p ( sun | y ) p (cool | y ) p (high | y ) p ( strong | y ) .005
p (n) p ( sun | n) p (cool | n) p (high | n) p ( strong | n) .021
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first
Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )

Classify_New_Instance (x)

Typical m-estimation of P(ai | vj) if nc=0


nc  mp
P(ai | v j )  Where
nm n: examples with v=v; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior
An example: Learning to classify the text
• We might wish to learn the target concept "electronic news
articles that I find interesting“.
• General setting:
– Consider an instance space X consisting of all possible text
documents .
– The task is to learn from these training examples to predict
the target value for subsequent text documents.
– The target values are like and dislike to indicate the two
classes.
An example: Learning to classify
the text
• The two main design issues involved in applying
the naive Bayes classifier to such text classification
problems are
– To decide how to represent an arbitrary text
document in terms of attribute values
– To decide how to estimate the probabilities
required by the naive Bayes classifier.
An example: Learning to classify
the text
• We define an attribute for each word position in
the document and define the value of that
attribute.
• Ex: 111 attribute values
An example: Learning to classify
the text
• We can now apply the naive Bayes classifier.
• we are given a set of 700 training documents that a
friend has classified as dislike and another 300 she
has classified as like.
• The new instance can be classified as
An example: Learning to classify
the text
• Assumption to apply naïve bayes classifier is the word
probabilities for one text position are independent of the
words that occur in other positions .
• Though it is inaccurate about this assumption the naive
Bayes learner performs remarkably well in many text
classification problems .
• For estimating class conditional probabilities we consider
the class conditional independence.
An example: Learning to classify
the text
• The attributes are independent and identically distributed, given the target
classification;
• That is, P(ai = wk/vj) = P(am = wk/vj)for all i, j, k, m.
• The estimate for P(wklvj)will be

• n =the total number of word positions in all training examples .


An example: Learning to classify
the text
• nk is the number of times word wk is found
among these n word positions.
• | Vocabulary | is the total number of distinct
words .
An example: Learning to classify
the text
• Experiment Results:
– A minor variant of this algorithm was applied to the problem of classifying usenet news
articles.
– The target classification for an article in this
case was the name of the usenet newsgroup in which the article appeared.
– 1,000 articles were collected from each newsgroup, forming a data set of 20,000
documents.
– The naive Bayes algorithm was then applied using two-thirdsof these 20,000 documents
as training examples, and performance was measured over the remaining third.
– The accuracy achieved by the program was 89%
An example: Learning to classify
the text
• Similarly impressive results have been achieved by others applying
similar statistical learning approaches to text classification.
• Ex: NEWSWEEDER system
– program for reading netnews that allows the user to rate articles as he or she
reads them.
– uses these rated articles as training examples to learn to predict which subsequent
articles will be of interest to the user .
– NEWSWEEDER used its learned profile of user interests to
suggest the most highly rated new articles each day.
Bayesian Belief Networks

• Naïve Bayes assumption of conditional independence too restrictive


• But it is intractable without some such assumptions
• Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with observed
training data.
• Bayesian Net
– Node = variables
– Arc = dependency
– DAG, with direction on arc representing causality
– To each variables A with parents B1, …., Bn there is attached a
conditional probability table P (A | B1, …., Bn)
Bayesian Belief Networks
• A Bayesian belief network describes the probability distribution over a set
of variables.
• Consider an arbitrary set of random variables Yl .. .Yn
then the joint space of the set of variables Y to be the cross product V(Yl)
x V(Y2) x... V(Yn).
• The probability distribution over this joint' space is called the joint
probability distribution.
• The conditional independence in bayesian belief network is
as follows
Bayesian Belief Networks
• Let X , Y, and Z be three discrete-valued random variables.
• We say that X is conditionally independent of Y given Z
if

• We commonly write the above expression in abbreviated


form as P(X|Y, Z ) = P(X|Z).
• Representation
• Example
Bayesian Belief Networks
•Age, Occupation and Income determine if
customer will buy this product.
Occ •Given that customer buys product, whether
Age there is interest in insurance is now
independent of Age, Occupation, Income.
Income •P(Age, Occ, Inc, Buy, Ins ) =
P(Age)P(Occ)P(Inc)
Buy X P(Buy|Age,Occ,Inc)P(Int|Buy)

Current State-of-the Art: Given structure


and probabilities, existing algorithms can
handle inference with categorical values and
Interested in limited representation of numerical values
Insurance
General Product Rule

n
P( x1 ,....xn | M )  P ( xi | Pai , M )
i 1
Pai  parent ( xi )
Bayesian Belief Networks
• Inference
– Exact inference of probabilities in general for an
arbitrary Bayesian network is known to be NP-hard .
– Numerous methods have been proposed for
probabilistic inference in Bayesian networks .
– Monte Carlo methods provide approximate solutions
by randomly sampling the distributions of the
unobserved variables
Inference in Bayesian Networks
Age Income
How likely are elderly rich
people to buy Sun?

P( paper = Sun | Age>60, Income > 60k)

House Living
Owner Location

Newspaper
Preference

Voting
EU
Pattern
Inference in Bayesian Networks

Age Income How likely are elderly rich


people who voted labour to
buy Daily Mail?

P( paper = DM | Age>60,
Living Income > 60k, v = labour)
House
Owner Location

Newspaper
Preference

Voting
EU
Pattern
Bayesian Belief Networks
• Learning Bayesian Belief Networks
– Settings
• The network structure might be given in advance, or it might have to be
inferred from the training data.
• All the network variables might be directly observable in each training
example, or some might be unobservable.
– If the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables
is straightforward.
Bayesian Belief Networks
• The network structure is given but only some of the variable
values are observable in the training data, the learning
problem is more difficult.
• Russell et al. (1995) propose a gradient ascent procedure
that learns the entries in the conditional probability tables.
• Searches through a space of hypotheses that corresponds to
the set of all possible entries for the conditional probability
tables.
Bayesian Belief Networks
• Gradient Ascent Training of Bayesian Networks
– It maximizes P(D|h) by following the gradient of In P(D|h) with
respect to the parameters that define the conditional probability
tables of the Bayesian network.

– wijk denote the conditional probability that the network


variable Yi will take on the value yi,
given that its immediate parents Ui .
Bayesian Belief Networks
• Derivation
– As d are drawn independently
Bayesian Belief Networks
Bayesian Belief Networks
• First we update each wijk by gradient ascent

• This algorithm is guaranteed only to find some local


optimum solution.
• An alternative to gradient ascent is the EM
algorithm which also finds locally maximum likelihood
solutions.
Bayesian Belief Networks
• Learning the Structure of Bayesian Networks
– A heuristic search algorithm called K2 for learning network structure is used
when the data is fully observable.
– It performs a greedy search that trades off network complexity for accuracy
over the training data.
– Constraint-based approaches to learning Bayesian network structure have also
been developed .
– These approaches infer independence and dependence relationships from the
data, and then use these relationships to construct Bayesian networks.


EM algorithm
• A widely used approach to learning in the presence of
unobserved variables.
• Consider a problem in which the data D is a set of instances
generated by a probability distribution that is a mixture of k
distinct Normal distributions.
• Each of the k Normal distributions has the same variance
σ2,and where σ2is known.
• The learning task is to output a hypothesis h = (μ1 .. .μk)
that describes the means of each of the k distributions.
EM algorithm
• If k=2
– The EM algorithm first initializes the hypothesis to h =
(μ1,μ2),where μ1 and μ2 are
arbitrary initial values.
– It then iteratively re-estimates h by repeating the following two
steps until the procedure converges to a stationary value for h.
• Step 1: Calculate the expected value E[zij]of each hidden variable zij,
assuming the current hypothesis h = (μ1,μ2) holds.
• Step 2: Calculate a new maximum likelihood hypothesis h' = (μ1’,μ2’),
assuming the value taken on by each hidden variable zij is its expected
value E[zij ]
• The E[zij ] is calculated as below.

• The maximum likelihood hypothesis in this case is given


by
• General statement of EM Algorithm
– In general let X = {xl,. . .,xm} denote the
observed data in a set of m independently drawn
instances.
– let Z = {zl,. . .,zm} denote the unobserved data in these
same instances, and let Y = X U Z denote the full data.
– The EM algorithm searches for the maximum likelihood
hypothesis h' by seeking the h' that maximizes E[ln P(Y |(h')]
• Derivation of the k Means Algorithm
– To apply EM we must derive an expression for Q(h|
h’)that applies to our k-means problem .
– The logarithm of the probability In P(Y|h’) for all m
instances in the data is
• For any function f (z) that is a linear
function of z, the following equality holds.


• The second (maximization) step then finds
the values μ1’ ...... μk’ that maximize this Q
function.

You might also like