0% found this document useful (0 votes)

6 views

Bayesian Learning

The document discusses Bayesian Learning, focusing on its algorithms, foundations for machine learning, and the use of prior knowledge in probabilistic learning. It explains concepts like Naïve Bayes, Maximum A Posteriori (MAP) hypothesis, and the Minimum Description Length principle, while also addressing challenges such as computational costs and the need for initial probability knowledge. Additionally, it covers practical applications of Bayesian classification and the estimation of a-posteriori probabilities.

Uploaded by

Bhuvaneshwar Nagireddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Bayesian Learning

Uploaded by

Bhuvaneshwar Nagireddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 81

Bayesian Learning

• Provides practical learning algorithms

– Naïve Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities)

• Provides foundations for machine learning

– Evaluating learning algorithms
– Guiding the design of new algorithms
– Learning from models : meta learning
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
• Difficulty:
– They typically require initial knowledge of many
probabilities.
– When these probabilities are not known in advance they are
often estimated based on background knowledge, previously
available data, and assumptions about the form of the
underlying distributions.
– significant computational cost required to
determine the Bayes optimal hypothesis in the general case
Basic Formulas for Probabilities

• Product Rule : probability P(AB) of a conjunction of two events A

and B:
P( A, B ) P( A | B) P( B ) P( B | A) P( A)

•Sum Rule: probability of a disjunction of two events A and B:

P ( A  B ) P ( A)  P( B )  P ( AB )
•Theorem of Total Probability : if events A1, …., An are mutually
exclusive with
n
P( B)  P( B | Ai ) P( Ai )
i 1
Basic Approach
Bayes Rule: P ( D | h) P ( h)
P(h | D) 
P( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
The Goal of Bayesian Learning: the most probable hypothesis given the
training data (Maximum A Posteriori hypothesis hmap)
hmap
max P (h | D)
hH

P ( D | h) P ( h)
max
hH P( D)
max P ( D | h) P (h)
hH
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of the
cases in which the disease is not present. Furthermore, .008 of the entire
population have this cancer.
P (cancer ) .008, P ( cancer ) .992
P (  | cancer ) .98, P (  | cancer ) .02
P (  |  cancer ) .03, P (  |  cancer ) .97
P (  | cancer ) P (cancer )
P (cancer | ) 
P ( )
P (  |  cancer ) P ( cancer )
P ( cancer | ) 
P ( )
MAP Learner

For each hypothesis h in H, calculate the posterior probability

P ( D | h) P ( h)
P(h | D) 
P( D)
Output the hypothesis hmap with the highest posterior probability
hmap max P (h | D)
hH

Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
MAP Learner
• Assumptions:
– The training data D is noise free (i.e., di = c(xi))
– The target concept c is contained in the hypothesis space H .
– We have no a priori reason to believe that any hypothesis is
more probable than any other.
MAP Learner
• Case 1:If h is not consistent with D then

• Case 2:h is consistent with D then

MAP Learner
• MAP Hypothesis and consistent Learners
– Given the above analysis, every hypothesis consistent
with D is a MAP hypothesis. we can conclude that
every consistent learner outputs a MAP hypothesis,
– I f we assume a uniform prior probability distribution
over H (i.e., P(hi)= P(hj) for all i, j), and if we assume
deterministic, noise free training data (i.e., P(D/h) = 1
i f D and h are consistent, and 0 otherwise) .
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES

• Consider the problem of learning a continuous-valued

target function .
• Any learning algorithm that minimizes the squared error
between the output hypothesis predictions and the training
data will output a maximum likelihood hypothesis .
• The significance of this result is that it provides a
Bayesianjustification .
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• Example:The problem faced by learner L is to learn an unknown target function f :
X -> R drawn from H.
• X-instance space ,R-Real value, H-hypothesis space
h : X -> R for each h in H.
• A set of m training examples is provided, where the target value of each example is
corrupted by random noise .
• <xi,di> where di=f(xi)+ei
• The task of the learner is to output a maximum likelihood hypothesis, or,
equivalently, a MAP hypothesis assuming all hypotheses are equally probable a
priori.
• The dashed line in the following figure corresponds to the hypothesis hML with
least-squared training error, hence the maximum likelihood hypothesis.
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• The two basic concepts from probability theory prove that a
hypothesis that minimizes the sum of squared errors
is also a maximum likelihood hypothesis for continuous
functions.
– Probability density:To discuss probabilities over continuous variables
we must introduce probability densities as finite probability cannot work.

– The random noise variable e is generated by a Normal

probability distribution.
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES

• Derivation of hML
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• First term is independent oh h so remove.
• Limitations
– The above analysis considers noise only in the
target value of the training example and does
not consider noise in the attributes describing
the instances themselves.
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• Learn a nondeterministic (probabilistic) function f :
X ->{0,1}, which has two discrete output values.
• Ex:medical patient symptoms.
• We might wish to learn a neural network (or other real-valued
function approximator) whose output is the probability that f (x) =
1.
• Now, the target function f' : X -? [O, 1] such that f'(x) = P (f (x) = 1).
• What criterion should we optimize in order to find a maximum
likelihood hypothesis for f' in this setting ?
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• The training data D is of the form D = {(xl,dl). . . (x,
dm)},where di is the observed 0 or 1 value for f (xi).
• P(D/h) can be written as below

• As x is independent of h(applying product rule)

MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• Gradient Search to Maximize Likelihood in a Neural Net

• Using single layer of sigmoid units

• Final expression is

• Weight updation rule.

Minimum description Length principle

• The Minimum Description Length principle is motivated by

interpreting the definition of hMAP in the light of basic concepts from
information theory .

• This can be interpreted as a statement that short hypotheses are

preferred .
• Problem: Designing a compact code to transmit messages
drawn at random .
• So assign shorter codes to messages that are more probable.
• The number of bits required to encode message i using code C
as the description length of message i with respect to C .
• Shannon and weaver showed that log2 pi bitst to encode
message i .
• Applying this to hMAP then

• Term1 is optimal code for hypothesis space H and Term2 is

optimal code for describing data D assuming that both the
sender and receiver know the hypothesis h .
• The Minimum Description Length (MDL) principle
recommends choosing the hypothesis that minimizes the sum of
these two description lengths
• Example:
– We wish to apply the MDL principle to the problem of
learning decision trees from some training data.
– C1: number of nodes in the tree and with the number of
edges .
– C2:
• If the sequence of instances are already known we need
only classifications. If the training classifications are
identical to the predictions of the hypothesis, then there is
no need to transmit any information about these examples .
• In the case where some examples are misclassified by h,
then for each misclassification we need to encode a
message that identifies which example is misclassified .
• Thus the MDL principle provides a way of trading off
hypothesis complexity for the number of errors committed by
the hypothesis.
• It might select a shorter hypothesis that makes a few errors over
a longer hypothesis that perfectly classifies the training data.
• Conclusions
– If a representation of hypotheses is chosen so
that the size of hypothesis h is -log2P(h)
– If a representation for exceptions is chosen so that the
encoding length of D given h is equal to -log2 P(Dlh),
then the MDL principle produces MAP hypotheses .
Bayes Optimal Classifier
• Question: Given new instance x, what is its most probable classification?
• Hmap(x) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:

Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h ) P(h | D) .4
hiH
i i

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0

P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  P( | h ) P(h | D) .6
hiH
i i
Gibbs algorithm
• Choose a hypothesis h from H at random, according to the
posterior probability distribution over H.
• Use h to predict the classification of the next instance x.
• The expected misclassification error for the Gibbs algorithm
is at most twice the expected error of the Bayes optimal
classifier .
• This result has an interesting implication for the concept
learning problem .
•
Naïve Bayes Learner
Assume target function f: X-> V, where each instance x described
by attributes <a1, a2, …., an>. Most probable value of f(x) is:
v max P (v j | a1 , a2 ....an )
vjV

P ( a1 , a2 ....an | v j ) P (v j )
max
vjV P ( a1 , a2 ....an )
max P ( a1 , a2 ....an | v j ) P (v j )
vjV

Naïve Bayes assumption:

P (a1 , a2 ....an | v j )  P(ai | v j ) (attributes are conditionally independent)
i
Naive Bayesian Classifier (II)
• Given a training set, we can compute the probabilities

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N | outlook=sunny,windy=true,…)

• Idea: assign to sample X the class label C such

that P(C|X) is maximal
Estimating a-posteriori probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples
having value xi as i-th attribute in class C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
• Computationally easy in both cases
Play-tennis example: estimating P(x i|C)
outlook
Outlook Temperature Humidity Windy Class
sunny hot high false N
P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high true N
overcast hot high false P
P(overcast|p) = 4/9 P(overcast|n) = 0
rain mild high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal false P
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast mild high true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast hot normal false P
rain mild high true N humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
P(p) = 9/14
windy
P(n) = 5/14 P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
Example : Naïve Bayes
Predict playing tennis in the day with the condition <sunny, cool, high,
strong> (P(v| o=sunny, t= cool, h=high w=strong)) using the following
training data:
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
# days of playing tennise with strong wind
we have :
# days of playing tennise
p ( y ) p ( sun | y ) p (cool | y ) p (high | y ) p ( strong | y ) .005
p (n) p ( sun | n) p (cool | n) p (high | n) p ( strong | n) .021
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
– Decision trees, that reason on one attribute at the time,
considering most important attributes first
Naïve Bayes Algorithm
Naïve_Bayes_Learn (examples)
for each target value vj
estimate P(vj)
for each attribute value ai of each attribute a
estimate P(ai | vj )

Classify_New_Instance (x)

Typical m-estimation of P(ai | vj) if nc=0

nc  mp
P(ai | v j )  Where
nm n: examples with v=v; p is prior estimate for P(ai|vj)
nc: examples with a=ai, m is the weight to prior
An example: Learning to classify the text
• We might wish to learn the target concept "electronic news
articles that I find interesting“.
• General setting:
– Consider an instance space X consisting of all possible text
documents .
– The task is to learn from these training examples to predict
the target value for subsequent text documents.
– The target values are like and dislike to indicate the two
classes.
An example: Learning to classify
the text
• The two main design issues involved in applying
the naive Bayes classifier to such text classification
problems are
– To decide how to represent an arbitrary text
document in terms of attribute values
– To decide how to estimate the probabilities
required by the naive Bayes classifier.
An example: Learning to classify
the text
• We define an attribute for each word position in
the document and define the value of that
attribute.
• Ex: 111 attribute values
An example: Learning to classify
the text
• We can now apply the naive Bayes classifier.
• we are given a set of 700 training documents that a
friend has classified as dislike and another 300 she
has classified as like.
• The new instance can be classified as
An example: Learning to classify
the text
• Assumption to apply naïve bayes classifier is the word
probabilities for one text position are independent of the
words that occur in other positions .
• Though it is inaccurate about this assumption the naive
Bayes learner performs remarkably well in many text
classification problems .
• For estimating class conditional probabilities we consider
the class conditional independence.
An example: Learning to classify
the text
• The attributes are independent and identically distributed, given the target
classification;
• That is, P(ai = wk/vj) = P(am = wk/vj)for all i, j, k, m.
• The estimate for P(wklvj)will be

• n =the total number of word positions in all training examples .

An example: Learning to classify
the text
• nk is the number of times word wk is found
among these n word positions.
• | Vocabulary | is the total number of distinct
words .
An example: Learning to classify
the text
• Experiment Results:
– A minor variant of this algorithm was applied to the problem of classifying usenet news
articles.
– The target classification for an article in this
case was the name of the usenet newsgroup in which the article appeared.
– 1,000 articles were collected from each newsgroup, forming a data set of 20,000
documents.
– The naive Bayes algorithm was then applied using two-thirdsof these 20,000 documents
as training examples, and performance was measured over the remaining third.
– The accuracy achieved by the program was 89%
An example: Learning to classify
the text
• Similarly impressive results have been achieved by others applying
similar statistical learning approaches to text classification.
• Ex: NEWSWEEDER system
– program for reading netnews that allows the user to rate articles as he or she
reads them.
– uses these rated articles as training examples to learn to predict which subsequent
articles will be of interest to the user .
– NEWSWEEDER used its learned profile of user interests to
suggest the most highly rated new articles each day.
Bayesian Belief Networks

• Naïve Bayes assumption of conditional independence too restrictive

• But it is intractable without some such assumptions
• Bayesian Belief network (Bayesian net) describe conditional
independence among subsets of variables (attributes): combining prior
knowledge about dependencies among variables with observed
training data.
• Bayesian Net
– Node = variables
– Arc = dependency
– DAG, with direction on arc representing causality
– To each variables A with parents B1, …., Bn there is attached a
conditional probability table P (A | B1, …., Bn)
Bayesian Belief Networks
• A Bayesian belief network describes the probability distribution over a set
of variables.
• Consider an arbitrary set of random variables Yl .. .Yn
then the joint space of the set of variables Y to be the cross product V(Yl)
x V(Y2) x... V(Yn).
• The probability distribution over this joint' space is called the joint
probability distribution.
• The conditional independence in bayesian belief network is
as follows
Bayesian Belief Networks
• Let X , Y, and Z be three discrete-valued random variables.
• We say that X is conditionally independent of Y given Z
if

• We commonly write the above expression in abbreviated

form as P(X|Y, Z ) = P(X|Z).
• Representation
• Example
Bayesian Belief Networks
•Age, Occupation and Income determine if
customer will buy this product.
Occ •Given that customer buys product, whether
Age there is interest in insurance is now
independent of Age, Occupation, Income.
Income •P(Age, Occ, Inc, Buy, Ins ) =
P(Age)P(Occ)P(Inc)
Buy X P(Buy|Age,Occ,Inc)P(Int|Buy)

Current State-of-the Art: Given structure

and probabilities, existing algorithms can
handle inference with categorical values and
Interested in limited representation of numerical values
Insurance
General Product Rule

n
P( x1 ,....xn | M )  P ( xi | Pai , M )
i 1
Pai  parent ( xi )
Bayesian Belief Networks
• Inference
– Exact inference of probabilities in general for an
arbitrary Bayesian network is known to be NP-hard .
– Numerous methods have been proposed for
probabilistic inference in Bayesian networks .
– Monte Carlo methods provide approximate solutions
by randomly sampling the distributions of the
unobserved variables
Inference in Bayesian Networks
Age Income
How likely are elderly rich
people to buy Sun?

P( paper = Sun | Age>60, Income > 60k)

House Living
Owner Location

Newspaper
Preference

Voting
EU
Pattern
Inference in Bayesian Networks

Age Income How likely are elderly rich

people who voted labour to
buy Daily Mail?

P( paper = DM | Age>60,
Living Income > 60k, v = labour)
House
Owner Location

Newspaper
Preference

Voting
EU
Pattern
Bayesian Belief Networks
• Learning Bayesian Belief Networks
– Settings
• The network structure might be given in advance, or it might have to be
inferred from the training data.
• All the network variables might be directly observable in each training
example, or some might be unobservable.
– If the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables
is straightforward.
Bayesian Belief Networks
• The network structure is given but only some of the variable
values are observable in the training data, the learning
problem is more difficult.
• Russell et al. (1995) propose a gradient ascent procedure
that learns the entries in the conditional probability tables.
• Searches through a space of hypotheses that corresponds to
the set of all possible entries for the conditional probability
tables.
Bayesian Belief Networks
• Gradient Ascent Training of Bayesian Networks
– It maximizes P(D|h) by following the gradient of In P(D|h) with
respect to the parameters that define the conditional probability
tables of the Bayesian network.

• This algorithm is guaranteed only to find some local

optimum solution.
• An alternative to gradient ascent is the EM
algorithm which also finds locally maximum likelihood
solutions.
Bayesian Belief Networks
• Learning the Structure of Bayesian Networks
– A heuristic search algorithm called K2 for learning network structure is used
when the data is fully observable.
– It performs a greedy search that trades off network complexity for accuracy
over the training data.
– Constraint-based approaches to learning Bayesian network structure have also
been developed .
– These approaches infer independence and dependence relationships from the
data, and then use these relationships to construct Bayesian networks.

•
EM algorithm
• A widely used approach to learning in the presence of
unobserved variables.
• Consider a problem in which the data D is a set of instances
generated by a probability distribution that is a mixture of k
distinct Normal distributions.
• Each of the k Normal distributions has the same variance
σ2,and where σ2is known.
• The learning task is to output a hypothesis h = (μ1 .. .μk)
that describes the means of each of the k distributions.
EM algorithm
• If k=2
– The EM algorithm first initializes the hypothesis to h =
(μ1,μ2),where μ1 and μ2 are
arbitrary initial values.
– It then iteratively re-estimates h by repeating the following two
steps until the procedure converges to a stationary value for h.
• Step 1: Calculate the expected value E[zij]of each hidden variable zij,
assuming the current hypothesis h = (μ1,μ2) holds.
• Step 2: Calculate a new maximum likelihood hypothesis h' = (μ1’,μ2’),
assuming the value taken on by each hidden variable zij is its expected
value E[zij ]
• The E[zij ] is calculated as below.

• The maximum likelihood hypothesis in this case is given

by
• General statement of EM Algorithm
– In general let X = {xl,. . .,xm} denote the
observed data in a set of m independently drawn
instances.
– let Z = {zl,. . .,zm} denote the unobserved data in these
same instances, and let Y = X U Z denote the full data.
– The EM algorithm searches for the maximum likelihood
hypothesis h' by seeking the h' that maximizes E[ln P(Y |(h')]
• Derivation of the k Means Algorithm
– To apply EM we must derive an expression for Q(h|
h’)that applies to our k-means problem .
– The logarithm of the probability In P(Y|h’) for all m
instances in the data is
• For any function f (z) that is a linear
function of z, the following equality holds.

•
• The second (maximization) step then finds
the values μ1’ ...... μk’ that maximize this Q
function.

Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
31 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
##7 Rev ML Module-2 Bayesian Learning
No ratings yet
##7 Rev ML Module-2 Bayesian Learning
7 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
ML 3
No ratings yet
ML 3
45 pages
Naive Bayes
No ratings yet
Naive Bayes
60 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
3maximum-likelyhood
No ratings yet
3maximum-likelyhood
15 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
Unit III
No ratings yet
Unit III
19 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
Bayesian
No ratings yet
Bayesian
91 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
SL09. Bayesian Learning
No ratings yet
SL09. Bayesian Learning
4 pages
Unit 4
No ratings yet
Unit 4
18 pages
L13 Bayesian Methods
No ratings yet
L13 Bayesian Methods
30 pages
Lecture 9: Bayesian Learning: Cognitive Systems II - Machine Learning SS 2005
No ratings yet
Lecture 9: Bayesian Learning: Cognitive Systems II - Machine Learning SS 2005
39 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
Concept Learning
No ratings yet
Concept Learning
33 pages
2BAYESIAN LEARNING (1)
No ratings yet
2BAYESIAN LEARNING (1)
22 pages
UNIT-3
No ratings yet
UNIT-3
99 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
3.1 New
No ratings yet
3.1 New
12 pages
Module - 4 AIML
No ratings yet
Module - 4 AIML
22 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
Machine_learning(unit 3)
No ratings yet
Machine_learning(unit 3)
9 pages
Chapter 6 Bayesianlearning
No ratings yet
Chapter 6 Bayesianlearning
32 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
25 pages
MODULE - 4 QB SOLVED-1
No ratings yet
MODULE - 4 QB SOLVED-1
31 pages
module_5_notes BAYESIAN learning notes
No ratings yet
module_5_notes BAYESIAN learning notes
24 pages
Module 5
No ratings yet
Module 5
24 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
24 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
AI&ML-Q With Answer
No ratings yet
AI&ML-Q With Answer
18 pages
ML UNIT 4-1-24
No ratings yet
ML UNIT 4-1-24
24 pages
AI Mod4@AzDOCUMENTS - in
No ratings yet
AI Mod4@AzDOCUMENTS - in
41 pages
Module - 4 Bayeian Learning
No ratings yet
Module - 4 Bayeian Learning
44 pages
L09 Learning I Bayesian Learning
No ratings yet
L09 Learning I Bayesian Learning
66 pages
ML Unit-4
No ratings yet
ML Unit-4
24 pages
Aiml Module 04
No ratings yet
Aiml Module 04
62 pages
Slide 1
No ratings yet
Slide 1
37 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
Bayesian Learning
No ratings yet
Bayesian Learning
44 pages
AIML- Module 4- Updated
No ratings yet
AIML- Module 4- Updated
41 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Solving_Sudoku_with_Machine_Learning
No ratings yet
Solving_Sudoku_with_Machine_Learning
2 pages
IBMSkillsNetwork AI0117EN Certificate _ Cognitive Class
No ratings yet
IBMSkillsNetwork AI0117EN Certificate _ Cognitive Class
1 page
thriller_revenge_escape_story
No ratings yet
thriller_revenge_escape_story
1 page
Apartment Rental Data Set
No ratings yet
Apartment Rental Data Set
2 pages
Time History Method
No ratings yet
Time History Method
10 pages
Project Report Titles For MBA in Insurance
0% (1)
Project Report Titles For MBA in Insurance
5 pages
Dbs Textex2 Solutions 3
No ratings yet
Dbs Textex2 Solutions 3
4 pages
A Comparative Study On Mobile Platforms (Android vs. IOS) : Smt. Annapurna, K.V.S. Pavan Teja, Dr. Y. Satyanarayana Murty
No ratings yet
A Comparative Study On Mobile Platforms (Android vs. IOS) : Smt. Annapurna, K.V.S. Pavan Teja, Dr. Y. Satyanarayana Murty
7 pages
Divisibility
No ratings yet
Divisibility
5 pages
Horizontal Link Cabling
No ratings yet
Horizontal Link Cabling
12 pages
Smart Form D
No ratings yet
Smart Form D
22 pages
2 Double Column Research Paper Format
100% (2)
2 Double Column Research Paper Format
2 pages
Angular SignalR
No ratings yet
Angular SignalR
6 pages
Sap HR Abap Interview Faq
100% (2)
Sap HR Abap Interview Faq
6 pages
5.2.2.4 Packet Tracer - ACL Demonstration
No ratings yet
5.2.2.4 Packet Tracer - ACL Demonstration
3 pages
Principles of Statutory Interpretation GP Singh Download
No ratings yet
Principles of Statutory Interpretation GP Singh Download
2 pages
DD SSCEDeviceRuntime MSI62FD
No ratings yet
DD SSCEDeviceRuntime MSI62FD
76 pages
OpenSees-ManualSensitivity Mar 2010
No ratings yet
OpenSees-ManualSensitivity Mar 2010
63 pages
AI in Power Systems
No ratings yet
AI in Power Systems
4 pages
Matlab Arduino PDF
No ratings yet
Matlab Arduino PDF
20 pages
Andra Gavrilović-Crte Iz Istorije Oslobođenja Srbije
No ratings yet
Andra Gavrilović-Crte Iz Istorije Oslobođenja Srbije
141 pages
Spare Parts - Ford Nano
No ratings yet
Spare Parts - Ford Nano
152 pages
Python Textbook 1701285714
No ratings yet
Python Textbook 1701285714
178 pages
Greedy Method Material
No ratings yet
Greedy Method Material
16 pages
Shri Mata Vaishno Devi University, Katra
No ratings yet
Shri Mata Vaishno Devi University, Katra
12 pages
Gdp-New Vendor Registration Form Black Bear Resources Indonesia
No ratings yet
Gdp-New Vendor Registration Form Black Bear Resources Indonesia
1 page
Golverk SAE920683
No ratings yet
Golverk SAE920683
7 pages
HPC Mini Project
No ratings yet
HPC Mini Project
12 pages
Cs701 Data Warehouse and Data Mining
No ratings yet
Cs701 Data Warehouse and Data Mining
23 pages
T.E. Sem. V at Vidyalankar: Proposed Time Table For Regular Batches (Online - Live & Interactive)
No ratings yet
T.E. Sem. V at Vidyalankar: Proposed Time Table For Regular Batches (Online - Live & Interactive)
2 pages
Rank List 1
No ratings yet
Rank List 1
6 pages
Exercise 3. Markov Chains (Initial State Multiplication)
No ratings yet
Exercise 3. Markov Chains (Initial State Multiplication)
4 pages
os_manual
No ratings yet
os_manual
67 pages
Pythagoras of Samos (C. 570 - C. 495 BC) Was An Ancient
No ratings yet
Pythagoras of Samos (C. 570 - C. 495 BC) Was An Ancient
6 pages

Bayesian Learning

Uploaded by

Bayesian Learning

Uploaded by

Bayesian Learning

• Provides practical learning algorithms

• Provides foundations for machine learning

• Product Rule : probability P(AB) of a conjunction of two events A

•Sum Rule: probability of a disjunction of two events A and B:

For each hypothesis h in H, calculate the posterior probability

• Case 2:h is consistent with D then

• Consider the problem of learning a continuous-valued

– The random noise variable e is generated by a Normal

• As x is independent of h(applying product rule)

• Using single layer of sigmoid units

• Weight updation rule.

• The Minimum Description Length principle is motivated by

• This can be interpreted as a statement that short hypotheses are

• Term1 is optimal code for hypothesis space H and Term2 is

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0

Naïve Bayes assumption:

• E.g. P(class=N | outlook=sunny,windy=true,…)

• Idea: assign to sample X the class label C such

Typical m-estimation of P(ai | vj) if nc=0

• n =the total number of word positions in all training examples .

• Naïve Bayes assumption of conditional independence too restrictive

• We commonly write the above expression in abbreviated

Current State-of-the Art: Given structure

P( paper = Sun | Age>60, Income > 60k)

Age Income How likely are elderly rich

– wijk denote the conditional probability that the network

• This algorithm is guaranteed only to find some local

• The maximum likelihood hypothesis in this case is given

You might also like