0% found this document useful (0 votes)

88 views56 pages

Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg

1. Bayesian decision theory and learning uses Bayes' theorem to calculate the posterior probability of a hypothesis given observed data. This allows for optimal decision making under uncertainty. 2. A Naive Bayes classifier applies Bayes' theorem with the assumption that attributes are conditionally independent. This simplifies computations by requiring only the estimation of attribute probabilities within each class. 3. Challenges include estimating prior class probabilities and probability distributions over multi-dimensional feature spaces. Parameteric models like Gaussian distributions are often used to model continuous variables.

Uploaded by

Utkarsh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views56 pages

Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg

Uploaded by

Utkarsh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Bayesian decision theory and

learning

Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
Books

n Chapter 5 of “Machine learning” by

Tom M. Mitchel
n Chapters 3, 16, of “Introduction to
Machine Learning” by Ethem Alpaydin.
Random process
n Data generated by a process not
completely known.
n Process modeled as a random process
n Convenient to handle the gap in knowledge.
n Data x treated as an outcome of a
random variable, X.
n P(X=x) is observable.
n Possible to associate classes with data.
n P(X=x|Ci), Ci is the i th class, i= 1,2,..K.
Bayesian Inference
1. Conditional probability
P(A and B) = P(A) P(B|A) = P(B) P(A|B)
Prior probability, the probability
of the hypothesis on previous
knowledge
2. Bayes’ Theorem
(Thomas Bayes (1763)) Likelihood function,
P(h) P(D|h) probability of the data
P(h|D) =
P(D) given the hypothesis
Posterior probability, the Unconditional probability (evidence) of
probability of the hypoth. the data, a normalizing constant
(assigning a class) given the ensuring the posterior probabilities
data. sum to 1.00
Learning scenarios
n Maximum a posteriori (MAP)
hypothesis:
ℎ"#$ ≡ argmax 𝑃 ℎ 𝐷
+,-
𝑃 𝐷 ℎ 𝑃(ℎ)
= argmax
+,- 𝑃(𝐷)
P(D) the same
for any h. = argmax 𝑃 𝐷 ℎ 𝑃(ℎ)
+,-

n Maximum likelihood (ML) hypothesis:

ℎ"4 ≡ argmax 𝑃 𝐷 ℎ
+,-
An example
n To find out whether a patient is suspected having a
particular form of cancer, and a diagnosis test
reported +ve for the patient.
n Given:
n P(Cancer)=.008 P(~Cancer)=.992
n P(+ve|Cancer)=.98 P(-ve|Cancer)=.02
n P(+ve|~Cancer)=.03 P(-ve|~Cancer)=.97
n Two hypotheses in the above:
n h1:Cancer, and h2:~Cancer
n Which one is to select given a positive outcome?
P(Cancer)=.008 P(~Cancer)=.992
P(+ve|Cancer)=.98 P(-ve|Cancer)=.02
Ans. P(+ve|~Cancer)=.03 P(-ve|~Cancer)=.97

n MAP approach
n P(+ve | Cancer) P(Cancer)=.98 x .008=.0078
n P(+ve | ~Cancer) P(~Cancer)=.03 x .992=.0298
n Hence, select h2: ~Cancer
n ML approach Bayesian inference
n P(+ve | Cancer)=.98
n P(+ve | ~Cancer)=.03
n Hence, select h1: Cancer!!
n Prior has a very important role in making a
decision!
Example (Contd.)
n P(data) and posterior probabilities
n P(+ve)?
n P(+ve,Cancer)+P(+ve,~Cancer)
n P(+ve|Cancer)P(Cancer)+P(+ve|~Cancer)P(~Cancer)
n =0.0376

n P(-ve)?
n 1-P(+ve)=1-0.0376=0.9624
n P(Cancer|+ve) s.d.?
n =0.0078/0.0376=0.21
Provides a
measure of √(0.79 x 0.21)
n P(~Cancer|+ve)
confidence! ≈0.41
n =1-0.21=0.79
Features of Bayesian Learning
n Flexible learning from each observable instance.
n either increasing or decreasing prob. of a hypothesis
being correct.
n Prior knowledge of hypothesis used.
n Inductive bias.
n Accommodates hypotheses with probabilistic
prediction.
n Each hypothesis in the version space of concept
learning will have a weight while taking a decision.
n Provides a framework of optimal decision making.
n Even when computation is intractable!
Concept learning under
Bayesian framework
No error in
n P(D|h): Likelihood data D.
n =1 if h an element of version space (VSH,D)
n =0, otherwise
n P(h): Prior
n Prior could be taken with uniform
distribution
n =1/|H|
§ P(h|D)
=(P(D|h).P(h))/P(D)
n P(D): Marginal Prob. of data =1/ |VSH,D|, if h in VS,
n =sum of (P(D|h).P(h)) over H. else 0.
n = (1.|VSH,D|)/|H|= |VSH,D|/|H|
Least mean squared error
estimate as the ML hypothesis
;

𝑀𝑆𝐸 = P 𝑦F − ℎ(𝑥F ) @
n Target function: y=f(x)
FG?
n h: hypothesis
same distribution at
n yi=h(xi)+ei, i=1,2,..n each observation
n ei ~ N(0, 𝛔)
; ; ? AB >+(CB ) E
1 >@
n Prob. of P(D|h) = <𝑒 D
2𝜋𝜎 FG?
n Log-likelihood @
? ; ? AB >+(CB )
n log(P(D|h))= 𝑛. ln D @L − ∑FG?
@ D
;

argmin P 𝑦F − ℎ(𝑥F ) @
n hML= +,-
FG?
Minimum description length
principle in Bayesian learning
ℎ"#$ = argmax 𝑃 𝐷 ℎ 𝑃(ℎ)
+,-

ℎ"#$ = argmax(𝑙𝑜𝑔2𝑃 𝐷 ℎ + 𝑙𝑜𝑔2𝑃(ℎ))

+,-
Description length of optimal encoding for H

ℎ"#$ = argmin(−𝑙𝑜𝑔2𝑃 𝐷 ℎ − 𝑙𝑜𝑔2𝑃(ℎ))

+,-
Description length of optimal encoding for (D|h)

n Information theory (Shannon and Weaver, 1949)

n Optimal encoding length of a message of prob. p = -log2(p).
Optimal classifier
n Learning a target function of a classifier.
n A target function: f: X à V
n V={v1,v2,..,vk}: A set of possible class labels.
n Hypotheses space H with h: X à V,
n hMAP=argmaxh{P(h|D)}
n May not be optimal Ensemble learning?
n Consider all h in H for decision making weighted
by their posterior.
n c(x)=argmaxv ∈ V { (𝚺h (P(v|h)P(h|D)) }
n c is learnt as an optimal classifier
An example
n H={h1,h2,h3}, V={+ve,-ve}
n P(h1|D)=0.4, P(h2|D)=0.3, P(h3|D)=0.3
n hMAP=h1
n Let for an instance x,
n h1(x)=+ve, h2(x)=-ve, h3(x)=-ve
n hMAP(x)=+ve (Selecting h1).

n c(x)=argmax{ P(+|h1).P(h1|D)+P(+|h2).P(h2|D)+P(+|h3).P(h3|D),
P(-|h1).P(h1|D)+P(-|h2).P(h2|D) + P(-|h3).P(h3|D) }
= argmax{1x(0.4)+0+0, 0+1x 0.3 +1 x 0.3} =argmax{0.4,0.6}
= -ve
Exhaustive enumeration!
Gibbs algorithm
n Instead of enumerating exhaustively
n choose a hypothesis h randomly for an instance
x with posterior distribution P(h|D).
n Apply h on the instance x.
n Performs sub-optimally
n expected error at most twice of the optimal
error when the prior has uniform distribution.
Bayesian Classification
(Summary)
o Input: a training set of tuples and their
associated class labels.
o each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn).
o Let there be m classes C1, C2, …, Cm.
o To derive the maximum posteriori, i.e., the
maximal P(Ci|X).
P(Ci) P(X|Ci)
P(Ci|X) =
P(X)
o Since P(X) is constant for all classes, only P(Ci) P(X|Ci)
16 needs to be maximized.
Discriminant functions
n Bayesian classifiers can be expressed in
the framework of classification based on
a set of discriminant functions gi(x).
n Rule:
n Assign Ci if gi(x) > gk(x), for all k (exc. i).
n Examples:
n gi(x) = P(Ci|x) For two classes:
Single function:
n gi(x) = P(x|Ci) P(Ci) g(x)=g1(x)-g2(x)
Challenges in computing
Computation involved:
Assign Ci to X iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes.
i=argmaxk{P(Ck|X)} = argmaxk{P(X|Ck)P(Ck)}

Challenges:
o Prior knowledge of probabilities of classes,
o Probability distributions in multidimensional feature spaces.
X 𝛜 x1 x x2 x x3 x … x xn
18 Adapetd from hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Naïve Bayes Classifier

Works on a simplified assumption:

attributes are conditionally independent (i.e., no dependence
relation between attributes).
n
P(X | C i) = Õ P( x | C i) = P( x | C i) ´ P( x | C i) ´ ... ´ P( x | C i)
k 1 2 n
k =1
Significant reduction of the computation cost
o requires only the class distributions.
Convenient to estimate P(xi|Ck)
For a categorical or discrete variable:
o fraction of times the value occurred in a class.
For a continuous variable:
19 o may use parametric modeling of Gaussian distribution.
Likelihood estimation in Naïve
Bayes Classifier
n
Likelihood: P(X | C i) = Õ P( x | C i) = P( x | C i) ´ P( x | C i) ´ ... ´ P( x | C i)
k 1 2 n
k =1

To estimate P(xi|Ck)
For categorical or discrete variable:
o fraction of times the value occurred in a class.
For continuous variable:
o may use parametric modeling of Gaussian distribution.
( x-µ )2
1 -
g ( x, µ , s ) = e 2s 2

2p s

20 P ( X | C i ) = g ( xk , µ Ci , s Ci )
An Example: Training Dataset
age income studentcredit_rating
buys_comput
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = >40 low yes fair yes
>40 low yes excellent no
‘no’
31…40 low yes excellent yes
<=30 medium no fair no
Data to be classified: <=30 low yes fair yes
X = (age <=30, >40 medium yes fair yes
Income = medium, <=30 medium yes excellent yes
31…40 medium no excellent yes
Student = yes
31…40 high yes fair yes
Credit_rating = Fair) >40 medium no excellent no
21
hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Computation of class prior
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

P(buys_computer = “no”) = 5/14= 0.357
22 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation: age
=“<=30”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6

23 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation:
income =“medium”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4

24 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation:
student =“yes”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2

25 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation:
credit_rating =“fair”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

26 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation: P(X|Ci)
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

27 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Estimation of posterior:
P(Ci|X) and class assignment age
<=30
<=30
31…40
>40
income studentcredit_rating
high
high
high
medium
no fair
buys_computer

no excellent
no fair
no fair
no
no
yes
yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

n X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

All the conditional probabilities should be

non-zero, else likelihood becomes zero.
Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10).
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
29 Prob(income = high) = 11/1003
Naïve Bayes Classifier: Pros and
Cons.
q Advantages
o Easy to implement
o Good results obtained in most of the cases
q Disadvantages
n Assumption: class conditional independence

n loss of accuracy
n In real life, dependencies exist among variables
n E.g., hospitals: patients: Profile: age, family history, etc.

Symptoms: fever, cough etc., Disease: lung cancer,

diabetes, etc.
n Dependencies among these cannot be modeled by Naïve

Bayes Classifier.
30
X Y
P(X) P(Y|X)
Bayesian Network
n A more general framework
n for modeling conditional dependencies.
n represents the interaction between variables in a
graph.
n composed of nodes, and arcs between the nodes.
n A node: a random variable, X, with the probability of the
random variable, P(X).
n A directed arc from X to Y: X influences Y with P(Y|X).
n A directed acyclic graph (DAG)
n No cycle.
n Topology called structure and P(X), P(Y|X) are parameters.
An example
n Bayesian network modeling

R W P(R,W) Rain P(R)=0.2 Sufficient

R W 0.16
(R) to specify
P(R,W)
~R W 0.24

R ~W 0.04 P(W|R)=0.8
Wet grass P(W|~R)=0.3
~R ~W 0.56
(W)

Marginal Prob.: P(R)= 0.2 & P(W)=0.4

Diagnostic Causal Inference:
Inference Inference: P(R|W)
= (P(R)P(W|R))/P(W)
mechanism
P(W|R)
= 0.8 =(0.2 x 0.8) / 0.4
=0.4

Knowing
n Bayesian network modeling that the
grass is
R W P(R,W) Rain wet,
P(R)=0.2 increases
R W 0.16
(R)
P(R) from
~R W 0.24 P(W|R)=0.8 0.2 to 0.4.
R ~W 0.04 P(W|~R)=0.3
Wet grass
~R ~W 0.56
(W) Directed edge, but
May not imply
Marginal Prob.: P(R)= 0.2 & P(W)=0.4 causality.
Formation of a graphical
model
n Form a graph
n by adding nodes, and
n arcs between two nodes, if they are not
independent.
n X and Y are independent, if they are not
conditionally dependent.
n P(Y|X)=P(Y)
n and also P(X|Y)=P(X).
n P(X,Y)=P(X)P(Y)
Conditional Independence
n Conditional independence between X and Y
given occurrence of a third event (Z):
n If P(X,Y|Z)=P(X|Z)P(Y|Z)
n Can also be written as
n P(X|Z)=P(X|Y,Z)
Y Z
Given Z, X and Y
are conditionally
Z independent. Y
X
P(X,Y,Z)= P(X,Y,Z)=
X P(Y) P(Z|Y) P(X|Z) P(Z) P(X|Z) P(Y|Z)
Head to tail connection Tail to tail connection
Conditional Independence
o When its value is known Z blocks the path from Y
to X,
o if Z is removed, there is no path between Y to X.
o Given Z, X and Y are independent.
Y Z
P(X,Y|Z)=P(X|Z)P(Y|Z)
Z Y
X
P(X,Y,Z)= P(X,Y,Z)=
X P(Y) P(Z|Y) P(X|Z) P(Z) P(X|Z) P(Y|Z)
Head to tail connection Tail to tail connection
Conditional Independence
n For specifying joint probabilities, no need to
P(C) specify at all possible data points.
#: 1 n Instead of 8 specifications, only 5 needed!
n Significant saving for a large network.
Cloudy
P(C)
Cloudy
P(C|R)
Rain
#: 2 P(S|C) P(R|C)
P(W|R) Sprinkler Rain
Wet Grass #:2
Head to tail connection Tail to tail connection
Inference / Diagnosis from
conditional Independence
n To compute probabilities of all possible
combinations of other variables, given a value
of a leaf node.

Y Knowing X,
Knowing X, Z
infer about Z
infer about Z
and then Y.
Z and then Y. Y
X
P(X,Y,Z)= P(X,Y,Z)=
X P(Y) P(Z|Y) P(X|Z) P(Z) P(X|Z) P(Y|Z)
Head to tail connection Tail to tail connection
Head to head connection
n X and Y are independent, The path from X and Y is
but become dependent blocked if Z is not
observed (independent),
when Z is known. else not blocked
(dependent through Z).
P(X) X Y P(Y)
#:1 P(X, Y|Z) ≠ P X Z P(Y|Z)
#:1
P(X,Y|Z)=P(X,Y,Z)/P(Z)
Z P(Z|X,Y)
#:4 P Z = P P P(X, Y, Z)
P(X,Y,Z)=P(X) P(Y) P(Z|X,Y) b a

= P P P X P Y P(Z|X, Y)
P(X,Y)=P(X)P(Y)
b a
Bayesian Networks: Larger
graphs from simpler graphs
n Propagating implied conditional independency.
P(C) o Explicitly encode
Cloudy #:1 independencies
P(S|C) P(R|C)
o Allow breaking down
#:2 #:2
inference into
P(C,S,R,W):Instead
Sprinkler of 16, only 9 Rain calculation over
parameters needed. small groups of
variables
P(W|S,R) o Propagated from
Wet grass #:4 evidence nodes
to query nodes.
P(C,S,R,W)=P(C) P(R|C) P(S|C) P(W|S,R)
Computation on Bayesian
Network
n Given the value of any set of variables as
an evidence infer the probabilities of any
other set of variables.
n A probabilistic database
n a machine that can answer queries regarding
the values of random variables.
n the difference between unsupervised and
supervised learning becomes blurry.
Inference through Bayesian
Networks g
𝑃 𝑋? , 𝑋@ , … , 𝑋g = < 𝑃(𝑋F |𝑝𝑎𝑟𝑒𝑛𝑡 𝑜𝑓 𝑋F )
FG?
n Given any subset of Xi , calculate the probability
distribution of some other subset of Xi by
marginalizing over the joint.
n exponential number of joint prob. combinations.
n Not exploiting implied independencies
n Redundancy of computing joint prob. of the same subsets.
n Efficient computation through belief propagation.
n Can accommodate hidden variables
n Values not known, but estimated from dependency of observed
variables.
Naïve Bayes Classifier: A
special case
n P(x1,x2,..,xd,C)=P(C)P(x1|C)P(x2|C)..P(xd|C)
n P(C|x)=(P(C)P(x|C))/P(x)
n P(x|C)=P(x1|C)P(x2|C)..P(xd|C)

P(C) C
Apply Bayesian
classification rule.
x1 x2 xd

P(x1|C) P(x2|C) P(xd|C)

Bayesian Decision making:
Losses and risks
n ai: ith action
n assign x to class Ci
n lik: loss due to ai if x belongs to Ck.
n Expected risk for taking action ai

𝑅 𝑎F |𝑥 = P 𝑙Fn 𝑃 𝐶n |𝑥
n
n Choose ai which minimizes R(.).
A few cases
n 0/1 loss case
0 𝑖𝑓 𝑖 = 𝑘
𝑙Fn =p
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝑅 𝑎F |𝑥 = P 𝑙Fn 𝑃 𝐶n |𝑥
n

Minimizing
= P 𝑃 𝐶n |𝑥
nvF
Maximizing
= 1 − 𝑃 𝐶F |𝑥
A few cases
n Include rejection for 0 𝑖𝑓 𝑖 = 𝑘
doubtful cases of 𝑙Fn = w𝜆 𝑖𝑓 𝑖 = 𝐾 + 1
classification. 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
n Additional (K+1 th) action
(aK+1) for rejection.
𝑅 𝑎F 𝑥 = 1 − 𝑃 𝐶F 𝑥 , 𝑓𝑜𝑟 𝑖 ≠ 𝐾 + 1

𝑅 𝑎z{? |𝑥 = P 𝜆𝑃 𝐶n |𝑥 = 𝜆
nvz{?
Optimum classification rule:
Choose ai
if P(Ci|x) is maximum among i=1,2,..K and > 1-𝜆
else Reject (No class assignment).
A few cases
0 𝑖𝑓 𝑖 = 𝑘 Meaningful
𝑙Fn = w𝜆 𝑖𝑓 𝑖 = 𝐾 + 1 If 0 < 𝜆 <1
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑅 𝑎F 𝑥 = 1 − 𝑃 𝐶F 𝑥 , 𝑓𝑜𝑟 𝑖 ≠ 𝐾 + 1
n If 𝜆 = 0, always
𝑅 𝑎z{? |𝑥 = P 𝜆𝑃 𝐶n |𝑥 = 𝜆 reject.
nvz{?
n If 𝜆=1, always
Optimum classification rule:
accept.
Choose ai
if P(Ci|x) is maximum among i=1,2,..K and > 1-𝜆
else Reject (No class assignment).
Generalization to utility theory
n Instead of loss consider gain Uik for
taking action ai at state k (here given by
class Ck).
n Expected utility:
𝐸𝑈 𝑎F |𝑥 = P 𝑈Fn 𝑃 𝐶n |𝑥
n
n Choose ai if EU(ai|x) is maximum out of
all actions ak’s.
Mining association rules
n An association rule:
n An implication XàY
n X: antecedent Y: consequent
n An example: Basket analysis for dependency on
procurement of items X and Y.
n Three useful measures:
n Support (X,Y): P(X,Y)
n # of customers bought X and Y / # of total customers.
n Confidence(XàY): P(Y|X)= P(X,Y)/P(X)
n # of customers bought X and Y / # of customers of X.
n Lift(X,Y)= P(X,Y)/(P(X).P(Y))=P(Y|X)/P(Y)
Three measures of association
rules
n Support (X,Y): P(X,Y)
n Confidence(XàY): P(Y|X)= P(X,Y)/P(X)
n Lift(X,Y)= P(X,Y)/(P(X).P(Y))=P(Y|X)/P(Y)
n Confidence indicates strength of the rule.
n should be very high (close to 1)
n significantly higher than P(Y).
n Support shows statistical significance
n Should be of considerable numbers.
n insignificant support with high confidence meaningless.
n For independent X and Y, Lift close to 1.
n Ratio other than close to 1, shows dependency.
n Lift > 1, è most likely X makes Y, else (<1) Y makes X.
Apriori algorithm
n To get association rules with high support and
confidence from a database.
n Possible to generalize association among more than 2
variables.
n E.g. X,Z à Y

n Two steps:
n Finding frequent item sets.
n those which have enough support.
n Converting them to rules with enough confidence.
n by splitting the items into two, as items in the antecedent and items
in the consequent.
Agrawal, R., H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. 1996. “Fast Discovery of Association Rules.”
In Advances in Knowledge Discovery and Data Mining, ed. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy,307–328. Cambridge, MA: MIT Press.
Apriori algorithm: Step 1
n Finding frequent item sets, that is, those
which have enough support.
n Start searching for combination with lower
cardinality, e.g. with 1 item, next 2 items, …
n Remove supersets in the combinations, which
are not in the list of lower cardinality sets.
n If X is not frequent, do not search for any
combination with X.
n Requires (n+1) passes for searching largest n-
itemset together.
Apriori algorithm: Step 2
n Converting them to rules with enough confidence,
n by splitting the items into two, as items in the antecedent
and items in the consequent.
n For every itemset, split keeping all but 1 in
antecedent and 1 item in consequent.
n E.g. for k itemset, k-1 items in antecedent and 1 item in
consequent.
n Remove those rules, which fail the test of confidence.
n In every pass, reduce antecedent part and increase
consequent part.
n Rules with larger consequent part are more useful.
Association and causality
n XàY indicates association, not
causality.
n There may be hidden variables acting in
the process not identified.
n E.g. association among {diapers, baby
food, and milk} may be established.
n Hidden variable: Baby at home.
Summary
n Bayesian inference:
n Compute P(Class|x).
n Decision may be taken by modeling risk or utility of
any action (to ith class of a k-th class sample).
n Classification rules can be set under the framework of
discriminant functions.
n Bayesian inference is useful in establishing association
among variables.
n Compute support (P(X,Y)), confidence (P(Y|X)), and Lift
(P(X,Y)/(P(X).P(Y)).
n Rules with high Support and Confidence, Lift not around 1.

SPLK-1002 Splunk Core Certified Power User Dumps
0% (1)
SPLK-1002 Splunk Core Certified Power User Dumps
11 pages
AI & MLB (MBA-III Sem.) 2022-24
0% (1)
AI & MLB (MBA-III Sem.) 2022-24
6 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
23 pages
Biol 180 WIN 2022 Practice Exam 1
No ratings yet
Biol 180 WIN 2022 Practice Exam 1
2 pages
Comparing The Areas Under Two or More Correlated Receiver Operating Characteristic Curves A Nonparametric Approach
No ratings yet
Comparing The Areas Under Two or More Correlated Receiver Operating Characteristic Curves A Nonparametric Approach
10 pages
Management Practices and Productive Performances of Sasso Chickens Breed Under Village Production System in SNNPR, Ethiopia
No ratings yet
Management Practices and Productive Performances of Sasso Chickens Breed Under Village Production System in SNNPR, Ethiopia
16 pages
Unit 3 Notes
100% (2)
Unit 3 Notes
32 pages
Pooja Pardeshi Project June220 PDF
No ratings yet
Pooja Pardeshi Project June220 PDF
109 pages
PowerPoint CH 03b
No ratings yet
PowerPoint CH 03b
50 pages
Computer Architecture and Operating Systems (Caos) Course Code: CS31702 4-0-0
No ratings yet
Computer Architecture and Operating Systems (Caos) Course Code: CS31702 4-0-0
33 pages
Chapter 5
No ratings yet
Chapter 5
45 pages
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
No ratings yet
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
73 pages
2 Csessyll
No ratings yet
2 Csessyll
50 pages
Bayesian
No ratings yet
Bayesian
91 pages
3 - Arithmetic For Computers
No ratings yet
3 - Arithmetic For Computers
59 pages
Intro To Dynamic Programming
No ratings yet
Intro To Dynamic Programming
7 pages
Homework Problems Stat 490C
No ratings yet
Homework Problems Stat 490C
44 pages
Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Stat Packages
No ratings yet
Stat Packages
50 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
43 pages
Siti Maryam Prosiding1
No ratings yet
Siti Maryam Prosiding1
17 pages
Research Exam
No ratings yet
Research Exam
23 pages
The Impact of Social Media On Students Academic PDF
No ratings yet
The Impact of Social Media On Students Academic PDF
9 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
65 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Text Mining - Classification
No ratings yet
Text Mining - Classification
28 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
Complex Engineering Problem (CEP) Descriptive Form
No ratings yet
Complex Engineering Problem (CEP) Descriptive Form
4 pages
JBI Manual For Evidence Synthesis - CONTENIDOS
No ratings yet
JBI Manual For Evidence Synthesis - CONTENIDOS
5 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Bayesian Decision Theory: CS479/679 Pattern Recognition Dr. George Bebis
64 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
Indian Institute of Technology, Kharagpur: Mid-Spring Semester 2021-22
No ratings yet
Indian Institute of Technology, Kharagpur: Mid-Spring Semester 2021-22
4 pages
Module05 - Bayesian Reasoning
No ratings yet
Module05 - Bayesian Reasoning
37 pages
Demand Forecasting - Prerequisites of A Good Forecast
No ratings yet
Demand Forecasting - Prerequisites of A Good Forecast
15 pages
Sampling Techniques
No ratings yet
Sampling Techniques
12 pages
Dr. Arslan Shaukat
No ratings yet
Dr. Arslan Shaukat
18 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
63 pages
AGE 200 Qualitative and Quantitative Techniques in Geography
No ratings yet
AGE 200 Qualitative and Quantitative Techniques in Geography
54 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
Practical Exam 2
No ratings yet
Practical Exam 2
4 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
Naive Bayes
No ratings yet
Naive Bayes
31 pages
8 ML
No ratings yet
8 ML
22 pages
2024 - Slide2 - BayesML Sub
No ratings yet
2024 - Slide2 - BayesML Sub
40 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Lecture Slide 03 - Bayesian Classifier - Summer 2023
No ratings yet
Lecture Slide 03 - Bayesian Classifier - Summer 2023
23 pages
Naive Bayes
No ratings yet
Naive Bayes
60 pages
8 - Classification NaiveBayes PDF
No ratings yet
8 - Classification NaiveBayes PDF
13 pages
Naïve Bayes Classifier: April 25, 2006
No ratings yet
Naïve Bayes Classifier: April 25, 2006
19 pages
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
No ratings yet
Bayesian Classification: Dr. Navneet Goyal BITS, Pilani
35 pages
Chapter 4
No ratings yet
Chapter 4
57 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
L13 Bayesian Methods
No ratings yet
L13 Bayesian Methods
30 pages
Anshu Complete Data Science Files
No ratings yet
Anshu Complete Data Science Files
26 pages
Bayesian Learning
No ratings yet
Bayesian Learning
58 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Module 4. Data Collection and Sampling Week 3
No ratings yet
Module 4. Data Collection and Sampling Week 3
29 pages
Unit III
No ratings yet
Unit III
19 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Lecture - 4.1 - Bayes Classifier
No ratings yet
Lecture - 4.1 - Bayes Classifier
31 pages
Statistical Perspective
No ratings yet
Statistical Perspective
85 pages
Lecture 7
No ratings yet
Lecture 7
15 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Unit-8 Processing & Analysis of Data
No ratings yet
Unit-8 Processing & Analysis of Data
39 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Abyud Tweve Reserch Report Final 2
No ratings yet
Abyud Tweve Reserch Report Final 2
81 pages
ML Unit 4-1-24
No ratings yet
ML Unit 4-1-24
24 pages
Cochrans Q Test
No ratings yet
Cochrans Q Test
8 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Nayes Bayes Classifier
No ratings yet
Nayes Bayes Classifier
46 pages
Mid Term Report
No ratings yet
Mid Term Report
11 pages
Module - 3 - Last Part
No ratings yet
Module - 3 - Last Part
16 pages
L4 Naive Bayes
No ratings yet
L4 Naive Bayes
31 pages
IML Module 3
No ratings yet
IML Module 3
95 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
ML 05 Bayesian Classifier
No ratings yet
ML 05 Bayesian Classifier
19 pages
Bcs602 ML Mod-4 Notes @vtunetwork
No ratings yet
Bcs602 ML Mod-4 Notes @vtunetwork
31 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg

Uploaded by

Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg

Uploaded by

Bayesian decision theory and

n Chapter 5 of “Machine learning” by

n Maximum likelihood (ML) hypothesis:

ℎ"#$ = argmax(𝑙𝑜𝑔2𝑃 𝐷 ℎ + 𝑙𝑜𝑔2𝑃(ℎ))

ℎ"#$ = argmin(−𝑙𝑜𝑔2𝑃 𝐷 ℎ − 𝑙𝑜𝑔2𝑃(ℎ))

n Information theory (Shannon and Weaver, 1949)

Works on a simplified assumption:

n P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

n P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222

n P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444

n P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

n P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667

n X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

n X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

All the conditional probabilities should be

Symptoms: fever, cough etc., Disease: lung cancer,

R W P(R,W) Rain P(R)=0.2 Sufficient

Marginal Prob.: P(R)= 0.2 & P(W)=0.4

P(x1|C) P(x2|C) P(xd|C)

You might also like