Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
learning
Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
Books
n MAP approach
n P(+ve | Cancer) P(Cancer)=.98 x .008=.0078
n P(+ve | ~Cancer) P(~Cancer)=.03 x .992=.0298
n Hence, select h2: ~Cancer
n ML approach Bayesian inference
n P(+ve | Cancer)=.98
n P(+ve | ~Cancer)=.03
n Hence, select h1: Cancer!!
n Prior has a very important role in making a
decision!
Example (Contd.)
n P(data) and posterior probabilities
n P(+ve)?
n P(+ve,Cancer)+P(+ve,~Cancer)
n P(+ve|Cancer)P(Cancer)+P(+ve|~Cancer)P(~Cancer)
n =0.0376
n P(-ve)?
n 1-P(+ve)=1-0.0376=0.9624
n P(Cancer|+ve) s.d.?
n =0.0078/0.0376=0.21
Provides a
measure of √(0.79 x 0.21)
n P(~Cancer|+ve)
confidence! ≈0.41
n =1-0.21=0.79
Features of Bayesian Learning
n Flexible learning from each observable instance.
n either increasing or decreasing prob. of a hypothesis
being correct.
n Prior knowledge of hypothesis used.
n Inductive bias.
n Accommodates hypotheses with probabilistic
prediction.
n Each hypothesis in the version space of concept
learning will have a weight while taking a decision.
n Provides a framework of optimal decision making.
n Even when computation is intractable!
Concept learning under
Bayesian framework
No error in
n P(D|h): Likelihood data D.
n =1 if h an element of version space (VSH,D)
n =0, otherwise
n P(h): Prior
n Prior could be taken with uniform
distribution
n =1/|H|
§ P(h|D)
=(P(D|h).P(h))/P(D)
n P(D): Marginal Prob. of data =1/ |VSH,D|, if h in VS,
n =sum of (P(D|h).P(h)) over H. else 0.
n = (1.|VSH,D|)/|H|= |VSH,D|/|H|
Least mean squared error
estimate as the ML hypothesis
;
𝑀𝑆𝐸 = P 𝑦F − ℎ(𝑥F ) @
n Target function: y=f(x)
FG?
n h: hypothesis
same distribution at
n yi=h(xi)+ei, i=1,2,..n each observation
n ei ~ N(0, 𝛔)
; ; ? AB >+(CB ) E
1 >@
n Prob. of P(D|h) = <𝑒 D
2𝜋𝜎 FG?
n Log-likelihood @
? ; ? AB >+(CB )
n log(P(D|h))= 𝑛. ln D @L − ∑FG?
@ D
;
argmin P 𝑦F − ℎ(𝑥F ) @
n hML= +,-
FG?
Minimum description length
principle in Bayesian learning
ℎ"#$ = argmax 𝑃 𝐷 ℎ 𝑃(ℎ)
+,-
n c(x)=argmax{ P(+|h1).P(h1|D)+P(+|h2).P(h2|D)+P(+|h3).P(h3|D),
P(-|h1).P(h1|D)+P(-|h2).P(h2|D) + P(-|h3).P(h3|D) }
= argmax{1x(0.4)+0+0, 0+1x 0.3 +1 x 0.3} =argmax{0.4,0.6}
= -ve
Exhaustive enumeration!
Gibbs algorithm
n Instead of enumerating exhaustively
n choose a hypothesis h randomly for an instance
x with posterior distribution P(h|D).
n Apply h on the instance x.
n Performs sub-optimally
n expected error at most twice of the optimal
error when the prior has uniform distribution.
Bayesian Classification
(Summary)
o Input: a training set of tuples and their
associated class labels.
o each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn).
o Let there be m classes C1, C2, …, Cm.
o To derive the maximum posteriori, i.e., the
maximal P(Ci|X).
P(Ci) P(X|Ci)
P(Ci|X) =
P(X)
o Since P(X) is constant for all classes, only P(Ci) P(X|Ci)
16 needs to be maximized.
Discriminant functions
n Bayesian classifiers can be expressed in
the framework of classification based on
a set of discriminant functions gi(x).
n Rule:
n Assign Ci if gi(x) > gk(x), for all k (exc. i).
n Examples:
n gi(x) = P(Ci|x) For two classes:
Single function:
n gi(x) = P(x|Ci) P(Ci) g(x)=g1(x)-g2(x)
Challenges in computing
Computation involved:
Assign Ci to X iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes.
i=argmaxk{P(Ck|X)} = argmaxk{P(X|Ck)P(Ck)}
Challenges:
o Prior knowledge of probabilities of classes,
o Probability distributions in multidimensional feature spaces.
X 𝛜 x1 x x2 x x3 x … x xn
18 Adapetd from hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Naïve Bayes Classifier
To estimate P(xi|Ck)
For categorical or discrete variable:
o fraction of times the value occurred in a class.
For continuous variable:
o may use parametric modeling of Gaussian distribution.
( x-µ )2
1 -
g ( x, µ , s ) = e 2s 2
2p s
20 P ( X | C i ) = g ( xk , µ Ci , s Ci )
An Example: Training Dataset
age income studentcredit_rating
buys_comput
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = 31…40 high no fair yes
‘yes’ >40 medium no fair yes
C2:buys_computer = >40 low yes fair yes
>40 low yes excellent no
‘no’
31…40 low yes excellent yes
<=30 medium no fair no
Data to be classified: <=30 low yes fair yes
X = (age <=30, >40 medium yes fair yes
Income = medium, <=30 medium yes excellent yes
31…40 medium no excellent yes
Student = yes
31…40 high yes fair yes
Credit_rating = Fair) >40 medium no excellent no
21
hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Computation of class prior
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
23 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation:
income =“medium”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
24 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation:
student =“yes”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
25 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation:
credit_rating =“fair”
age income student credit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
26 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Likelihood estimation: P(X|Ci)
age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
P(X|Ci) :
P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
27 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Estimation of posterior:
P(Ci|X) and class assignment age
<=30
<=30
31…40
>40
income studentcredit_rating
high
high
high
medium
no fair
buys_computer
no excellent
no fair
no fair
no
no
yes
yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
P(Ci|X) ∝ P(X|Ci)*P(Ci) :
P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class
(“buys_computer = yes”)
28 hanj.cs.illinois.edu/bk3/bk3_slides/08ClassBasic.ppt
Avoiding Zero-Probability
n
P(X | C i) = Õ P( x | C i) = P( x | C i) ´ P( x | C i) ´ ... ´ P( x | C i)
k 1 2 n
k =1
n loss of accuracy
n In real life, dependencies exist among variables
n E.g., hospitals: patients: Profile: age, family history, etc.
Bayes Classifier.
30
X Y
P(X) P(Y|X)
Bayesian Network
n A more general framework
n for modeling conditional dependencies.
n represents the interaction between variables in a
graph.
n composed of nodes, and arcs between the nodes.
n A node: a random variable, X, with the probability of the
random variable, P(X).
n A directed arc from X to Y: X influences Y with P(Y|X).
n A directed acyclic graph (DAG)
n No cycle.
n Topology called structure and P(X), P(Y|X) are parameters.
An example
n Bayesian network modeling
R ~W 0.04 P(W|R)=0.8
Wet grass P(W|~R)=0.3
~R ~W 0.56
(W)
Knowing
n Bayesian network modeling that the
grass is
R W P(R,W) Rain wet,
P(R)=0.2 increases
R W 0.16
(R)
P(R) from
~R W 0.24 P(W|R)=0.8 0.2 to 0.4.
R ~W 0.04 P(W|~R)=0.3
Wet grass
~R ~W 0.56
(W) Directed edge, but
May not imply
Marginal Prob.: P(R)= 0.2 & P(W)=0.4 causality.
Formation of a graphical
model
n Form a graph
n by adding nodes, and
n arcs between two nodes, if they are not
independent.
n X and Y are independent, if they are not
conditionally dependent.
n P(Y|X)=P(Y)
n and also P(X|Y)=P(X).
n P(X,Y)=P(X)P(Y)
Conditional Independence
n Conditional independence between X and Y
given occurrence of a third event (Z):
n If P(X,Y|Z)=P(X|Z)P(Y|Z)
n Can also be written as
n P(X|Z)=P(X|Y,Z)
Y Z
Given Z, X and Y
are conditionally
Z independent. Y
X
P(X,Y,Z)= P(X,Y,Z)=
X P(Y) P(Z|Y) P(X|Z) P(Z) P(X|Z) P(Y|Z)
Head to tail connection Tail to tail connection
Conditional Independence
o When its value is known Z blocks the path from Y
to X,
o if Z is removed, there is no path between Y to X.
o Given Z, X and Y are independent.
Y Z
P(X,Y|Z)=P(X|Z)P(Y|Z)
Z Y
X
P(X,Y,Z)= P(X,Y,Z)=
X P(Y) P(Z|Y) P(X|Z) P(Z) P(X|Z) P(Y|Z)
Head to tail connection Tail to tail connection
Conditional Independence
n For specifying joint probabilities, no need to
P(C) specify at all possible data points.
#: 1 n Instead of 8 specifications, only 5 needed!
n Significant saving for a large network.
Cloudy
P(C)
Cloudy
P(C|R)
Rain
#: 2 P(S|C) P(R|C)
P(W|R) Sprinkler Rain
Wet Grass #:2
Head to tail connection Tail to tail connection
Inference / Diagnosis from
conditional Independence
n To compute probabilities of all possible
combinations of other variables, given a value
of a leaf node.
Y Knowing X,
Knowing X, Z
infer about Z
infer about Z
and then Y.
Z and then Y. Y
X
P(X,Y,Z)= P(X,Y,Z)=
X P(Y) P(Z|Y) P(X|Z) P(Z) P(X|Z) P(Y|Z)
Head to tail connection Tail to tail connection
Head to head connection
n X and Y are independent, The path from X and Y is
but become dependent blocked if Z is not
observed (independent),
when Z is known. else not blocked
(dependent through Z).
P(X) X Y P(Y)
#:1 P(X, Y|Z) ≠ P X Z P(Y|Z)
#:1
P(X,Y|Z)=P(X,Y,Z)/P(Z)
Z P(Z|X,Y)
#:4 P Z = P P P(X, Y, Z)
P(X,Y,Z)=P(X) P(Y) P(Z|X,Y) b a
= P P P X P Y P(Z|X, Y)
P(X,Y)=P(X)P(Y)
b a
Bayesian Networks: Larger
graphs from simpler graphs
n Propagating implied conditional independency.
P(C) o Explicitly encode
Cloudy #:1 independencies
P(S|C) P(R|C)
o Allow breaking down
#:2 #:2
inference into
P(C,S,R,W):Instead
Sprinkler of 16, only 9 Rain calculation over
parameters needed. small groups of
variables
P(W|S,R) o Propagated from
Wet grass #:4 evidence nodes
to query nodes.
P(C,S,R,W)=P(C) P(R|C) P(S|C) P(W|S,R)
Computation on Bayesian
Network
n Given the value of any set of variables as
an evidence infer the probabilities of any
other set of variables.
n A probabilistic database
n a machine that can answer queries regarding
the values of random variables.
n the difference between unsupervised and
supervised learning becomes blurry.
Inference through Bayesian
Networks g
𝑃 𝑋? , 𝑋@ , … , 𝑋g = < 𝑃(𝑋F |𝑝𝑎𝑟𝑒𝑛𝑡 𝑜𝑓 𝑋F )
FG?
n Given any subset of Xi , calculate the probability
distribution of some other subset of Xi by
marginalizing over the joint.
n exponential number of joint prob. combinations.
n Not exploiting implied independencies
n Redundancy of computing joint prob. of the same subsets.
n Efficient computation through belief propagation.
n Can accommodate hidden variables
n Values not known, but estimated from dependency of observed
variables.
Naïve Bayes Classifier: A
special case
n P(x1,x2,..,xd,C)=P(C)P(x1|C)P(x2|C)..P(xd|C)
n P(C|x)=(P(C)P(x|C))/P(x)
n P(x|C)=P(x1|C)P(x2|C)..P(xd|C)
P(C) C
Apply Bayesian
classification rule.
x1 x2 xd
𝑅 𝑎F |𝑥 = P 𝑙Fn 𝑃 𝐶n |𝑥
n
n Choose ai which minimizes R(.).
A few cases
n 0/1 loss case
0 𝑖𝑓 𝑖 = 𝑘
𝑙Fn =p
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑅 𝑎F |𝑥 = P 𝑙Fn 𝑃 𝐶n |𝑥
n
Minimizing
= P 𝑃 𝐶n |𝑥
nvF
Maximizing
= 1 − 𝑃 𝐶F |𝑥
A few cases
n Include rejection for 0 𝑖𝑓 𝑖 = 𝑘
doubtful cases of 𝑙Fn = w𝜆 𝑖𝑓 𝑖 = 𝐾 + 1
classification. 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
n Additional (K+1 th) action
(aK+1) for rejection.
𝑅 𝑎F 𝑥 = 1 − 𝑃 𝐶F 𝑥 , 𝑓𝑜𝑟 𝑖 ≠ 𝐾 + 1
𝑅 𝑎z{? |𝑥 = P 𝜆𝑃 𝐶n |𝑥 = 𝜆
nvz{?
Optimum classification rule:
Choose ai
if P(Ci|x) is maximum among i=1,2,..K and > 1-𝜆
else Reject (No class assignment).
A few cases
0 𝑖𝑓 𝑖 = 𝑘 Meaningful
𝑙Fn = w𝜆 𝑖𝑓 𝑖 = 𝐾 + 1 If 0 < 𝜆 <1
1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑅 𝑎F 𝑥 = 1 − 𝑃 𝐶F 𝑥 , 𝑓𝑜𝑟 𝑖 ≠ 𝐾 + 1
n If 𝜆 = 0, always
𝑅 𝑎z{? |𝑥 = P 𝜆𝑃 𝐶n |𝑥 = 𝜆 reject.
nvz{?
n If 𝜆=1, always
Optimum classification rule:
accept.
Choose ai
if P(Ci|x) is maximum among i=1,2,..K and > 1-𝜆
else Reject (No class assignment).
Generalization to utility theory
n Instead of loss consider gain Uik for
taking action ai at state k (here given by
class Ck).
n Expected utility:
𝐸𝑈 𝑎F |𝑥 = P 𝑈Fn 𝑃 𝐶n |𝑥
n
n Choose ai if EU(ai|x) is maximum out of
all actions ak’s.
Mining association rules
n An association rule:
n An implication XàY
n X: antecedent Y: consequent
n An example: Basket analysis for dependency on
procurement of items X and Y.
n Three useful measures:
n Support (X,Y): P(X,Y)
n # of customers bought X and Y / # of total customers.
n Confidence(XàY): P(Y|X)= P(X,Y)/P(X)
n # of customers bought X and Y / # of customers of X.
n Lift(X,Y)= P(X,Y)/(P(X).P(Y))=P(Y|X)/P(Y)
Three measures of association
rules
n Support (X,Y): P(X,Y)
n Confidence(XàY): P(Y|X)= P(X,Y)/P(X)
n Lift(X,Y)= P(X,Y)/(P(X).P(Y))=P(Y|X)/P(Y)
n Confidence indicates strength of the rule.
n should be very high (close to 1)
n significantly higher than P(Y).
n Support shows statistical significance
n Should be of considerable numbers.
n insignificant support with high confidence meaningless.
n For independent X and Y, Lift close to 1.
n Ratio other than close to 1, shows dependency.
n Lift > 1, è most likely X makes Y, else (<1) Y makes X.
Apriori algorithm
n To get association rules with high support and
confidence from a database.
n Possible to generalize association among more than 2
variables.
n E.g. X,Z à Y
n Two steps:
n Finding frequent item sets.
n those which have enough support.
n Converting them to rules with enough confidence.
n by splitting the items into two, as items in the antecedent and items
in the consequent.
Agrawal, R., H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. 1996. “Fast Discovery of Association Rules.”
In Advances in Knowledge Discovery and Data Mining, ed. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy,307–328. Cambridge, MA: MIT Press.
Apriori algorithm: Step 1
n Finding frequent item sets, that is, those
which have enough support.
n Start searching for combination with lower
cardinality, e.g. with 1 item, next 2 items, …
n Remove supersets in the combinations, which
are not in the list of lower cardinality sets.
n If X is not frequent, do not search for any
combination with X.
n Requires (n+1) passes for searching largest n-
itemset together.
Apriori algorithm: Step 2
n Converting them to rules with enough confidence,
n by splitting the items into two, as items in the antecedent
and items in the consequent.
n For every itemset, split keeping all but 1 in
antecedent and 1 item in consequent.
n E.g. for k itemset, k-1 items in antecedent and 1 item in
consequent.
n Remove those rules, which fail the test of confidence.
n In every pass, reduce antecedent part and increase
consequent part.
n Rules with larger consequent part are more useful.
Association and causality
n XàY indicates association, not
causality.
n There may be hidden variables acting in
the process not identified.
n E.g. association among {diapers, baby
food, and milk} may be established.
n Hidden variable: Baby at home.
Summary
n Bayesian inference:
n Compute P(Class|x).
n Decision may be taken by modeling risk or utility of
any action (to ith class of a k-th class sample).
n Classification rules can be set under the framework of
discriminant functions.
n Bayesian inference is useful in establishing association
among variables.
n Compute support (P(X,Y)), confidence (P(Y|X)), and Lift
(P(X,Y)/(P(X).P(Y)).
n Rules with high Support and Confidence, Lift not around 1.