IML Module 3
IML Module 3
▶ Spam Classification
▶ Given an email, predict whether it is spam or not
▶ Medical Diagnosis
▶ Given a list of symptoms, predict whether a patient has disease X or not
▶ Weather
▶ Based on temperature, humidity, etc… predict if it will rain tomorrow
Bayesian Classification
▶ Problem statement:
▶ Given features
X1,X2,…,Xn
▶ Predict a label Y
Another Application
▶ Digit
Recognition
Classifier 5
P(D | h)P(h)
P(h | D) =
P(D)
▶ P(h) = prior probability of hypothesis h
▶ P(D) = prior probability of training data
D
▶ P(h|D) = probability of h given D
▶ P(D|h) = probability of D given h
Bayes Optimal Classifier
▶ The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for
a new example, given the training dataset.
▶ This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal
decision boundary, or the Bayes optimal discriminant function.
▶ Bayes Classifier: Probabilistic model that makes the most probable prediction for new
examples.
What is the most probable classification of the new instance given the training data?
Choosing Hypotheses
P(D | h)P(h)
P(h | D) =
P(D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hMAP
: hMAP = arg max P(h | D)
h∈H
P(D | h)P(h)
= arg max
h∈H P(D)
= arg max P(D | h)P(h)
h∈H
If we assume P(hi)=P(hj) then can further simplify, and choose the Maximum
likelihood (ML) hypothesis
hML = arg max
hi ∈H P(D | hi )
Bayes Optimal Classifier
Bayes optimal classification
∑
hi ∈H P(vj | hi )P(hi | D)
arg max
vj
∈V
Example:
P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1
P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0
therefore
∑
P(+ | hi )P(hi | D) = .4
hi ∈H
∑
P(− | hi )P(hi | D) = .6
hi ∈H
∑
hi ∈H P(vj | hi )P(hi | D) =
arg max
vj
∈V
-
The Bayes Classifier
▶ A good strategy is to
predict:
▶ So … How do we compute
that?
The Bayes Classifier
▶ Use Bayes
Rule! Likelihood Prior
Normalization Constant
▶ Why did this help? Well, we think that we might be able to specify how features are “generated”
by the class label
ESTIMATING PROBABILITIES
Here, n, and n are defined as before, p is our prior estimate of the probability we wish to determine.
m is a constant called the equivalent sample size, which determines how heavily to weight p relative to
the observed data.
method for choosing p is to assume uniform priors. that is, if an attribute has k possible values we set p
= 1/k
New test data set
▶ Likelihood of no
▶ One common practice to handle numerical attribute values is to assume normal distributions for
numerical attributes.
Naïve Bayes Assumption
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
▶ Actually, the Naïve Bayes assumption is almost never
true
▶ Still… Naïve Bayes often performs surprisingly well even when its assumptions do not
hold
Bayesian Belief Networks
34
BBN – Conditional Independence
35
Bayesian Belief Networks
37
Bayesian Belief Networks
38
Inferences in Bayesian Belief Networks
39
Learning of Bayesian Networks
40
Learning Bayes Nets
41
Gradient Ascent for Bayes Nets
42
Gradient Ascent for Bayes Nets
43
Gradient Ascent for Bayes Nets
44
Gradient Ascent for Bayes Nets
45
Gradient Ascent for Bayes Nets
46
Gradient Ascent for Bayes Nets
47
More on Learning Bayes Nets
48
Bayesian Belief Networks
49
Bayesian Belief Networks
50
Bayesian Belief Networks - Joint Probability
51
Bayesian Belief Networks – only single condition given – Marginal Probability
52
Bayesian Belief Networks – Event not sure but given Evidences
53
Bayesian Belief Networks – Event not sure but given Evidences
54
Bayesian Belief Networks – Event not sure but given Evidences
55
Bayesian Belief Networks – Event not sure but given Evidences
56
BBN -
Advantages
• Intuitive, graphical, and efficient
• Accounts for sources of uncertainty
• Allows for information updating
• Models multiple interdependencies
• Models distributed & interacting systems
• Identifies critical components & cut sets
• Includes utility and decision nodes
57
BBN -
Disadvantages
• Not ideally suited for computing small probabilities
• Practical limitations on the type of distributions and the form of statistical
dependence
• Computationally demanding for systems with a large number of random variables
• Exponential growth of computational effort with increased number of states
58
BBN -
Applications
59
BBN -
Summary
60
Expectation Maximization [EM] Algorithm
61
Expectation Maximization [EM] Algorithm – Finite Mixture
62
Generating Mixture of k Guassians
63
EM Estimating k-means
64
EM Estimating k-means
65
EM Estimating k-means
66
EM Estimating k-means
67
EM Estimating k-means
68
EM Estimating k-means
69
General EM Problem
70
Extending Mixture Model
71
Extending Mixture Model
72
General EM Method
73
General EM Method
74
Expectation Maximization [EM] Algorithm
75
Expectation Maximization [EM] Algorithm
76
Expectation Maximization [EM] Algorithm
77
Expectation Maximization [EM] Algorithm - Uses
78
Expectation Maximization [EM] Algorithm - Advantages
79
Expectation Maximization [EM] Algorithm - Disadvantages
80
Expectation Maximization [EM] Algorithm - Example
81
Expectation Maximization [EM] Algorithm - Example
84
Expectation Maximization [EM] Algorithm - Example
85
Ensemble Learning
Types of Ensemble Methods
1. Voting (Averaging)
2. Bootstrap aggregation (Bagging)
3. Random Forest
4. Boosting
5. Stacked Generalization (Blending)
Voting (Averaging)
Bootstrap Aggregation (Bagging)
Random Forest
Boosting
Boosting
Stacked Generalization (Blending)
Stacked Generalization (Blending)
Note: Explore on Pasting and Random Patches