0% found this document useful (0 votes)
2 views95 pages

IML Module 3

The document discusses various Bayesian, ensemble, and probabilistic learning techniques, focusing on concepts like Bayes theorem, maximum likelihood, and the Bayes optimal classifier. It also covers practical applications such as spam classification, medical diagnosis, and weather prediction, along with methods like Naïve Bayes and Bayesian belief networks. Additionally, it introduces ensemble learning methods including voting, bagging, and boosting, highlighting their significance in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views95 pages

IML Module 3

The document discusses various Bayesian, ensemble, and probabilistic learning techniques, focusing on concepts like Bayes theorem, maximum likelihood, and the Bayes optimal classifier. It also covers practical applications such as spam classification, medical diagnosis, and weather prediction, along with methods like Naïve Bayes and Bayesian belief networks. Additionally, it introduces ensemble learning methods including voting, bagging, and boosting, highlighting their significance in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

B M S College of Engineering

Department of Machine Learning


UNIT 03
Bayesian, Ensemble and
Probabilistic Learning
Techniques/Models
Brute-Force Bayes Concept Learning
We can design a straightforward concept learning algorithm to output the maximum
a posteriori hypothesis, based on Bayes theorem, as follows:
Assumptions
First consider the case where h is inconsistent with the training data D.
Now consider the case where h is consistent with D.
Maximum Likelihood and Least Squared
Error Hypothesis
Maximum Likelihood and Least Squared
Error Hypothesis
Maximum Likelihood and Least Squared
Error Hypothesis
Maximum Likelihood and Least Squared
Error Hypothesis
Minimum Description Length Principle
Minimum Description Length Principle

▶ Let us considerer a probability of designing a code to transmit messages drawn at


random from a set D
▶ Where probability of drawing an ith message =pi
▶ While transmitting we want a code that minimizes the expected no. of bits.
▶ To do this we should assign shorter codes to the most probable one
▶ We represent the length of message I wrt c as Lc(i)
Things We’d Like to Do

▶ Spam Classification
▶ Given an email, predict whether it is spam or not

▶ Medical Diagnosis
▶ Given a list of symptoms, predict whether a patient has disease X or not

▶ Weather
▶ Based on temperature, humidity, etc… predict if it will rain tomorrow
Bayesian Classification

▶ Problem statement:
▶ Given features
X1,X2,…,Xn
▶ Predict a label Y
Another Application

▶ Digit
Recognition

Classifier 5

▶ X1,…,Xn ∈ {0,1} (Black vs. White pixels)


▶ Y ∈ {5,6} (predict whether a digit is a 5 or a
6)
Provide practical learning algorithms:
• Naïve Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities) with observed data
Requires prior probabilities:
• Provides useful conceptual framework:
• Provides “gold standard” for evaluating other learning algorithms
• Additional insight into Occam’s razor
Bayes Theorem

P(D | h)P(h)
P(h | D) =
P(D)
▶ P(h) = prior probability of hypothesis h
▶ P(D) = prior probability of training data
D
▶ P(h|D) = probability of h given D
▶ P(D|h) = probability of D given h
Bayes Optimal Classifier

▶ The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for
a new example, given the training dataset.

▶ This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal
decision boundary, or the Bayes optimal discriminant function.

▶ Bayes Classifier: Probabilistic model that makes the most probable prediction for new
examples.

What is the most probable classification of the new instance given the training data?
Choosing Hypotheses

P(D | h)P(h)
P(h | D) =
P(D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hMAP
: hMAP = arg max P(h | D)
h∈H

P(D | h)P(h)
= arg max
h∈H P(D)
= arg max P(D | h)P(h)
h∈H

If we assume P(hi)=P(hj) then can further simplify, and choose the Maximum
likelihood (ML) hypothesis
hML = arg max
hi ∈H P(D | hi )
Bayes Optimal Classifier
Bayes optimal classification


hi ∈H P(vj | hi )P(hi | D)
arg max
vj
∈V
Example:
P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1
P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0
therefore

P(+ | hi )P(hi | D) = .4
hi ∈H


P(− | hi )P(hi | D) = .6
hi ∈H


hi ∈H P(vj | hi )P(hi | D) =
arg max
vj
∈V
-
The Bayes Classifier

▶ A good strategy is to
predict:

▶ (for example: what is the probability that the image represents a 5


given its pixels?)

▶ So … How do we compute
that?
The Bayes Classifier

▶ Use Bayes
Rule! Likelihood Prior

Normalization Constant

▶ Why did this help? Well, we think that we might be able to specify how features are “generated”
by the class label
ESTIMATING PROBABILITIES

Use the m-estimate defined as follows

Here, n, and n are defined as before, p is our prior estimate of the probability we wish to determine.
m is a constant called the equivalent sample size, which determines how heavily to weight p relative to
the observed data.
method for choosing p is to assume uniform priors. that is, if an attribute has k possible values we set p
= 1/k
New test data set

▶ Assume today = (Sunny, Hot, Normal,


False)

and probability to not play golf is given by:

Since, P(today) is common in both probabilities, we can ignore


P(today) and find proportional probabilities as:
▶ Likelihood of yes

▶ Likelihood of no

▶ Therefore, the prediction is No


The Naive Bayes Classifier for Data Sets with
Numerical Attribute Values

▶ One common practice to handle numerical attribute values is to assume normal distributions for
numerical attributes.
Naïve Bayes Assumption

▶ Recall the Naïve Bayes assumption:

▶ that all features are independent given the class label Y

▶ Does this hold for the digit recognition


problem?
Exclusive-OR Example

▶ For an example where conditional independence


fails:
▶ Y=XOR(X1,X2)

X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
▶ Actually, the Naïve Bayes assumption is almost never
true

▶ Still… Naïve Bayes often performs surprisingly well even when its assumptions do not
hold
Bayesian Belief Networks

34
BBN – Conditional Independence

35
Bayesian Belief Networks

Dr. Monika Puttaramaiah, Dept. of MEL, BMSCE 36


Bayesian Belief Networks

37
Bayesian Belief Networks

38
Inferences in Bayesian Belief Networks

39
Learning of Bayesian Networks

40
Learning Bayes Nets

41
Gradient Ascent for Bayes Nets

42
Gradient Ascent for Bayes Nets

43
Gradient Ascent for Bayes Nets

44
Gradient Ascent for Bayes Nets

45
Gradient Ascent for Bayes Nets

46
Gradient Ascent for Bayes Nets

47
More on Learning Bayes Nets

48
Bayesian Belief Networks

49
Bayesian Belief Networks

50
Bayesian Belief Networks - Joint Probability

51
Bayesian Belief Networks – only single condition given – Marginal Probability

52
Bayesian Belief Networks – Event not sure but given Evidences

53
Bayesian Belief Networks – Event not sure but given Evidences

54
Bayesian Belief Networks – Event not sure but given Evidences

55
Bayesian Belief Networks – Event not sure but given Evidences

56
BBN -
Advantages
• Intuitive, graphical, and efficient
• Accounts for sources of uncertainty
• Allows for information updating
• Models multiple interdependencies
• Models distributed & interacting systems
• Identifies critical components & cut sets
• Includes utility and decision nodes

57
BBN -
Disadvantages
• Not ideally suited for computing small probabilities
• Practical limitations on the type of distributions and the form of statistical
dependence
• Computationally demanding for systems with a large number of random variables
• Exponential growth of computational effort with increased number of states

58
BBN -
Applications

59
BBN -
Summary

60
Expectation Maximization [EM] Algorithm

61
Expectation Maximization [EM] Algorithm – Finite Mixture

62
Generating Mixture of k Guassians

63
EM Estimating k-means

64
EM Estimating k-means

65
EM Estimating k-means

66
EM Estimating k-means

67
EM Estimating k-means

68
EM Estimating k-means

69
General EM Problem

70
Extending Mixture Model

71
Extending Mixture Model

72
General EM Method

73
General EM Method

74
Expectation Maximization [EM] Algorithm

75
Expectation Maximization [EM] Algorithm

76
Expectation Maximization [EM] Algorithm

77
Expectation Maximization [EM] Algorithm - Uses

78
Expectation Maximization [EM] Algorithm - Advantages

79
Expectation Maximization [EM] Algorithm - Disadvantages

80
Expectation Maximization [EM] Algorithm - Example

81
Expectation Maximization [EM] Algorithm - Example

84
Expectation Maximization [EM] Algorithm - Example

85
Ensemble Learning
Types of Ensemble Methods

1. Voting (Averaging)
2. Bootstrap aggregation (Bagging)
3. Random Forest
4. Boosting
5. Stacked Generalization (Blending)
Voting (Averaging)
Bootstrap Aggregation (Bagging)
Random Forest
Boosting
Boosting
Stacked Generalization (Blending)
Stacked Generalization (Blending)
Note: Explore on Pasting and Random Patches

You might also like