0% found this document useful (0 votes)

125 views42 pages

2MLIntrodpart 2

Bayesian learning provides practical algorithms like Naive Bayes and Bayesian belief networks. It also provides foundations for machine learning by evaluating algorithms, guiding new design, and enabling meta learning. Bayesian inference uses prior knowledge to make better guesses from limited data, capturing common sense reasoning. Bayesian probability and Bayes' theorem are central to reasoning under uncertainty.

Uploaded by

Nausheen Fatima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views42 pages

2MLIntrodpart 2

Uploaded by

Nausheen Fatima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Bayesian Learning

• Provides practical learning algorithms

– Naïve Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities)

• Provides foundations for machine learning

– Evaluating learning algorithms
– Guiding the design of new algorithms
– Learning from models : meta learning

Dilemma
This person dropped their
ticket in the hallway.
Do you call out
“Excuse me, ma’am!”
or
“Excuse me, sir!”
You have to make a
guess.

Bayesian inference is a way to make guesses about what

your data mean based on sometimes very little data.

1
Bayesian Inference
• Bayesian inference is a way to capture
common sense.
• It helps you use what you know to make
better guesses.

Conditional probabilities
P(A | B) is the probability of A, given B.
“If I know B is the case, what is the probability that A is also the case?”
P(A | B) is not the same as P(B | A).

P(cute | puppy) is not the same as P(puppy | cute)

If I know the thing I’m holding is a puppy, what is the probability that it is cute?
If I know the the thing I’m holding is cute, what is the probability that it is a puppy?

2
Joint probabilities
● P(A and B) or P(A,B) or P(A with B) or P(A Ⴖ B)
● P(A,B)=P(A)*P(B|A)
● P(A,B,C)=P(A)*P(B|A)*P(C|A and B)
● P(B,A)=P(B)* P(A|B)

e.g. What is the probability that a person is both a woman and has short
hair?
P(woman with short hair)
= P(woman) * P(short hair | woman)
= .5 * .5 = .25
● P( Ⴈ A | B ) = 1- P( A | B )

Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:

3
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
● P(D2=Sunny)=
P(D2=Sunny and D1=Sunny) +
P(D2=Sunny and D1=Rainy)
= P(D2=Sunny | D1=Sunny) * P(D1=Sunny) +
P(D2=Sunny | D1=Rainy)* P(D1=Rainy)
=0.8 * 0.9+ 0.6 * 0.1
=0.78

Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:

4
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Sunny)=0.8
P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny | D1=Rainy)=0.6
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4

Answer: 0.22

P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer: solve it by replacing D3 by D2 and D2 by D1 from previous solution.

5
Theorem of Total Probability
• if events A1, …., An are mutually exclusive with

n
P ( B )   P ( B | Ai ) P ( Ai )
i 1

Product Rule : probability P(A,B) of a

6
Bayes’ Theorem

P(A | B) = P(B | A) P(A)

P(B)

Properties of Bayes Rule

• The computation or revision of unknown or old
probabilities called priori probabilities in the light of
additional information made available by experiment or
past records to derive a set of new probabilities known
as posterior probabilities.
• Combines prior knowledge and observed data: prior
probability of a hypothesis multiplied with probability
of the hypothesis given the training data
• Probabilistic hypothesis: outputs not only a
classification, but a probability distribution over all
classes

7
Bayes Rule example
• There is a specific type of cancer which exists for 1% of population. Probability of a
test coming positive given that one has cancer is 0.9. And the probability of this test
coming out negative given that one doesn’t have cancer is 0.2. What is the probability
that a person has this cancer given that he just received a positive test.
• Answer:
P(C)=0.01
P(+ve | C)=0.9 P(ႨC)=0.9
P(-ve | Ⴈ C)=0.2 P( -ve | C)=0.1
P(C | +ve)=?
P( -ve | Ⴈ C)=0.8

P(C | +ve)= P(+ve|c) P(C) / P(+ve)

=0.9 * 0.01 / 0.207
= 0.0403 i.e. 4.03% so the test is quite sensitive.

P(+ve)=P(+ve | C) P(C)+ P( +ve | ႨC) P( Ⴈ C)=0.90.01+0.20.99=0.207

Example 2
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P ( cancer )  . 008 , P (  cancer )  . 992
P (  | cancer )  . 98 , P (  | cancer )  . 02
P (  |  cancer )  . 03 , P (  |  cancer )  . 97
P (  | cancer ) P ( cancer )
P ( cancer |  ) 
P ( )
P (  |  cancer ) P (  cancer )
P (  cancer |  ) 
P ( )

8
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured.

Reasoning Under Uncertainty

Many different types of errors can contribute to
uncertainty.
1. data might be missing or unavailable
2. data might be ambiguous or unreliable due to
measurement errors
3. the representation of data may be imprecise or
inconsistent
4. data may just be user's best guess (random)
5. data may be based on defaults, and defaults
may have exceptions

9
Approaches in Dealing with
Uncertainty
Numerically oriented methods:
• Bayes’ Rules
• Certainty Factors
• Dempster Shafer
• Fuzzy Sets
Quantitative approaches
• Non-monotonic reasoning
Symbolic approaches
• Cohen’s Theory of Endorsements
• Fox’s semantic systems

Naïve Bayesian Classifier

● It is based on Baye’s theorem with independence
assumption between predictors.

● A naïve Bayesian model is easy to build, with no

complicated iterative parameter estimation which makes it
particularly useful for very large datasets.

● Despite its simplicity, it often does surprisingly well and is

widely used because it often outperforms more
sophisticated classification methods.

10
Naïve Bayesian Classifier
● Let D=training set of tuples, each tuple is represented by
n-dimensional vector X=(x1, x2, x3….xn)

● Let there be m classes, C1, C2, ….Cm

● Given a tuple X, naïve Bayesian classifier predicts that

tuple X belongs to class Ci

iff P(Ci|X) > P(Cj|X) for 1≤j ≤ m, j ≠i

i.e. Maximize P(Ci|X)= Maximize P(X|Ci) P(Ci)

Bayesian Theorem
• Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P(h | D)  P( D | h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h  arg max P ( h | D )  arg max P ( D | h ) P ( h ) .
MAP hH hH
• Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost

11
Estimating a-posteriori
probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum

MAP (Maximum A Posteriori hypothesis) Learner

For each hypothesis h in H, calculate the posterior probability

P ( D | h ) P ( h)
P(h | D ) 
P( D)
Output the hypothesis hmap with the highest posterior probability

hmap  max P (h | D)
hH
Comments:
Computational intensive
Providing a standard for judging the performance
of learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task

12
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N | outlook=sunny, windy=true,…)

• Idea: assign to sample X the class label C such

that P(C|X) is maximal

Bayesian classification
• The classification problem may be
formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N |
outlook=sunny,windy=true,…)

• Idea: assign to sample X the class label C

such that P(C|X) is maximal

13
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
• If i-th attribute is continuous:
P(xi|C) is estimated through a Gaussian density
function
• Computationally easy in both cases

likelihood

14
Day Outlook Temperature Humidity Wind Play ball

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Naive Bayesian Classifier (II)

• Given a training set, we can compute the
probabilities
Outlook Y N Humidity Y N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Y N Windy Y N
hot 2/9 2/5 weak 6/9 2/5
mild 4/9 2/5 strong 3/9 3/5
cool 3/9 1/5

15
weather example: classifying
X
• An unseen sample X = <rain, hot, high, weak>

• Sample X is classified in class no (don’t play)

Bayesian Learning for continuous data

source: Han and kamber, Data Mining Book

• To reduce computation in evaluating P(Xj|Ci), naive assumption of class-

conditional independence is made. This presumes that attributes’ values are
conditionally independent of one another, given the class label of the tuple.

16
Probability distributions

Height of adults in cm
.28
.24
Probability

.17
.12
.09
.06
.03
.01
<150 150 160 170 180 190 200 >21
to to to to to to 0
160 170 180 190 200 210

Probability density function

Source: https://www.nature.com/articles/s41598-018-33413-y

17
Laplacian correction/estimator
• What if we encounter probability values of zero?
• It will result in probability product coming out to be zero.
• There is a simple trick to avoid this problem. We can assume
that our training database, D, is so large that adding one to
each count that we need would only make a negligible
difference in the estimated probability value, yet would
conveniently avoid the case of probability values of zero.
• This technique for probability estimation is known as the
Laplacian correction or Laplace estimator, named after
Pierre Laplace, a French mathematician who lived from 1749
to 1827.
• If we have, say, q counts to which we each add one, then
we must remember to add q to the corresponding
denominator used in the probability calculation.

Laplacian correction/estimator
• Suppose for some training database D with 1000 tuples, we
have this situation:
– 0 tuples with income = low
– 990 tuples with income = medium
– 10 tuples with income = high
• Then probabilities of these events are:
0 990/1000 10/1000
=0 =0.999 =0.010
• Using Laplacian correction, we pretend that we have one
more tuple for each income-value pair:
1/1003 991/1003 11/1003
=0.001 =0.998 =0.011
• Hence the corrected probability estimates are close to their
uncorrected counterparts, yet the zero probability value is
avoided.
74

18
Advantages of Bayesian Learning
• Unlike many other models of supervised learning, the
naive Bayes classifier can handle missing data where
not all features are observed; the agent conditions on the
features that are observed.
• Naive Bayes is optimal
• A naive Bayes model gives a direct way to assess the
weights and allows for missing data.
• It however makes the assumption that the Xi are
independent given Y, which may not hold practically.
• A linear regression model trained, for example, with
gradient descent can take into account dependencies,
but does not work for missing data.

Bayes Rule: Summary

• Robust to isolated noise points
• Handle missing values by ignoring the
instance during probability estimate
calculations
• Robust to irrelevant attributes
• Independence assumption may not hold
for some attributes

19
Applications of Bayes Classifier

Probability of a bigram w1 and w2

● P(w1, w2) denotes the probability of the word pair w1 and w2
occurring in sequence (e.g., of the, say unto, etc.)
● If we make one minor simplifying assumption, then the formula
is exactly the same as for a single word:

● In fact, this method extends to n-grams of any size:

● A sequence of two words (e.g., of the) is called a bigram

● A three-word sequence (e.g., sound of the) is called a trigram.
● The general term n-gram means ‘sequence of length n’.

20
Probability of a bigram w1 and w2

Predicting the next word

● According to the definition of conditional probability,
the probability of seeing word wn given the previous
words w1,w2, . . .wn−1 is:

● Bigram example

● Trigram example

● This method is called Maximum Llikelihood

Estimation (MLE).

21
Predicting words based on Shakespeare
● What is the probability of seeing the, given that we’ve just
seen of ?

● What is the probability of seeing king, given that we’ve

just seen the?

Applications of Naive Bayes Algorithms

• Real time Prediction: Naive Bayes is an eager learning
classifier and it is sure fast. Thus, it could be used for
making predictions in real time.

• Multi class Prediction: This algorithm is also well known for

multi class prediction feature as we can predict the
probability of multiple classes of target variable.

• Text classification: Naive Bayes classifiers mostly used in

text classification (due to better result in multi class
problems and independence rule) have higher success
rate as compared to other algorithms.

22
Few more applications
● It is widely used in Spam filtering (identify spam e-mail) and

● Sentiment Analysis (in social media analysis, to identify positive and

negative customer sentiments)

● Recommendation System: Naive Bayes Classifier and together builds

a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a
user would like a given resource or not.

Bayesian Belief Networks

Source: AI book, 3rd Edition, Russel and Norvig

23
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
– Decision trees, that reason on one attribute at the
time, considering most important attributes first

Bayesian Belief Networks

• Naïve Bayes assumption of conditional independence too
restrictive

• But it is intractable without some such assumptions

• Bayesian Belief network (Bayesian net) describe

conditional independence among subsets of variables
(attributes): combining prior knowledge about
dependencies among variables with observed training
data.

24
Bayesian Belief Network
• A directed acyclic probabilistic graphical model that captures
dependence among the attributes
• Bayesian Net
– Nodes: Variable/Attributes/Class
– Directed edges: Causality/dependency
– Absence of edge: independence
– Network structure: domain knowledge
– Joint probabilities: from data

• To each variables A with parents B1, …., Bn there is attached

a conditional probability table P (A | B1, …., Bn)

Example

• The probability of being happy

– What makes me happy is when
the weather is sunny or if I get a
salary raise in my job.

• Find P(R|S)=?
(ans=P(R)=0.01)
• Find P(R|H,S)=?
(0.0142) -Sunny and Raise are
evidence variable
• Find P(R|H)=? -Happy is query variable
(0.97)
• Hint: Apply Bayes Rule

25
Example of Bayesian Belief Network

-B,E are evidence

-J,M are query variables
-Anything neither query
nor evidence is called
hidden variable.

1 4 2 2
1

P(J,M,A,B,E) = P(B) * P(E) * P(A|B,E) * P(J|A) * P(M|A)

Source: Russel Norvig AI book

Bayesian Reasoning
• Given the evidence of who has or has not called, we would
like to estimate the probability of a burglary.
• In probabilistic inference, the output is not a single number
for each of query variables, rather it is a joint probability
distribution over the query variables.
• We call it posterior distribution of one or more query
variables, given the evidence.
• It is the probability distribution of one or more query
variables given the values of evidence variables:

26
Bayesian Reasoning
• We call it posterior distribution of one or more query
variables Qi, given the evidence variables Ei :
P(Q1, Q2,….|E1=e1, E2=e2……En=en)
• Another question that can be answered, out of all possible
values for all the query variables, which combination of
values has the highest probability:
argmax P(Q1=q1, Q2=q2,….|E1=e1, E2=e2……)

Bayesian Reasoning
• One great thing about Bayes nets is that we are not
restricted to go only in one direction rather we could go in
causal direction or reverse that causal flow.

– For example we could have J & M be the evidence

variables, B & E be the query variables.

27
Bayesian Inferencing
• Find probability that alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John
and Mary call
• Find P(J,M,A,~B,~E)=?

CPT
• Find probability that alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John
and Mary call
• Find P(J,M,A,~B,~E)=?
Answer:
=P(J|A).P(M|A).P(A|~B,~E).P(~B).P(~E)
=0.90*0.70*0.001*0.999*0.998
=0.00062

• This way once such a network is established, it can be

used to answer any query about the domain.

28
Advantage of Bayesian networks
● P(B,E,A,J,M) = P(B)*P(E)*P(A|B,E)*P(J|A)*P(M|A)
● B and E have no incoming arrows, so they a have a
probability distribution of P(B) and P(E).
● A has two incoming arrows, so it’s probability is
conditioned on B and E, giving us P(A|B,E).
● J and M are both conditioned on A, giving us P(J|A) and
P(M|A).
● So, the definition of this setup for the joint distribution,
P(B,E,A,J,M) is based on the factors above, and gives us
one really BIG advantage.
● We know that the joint distribution over any five
random variables requires 25-1=31 probability values,
while our Bayes network only requires 10 probability
values.

Advantage of Bayesian networks

● we would only need 10 probability values, compared to 31
for an unstructured non-graph method.
● It might not seem like such a difference, but when scaling
to a larger and more complex problem, the compactness
of the network leads to a representation that scales
significantly better to large networks.
● This is a key reason why Bayes Networks are being used
so extensively.

P(J,M,A,B,E) = P(J|A) * P(M|A)* P(A|B,E) * P(B) * P(E)

Source: Russel Norvig AI book

29
Representing the full joint distribution
source: Russel & Norvig Book, page 513

Representing the full joint distribution

30
Advantage of Bayesian networks

Bayesian Belief Networks

• If an arc is drawn from a node Y to Z,
– Y is parent or immediate predecessor of Z, and Z is a descendent
of Y.

• Each variable is conditionally independent of its non-

descendants in the graph, given its parents.
– E.g. Having lung cancer is influenced by a person’s family history
of lung cancer, as well as whether or not the person is a smoker.

• Once we know outcome of variable Lung Cancer, then

variables Family history and Smoker do not provide any
additional information regarding positive x-ray.

100

31
Example

101

Training Bayesian Belief Networks

• There are 2 issues in learning of BBN:
– network topology and
– CPT entries

• Network topology can be learnt from training data using

several existing algorithms, given observable variables.

• Human experts usually have good grasp of direct

conditional dependencies that hold in domain under
analysis, which helps in network design.

102

32
Training Bayesian Belief Networks
• Once network topology and observable variables are
known, gradient descent strategy can be used to learn the
CPT entries.

• Learning starts with random initial values of probability

values.

• Apply gradient descent to learn CPT entries by minimizing

error. (Gradient descent to be discussed with topic ANN)

103

Clustering

income

education

age

104

33
Clustering

income

education

age

105

Clustering
• Clustering is alternatively called as
“grouping”
• Intuitively, if you would want to assign
same label to a data points that are close
to each other
• Thus, clustering algorithms rely on a
distance metric between data points

106

34
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined
classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms

107

Issues
• Given desired number of clusters?
• Finding “best” clusters
• Are clusters semantically meaningful?

108

35
Cluster Analysis
• Finding groups of objects such that
– the objects in a group will be similar (or related) to
one another and
– different from (or unrelated to) the objects in other
groups

109

Example
• Suppose we have 4 types of medicines and each has two attributes
(pH and weight index). Our goal is to group these objects into K=2
group of medicine.

Medicine Weight pH-

Index D
A 1 1
C
B 2 1

C 4 3 A B

D 5 4

110

36
Another Example: Text
• Each document is a vector
– e.g., <100110...> contains words 1,4,5,...
• Clusters contain “similar” documents
• Useful for understanding, searching
documents

sports
international
news
business

111

Where to use clustering?

• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic

112

37
K-means Algorithm
• Given the cluster number K, the K-means algorithm
is carried out in three steps after initialisation:
Initialisation: set seed points (randomly selected as
means of clusters)
1)Assign each object to the cluster of the nearest seed
point measured with a specific distance metric
2)Compute new seed points as the centroids of the
clusters of the current partition (the centroid is the
centre, i.e., mean point, of the cluster)
3)Go back to Step 1), stop when no more new
assignment (i.e., membership in each cluster no longer
changes)

113

Class Problem
Training x1 x2
Examples

A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
• Let k=2, means we are interested in two clusters
• Let A and C are randomly selected as the means of 2
clusters.

114

38
Class Problem
Mean/Center Distance from Distance from
center 1 center 2

A (C1) 0 (C1) 1.4

B 1 (C1) 2.2
C (C2) 1.4 0 (C2)
D 3.2 2.8 (C2)
E 4.47 4.2 (C2)
• Find distance between each observation and all
centers/cluster means.
• Assign each observation to the cluster having the closest
mean.
• Recalculate the cluster means.
115

Class Problem
Mean/Center Distance from Distance from
center 1 center 2

A (C1) 0 (C1) 1.4

B 1 (C1) 2.2
C (C2) 1.4 0 (C2)
D 3.2 2.8 (C2)
E 4.47 4.2 (C2)
• Recalculate the cluster means.
• C1 = {A,B} and C2 = {C,D,E}
• New mean of cluster C1 = {(1+1 )/2, (0+1)/2} = {1,0.5}
• New mean of cluster 2 = {(0+2+3)/3,(2+4+5)/3} = {1.7, 3.7}

116

39
Class Problem
Mean/Center Distance from Distance from
center 1 {1,0.5} center 2 {1.7, 3.7}

A 0.5 (C1) 2.7

B 0.5 (C1) 3.7
C 1.8 (C1) 2.4
D 3.6 0.5 (C2)
E 4.9 1.9 (C2)
• Recalculate the cluster means.
• C1 = {A,B,C} and C2 = {D,E}
• New mean of cluster C1 = {(0+1+1 )/3, (0+1+2)/3} = {0.7,1}
• New mean of cluster 2 = {(2+3)/2,(4+5)/2} = {2.5, 4.5}

117

Class Problem
Mean/Cent Distance from center Distance from
er 1 {0.7,1} center 2 {2.5, 4.5}

A
B
C
D
E

• Algorithm converges when recalculating distances,

reassigning cases to clusters results in no change.

118

40
Exercise 1
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1. How many steps are required for convergence?
2. What are memberships of two clusters after convergence?
3. What are centroids of two clusters after convergence?

Medicine Weight pH- D

Index
C
A 1 1

B 2 1 A B
C 4 3

D 5 4
119

119

Exercise 2: K means

120

41
Drawback of K-means
• K-means fails for non-linearly separable data.
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data.
• There are other limitations – still a need for reducing costs
of calculating distances to centroids.
• K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
– Which is the most centrally located object in a cluster.
• One solution is advanced algorithm Kernal K-means
clustering which is based on idea to project data onto high-
dimensional kernel space, and then perform K-means
clustering.

121

Summary
• K-means algorithm is a simple yet popular
method for clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to
overcome its weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering
analysis
– CLARA: extension to deal with large data sets
– Mixture models (EM algorithm): handling uncertainty of
clusters

122

Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Crush Hypothesis Testing
From Everand
Crush Hypothesis Testing
Allison Dillard
No ratings yet
Unit I Probabilistic Reasoning I 9
No ratings yet
Unit I Probabilistic Reasoning I 9
20 pages
Bayesian Learning
No ratings yet
Bayesian Learning
41 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
25-27 Statistical Reasoning-Probablistic Model-Naive Bayes Classifier
No ratings yet
25-27 Statistical Reasoning-Probablistic Model-Naive Bayes Classifier
35 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
180 pages
Naive Bayes
No ratings yet
Naive Bayes
21 pages
L4 Naive Bayes
No ratings yet
L4 Naive Bayes
31 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Unit 4
No ratings yet
Unit 4
36 pages
2bayesian Learning
No ratings yet
2bayesian Learning
22 pages
Probabilistic Reasoning
No ratings yet
Probabilistic Reasoning
9 pages
Lecture 5 Bayesian
No ratings yet
Lecture 5 Bayesian
37 pages
PML UNIT V Material
No ratings yet
PML UNIT V Material
44 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Unit II AI
No ratings yet
Unit II AI
43 pages
Bayesian
No ratings yet
Bayesian
14 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
Unit II Probabilistic Reasoning
No ratings yet
Unit II Probabilistic Reasoning
28 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
ML BayesionBeliefNetwork Lect12 14
No ratings yet
ML BayesionBeliefNetwork Lect12 14
99 pages
What Is Naive Bayes?
No ratings yet
What Is Naive Bayes?
6 pages
Ai (It) Unit-3
No ratings yet
Ai (It) Unit-3
85 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Bayes Theorem
No ratings yet
Bayes Theorem
20 pages
Module 5
No ratings yet
Module 5
24 pages
ML L9 Naive Bayes
No ratings yet
ML L9 Naive Bayes
18 pages
26-Bayes Rule-16-03-2024
No ratings yet
26-Bayes Rule-16-03-2024
18 pages
3.1 New
No ratings yet
3.1 New
12 pages
Bayes Theorem
No ratings yet
Bayes Theorem
7 pages
Classification (Naive Bayes)
No ratings yet
Classification (Naive Bayes)
40 pages
BSC ML CH2
No ratings yet
BSC ML CH2
79 pages
Bayesian Learning
No ratings yet
Bayesian Learning
44 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
Lecture 9 - 10 Naive Generative Analysis
No ratings yet
Lecture 9 - 10 Naive Generative Analysis
54 pages
Unit 4
No ratings yet
Unit 4
24 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
D3 It Naive Bayes
No ratings yet
D3 It Naive Bayes
24 pages
Unit 2
No ratings yet
Unit 2
20 pages
Artifical Intelligence Notes Part 7
No ratings yet
Artifical Intelligence Notes Part 7
49 pages
Module 5
No ratings yet
Module 5
30 pages
Module - 5 - Notes BAYESIAN Learning Notes
No ratings yet
Module - 5 - Notes BAYESIAN Learning Notes
24 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
Naive Bayes - Lecture Slides
No ratings yet
Naive Bayes - Lecture Slides
11 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Acting Under Uncertainty - Bayesian Inference-Probabilistic Reasoning
No ratings yet
Acting Under Uncertainty - Bayesian Inference-Probabilistic Reasoning
22 pages
Unit IV CI PDF
No ratings yet
Unit IV CI PDF
24 pages
Module V - v1
No ratings yet
Module V - v1
58 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
ML Unit 4-1-24
No ratings yet
ML Unit 4-1-24
24 pages
Data Analytics Unit-2 PPT Notes
No ratings yet
Data Analytics Unit-2 PPT Notes
190 pages
ML Unit 3 Bayesian - Learning (Textbook)
No ratings yet
ML Unit 3 Bayesian - Learning (Textbook)
25 pages
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet
Dunkin Donuts Reports Credential Stuffing Attack
No ratings yet
Dunkin Donuts Reports Credential Stuffing Attack
7 pages
Machine Learning and Soft Computing: CSCC53 Mca V Sem 2020
No ratings yet
Machine Learning and Soft Computing: CSCC53 Mca V Sem 2020
33 pages
Convolutional Neural Network (CNN)
No ratings yet
Convolutional Neural Network (CNN)
25 pages
Assignment2MLlab (Nausheen)
No ratings yet
Assignment2MLlab (Nausheen)
2 pages
Unit-3 Negotiation Skills
No ratings yet
Unit-3 Negotiation Skills
6 pages
Meaning and Definition of Report
No ratings yet
Meaning and Definition of Report
10 pages
Unsupesfwafarvised Learning
No ratings yet
Unsupesfwafarvised Learning
49 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
35 pages
Technical Skills
No ratings yet
Technical Skills
5 pages
CLVII-Part A Lab Manual
No ratings yet
CLVII-Part A Lab Manual
57 pages
Oral Questions LP-II: Star Schema
No ratings yet
Oral Questions LP-II: Star Schema
21 pages
Experiment-7: Implementation of K-Means Clustering Algorithm
No ratings yet
Experiment-7: Implementation of K-Means Clustering Algorithm
3 pages
Exercise6 Solution
No ratings yet
Exercise6 Solution
8 pages
Machine Learning Algorithms - A Review: January 2019
No ratings yet
Machine Learning Algorithms - A Review: January 2019
7 pages
DADS303 - MBA 3 - Machine - Learning
No ratings yet
DADS303 - MBA 3 - Machine - Learning
11 pages
DOE Homework 7 Stefan Garnett Harmasi
No ratings yet
DOE Homework 7 Stefan Garnett Harmasi
5 pages
Chapter (2) Literature Review
No ratings yet
Chapter (2) Literature Review
8 pages
A Survey of Flow Cytometry Data Analysis Methods
No ratings yet
A Survey of Flow Cytometry Data Analysis Methods
20 pages
ML Lab Manual
No ratings yet
ML Lab Manual
66 pages
An Introduction To Machine Learning For Students in Secondary Education
No ratings yet
An Introduction To Machine Learning For Students in Secondary Education
7 pages
2015 Elsevier Dynamic Clustering With Improved Binary Artificial Bee Colony Algorithm
No ratings yet
2015 Elsevier Dynamic Clustering With Improved Binary Artificial Bee Colony Algorithm
12 pages
EY & Zepto Data Analyst Interview Questions
No ratings yet
EY & Zepto Data Analyst Interview Questions
24 pages
Airline Reservation
No ratings yet
Airline Reservation
2 pages
Jmp053 Cluster Analysis in The Public Sector
No ratings yet
Jmp053 Cluster Analysis in The Public Sector
18 pages
Bda 41
No ratings yet
Bda 41
72 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Assignment-7: Opening Iris - Arff and Removing Class Attribute
No ratings yet
Assignment-7: Opening Iris - Arff and Removing Class Attribute
17 pages
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
No ratings yet
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
11 pages
Take Test - Final Exam Preparation - Artificial ..
No ratings yet
Take Test - Final Exam Preparation - Artificial ..
11 pages
Zara
No ratings yet
Zara
47 pages
Random State
No ratings yet
Random State
4 pages
Quiz 1 Materials
No ratings yet
Quiz 1 Materials
159 pages
Image Segmentation in Python - Practical Hands-On
No ratings yet
Image Segmentation in Python - Practical Hands-On
24 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Fantasy Sports Prediction Clustering Analysis
No ratings yet
Fantasy Sports Prediction Clustering Analysis
21 pages
Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
No ratings yet
Implementing K-Means Clustering Algorithm Using Mapreduce Paradigm
5 pages

2MLIntrodpart 2

Uploaded by

2MLIntrodpart 2

Uploaded by

Bayesian Learning

• Provides practical learning algorithms

• Provides foundations for machine learning

Bayesian inference is a way to make guesses about what

P(cute | puppy) is not the same as P(puppy | cute)

Product Rule : probability P(A,B) of a

P(A | B) = P(B | A) P(A)

Properties of Bayes Rule

P(C | +ve)= P(+ve|c) P(C) / P(+ve)

P(+ve)=P(+ve | C) P(C)+ P( +ve | ႨC) P( Ⴈ C)=0.9*0.01+0.2*0.99=0.207

Reasoning Under Uncertainty

Naïve Bayesian Classifier

● A naïve Bayesian model is easy to build, with no

● Despite its simplicity, it often does surprisingly well and is

● Let there be m classes, C1, C2, ….Cm

● Given a tuple X, naïve Bayesian classifier predicts that

iff P(Ci|X) > P(Cj|X) for 1≤j ≤ m, j ≠i

i.e. Maximize P(Ci|X)= Maximize P(X|Ci) P(Ci)

MAP (Maximum A Posteriori hypothesis) Learner

• E.g. P(class=N | outlook=sunny, windy=true,…)

• Idea: assign to sample X the class label C such

• Idea: assign to sample X the class label C

D1 Sunny Hot High Weak No

Naive Bayesian Classifier (II)

• Sample X is classified in class no (don’t play)

Bayesian Learning for continuous data

• To reduce computation in evaluating P(Xj|Ci), naive assumption of class-

Probability density function

Bayes Rule: Summary

Probability of a bigram w1 and w2

● In fact, this method extends to n-grams of any size:

● A sequence of two words (e.g., of the) is called a bigram

Predicting the next word

● This method is called Maximum Llikelihood

● What is the probability of seeing king, given that we’ve

Applications of Naive Bayes Algorithms

• Multi class Prediction: This algorithm is also well known for

• Text classification: Naive Bayes classifiers mostly used in

● Sentiment Analysis (in social media analysis, to identify positive and

● Recommendation System: Naive Bayes Classifier and together builds

Bayesian Belief Networks

Bayesian Belief Networks

• But it is intractable without some such assumptions

• Bayesian Belief network (Bayesian net) describe

• To each variables A with parents B1, …., Bn there is attached

• The probability of being happy

-B,E are evidence

P(J,M,A,B,E) = P(B) * P(E) * P(A|B,E) * P(J|A) * P(M|A)

– For example we could have J & M be the evidence

• This way once such a network is established, it can be

Advantage of Bayesian networks

P(J,M,A,B,E) = P(J|A) * P(M|A)* P(A|B,E) * P(B) * P(E)

Representing the full joint distribution

Bayesian Belief Networks

• Each variable is conditionally independent of its non-

• Once we know outcome of variable Lung Cancer, then

Training Bayesian Belief Networks

• Network topology can be learnt from training data using

• Human experts usually have good grasp of direct

• Learning starts with random initial values of probability

• Apply gradient descent to learn CPT entries by minimizing

Medicine Weight pH-

Where to use clustering?

A (C1) 0 (C1) 1.4

A (C1) 0 (C1) 1.4

A 0.5 (C1) 2.7

• Algorithm converges when recalculating distances,

Medicine Weight pH- D

You might also like

P(+ve)=P(+ve | C) P(C)+ P( +ve | ႨC) P( Ⴈ C)=0.90.01+0.20.99=0.207