2MLIntrodpart 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Bayesian Learning

• Provides practical learning algorithms


– Naïve Bayes learning
– Bayesian belief network learning
– Combine prior knowledge (prior probabilities)

• Provides foundations for machine learning


– Evaluating learning algorithms
– Guiding the design of new algorithms
– Learning from models : meta learning

39

Dilemma
This person dropped their
ticket in the hallway.
Do you call out
“Excuse me, ma’am!”
or
“Excuse me, sir!”
You have to make a
guess.

Bayesian inference is a way to make guesses about what


your data mean based on sometimes very little data.

40

1
Bayesian Inference
• Bayesian inference is a way to capture
common sense.
• It helps you use what you know to make
better guesses.

41

Conditional probabilities
P(A | B) is the probability of A, given B.
“If I know B is the case, what is the probability that A is also the case?”
P(A | B) is not the same as P(B | A).

P(cute | puppy) is not the same as P(puppy | cute)


If I know the thing I’m holding is a puppy, what is the probability that it is cute?
If I know the the thing I’m holding is cute, what is the probability that it is a puppy?

42

2
Joint probabilities
● P(A and B) or P(A,B) or P(A with B) or P(A Ⴖ B)
● P(A,B)=P(A)*P(B|A)
● P(A,B,C)=P(A)*P(B|A)*P(C|A and B)
● P(B,A)=P(B)* P(A|B)

e.g. What is the probability that a person is both a woman and has short
hair?
P(woman with short hair)
= P(woman) * P(short hair | woman)
= .5 * .5 = .25
● P( Ⴈ A | B ) = 1- P( A | B )

43

Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:

44

3
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Sunny?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:
● P(D2=Sunny)=
P(D2=Sunny and D1=Sunny) +
P(D2=Sunny and D1=Rainy)
= P(D2=Sunny | D1=Sunny) * P(D1=Sunny) +
P(D2=Sunny | D1=Rainy)* P(D1=Rainy)
=0.8 * 0.9+ 0.6 * 0.1
=0.78

45

Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer:

46

4
Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 2 is Rainy?
P(D1=Sunny)=0.9 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Sunny)=0.8
P(D2=Rainy | D1=Sunny)=0.2
P(D2=Sunny | D1=Rainy)=0.6
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4

Answer: 0.22

47

Marginal Probability
● It is either sunny or it’s rainy. Probability of a sunny day is 0.9. A sunny day
follows a sunny day with a probability of 0.8. A sunny day follows a rainy
day with probability of 0.6. What is the probability that Day 3 is Sunny?

P(D1=Sunny)=0.9
P(D2=Sunny | D1=Sunny)=0.8 P(D1=Rainy)=0.1
P(D2=Sunny | D1=Rainy)=0.6 P(D2=Rainy | D1=Sunny)=0.2
P(D2=Rainy)=? P(D2=Rainy | D1=Rainy)=0.4
Answer: solve it by replacing D3 by D2 and D2 by D1 from previous solution.

48

5
Theorem of Total Probability
• if events A1, …., An are mutually exclusive with

n
P ( B )   P ( B | Ai ) P ( Ai )
i 1

49

Product Rule : probability P(A,B) of a


conjunction of two events A and B:
P ( A, B )  P ( A | B ) P ( B )  P ( B | A) P ( A)
● The two joint probabilities on occurrence
of same two events are always equal.
● P(B,A) = P(A,B)
● P(B)* P(A|B)= P(A)*P(B|A)
 P(A | B) = P(B|A) * P(A) / P(B)
 P(A | B) = P(A,B) / P(B)

50

6
Bayes’ Theorem

P(A | B) = P(B | A) P(A)


P(B)

51

Properties of Bayes Rule


• The computation or revision of unknown or old
probabilities called priori probabilities in the light of
additional information made available by experiment or
past records to derive a set of new probabilities known
as posterior probabilities.
• Combines prior knowledge and observed data: prior
probability of a hypothesis multiplied with probability
of the hypothesis given the training data
• Probabilistic hypothesis: outputs not only a
classification, but a probability distribution over all
classes

52

7
Bayes Rule example
• There is a specific type of cancer which exists for 1% of population. Probability of a
test coming positive given that one has cancer is 0.9. And the probability of this test
coming out negative given that one doesn’t have cancer is 0.2. What is the probability
that a person has this cancer given that he just received a positive test.
• Answer:
P(C)=0.01
P(+ve | C)=0.9 P(ႨC)=0.9
P(-ve | Ⴈ C)=0.2 P( -ve | C)=0.1
P(C | +ve)=?
P( -ve | Ⴈ C)=0.8

P(C | +ve)= P(+ve|c) P(C) / P(+ve)


=0.9 * 0.01 / 0.207
= 0.0403 i.e. 4.03% so the test is quite sensitive.

P(+ve)=P(+ve | C) P(C)+ P( +ve | ႨC) P( Ⴈ C)=0.9*0.01+0.2*0.99=0.207

53

Example 2
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive.
The test returns a correct positive result in only 98% of the cases
in which the disease is actually present, and a correct negative
result in only 97% of the cases in which the disease is not
present. Furthermore, .008 of the entire population have this
cancer.
P ( cancer )  . 008 , P (  cancer )  . 992
P (  | cancer )  . 98 , P (  | cancer )  . 02
P (  |  cancer )  . 03 , P (  |  cancer )  . 97
P (  | cancer ) P ( cancer )
P ( cancer |  ) 
P ( )
P (  |  cancer ) P (  cancer )
P (  cancer |  ) 
P ( )

54

8
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data.
• Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured.

55

Reasoning Under Uncertainty


Many different types of errors can contribute to
uncertainty.
1. data might be missing or unavailable
2. data might be ambiguous or unreliable due to
measurement errors
3. the representation of data may be imprecise or
inconsistent
4. data may just be user's best guess (random)
5. data may be based on defaults, and defaults
may have exceptions

56

9
Approaches in Dealing with
Uncertainty
Numerically oriented methods:
• Bayes’ Rules
• Certainty Factors
• Dempster Shafer
• Fuzzy Sets
Quantitative approaches
• Non-monotonic reasoning
Symbolic approaches
• Cohen’s Theory of Endorsements
• Fox’s semantic systems

57

Naïve Bayesian Classifier


● It is based on Baye’s theorem with independence
assumption between predictors.

● A naïve Bayesian model is easy to build, with no


complicated iterative parameter estimation which makes it
particularly useful for very large datasets.

● Despite its simplicity, it often does surprisingly well and is


widely used because it often outperforms more
sophisticated classification methods.

58

10
Naïve Bayesian Classifier
● Let D=training set of tuples, each tuple is represented by
n-dimensional vector X=(x1, x2, x3….xn)

● Let there be m classes, C1, C2, ….Cm

● Given a tuple X, naïve Bayesian classifier predicts that


tuple X belongs to class Ci

iff P(Ci|X) > P(Cj|X) for 1≤j ≤ m, j ≠i

i.e. Maximize P(Ci|X)= Maximize P(X|Ci) P(Ci)

59

Bayesian Theorem
• Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P(h | D)  P( D | h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h  arg max P ( h | D )  arg max P ( D | h ) P ( h ) .
MAP hH hH
• Practical difficulty: require initial knowledge of
many probabilities, significant computational
cost

60

11
Estimating a-posteriori
probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum

61

MAP (Maximum A Posteriori hypothesis) Learner


For each hypothesis h in H, calculate the posterior probability

P ( D | h ) P ( h)
P(h | D ) 
P( D)
Output the hypothesis hmap with the highest posterior probability

hmap  max P (h | D)
hH
Comments:
Computational intensive
Providing a standard for judging the performance
of learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task

62

12
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N | outlook=sunny, windy=true,…)

• Idea: assign to sample X the class label C such


that P(C|X) is maximal

63

Bayesian classification
• The classification problem may be
formalized using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.

• E.g. P(class=N |
outlook=sunny,windy=true,…)

• Idea: assign to sample X the class label C


such that P(C|X) is maximal

64

13
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
• If i-th attribute is continuous:
P(xi|C) is estimated through a Gaussian density
function
• Computationally easy in both cases

65

likelihood

66

14
Day Outlook Temperature Humidity Wind Play ball

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

67

Naive Bayesian Classifier (II)


• Given a training set, we can compute the
probabilities
Outlook Y N Humidity Y N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Y N Windy Y N
hot 2/9 2/5 weak 6/9 2/5
mild 4/9 2/5 strong 3/9 3/5
cool 3/9 1/5

68

15
weather example: classifying
X
• An unseen sample X = <rain, hot, high, weak>

• P(X|yes)·P(yes) =
P(rain|yes)·P(hot|yes)·P(high|yes)·P(weak|yes)·
P(yes) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|no)·P(no) =
P(rain|no)·P(hot|no)·P(high|no)·P(weak|no)·P(n
o) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class no (don’t play)

69

Bayesian Learning for continuous data


source: Han and kamber, Data Mining Book

• To reduce computation in evaluating P(Xj|Ci), naive assumption of class-


conditional independence is made. This presumes that attributes’ values are
conditionally independent of one another, given the class label of the tuple.

70

16
Probability distributions

Height of adults in cm
.28
.24
Probability

.17
.12
.09
.06
.03
.01
<150 150 160 170 180 190 200 >21
to to to to to to 0
160 170 180 190 200 210

71

Probability density function

Source: https://www.nature.com/articles/s41598-018-33413-y

72

17
Laplacian correction/estimator
• What if we encounter probability values of zero?
• It will result in probability product coming out to be zero.
• There is a simple trick to avoid this problem. We can assume
that our training database, D, is so large that adding one to
each count that we need would only make a negligible
difference in the estimated probability value, yet would
conveniently avoid the case of probability values of zero.
• This technique for probability estimation is known as the
Laplacian correction or Laplace estimator, named after
Pierre Laplace, a French mathematician who lived from 1749
to 1827.
• If we have, say, q counts to which we each add one, then
we must remember to add q to the corresponding
denominator used in the probability calculation.

73

Laplacian correction/estimator
• Suppose for some training database D with 1000 tuples, we
have this situation:
– 0 tuples with income = low
– 990 tuples with income = medium
– 10 tuples with income = high
• Then probabilities of these events are:
0 990/1000 10/1000
=0 =0.999 =0.010
• Using Laplacian correction, we pretend that we have one
more tuple for each income-value pair:
1/1003 991/1003 11/1003
=0.001 =0.998 =0.011
• Hence the corrected probability estimates are close to their
uncorrected counterparts, yet the zero probability value is
avoided.
74

18
Advantages of Bayesian Learning
• Unlike many other models of supervised learning, the
naive Bayes classifier can handle missing data where
not all features are observed; the agent conditions on the
features that are observed.
• Naive Bayes is optimal
• A naive Bayes model gives a direct way to assess the
weights and allows for missing data.
• It however makes the assumption that the Xi are
independent given Y, which may not hold practically.
• A linear regression model trained, for example, with
gradient descent can take into account dependencies,
but does not work for missing data.

75

Bayes Rule: Summary


• Robust to isolated noise points
• Handle missing values by ignoring the
instance during probability estimate
calculations
• Robust to irrelevant attributes
• Independence assumption may not hold
for some attributes

76

19
Applications of Bayes Classifier

77

Probability of a bigram w1 and w2


● P(w1, w2) denotes the probability of the word pair w1 and w2
occurring in sequence (e.g., of the, say unto, etc.)
● If we make one minor simplifying assumption, then the formula
is exactly the same as for a single word:

● In fact, this method extends to n-grams of any size:

● A sequence of two words (e.g., of the) is called a bigram


● A three-word sequence (e.g., sound of the) is called a trigram.
● The general term n-gram means ‘sequence of length n’.

78

20
Probability of a bigram w1 and w2

79

Predicting the next word


● According to the definition of conditional probability,
the probability of seeing word wn given the previous
words w1,w2, . . .wn−1 is:

● Bigram example

● Trigram example

● This method is called Maximum Llikelihood


Estimation (MLE).

80

21
Predicting words based on Shakespeare
● What is the probability of seeing the, given that we’ve just
seen of ?

● What is the probability of seeing king, given that we’ve


just seen the?

81

Applications of Naive Bayes Algorithms


• Real time Prediction: Naive Bayes is an eager learning
classifier and it is sure fast. Thus, it could be used for
making predictions in real time.

• Multi class Prediction: This algorithm is also well known for


multi class prediction feature as we can predict the
probability of multiple classes of target variable.

• Text classification: Naive Bayes classifiers mostly used in


text classification (due to better result in multi class
problems and independence rule) have higher success
rate as compared to other algorithms.

82

22
Few more applications
● It is widely used in Spam filtering (identify spam e-mail) and

● Sentiment Analysis (in social media analysis, to identify positive and


negative customer sentiments)

● Recommendation System: Naive Bayes Classifier and together builds


a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a
user would like a given resource or not.

83

Bayesian Belief Networks


Source: AI book, 3rd Edition, Russel and Norvig

84

23
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
– Decision trees, that reason on one attribute at the
time, considering most important attributes first

85

Bayesian Belief Networks


• Naïve Bayes assumption of conditional independence too
restrictive

• But it is intractable without some such assumptions

• Bayesian Belief network (Bayesian net) describe


conditional independence among subsets of variables
(attributes): combining prior knowledge about
dependencies among variables with observed training
data.

86

24
Bayesian Belief Network
• A directed acyclic probabilistic graphical model that captures
dependence among the attributes
• Bayesian Net
– Nodes: Variable/Attributes/Class
– Directed edges: Causality/dependency
– Absence of edge: independence
– Network structure: domain knowledge
– Joint probabilities: from data

• To each variables A with parents B1, …., Bn there is attached


a conditional probability table P (A | B1, …., Bn)

87

Example

• The probability of being happy


– What makes me happy is when
the weather is sunny or if I get a
salary raise in my job.

• Find P(R|S)=?
(ans=P(R)=0.01)
• Find P(R|H,S)=?
(0.0142) -Sunny and Raise are
evidence variable
• Find P(R|H)=? -Happy is query variable
(0.97)
• Hint: Apply Bayes Rule

88

25
Example of Bayesian Belief Network

-B,E are evidence


-J,M are query variables
-Anything neither query
nor evidence is called
hidden variable.

1 4 2 2
1

P(J,M,A,B,E) = P(B) * P(E) * P(A|B,E) * P(J|A) * P(M|A)


Source: Russel Norvig AI book

89

Bayesian Reasoning
• Given the evidence of who has or has not called, we would
like to estimate the probability of a burglary.
• In probabilistic inference, the output is not a single number
for each of query variables, rather it is a joint probability
distribution over the query variables.
• We call it posterior distribution of one or more query
variables, given the evidence.
• It is the probability distribution of one or more query
variables given the values of evidence variables:

90

26
Bayesian Reasoning
• We call it posterior distribution of one or more query
variables Qi, given the evidence variables Ei :
P(Q1, Q2,….|E1=e1, E2=e2……En=en)
• Another question that can be answered, out of all possible
values for all the query variables, which combination of
values has the highest probability:
argmax P(Q1=q1, Q2=q2,….|E1=e1, E2=e2……)

91

Bayesian Reasoning
• One great thing about Bayes nets is that we are not
restricted to go only in one direction rather we could go in
causal direction or reverse that causal flow.

– For example we could have J & M be the evidence


variables, B & E be the query variables.

92

27
Bayesian Inferencing
• Find probability that alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John
and Mary call
• Find P(J,M,A,~B,~E)=?

93

CPT
• Find probability that alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both John
and Mary call
• Find P(J,M,A,~B,~E)=?
Answer:
=P(J|A).P(M|A).P(A|~B,~E).P(~B).P(~E)
=0.90*0.70*0.001*0.999*0.998
=0.00062

• This way once such a network is established, it can be


used to answer any query about the domain.

94

28
Advantage of Bayesian networks
● P(B,E,A,J,M) = P(B)*P(E)*P(A|B,E)*P(J|A)*P(M|A)
● B and E have no incoming arrows, so they a have a
probability distribution of P(B) and P(E).
● A has two incoming arrows, so it’s probability is
conditioned on B and E, giving us P(A|B,E).
● J and M are both conditioned on A, giving us P(J|A) and
P(M|A).
● So, the definition of this setup for the joint distribution,
P(B,E,A,J,M) is based on the factors above, and gives us
one really BIG advantage.
● We know that the joint distribution over any five
random variables requires 25-1=31 probability values,
while our Bayes network only requires 10 probability
values.

95

Advantage of Bayesian networks


● we would only need 10 probability values, compared to 31
for an unstructured non-graph method.
● It might not seem like such a difference, but when scaling
to a larger and more complex problem, the compactness
of the network leads to a representation that scales
significantly better to large networks.
● This is a key reason why Bayes Networks are being used
so extensively.

P(J,M,A,B,E) = P(J|A) * P(M|A)* P(A|B,E) * P(B) * P(E)


Source: Russel Norvig AI book

96

29
Representing the full joint distribution
source: Russel & Norvig Book, page 513

97

Representing the full joint distribution

98

30
Advantage of Bayesian networks

99

Bayesian Belief Networks


• If an arc is drawn from a node Y to Z,
– Y is parent or immediate predecessor of Z, and Z is a descendent
of Y.

• Each variable is conditionally independent of its non-


descendants in the graph, given its parents.
– E.g. Having lung cancer is influenced by a person’s family history
of lung cancer, as well as whether or not the person is a smoker.

• Once we know outcome of variable Lung Cancer, then


variables Family history and Smoker do not provide any
additional information regarding positive x-ray.

100

31
Example

101

Training Bayesian Belief Networks


• There are 2 issues in learning of BBN:
– network topology and
– CPT entries

• Network topology can be learnt from training data using


several existing algorithms, given observable variables.

• Human experts usually have good grasp of direct


conditional dependencies that hold in domain under
analysis, which helps in network design.

102

32
Training Bayesian Belief Networks
• Once network topology and observable variables are
known, gradient descent strategy can be used to learn the
CPT entries.

• Learning starts with random initial values of probability


values.

• Apply gradient descent to learn CPT entries by minimizing


error. (Gradient descent to be discussed with topic ANN)

103

Clustering

income

education

age

104

33
Clustering

income

education

age

105

Clustering
• Clustering is alternatively called as
“grouping”
• Intuitively, if you would want to assign
same label to a data points that are close
to each other
• Thus, clustering algorithms rely on a
distance metric between data points

106

34
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined
classes
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms

107

Issues
• Given desired number of clusters?
• Finding “best” clusters
• Are clusters semantically meaningful?

108

35
Cluster Analysis
• Finding groups of objects such that
– the objects in a group will be similar (or related) to
one another and
– different from (or unrelated to) the objects in other
groups

109

Example
• Suppose we have 4 types of medicines and each has two attributes
(pH and weight index). Our goal is to group these objects into K=2
group of medicine.

Medicine Weight pH-


Index D
A 1 1
C
B 2 1

C 4 3 A B

D 5 4

110

36
Another Example: Text
• Each document is a vector
– e.g., <100110...> contains words 1,4,5,...
• Clusters contain “similar” documents
• Useful for understanding, searching
documents

sports
international
news
business

111

Where to use clustering?

• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic

112

37
K-means Algorithm
• Given the cluster number K, the K-means algorithm
is carried out in three steps after initialisation:
Initialisation: set seed points (randomly selected as
means of clusters)
1)Assign each object to the cluster of the nearest seed
point measured with a specific distance metric
2)Compute new seed points as the centroids of the
clusters of the current partition (the centroid is the
centre, i.e., mean point, of the cluster)
3)Go back to Step 1), stop when no more new
assignment (i.e., membership in each cluster no longer
changes)

113

Class Problem
Training x1 x2
Examples

A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
• Let k=2, means we are interested in two clusters
• Let A and C are randomly selected as the means of 2
clusters.

114

38
Class Problem
Mean/Center Distance from Distance from
center 1 center 2

A (C1) 0 (C1) 1.4


B 1 (C1) 2.2
C (C2) 1.4 0 (C2)
D 3.2 2.8 (C2)
E 4.47 4.2 (C2)
• Find distance between each observation and all
centers/cluster means.
• Assign each observation to the cluster having the closest
mean.
• Recalculate the cluster means.
115

Class Problem
Mean/Center Distance from Distance from
center 1 center 2

A (C1) 0 (C1) 1.4


B 1 (C1) 2.2
C (C2) 1.4 0 (C2)
D 3.2 2.8 (C2)
E 4.47 4.2 (C2)
• Recalculate the cluster means.
• C1 = {A,B} and C2 = {C,D,E}
• New mean of cluster C1 = {(1+1 )/2, (0+1)/2} = {1,0.5}
• New mean of cluster 2 = {(0+2+3)/3,(2+4+5)/3} = {1.7, 3.7}

116

39
Class Problem
Mean/Center Distance from Distance from
center 1 {1,0.5} center 2 {1.7, 3.7}

A 0.5 (C1) 2.7


B 0.5 (C1) 3.7
C 1.8 (C1) 2.4
D 3.6 0.5 (C2)
E 4.9 1.9 (C2)
• Recalculate the cluster means.
• C1 = {A,B,C} and C2 = {D,E}
• New mean of cluster C1 = {(0+1+1 )/3, (0+1+2)/3} = {0.7,1}
• New mean of cluster 2 = {(2+3)/2,(4+5)/2} = {2.5, 4.5}

117

Class Problem
Mean/Cent Distance from center Distance from
er 1 {0.7,1} center 2 {2.5, 4.5}

A
B
C
D
E

• Algorithm converges when recalculating distances,


reassigning cases to clusters results in no change.

118

40
Exercise 1
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer three questions as follows:
1. How many steps are required for convergence?
2. What are memberships of two clusters after convergence?
3. What are centroids of two clusters after convergence?

Medicine Weight pH- D


Index
C
A 1 1

B 2 1 A B
C 4 3

D 5 4
119

119

Exercise 2: K means

120

41
Drawback of K-means
• K-means fails for non-linearly separable data.
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data.
• There are other limitations – still a need for reducing costs
of calculating distances to centroids.
• K-Medoids: Instead of taking the mean value of the object
in a cluster as a reference point, medoids can be used,
– Which is the most centrally located object in a cluster.
• One solution is advanced algorithm Kernal K-means
clustering which is based on idea to project data onto high-
dimensional kernel space, and then perform K-means
clustering.

121

Summary
• K-means algorithm is a simple yet popular
method for clustering analysis
• Its performance is determined by initialisation and
appropriate distance measure
• There are several variants of K-means to
overcome its weaknesses
– K-Medoids: resistance to noise and/or outliers
– K-Modes: extension to categorical data clustering
analysis
– CLARA: extension to deal with large data sets
– Mixture models (EM algorithm): handling uncertainty of
clusters

122

42

You might also like