0% found this document useful (0 votes)
3 views

2. Classification and clustering algorithms

The document discusses machine learning concepts, focusing on classification, clustering, and Bayesian theory. It explains the differences between supervised and unsupervised learning, detailing algorithms such as K-nearest neighbors and Naïve Bayes classifiers. The importance of these techniques in public health decision-making and data analysis is emphasized, along with their respective advantages and disadvantages.

Uploaded by

abraham.kitaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

2. Classification and clustering algorithms

The document discusses machine learning concepts, focusing on classification, clustering, and Bayesian theory. It explains the differences between supervised and unsupervised learning, detailing algorithms such as K-nearest neighbors and Naïve Bayes classifiers. The importance of these techniques in public health decision-making and data analysis is emphasized, along with their respective advantages and disadvantages.

Uploaded by

abraham.kitaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Machine Learning

PHDS-634

Topic 2:
Classification,
Clustering and
Bayesian Theory

Wondwossen Mulugeta (PhD)


1
Email: [email protected]
Topic Coverage
Topic 2: Sub-Topics
Classification, Classification
K-Nearest Neighborhood
Clustering and Clustering
Bayesian Theory Bayesian Networks

2
Classification and Clustering
Modeling from past data is crucial tool to help
experts and physicians make more informed
decisions.
Machine Learning Models provide evidence on
how things happen from which experts can use
for better understanding of public health
situation and appropriate interventions
Understanding the value of modeling in the
field of health science is beneficial.

3
Basic Algorithms
⚫ Classification: which is also called Supervised
learning, maps data into predefined groups or classes
to enhance the prediction process
⚫ Clustering: which is also called Unsupervised
learning, groups similar data together into clusters.
⚫ is used to find appropriate groupings of elements for
a set of data.
⚫ Unlike classification, clustering is a kind of
undirected knowledge discovery or unsupervised
learning; i.e., there is no target field & the
relationship among the data is identified by
bottom-up approach.
4
Supervised vs. Unsupervised Learning
⚫ Supervised learning (classification)
⚫ Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
⚫ New data is classified based on the training set

⚫ Unsupervised learning (clustering)


⚫ The class labels of training data is unknown
⚫ Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

5
CLASSIFICATION
⚫ Given a collection of records (training set), each record
contains a set of attributes, one of the attributes is the class or
the label attribute.
⚫ Goal: Find a model for class attribute as a function of the values
of other attributes.
⚫ Previously unseen records should be assigned a class as
accurately as possible. A test set is used to determine the
accuracy of the model.
⚫ Usually, the data set is divided into training and test sets,
⚫ For example:
⚫ One can use a classification model to predict whether a patient is
HIV +Ve or -Ve
⚫ Classification can also be user to categorize the effectiveness of a
6 drug as “High”, “Moderate” or “Low”
CLASSIFICATION: A TWO-STEP
PROCESS
1. Model construction: describing a set of predetermined classes
⚫ Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
⚫ The set of tuples used for model construction is called the training
set
⚫ The model is represented as classification rules, decision trees,
or mathematical formulae
2. Model usage: for classifying new or unseen instances into one
of the class values or labels.
⚫ Estimate accuracy of the model
⚫ The known label of test instance is compared with the classified result from
the model
⚫ Accuracy rate is the percentage of test set samples that are correctly classified
by the model

7
⚫ Test set should be different from the training set
⚫ If the accuracy is acceptable, use the model to classify data
Illustrating Classification Task

8
Instance-Based Methods
⚫ Instance-based learning:
⚫ Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be
classified
⚫ Typical approaches
⚫ k-nearest neighbor approach
⚫ Instances represented as points in a Euclidean
space.
⚫ Locally weighted regression
⚫ Constructs local approximation
⚫ Case-based reasoning
9
⚫ Uses symbolic representations and
knowledge-based inference
Different Learning Methods
⚫ Eager Learning
⚫ Learning = acquiring an explicit structure of a
classifier on the whole training set;
⚫ Classification = an instance gets a classification
using the explicit structure of the classifier.

⚫ Instance-Based Learning (Lazy Learning)


⚫ Learning = storing all training instances
⚫ Classification = an instance gets a classification
equal to the classification of the nearest instances
to the new instance.
10
Classification methods
⚫ Goal: Predict class Ci = f(x1, x2, .. xn)

⚫ There are various classification methods. Popular


classification techniques include the following.
⚫ K-nearest neighbor (Instance Based)
⚫ Bayesian network: a probabilistic model
⚫ Decision tree classifier: divide decision space into
piecewise constant regions.
⚫ Neural networks: partition by non-linear boundaries

11
K-Nearest Neighbors

12
K-Nearest Neighbors
⚫ K-nearest neighbor is a supervised learning algorithm
where the result of new instance query is classified based
on majority of K-nearest neighbor category.
⚫ The purpose of this algorithm is to classify a new object
based on attributes and training samples: (xi, f(xi)), i=1..N.
⚫ Given a query point Q, we find K number of objects or
(training points) closest to the query point.
⚫ The decision is using majority vote among the class of the K
objects.
⚫ K Nearest neighbor algorithm uses neighborhood classification as
the prediction value of the new query instance.
⚫ It uses distance measure from the query instance to the
training samples to determine the K-nearest neighbors.
13
How to compute….
K-Nearest Neighbor (KNN) Algorithm?
1. Determine parameter K = number of nearest neighbors
2. Calculate the distance between the query-instance and
all the training samples
⚫ we can use Euclidean distance or any other similarity
measure
3. Sort the distance and determine nearest neighbors based
on the Kth minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors
as the prediction value of the query instance
⚫ Any ties can be broken at random with reason.
14
K Nearest Neighbors: Key issues
The key issues involved in training KNN model includes:
⚫ Setting the variable K (Number of nearest neighbors)
⚫ The numbers of nearest neighbors (K) should be based on cross
validation over a number of K setting.
⚫ When k=1 is a good baseline model to benchmark against.
⚫ A good rule-of-thumb is that K should be less than or equal to the
square root of the total number of training patterns.

⚫ Setting the type of distant measure


⚫ We need a measure of distance in order to know who are the
neighbors
⚫ Assume that we have T attributes for the learning problem. Then
one example point x has elements xt ∈ ℜ, t=1,…T.
⚫ The distance between two points xi xj is often defined as the
Euclidean distance:
15
Similarity/Dissimilarity Measures

16
Example
⚫ We have data from people & an objective testing with two
attributes (acid durability & strength) to classify whether a
special medical tissue is good or not. Here are four training
samples. X X Y
1 2
7 seconds 7 Bad
X1 = Acid Durability 7 seconds 4 Bad
3 seconds 4 Good
X2 = Strength (kg/m2)
1 seconds 4 Good
Y = Classification

⚫ Now the factory produces a new medical tissue that pass


laboratory test with X1 = 3 and X2 = 7.
⚫ Without undertaking another expensive review, guess the
goodness of the new tissue? Use squared Euclidean distance
17
for similarity measurement. Use K=3
18
Training data Exercise

Test instance

19
KNNs: advantages & Disadvantages
⚫ Advantage
⚫ Simple
⚫ Powerful
⚫ Requires no training time (prediction done online)
⚫ Nonparametric architecture
⚫ Disadvantage: Difficulties with k-nearest
neighbor
⚫ Memory intensive:
⚫ training examples should always be available
⚫ Classification/estimation is slow
⚫ Have to calculate the distance of the test case from all
training cases
⚫ There may be irrelevant attributes amongst the attributes
20
– curse of dimensionality
Bayesian Learning

21
Why Bayesian Classification?
⚫ Provides practical learning algorithms
⚫ Probabilistic learning: Calculate explicit probabilities for
hypothesis. E.g. Naïve Bayes
⚫ Prior knowledge and observed data can be combined
⚫ Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
⚫ It is a generative (model based) approach, which offers a
useful conceptual framework
⚫ Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities.
E.g. sequences could also be classified, based on a probabilistic model
specification
⚫ Any kind of objects can be classified, based on a probabilistic
22 model specification
CONDITIONAL PROBABILITY
⚫ Probability : How likely is it that an event will happen?
⚫ Sample Space S
⚫ An event A and C are a subset of S

⚫ P(C / A)- Probability that event C occurs given that event


A has already occurred.

Example of conditional probability:


⚫ There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2
has 4 red ball and 3 blue ball.
⚫ Find probability of picking a red ball from basket 1?
P(red ball | basket 1) = 2/7
P(red ball | basket 2) = 4/7
⚫ What about P(basket2 | red ball) ?
23
Bayes Classifier
⚫ A probabilistic framework for solving classification problems
⚫ Bayes theorem:

Therefore:

24
Bayes Theorem
⚫ Example: A medical cancer diagnosis problem. There are 2 possible outcomes of
a diagnosis: +ve, -ve.
We know 0.8% of world population has cancer. Test gives correct +ve result 98% of the
time and gives correct –ve result 97% of the time. If a patient’s test returns +ve,
should we diagnose the patient as having cancer?
P(cancer) = 0.008 p(no-cancer) = 0.992
P(+ve|cancer) = 0.98 P(-ve|cancer) = 0.02
P(+ve|no-cancer) = 0.03 P(-ve|no-cancer) = 0.97

Using Bayes Formula:


⚫ P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = 0.0078 / P(+ve)
⚫ P(no-cancer|+ve) = P(+ve|no-cancer)xP(no-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve)
25 So, the patient most likely does not have cancer. p(+ve) is a common
denominator
Bayes Classifier
⚫ Example of Bayes Theorem
⚫ A doctor knows that meningitis causes stiff neck 50% of the time.
Prior probability of any patient having meningitis is 1/50. Prior
probability of any patient having stiff neck is 1/20. If a patient has
stiff neck, what’s the probability he/she has meningitis?

26
General Bayes Theorem
⚫ Consider each attribute & class label as random variables
⚫ Given a record with attributes (A1, A2,…,An)
⚫ Goal is to predict class C
⚫ Specifically, we want to find the value of C that maximizes P(C|
A1, A2,…,An )
⚫ Can we estimate P(C| A1, A2,…,An ) directly from data?
⚫ Approach: compute the posterior probability P(C | A1, A2, …, An)
for all values of C using the Bayes theorem

⚫ Choose value of C that maximizes: P(C | A1, A2, …, An)


⚫ Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
⚫ How to estimate P(A1, A2, …, An | C )?
27
Naïve Bayes Classifier
⚫ Assume independence among attributes A when class
i
is given:
⚫ P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

⚫ Can estimate P(A | C ) for all A and C .


i j i j

⚫ New point is classified to Cj if P(Cj) Π P(Ai| Cj) is maximal.

28
Example. ‘Play Tennis’ data
•Suppose that some one who loves tennis have a free afternoon and he is
thinking whether or not to go and play tennis, How you do help this guy?
✔ Based on the following training data, predict when this player will Play
Tennis?

29
Naive Bayes Classifier
⚫ Given a training set, we can compute the probabilities

P(Play) = 9/14
P(NoPlay) = 5/14

30 These conditional Probabilities are the learned knowledge


Play-tennis example
⚫ Based on the model created, predict Play Tennis or Not for the
following unseen sample
(Outlook=Sunny, Temperature=Cool, Humidity=High,
Wind=Strong)

Working:

•More example: What if the following test data is given:


31 X= <Outlook=rain, Temperature=hot, Humidity=high,
Wind=weak>
Exercise: Naïve Bayes Classifier

A: attributes
M: mammals
N: non-mammals

P(A|M)P(M) >
P(A|N)P(N)
=> Mammals
32
Naive Bayesian Classifier
⚫ Advantages
⚫ Easy to implement
⚫ Good results obtained in most of the cases
⚫ Robust to isolated noise points
⚫ Handle missing values by ignoring the instance during probability
estimate calculations (the probability will be zero)
⚫ Robust to irrelevant attributes
⚫ Disadvantages
⚫ Class conditional independence assumption may not hold for some
attributes, therefore loss of accuracy
⚫ Practically dependencies exist among variables
⚫ E.g. hospitals: patients: profile: age, family history, etc. symptoms: fever,
cough etc. Disease: lung cancer, diabetes, etc.
⚫ Dependencies among these cannot be modeled by Naïve Bayesian classifier
⚫ How to deal with these dependencies?
⚫ Bayesian Belief Networks
33
Clustering

34
Paper Review
Journal URL:
http://www.jmlr.org/
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster

x x x
⚫ Given a set of points, with a notion x x x
x x x
of distance between points, group the x x
x
points into some number of clusters, x x x x x
so that members of a cluster are in x x x x x x
some sense as close to each other as x x x xx
x x x
possible. x x x
⚫ While data points in the same x
cluster are similar, those in separate x x
clusters are dissimilar to one x
36another. x
Example: Clustering
⚫ The example below demonstrates the clustering of padlocks
of same kind. There are a total of 10 padlocks which
various in color, size, shape, etc.

⚫ How many possible clusters of padlocks can be identified?


⚫ There are three different types of padlocks; or
⚫ There are two kinds of colors; or
⚫ There are two sizes of padlocks
⚫ The padlocks of same kind might be clustered into a group as
shown below by the types.

37
38
What is Cluster Analysis?
⚫ Cluster: a collection of data objects
⚫ Similar to one another within the same cluster
⚫ Dissimilar to the objects in other clusters
⚫ Cluster analysis
⚫ Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
⚫ Unsupervised learning: no predefined classes
⚫ Typical applications
⚫ As a stand-alone tool to get insight into data distribution
⚫ As a preprocessing step for other algorithms like
information retrieval
39
Examples of Clustering Applications
⚫ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing programs

⚫ Land use: Identification of areas of similar land use in an earth observation


database

⚫ Insurance: Identifying groups of motor insurance policy holders with a


high average claim cost

⚫ City-planning: Identifying groups of houses according to their house type,


value, and geographical location

⚫ Earth-quake studies: Observed earth quake epicenters should be


clustered along continent faults
40
Quality: What Is Good Clustering?
⚫ The quality of a clustering result Intra-cluster
distances are
depends on both the similarity minimized
measure used by the method and its
implementation
⚫ Key requirement of clustering:
Need a good measure of similarity
between instances.
⚫ A good clustering method will
produce high quality clusters with
⚫ high intra-class similarity
⚫ low inter-class similarity Inter-cluster
distances are
Inter
maximized
41
Overview of clustering

● Feature Selection
● identifying the most effective subset of the original
features to use in clustering
● Feature Extraction
● transformations of the input features to produce new
salient features.
● Inter-Pattern Similarity
● measured by a distance function defined on pairs of
patterns.
● Grouping
● methods to group similar patterns in the same cluster
43
Evaluation
⚫ Intra-cluster cohesion (compactness):
⚫ Cohesion measures how near the data points in a
cluster are to the cluster centroid.
⚫ Sum of squared error (SSE) is a commonly used
measure.
⚫ Inter-cluster separation (isolation):
⚫ Separation means that different cluster centroids
should be far away from one another.
⚫ In most applications, expert judgments are still
the key.
44
Cluster Evaluation: Hard Problem
⚫ The quality of a clustering is very hard to evaluate
because
⚫ We do not know the correct clusters/classes

• Some methods are used:


⚫ Direct evaluation (using either User inspection or Ground
Truth)
⚫ Indirect Evaluation

⚫ User inspection
⚫ Study centroids of the cluster, and spreads of data items in each
cluster
⚫ For text documents clustering problem, one can read some
45 documents in clusters to see if the documents are about similar
topics and judge the quality of clustering algorithms employed.
Measuring clustering validity
Internal Index:
• Validate without external info
• With different number of clusters ? ?
• Solve the number of clusters

External Index (Reality)


• Validate against ground truth
?
• Compare two clusters:
(how similar)
?
46
Internal Measures: SSE
⚫ Internal Index: Used to measure the goodness of a clustering
structure without reference to external information
⚫ Example: SSE
⚫ SSE is good for comparing two clustering or two clusters
(average SSE).
⚫ Can also be used to estimate the number of clusters

47
Estimating the “right” number of clusters
⚫ Typical approach: find a “knee” in an internal measure
curve.

⚫ Desirable property:
⚫ some clustering algorithm does not require the number of
clusters to be specified (e.g., DBSCAN)
⚫ Discussion Topic:
48
⚫ why not the k that minimizes the SSE?
Internal Measures: Cohesion and Separation
⚫ Cluster Cohesion: Measures how closely related are
objects within a single cluster
⚫ Cluster Separation: Measure how distinct or
well-separated a cluster is from other clusters
⚫ Example: Squared Error
⚫ Cohesion is measured by the within cluster sum of squares
(SSE)
Ideally this should be small

⚫ Separation is measured by the between cluster sum of squares


Ideally this should be large

49
Internal Measures: Cohesion and Separation
⚫ A proximity graph based approach can also be used for
cohesion and separation.
⚫ Cluster cohesion is the sum of the weight of all links within
a cluster.
⚫ Cluster separation is the sum of the weights between
nodes in the cluster and nodes outside the cluster.

50
cohesion separation
External Measure: Ground Truth
⚫ We use some labeled data (for classification) to check if
the clustering or grouping is acceptable
⚫ Assumption:

⚫ Each class in the classification labeled data is a

cluster.

⚫ After clustering, a confusion matrix is constructed.


From the matrix, we compute various measurements,
entropy, purity, precision, recall and F-score.
⚫ Let the classes in the data D be C = (c1, c2, …, ck). The clustering

method produces k clusters, which divides D into k disjoint


51 subsets, D1, D2, …, Dk.
Evaluation of Cluster Quality using Purity
⚫ Quality measured by its ability to discover some or all of
the hidden patterns or latent classes in gold standard data
⚫ Assesses a clustering with respect to ground truth …
⚫ This requires labeled data
⚫ Assume documents with C gold standard classes, while
our clustering algorithms produce K clusters, ω1, ω2, …,
ωK with ni members
⚫ Simple measure: Purity, the ratio between the dominant
class in the cluster πi and the size of cluster ωi

⚫ Others are entropy of classes in clusters (or mutual


52
information between classes and clusters)
m1
m2
m3

n1
n2
n3
m1
m2
m3
m1
m2
m3
Good and bad
clustering

Purity: (0.94, 0.81, 0.85) Purity: (0.38, 0.38, 0.38)


– overall 0.86 – overall 0.38
Precision: (0.94, 0.81, 0.85) Precision: (0.38, 0.38, 0.38)
– overall 0.86 – overall 0.38
Recall: (0.9, 0.85, 0.85) Recall: (0.35, 0.42, 0.38)
56 - overall 0.87 – overall 0.39
Purity example 2

∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙ ∙
∙ ∙ ∙ ∙ ∙
Cluster I Cluster II Cluster III

•Assume that we cluster three category of data items (those


colored with red, blue and green) into three clusters as shown
in the above figures. Calculate purity to measure the quality of
each cluster.
Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 = 83%
Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 = 67%
Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5 = 60%
57
Exercise: Cluster Quality Measure
⚫ Assume we have a text collection D of 900 documents from
three topics (or three classes). Science, Sports, and Politics.
Each class has 300 documents. Each document in D is labeled
with one of the topics/classes. We use this collection to
perform clustering to find three clusters, and the result is
shown in the following table:
⚫ Measure the effectiveness of the clustering algorithm using purity
and also entropy?
Cluster Science Sports Politics Purity Entropy

1 250 20 10
2 20 180 80
3 30 100 210
Total 300 300 300
58
A remark about ground truth evaluation
⚫ Commonly used to compare different clustering
algorithms.
⚫ A real-life data set for clustering has no class labels.
⚫ Thus although an algorithm may perform very well on some
labeled data sets, no guarantee that it will perform well on
the actual application data at hand.
⚫ The fact that it performs well on some label data sets
does give us some confidence of the quality of the
algorithm.
⚫ This evaluation method is said to be based on
external data or information.
59
Indirect Evaluation
⚫ In some applications, clustering is not the primary task, but
used to help perform another task.
⚫ We can use the performance on the primary task to compare
clustering methods.
⚫ For instance, in designing a recommender system, if the
primary task is to provide recommendations on book
purchasing to online shoppers.
⚫ If we can cluster books according to their features, we might be
able to provide better recommendations.
⚫ We can evaluate different clustering algorithms based
on how well they help with the recommendation task.
⚫ Here, we assume that the recommendation can be reliably
evaluated.
60
Similarity/Dissimilarity Measures
⚫ Each clustering problem is based on some
kind of distance “farness” or “nearness”
measurement between data points.
⚫ Clustering task performs some kind of
comparison:
⚫ Among the members of the same cluster and
⚫ Among members of different clusters
⚫ Thus, distances are normally used to measure
the similarity or dissimilarity between two
data objects
61
How to Measure Similarity/Dissimilarity?
⚫ Popular similarity measure is: Minkowski distance:

where X = (x1, x2, …, xn) and Y = (y1, y2, …, yn) are two
n-dimensional data objects; n is size of vector attributes of the
data object; q= 1,2,3,…

⚫ If q = 1, dis is Manhattan distance

62
Similarity/Dissimilarity Measures

63
Similarity & Dissimilarity Between Objects
⚫ If q = 2, dis is Euclidean distance:

⚫ Cosine Similarity
⚫ If X and Y are two vector attributes of data objects, then
cosine similarity measure is given by:

where ∙ indicates vector dot product, ||xi|| the length of vector d

64
Example: Similarity measure
⚫ Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1∙d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1= 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)½ =(42)½ = 6.481

||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)½ = (17)½ = 4.12

65
The need for representative
⚫ Key problem: as you build clusters, more than one
instances might be put under the same cluster. How do
you represent the location of each cluster?. The
representative of the cluster will be used to tell which pair
of clusters is closest?
⚫ All members of the cluster can not be used as
representative. Thus it is important to pick a
representative of each cluster.
⚫ For each cluster assign a centroid (closest to all other
points with in the cluster) = average of its points.

66
• One can measure inter-cluster distances by distances of
Major Clustering Approaches
⚫ Partitioning clustering approach:
⚫ Construct various partitions as per the given number of clusters
⚫ Typical methods:
⚫ distance-based: K-means clustering
⚫ model-based: expectation maximization (EM) clustering.

⚫ Hierarchical clustering approach:


⚫ Create a hierarchical decomposition of the set of data (or
objects) using some criterion
⚫ Typical methods:
⚫ agglomerative vs. divisive clustering
⚫ single link vs. complete link vs. average link clustering

67
Major Clustering Approaches (I)
⚫ Partitioning approach:
⚫ Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square
errors
⚫ Typical methods: k-means, k-medoids, CLARANS
⚫ Hierarchical approach:
⚫ Create a hierarchical decomposition of the set of data
(or objects) using some criterion
⚫ Typical methods: Diana, Agnes, BIRCH, ROCK,
CAMELEON
⚫ Density-based approach:
⚫ Based on connectivity and density functions
68 ⚫ Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches (II)
⚫ Grid-based approach:
⚫ based on a multiple-level granularity structure
⚫ Typical methods: STING, WaveCluster, CLIQUE
⚫ Model-based:
⚫ A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
⚫ Typical methods: EM, SOM, COBWEB
⚫ Frequent pattern-based:
⚫ Based on the analysis of frequent patterns
⚫ Typical methods: pCluster
⚫ User-guided or constraint-based:
⚫ Clustering by considering user-specified or application-specific
constraints
⚫ Typical methods: COD (obstacles), constrained clustering

69
Partitioning Algorithms: Basic Concept
⚫ Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters; such that, sum of
squared distance is minimum
⚫ Given a k, find a partition of k clusters that optimizes
the chosen partitioning criterion
⚫ Global optimal: exhaustively enumerate all partitions
⚫ Heuristic methods: k-means and k-medoids algorithms
⚫ k-means: Each cluster is represented by the center of the
cluster
⚫ K is the number of clusters to partition the dataset
⚫ Means refers to the average location of members of a particular cluster
⚫ k-medoids or PAM (Partition Around Medoids):
⚫ Each cluster is represented by one of the objects in the
cluster
70
The K-Means Clustering Algorithm
▪ Given k (number of clusters), the k-means algorithm is
implemented as follows:
1. Select K cluster points randomly as initial centroids
2. Compute similarity between each instance and each
cluster
3. Assign each instance to the cluster with the nearest
seed point
4. Re-compute the centroids of each K clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
5. Repeat until the centroid don’t change

71
The K-Means Clustering Method
⚫ Example
10 Assign 1
0
9
9

8
each Update 8

7
7

6
objects the 6

5
5

4
to most cluster 4

3
3

2
similar means 2

1
1

0
center 0
0 1 2 3 4 5 6 7 8 9 1
0 1 2 3 4 5 6 7 8 9 10 0

reassign reassign
K=2
Update
Arbitrarily choose the
K object as initial cluster
cluster center means

72
Example Problem
⚫ Cluster the following eight points (with (x, y)
representing locations) into three clusters :
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8)
A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9).
⚫ Assume that the randomly selected initial cluster centers are:
A1(2, 10), A4(8,4) and A7(1, 2).
⚫ The distance function between two points Aj=(x1, y1)
and Ci=(x2, y2) is defined as:
Manhatta
n
dis(Aj, Ci) = |x2 – x1| + |y2 – y1| .
⚫ Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
73
Iteration 1
⚫ Starting from point A1(2, 10) calculate the distance to each of
the three means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |8 – 2| + |4 – 10| = 6 + 6 = 12
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
⚫ Fill these values in the table & decide which cluster should the point (2,
10) be placed in? The one, where the point has the shortest distance to the
mean – i.e. mean 1 (cluster 1), since the distance is 0.
⚫ Next go to the second point A2(2, 5) and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |8 – 2| + |4 – 5| = 6 + 1 = 7
dis(A2, mean3) = |1 – 2| + |2 – 5| = 1 + 3 = 4
⚫ So, we fill in these values in the table and assign the point (2, 5) to
cluster 3 since mean 3 is the shortest distance from A2.
⚫ Analogically, we fill in the rest of the table, and place each point
in one of the clusters
74
Iteration 1
First we list all points in the first column of the table below. The initial cluster
centers - centroids, are (2, 10), (8,4) and (1, 2) - chosen randomly.
The Table shows the distance of each data points (instances) from the chosen
centroids. The last column shows the cluster the instance should be assigned
to based on its distance from the centroids.

Data Points Cluster 1 with Cluster 2 with Cluster 3 with Cluster


centroid (2,10) centroid (8, 4) centroid (1, 2)
A1 (2, 10) 0 12 9 1
A2 (2, 5) 5 7 4 3
A3 (8, 4) 12 0 9 2
A4 (5, 8) 5 7 10 1
A5 (7, 5) 10 2 9 2
A6 (6, 4) 10 2 7 2
A7 (1, 2) 9 9 0 3
A8 (4, 9) 3 9 10 1
75
Iteration 1
⚫ Next, we need to re-compute the new cluster centers. We do so, by
taking the mean of all points in each cluster.
⚫ For Cluster 1, we have three points (A1, A4, A8) and needs to take
average of them as new centroid, i.e.
((2+5+4)/3, (10+8+9)/3) = (3.67, 9) Cluster 1
⚫ For Cluster 2, we have three points (A3, A5, A6). The new centroid
is:
((8+7+6)/3, (4+5+4)/3 ) = (7, 4.33) Cluster 2
⚫ For Cluster 3, we have two points (A2, A7). The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5) Cluster 3
⚫ Since centroids changes in Iteration1, we go to the next Iteration
(epoch2) using the new means we computed.
⚫ The iteration continues until the centroids do not
76 change anymore.
Iteration 2
⚫ After the 2nd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.67,9);
cluster 2: {A3,A5,A6} with new centroid = (7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
⚫ Using the new centroid compute cluster members again.
Data Points Cluster 1 Cluster 2 Cluster 1 Cluster
with centroid with centroid with centroid
(3.67, 9) (7, 4.33) (1.5, 3.5)
A1 (2, 10) 1.67+1=2.67 5+5.67=10.67 0.5+6.5=7 1
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
77
Final results
⚫ Finally in the 2th epoch there is no change of members of
clusters and centroids. So the algorithm stops.
⚫ The result of clustering is shown in the following figure
Cluster
1

Cluster
Cluster 2
3

78
Comments on the K-Means Method
⚫ Strength: Simple assumptions and Relatively efficient: O
(tkn), where n is the number objects, k is the number
clusters, and t is the number iterations. Normally, k, t << n.
⚫ Weakness
⚫ Applicable only when mean is defined and K, the number
of clusters, specified in advance
⚫ One can use hierarchical clustering to overcome this weakness

⚫ Unable to handle noisy data & outliers since an object


with an extremely large value may distort the distribution
of the data by changing the centroid substantially.
⚫ K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
79 ⚫ HOW to find the Medoid?
Hierarchical Clustering
⚫ As compared to partitioning algorithm,
in hierarchical clustering the data are
not partitioned into a particular cluster.
⚫ Instead, a series of partitions takes place,
which may run from cluster of a single
element to a single cluster containing all
objects
• Produces a set of nested clusters dendrogram
organized as a hierarchical tree.
– Hierarchical clustering outputs a hierarchy, a
structure that is more informative than the
unstructured set of clusters returned by partitioning
clustering.
– Can be visualized as a dendrogram; a tree like
diagram that records the sequences of merges or
80
splits
Example of hierarchical clustering

81
Hierarchical clustering

K=1

K=2

K=3

K=4
K=5

K=6
K=7
K=8
Two main types of hierarchical clustering
⚫ Agglomerative: it is a Bottom Up clustering technique
⚫ Start with all sample units in n clusters of size 1.
⚫ Then, at each step of the algorithm, the pair of clusters with the shortest distance
are combined into a single cluster.
⚫ The algorithm stops when all sample units are grouped into one cluster of size n.
⚫ Divisive: it is a Top Down clustering technique
⚫ Start with all sample units in a single cluster of size n.
⚫ Then, at each step of the algorithm, clusters are partitioned into a pair of clusters,
selected to maximize the distance between each clauster.
⚫ The algorithm stops when sample units are partitioned into n clusters of size 1.

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
a ab
b abcde
c
cde
d
de
e
divisive
83 Step 4 Step 3 Step 2 Step 1 Step 0
Agglomerative Clustering Algorithm
⚫ More popular hierarchical clustering
technique
⚫ Basic algorithm is straightforward
1. Let each data point be a separate cluster
2. Repeat
1. Compute the proximity matrix
2. Merge the two closest clusters
3. Update the proximity matrix
3. Until only a single cluster remains

⚫ The key operation is the computation of the


proximity of two clusters

84
Example
Problem: clustering analysis with agglomerative
algorithm

data matrix

Euclidean distance distance matrix


120
Example
Merge two closest clusters (iteration 1)
In our case, the distance between D and F is the smallest of
all. Thus put D and F in one cluster and compute their
average as their centroid

121
Example
Update distance matrix (iteration 1)

122
Example
Merge two closest clusters (iteration 2)

123
Example
Update distance matrix (iteration 2)

124
Example
Merge two closest clusters/update distance matrix
(iteration 3)

125
Example
Merge two closest clusters/update distance matrix
(iteration 4)

126
Example
• Final result (meeting termination condition)

127
Example
Dendrogram tree representation
1. In the beginning we have 6
clusters: A, B, C, D, E and F
6 2. We merge clusters D and F into
cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B
into (A, B) at distance 0.71
e
lifetim

4. We merge clusters E and (D, F)


into ((D, F), E) at distance 1.00
5
5. We merge clusters ((D, F), E) and C
4 into (((D, F), E), C) at distance 1.41
6. We merge clusters (((D, F), E), C)
3
and (A, B) into ((((D, F), E), C), (A, B))
2
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
object
128
Linkage in Hierarchical Clustering
■ Distance measure is highly used for clustering, but
how can it be used to measure distance between a
data item and a cluster or between two clusters?
■ One method is to treat a data point as a cluster with a
single item, so our only problem is to define a
linkage method between clusters
■ Single Link
■ Complete Link
■ Average Link
Single Linkage Algorithm:

● When an algorithm uses the minimum-distance


dmin(Ci,Cj), to measure the distance between
clusters, it is sometimes called
nearest-neighbor clustering algorithm.
● Moreover, if the clustering process is
terminated when the distance between nearest
clusters exceed an arbitrary threshold, it is
called a single-linkage algorithm.
Single Linkage
■ The minimum of all pairwise distances between points
in the two clusters
■ Tends to produce long, “loose” clusters
Complete Linkage Algorithm:
● When an algorithm uses the maximum-distance
dmax(Ci,Cj), to measure the distance between clusters, it
is sometimes called a farthest-neighbor clustering
algorithm.
● If the clustering process is terminated when the
maximum distance between nearest clusters
exceed an arbitrary threshold, it is called a
complete-linkage algorithm.
● The distance between two clusters is determined by
the most distant nodes in two clusters.
Complete Linkage
■ The maximum of all pairwise distances between points in
the two clusters
■ Tends to produce very tight clusters
Average Linkage Algorithm:
● When an algorithm uses the average distance
davg(Ci,Cj), to measure the distance between clusters.
● If the clustering process is terminated when the
average distance between clusters exceed an
arbitrary threshold, it is called a
average-linkage algorithm.
● The distance between two clusters is determined by
the average of the distance between all nodes in two
clusters.
Average Linkage
■ In Average linkage clustering, the distance between two
clusters is defined as the average of distances between all
pairs of objects, where each pair is made up of one object
from each group.
Excercise
⚫ Perform a agglomerative clustering of five samples using two
features X and Y. Calculate Manhattan distance between each
pair of samples to measure their similarity.

Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12

101
Proximity Matrix: First epoch
1 2 3 4 5

1 = (4,4) - 4 15 20 28
2= (8,4) 4 - 11 16 24
3=(15,8) 15 11 - 13 13
4=(24,4) 20 16 13 - 8
5=(24,12) 28 24 13 8 -

102
Exercise: Hierarchical clustering
⚫ Given the following 8 samples having two attributes,
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8),
A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).

⚫ Construct dendrograms using agglomerative


clustering algorithm. Use manhatten distance for
similarity measure
⚫ Compare the dendrogram constructed using
agglomerative clustering algorithm with divisive
clustering
⚫ Also try to use single-link and complete-link
agglomerative clustering to cluster the above given
103
data?
Remarks on Hierarchical Clustering
⚫ Strength
⚫ Do not have to assume any particular number of
clusters
⚫ Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level

⚫ They may correspond to meaningful taxonomies


⚫ Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)

⚫ Weakness
⚫ Do not scale well: time complexity of at least O(n2),
where the total number of data objects
104 ⚫ If there are many data points (examples), the algorithm is
Semi Supervised Learning
Use small number of labeled data to label large amount of
cheap unlabeled data.
Basic idea: similar examples should be given the same
classification.
Typical example :
web page classification: unlimited amount of cheap
unlabeled data, while labeling is expensive.

133
Why Semi-Supervised
Because people want better performance for cheaper cost.
unlabeled data is cheap
labeled data can be hard to get and expensive
human annotation is time taking, boring and erroneous
labels may require experts
labels may require special devices

How to Semi-Supervise
Using both labeled and unlabeled data to build better
learners, than using each one alone.

134
Semi-supervised learning

□ Label propagation
□ Transductive learning
□ Co-training
□ Active learning

135
End of Topic 2

108

You might also like