CLUSTERING
METHODS
ILLUSTRATIONS
By Dr Francis Musembi
1. Basic Definitions
• Data Mining (DM): Extraction of patterns, rules, trends from
large structured data sets (i.e. data with particular
organization).
• Text Mining does same but on unstructured data (text).
• Clustering: A DM technique used to group data sets with
similar content (into similar clusters).
– Produces groups, s.t. elements in a group are more similar.
By Dr Francis Musembi
2. Motivations of Clustering
• Business: Find clusters of similar customers
• Clustering results of web search engine
• Clustering in pattern recognition
• Image processing
• Clustering in different MIS applications
• Medicine
• Etc
Clustering makes data;
• More understandable, Easier, Faster to searching, More
efficient.
By Dr Francis Musembi
3. The Vector Space Model - VSM
Matrix representation
• Instances/documents represented as a matrix.
• A row represents an instance/document.
• A column represents an attribute.
(Rehurek 2011)
VSM equivalent
• Documents represented as points in space.
• A dimension represents attribute (e.g. x, y).
• A point represents an instance/document
(e.g. 3 instances (20,50), (50,85), (70,90)).
By Dr Francis Musembi
Example (structured data)
Age Pressure Matrix Representation
20 50 20 50
50 85 50 85
70 90
70 90
The VSM
A space of dimension 2 with
data points
(20, 50), (50, 85), (70, 90).
By Dr Francis Musembi
Example (text data)
3 Documents
With underlined identifiable key terms (that should form the basis of distinguishing the
documents)
D1: Eating fruit improves health.
D2: Give your infant fruit regularly, for the infant to have good health.
D3: Regular exercise improves your health.
Term-dictionary: T1: fruit, T2: health, T3: infant, T4: exercise.
Term-document matrix:
1 1 0
1 1 1
A
0 1 0
0 0 1
The VSM
A space of dimension 4 with points (1,1,0,0), (1,1,1,0), (0,1,0,1).
By Dr Francis Musembi
Clustering VSM Points
• Most clustering algorithms cluster text documents by
converting the documents into VSM, then cluster the points.
• However, the exact approach differs to determine exactly how
similar points (for documents) are recognized.
• Various approaches: Centroid-based approach, probability
approach, density-based approach, grid-based approach.
Ning (2005, p. 19), Aggrawal, S (2013, p. 1), etc.
• Key objectives that should be met: accuracy, efficiency,
scalability, robustness. (Neethu & Surendran, 2013 ),
Chandra & Anuradha (2011), Rai (2010), Shah (2012, p. 34),
etc.
By Dr Francis Musembi
Euclidean Distance
n 2
dist ( pk qk )
k 1
Where n is the number of dimensions (attributes) and pk
and qk are, respectively, the kth attributes (components) or
data objects p and q.
Example
Distance between first 2 points in above example is
((1-1)2+(1-1)2+(1-0)2+(0-0)2)1/2
By Dr Francis Musembi
4. Centroid-based Approach
• Measure the Euclidean distance between documents and
their cluster centers to determine clusters.
• E.g: KMeans (Default parameter value: K=no. of clusters=2)
1. Choose the number of clusters, k.
2. Randomly determine k cluster centers (centroids).
3. Repeat the following until no object moves (i.e. no object
changes clusters)
(i) Determine the distance of each object to all
centroids.
(ii) Assign each point to the nearest centroid.
(iii) Re-compute the new cluster centers.
By Dr Francis Musembi
Ex: Possible first 3 loops of KMeans (on 2D data)
K=2
By Dr Francis Musembi
1.1: Select 2 centroids randomly.
k0 k1
K=2
By Dr Francis Musembi
1.2: Assign points to nearest centroids (thus 2
clusters).
k0 k1
K=2
c1
c0
By Dr Francis Musembi
2.1: Compute new centroids (don’t have to be
data points).
k0 k1
K=2
c1
c0
By Dr Francis Musembi
Current data
k0 k1
K=2
By Dr Francis Musembi
2.2: Assign each point to nearest centroid.
k0 k1
K=2
c1
c0
By Dr Francis Musembi
3.1: Compute new centroids.
k0 k1
K=2
c1
c0
By Dr Francis Musembi
Current data
k0 k1
K=2
By Dr Francis Musembi
3.2: Assign each point to nearest centroid.
k0 k1
K=2
c1
c0
By Dr Francis Musembi
Repeat the looping till no object changes its
cluster.
k0 k1
K=2
c1
c0
By Dr Francis Musembi
KMeans Advantages
• Simple (straight logic).
• Efficient.
• Scalable.
By Dr Francis Musembi
KMeans Limitations
• Parameter dependence
– Forces data to have particular no. of clusters.
– E.g. following will be 2 clusters (default k value).
y
30
25
20
y
15
10
5
0
0 5 10 15 20 25 30
By Dr Francis Musembi
KMeans Limitations
• Parameter dependence
– Forces data to have particular no. of clusters.
– E.g. following will be 2 clusters (default param).
y
30
25
20
y
15
10
5
0
0 5 10 15 20 25 30
– User can’t know no. of expected clusters in unknown data.
By Dr Francis Musembi
KMeans Limitations
• Parameter dependence
• Can’t cluster outliers
E.g: Forced to be in a cluster
By Dr Francis Musembi
KMeans Limitations
• Parameter dependence
• Can’t cluster outliers
• Poor in clustering irregular shaped data
E.g: Irregular shaped data (2 groups/classes each).
By Dr Francis Musembi
KMeans Limitations
• Parameter dependence
• Can’t cluster outliers
• Poor in clustering irregular shaped data
E.g: Possible wrong produced clusters by KMeans.
Produced clusters
By Dr Francis Musembi
5. Density-based Approach
• Find clusters by differentiating regions in terms of the relative
density (or compactness/concentration/number of objects
per unit area) of points in them. Thus, regions adjacent to a
cluster contain data points of either less concentration or
higher concentration.
By Dr Francis Musembi
• E.g: DBSCAN
Scans all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors, & repeating.
EX: Possible first few steps of DBSCAN:
Eps = 1cm
p
MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Black
circle
represents
cluster.
Eps = 1cm
p
MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Eps = 1cm
p MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Eps = 1cm
p MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Eps = 1cm
p MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Eps = 1cm
p MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Eps = 1cm
p MinPts = 5
By Dr Francis Musembi
• E.g: DBSCAN
Finds all neighbor (VSM) points within distance (radius) Eps of
a starting point p, and either forming a cluster (if the number
of the neighbors is at least Minpts), or else considering p as
noise (outlier), before moving to the neighbors & repeating.
Eps = 1cm
p MinPts = 5
By Dr Francis Musembi
DBSCAN Strengths
• Can cluster well outliers
E.g: Outlier will be detected
By Dr Francis Musembi
DBSCAN Strengths
• Can cluster well outliers
E.g: Outlier will be detected
clusters formed (provided appropriate Eps, MinPts)
• Can cluster well irregular shaped data (due to its scanning
manner), e.g. above illustration.
By Dr Francis Musembi
DBSCAN Weaknesses
• Parameters dependence (Eps, MinPts)
Groups/classes with much sparser points (than EPS) wont be
detected.
E.g: clusters formed.
Eps=1
MinPts=6
By Dr Francis Musembi
DBSCAN Weaknesses
• Parameters dependence (Eps, MinPts)
Groups/classes with much sparser points (than EPS) would be
detected.
E.g: no clusters formed (with MinpPts>=8).
Eps=1
MinPts=8
By Dr Francis Musembi
DBSCAN Weaknesses
• Parameters dependence (Eps, MinPts)
Groups/classes with much sparser points (than EPS) would be
detected.
E.g: no clusters formed (with Eps=0.5).
Eps=0.5
MinPts=6
By Dr Francis Musembi
DBSCAN Weaknesses
• Parameters dependence (Eps, MinPts)
Groups/classes with much sparser points (than EPS) would be
detected.
E.g: Very large Eps: The 2 distinct groups not identifiable.
Eps=20
MinPts=6
By Dr Francis Musembi
6. Other Algorithms
• Hierarchical approach
Agglomerative (Bottom-up) type: Start with each VSM point as a
cluster, then recursively merge each 2 nearest clusters into one
cluster. This repeats until one cluster (of all points) is left, forming a
tree hierarchy.
HierarchicalClusterer (Agglomerative).
Divisive type (top-down) type: Start with the entire points as a
cluster. Then recursively divide each cluster into two (such that
each has nearest points), until each cluster contains only one
point.
NB: Euclidean formula measures nearest points (clusters).
By Dr Francis Musembi
Other Algorithms
• Hierarchical: Starts with each VSM point as a cluster, then
recursively merges each 2 clusters into one cluster. This
repeats until one cluster is left, forming a tree hierarchy.
(Bottom-up Agglomerative). Others are top-down (Divisive).
E.g. possible first 3 steps of divisive clustering
By Dr Francis Musembi
Other Algorithms
• Hierarchical: Starts with each VSM point as a cluster, then
recursively merges each 2 clusters into one cluster. This
repeats until one cluster is left, forming a tree hierarchy.
(Bottom-up Agglomerative). Others are top-down (Divisive).
E.g. possible first 3 steps of divisive clustering
By Dr Francis Musembi
Other Algorithms
• Hierarchical: Starts with each VSM point as a cluster, then
recursively merges each 2 clusters into one cluster. This
repeats until one cluster is left, forming a tree hierarchy.
(Bottom-up Agglomerative). Others are top-down (Divisive).
E.g. possible first 3 steps of divisive clustering
NB: The merging is based on
a distance measure, e.g.
Euclidean distance.
By Dr Francis Musembi
Other Algorithms
• Probability-based: Find the probability with which a data
point belongs to a cluster.
Data is regarded to belong to a certain probability
distribution, and the area around the mean of a distribution
constitutes a natural cluster.
E.g: EM
By Dr Francis Musembi
• Grid-based: Quantize VSM into cells forming a grid structure.
Contiguous dense cells are a cluster. E.g. CLIQUE, STING.
E.g: (Using cell density=5), a cluster no cluster
• Strength: Very efficient.
• Limitations: Parameter dependence (cell size, cell density),
Poor in irregular shapes, cell adjacency problem (no diagonal).
By Dr Francis Musembi
EXERCISES
By Dr Francis Musembi
Exercise 1: Identify classes (groups).
Index Data
0 7 6 1
1 5 22
2 3 19
3 10 0
10
4 2 17
5 22 5 2
6 17 7
7 4 4 4
8 21 6
9 20 2
10 2 20
7 0 3 6 9 8 5
By Dr Francis Musembi
Exercise 1: Possible classes
Index Data Class
0 7 6 0 1
1 5 22 2
2 3 19 2
3 10 0 0
10
4 2 17 2
5 22 5 1 2
6 17 7 1
7
8
4
21
4
6
0
1
4
9 20 2 1
10 2 20 2
Class 2
7 0 3 6 9 8 5
Class 0
Class 1
By Dr Francis Musembi
Exercise 1: The data in arff format
% relation size (11, 3, 3)
1
@relation self0
@attribute x real 10
@attribute y real
@attribute class {0,1,2} 2
@data 4
7,6,0
5,22,2
3,19,2
Class 2
10,0,0
2,17,2 7 0 3 6 9 8 5
22,5,1
17,7,1
4,4,0 Class 0
Class 1
21,6,1
20,2,1
2,20,2
By Dr Francis Musembi
Exercise 2: Identify classes (groups), outliers
Index Data Illustration 3 4
9 6 10
0 3,3,
1 4,4, 25
2 5,4,
3 15,20, 20
4 16,17,
15
5 17,16,
6 18,18, Series1
10
7 4,5, 7
8 20,3,
5
9 2,18,
10 20,18, 0
11 5,1, 0 5 10 15 20 25
0 1 5
8
2 11
By Dr Francis Musembi
Identified: Classes: (0,1), Outlier (-1)
Index Data Illustration 3 4
9 6 10
0 3,3,0
1 4,4,0 25
2 5,4,0
3 15,20,1 20
4 16,17,1
15
5 17,16,1
6 18,18,1 Series1
10
7 4,5,0 7
8 20,3,-1
5
9 2,18,-1
10 20,18,1 0
11 5,1,0 0 5 10 15 20 25
0 1 5
8
2 11
By Dr Francis Musembi
Data in arff format
% relation0, -1 means outlier, size (12,3,2)
@relation self0
@attribute x real
@attribute y real
@attribute class {0,1,}
Illustration 9 3 4 6 10
@data
25
3,3,0
4,4,0 20
5,4,0
15,20,1 15
16,17,1 Series1
10
17,16,1 7
18,18,1 5
4,5,0
0
20,3,-1 0 5 10 15 20 25
2,18,-1
0 1 5
8
20,18,1 2 11
5,1,0
By Dr Francis Musembi
References
• Agrawal, R, Imielinski, T & Swami A 1993, ‘Database Mining: A Performance Perspective’, IEEE Transactions on Knowledge
and Data Engineering. 5(6):914-925.
• Geraci, F 2008, Fast Clustering for Web Information Retrieval, PhD Thesis, Universit’ A Degli Studi Di Siena.
• Hao, Z 2012, ‘A New Text Clustering Method Based on KGA’, Journal of Software, vol. 7, no. 5, pp. 1-5.
• Lasek, P 2011, Efficient Density-Based Clustering, PhD Thesis, Warsaw University of Technology.
• Li, Y 2007, High Performance Text document Clustering, PhD Thesis, Wright State University.
• Liu, J 2006, New Approaches for Clustering High Dimensional Data, PhD Thesis, University of North Carolina.
• Ng, R &Han, J 2002, ‘CLARANS: A Method for Clustering Objects for Spatial Data Mining’, IEEE Transactions on Knowledge
and Data Engineering, vol. 14, no. 5.
• Ning, W 2005, Textmining and Organization in Large Corpus, MSc Thesis, Technical University of Denmark (DTU).
• Punitha, S & Punithavalli M 2012, ‘A Comparative Study to Find a Suitable Method for Text Document Clustering’, IJCSNS
International Journal of Computer Science and Network Security, vol. 12, no. 10.
• Rai, P 2010, ‘A Survey of Clustering Techniques’, International Journal of Computer Applications, vol. 7, no 12.
• Rama, B, Jayashree, P, Jiwani, S 2010, ‘A Survey on clustering: Current status and challenging issues’, International Journal on
Computer Science and Engineering, vol. 02, no. 09.
• Rehurek, R 2011, Scalability of Semantic Analysis in Natural Language Processing, PhD Thesis, Masaryk University.
• Rosell, M 2009, Clustering Exploration: Swedish Text Representation and Clustering Results Unraveled, PhD Thesis,
Stockholm, Sweden.
• Velmurugan, T, & Santhanam, T 2011, ‘A Comparative analysis between K Medoids and Fuzzy C Means Clustering Algorithms
for Statistically Distributed Data Points’, Journal of Theoretical and Applied Information Technology, vol. 27, no. 1.
By Dr Francis Musembi