RDM Slides Clustering With R 1
RDM Slides Clustering With R 1
Yanchang Zhao
http://www.RDataMining.com
July 2019
1 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
2 / 62
What is Data Clustering?
3 / 62
†
Data Clustering with R
I Partitioning Methods
I k-means clustering: stats::kmeans() ∗ and
fpc::kmeansruns()
I k-medoids clustering: cluster::pam() and fpc::pamk()
I Hierarchical Methods
I Divisive hierarchical clustering: DIANA, cluster::diana(),
I Agglomerative hierarchical clustering: cluster::agnes(),
stats::hclust()
I Density based Methods
I DBSCAN: fpc::dbscan()
I Cluster Validation
I Packages clValid, cclust, NbClust
∗
package name::function name()
†
Chapter 6 - Clustering, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
4 / 62
The Iris Dataset - I
‡
https://archive.ics.uci.edu/ml/datasets/Iris
5 / 62
The Iris Dataset - II
Below we have a look at the structure of the dataset with str().
6 / 62
The Iris Dataset - III
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Wid...
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0....
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0....
## Median :5.800 Median :3.000 Median :4.350 Median :1....
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1....
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1....
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2....
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
7 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
8 / 62
Partitioning clustering - I
I Partitioning the data into k groups first and then trying to
improve the quality of clustering by moving objects from one
group to another
I k-means [Alsabti et al., 1998, Macqueen, 1967]: randomly
selects k objects as cluster centers and assigns other objects
to the nearest cluster centers, and then improves the
clustering by iteratively updating the cluster centers and
reassigning the objects to the new centers.
I k-medoids [Huang, 1998]: a variation of k-means for
categorical data, where the medoid (i.e., the object closest to
the center), instead of the centroid, is used to represent a
cluster.
I PAM and CLARA [Kaufman and Rousseeuw, 1990]
I CLARANS [Ng and Han, 1994]
9 / 62
Partitioning clustering - II
10 / 62
k-Means Algorithm
11 / 62
k-Means Algorithm - Criterion Function
12 / 62
k-means clustering
13 / 62
## K-means clustering with 3 clusters of sizes 38, 50, 62
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.850000 3.073684 5.742105 2.071053
## 2 5.006000 3.428000 1.462000 0.246000
## 3 5.901613 2.748387 4.393548 1.433871
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3...
## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3...
## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1...
## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3...
##
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"...
## [5] "tot.withinss" "betweenss" "size" "iter" ...
## [9] "ifault" 14 / 62
Results of k-Means Clustering
table(iris$Species, kmeans.result$cluster)
##
## 1 2 3
## setosa 0 50 0
## versicolor 2 0 48
## virginica 36 0 14
15 / 62
plot(iris2[, c("Sepal.Length", "Sepal.Width")],
col = kmeans.result$cluster)
points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],
col = 1:3, pch = 8, cex=2) # plot cluster centers
4.0
3.5
Sepal.Width
3.0
2.5
2.0
Sepal.Length
16 / 62
k-means clustering with estimating k and initialisations
17 / 62
library(fpc)
kmeansruns.result <- kmeansruns(iris2)
kmeansruns.result
## K-means clustering with 3 clusters of sizes 62, 50, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.901613 2.748387 4.393548 1.433871
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1...
## [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1...
## [91] 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3...
## [121] 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1...
##
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
18 / 62
The k-Medoids Clustering
I Difference from k-means: a cluster is represented with its
center in the k-means algorithm, but with the object closest
to the center of the cluster in the k-medoids clustering.
I more robust than k-means in presence of outliers
I PAM (Partitioning Around Medoids) is a classic algorithm for
k-medoids clustering.
I The CLARA algorithm is an enhanced technique of PAM by
drawing multiple samples of data, applying PAM on each
sample and then returning the best clustering. It performs
better than PAM on larger data.
I Functions pam() and clara() in package cluster
[Maechler et al., 2016]
I Function pamk() in package fpc does not require a user to
choose k.
19 / 62
Clustering with pam()
## clustering with PAM
library(cluster)
# group into 3 clusters
pam.result <- pam(iris2, 3)
# check against actual class label
table(pam.result$clustering, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
Three clusters:
I Cluster 1 is species “setosa” and is well separated from the
other two.
I Cluster 2 is mainly composed of “versicolor”, plus some cases
from “virginica”.
I The majority of cluster 3 are “virginica”, with two cases from
“versicolor”.
20 / 62
plot(pam.result)
1 : 50 | 0.80
2
0
2 : 62 | 0.42
−2
3 : 38 | 0.45
21 / 62
I The left chart is a 2-dimensional “clusplot” (clustering plot)
of the three clusters and the lines show the distance between
clusters.
I The right chart shows their silhouettes. A large si (almost 1)
suggests that the corresponding observations are very well
clustered, a small si (around 0) means that the observation
lies between two clusters, and observations with a negative si
are probably placed in the wrong cluster.
I Silhouette width of cluster 1 is 0.80, which means it is well
clustered and seperated from other clusters. The other two
are of relatively low silhouette width (0.42 and 0.45), and they
are somewhat overlapped with each other.
22 / 62
Clustering with pamk()
library(fpc)
pamk.result <- pamk(iris2)
# number of clusters
pamk.result$nc
## [1] 2
Two clusters:
I “setosa”
I a mixture of “versicolor” and “virginica”
23 / 62
plot(pamk.result)
1 : 51 | 0.81
1
Component 2
0
−1
2 : 99 | 0.62
−2
−3
24 / 62
Results of Clustering
25 / 62
Results of Clustering
25 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
26 / 62
Hierarchical Clustering - I
27 / 62
Hierarchical Clustering - II
28 / 62
Hierarchical Clustering Algorithms
29 / 62
Hierarchical Clustering - Distance Between Clusters
30 / 62
Hierarchical Clustering of the iris Data
## hiercrchical clustering
set.seed(2835)
# draw a sample of 40 records from the iris data, so that the
# clustering plot will not be over crowded
idx <- sample(1:dim(iris)[1], 40)
iris3 <- iris[idx, ]
# remove class label
iris3$Species <- NULL
# hierarchical clustering
hc <- hclust(dist(iris3), method = "ave")
# plot clusters
plot(hc, hang = -1, labels = iris$Species[idx])
# cut tree into 3 clusters
rect.hclust(hc, k = 3)
# get cluster IDs
groups <- cutree(hc, k = 3)
31 / 62
Height
0 1 2 3 4
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
virginica
versicolor
versicolor
versicolor
dist(iris3)
versicolor
versicolor
hclust (*, "average")
versicolor
versicolor
Cluster Dendrogram
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
32 / 62
Agglomeration Methods of hclust
33 / 62
DIANA
I DIANA [Kaufman and Rousseeuw, 1990]: divisive hierarchical
clustering
I Constructs a hierarchy of clusterings, starting with one large
cluster containing all observations.
I Divides clusters until each cluster contains only a single
observation.
I At each stage, the cluster with the largest diameter is
selected. (The diameter of a cluster is the largest dissimilarity
between any two of its observations.)
I To divide the selected cluster, the algorithm first looks for its
most disparate observation (i.e., which has the largest average
dissimilarity to other observations in the selected cluster).
This observation initiates the “splinter group”. In subsequent
steps, the algorithm reassigns observations that are closer to
the “splinter group” than to the “old party”. The result is a
division of the selected cluster into two new clusters.
34 / 62
DIANA
35 / 62
Height
0 1 2 3 4 5 6 7
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
virginica
virginica
versicolor
iris3
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
Divisive Coefficient = 0.93 versicolor
versicolor
versicolor
versicolor
Dendrogram of diana(x = iris3)
virginica
virginica
virginica
virginica
virginica
virginica
virginica
versicolor
virginica
36 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
37 / 62
Density-Based Clustering
38 / 62
DBSCAN [Ester et al., 1996]
39 / 62
Density-based Clustering of the iris data
## Density-based Clustering
library(fpc)
iris2 <- iris[-5] # remove class tags
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
ds
## dbscan Pts=150 MinPts=5 eps=0.42
## 0 1 2 3
## border 29 6 10 12
## seed 0 42 27 24
## total 29 48 37 36
40 / 62
Density-based Clustering of the iris data
I 1 to 3: identified clusters
I 0: noises or outliers, i.e., objects that are not assigned to any
clusters
41 / 62
plot(ds, iris2)
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
7.5
6.5
Sepal.Length
5.5
4.5
2.0 2.5 3.0 3.5 4.0
Sepal.Width
7
6
5
Petal.Length
4
3
2
1
0.5 1.0 1.5 2.0 2.5
Petal.Width
42 / 62
plot(ds, iris2[, c(1, 4)])
2.5
2.0
Petal.Width
1.5
1.0
0.5
Sepal.Length
43 / 62
plotcluster(iris2, ds$cluster)
0
3 33
3 0
3 30
3 3
1 1 3
1 3
2
0 1 3 3 3
0 3 3 3 0 333
1
1 11 1 1 1 3 33
11 1 1 0 2 2 3 3
1
0 2 22
dc 2
1 1 1 1 3 30
11 1 1 11 11
1 22 33
1 111 111 1 2 2 3 0
1 1 1 22 3
11 2 22
022 222
0
1 11 0 2220
0
2 3
3
11 1 02
2 2 2
0
22 3 03 0
2 220
−1
0
0 0
2 2 0 0
0 0 0
−2
0
0
−8 −6 −4 −2 0 2
dc 1
44 / 62
Prediction with Clustering Model
I Label new data, based on their similarity with the clusters
I Draw a sample of 10 objects from iris and add small noises
to them to make a new dataset for labeling
I Random noises are generated with a uniform distribution
using function runif().
## cluster prediction
# create a new dataset for labeling
set.seed(435)
idx <- sample(1:nrow(iris), 10)
# remove class labels
new.data <- iris[idx,-5]
# add random noise
new.data <- new.data + matrix(runif(10*4, min=0, max=0.2),
nrow=10, ncol=4)
# label new data
pred <- predict(ds, iris2, new.data)
45 / 62
Results of Prediction
46 / 62
Results of Prediction
46 / 62
plot(iris2[, c(1, 4)], col = 1 + ds$cluster)
points(new.data[, c(1, 4)], pch = "+", col = 1 + pred, cex = 3)
2.5
+
2.0
+
+++ +
Petal.Width
1.5
+
1.0
+
0.5
+ +
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Sepal.Length
47 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
48 / 62
Cluster Validation
49 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
50 / 62
Further Readings - Clustering
I A breif overview of various apporaches for clustering
Yanchang Zhao, et al. ”Data Clustering.” In Ferraggine et al. (Eds.), Handbook
of Research on Innovations in Database Technologies and Applications, Feb
2009. http://yanchang.rdatamining.com/publications/
Overview-of-Data-Clustering.pdf
I Cluster Analysis & Evaluation Measures
https://en.wikipedia.org/wiki/Cluster_analysis
I Detailed review of algorithms for data clustering
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review.
ACM Computing Surveys, 31(3), 264-323.
Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue
Software, San Jose, CA, USA.
http://citeseer.ist.psu.edu/berkhin02survey.html.
I A comprehensive textbook on data mining
Han, J., & Kamber, M. (2000). Data mining: concepts and techniques. San
Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
51 / 62
Further Readings - Clustering with R
I Data Mining Algorithms In R: Clustering
https:
//en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering
I Data Mining Algorithms In R: k-Means Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/K-Means
I Data Mining Algorithms In R: k-Medoids Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/Partitioning_Around_Medoids_(PAM)
I Data Mining Algorithms In R: Hierarchical Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/Hierarchical_Clustering
I Data Mining Algorithms In R: Density-Based Clustering
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Clustering/Density-Based_Clustering
52 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset
Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering
Hierarchical Clustering
Density-Based clustering
Cluster Validation
Exercises
53 / 62
Exercise - I
Clustering cars based on road test data
I mtcars: the Motor Trend Car Road Tests data, comprising
fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74
models) [R Core Team, 2015]
I A data frame with 32 observations on 11 variables:
1. mpg: fuel consumption (Miles/gallon)
2. cyl: Number of cylinders
3. disp: Displacement (cu.in.)
4. hp: Gross horsepower
5. drat: Rear axle ratio
6. wt: Weight (lb/1000)
7. qsec: 1/4 mile time
8. vs: V engine or straight engine
9. am: Transmission (0 = automatic, 1 = manual)
10. gear: Number of forward gears
11. carb: Number of carburetors
54 / 62
Exercise - II
To cluster states of US
I state.x77: statistics of the 50 states of
US [R Core Team, 2015]
I a matrix with 50 rows and 8 columns
1. Population: population estimate as of July 1, 1975
2. Income: per capita income (1974)
3. Illiteracy: illiteracy (1970, percent of population)
4. Life Exp: life expectancy in years (1969–71)
5. Murder: murder and non-negligent manslaughter rate per
100,000 population (1976)
6. HS Grad: percent high-school graduates (1970)
7. Frost: mean number of days with minimum temperature below
freezing (1931–1960) in capital or large city
8. Area: land area in square miles
55 / 62
Exercise - Questions
56 / 62
Online Resources
57 / 62
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
58 / 62
How to Cite This Work
I Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
I BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
59 / 62
References I
Alsabti, K., Ranka, S., and Singh, V. (1998).
An efficient k-means clustering algorithm.
In Proc. the First Workshop on High Performance Data Mining, Orlando, Florida.
60 / 62
References II
Hennig, C. (2014).
fpc: Flexible procedures for clustering.
R package version 2.1-9.
Macqueen, J. B. (1967).
Some methods of classification and analysis of multivariate observations.
In the Fifth Berkeley Symposium on Mathematical Statistics and Probability.
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2016).
cluster: Cluster Analysis Basics and Extensions.
R package version 2.0.4 — For new features, see the ’Changelog’ file (in the package source).
61 / 62
References III
62 / 62