0% found this document useful (0 votes)

15 views

RDM Slides Clustering With R 1

Cluster

Uploaded by

marmaduke32

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

RDM Slides Clustering With R 1

Cluster

Uploaded by

marmaduke32

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Data Clustering with R

Yanchang Zhao
http://www.RDataMining.com

R and Data Mining Course

Beijing University of Posts and Telecommunications,
Beijing, China

July 2019

1 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset

Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering

Hierarchical Clustering

Density-Based clustering

Cluster Validation

I Data clustering is to partition data into groups, where the

data in the same group are similar to one another and the data
from different groups are dissimilar [Han and Kamber, 2000].
I To segment data into clusters so that the intra-cluster
similarity is maximized and that the inter-cluster similarity is
minimized.
I The groups obtained are a partition of data, which can be
used for customer segmentation, document categorization,
etc.

3 / 62
†
Data Clustering with R

I Partitioning Methods
I k-means clustering: stats::kmeans() ∗ and
fpc::kmeansruns()
I k-medoids clustering: cluster::pam() and fpc::pamk()
I Hierarchical Methods
I Divisive hierarchical clustering: DIANA, cluster::diana(),
I Agglomerative hierarchical clustering: cluster::agnes(),
stats::hclust()
I Density based Methods
I DBSCAN: fpc::dbscan()
I Cluster Validation
I Packages clValid, cclust, NbClust

∗
package name::function name()
†
Chapter 6 - Clustering, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
4 / 62
The Iris Dataset - I

The iris dataset [Frank and Asuncion, 2010] consists of 50

samples from each of three classes of iris flowers. There are five
attributes in the dataset:
I sepal length in cm,
I sepal width in cm,
I petal length in cm,
I petal width in cm, and
I class: Iris Setosa, Iris Versicolour, and Iris Virginica.
Detailed desription of the dataset can be found at the UCI
Machine Learning Repository ‡ .

‡
https://archive.ics.uci.edu/ml/datasets/Iris
5 / 62
The Iris Dataset - II
Below we have a look at the structure of the dataset with str().

## the IRIS dataset

str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0...
## $ Species : Factor w/ 3 levels "setosa","versicolor",....

I 150 observations (records, or rows) and 5 variables (or

columns)
I The first four variables are numeric.
I The last one, Species, is categoric (called as “factor” in R)
and has three levels of values.

6 / 62
The Iris Dataset - III

summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Wid...
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0....
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0....
## Median :5.800 Median :3.000 Median :4.350 Median :1....
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1....
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1....
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2....
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##

7 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset

Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering

Hierarchical Clustering

Density-Based clustering

Cluster Validation

I The result of partitioning clustering is dependent on the

selection of initial cluster centers and it may result in a local
optimum instead of a global one. (Improvement: run k-means
multiple times with different initial centers and then choose
the best clustering result.)
I Tends to result in sphere-shaped clusters with similar sizes
I Sentitive to outliers
I Non-trivial to choose an appropriate value for k

10 / 62
k-Means Algorithm

I k-means: a classic partitioning method for clustering

I First, it selects k objects from the dataset, each of which
initially represents a cluster center.
I Each object is assigned to the cluster to which it is most
similar, based on the distance between the object and the
cluster center.
I The means of clusters are computed as the new cluster
centers.
I The process iterates until the criterion function converges.

11 / 62
k-Means Algorithm - Criterion Function

A typical criterion function is the squared-error criterion, defined as

Xk X
E= kp − mi k2 , (1)
i=1 p∈Ci

where E is the sum of square-error, p is a point, and mi is the

center of cluster Ci .

12 / 62
k-means clustering

## k-means clustering set a seed for random number generation to

## make the results reproducible
set.seed(8953)
## make a copy of iris data
iris2 <- iris
## remove the class label, Species
iris2$Species <- NULL
## run kmeans clustering to find 3 clusters
kmeans.result <- kmeans(iris2, 3)

## print the clusterng result

kmeans.result

13 / 62
## K-means clustering with 3 clusters of sizes 38, 50, 62
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.850000 3.073684 5.742105 2.071053
## 2 5.006000 3.428000 1.462000 0.246000
## 3 5.901613 2.748387 4.393548 1.433871
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3...
## [61] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3...
## [91] 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3 3 1 1...
## [121] 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3...
##
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"...
## [5] "tot.withinss" "betweenss" "size" "iter" ...
## [9] "ifault" 14 / 62
Results of k-Means Clustering

Check clustering result against class labels (Species)

table(iris$Species, kmeans.result$cluster)
##
## 1 2 3
## setosa 0 50 0
## versicolor 2 0 48
## virginica 36 0 14

I Class “setosa” can be easily separated from the other clusters

I Classes “versicolor” and “virginica” are to a small degree
overlapped with each other.

15 / 62
plot(iris2[, c("Sepal.Length", "Sepal.Width")],
col = kmeans.result$cluster)
points(kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],
col = 1:3, pch = 8, cex=2) # plot cluster centers

4.0
3.5
Sepal.Width

3.0
2.5
2.0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Sepal.Length
16 / 62
k-means clustering with estimating k and initialisations

I kmeansruns() in package fpc [Hennig, 2014]

I calls kmeans() to perform k-means clustering
I initializes the k-means algorithm several times with random
points from the data set as means
I estimates the number of clusters by Calinski Harabasz index
or average silhouette width

17 / 62
library(fpc)
kmeansruns.result <- kmeansruns(iris2)
kmeansruns.result
## K-means clustering with 3 clusters of sizes 62, 50, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.901613 2.748387 4.393548 1.433871
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2...
## [31] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1...
## [61] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1...
## [91] 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 3 1 1 3 3...
## [121] 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1...
##
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
18 / 62
The k-Medoids Clustering
I Difference from k-means: a cluster is represented with its
center in the k-means algorithm, but with the object closest
to the center of the cluster in the k-medoids clustering.
I more robust than k-means in presence of outliers
I PAM (Partitioning Around Medoids) is a classic algorithm for
k-medoids clustering.
I The CLARA algorithm is an enhanced technique of PAM by
drawing multiple samples of data, applying PAM on each
sample and then returning the best clustering. It performs
better than PAM on larger data.
I Functions pam() and clara() in package cluster
[Maechler et al., 2016]
I Function pamk() in package fpc does not require a user to
choose k.

19 / 62
Clustering with pam()
## clustering with PAM
library(cluster)
# group into 3 clusters
pam.result <- pam(iris2, 3)
# check against actual class label
table(pam.result$clustering, iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36

Three clusters:
I Cluster 1 is species “setosa” and is well separated from the
other two.
I Cluster 2 is mainly composed of “versicolor”, plus some cases
from “virginica”.
I The majority of cluster 3 are “virginica”, with two cases from
“versicolor”.
20 / 62
plot(pam.result)

clusplot(pam(x = iris2, k = 3)) Silhouette

n = 150
plot of pam(x = iris2,Cj k = 3)
3 clusters
j : nj | avei∈Cj si
Component 2

1 : 50 | 0.80
2
0

2 : 62 | 0.42
−2

3 : 38 | 0.45

−3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si
Component 1
These two components explain 95.81 % of the point variability.Average silhouette width : 0.55

21 / 62
I The left chart is a 2-dimensional “clusplot” (clustering plot)
of the three clusters and the lines show the distance between
clusters.
I The right chart shows their silhouettes. A large si (almost 1)
suggests that the corresponding observations are very well
clustered, a small si (around 0) means that the observation
lies between two clusters, and observations with a negative si
are probably placed in the wrong cluster.
I Silhouette width of cluster 1 is 0.80, which means it is well
clustered and seperated from other clusters. The other two
are of relatively low silhouette width (0.42 and 0.45), and they
are somewhat overlapped with each other.

22 / 62
Clustering with pamk()

library(fpc)
pamk.result <- pamk(iris2)
# number of clusters
pamk.result$nc
## [1] 2

# check clustering against actual class label

table(pamk.result$pamobject$clustering, iris$Species)
##
## setosa versicolor virginica
## 1 50 1 0
## 2 0 49 50

Two clusters:
I “setosa”
I a mixture of “versicolor” and “virginica”

23 / 62
plot(pamk.result)

clusplot(pam(x = sdata, k = k, diss = diss)) Silhouette

n = 150
plot of pam(x = sdata,
2 clusters Cj k = k, diss = diss
2 j : nj | avei∈Cj si

1 : 51 | 0.81
1
Component 2

0
−1

2 : 99 | 0.62
−2
−3

−3 −2 −1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si
Component 1
These two components explain 95.81 % of the point variability.
Average silhouette width : 0.69

24 / 62
Results of Clustering

I In this example, the result of pam() seems better, because it

identifies three clusters, corresponding to three species.

25 / 62
Results of Clustering

I In this example, the result of pam() seems better, because it

identifies three clusters, corresponding to three species.
I Note that we cheated by setting k = 3 when using pam(),
which is already known to us as the number of species.

25 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset

Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering

Hierarchical Clustering

Density-Based clustering

Cluster Validation

I With hierarchical clustering approach, a hierarchical

decomposition of data is built in either bottom-up
(agglomerative) or top-down (divisive) way.
I Generally a dendrogram is generated and a user may select to
cut it at a certain level to get the clusters.

27 / 62
Hierarchical Clustering - II

28 / 62
Hierarchical Clustering Algorithms

I With agglomerative clustering, every single object is taken as

a cluster and then iteratively the two nearest clusters are
merged to build bigger clusters until the expected number of
clusters is obtained or when only one cluster is left.
I AGENS [Kaufman and Rousseeuw, 1990]
I Divisive clustering works in an opposite way, which puts all
objects in a single cluster and then divides the cluster into
smaller and smaller ones.
I DIANA [Kaufman and Rousseeuw, 1990]
I BIRCH [Zhang et al., 1996]
I CURE [Guha et al., 1998]
I ROCK [Guha et al., 1999]
I Chameleon [Karypis et al., 1999]

29 / 62
Hierarchical Clustering - Distance Between Clusters

In hierarchical clustering, there are four different methods to

measure the distance between clusters:
I Centroid distance is the distance between the centroids of two
clusters.
I Average distance is the average of the distances between
every pair of objects from two clusters.
I Single-link distance, a.k.a. minimum distance, is the distance
between the two nearest objects from two clusters.
I Complete-link distance, a.k.a. maximum distance, is the
distance between the two objects which are the farthest from
each other from two clusters.

30 / 62
Hierarchical Clustering of the iris Data

## hiercrchical clustering
set.seed(2835)
# draw a sample of 40 records from the iris data, so that the
# clustering plot will not be over crowded
idx <- sample(1:dim(iris)[1], 40)
iris3 <- iris[idx, ]
# remove class label
iris3$Species <- NULL
# hierarchical clustering
hc <- hclust(dist(iris3), method = "ave")
# plot clusters
plot(hc, hang = -1, labels = iris$Species[idx])
# cut tree into 3 clusters
rect.hclust(hc, k = 3)
# get cluster IDs
groups <- cutree(hc, k = 3)

31 / 62
Height

0 1 2 3 4

setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
virginica
versicolor
versicolor
versicolor
dist(iris3)

versicolor
versicolor
hclust (*, "average")

versicolor
versicolor
Cluster Dendrogram

versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
32 / 62
Agglomeration Methods of hclust

hclust(d, method = "complete", members = NULL)

I method = "ward.D" or "ward.D2": Ward’s minimum
variance method aims at finding compact, spherical
clusters [R Core Team, 2015].
I method = "complete": complete-link distance; finds similar
clusters.
I method = "single": single-link distance; adopts a “friends
of friends” clustering strategy.
I method = "average": average distance
I method = "centroid": centroid distance
I method = "median":
I method = "mcquitty":

33 / 62
DIANA
I DIANA [Kaufman and Rousseeuw, 1990]: divisive hierarchical
clustering
I Constructs a hierarchy of clusterings, starting with one large
cluster containing all observations.
I Divides clusters until each cluster contains only a single
observation.
I At each stage, the cluster with the largest diameter is
selected. (The diameter of a cluster is the largest dissimilarity
between any two of its observations.)
I To divide the selected cluster, the algorithm first looks for its
most disparate observation (i.e., which has the largest average
dissimilarity to other observations in the selected cluster).
This observation initiates the “splinter group”. In subsequent
steps, the algorithm reassigns observations that are closer to
the “splinter group” than to the “old party”. The result is a
division of the selected cluster into two new clusters.
34 / 62
DIANA

## clustering with DIANA

library(cluster)
diana.result <- diana(iris3)

plot(diana.result, which.plots = 2, labels = iris$Species[idx])

35 / 62
Height

0 1 2 3 4 5 6 7

setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
virginica
virginica
versicolor

iris3
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
Divisive Coefficient = 0.93 versicolor
versicolor
versicolor
versicolor
Dendrogram of diana(x = iris3)

virginica
virginica
virginica
virginica
virginica
virginica
virginica
versicolor
virginica
36 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset

Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering

Hierarchical Clustering

Density-Based clustering

Cluster Validation

I The rationale of density-based clustering is that a cluster is

composed of well-connected dense region, while objects in
sparse areas are removed as noises.
I DBSCAN is a typical density-based clustering algorithm,
which works by expanding clusters to their dense
neighborhood [Ester et al., 1996].
I Other density-based clustering techniques:
OPTICS [Ankerst et al., 1999] and
DENCLUE [Hinneburg and Keim, 1998]
I The advantage of density-based clustering is that it can filter
out noise and find clusters of arbitrary shapes (as long as they
are composed of connected dense regions).

38 / 62
DBSCAN [Ester et al., 1996]

I Group objects into one cluster if they are connected to one

another by densely populated area
I The DBSCAN algorithm from package fpc provides a
density-based clustering for numeric data.
I Two key parameters in DBSCAN:
I eps: reachability distance, which defines the size of
neighborhood; and
I MinPts: minimum number of points.
I If the number of points in the neighborhood of point α is no
less than MinPts, then α is a dense point. All the points in its
neighborhood are density-reachable from α and are put into
the same cluster as α.
I Can discover clusters with various shapes and sizes
I Insensitive to noise

39 / 62
Density-based Clustering of the iris data

## Density-based Clustering
library(fpc)
iris2 <- iris[-5] # remove class tags
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
ds
## dbscan Pts=150 MinPts=5 eps=0.42
## 0 1 2 3
## border 29 6 10 12
## seed 0 42 27 24
## total 29 48 37 36

40 / 62
Density-based Clustering of the iris data

# compare clusters with actual class labels

table(ds$cluster, iris$Species)
##
## setosa versicolor virginica
## 0 2 10 17
## 1 48 0 0
## 2 0 37 0
## 3 0 3 33

I 1 to 3: identified clusters
I 0: noises or outliers, i.e., objects that are not assigned to any
clusters

41 / 62
plot(ds, iris2)

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

7.5
6.5
Sepal.Length

5.5
4.5
2.0 2.5 3.0 3.5 4.0

Sepal.Width

7
6
5
Petal.Length

4
3
2
1
0.5 1.0 1.5 2.0 2.5

Petal.Width

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

42 / 62
plot(ds, iris2[, c(1, 4)])

2.5
2.0
Petal.Width

1.5
1.0
0.5

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Sepal.Length

43 / 62
plotcluster(iris2, ds$cluster)

0
3 33
3 0
3 30
3 3
1 1 3
1 3
2

0 1 3 3 3
0 3 3 3 0 333
1
1 11 1 1 1 3 33
11 1 1 0 2 2 3 3
1

0 2 22
dc 2

1 1 1 1 3 30
11 1 1 11 11
1 22 33
1 111 111 1 2 2 3 0
1 1 1 22 3
11 2 22
022 222
0

1 11 0 2220
0
2 3
3
11 1 02
2 2 2
0
22 3 03 0
2 220
−1

0
0 0
2 2 0 0
0 0 0
−2

0
0

−8 −6 −4 −2 0 2

dc 1

44 / 62
Prediction with Clustering Model
I Label new data, based on their similarity with the clusters
I Draw a sample of 10 objects from iris and add small noises
to them to make a new dataset for labeling
I Random noises are generated with a uniform distribution
using function runif().

## cluster prediction
# create a new dataset for labeling
set.seed(435)
idx <- sample(1:nrow(iris), 10)
# remove class labels
new.data <- iris[idx,-5]
# add random noise
new.data <- new.data + matrix(runif(10*4, min=0, max=0.2),
nrow=10, ncol=4)
# label new data
pred <- predict(ds, iris2, new.data)

45 / 62
Results of Prediction

table(pred, iris$Species[idx]) # check cluster labels

##
## pred setosa versicolor virginica
## 0 0 0 1
## 1 3 0 0
## 2 0 3 0
## 3 0 1 2

46 / 62
Results of Prediction

table(pred, iris$Species[idx]) # check cluster labels

##
## pred setosa versicolor virginica
## 0 0 0 1
## 1 3 0 0
## 2 0 3 0
## 3 0 1 2

Eight(=3+3+2) out of 10 objects are assigned with correct class

labels.

46 / 62
plot(iris2[, c(1, 4)], col = 1 + ds$cluster)
points(new.data[, c(1, 4)], pch = "+", col = 1 + pred, cex = 3)

2.5
+
2.0

+
+++ +
Petal.Width

1.5

+
1.0

+
0.5

+ +
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Sepal.Length

47 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset

Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering

Hierarchical Clustering

Density-Based clustering

Cluster Validation

I silhouette() compute or extract silhouette information

(cluster )
I cluster.stats() compute several cluster validity statistics
from a clustering and a dissimilarity matrix (fpc)
I clValid() calculate validation measures for a given set of
clustering algorithms and number of clusters (clValid)
I clustIndex() calculate the values of several clustering
indexes, which can be independently used to determine the
number of clusters existing in a data set (cclust)
I NbClust() provide 30 indices for cluster validation and
determining the number of clusters (NbClust)

49 / 62
Contents
Introduction
Data Clustering with R
The Iris Dataset

Partitioning Clustering
The k-Means Clustering
The k-Medoids Clustering

Hierarchical Clustering

Density-Based clustering

Cluster Validation

I Which attributes to use?

I Are the attributes at the same scale?
I Which clustering techniques to use?
I Which clustering algorithms to use?
I How many clusters to find?
I Are the clustering results good or not?

56 / 62
Online Resources

I Book titled R and Data Mining: Examples and Case Studies

http://www.rdatamining.com/docs/RDataMining-book.pdf
I R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
I Free online courses and documents
http://www.rdatamining.com/resources/
I RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
I Twitter (3,300+ followers)
@RDataMining

57 / 62
The End

Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
58 / 62
How to Cite This Work

I Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
I BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}

59 / 62
References I
Alsabti, K., Ranka, S., and Singh, V. (1998).
An efficient k-means clustering algorithm.
In Proc. the First Workshop on High Performance Data Mining, Orlando, Florida.

Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999).

OPTICS: ordering points to identify the clustering structure.
In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data,
pages 49–60, New York, NY, USA. ACM Press.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996).

A density-based algorithm for discovering clusters in large spatial databases with noise.
In KDD, pages 226–231.

Frank, A. and Asuncion, A. (2010).

UCI machine learning repository. university of california, irvine, school of information and computer sciences.
http://archive.ics.uci.edu/ml.

Guha, S., Rastogi, R., and Shim, K. (1998).

CURE: an efficient clustering algorithm for large databases.
In SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data,
pages 73–84, New York, NY, USA. ACM Press.

Guha, S., Rastogi, R., and Shim, K. (1999).

ROCK: A robust clustering algorithm for categorical attributes.
In Proceedings of the 15th International Conference on Data Engineering, 23-26 March 1999, Sydney,
Austrialia, pages 512–521. IEEE Computer Society.

Han, J. and Kamber, M. (2000).

Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

60 / 62
References II
Hennig, C. (2014).
fpc: Flexible procedures for clustering.
R package version 2.1-9.

Hinneburg, A. and Keim, D. A. (1998).

An efficient approach to clustering in large multimedia databases with noise.
In KDD, pages 58–65.
DENCLUE.
Huang, Z. (1998).
Extensions to the k-means algorithm for clustering large data sets with categorical values.
Data Min. Knowl. Discov., 2(3):283–304.

Karypis, G., Han, E.-H., and Kumar, V. (1999).

Chameleon: hierarchical clustering using dynamic modeling.
Computer, 32(8):68–75.

Kaufman, L. and Rousseeuw, P. J. (1990).

Finding groups in data. an introduction to cluster analysis.
Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York:
Wiley, 1990.

Macqueen, J. B. (1967).
Some methods of classification and analysis of multivariate observations.
In the Fifth Berkeley Symposium on Mathematical Statistics and Probability.

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2016).
cluster: Cluster Analysis Basics and Extensions.
R package version 2.0.4 — For new features, see the ’Changelog’ file (in the package source).

61 / 62
References III

Ng, R. T. and Han, J. (1994).

Efficient and effective clustering methods for spatial data mining.
In VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 144–155,
San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

R Core Team (2015).

R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.

Zhang, T., Ramakrishnan, R., and Livny, M. (1996).

BIRCH: an efficient data clustering method for very large databases.
In SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data,
pages 103–114, New York, NY, USA. ACM Press.

62 / 62

I Will Rejoice (Chords & Parts)
No ratings yet
I Will Rejoice (Chords & Parts)
1 page
Survey, Strain Identification and Management of Huanglongbing (HLB) Disease of Citrus in The Philippines
100% (1)
Survey, Strain Identification and Management of Huanglongbing (HLB) Disease of Citrus in The Philippines
21 pages
Chemical Engineering Magazine Jan PDF
No ratings yet
Chemical Engineering Magazine Jan PDF
68 pages
3.7 CIP Troubleshoot Rev C
100% (1)
3.7 CIP Troubleshoot Rev C
35 pages
Design Contract Template 1 PDF
No ratings yet
Design Contract Template 1 PDF
5 pages
How To Use Excel in Analytical Chemistry and in General Scientific Data Analysis - Robert de Levie
No ratings yet
How To Use Excel in Analytical Chemistry and in General Scientific Data Analysis - Robert de Levie
501 pages
Boiling Heat Transfer and Two-Phase Flow (2018) PDF
100% (3)
Boiling Heat Transfer and Two-Phase Flow (2018) PDF
573 pages
Unusual Presentation of May-Thurner Syndrome
No ratings yet
Unusual Presentation of May-Thurner Syndrome
5 pages
Materi Praktikum
No ratings yet
Materi Praktikum
7 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Exp 7 PDF
No ratings yet
Exp 7 PDF
4 pages
Cluster Analysis Usingr PDF
No ratings yet
Cluster Analysis Usingr PDF
0 pages
Practical 7 1
No ratings yet
Practical 7 1
9 pages
Ex No 10
No ratings yet
Ex No 10
2 pages
Iris Species IB
No ratings yet
Iris Species IB
7 pages
Exp 5
No ratings yet
Exp 5
4 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
No ratings yet
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
22 pages
Comparative Investigation of K-Means and K-Medoid Algorithm On Iris Data
No ratings yet
Comparative Investigation of K-Means and K-Medoid Algorithm On Iris Data
4 pages
Minor Project by Ali (Intrainz)
No ratings yet
Minor Project by Ali (Intrainz)
17 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
AAI101 - Session 2 - Unsupervised Learning
No ratings yet
AAI101 - Session 2 - Unsupervised Learning
38 pages
Department Of: Computer Science & Engineering
No ratings yet
Department Of: Computer Science & Engineering
4 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
Lesson8 Clustering
100% (1)
Lesson8 Clustering
33 pages
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
No ratings yet
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
33 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
K-Means Clustering - MATLAB Kmeans
No ratings yet
K-Means Clustering - MATLAB Kmeans
23 pages
Clustering With R
No ratings yet
Clustering With R
4 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Clustering in R
No ratings yet
Clustering in R
12 pages
S27
No ratings yet
S27
30 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
K Means
No ratings yet
K Means
23 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Exploring Unsupervised Learning Algorithms with the Iris Dataset
No ratings yet
Exploring Unsupervised Learning Algorithms with the Iris Dataset
4 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
Partitioning Methods
No ratings yet
Partitioning Methods
26 pages
Practical no_ 4
No ratings yet
Practical no_ 4
3 pages
AMR - Assignment 1-Sample Solutions
No ratings yet
AMR - Assignment 1-Sample Solutions
7 pages
Research On K-Value Selection Method of K-Means Clustering Algorithm
No ratings yet
Research On K-Value Selection Method of K-Means Clustering Algorithm
10 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K-Means Clustering
No ratings yet
K-Means Clustering
19 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
K-Means Clustering
No ratings yet
K-Means Clustering
18 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
Clustering
No ratings yet
Clustering
84 pages
K-Means Clustering Using PCA Analysis Lab Report
No ratings yet
K-Means Clustering Using PCA Analysis Lab Report
9 pages
07-Clustering
No ratings yet
07-Clustering
54 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
DM_C6
No ratings yet
DM_C6
37 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Alehandro Lumentah 210211010188 Assignment09
No ratings yet
Alehandro Lumentah 210211010188 Assignment09
10 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 7 Clustering (P) (1) (1)
No ratings yet
Unit 7 Clustering (P) (1) (1)
22 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
K-Means Cluster
No ratings yet
K-Means Cluster
2 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Handbook On Particle Separation Processes (2011) PDF
75% (4)
Handbook On Particle Separation Processes (2011) PDF
227 pages
Spray Drying Modelling Based On Advanced Droplet Drying Kinetics
No ratings yet
Spray Drying Modelling Based On Advanced Droplet Drying Kinetics
9 pages
Hydro Cyclones
No ratings yet
Hydro Cyclones
24 pages
HVAC, Ramesh
86% (7)
HVAC, Ramesh
1,104 pages
Air Receiver Tank Sizing
100% (1)
Air Receiver Tank Sizing
4 pages
Fan 2015
No ratings yet
Fan 2015
8 pages
Spray Dryer Modeling in Theory and Practice PDF
No ratings yet
Spray Dryer Modeling in Theory and Practice PDF
34 pages
Spray Drying Modelling Based On Advanced Droplet Drying Kinetics
No ratings yet
Spray Drying Modelling Based On Advanced Droplet Drying Kinetics
9 pages
Modelling of The Spray Drying Process For Particle Design PDF
No ratings yet
Modelling of The Spray Drying Process For Particle Design PDF
41 pages
Study of The Influence of Soy Lecithin Addition On The Wettability of Buffalo
No ratings yet
Study of The Influence of Soy Lecithin Addition On The Wettability of Buffalo
34 pages
Parinya Sanguansat Principal Component Analysis Multidisciplinary Applications InTech 2012 PDF
No ratings yet
Parinya Sanguansat Principal Component Analysis Multidisciplinary Applications InTech 2012 PDF
212 pages
Chemical Changes During Extrusion Cooking - Camire PDF
No ratings yet
Chemical Changes During Extrusion Cooking - Camire PDF
13 pages
Chemical Engineering Magazine Feb PDF
No ratings yet
Chemical Engineering Magazine Feb PDF
64 pages
VBA Cheat-Sheet and Tutorial: VBA References To Information in Excel
100% (2)
VBA Cheat-Sheet and Tutorial: VBA References To Information in Excel
19 pages
Handbook On Particle Separation Processes (2011) PDF
75% (4)
Handbook On Particle Separation Processes (2011) PDF
227 pages
02 Tank Cleaning Nozzles
No ratings yet
02 Tank Cleaning Nozzles
24 pages
Dosatron - Comparativo Sistema de Dosagem
No ratings yet
Dosatron - Comparativo Sistema de Dosagem
37 pages
Apostila Exc Avan16-1
No ratings yet
Apostila Exc Avan16-1
372 pages
Pharmaceutical Process Validation 3rd (Int'l) Ed - R. Nash, A.wachter (Marcel Dekker, 2003) WW
90% (10)
Pharmaceutical Process Validation 3rd (Int'l) Ed - R. Nash, A.wachter (Marcel Dekker, 2003) WW
883 pages
3.5a CIP Tanks
100% (1)
3.5a CIP Tanks
20 pages
OE-A2 Rev 12 Chapter 8
No ratings yet
OE-A2 Rev 12 Chapter 8
50 pages
Santiago Calatrava
No ratings yet
Santiago Calatrava
13 pages
Rbi
No ratings yet
Rbi
1 page
Saab Thesis Work
100% (3)
Saab Thesis Work
7 pages
Pre List
No ratings yet
Pre List
43 pages
Nomadic Matt's Guide To Road Tripping The United States
No ratings yet
Nomadic Matt's Guide To Road Tripping The United States
70 pages
+++intro: .Art As A Toll For Social and Political Criticism
No ratings yet
+++intro: .Art As A Toll For Social and Political Criticism
2 pages
A Snake in The Grass
No ratings yet
A Snake in The Grass
6 pages
Physicochemical Characteristics of Orange Juice Samples From Seven C U Ltivars
No ratings yet
Physicochemical Characteristics of Orange Juice Samples From Seven C U Ltivars
7 pages
Estates Primer Problem Set
No ratings yet
Estates Primer Problem Set
8 pages
Using Eckel Model To Measure Income Smoothing Practices The Case of French Companies
No ratings yet
Using Eckel Model To Measure Income Smoothing Practices The Case of French Companies
4 pages
CatnapCrochet-GhostyPDFPattern
No ratings yet
CatnapCrochet-GhostyPDFPattern
11 pages
Mobile Phones Preferences
No ratings yet
Mobile Phones Preferences
4 pages
Culinary Nostalgia: Authenticity, Nationalism, and Diaspora by Anita Mannur
No ratings yet
Culinary Nostalgia: Authenticity, Nationalism, and Diaspora by Anita Mannur
22 pages
Society For Military History
No ratings yet
Society For Military History
7 pages
IMImobile PLC Annual Report 31st March 2017
No ratings yet
IMImobile PLC Annual Report 31st March 2017
89 pages
Japan Prepares For Total War: The Search For Economic Security, 1919-1941 (Cornell Studies in Security Affairs) 1st Edition Barnhart
100% (4)
Japan Prepares For Total War: The Search For Economic Security, 1919-1941 (Cornell Studies in Security Affairs) 1st Edition Barnhart
36 pages
ICL Online Test (IOT) 110-1
No ratings yet
ICL Online Test (IOT) 110-1
9 pages
Data Structures
No ratings yet
Data Structures
12 pages
Wikipedia, The Free Encyclopedia
No ratings yet
Wikipedia, The Free Encyclopedia
10 pages
James Joyce's Stephen Hero
No ratings yet
James Joyce's Stephen Hero
11 pages
Preview-9781351882088 A30893765
No ratings yet
Preview-9781351882088 A30893765
40 pages
Network Connectivity Testing
No ratings yet
Network Connectivity Testing
10 pages
Constituency 10 Â " Montagne Blanche and Grand River South East
No ratings yet
Constituency 10 Â " Montagne Blanche and Grand River South East
4 pages
Water Quality Guidelines: South African
No ratings yet
Water Quality Guidelines: South African
194 pages
Trabajo Práctico N°3: Inglés Técnico
No ratings yet
Trabajo Práctico N°3: Inglés Técnico
3 pages