Silhouette (Clustering) : Method

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 7

Silhouette (clustering)

From Wikipedia, the free encyclopedia

Silhouette refers to a method of interpretation and validation of clusters of data. The technique
provides a succinct graphical representation of how well each object lies within its cluster. It was first
described by Peter J. Rousseeuw in 1986.[1]

Method[edit]
Assume the data have been clustered via any technique, such as k-means, into
each datum , let

clusters. For

be the average dissimilarity of with all other data within the same cluster. Any measure of
dissimilarity can be used but distance measures are the most common. We can interpret
as
how well is assigned to its cluster (the smaller the value, the better the assignment). We then
define the average dissimilarity of point to a cluster as the average of the distance from to
points in .
Let
be the lowest average dissimilarity of to any other cluster which is not a member. The
cluster with this lowest average dissimilarity is said to be the "neighbouring cluster" of because it is
the next best fit cluster for point . We now define:

Which can be written as:

From the above definition it is clear that

For
to be close to 1 we require
. As
is a measure of how
dissimilar is to its own cluster, a small value means it is well matched. Furthermore, a
large

implies that

is badly matched to its neighbouring cluster. Thus an

close to one means that the datum is appropriately clustered. If


is close to negative
one, then by the same logic we see that would be more appropriate if it was clustered
in its neighbouring cluster. An
two natural clusters.
The average

near zero means that the datum is on the border of

over all data of a cluster is a measure of how tightly grouped all the

data in the cluster are. Thus the average


over all data of the entire dataset is a
measure of how appropriately the data has been clustered. If there are too many or too
few clusters, as may occur when a poor choice of is used in the k-means algorithm,
some of the clusters will typically display much narrower silhouettes than the rest. Thus

silhouette plots and averages may be used to determine the natural number of clusters
within a dataset.
The book is done! All 822 pages of the third edition of Data Mining Techniques for Marketing, Sales, and
Customer Relationship Management will be hitting bookstore shelves later this month or you can order it
now. To celebrate, I am returning to the blog.
One of the areas where Gordon and I have added a lot of new material is clustering. In this post, I want to
share a nice measure of cluster goodness first described by Peter Rousseeuw in 1986. Intuitively, good
clusters have the property that cluster members are close to each other and far from members of other
clusters. That is what is captured by a cluster's silhouette.
To calculate a clusters silhouette, first calculate the average distance within the cluster. Each cluster
member has its own average distance from all other members of the same cluster. This is
its dissimilarityfrom its cluster. Cluster members with low dissimilarity are comfortably within the cluster
to which they have been assigned. The average dissimilarity for a cluster is a measure of how compact it is.
Note that two members of the same cluster may have different neighboring clusters. For points that are
close to the boundary between
two clusters, the two dissimilarity scores may be nearly equal.
The average distance to fellow cluster members is then compared to the average distance to members of
the neighboring cluster. The pictures below show this process for one point (17, 27).

The ratio of a point's dissimilarity to its own cluster to its dissimilarity with its nearest neighboring cluster
is its silhouette. The typical range of the score is from zero when a record is right on the boundary of two
clusters to one when it is identical to the other records in its own cluster. In theory, the silhouette score
can go from negative one to one. A negative value means that the record is more similar to the records of

its neighboring
cluster than to other members of its own cluster. To see how this could happen, imagine forming clusters
using an agglomerative algorithm and single-linkage distance. Single-linkage says the distance from a
point to a cluster is the distance to the nearest member of that cluster. Suppose the data consists of many
records with the value 32 and many others with the value 64 along with a scattering of records with values
from 32 to 50. In the first step, all the records at distance zero are combined into two tight clusters. In the
next step, records distance one away are combined causing some 33s to be added to the left cluster
followed by 34s, 35s, etc. Eventually, the left cluster will swallow records that would feel happier in the
right cluster.
The silhouette score for an entire cluster is calculated as the average of the silhouette scores of its
members. This measures the degree of similarity of cluster members. The silhouette of the entire dataset
is the average of the silhouette scores of all the individual records. This is a measure of how appropriately
the data has been
clustered. What is nice about this measure is that it can be applied at the level of the dataset to determine
which clusters are not very good and at the level of a cluster to determine which members do not fit in
very well. The silhouette can be used to choose an appropriate value for k in k-means by trying each value
of
k in the acceptable range and choosing the one that yields the best silhouette. It can also be used to
compare clusters produced by different random seeds.
The final picture shows the silhouette scores for the three clusters in the example.

The K-Means Clustering Algorithm


There are many good introductions to k-means clustering available, including our book Data Mining
Techniques for Marketing, Sales, and Customer Support. The Google presentation mentioned
above provides a very brief introduction.
Let's review the k-means clustering algorithm. Given a data set where all the columns are numeric, the
algorithm for k-means clustering is basically the following:
(1) Start with k cluster centers (chosen randomly or according to some specific procedure).
(2) Assign each row in the data to its nearest cluster center.
(3) Re-calculate the cluster centers as the "average" of the rows in (2).
(4) Repeat, until the cluster centers no longer change or some other stopping criterion has been met.
In the end, the k-means algorithm "colors" all the rows in the data set, so similar rows have the same
color.

K-Means in a Parallel World


To run this algorithm, it seems, at first, as though all the rows assigned to each cluster in Step (2) need to
be brought together to recalculate the cluster centers.
However, this is not true. K-Means clustering is an example of an embarrassingly parallel algorithm,
meaning that that it is very well suited to parallel implementations. In fact, it is quite adaptable to both
SQL and to MapReduce, with efficient algorithms. By "efficient", I mean that large amounts of data do not
need to be sent around processors and that the processors have minimum amounts of communication. It
is true that the entire data set does need to be read by the processors for each iteration of the algorithm,
but each row only needs to be read by one processor.
A parallel version of the k-means algorithm was incorporated into the Darwin data mining package,
developed by Thinking Machines Corporation in the early 1990s. I do not know if this was the first parallel
implementation of the algorithm. Darwin was later purchased by Oracle, and became the basis for Oracle
Data Mining.
How does the parallel version work? The data can be partitioned among multiple processors (or streams
or threads). Each processor can read the previous iteration's cluster centers and assign the rows on the
processor to clusters. Each processor then calculates new centers for its of data. Each actual cluster center
(for the data across all processors) is then the weighted average of the centers on each processor.
In other words, the rows of data do not need to be combined globally. They can be combined locally, with
the reduced set of results combined across all processors. In fact, MapReduce even contains a "combine"
method for just this type of algorithm.
All that remains is figuring out how to handle the cluster center information. Let us postulate a shared file
that has the centroids as calculated for each processor. This file contains:

The iteration number.

The cluster id.

The cluster coordinates.

The number of rows assigned to the cluster.

This is the centroid file. An iteration through the algorithm is going to add another set of rows to this file.
This information is the only information that needs to be communicated globally.
There are two ways to do this in the MapReduce framework. The first uses map, combine, and reduce. The
second only uses map and reduce.

K-Means Using Map, Combine, Reduce


Before begining, a file is created accessible to all processors that contains initial centers for all clusters.
This file contains the cluster centers for each iteration.
The Map function reads this file to get the centers from the last finished iteration. It then reads the input
rows (the data) and calculates the distance to each center. For each row, it produces an output pair with:

key -- cluster id;

value -- coordinates of row.

Now, this is a lot of data, so we use a Combine function to reduce the size before sending it to Reduce. The
Combine function calculates the average of the coordinates for each cluster id, along with the number of
records. This is simple, and it produces one record of output for each cluster:

key is cluster

value is number of records and average values of the coordinates.

The amount of data now is the number of clusters times the number of processors times the size of the
information needed to define each cluster. This is small relative to the data size.
The Reduce function (and one of these is probably sufficient for this problem regardless of data size and
the number of Maps) calcualtes the weighted average of its input. Its output should be written to a file,
and contain:

the iteration number;

the cluster id;

the cluster center coordinates;

the size of the cluster.

The iteration process can than continue.

K-Means Using Just Map and Reduce


Using just Map and Reduce, it is possible to do the same things. In this case, the Map and Combine
functions described above are combined into a single function.
So, the Map function does the following:

Initializes itself with the cluster centers from the previous iteration;

Keeps information about each cluster in memory. This information is the total number of records
assigned to the cluster in the processor and the total of each coordinate.

For each record, it updates the information in memory.

It then outputs the key-value pairs for the Combine function described above.

The Reduce function is the same as above.

K-Means Using SQL


Of course, one of my purposes in discussing MapReduce has been to understand whether and how it is
more powerful than SQL. For fifteen years, databases have been the only data-parallel application readily
available. The parallelism is hidden underneath the SQL language, so many people using SQL do not fully
appreciate the power they are using.

An iteration of k-means looks like:


SELECT @iteration+1, cluster_id,
.......AVERAGE(d.data) as center
FROM (SELECT d.data, cc.cluster_id,
.............ROW_NUMBER() OVER (PARTITION BY d.data
................................ORDER BY DISTANCE(d.data, cc.center) as
ranking
......FROM data d CROSS JOIN
.....(SELECT *
......FROM cluster_centers cc
......WHERE iteration = @iteration) cc
.....) a
WHERE ranking = 1
GROUP BY cluster_id
This code assumes the existence of functions or code for theAVERAGE() and DISTANCE() functions.
These are placeholders for the correct functions. Also, it uses analytic functions. (If you are not familiar
with these, I recommend my book Data Analysis Using SQL and Excel.)
The efficiency of the SQL code is determined, to a large extent, by the analytic function that ranks all the
cluster centers. We hope that a powerful parallel engine will recognize that the data is all in one place, and
hence that this function will be quite efficient.

A Final Note About K-Means Clustering


The K-Means clustering algorithm does require reading through all the data for each iteration through the
algorithm. In general, it tends to converge rather quickly (tens of iterations), so this may not be an issue.
Also, the I/O for reading the data can all be local I/O, rather than sending large amounts of data through
the network.
For most purposes, if you are dealing with a really big dataset, you can sample it down to a fraction of its
original size to get reasonable clusters. If you are not satisfied with this method, then sample the data,
find the centers of the clusters, and then use these to initialize the centers for the overall data. This will
probably reduce the number of iterations through the entire data to less than 10 (one pass for the sample,
a handful for the final clustering).
When running the algorithm on very large amounts of data, numeric overflow is a very real issue. This is
another reason why clustering locally, taking averages, and then taking the weighted average globally is
beneficial -- and why doing sample is a good way to begin.
Also, before clustering, it is a good idea to standardize numeric variables (subtract the average and divide
by the standard deviation).

Decision trees and clustering


When choosing between decision trees and clustering, remember that decision trees are themselves a
clustering method. The leaves of a decision tree contain clusters of records that are similar to one another
and dissimilar from records in other leaves. The difference between the clusters found with a decision tree
and the clusters found using other methods such as K-means, agglomerative algorithms, or selforganizing maps is that decision trees are directed while the other techniques I mentioned are undirected.
Decision trees are appropriate when there is a target variable for which all records in a cluster should have
a similar value. Records in a cluster will also be similar in other ways since they are all described by the

same set of rules, but the target variable drives the process. People often use undirected clustering
techniques when a directed technique would be more appropriate. In your case, I think you made the
correct choice because you can easily come up with a target variable such as the percentage cancelations,
alterations and no-shows in a market.
You can make a model set that has one row per market. One column, the target, will be the percentage of
reservations that get changed or cancelled. The other columns will contain everything you know about the
market--number of flights, number of connections, ratio of business to leasure travelers, number of
carriers, ratio of transit passengers to origin or destination passengers, percentage of same day bookings,
same week bookings, same month bookings, and whatever else comes to mind. A decision tree will
produce some leaves with trustworthy bookings and some with untrustworthy bookings and the paths
from the root to these leaves will be descriptions of the clusters.

You might also like