Silhouette (Clustering) : Method
Silhouette (Clustering) : Method
Silhouette (Clustering) : Method
Silhouette refers to a method of interpretation and validation of clusters of data. The technique
provides a succinct graphical representation of how well each object lies within its cluster. It was first
described by Peter J. Rousseeuw in 1986.[1]
Method[edit]
Assume the data have been clustered via any technique, such as k-means, into
each datum , let
clusters. For
be the average dissimilarity of with all other data within the same cluster. Any measure of
dissimilarity can be used but distance measures are the most common. We can interpret
as
how well is assigned to its cluster (the smaller the value, the better the assignment). We then
define the average dissimilarity of point to a cluster as the average of the distance from to
points in .
Let
be the lowest average dissimilarity of to any other cluster which is not a member. The
cluster with this lowest average dissimilarity is said to be the "neighbouring cluster" of because it is
the next best fit cluster for point . We now define:
For
to be close to 1 we require
. As
is a measure of how
dissimilar is to its own cluster, a small value means it is well matched. Furthermore, a
large
implies that
over all data of a cluster is a measure of how tightly grouped all the
silhouette plots and averages may be used to determine the natural number of clusters
within a dataset.
The book is done! All 822 pages of the third edition of Data Mining Techniques for Marketing, Sales, and
Customer Relationship Management will be hitting bookstore shelves later this month or you can order it
now. To celebrate, I am returning to the blog.
One of the areas where Gordon and I have added a lot of new material is clustering. In this post, I want to
share a nice measure of cluster goodness first described by Peter Rousseeuw in 1986. Intuitively, good
clusters have the property that cluster members are close to each other and far from members of other
clusters. That is what is captured by a cluster's silhouette.
To calculate a clusters silhouette, first calculate the average distance within the cluster. Each cluster
member has its own average distance from all other members of the same cluster. This is
its dissimilarityfrom its cluster. Cluster members with low dissimilarity are comfortably within the cluster
to which they have been assigned. The average dissimilarity for a cluster is a measure of how compact it is.
Note that two members of the same cluster may have different neighboring clusters. For points that are
close to the boundary between
two clusters, the two dissimilarity scores may be nearly equal.
The average distance to fellow cluster members is then compared to the average distance to members of
the neighboring cluster. The pictures below show this process for one point (17, 27).
The ratio of a point's dissimilarity to its own cluster to its dissimilarity with its nearest neighboring cluster
is its silhouette. The typical range of the score is from zero when a record is right on the boundary of two
clusters to one when it is identical to the other records in its own cluster. In theory, the silhouette score
can go from negative one to one. A negative value means that the record is more similar to the records of
its neighboring
cluster than to other members of its own cluster. To see how this could happen, imagine forming clusters
using an agglomerative algorithm and single-linkage distance. Single-linkage says the distance from a
point to a cluster is the distance to the nearest member of that cluster. Suppose the data consists of many
records with the value 32 and many others with the value 64 along with a scattering of records with values
from 32 to 50. In the first step, all the records at distance zero are combined into two tight clusters. In the
next step, records distance one away are combined causing some 33s to be added to the left cluster
followed by 34s, 35s, etc. Eventually, the left cluster will swallow records that would feel happier in the
right cluster.
The silhouette score for an entire cluster is calculated as the average of the silhouette scores of its
members. This measures the degree of similarity of cluster members. The silhouette of the entire dataset
is the average of the silhouette scores of all the individual records. This is a measure of how appropriately
the data has been
clustered. What is nice about this measure is that it can be applied at the level of the dataset to determine
which clusters are not very good and at the level of a cluster to determine which members do not fit in
very well. The silhouette can be used to choose an appropriate value for k in k-means by trying each value
of
k in the acceptable range and choosing the one that yields the best silhouette. It can also be used to
compare clusters produced by different random seeds.
The final picture shows the silhouette scores for the three clusters in the example.
This is the centroid file. An iteration through the algorithm is going to add another set of rows to this file.
This information is the only information that needs to be communicated globally.
There are two ways to do this in the MapReduce framework. The first uses map, combine, and reduce. The
second only uses map and reduce.
Now, this is a lot of data, so we use a Combine function to reduce the size before sending it to Reduce. The
Combine function calculates the average of the coordinates for each cluster id, along with the number of
records. This is simple, and it produces one record of output for each cluster:
key is cluster
The amount of data now is the number of clusters times the number of processors times the size of the
information needed to define each cluster. This is small relative to the data size.
The Reduce function (and one of these is probably sufficient for this problem regardless of data size and
the number of Maps) calcualtes the weighted average of its input. Its output should be written to a file,
and contain:
Initializes itself with the cluster centers from the previous iteration;
Keeps information about each cluster in memory. This information is the total number of records
assigned to the cluster in the processor and the total of each coordinate.
It then outputs the key-value pairs for the Combine function described above.
same set of rules, but the target variable drives the process. People often use undirected clustering
techniques when a directed technique would be more appropriate. In your case, I think you made the
correct choice because you can easily come up with a target variable such as the percentage cancelations,
alterations and no-shows in a market.
You can make a model set that has one row per market. One column, the target, will be the percentage of
reservations that get changed or cancelled. The other columns will contain everything you know about the
market--number of flights, number of connections, ratio of business to leasure travelers, number of
carriers, ratio of transit passengers to origin or destination passengers, percentage of same day bookings,
same week bookings, same month bookings, and whatever else comes to mind. A decision tree will
produce some leaves with trustworthy bookings and some with untrustworthy bookings and the paths
from the root to these leaves will be descriptions of the clusters.