Introduction to Cluster Analysis.
Introduction to Cluster Analysis.
Analysis
Unit 2 : Chapter 2
Contents
• Classification v/s Clustering
• Clustering
• Types of data in cluster analysis
• Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density Method
Classification v/s Clustering
Clustering
• What Is Cluster Analysis?
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects in
a cluster are similar to one another, yet dissimilar to objects in other clusters.
• The set of clusters resulting from a cluster analysis can be referred to as a
clustering.
Scalability:
• Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios.
• Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.
Partitioning methods:
• Given a set of n objects, a partitioning method constructs k partitions of the data,
where each partition represents a cluster and k ≤ n. That is, it
• divides the data into k groups such that each group must contain at least one
object.
• In other words, partitioning methods conduct one-level partitioning on data sets.
• The basic partitioning methods typically adopt exclusive cluster separation. That
is,
• each object must belong to exactly one group.
A Centroid-Based Technique
Problem
• Refer class notes
How can we make the k-means algorithm more scalable?
• One approach to making the k-means method more efficient on large data sets is
to use a good-sized set of samples in clustering.
• For large values of n and k, such computation becomes very costly, and much
more costly than the k-means method.
Problems
• Refer class notes
“How can we scale up the k-medoids method?
• To deal with larger data sets, a sampling-based method called CLARA (Clustering
LARge Applications) can be used.
• Instead of taking the whole data set into consideration, CLARA uses a random
sample of the data set. The PAM algorithm is then applied to compute the best
medoids from the sample.
• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• The representative objects (medoids) chosen will likely be similar to those that
would have been chosen from the whole data set. CLARA builds clustering's from
multiple random samples and returns the best clustering as the output.
Hierarchical methods:
• The divisive approach, also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split into
smaller clusters, until eventually each object is in one cluster, or a termination
condition holds.
• Hierarchical clustering methods can be distance-based or density- and continuity
based.
• Hierarchical methods suffer from the factor that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not having to worry about a combinatorial number of
different choices.
• Such techniques cannot correct erroneous decisions; however, methods for
improving the quality of hierarchical clustering have been proposed.
Example
Distance measure
DIANA: All the objects are used
Distance between the cluster to form one initial cluster. The cluster is
split according to some principle such as
the maximum Euclidean distance
between the closest neighboring objects
in the cluster.
• A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering.
• It shows how objects are grouped together (in an agglomerative method) or
partitioned (in a divisive method)
• A Dendrogram for the five objects presented in Figure where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to form the first cluster, and they stay together at all subsequent levels.
Challenges and Solutions
Density-based methods:
• Here the general idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some threshold.
• For example, for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.
• Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
• Density-based methods can divide a set of objects into multiple exclusive
clusters, or a hierarchy of clusters.
• Typically, density-based methods consider exclusive clusters only, and do not
consider fuzzy clusters.
• Moreover, density-based methods can be extended from full space to subspace
clustering.
• The density-based algorithm requires two parameters, the minimum point number
needed to form the cluster and the threshold of radius distance defines the
neighborhood of every point.
• The commonly used density-based clustering algorithm known as DBSCAN
groups data points that are close together and can discover clusters.
Advantages
• Density-based clustering algorithms can effectively handle noise and outliers in
the dataset, making them robust in such scenarios.
• These algorithms can identify clusters of arbitrary shapes and sizes instead of
other clustering algorithms that may assume specific forms.
• They don’t require prior knowledge of the number of clusters, making them more
flexible and versatile.
• They can efficiently process large datasets and handle high-dimensional data.
Disadvantages
• The performance of density-based clustering algorithms is highly dependent on
the choice of parameters, such as ε and MinPts, which can be challenging to tune.
• These algorithms may not be suitable for datasets with low-density regions or
evenly distributed data points.
• They can be computationally expensive and time-consuming, especially for large
datasets with complex structures.
• Density-based clustering can need help with identifying clusters of varying
densities or scales.
Summary of methods
Evaluation of Clustering
• Assessing clustering tendency. In this task, for a given data set, we assess
whether a non random structure exists in the data. Blindly applying a clustering
method on a data set will return clusters; however, the clusters mined may be
misleading.
• Clustering analysis on a data set is meaningful only when there is a nonrandom
structure in the data.
• Determining the number of clusters in a data set. A few algorithms, such as k-
means, require the number of clusters in a data set as the parameter. Moreover, the
number of clusters can be regarded as an interesting and important summary
statistic of data set.
• Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
• Measuring clustering quality. After applying a clustering method on a data set,
we want to assess how good the resulting clusters are. A number of measures can
be used.
• Some methods measure how well the clusters fit the data set, while others
measure how well the clusters match the ground truth, if such truth is available.
• There are also measures that score clustering and thus can compare two sets of
clustering results on the same data set.
THANK YOU