0% found this document useful (0 votes)
11 views

Introduction to Cluster Analysis.

The document provides an introduction to cluster analysis, detailing its definition, applications, and various methods including partitioning, hierarchical, and density-based approaches. It discusses the requirements for effective clustering, challenges faced, and the evaluation of clustering results. Additionally, it emphasizes the importance of domain knowledge and parameter tuning in achieving meaningful clustering outcomes.

Uploaded by

dikshaprabhugvm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Introduction to Cluster Analysis.

The document provides an introduction to cluster analysis, detailing its definition, applications, and various methods including partitioning, hierarchical, and density-based approaches. It discusses the requirements for effective clustering, challenges faced, and the evaluation of clustering results. Additionally, it emphasizes the importance of domain knowledge and parameter tuning in achieving meaningful clustering outcomes.

Uploaded by

dikshaprabhugvm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to Cluster

Analysis
Unit 2 : Chapter 2
Contents
• Classification v/s Clustering
• Clustering
• Types of data in cluster analysis
• Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density Method
Classification v/s Clustering
Clustering
• What Is Cluster Analysis?
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects in
a cluster are similar to one another, yet dissimilar to objects in other clusters.
• The set of clusters resulting from a cluster analysis can be referred to as a
clustering.

• It is a process of grouping a set of data objects into multiple groups or clusters so


that objects within a cluster have high similarity, but are very dissimilar to objects
in other clusters
Applications
• Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base.
• And they can characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations
• Clustering is also used in outlier detection applications such as detection of credit
card fraud.
• Clustering also helps in classifying documents on the web for information
discovery.
• Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements for Cluster Analysis

Scalability:
• Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios.
• Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.

Ability to deal with different types of attributes:


Many algorithms are designed to cluster numeric (interval-based) data. However,
applications may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types..
Requirements for domain knowledge to determine input parameters:
• Many clustering algorithms require users to provide domain knowledge in the
form of input parameters such as the desired number of clusters. Consequently,
the clustering results may be sensitive to such parameters.
• Parameters are often hard to determine, especially for high-dimensionality data
sets and where users have yet to grasp a deep understanding of their data.

Ability to deal with noisy data:


• Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data. Clustering algorithms can be sensitive to such noise and may produce poor-
quality clusters. Therefore, we need clustering methods that are robust to noise
Discovery of clusters with arbitrary shape:
• Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. It is important to develop
algorithms that can detect clusters of arbitrary shape.
Incremental clustering and insensitivity to input order:
• In many applications, incremental updates (representing newer data) may arrive at
any time. Some clustering algorithms cannot incorporate incremental updates into
existing clustering structures and, instead, have to recomputed a new clustering
from scratch.
• Clustering algorithms may also be sensitive to the input data order. Incremental
clustering algorithms and algorithms that are insensitive to the input order are
needed.
• Clustering algorithms typically operate on either of the following two data
structures:
• – Data matrix
• – Dissimilarity matrix
Data Matrix
Dissimilarity Matrix
Types of Data in Cluster Analysis
• Dissimilarity can be computed for
• – Interval-scaled (numeric) variables
• – Binary variables
• – Categorical (nominal) variables
• – Ordinal variables
• Types of Data in Cluster Analysis
• – Ratio variables
• – Mixed types variables
Ratio Sclaed
Clustering Methods

Partitioning methods:
• Given a set of n objects, a partitioning method constructs k partitions of the data,
where each partition represents a cluster and k ≤ n. That is, it
• divides the data into k groups such that each group must contain at least one
object.
• In other words, partitioning methods conduct one-level partitioning on data sets.
• The basic partitioning methods typically adopt exclusive cluster separation. That
is,
• each object must belong to exactly one group.
A Centroid-Based Technique
Problem
• Refer class notes
How can we make the k-means algorithm more scalable?

• One approach to making the k-means method more efficient on large data sets is
to use a good-sized set of samples in clustering.

• Another is to employ a filtering approach that uses a spatial hierarchical data


index to save costs when computing means.
• A third approach explores the micro clustering idea, which first groups nearby
objects into “micro clusters” and then performs k-means clustering on the micro
clusters
What Is the Problem of the K-Means Method?
The K-Medoids Clustering Method: A Representative Object-Based Technique
1. Initialize: select k random points out of the n data points as the
medoids.
2. Associate each data point to the closest medoid by using any
common distance metric methods.
3. While the cost decreases: For each medoid m, for each data o point
which is not a medoids:
4. Swap m and o, associate each data point to the closest medoids,
and recompute the cost.
5. If the total cost is more than that in the previous step, undo the
swap.
Which method is more robust—k-
means or k-medoids?
• The k-medoids method is more robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other extreme values
than a mean.

• However, the complexity of each iteration in the k-medoids algorithm is


• O(k(n − k)2).

• For large values of n and k, such computation becomes very costly, and much
more costly than the k-means method.
Problems
• Refer class notes
“How can we scale up the k-medoids method?
• To deal with larger data sets, a sampling-based method called CLARA (Clustering
LARge Applications) can be used.

• Instead of taking the whole data set into consideration, CLARA uses a random
sample of the data set. The PAM algorithm is then applied to compute the best
medoids from the sample.

• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• The representative objects (medoids) chosen will likely be similar to those that
would have been chosen from the whole data set. CLARA builds clustering's from
multiple random samples and returns the best clustering as the output.
Hierarchical methods:

• A hierarchical method creates a hierarchical decomposition of the given set of


data objects.
• A hierarchical method can be classified as being either agglomerative or divisive,
based on how the hierarchical decomposition is formed.
• The agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups
close to one another, until all the groups are merged into one (the topmost level of
the hierarchy), or a termination condition holds.

• The divisive approach, also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split into
smaller clusters, until eventually each object is in one cluster, or a termination
condition holds.
• Hierarchical clustering methods can be distance-based or density- and continuity
based.
• Hierarchical methods suffer from the factor that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not having to worry about a combinatorial number of
different choices.
• Such techniques cannot correct erroneous decisions; however, methods for
improving the quality of hierarchical clustering have been proposed.
Example
Distance measure
DIANA: All the objects are used
Distance between the cluster to form one initial cluster. The cluster is
split according to some principle such as
the maximum Euclidean distance
between the closest neighboring objects
in the cluster.
• A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering.
• It shows how objects are grouped together (in an agglomerative method) or
partitioned (in a divisive method)
• A Dendrogram for the five objects presented in Figure where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to form the first cluster, and they stay together at all subsequent levels.
Challenges and Solutions
Density-based methods:

• Here the general idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some threshold.
• For example, for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.
• Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
• Density-based methods can divide a set of objects into multiple exclusive
clusters, or a hierarchy of clusters.
• Typically, density-based methods consider exclusive clusters only, and do not
consider fuzzy clusters.
• Moreover, density-based methods can be extended from full space to subspace
clustering.
• The density-based algorithm requires two parameters, the minimum point number
needed to form the cluster and the threshold of radius distance defines the
neighborhood of every point.
• The commonly used density-based clustering algorithm known as DBSCAN
groups data points that are close together and can discover clusters.
Advantages
• Density-based clustering algorithms can effectively handle noise and outliers in
the dataset, making them robust in such scenarios.
• These algorithms can identify clusters of arbitrary shapes and sizes instead of
other clustering algorithms that may assume specific forms.
• They don’t require prior knowledge of the number of clusters, making them more
flexible and versatile.
• They can efficiently process large datasets and handle high-dimensional data.
Disadvantages
• The performance of density-based clustering algorithms is highly dependent on
the choice of parameters, such as ε and MinPts, which can be challenging to tune.
• These algorithms may not be suitable for datasets with low-density regions or
evenly distributed data points.
• They can be computationally expensive and time-consuming, especially for large
datasets with complex structures.
• Density-based clustering can need help with identifying clusters of varying
densities or scales.
Summary of methods
Evaluation of Clustering
• Assessing clustering tendency. In this task, for a given data set, we assess
whether a non random structure exists in the data. Blindly applying a clustering
method on a data set will return clusters; however, the clusters mined may be
misleading.
• Clustering analysis on a data set is meaningful only when there is a nonrandom
structure in the data.
• Determining the number of clusters in a data set. A few algorithms, such as k-
means, require the number of clusters in a data set as the parameter. Moreover, the
number of clusters can be regarded as an interesting and important summary
statistic of data set.
• Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
• Measuring clustering quality. After applying a clustering method on a data set,
we want to assess how good the resulting clusters are. A number of measures can
be used.
• Some methods measure how well the clusters fit the data set, while others
measure how well the clusters match the ground truth, if such truth is available.
• There are also measures that score clustering and thus can compare two sets of
clustering results on the same data set.
THANK YOU

You might also like