Introduction to Cluster Analysis.

The document provides an introduction to cluster analysis, detailing its definition, applications, and various methods including partitioning, hierarchical, and density-based approaches. It discusses the requirements for effective clustering, challenges faced, and the evaluation of clustering results. Additionally, it emphasizes the importance of domain knowledge and parameter tuning in achieving meaningful clustering outcomes.

Uploaded by

dikshaprabhugvm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Introduction to Cluster Analysis.

Uploaded by

dikshaprabhugvm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

Introduction to Cluster

Analysis
Unit 2 : Chapter 2
Contents
• Classification v/s Clustering
• Clustering
• Types of data in cluster analysis
• Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density Method
Classification v/s Clustering
Clustering
• What Is Cluster Analysis?
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects in
a cluster are similar to one another, yet dissimilar to objects in other clusters.
• The set of clusters resulting from a cluster analysis can be referred to as a
clustering.

• It is a process of grouping a set of data objects into multiple groups or clusters so

that objects within a cluster have high similarity, but are very dissimilar to objects
in other clusters
Applications
• Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base.
• And they can characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations
• Clustering is also used in outlier detection applications such as detection of credit
card fraud.
• Clustering also helps in classifying documents on the web for information
discovery.
• Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements for Cluster Analysis

Scalability:
• Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios.
• Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.

Ability to deal with different types of attributes:

Many algorithms are designed to cluster numeric (interval-based) data. However,
applications may require clustering other data types, such as binary, nominal
(categorical), and ordinal data, or mixtures of these data types..
Requirements for domain knowledge to determine input parameters:
• Many clustering algorithms require users to provide domain knowledge in the
form of input parameters such as the desired number of clusters. Consequently,
the clustering results may be sensitive to such parameters.
• Parameters are often hard to determine, especially for high-dimensionality data
sets and where users have yet to grasp a deep understanding of their data.

Ability to deal with noisy data:

• Most real-world data sets contain outliers and/or missing, unknown, or erroneous
data. Clustering algorithms can be sensitive to such noise and may produce poor-
quality clusters. Therefore, we need clustering methods that are robust to noise
Discovery of clusters with arbitrary shape:
• Many clustering algorithms determine clusters based on Euclidean or Manhattan
distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. It is important to develop
algorithms that can detect clusters of arbitrary shape.
Incremental clustering and insensitivity to input order:
• In many applications, incremental updates (representing newer data) may arrive at
any time. Some clustering algorithms cannot incorporate incremental updates into
existing clustering structures and, instead, have to recomputed a new clustering
from scratch.
• Clustering algorithms may also be sensitive to the input data order. Incremental
clustering algorithms and algorithms that are insensitive to the input order are
needed.
• Clustering algorithms typically operate on either of the following two data
structures:
• – Data matrix
• – Dissimilarity matrix
Data Matrix
Dissimilarity Matrix
Types of Data in Cluster Analysis
• Dissimilarity can be computed for
• – Interval-scaled (numeric) variables
• – Binary variables
• – Categorical (nominal) variables
• – Ordinal variables
• Types of Data in Cluster Analysis
• – Ratio variables
• – Mixed types variables
Ratio Sclaed
Clustering Methods

Partitioning methods:
• Given a set of n objects, a partitioning method constructs k partitions of the data,
where each partition represents a cluster and k ≤ n. That is, it
• divides the data into k groups such that each group must contain at least one
object.
• In other words, partitioning methods conduct one-level partitioning on data sets.
• The basic partitioning methods typically adopt exclusive cluster separation. That
is,
• each object must belong to exactly one group.
A Centroid-Based Technique
Problem
• Refer class notes
How can we make the k-means algorithm more scalable?

• One approach to making the k-means method more efficient on large data sets is
to use a good-sized set of samples in clustering.

• Another is to employ a filtering approach that uses a spatial hierarchical data

index to save costs when computing means.
• A third approach explores the micro clustering idea, which first groups nearby
objects into “micro clusters” and then performs k-means clustering on the micro
clusters
What Is the Problem of the K-Means Method?
The K-Medoids Clustering Method: A Representative Object-Based Technique
1. Initialize: select k random points out of the n data points as the
medoids.
2. Associate each data point to the closest medoid by using any
common distance metric methods.
3. While the cost decreases: For each medoid m, for each data o point
which is not a medoids:
4. Swap m and o, associate each data point to the closest medoids,
and recompute the cost.
5. If the total cost is more than that in the previous step, undo the
swap.
Which method is more robust—k-
means or k-medoids?
• The k-medoids method is more robust than k-means in the presence of noise and
outliers because a medoid is less influenced by outliers or other extreme values
than a mean.

• However, the complexity of each iteration in the k-medoids algorithm is

• O(k(n − k)2).

• For large values of n and k, such computation becomes very costly, and much
more costly than the k-means method.
Problems
• Refer class notes
“How can we scale up the k-medoids method?
• To deal with larger data sets, a sampling-based method called CLARA (Clustering
LARge Applications) can be used.

• Instead of taking the whole data set into consideration, CLARA uses a random
sample of the data set. The PAM algorithm is then applied to compute the best
medoids from the sample.

• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• The representative objects (medoids) chosen will likely be similar to those that
would have been chosen from the whole data set. CLARA builds clustering's from
multiple random samples and returns the best clustering as the output.
Hierarchical methods:

• A hierarchical method creates a hierarchical decomposition of the given set of

data objects.
• A hierarchical method can be classified as being either agglomerative or divisive,
based on how the hierarchical decomposition is formed.
• The agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups
close to one another, until all the groups are merged into one (the topmost level of
the hierarchy), or a termination condition holds.

• The divisive approach, also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split into
smaller clusters, until eventually each object is in one cluster, or a termination
condition holds.
• Hierarchical clustering methods can be distance-based or density- and continuity
based.
• Hierarchical methods suffer from the factor that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to smaller
computation costs by not having to worry about a combinatorial number of
different choices.
• Such techniques cannot correct erroneous decisions; however, methods for
improving the quality of hierarchical clustering have been proposed.
Example
Distance measure
DIANA: All the objects are used
Distance between the cluster to form one initial cluster. The cluster is
split according to some principle such as
the maximum Euclidean distance
between the closest neighboring objects
in the cluster.
• A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering.
• It shows how objects are grouped together (in an agglomerative method) or
partitioned (in a divisive method)
• A Dendrogram for the five objects presented in Figure where l = 0 shows the five
objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to form the first cluster, and they stay together at all subsequent levels.
Challenges and Solutions
Density-based methods:

• Here the general idea is to continue growing a given cluster as long as the density
(number of objects or data points) in the “neighborhood” exceeds some threshold.
• For example, for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.
• Such a method can be used to filter out noise or outliers and discover clusters of
arbitrary shape.
• Density-based methods can divide a set of objects into multiple exclusive
clusters, or a hierarchy of clusters.
• Typically, density-based methods consider exclusive clusters only, and do not
consider fuzzy clusters.
• Moreover, density-based methods can be extended from full space to subspace
clustering.
• The density-based algorithm requires two parameters, the minimum point number
needed to form the cluster and the threshold of radius distance defines the
neighborhood of every point.
• The commonly used density-based clustering algorithm known as DBSCAN
groups data points that are close together and can discover clusters.
Advantages
• Density-based clustering algorithms can effectively handle noise and outliers in
the dataset, making them robust in such scenarios.
• These algorithms can identify clusters of arbitrary shapes and sizes instead of
other clustering algorithms that may assume specific forms.
• They don’t require prior knowledge of the number of clusters, making them more
flexible and versatile.
• They can efficiently process large datasets and handle high-dimensional data.
Disadvantages
• The performance of density-based clustering algorithms is highly dependent on
the choice of parameters, such as ε and MinPts, which can be challenging to tune.
• These algorithms may not be suitable for datasets with low-density regions or
evenly distributed data points.
• They can be computationally expensive and time-consuming, especially for large
datasets with complex structures.
• Density-based clustering can need help with identifying clusters of varying
densities or scales.
Summary of methods
Evaluation of Clustering
• Assessing clustering tendency. In this task, for a given data set, we assess
whether a non random structure exists in the data. Blindly applying a clustering
method on a data set will return clusters; however, the clusters mined may be
misleading.
• Clustering analysis on a data set is meaningful only when there is a nonrandom
structure in the data.
• Determining the number of clusters in a data set. A few algorithms, such as k-
means, require the number of clusters in a data set as the parameter. Moreover, the
number of clusters can be regarded as an interesting and important summary
statistic of data set.
• Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
• Measuring clustering quality. After applying a clustering method on a data set,
we want to assess how good the resulting clusters are. A number of measures can
be used.
• Some methods measure how well the clusters fit the data set, while others
measure how well the clusters match the ground truth, if such truth is available.
• There are also measures that score clustering and thus can compare two sets of
clustering results on the same data set.
THANK YOU

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
A. R. Mitchell - D. F. Griffiths - The Finite Difference Method in Partial Differential Equations-John Wiley & Sons Incorporated (1980)
100% (1)
A. R. Mitchell - D. F. Griffiths - The Finite Difference Method in Partial Differential Equations-John Wiley & Sons Incorporated (1980)
296 pages
DeWitt Quantum Gravity 3
No ratings yet
DeWitt Quantum Gravity 3
18 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Clustering
No ratings yet
Clustering
104 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Unit 4
No ratings yet
Unit 4
4 pages
Clustering
No ratings yet
Clustering
7 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Cluster
No ratings yet
Cluster
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Clustering
No ratings yet
Clustering
29 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Clustering
No ratings yet
Clustering
24 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Grouping
No ratings yet
Grouping
98 pages
(PML ITS - Week 10) - Clustering
No ratings yet
(PML ITS - Week 10) - Clustering
42 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Clustering
No ratings yet
Clustering
25 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
UNIT 2 DMW
No ratings yet
UNIT 2 DMW
26 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Unit 5
No ratings yet
Unit 5
10 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Clustering
No ratings yet
Clustering
45 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Clustering
No ratings yet
Clustering
32 pages
DWDS Unit 6 Cluster Analysis (1)
No ratings yet
DWDS Unit 6 Cluster Analysis (1)
31 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Computer Vision Notes
No ratings yet
Computer Vision Notes
72 pages
HP-DSA Aplication Note 243-1
No ratings yet
HP-DSA Aplication Note 243-1
42 pages
Iit Model Paper 10
100% (1)
Iit Model Paper 10
13 pages
Gate 2014 Syllabus For Instrumentation Engineering in
No ratings yet
Gate 2014 Syllabus For Instrumentation Engineering in
6 pages
General Mathematics: S.Y. 2021 - 2022 1 Quarter
No ratings yet
General Mathematics: S.Y. 2021 - 2022 1 Quarter
4 pages
Economic Order Quantity (EOQ) Model: Dr. Rakesh Kumar
No ratings yet
Economic Order Quantity (EOQ) Model: Dr. Rakesh Kumar
6 pages
Contribution of Early Greeks in Science and Technology
No ratings yet
Contribution of Early Greeks in Science and Technology
10 pages
Quantitative Techniques, ASSIGNMENTS-2
No ratings yet
Quantitative Techniques, ASSIGNMENTS-2
7 pages
Long Test Math 10 q1
No ratings yet
Long Test Math 10 q1
2 pages
4.ECE301 - Lexical Conventions of Verilog HDL
No ratings yet
4.ECE301 - Lexical Conventions of Verilog HDL
25 pages
InfyTQ Previous Year Slots Aptitude Questions Day 2
No ratings yet
InfyTQ Previous Year Slots Aptitude Questions Day 2
6 pages
DLP 2 (Day 2)
No ratings yet
DLP 2 (Day 2)
10 pages
Sarala Birla Public School: Name: - UID: - STD: XI
No ratings yet
Sarala Birla Public School: Name: - UID: - STD: XI
11 pages
A Higher Order Theory Applied To Beams Resting On Elastic Foundations PDF
No ratings yet
A Higher Order Theory Applied To Beams Resting On Elastic Foundations PDF
101 pages
ACTION SENIOR HIGH - MOCK EXAMS FNL
No ratings yet
ACTION SENIOR HIGH - MOCK EXAMS FNL
5 pages
09FDIP: Spaceworkers - ArchDaily PDF
No ratings yet
09FDIP: Spaceworkers - ArchDaily PDF
9 pages
Module-2 21ec33 Notes Updated
No ratings yet
Module-2 21ec33 Notes Updated
51 pages
Elgamel Digital Signature Scheme (Rahat Ali ^J Roll No. -31
No ratings yet
Elgamel Digital Signature Scheme (Rahat Ali ^J Roll No. -31
9 pages
Logarithm and Exponential
No ratings yet
Logarithm and Exponential
4 pages
Is 13235.1991 PDF
No ratings yet
Is 13235.1991 PDF
32 pages
(Problem Books in Mathematics) Vladimir V. Tkachuk (Auth.) - A Cp-Theory Problem Book - Special Features of Function Spaces-Springer International Publishing (2014) PDF
100% (1)
(Problem Books in Mathematics) Vladimir V. Tkachuk (Auth.) - A Cp-Theory Problem Book - Special Features of Function Spaces-Springer International Publishing (2014) PDF
595 pages
The Role of Spatial Agglomeration in A Structural Model of Innovation, Productivity and Export: A Firm-Level Analysis
No ratings yet
The Role of Spatial Agglomeration in A Structural Model of Innovation, Productivity and Export: A Firm-Level Analysis
24 pages
Jurnal Kemurnian Benih
No ratings yet
Jurnal Kemurnian Benih
20 pages
Report
No ratings yet
Report
15 pages
Quantum Optics Multimode Radiation Field States: Localized Single Photon State
No ratings yet
Quantum Optics Multimode Radiation Field States: Localized Single Photon State
6 pages
4 14-4 15
No ratings yet
4 14-4 15
3 pages
Kun Her Tanti 2018
No ratings yet
Kun Her Tanti 2018
7 pages

Introduction to Cluster Analysis.

Uploaded by

Introduction to Cluster Analysis.

Uploaded by

Introduction to Cluster

• It is a process of grouping a set of data objects into multiple groups or clusters so

Ability to deal with different types of attributes:

Ability to deal with noisy data:

• Another is to employ a filtering approach that uses a spatial hierarchical data

• However, the complexity of each iteration in the k-medoids algorithm is

• A hierarchical method creates a hierarchical decomposition of the given set of

You might also like