0% found this document useful (0 votes)

16 views29 pages

DA Seminar

Uploaded by

researcherniaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views29 pages

DA Seminar

Uploaded by

researcherniaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Cluster Analysis- Similarity

Measure-Hierarchical Vs Nonhierarchical
Clustering, Interpretation of results

Submitted By
Hiba UK
Cluster analysis
Cluster analysis is a group of multivariate techniques whose primary purpose is to group
objects based on the characteristics they possess.

It has been referred to as Q analysis, typology construction, classification analysis, and

numerical taxonomy (due to the usage of clustering methods in such diverse disciplines ).

It is a means of grouping records based upon attributes that make them similar. The objects
within clusters will be close together when plotted geometrically, and different clusters will
be far apart.
Common roles cluster analysis
1) Data Reduction

A researcher may be faced with a large number of observations that are meaningless unless
classified into manageable groups.

Cluster analysis can perform this data reduction procedure objectively by reducing the
information from an entire population or sample to information about specific groups.

2) Hypothesis Generation

Cluster analysis is also useful when a researcher wishes to develop hypotheses concerning the
nature of the data or to examine previously stated hypotheses.
Objectives of cluster analysis
Grouping similar data points: The core aim is to identify and group data points based on their
similarities. This means finding groups within your data where the members within each group
are more alike than members from different groups.

Uncovering hidden patterns and structures: By grouping similar data points, you can reveal
underlying patterns and structures that might not be readily apparent when looking at the data
individually. This can lead to new insights and understanding of the data.

Anomaly detection: Outliers or anomalies can be identified by their distance or dissimilarity to

other data points within a cluster. This can be useful for fraud detection, system monitoring, and
quality control
Dimensionality reduction: In some cases, datasets can be large and contain many features.
Cluster analysis can help reduce the dimensionality of your data by grouping similar
features together. This can make it easier to visualize and analyze the data, and it can also
improve the performance of other machine learning algorithms.

Data exploration and segmentation: Cluster analysis is a great tool for exploring and
segmenting your data. It allows you to identify different subsets within your data, which
can be helpful for further analysis, targeted marketing campaigns, personalized
recommendations, or resource allocation.
How does cluster analysis work?
The primary objective of cluster analysis is to define the structure of the data by placing the
most similar observations into groups. To accomplish this task, we must address three basic
questions:

▪ How do we measure similarity?

▪ How do we form clusters?
▪ How many groups do we form?
Example

Respondents
Clustering
variable
A B C D E F G

3 4 4 2 6 7 6

2 5 7 7 6 7 4
1.Measuring similarity
Proximity Matrix of Euclidean Distances Between observations
Observation
Observation A B C D E F G

A -
B 3.162 -
C 5.099 2.000 -
D 5.099 2.828 2.000 -
E 5.000 2.236 2.236 4.123 -
F 6.403 3.606 3.000 5.000 1.414 -
G 3.606 2.236 3.606 5.000 2.000 3.162 -
2.Forming clusters
Using a rule, "Identify the two most similar (closest) observations not already in the same cluster
and combine them.”

We apply this rule repeatedly to generate a number of cluster solutions, starting with each
observation as its own “cluster” and then combining two clusters at a time until all observations
are in a single cluster.

This process is termed a hierarchical procedure because it moves in a stepwise fashion to form an
entire range of cluster solutions.

It is also an agglomerative method because clusters are formed by combining existing clusters.
Agglomerative Hierarchical clustering Process

Agglomeration process Cluster solution

Step Minimum distance Observation Cluster membership Number of Overall similarity
between pair (A)(B)(C)(D)(E)(F)(G) clusters measure (Average
unclustured within-cluster
observations distance)

Initial solution (A)(B)(C)(D)(E)(F)(G) 7 0

1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414

2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896

6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420

The hierarchical clustering process can be portrayed graphically in several ways.

First, because the process is hierarchical, the clustering process can be shown as a series of nested
groupings
▪ This process, however, can represent the
proximity of the observations for only
two or three clustering variables in the
scatterplot or three-dimensional graph.
A more common approach is a dendrogram, which represents the clustering process in a tree-like
graph.

The horizontal axis represents the agglomeration coefficient, in this instance the distance used in
joining clusters.

This approach is particularly useful in identifying outliers, It also depicts the relative size of varying
clusters, although it becomes unwieldy when the number of observations increases.
3. Determining the Number of Clusters in the Final Solution
To select a final cluster solution, we examine the proportionate changes in the homogeneity measure to identify
large increases indicative of merging dissimilar clusters:

Step Overall similarity Difference in similarity Percentage increase in

measure to next step heterogeneity to next
stage
1 1.414 .778 55.0
2 2.192 -.048 NI
3 2.144 .090 4.2
4 2.234 .662 29.6
5 2.896 .524 18.1
6 3.420

When we first join two observations (step 1) we establish the minimum heterogeneity level within clusters, in this
cases 1.414.

In Step 2 we see a substantial increase in heterogeneity from Step 1, but in the next two steps (3 and 4), the overall
measure does not change substantially, which indicates that we are forming other clusters with essentially the same
heterogeneity of the existing clusters.
When we get to step 5, which combines the two three-member clusters, we see a large increase (.662
or 29.6%). This change indicates that joining these two clusters resulted in a single cluster that was
markedly less homogeneous. As a result, we would consider the three-cluster solution of step 4 much
better than the two-cluster solution found in step 5.

We can also see that in step 6 the overall measure again increased markedly, indicating when this
single observation was joined at the last step, it substantially changed the cluster homogeneity. Given
the rather unique profile of this observation (observation A) compared to the others, it might best be
designated as a member of the entropy group, those observations that are outliers and independent of
the existing clusters.
Hierarchical cluster analysis
Hierarchical cluster analysis is a data mining technique used to group similar data points
together.

It creates a hierarchy of clusters, where clusters at lower levels are nested within clusters at
higher levels.

This hierarchy is visualized using a tree-like structure called a dendrogram.

Hierarchical clustering does not require the user to specify the number of clusters in advance,
unlike some other clustering algorithms.

Hierarchical cluster analysis provides a flexible and intuitive approach to clustering data,
allowing for the exploration of the underlying structure of the data and the identification of
meaningful groups.
Basic types hierarchical cluster analysis

There are two main approaches in hierarchical cluster analysis. They are:

❖ Agglomerative algorithm
❖ Divisive algorithm
Agglomerative algorithm

In this approach, each data point starts as its own cluster, and pairs of clusters are
successively merged based on their similarity.

At each step, the two closest clusters are combined into a single cluster, resulting in a
hierarchy of clusters.

This process continues until all data points belong to a single cluster or until a stopping
criterion is met.
Divisive algorithm
In contrast to agglomerative clustering, divisive hierarchical clustering begins with all data
points in a single cluster and then divides the data into smaller clusters.

At each step, the algorithm splits a cluster into two clusters that are maximally dissimilar.

This process continues recursively until each data point is in its own cluster or until a
stopping criterion is met.
Advantages of hierarchical clustering
Hierarchy of clusters: Hierarchical clustering produces a tree-like structure called a dendrogram,
which provides a visual representation of the relationships between clusters. This hierarchical structure
allows for the exploration of the data at different levels of granularity.

Easy to interpret: The hierarchical structure produced by hierarchical clustering provides insight into
the underlying structure of the data and can help identify meaningful clusters and sub clusters.

No need to specify the number of clusters in advance: Unlike some other clustering algorithms,
hierarchical clustering does not require the user to specify the number of clusters in advance. The
dendrogram allows users to choose the number of clusters based on their interpretation of the data.

Can handle different data types: Can be applied to numerical and categorical data.
Disadvantages of hierarchical cluster analysis

Computational complexity: Hierarchical clustering algorithms can be computationally intensive,

especially for large datasets. The time and memory requirements increase with the number of data
points, making hierarchical clustering less practical for very large datasets.

Inability to undo previous merges: Once clusters are merged in hierarchical clustering, it is not
possible to undo these merges. This lack of flexibility can be a limitation when refining or adjusting
the clustering results.

Difficulty in determining the number of clusters: While hierarchical clustering does not require
users to specify the number of clusters in advance, determining the optimal number of clusters from
the dendrogram can be subjective and challenging, especially for complex datasets with overlapping
clusters.
Non hierarchical cluster analysis
Non-hierarchical cluster analysis, also known as partitioning clustering, takes a different
approach to grouping data points compared to hierarchical clustering

Non-hierarchical clustering aims to directly classify data points into a predefined number of
clusters (k) based on their similarities.

It doesn't create a hierarchical structure like dendrograms in hierarchical clustering.

Non-hierarchical clustering methods offer flexibility in specifying the number of clusters and
can be effective for a wide range of data types and structures.
K-means clustering
One of the most popular non-hierarchical clustering algorithms is K-means clustering

This popular algorithm starts by randomly selecting k data points as initial cluster centers
(centroids).

Then, each data point is assigned to the cluster with the nearest centroid based on a chosen
distance metric (e.g., Euclidean distance).

Once all points are assigned, the centroids are recalculated based on the average of the points in
their respective clusters.

This process of reassigning data points and recalculating centroids iterates until a stopping
criterion is met, such as minimal change in centroid positions or reaching a maximum number of
iterations.
Advantages of non hierarchical clustering
Scalability: Non-hierarchical clustering algorithms, such as K-means and GMM, are often more
scalable and computationally efficient than hierarchical clustering methods. They can handle
large datasets more effectively, making them suitable for big data applications.

Explicit number of clusters: Non-hierarchical clustering requires the user to specify the number
of clusters in advance, providing a clear and explicit outcome. This can be advantageous when
the desired number of clusters is known or when the data naturally partitions into a specific
number of groups.

Fast convergence: Algorithms like K-means converge quickly, especially for well-separated
clusters, leading to efficient clustering results in a relatively short amount of time.
Disadvantages of non hierarchical cluster
analysis
Dependence on initializations: Non-hierarchical clustering algorithms are sensitive to initializations,
particularly K-means. Different initializations can lead to different clustering results, and finding the
optimal initialization can be challenging, especially for high-dimensional data.

Difficulty with non-linear and non-spherical clusters: Non-hierarchical clustering algorithms like
K-means assume that clusters are spherical and may struggle with non-linear or non-spherical clusters.
They can produce suboptimal results when clusters have complex shapes or overlap significantly.

Need for predefined number of clusters: One of the main disadvantages of non-hierarchical
clustering is the need to specify the number of clusters in advance. This requirement can be problematic
when the optimal number of clusters is unknown or when the data does not naturally partition into a
specific number of groups.
Interpretation of result
Cluster Profiles: Examine the profiles or characteristics of each cluster to understand their
distinguishing features. This may involve analyzing the mean or median values of the variables within
each cluster or visualizing the data distribution within clusters using histograms, box plots, or other
graphical methods.

Cluster Centroids or Representatives: For centroid-based clustering algorithms like K-means,

examine the centroids of each cluster to understand their representative values. Centroids represent the
average or central tendency of data points within a cluster and can provide insights into the typical
characteristics of each cluster.

Cluster Size and Density: Analyze the size and density of each cluster to understand its prevalence
and compactness. Clusters with a large number of data points and high density are more homogeneous
and well-defined, while clusters with fewer data points and lower density may be more heterogeneous
or less distinct.
Cluster Separation: Assess the separation between clusters to determine how distinct they are from each
other. Clusters that are well-separated in feature space are more clearly delineated, while clusters that overlap
or are close together may be less distinct or more ambiguous.

Validation Metrics: Use validation metrics such as silhouette score, Davies-Bouldin index, or Dunn index to
quantitatively evaluate the quality of the clustering results. Higher silhouette scores and lower values of other
metrics indicate better-defined and more internally homogeneous clusters.

Visualization: Visualize the clusters and their relationships using scatter plots, heatmaps, or
multidimensional scaling (MDS) plots. Visualization techniques can help uncover patterns, trends, and
relationships between clusters and variables that may not be apparent from numerical summaries alone.

Interpretability: Ensure that the clusters are interpretable and meaningful in the context of the problem
domain. Consider whether the identified clusters can be explained and understood in terms of underlying
patterns, relationships, or phenomena in the data.
THANK YOU

Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
No ratings yet
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
3 pages
Six Sigma DMAIC Project Report Template
100% (4)
Six Sigma DMAIC Project Report Template
10 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
BA2 7 Cluster
No ratings yet
BA2 7 Cluster
33 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
Hierarchical Clusters
No ratings yet
Hierarchical Clusters
6 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Agnes
No ratings yet
Agnes
25 pages
ML CO4 SESSION 30 Hierarchical Clustering
No ratings yet
ML CO4 SESSION 30 Hierarchical Clustering
20 pages
Week-9-Part-2 Agglomerative Clustering
No ratings yet
Week-9-Part-2 Agglomerative Clustering
40 pages
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
No ratings yet
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
43 pages
Hierarchical Clustering in Machine Learning
No ratings yet
Hierarchical Clustering in Machine Learning
10 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Expt 5
No ratings yet
Expt 5
3 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
5 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
13 Birch
No ratings yet
13 Birch
8 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Presentation Malo
No ratings yet
Presentation Malo
65 pages
DWM 4
No ratings yet
DWM 4
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
34 pages
Cluster Analysis
No ratings yet
Cluster Analysis
30 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
7 pages
Cluster Analysis
No ratings yet
Cluster Analysis
101 pages
Group#10 (Cluster Analysis)
No ratings yet
Group#10 (Cluster Analysis)
53 pages
Hierarchical Clustering Algorithm
No ratings yet
Hierarchical Clustering Algorithm
9 pages
Lec 35
No ratings yet
Lec 35
18 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
AI20 - Hierarchical-Clustering
No ratings yet
AI20 - Hierarchical-Clustering
31 pages
Clustering
No ratings yet
Clustering
38 pages
Clustring
No ratings yet
Clustring
20 pages
Spooo
No ratings yet
Spooo
9 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Lect 11 DM
No ratings yet
Lect 11 DM
41 pages
Aula - Análise de Clusters
No ratings yet
Aula - Análise de Clusters
93 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
Clustering and Applications and Trends in Datamining Lecture:-30 To 35
No ratings yet
Clustering and Applications and Trends in Datamining Lecture:-30 To 35
66 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
Chapter 9-Cluster Analysis
No ratings yet
Chapter 9-Cluster Analysis
12 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
4.4 Hierarchical Clustering Methods
No ratings yet
4.4 Hierarchical Clustering Methods
39 pages
Hierarchical
No ratings yet
Hierarchical
31 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
AIMLB-PGP-2025-Session-12
No ratings yet
AIMLB-PGP-2025-Session-12
45 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Clustering
No ratings yet
Clustering
19 pages
Unt III (DS)
No ratings yet
Unt III (DS)
49 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Hierarchial Clustering
No ratings yet
Hierarchial Clustering
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
41 pages
Julia for Data Science
From Everand
Julia for Data Science
Anshul Joshi
No ratings yet
Group Theory: Foundations and Applications
From Everand
Group Theory: Foundations and Applications
Parthiban Srinivasan
No ratings yet
claim form CET
No ratings yet
claim form CET
2 pages
Prob. Distri.
No ratings yet
Prob. Distri.
36 pages
Data Analytics
No ratings yet
Data Analytics
28 pages
Discrimi NT
No ratings yet
Discrimi NT
18 pages
Business Analytics Usage and Study
No ratings yet
Business Analytics Usage and Study
89 pages
Las in Practical Research 2 Week 1
100% (1)
Las in Practical Research 2 Week 1
6 pages
The Learning Leader S Guide To AI Literacy b27c9d3df4
100% (1)
The Learning Leader S Guide To AI Literacy b27c9d3df4
20 pages
Research On QWL in Banking Sector by Aakriti Karki
No ratings yet
Research On QWL in Banking Sector by Aakriti Karki
58 pages
Jigsaw Academy-Foundation Course Topic Details
No ratings yet
Jigsaw Academy-Foundation Course Topic Details
10 pages
Lab Assignment 9
No ratings yet
Lab Assignment 9
3 pages
Tobit
No ratings yet
Tobit
28 pages
Second
No ratings yet
Second
21 pages
Project Data Analytics With Power BI
No ratings yet
Project Data Analytics With Power BI
18 pages
Full Chapter Introduction To Business Analytics 2Nd Edition Marguerite L Johnson PDF
100% (9)
Full Chapter Introduction To Business Analytics 2Nd Edition Marguerite L Johnson PDF
53 pages
Resume Sample For Quality Control Manager
100% (2)
Resume Sample For Quality Control Manager
5 pages
Unit 2 Data Analytics
No ratings yet
Unit 2 Data Analytics
33 pages
ML Unit 3
No ratings yet
ML Unit 3
21 pages
Water Fraud REPORT
0% (2)
Water Fraud REPORT
63 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
MGrow - No Code AI and Machine Learning Brochure
No ratings yet
MGrow - No Code AI and Machine Learning Brochure
8 pages
Analyzing ECG Data For Arrhythmia Detection
No ratings yet
Analyzing ECG Data For Arrhythmia Detection
2 pages
Compre Advanced Stat 2022
No ratings yet
Compre Advanced Stat 2022
5 pages
MATH& 146 Lesson 9: Standard Deviation
No ratings yet
MATH& 146 Lesson 9: Standard Deviation
21 pages
Systat
No ratings yet
Systat
8 pages
Artificial Intelligence - Faiza Yamen-3
No ratings yet
Artificial Intelligence - Faiza Yamen-3
39 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Development of English Language Teaching PDF
No ratings yet
Development of English Language Teaching PDF
17 pages
Chapter 3
No ratings yet
Chapter 3
15 pages
Post-Occupancy Evaluation in Architecture Experiences and Perspectives From UK Practice
No ratings yet
Post-Occupancy Evaluation in Architecture Experiences and Perspectives From UK Practice
14 pages
Summer Internship Booklet For Students
No ratings yet
Summer Internship Booklet For Students
31 pages
Data Science
No ratings yet
Data Science
8 pages
Pengaruh Budaya Organisasi Terhadap Kinerja Karyawan Melalui Komitmen Organisasi Sebagai Variabel Intervening Pada Pt. Kerta Rajasa Raya
No ratings yet
Pengaruh Budaya Organisasi Terhadap Kinerja Karyawan Melalui Komitmen Organisasi Sebagai Variabel Intervening Pada Pt. Kerta Rajasa Raya
15 pages
List of Formulas
No ratings yet
List of Formulas
3 pages

DA Seminar

Uploaded by

DA Seminar

Uploaded by

Cluster Analysis- Similarity

It has been referred to as Q analysis, typology construction, classification analysis, and

Anomaly detection: Outliers or anomalies can be identified by their distance or dissimilarity to

▪ How do we measure similarity?

Agglomeration process Cluster solution

Initial solution (A)(B)(C)(D)(E)(F)(G) 7 0

1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414

6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420

Step Overall similarity Difference in similarity Percentage increase in

This hierarchy is visualized using a tree-like structure called a dendrogram.

Computational complexity: Hierarchical clustering algorithms can be computationally intensive,

It doesn't create a hierarchical structure like dendrograms in hierarchical clustering.

Cluster Centroids or Representatives: For centroid-based clustering algorithms like K-means,

You might also like