0% found this document useful (0 votes)
16 views29 pages

DA Seminar

Uploaded by

researcherniaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views29 pages

DA Seminar

Uploaded by

researcherniaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Cluster Analysis- Similarity

Measure-Hierarchical Vs Nonhierarchical
Clustering, Interpretation of results

Submitted By
Hiba UK
Cluster analysis
Cluster analysis is a group of multivariate techniques whose primary purpose is to group
objects based on the characteristics they possess.

It has been referred to as Q analysis, typology construction, classification analysis, and


numerical taxonomy (due to the usage of clustering methods in such diverse disciplines ).

It is a means of grouping records based upon attributes that make them similar. The objects
within clusters will be close together when plotted geometrically, and different clusters will
be far apart.
Common roles cluster analysis
1) Data Reduction

A researcher may be faced with a large number of observations that are meaningless unless
classified into manageable groups.

Cluster analysis can perform this data reduction procedure objectively by reducing the
information from an entire population or sample to information about specific groups.

2) Hypothesis Generation

Cluster analysis is also useful when a researcher wishes to develop hypotheses concerning the
nature of the data or to examine previously stated hypotheses.
Objectives of cluster analysis
Grouping similar data points: The core aim is to identify and group data points based on their
similarities. This means finding groups within your data where the members within each group
are more alike than members from different groups.

Uncovering hidden patterns and structures: By grouping similar data points, you can reveal
underlying patterns and structures that might not be readily apparent when looking at the data
individually. This can lead to new insights and understanding of the data.

Anomaly detection: Outliers or anomalies can be identified by their distance or dissimilarity to


other data points within a cluster. This can be useful for fraud detection, system monitoring, and
quality control
Dimensionality reduction: In some cases, datasets can be large and contain many features.
Cluster analysis can help reduce the dimensionality of your data by grouping similar
features together. This can make it easier to visualize and analyze the data, and it can also
improve the performance of other machine learning algorithms.

Data exploration and segmentation: Cluster analysis is a great tool for exploring and
segmenting your data. It allows you to identify different subsets within your data, which
can be helpful for further analysis, targeted marketing campaigns, personalized
recommendations, or resource allocation.
How does cluster analysis work?
The primary objective of cluster analysis is to define the structure of the data by placing the
most similar observations into groups. To accomplish this task, we must address three basic
questions:

▪ How do we measure similarity?


▪ How do we form clusters?
▪ How many groups do we form?
Example

Respondents
Clustering
variable
A B C D E F G

3 4 4 2 6 7 6

2 5 7 7 6 7 4
1.Measuring similarity
Proximity Matrix of Euclidean Distances Between observations
Observation
Observation A B C D E F G

A -
B 3.162 -
C 5.099 2.000 -
D 5.099 2.828 2.000 -
E 5.000 2.236 2.236 4.123 -
F 6.403 3.606 3.000 5.000 1.414 -
G 3.606 2.236 3.606 5.000 2.000 3.162 -
2.Forming clusters
Using a rule, "Identify the two most similar (closest) observations not already in the same cluster
and combine them.”

We apply this rule repeatedly to generate a number of cluster solutions, starting with each
observation as its own “cluster” and then combining two clusters at a time until all observations
are in a single cluster.

This process is termed a hierarchical procedure because it moves in a stepwise fashion to form an
entire range of cluster solutions.

It is also an agglomerative method because clusters are formed by combining existing clusters.
Agglomerative Hierarchical clustering Process

Agglomeration process Cluster solution


Step Minimum distance Observation Cluster membership Number of Overall similarity
between pair (A)(B)(C)(D)(E)(F)(G) clusters measure (Average
unclustured within-cluster
observations distance)

Initial solution (A)(B)(C)(D)(E)(F)(G) 7 0

1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414


2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896

6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420


The hierarchical clustering process can be portrayed graphically in several ways.

First, because the process is hierarchical, the clustering process can be shown as a series of nested
groupings
▪ This process, however, can represent the
proximity of the observations for only
two or three clustering variables in the
scatterplot or three-dimensional graph.
A more common approach is a dendrogram, which represents the clustering process in a tree-like
graph.

The horizontal axis represents the agglomeration coefficient, in this instance the distance used in
joining clusters.

This approach is particularly useful in identifying outliers, It also depicts the relative size of varying
clusters, although it becomes unwieldy when the number of observations increases.
3. Determining the Number of Clusters in the Final Solution
To select a final cluster solution, we examine the proportionate changes in the homogeneity measure to identify
large increases indicative of merging dissimilar clusters:

Step Overall similarity Difference in similarity Percentage increase in


measure to next step heterogeneity to next
stage
1 1.414 .778 55.0
2 2.192 -.048 NI
3 2.144 .090 4.2
4 2.234 .662 29.6
5 2.896 .524 18.1
6 3.420

When we first join two observations (step 1) we establish the minimum heterogeneity level within clusters, in this
cases 1.414.

In Step 2 we see a substantial increase in heterogeneity from Step 1, but in the next two steps (3 and 4), the overall
measure does not change substantially, which indicates that we are forming other clusters with essentially the same
heterogeneity of the existing clusters.
When we get to step 5, which combines the two three-member clusters, we see a large increase (.662
or 29.6%). This change indicates that joining these two clusters resulted in a single cluster that was
markedly less homogeneous. As a result, we would consider the three-cluster solution of step 4 much
better than the two-cluster solution found in step 5.

We can also see that in step 6 the overall measure again increased markedly, indicating when this
single observation was joined at the last step, it substantially changed the cluster homogeneity. Given
the rather unique profile of this observation (observation A) compared to the others, it might best be
designated as a member of the entropy group, those observations that are outliers and independent of
the existing clusters.
Hierarchical cluster analysis
Hierarchical cluster analysis is a data mining technique used to group similar data points
together.

It creates a hierarchy of clusters, where clusters at lower levels are nested within clusters at
higher levels.

This hierarchy is visualized using a tree-like structure called a dendrogram.

Hierarchical clustering does not require the user to specify the number of clusters in advance,
unlike some other clustering algorithms.

Hierarchical cluster analysis provides a flexible and intuitive approach to clustering data,
allowing for the exploration of the underlying structure of the data and the identification of
meaningful groups.
Basic types hierarchical cluster analysis

There are two main approaches in hierarchical cluster analysis. They are:

❖ Agglomerative algorithm
❖ Divisive algorithm
Agglomerative algorithm

In this approach, each data point starts as its own cluster, and pairs of clusters are
successively merged based on their similarity.

At each step, the two closest clusters are combined into a single cluster, resulting in a
hierarchy of clusters.

This process continues until all data points belong to a single cluster or until a stopping
criterion is met.
Divisive algorithm
In contrast to agglomerative clustering, divisive hierarchical clustering begins with all data
points in a single cluster and then divides the data into smaller clusters.

At each step, the algorithm splits a cluster into two clusters that are maximally dissimilar.

This process continues recursively until each data point is in its own cluster or until a
stopping criterion is met.
Advantages of hierarchical clustering
Hierarchy of clusters: Hierarchical clustering produces a tree-like structure called a dendrogram,
which provides a visual representation of the relationships between clusters. This hierarchical structure
allows for the exploration of the data at different levels of granularity.

Easy to interpret: The hierarchical structure produced by hierarchical clustering provides insight into
the underlying structure of the data and can help identify meaningful clusters and sub clusters.

No need to specify the number of clusters in advance: Unlike some other clustering algorithms,
hierarchical clustering does not require the user to specify the number of clusters in advance. The
dendrogram allows users to choose the number of clusters based on their interpretation of the data.

Can handle different data types: Can be applied to numerical and categorical data.
Disadvantages of hierarchical cluster analysis

Computational complexity: Hierarchical clustering algorithms can be computationally intensive,


especially for large datasets. The time and memory requirements increase with the number of data
points, making hierarchical clustering less practical for very large datasets.

Inability to undo previous merges: Once clusters are merged in hierarchical clustering, it is not
possible to undo these merges. This lack of flexibility can be a limitation when refining or adjusting
the clustering results.

Difficulty in determining the number of clusters: While hierarchical clustering does not require
users to specify the number of clusters in advance, determining the optimal number of clusters from
the dendrogram can be subjective and challenging, especially for complex datasets with overlapping
clusters.
Non hierarchical cluster analysis
Non-hierarchical cluster analysis, also known as partitioning clustering, takes a different
approach to grouping data points compared to hierarchical clustering

Non-hierarchical clustering aims to directly classify data points into a predefined number of
clusters (k) based on their similarities.

It doesn't create a hierarchical structure like dendrograms in hierarchical clustering.

Non-hierarchical clustering methods offer flexibility in specifying the number of clusters and
can be effective for a wide range of data types and structures.
K-means clustering
One of the most popular non-hierarchical clustering algorithms is K-means clustering

This popular algorithm starts by randomly selecting k data points as initial cluster centers
(centroids).

Then, each data point is assigned to the cluster with the nearest centroid based on a chosen
distance metric (e.g., Euclidean distance).

Once all points are assigned, the centroids are recalculated based on the average of the points in
their respective clusters.

This process of reassigning data points and recalculating centroids iterates until a stopping
criterion is met, such as minimal change in centroid positions or reaching a maximum number of
iterations.
Advantages of non hierarchical clustering
Scalability: Non-hierarchical clustering algorithms, such as K-means and GMM, are often more
scalable and computationally efficient than hierarchical clustering methods. They can handle
large datasets more effectively, making them suitable for big data applications.

Explicit number of clusters: Non-hierarchical clustering requires the user to specify the number
of clusters in advance, providing a clear and explicit outcome. This can be advantageous when
the desired number of clusters is known or when the data naturally partitions into a specific
number of groups.

Fast convergence: Algorithms like K-means converge quickly, especially for well-separated
clusters, leading to efficient clustering results in a relatively short amount of time.
Disadvantages of non hierarchical cluster
analysis
Dependence on initializations: Non-hierarchical clustering algorithms are sensitive to initializations,
particularly K-means. Different initializations can lead to different clustering results, and finding the
optimal initialization can be challenging, especially for high-dimensional data.

Difficulty with non-linear and non-spherical clusters: Non-hierarchical clustering algorithms like
K-means assume that clusters are spherical and may struggle with non-linear or non-spherical clusters.
They can produce suboptimal results when clusters have complex shapes or overlap significantly.

Need for predefined number of clusters: One of the main disadvantages of non-hierarchical
clustering is the need to specify the number of clusters in advance. This requirement can be problematic
when the optimal number of clusters is unknown or when the data does not naturally partition into a
specific number of groups.
Interpretation of result
Cluster Profiles: Examine the profiles or characteristics of each cluster to understand their
distinguishing features. This may involve analyzing the mean or median values of the variables within
each cluster or visualizing the data distribution within clusters using histograms, box plots, or other
graphical methods.

Cluster Centroids or Representatives: For centroid-based clustering algorithms like K-means,


examine the centroids of each cluster to understand their representative values. Centroids represent the
average or central tendency of data points within a cluster and can provide insights into the typical
characteristics of each cluster.

Cluster Size and Density: Analyze the size and density of each cluster to understand its prevalence
and compactness. Clusters with a large number of data points and high density are more homogeneous
and well-defined, while clusters with fewer data points and lower density may be more heterogeneous
or less distinct.
Cluster Separation: Assess the separation between clusters to determine how distinct they are from each
other. Clusters that are well-separated in feature space are more clearly delineated, while clusters that overlap
or are close together may be less distinct or more ambiguous.

Validation Metrics: Use validation metrics such as silhouette score, Davies-Bouldin index, or Dunn index to
quantitatively evaluate the quality of the clustering results. Higher silhouette scores and lower values of other
metrics indicate better-defined and more internally homogeneous clusters.

Visualization: Visualize the clusters and their relationships using scatter plots, heatmaps, or
multidimensional scaling (MDS) plots. Visualization techniques can help uncover patterns, trends, and
relationships between clusters and variables that may not be apparent from numerical summaries alone.

Interpretability: Ensure that the clusters are interpretable and meaningful in the context of the problem
domain. Consider whether the identified clusters can be explained and understood in terms of underlying
patterns, relationships, or phenomena in the data.
THANK YOU

You might also like