0% found this document useful (0 votes)
67 views9 pages

Data Mining Project - Clustering - State Wise Health Income

The document discusses applying hierarchical and k-means clustering techniques to identify optimal clusters in sample data. It covers data exploration, outlier treatment, scaling, determining optimal clusters using dendrograms and elbow/silhouette scores, and describing cluster profiles. The conclusion discusses imputing missing values, outlier effects, scaling effects, and opportunities to further segment data and generate insights from clusters.

Uploaded by

Priyanka Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views9 pages

Data Mining Project - Clustering - State Wise Health Income

The document discusses applying hierarchical and k-means clustering techniques to identify optimal clusters in sample data. It covers data exploration, outlier treatment, scaling, determining optimal clusters using dendrograms and elbow/silhouette scores, and describing cluster profiles. The conclusion discusses imputing missing values, outlier effects, scaling effects, and opportunities to further segment data and generate insights from clusters.

Uploaded by

Priyanka Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Mining

Clustering Project
Priyanka Sharma
Index

Questions Page No.


Question 1 3
Question 2 4
Question 3 5
Question 4 6
Question 5 7
Part 2: Clustering: Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, Data types, shape, EDA, etc)

Following steps are performed for EDA

-Head

-Info

Types :

No Duplicate and Null values.


Part 2: Clustering: Do you think scaling is necessary for clustering in this case? Justify.

Possible Approaches for reducing noise:

1. Treating outliers using IQR method.


2. Treating outliers using z-score method.
Using EDA results to segment data into two or more parts and then apply k-means algorithm to each
part separately.

Before Outlier treatment:

After Outlier treatment


Part 2: Clustering: Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them.

We used scikit-learn’s StandardScaler to perform z-score scaling.

Scaling of variables is important for clustering to stabilize the weights of the


different variables. If there is wide discrepancy in the range of variables
cluster formation may be affected by weight differential.

Perform clustering
Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance

Using SciPy’s cluster hierarchy function, we created the below dendrogram.


In a Dendrogram, each branch is called a clade. The terminal end of each clade is
called a leaf. The arrangement of the clades tells us which leaves are most similar to
each other. The height of the branching points indicates how similar or different
they are from each other: the greater the height, the greater the difference.
Alternatively, there may be 3 clusters as well, But we choose 4 Clusters using
Dendrogram for this project.
Part 2: Clustering: Apply K-Means clustering on scaled data and determine
optimum clusters. Apply elbow curve and find the silhouette score.

No of clusters : 4

Hierarchical Clustering as well as KMeans Clustering were performed. We used Elbow


plot and Silhouette Score to identify optimum number of clusters in KMeans whereas
in Hierarchical Clustering dendrogram was drawn. In Hierarchical method, we got 5
clusters while in KMeans, we got 5 (using elbow plot) and 4 clusters (using silhouette
score).
Part 2: Clustering: Describe cluster profiles for the clusters defined. Recommend
different priority based actions that need to be taken for different clusters on the
bases of their vulnerability situations according to their Economic and Health
Conditions.
Conclusion
In this project,

We learned to impute missing values using a different approach i.e. using custom formulae

We discussed about outlier’s effect on quality of clustering profiles

We discussed about the scaling and its effect on performance of the algorithm
We discussed that clusters need to be revisited if there is too much similarity, or
overlap, among them

What more could be done?

You can divide the data, then segment using clustering.

You can dig deeper into clusters and generate more insight.

You might also like