91% found this document useful (11 votes)
3K views18 pages

Business Report Data Mining

The document discusses clustering health and economic data from states in a country. It performs exploratory data analysis on the dataset, which has no missing values and consists of health indices, per capita income, and GDP variables. Both hierarchical and K-means clustering are applied to identify optimal clusters. Both methods indicate 3 clusters as the best solution. The clusters represent high, medium, and low GDP per capita areas. Priority actions are recommended based on vulnerability, with lower GDP areas needing more support for health and economic conditions.

Uploaded by

shorya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
91% found this document useful (11 votes)
3K views18 pages

Business Report Data Mining

The document discusses clustering health and economic data from states in a country. It performs exploratory data analysis on the dataset, which has no missing values and consists of health indices, per capita income, and GDP variables. Both hierarchical and K-means clustering are applied to identify optimal clusters. Both methods indicate 3 clusters as the best solution. The clusters represent high, medium, and low GDP per capita areas. Priority actions are recommended based on vulnerability, with lower GDP areas needing more support for health and economic conditions.

Uploaded by

shorya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Business Report: Data Mining Project

By- Shorya Goel

Problem 1: Clustering
Problem Statement: The dataset given is about the Health and economic conditions in different States of a
country. The Group States based on how similar their situation is, so as to provide these groups to the
government so that appropriate measures can be taken to escalate their Health and Economic conditions.

Data Dictionary
1. States: names of States
2. Health_indeces1: A composite index rolls several related measures (indicators) into a single score that
provides a summary of how the health system is performing in the State.
3. Health_indeces2: A composite index rolls several related measures (indicators) into a single score that
provides a summary of how the health system is performing in certain areas of the States.
4. Per_capita_income-Per capita income (PCI) measures the average income earned per person in a given
area (city, region, country, etc.) in a specified year. It is calculated by dividing the area's total income by its
total population.
5. GDP: GDP provides an economic snapshot of a country/state, used to estimate the size of an economy and
growth rate.

1.1. Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, etc.)
Read the dataset- State_wise_Health_income-1 (1).csv

There are two variables “Unnamed: 0” and States that signifies only the id in the dataset and are not required in
clustering process. Hence, these can be dropped.
After dropping these variables-
Info of the Dataset-

There are 4 variables and 297 records.


No missing record based on initial analysis.
All the variables are integer type variables.

Shape of the Dataset: (297, 4)


This shows the total number of rows = 297 and total number of columns = 4.

Checking Missing Values

There are no missing values present in the dataset.

Summary of the Dataset

This gives us the descriptive statistics of the data such as mean, count, frequency and 5 point summary, etc.
Univariate Analysis-

Skewness
GDP 0.829665
Per_capita_income 0.823113
Health_indeces1 0.715371
Health_indices2 -0.173803

From the above graphs it can be observed that –


 All the variables except for Health_indices2 are right skewed.
 Health_indices2 is negatively skewed.
 There are outliers present “Health_indeces1” and “Per_capita_income”.
 The data points in all the variables are somewhat distributed in similar fashion.
Multivariate Analysis-

Pairplot-

Covariance Matrix-
Correlation Matrix-

Heatmap-

From the above it is evident that there is multi-collinearity present in the data.
Highest correlation is between “Health_indeces1” and “GDP”.
Outliers Check/Treatment-

Using boxplots

No. of outliers in Health_indeces1: 2


No. of outliers in Per_capita_income: 1

Outlier Treatment- Instead of Imputing which causes data loss we will define a custom function- If for a particular
column the value is greater than max value, than assign that max value to it. Same logic for min value as well. This is
known as min-max substitution.
Now, there are no outliers present in the dataset.

1.2. Do you think scaling is necessary for clustering in this case? Justify

Yes, Scaling is necessary as Clustering algorithms such as K-means do need feature scaling before they are fed to
the algorithm. Since, clustering techniques use Euclidean Distance, it will be wise to scale the data consisting of
attributes with different units of measurements.
The above dataset consists of data with different units of measurement also known as weights, thus scaling them
will form a common space and data will be from relative range.
We will use z-score scaling here, in which mean=0 and standard deviation=1.

Prior to Scaling-
After Scaling-

Now data belongs to a relative range between -1.5 to 3.


1.3. Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
There are different methods of clustering, in this dataset we will use “Average” and “Ward” linkage methods.
Average Linkage-
In this method the distance between each pair of observations in each cluster are added up and divided by the
number of pairs to get an average inter-cluster distance.
Average-linkage and complete-linkage are the two most popular distance metrics in hierarchical clustering.

To make it clearer we will truncate it.

P= 10

P= 25
From the above Dendrogram it is clear that 3 clusters should be formed.
We will use fcluster module to create the cluster.
After creating 3 clusters as (1, 2 and 3), and storing them in another columns named “cluster-3” in the dataset.

Cluster Frequency-

Cluster Profiles-

Cluster Visualization for Average Linkage-


Ward Linkage-
In this method the linkage function describing the distance between two clusters is computed as the increase in the
"error sum of squares" (ESS) after fusing two clusters into a single cluster.
Ward’s method chooses the successive steps in order to minimize the increase in ESS at each step.

To make it clearer we will truncate it.

P= 10

P= 25
From the above Dendrogram it is clear that 3 clusters should be formed.
We will use fcluster module to create the cluster.
After creating 3 clusters as (1, 2 and 3), and storing them in another columns named “cluster-3” in the dataset.

Cluster Frequency-

Cluster Profiles-

Cluster Visualization for Ward Linkage-


Observations-
 Mean values for both the Average linkage and Ward Linkage are different with a lot of variation in the
clusters frequency.
 We will prefer Ward linkage in this dataset as it performed significantly well.
 Based on the above Dendrogram 3 cluster solution seems to be the best fit.
 And three group cluster solution gives a pattern based on high, medium and low GDP per capita areas.

1.4. Apply K-Means clustering on scaled data and determine optimum


clusters. Apply elbow curve and find the silhouette score.
K-Mean Clustering- This is an iterative method of partitioning the data into K predefined distinct non-overlapping
subgroups also known as clusters. In this each data point belong to a single group. In the intra-cluster data points are
as similar as possible while the distance between different clusters as far as possible.
Working steps of k-means algorithm-
 Specify number of clusters K.
 Initialize centroids by first shuffling the dataset and then randomly selecting K  data points for the centroids
without replacement.
 Keep iterating until there is no change to the centroids, i.e. assignment of data points to clusters isn’t
changing.
 Compute the sum of the squared distance between data points and all centroids.
 Assign each data point to the closest cluster (centroid).
 Compute the centroids for the clusters by taking the average of the all data points that belong to each
cluster.

Now performing K-Means Elbow Method for K= (1 to 10)

The inertia for K= 1 to K=10.

Elbow Curve-
Insights- From the above graph optimal number of clusters will either 3 or 4. We will go forward with 3 clusters.

Creating 3 clusters using Kmeans and adding them to the original dataset.

Cluster Visualization for Kmeans –

Silhouette Method- In this we compute the silhouette coefficients for each data point. It is the measure of how close
it is to its own cluster rather than other clusters.
Silhouette Score - 0.5340151343712788

Scores for cluster K=2 to K=10:

[0.5282573570427488,
0.5340151343712788,
0.5524561729411546,
0.5208181010553294,
0.5337141912655894,
0.5557534218887419,
0.5342932176693953,
0.5083265323516991,
0.5145381754982109]

Graph plot using Silhouette Score-


Now, adding Silhouette width to the K-Mean dataset-
Silhouette width is a measure between -1 to +1, with value 1 indicating very good cluster.

Optimal Clusters-
We will now check outputs for both 3 cluster and 4 cluster and choose the optimal one.

3 Cluster Solution-

Cluster Frequency-
Cluster Profiles-

4 Cluster Solution-

Cluster Frequency-

Cluster Profiles-

Observations - Based on the above cluster solution, 3 cluster solution seems to be the best fit as it differentiate the 3
clusters as-
 High GDP per capita area
 Medium GDP per capita area
 Low GDP per capita area

1.5. Describe cluster profiles for the clusters defined. Recommend different
priority based actions that need to be taken for different clusters on the bases
of their vulnerability situations according to their Economic and Health
Conditions.

Our main objective was to divide the data in optimal number of clusters.
From both the hierarchal clustering and Kmeans clustering, we get 3 as the optimal number of clusters.

Insights from all the above clustering method-


3 group cluster via Kmeans-

Here, Cluster 1 = Low GDP per capita


Cluster 2 = Medium GDP per capita
Cluster 3 = High GDP per capita

3 group cluster via hierarchical clustering-

Here, Cluster 1 = High GDP per capita


Cluster 2 = Low GDP per capita
Cluster 3 = Medium GDP per capita

Cluster Group Profiles-

Cluster 1: High GDP per capita Areas


   - These are the areas which have the highest growth rate.
   - The health and economic conditions in these ares excellent.
   - Per capita income in these areas is very high.
Cluster 2: Low GDP per capita Areas
   - These are the areas which have very low growth rate.
   - The health and economic conditions are not good in these areas.
   - Per capita income in these areas is very low.
Cluster 3: Medium GDP per capita Areas
   - These are the areas which have an average growth rate.
   - The health and economic conditions in these areas are adequate.
   - Per capita income in these areas is average.

Recommendations for each cluster profile.


Main features that affect the Health and Economic conditions are workforce and productivity. Higher these attributes
higher is the GDP per capita and thus higher the Health and Economic conditions
Cluster 1: High GDP per capita Areas

 Maintaining the growth in productivity and the size of the workforce will keep the Health and Economic
conditions high.

Cluster 2: Low GDP per capita Areas

 In these areas large scale industries need to be opened.


 More employment opportunities should be created to increase the size of the worforce and thus increasing
productivity.
These will help in the growth of the Health and Economic conditions in these areas.

Cluster 3: Medium GDP per capita Areas

 In these areas more new businesses will help in the growth and development on the areas.
 Cutting tax rates will also help these areas in growing.

You might also like