Business Report Data Mining
Business Report Data Mining
Problem 1: Clustering
Problem Statement: The dataset given is about the Health and economic conditions in different States of a
country. The Group States based on how similar their situation is, so as to provide these groups to the
government so that appropriate measures can be taken to escalate their Health and Economic conditions.
Data Dictionary
1. States: names of States
2. Health_indeces1: A composite index rolls several related measures (indicators) into a single score that
provides a summary of how the health system is performing in the State.
3. Health_indeces2: A composite index rolls several related measures (indicators) into a single score that
provides a summary of how the health system is performing in certain areas of the States.
4. Per_capita_income-Per capita income (PCI) measures the average income earned per person in a given
area (city, region, country, etc.) in a specified year. It is calculated by dividing the area's total income by its
total population.
5. GDP: GDP provides an economic snapshot of a country/state, used to estimate the size of an economy and
growth rate.
1.1. Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, etc.)
Read the dataset- State_wise_Health_income-1 (1).csv
There are two variables “Unnamed: 0” and States that signifies only the id in the dataset and are not required in
clustering process. Hence, these can be dropped.
After dropping these variables-
Info of the Dataset-
This gives us the descriptive statistics of the data such as mean, count, frequency and 5 point summary, etc.
Univariate Analysis-
Skewness
GDP 0.829665
Per_capita_income 0.823113
Health_indeces1 0.715371
Health_indices2 -0.173803
Pairplot-
Covariance Matrix-
Correlation Matrix-
Heatmap-
From the above it is evident that there is multi-collinearity present in the data.
Highest correlation is between “Health_indeces1” and “GDP”.
Outliers Check/Treatment-
Using boxplots
Outlier Treatment- Instead of Imputing which causes data loss we will define a custom function- If for a particular
column the value is greater than max value, than assign that max value to it. Same logic for min value as well. This is
known as min-max substitution.
Now, there are no outliers present in the dataset.
1.2. Do you think scaling is necessary for clustering in this case? Justify
Yes, Scaling is necessary as Clustering algorithms such as K-means do need feature scaling before they are fed to
the algorithm. Since, clustering techniques use Euclidean Distance, it will be wise to scale the data consisting of
attributes with different units of measurements.
The above dataset consists of data with different units of measurement also known as weights, thus scaling them
will form a common space and data will be from relative range.
We will use z-score scaling here, in which mean=0 and standard deviation=1.
Prior to Scaling-
After Scaling-
P= 10
P= 25
From the above Dendrogram it is clear that 3 clusters should be formed.
We will use fcluster module to create the cluster.
After creating 3 clusters as (1, 2 and 3), and storing them in another columns named “cluster-3” in the dataset.
Cluster Frequency-
Cluster Profiles-
P= 10
P= 25
From the above Dendrogram it is clear that 3 clusters should be formed.
We will use fcluster module to create the cluster.
After creating 3 clusters as (1, 2 and 3), and storing them in another columns named “cluster-3” in the dataset.
Cluster Frequency-
Cluster Profiles-
Elbow Curve-
Insights- From the above graph optimal number of clusters will either 3 or 4. We will go forward with 3 clusters.
Creating 3 clusters using Kmeans and adding them to the original dataset.
Silhouette Method- In this we compute the silhouette coefficients for each data point. It is the measure of how close
it is to its own cluster rather than other clusters.
Silhouette Score - 0.5340151343712788
[0.5282573570427488,
0.5340151343712788,
0.5524561729411546,
0.5208181010553294,
0.5337141912655894,
0.5557534218887419,
0.5342932176693953,
0.5083265323516991,
0.5145381754982109]
Optimal Clusters-
We will now check outputs for both 3 cluster and 4 cluster and choose the optimal one.
3 Cluster Solution-
Cluster Frequency-
Cluster Profiles-
4 Cluster Solution-
Cluster Frequency-
Cluster Profiles-
Observations - Based on the above cluster solution, 3 cluster solution seems to be the best fit as it differentiate the 3
clusters as-
High GDP per capita area
Medium GDP per capita area
Low GDP per capita area
1.5. Describe cluster profiles for the clusters defined. Recommend different
priority based actions that need to be taken for different clusters on the bases
of their vulnerability situations according to their Economic and Health
Conditions.
Our main objective was to divide the data in optimal number of clusters.
From both the hierarchal clustering and Kmeans clustering, we get 3 as the optimal number of clusters.
Maintaining the growth in productivity and the size of the workforce will keep the Health and Economic
conditions high.
In these areas more new businesses will help in the growth and development on the areas.
Cutting tax rates will also help these areas in growing.