91% found this document useful (11 votes)

3K views18 pages

Business Report Data Mining

The document discusses clustering health and economic data from states in a country. It performs exploratory data analysis on the dataset, which has no missing values and consists of health indices, per capita income, and GDP variables. Both hierarchical and K-means clustering are applied to identify optimal clusters. Both methods indicate 3 clusters as the best solution. The clusters represent high, medium, and low GDP per capita areas. Priority actions are recommended based on vulnerability, with lower GDP areas needing more support for health and economic conditions.

Uploaded by

shorya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

91% found this document useful (11 votes)

3K views18 pages

Business Report Data Mining

Uploaded by

shorya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Business Report: Data Mining Project

By- Shorya Goel

Problem 1: Clustering
Problem Statement: The dataset given is about the Health and economic conditions in different States of a
country. The Group States based on how similar their situation is, so as to provide these groups to the
government so that appropriate measures can be taken to escalate their Health and Economic conditions.

Data Dictionary
1. States: names of States
2. Health_indeces1: A composite index rolls several related measures (indicators) into a single score that
provides a summary of how the health system is performing in the State.
3. Health_indeces2: A composite index rolls several related measures (indicators) into a single score that
provides a summary of how the health system is performing in certain areas of the States.
4. Per_capita_income-Per capita income (PCI) measures the average income earned per person in a given
area (city, region, country, etc.) in a specified year. It is calculated by dividing the area's total income by its
total population.
5. GDP: GDP provides an economic snapshot of a country/state, used to estimate the size of an economy and
growth rate.

1.1. Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, etc.)
Read the dataset- State_wise_Health_income-1 (1).csv

There are two variables “Unnamed: 0” and States that signifies only the id in the dataset and are not required in
clustering process. Hence, these can be dropped.
After dropping these variables-
Info of the Dataset-

There are 4 variables and 297 records.

No missing record based on initial analysis.
All the variables are integer type variables.

Shape of the Dataset: (297, 4)

This shows the total number of rows = 297 and total number of columns = 4.

Checking Missing Values

There are no missing values present in the dataset.

Summary of the Dataset

This gives us the descriptive statistics of the data such as mean, count, frequency and 5 point summary, etc.
Univariate Analysis-

Skewness
GDP 0.829665
Per_capita_income 0.823113
Health_indeces1 0.715371
Health_indices2 -0.173803

From the above graphs it can be observed that –

 All the variables except for Health_indices2 are right skewed.
 Health_indices2 is negatively skewed.
 There are outliers present “Health_indeces1” and “Per_capita_income”.
 The data points in all the variables are somewhat distributed in similar fashion.
Multivariate Analysis-

Pairplot-

Covariance Matrix-
Correlation Matrix-

Heatmap-

From the above it is evident that there is multi-collinearity present in the data.
Highest correlation is between “Health_indeces1” and “GDP”.
Outliers Check/Treatment-

Using boxplots

No. of outliers in Health_indeces1: 2

No. of outliers in Per_capita_income: 1

Outlier Treatment- Instead of Imputing which causes data loss we will define a custom function- If for a particular
column the value is greater than max value, than assign that max value to it. Same logic for min value as well. This is
known as min-max substitution.
Now, there are no outliers present in the dataset.

1.2. Do you think scaling is necessary for clustering in this case? Justify

Yes, Scaling is necessary as Clustering algorithms such as K-means do need feature scaling before they are fed to
the algorithm. Since, clustering techniques use Euclidean Distance, it will be wise to scale the data consisting of
attributes with different units of measurements.
The above dataset consists of data with different units of measurement also known as weights, thus scaling them
will form a common space and data will be from relative range.
We will use z-score scaling here, in which mean=0 and standard deviation=1.

Prior to Scaling-
After Scaling-

Now data belongs to a relative range between -1.5 to 3.

1.3. Apply hierarchical clustering to scaled data. Identify the number of
optimum clusters using Dendrogram and briefly describe them.
There are different methods of clustering, in this dataset we will use “Average” and “Ward” linkage methods.
Average Linkage-
In this method the distance between each pair of observations in each cluster are added up and divided by the
number of pairs to get an average inter-cluster distance.
Average-linkage and complete-linkage are the two most popular distance metrics in hierarchical clustering.

To make it clearer we will truncate it.

P= 10

P= 25
From the above Dendrogram it is clear that 3 clusters should be formed.
We will use fcluster module to create the cluster.
After creating 3 clusters as (1, 2 and 3), and storing them in another columns named “cluster-3” in the dataset.

Cluster Frequency-

Cluster Profiles-

Cluster Visualization for Average Linkage-

Ward Linkage-
In this method the linkage function describing the distance between two clusters is computed as the increase in the
"error sum of squares" (ESS) after fusing two clusters into a single cluster.
Ward’s method chooses the successive steps in order to minimize the increase in ESS at each step.

To make it clearer we will truncate it.

P= 10

Cluster Frequency-

Cluster Profiles-

Cluster Visualization for Ward Linkage-

Observations-
 Mean values for both the Average linkage and Ward Linkage are different with a lot of variation in the
clusters frequency.
 We will prefer Ward linkage in this dataset as it performed significantly well.
 Based on the above Dendrogram 3 cluster solution seems to be the best fit.
 And three group cluster solution gives a pattern based on high, medium and low GDP per capita areas.

1.4. Apply K-Means clustering on scaled data and determine optimum

clusters. Apply elbow curve and find the silhouette score.
K-Mean Clustering- This is an iterative method of partitioning the data into K predefined distinct non-overlapping
subgroups also known as clusters. In this each data point belong to a single group. In the intra-cluster data points are
as similar as possible while the distance between different clusters as far as possible.
Working steps of k-means algorithm-
 Specify number of clusters K.
 Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids
without replacement.
 Keep iterating until there is no change to the centroids, i.e. assignment of data points to clusters isn’t
changing.
 Compute the sum of the squared distance between data points and all centroids.
 Assign each data point to the closest cluster (centroid).
 Compute the centroids for the clusters by taking the average of the all data points that belong to each
cluster.

Now performing K-Means Elbow Method for K= (1 to 10)

The inertia for K= 1 to K=10.

Elbow Curve-
Insights- From the above graph optimal number of clusters will either 3 or 4. We will go forward with 3 clusters.

Creating 3 clusters using Kmeans and adding them to the original dataset.

Cluster Visualization for Kmeans –

Silhouette Method- In this we compute the silhouette coefficients for each data point. It is the measure of how close
it is to its own cluster rather than other clusters.
Silhouette Score - 0.5340151343712788

Scores for cluster K=2 to K=10:

[0.5282573570427488,
0.5340151343712788,
0.5524561729411546,
0.5208181010553294,
0.5337141912655894,
0.5557534218887419,
0.5342932176693953,
0.5083265323516991,
0.5145381754982109]

Graph plot using Silhouette Score-

Now, adding Silhouette width to the K-Mean dataset-
Silhouette width is a measure between -1 to +1, with value 1 indicating very good cluster.

Optimal Clusters-
We will now check outputs for both 3 cluster and 4 cluster and choose the optimal one.

3 Cluster Solution-

Cluster Frequency-
Cluster Profiles-

4 Cluster Solution-

Cluster Frequency-

Cluster Profiles-

Observations - Based on the above cluster solution, 3 cluster solution seems to be the best fit as it differentiate the 3
clusters as-
 High GDP per capita area
 Medium GDP per capita area
 Low GDP per capita area

1.5. Describe cluster profiles for the clusters defined. Recommend different
priority based actions that need to be taken for different clusters on the bases
of their vulnerability situations according to their Economic and Health
Conditions.

Our main objective was to divide the data in optimal number of clusters.
From both the hierarchal clustering and Kmeans clustering, we get 3 as the optimal number of clusters.

Insights from all the above clustering method-

3 group cluster via Kmeans-

Here, Cluster 1 = Low GDP per capita

Cluster 2 = Medium GDP per capita
Cluster 3 = High GDP per capita

3 group cluster via hierarchical clustering-

Here, Cluster 1 = High GDP per capita

Cluster 2 = Low GDP per capita
Cluster 3 = Medium GDP per capita

Cluster Group Profiles-

Cluster 1: High GDP per capita Areas

- These are the areas which have the highest growth rate.
- The health and economic conditions in these ares excellent.
- Per capita income in these areas is very high.
Cluster 2: Low GDP per capita Areas
- These are the areas which have very low growth rate.
- The health and economic conditions are not good in these areas.
- Per capita income in these areas is very low.
Cluster 3: Medium GDP per capita Areas
- These are the areas which have an average growth rate.
- The health and economic conditions in these areas are adequate.
- Per capita income in these areas is average.

Recommendations for each cluster profile.

Main features that affect the Health and Economic conditions are workforce and productivity. Higher these attributes
higher is the GDP per capita and thus higher the Health and Economic conditions
Cluster 1: High GDP per capita Areas

 Maintaining the growth in productivity and the size of the workforce will keep the Health and Economic
conditions high.

Cluster 2: Low GDP per capita Areas

 In these areas large scale industries need to be opened.

 More employment opportunities should be created to increase the size of the worforce and thus increasing
productivity.
These will help in the growth of the Health and Economic conditions in these areas.

Cluster 3: Medium GDP per capita Areas

 In these areas more new businesses will help in the growth and development on the areas.
 Cutting tax rates will also help these areas in growing.

Data Mining Business Report-Clustering & CART
100% (4)
Data Mining Business Report-Clustering & CART
57 pages
Marketing & Retail Analytics Assignment: Cafe Data - 02
100% (1)
Marketing & Retail Analytics Assignment: Cafe Data - 02
11 pages
DataMiningProjectProblem1 Clustering
100% (4)
DataMiningProjectProblem1 Clustering
20 pages
Logistic Regression and Lda
75% (4)
Logistic Regression and Lda
27 pages
MRA Assignment: by Chitra Mukadam
100% (2)
MRA Assignment: by Chitra Mukadam
19 pages
Assignment - Predictive Modeling
88% (24)
Assignment - Predictive Modeling
66 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
Arnab Chowdhury DM
75% (4)
Arnab Chowdhury DM
14 pages
Mountain State University 1
100% (1)
Mountain State University 1
2 pages
SMDM Project Business
80% (5)
SMDM Project Business
13 pages
ML Quiz 3 Machine Learning Great Learning
89% (9)
ML Quiz 3 Machine Learning Great Learning
7 pages
Predictive Modelling Project - n1
100% (4)
Predictive Modelling Project - n1
36 pages
Points
0% (6)
Points
1 page
Business: Capstone Project House Price Prediction Project Note-1
88% (8)
Business: Capstone Project House Price Prediction Project Note-1
40 pages
Anushi Project-House Price Prediction
100% (2)
Anushi Project-House Price Prediction
26 pages
Data Mining Business Report 2
No ratings yet
Data Mining Business Report 2
18 pages
Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Project Predictive Modeling
50% (2)
Project Predictive Modeling
69 pages
Project Report - FRA V1.0
71% (7)
Project Report - FRA V1.0
28 pages
Data Mining - Business Report: Clustering Clean - Ads
100% (4)
Data Mining - Business Report: Clustering Clean - Ads
24 pages
Advance Statistics - Buisness Report
100% (1)
Advance Statistics - Buisness Report
26 pages
DVT Alternate Project
50% (2)
DVT Alternate Project
1 page
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
100% (2)
SMDM Project Business Report - Ketan Sawalkar: (Document Title)
17 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Kailash BusinessReport
No ratings yet
Kailash BusinessReport
31 pages
SMDM Project Report
100% (1)
SMDM Project Report
19 pages
Business Report Machine Learning-1
100% (7)
Business Report Machine Learning-1
60 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
Lifi
100% (1)
Lifi
16 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
This Study Resource Was: Quiz 3
100% (1)
This Study Resource Was: Quiz 3
5 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
06T Semihermetic Screw Compressor
No ratings yet
06T Semihermetic Screw Compressor
8 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
100% (1)
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
3 pages
Transport Phenomena - 7 - Conservation of Momentum3
No ratings yet
Transport Phenomena - 7 - Conservation of Momentum3
34 pages
House Prices Predictive Model Summary Report
100% (1)
House Prices Predictive Model Summary Report
20 pages
House
100% (2)
House
19 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
FRA Project Business Report
100% (2)
FRA Project Business Report
27 pages
Data Mining Quiz 2
100% (2)
Data Mining Quiz 2
8 pages
Casting PDF
100% (1)
Casting PDF
48 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Gowtham Mra 2
No ratings yet
Gowtham Mra 2
18 pages
SMDM Project
No ratings yet
SMDM Project
16 pages
Tactical Barbell Interactive Spreadsheet - Improved
No ratings yet
Tactical Barbell Interactive Spreadsheet - Improved
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Project Report - Advanced - Stats - Final PDF
No ratings yet
Project Report - Advanced - Stats - Final PDF
25 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page
Yuken A
No ratings yet
Yuken A
92 pages
CLUSTERING ANALYSIS State Wise Health PDF
No ratings yet
CLUSTERING ANALYSIS State Wise Health PDF
14 pages
Dam Hydrology and Reservoir Sedimentation
No ratings yet
Dam Hydrology and Reservoir Sedimentation
3 pages
Adaptive Headlights System For Four Wheelers A Review
No ratings yet
Adaptive Headlights System For Four Wheelers A Review
9 pages
Assign 1
No ratings yet
Assign 1
4 pages
Project 7 - DVT - Manoj
No ratings yet
Project 7 - DVT - Manoj
1 page
Z Fi BDC Asset Master
No ratings yet
Z Fi BDC Asset Master
23 pages
Experiment # 09 Implementation of Bridges and Spanning Tree Protocol
No ratings yet
Experiment # 09 Implementation of Bridges and Spanning Tree Protocol
10 pages
2025 - Exemplar English Gr3T2 Maths Diagnostic Assessment
No ratings yet
2025 - Exemplar English Gr3T2 Maths Diagnostic Assessment
6 pages
Milestone 1
No ratings yet
Milestone 1
2 pages
Stat Basic Definitions
No ratings yet
Stat Basic Definitions
4 pages
Diesel Truck Engine 305 HP: 1150 LB-FT at 1200 RPM Peak Torque
No ratings yet
Diesel Truck Engine 305 HP: 1150 LB-FT at 1200 RPM Peak Torque
2 pages
Jurnal Pindah Silang
No ratings yet
Jurnal Pindah Silang
14 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Compressive Strength
No ratings yet
Compressive Strength
6 pages
Calculator SFBM
No ratings yet
Calculator SFBM
9 pages
Single Valued Neutrosophic Sets
No ratings yet
Single Valued Neutrosophic Sets
4 pages
Manual: Physics Experiment With Your Phone
No ratings yet
Manual: Physics Experiment With Your Phone
19 pages
SMS46KI03I
No ratings yet
SMS46KI03I
3 pages
Cambridge IGCSE™: Combined Science 0653/42 May/June 2021
No ratings yet
Cambridge IGCSE™: Combined Science 0653/42 May/June 2021
9 pages
Iot Based Aeroponics Agriculture Monitoring System Using Raspberry Pi
No ratings yet
Iot Based Aeroponics Agriculture Monitoring System Using Raspberry Pi
8 pages
Properties of Fluids
No ratings yet
Properties of Fluids
26 pages
TVM
No ratings yet
TVM
1 page
Tunnel Gearboxes Power Take Off 22092016 Web
No ratings yet
Tunnel Gearboxes Power Take Off 22092016 Web
8 pages
Database Vendor, Customer and Product
No ratings yet
Database Vendor, Customer and Product
5 pages
Development of Economical Microcontroller-Based Soil Moisture Sensor Using Time Domain Reflectometry
No ratings yet
Development of Economical Microcontroller-Based Soil Moisture Sensor Using Time Domain Reflectometry
5 pages
XE Currency Data API Non Technical Quick Start Guide
No ratings yet
XE Currency Data API Non Technical Quick Start Guide
5 pages
Mehta-OS - Intermediate 6CO-1
No ratings yet
Mehta-OS - Intermediate 6CO-1
2 pages
Alternating Current Short Notes
No ratings yet
Alternating Current Short Notes
4 pages