0% found this document useful (0 votes)
10 views

Project Report Data Mining

Data Mining Research

Uploaded by

220700
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Project Report Data Mining

Data Mining Research

Uploaded by

220700
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Project Report

WATER POTABILITY
Cluster Analysis

Shreya Singh
220700
BA Programme (CA+Maths)
AGENDA

K-Means Clustering
Agglomerative Hierarchical Clustering
DBSCAN
ABOUT DATASET - WATER POTABILITY

Access to safe drinking-water is essential to health, a basic human right and a component of
effective policy for health protection. This is important as a health and development issue at a
national, regional and local level.

The dataset is a labelled and numeric dataset and has the following columns of
information :

pH value Organic Carbons


Hardness Trihalomethanes
Chloramines Turbidity
Sulfate Potability (label)
Conductivity
** We are going to ignore/drop the column of labels and then
perform clustering.
DATASET
PREPROCESSING
NULL VALUES
Replacing NULL values with
mean of their respective
columns.
OUTLIERS
Checking the presence of
outliers through plotting boxplot
of each column.

Almost all columns have some


outliers.
OUTLIERS
Removing outliers of all columns
through Inter-Quartile Range
Method.
MINMAX SCALER
Scaling the dataset between 0
and 1 through MinMaxScaler.
K-MEANS CLUSTERING
INERTIA
Calculating Inertia for all the columns of
the dataset in order to find out the
optimum value of k for K-Means Clustering.
ELBOW METHOD
The Elbow Point gives the value
of K=2.
K-MEANS
CLUSTERING
Applying K-Means Clustering
after dropping the labelled
column.
PRINCIPAL
COMPONENT
ANALYSIS
Because of existence of multiple
columns, to represent the
clusters on 2-D scatter plot, we
apply PCA to reduce the
dimensions.
VISUALIZING
K-MEANS CLUSTER
The dataset is divided into 2
clusters and are represented in
the scatter plot.
AGGLOMERATIVE
HIERARCHICAL CLUSTERING
DENDROGRAM
Visualizing the dendrogram for
the dataset.
Visualizing the scatter plots for
2,3,4 number of clusters.

AGGLOMERATIVE
CLUSTERING
SILHOUETTE
SCORE
Finding the optimum number of
clusters through visualizing the
silhouette score of the dataset.
In this case, 2 is the optimum
number of clusters.
DBSCAN CLUSTERING
NEAREST
NEIGHBOUR
minpts=2*dimesions
Finding the K-Nearest
Neighbours and visualizing it.
KNEE LOCATOR
Locating the Knee Point through
Knee Locator in order to find the
value of eps.

Eps=0.61 here
DBSCAN CLUSTERING
Labelling the clusters and figuring out its composition.
VISUALIZING DBSCAN CLUSTERING
Visualizing the clusters formed by DBSCAN after reducing the dimensions by
Principal Component Analysis
VISUALIZING DBSCAN CLUSTERING
The dataset divided into 2 clusters and represented on the scatter plot.
THANK YOU!

You might also like