Chapter 1 Introduction
Chapter 1 Introduction
Raw data may be gathered in large quantities from many different disciplines, but this
data is meaningless unless it is properly reasoned through to provide knowledge that is
helpful. In this project, we concentrate on clustering, one of the key data mining
techniques.
In unsupervised learning, the goal is often to find hidden structures or clusters in the data
or to uncover relationships between variables. By extracting meaningful patterns,
unsupervised learning algorithms can help with tasks such as data exploration, data pre-
processing, anomaly detection, and feature engineering. The most successful
unsupervised learning approach is clustering, which is one of several strategies employed.
The most used clustering method is K-means. K-means is sensitive to out-of-the-ordinary
data points, and it occasionally creates empty clusters.We recommended a novel method
to address this issue. K-means genetic algorithm. In this study, the Genetic K-means
clustering approach is the main focus.
1
1. Clustering: Clustering algorithms group similar data points together based on their
characteristics or proximity in the feature space. The objective is to identify
natural clusters within the data without any prior knowledge of the class labels.
Common clustering algorithms include k-means, hierarchical clustering, and
DBSCAN.
2. Dimensionality reduction: Dimensionality reduction techniques aim to reduce the
number of variables or features in the data while preserving important
information. These methods are useful when dealing with high-dimensional data
by simplifying it and allowing for easier visualization and analysis. Principal
Component Analysis (PCA) and t-SNE (t-Distributed Stochastic Neighbor
Embedding) are popular dimensionality reduction algorithms.
Unsupervised learning plays a crucial role in various fields, including data mining,
pattern recognition, recommender systems, and exploratory data analysis, as it helps
uncover underlying structures and insights in unlabelled datasets.
1.2.1 Clustering:
It is an unsupervised method that uses an automated method to group items with similar
properties together [5]. Statistics professionals also refer to it as categorization.
Clustering is grouping things based on similarities. The items are divided into two
clusters: one for those with similar qualities and another for those with different
properties. The Manhattan distance or the Euclidian distance are used to measure
similarity. Getting low intra-cluster and the converse for inter-cluster similarity is the
main goal of clustering.
2
1.3 Types of Clustering
Partitioning clustering
Hierarchical clustering
Fuzzy clustering
Density Based clustering
Model based clustering
3
1.3.2 Hierarchical Clustering:
Another name for hierarchical clustering is HCA. It creates a cluster by collecting related
things. It is not necessary to pre-specify the value of k or the number of clusters [7]. The
outcome of hierarchical clustering yields a tree-based structure. Its primarily of two
types:
Agglomerative Clustering
Divisive Clustering
4
The following distances are used to determine how similar two clusters are:
1. Euclidian Distance
2. Manhattan Distance
3. Edit Distance
4. Hamming Distance
A straight line distance called a euclidian distance exists between two places. L2 standard
is another name for the Euclidian separation [8]. The distance between two points on a
plane with the coordinates (x, y) is known as the Euclidian separation. The formula to
calculate Euclidian distance is:
City block distance, or L1 norm are other names for Manhattan Distance. It is measured
in right-angle axes [8]. Manhattan distance is the product of diagonal and horizontal
distance, which is calculated using Pythagoras' theorem. The Manhattan Distance formula
is as follows:
5
1.5 Techniques to find the optimum number of Clusters
Finding the right number of clusters and interpreting and validating consistency inside
the cluster are both aided by the Elbow approach. This approach focuses on variance %,
which depends on the value of k [8]. The value of k should be set so that addition of more
clusters does not significantly improve the modelling of the data.
Number of Clusters K
𝑛(𝑘) − 𝑥(𝑘)
𝑇(𝑘) =
max{𝑛(𝑘), 𝑥(𝑘)}
6
Number of Clusters K
Figure 1.3 Evaluation Graph of Silhouette Method [8]
1 B
Gapn (k) = ∑ log(𝑊𝑘𝑏) − log(𝑊𝑘)
𝐵 𝑏=1
7
1.5.4 Davies Bouldin Index:
It contrasts the total inside intra-cluster variance for different values of k with the values
that would be predicted under the data's null reference distribution. The ideal clusters will
be formed by the values that maximise the gap statistic.
𝑀𝑖 - centroid of cluster
𝐶𝑖, 𝑇𝑗. - the size of cluster
𝐷𝑗 -is the measure of validity of the cluster.
8
1.6.1 Flowchart of standard K-means:
9
The limitations of K-Means algorithm have been listed as follows:
It requires advance knowledge of the data and a domain expert is needed to
choose the requisite value for k.
Final result is affected more by initial selection and less by outliers.
Reconfiguring the data yields a different result for the same data
Only convex shaped clusters are formed
Dataset {2, 3, 4, 10, 11, 12, 20, 25, 30} and K=2
10
Sepal Length
It is a technique that comes from biological evolution and natural selection to tackle
limited and unconstrained optimization issue [10]. The genetic algorithm repeatedly
adjusts a population of individual solutions. At every step, the genetic algorithm
randomly chooses individuals from the current population to be parents which are used to
create the offspring for the next generation. Genetic algorithm may be used to tackle
different optimization challenges. Steps are as follows:
1. Initialization of Population
2. Finding Fitness value
3. Selection of chromosomes
4. Crossover
5. Mutation
1. Initialization:
In initialization step, population is defined as set of individuals. An individual is defining by a set
of variables known as Genes. In a genetic algorithm, strings are used to describe the set of genes
of an individual. To encode the genes in a chromosome binary values are used.
11
Figure 1.7 Genetic Algorithm Chromosome and Population [10]
2. Fitness Function: The fitness function calculate the ability of individual to compete
with other individuals. It present a fitness score to each individual. The probability that
an individual will be selected for reproduction is depend on its fitness value. The formula
for fitness value depend on the type of problem .The most commonly used formula for
finding fitness value is
Smax = max(F(SINTER)/F(SINTRA))
Where Smax is the fitness function obtained by dividing the total inter cluster distance
3. Selection: The concept of selection phase is to prefer the fittest individuals and their
genes move onward to the next generation .Pairs of an individuals are selected according
to their fitness scores. Individuals with large fitness value have more chance to be
selected for reproduction.
12
Table 1.1 Crossover operation on S1 and S3 chromosome.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S3 1 1 1 0 1 1 0 1 0 1 F=7
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
Table 1.2 and 1.3 shows the crossover operation and result of chromosome S1 and
S3. The crossover operator is applied to produce new offspring’s with higher probability.
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
Table 1.3 and 1.4 shows the mutation of chromosome S1 and S3. In mutation the bits
with lower probability is flipped and increase the fitness of particular chromosome.
13
1.7.1 Flow chart of genetic algorithm:
14
Chapter 2 Literature Survey
2.1 Clustering
Clustering is an unsupervised technique that creates clusters of objects having similar
features using automatic technique [1]. It is also termed as classification by statisticians.
In clustering, grouping is done based on similarity of objects .The objects with similar
properties are grouped in one cluster and with dissimilar properties into another.
Similarity is evaluated by using Euclidian distance or Manhattan distance. The main aim
of clustering is to obtain low intra- cluster similarity and high inter-cluster similarity.
There are different types of clustering techniques includes partitioning clustering,
hierarchical clustering, fuzzy clustering and density based clustering .This chapter
presents literature survey of clustering techniques for analyzing big data and tabular
comparison of these techniques is presented.
There are distinctive kinds of partitioning clustering techniques. The most mainstream
Partioning clustering strategy is the K-means clustering presented by Macqueen in1967.
In K- means clustering every data point is grouped based on similarity. The K-means
strategy is sensitive to outliers and in some cases it frames empty clusters with large
datasets. In this we proposed another method (Hybrid K-means with Genetic algorithm)
to evacuate the downside of k means clustering [11].
15
2.3 Unsupervised learning technique: Clustering
There is huge amount of data generated from information industry and other repositories.
This data is useless until it’s processed and extract useful knowledge from it. The
researchers analyze two elementary objectives of data mining: description and prediction.
There are different techniques used for extracting useful information from large amount
of data. Clustering is the most commonly used partitioning method for mining large
datasets. In partitioning methods K-means clustering is most efficient method but it is
sensitive to outliers and sometimes produce empty clusters with large dataset. Clustering
using optimization problem techniques: genetic algorithm, decision trees, neural network
remove the problem of partitioning methods [1].
S.Bandyopadhyay et al. [4] designed a genetic clustering algorithm called as GKA that
classify the pixels of satellite image. This KGA clustering algorithm is applied when
number of clusters is known a priori and crisp in nature. In this paper Genetic algorithm
is used to search clusters centers. Floating point representation of chromosomes is used
because it is more natural and appropriate form for coding the cluster centers. The major
drawback of this paper is that the algorithm is applied only to those dataset whose k is
already known.
M.Jain. et al. [7] proposed a K-means with genetic algorithm for enhancing stock
prediction. In this paper they enhance a stock prediction or market analysis using k means
with genetic algorithm for finding the cluster centroids. Chi square similarity is used for
determining the accuracy and the result obtained has highest accuracy then k means .The
16
drawback of proposed algorithm is that they are only applied to matrix represented
dataset.
C. Ordonez. [9] Presented two different approaches of the K-means clustering algorithm
to cluster the binary data streams. The variants used by them are incremental K-means
and scalable K-means. A proposed variant gives high quality clusters and less sensitive as
compared to k-means. The proposed incremental k-means and scalable K-means is
compared with existing k-means in terms of accuracy, confusion matrix and error rate.
The incremental and scalable k-means gives higher accuracy than existing k-means.
Mor et al. [11] proposed a genetic algorithm approach for clustering and compared the
results with k means algorithm. In this proposed approach the fitness is calculated on the
basis of intra cluster and inter cluster similarity measures. The proposed algorithm has
low intra cluster distance and high inter cluster distance and also remove the drawback of
local optima. The drawback of this algorithm is that the GA algorithm not find the value
of K (number of clusters) as K is randomly chosen in this algorithm.
K.Dharmendra et al. [12] presented the efficient K-means clustering algorithm. The main
objective of K-means clustering algorithm is to divide the dataset into K number of
clusters where k is predefined or randomly selected by the analyst. The main aim of K-
means clustering used in this paper is to minimize the within sum of square.
E.O.Hartono et al. [14] proposed an algorithm for determining a cluster centroid using
genetic algorithm. The determining an initial value of cluster centroid using genetic
algorithm provide better results than random numbers. Fitness value is calculated using
MSE (Mean Square Error).Fitness value is 1 divided by the MSE. Lower the MSE higher
the performance. The drawback of proposed algorithm is that the evaluation of clusters is
17
not done and also the value of K is not known Apriori.
P.Vats et al. [17] had discussed a comparative analysis of various clustering technique
using genetic and K-means algorithm. It uses the sample Iris dataset to perform the
different clustering techniques i.e. K-means algorithm, Incremental K-means and Fuzzy
C-means. The code is implemented using matlab and Weka.In this paper they have
discussed Fuzzy C-means gives better results as compared to K-means algorithm.
K.Kim et al. [22] proposed a recommender system using Genetic algorithm and K-means
for online shopping market. In this proposed technique the initial seed is optimized by
Genetic algorithm called GA K-means for online shopping recommender system. The
proposed algorithm results is compared with existing algorithms and it shows that
proposed algorithm improves the segmentation performance compared to existing
algorithms.
2.4 Conclusion
In this chapter literature review and comparative analysis of the recent techniques for
clustering big data is done. The various clustering techniques for analyzing big data are
compared and their merits and demerits are presented.
18
Chapter 3 System Design and Development
19
computation time but there are some shortcomings in both these techniques.
The k-means algorithm executes fast but it cannot handle non arbitrary shape, sensitive to
outliers and form empty clusters for large datasets. The hybrid Genetic K-means
algorithm can handle noise as well as non-arbitrary shape but it takes more computation
time and is more complex than K-Means.
The K-means algorithm is applied with cuckoo search algorithm to remove the
shortcomings of existing K-means algorithm, but it doesn’t handle large dataset.
In another technique k-means is modified to incremental K-means which generates
better result than k-means for numeric datasets.
3.3 Objectives
The objectives of the project are as follows:
To study the existing K-means clustering technique in combination with Genetic
algorithm.
To introduce an efficient clustering technique to remove the drawback of existing
clustering algorithm.
To implement and validate the proposed technique on Iris dataset.
20
3.5 Proposed Hybrid Technique
The proposed method is a hybrid technique based on K-Means and Genetic Algorithm
that combines the benefits of both K-Means and Genetic algorithms [20]. The benefit of
genetic algorithm is to determine the first initial cluster centroid using genetic algorithm
and after applying crossover and mutation the result is passed to K-Means for clustering
.In last after clustering Davies Bouldin index is used for Evaluation of clusters.
Fig 4.1 shows the procedure of proposed hybrid technique
Step 1: In the first step collection of data from UCI machine learning repository [42] for
formation of clusters.
Step 2: In Second step normalization of data is done using python IDLE as normalized
data is easy to handle.
Step 3: In the Third step Genetic algorithm for finding initial cluster centroid and after
crossover and mutation, the result is passed to K-means for formation of clusters.
Step 4: In the last steps after formation of clusters Davies Bouldin Index is used
Evaluation of cluster.
21
Step 2: Fitness function
Calculate the fitness of each chromosome based on fitness function. The fitness function
is mainly depend on our problem. The main aim of clustering is low intra cluster distance
and high inter cluster distance, so the most commonly used fitness function is:
Fmax = max(S(DINTER)/S(DINTRA))
Where Fmax is the fitness function obtained by dividing the total inter cluster distance
and total intra cluster distance.
Step 3: Selection Phase
The selection of chromosomes is done based on the fitness value .The chromosomes are
selected using roulette wheel selection method or rank based selection .The chromosomes
with high fitness value have high probability to be selected first.
Step 4: Crossover
Crossover is applied to chromosomes to produce new off springs. The selected
chromosomes are randomly selected to produce new off springs. Crossover is simply
reproduction of new chromosomes.
Step 5: Mutation
It is used to maintain a genetic diversity of population After Crossover Mutation is
applied which is random tweak to particular chromosome. There are various methods for
mutation includes:
Bit Manipulation
Random Resetting
Swap Mutation
Scramble Mutation
Inversion Mutation
22
Figure 3.2 Process of Genetic Algorithm [21]
Fig 3.2 shows the process of Genetic algorithm. First evaluation of chromosome using
fitness function and select the chromosomes based on Roulette wheel selection .After
selection crossover is applied and lastly the result of crossover passed to mutation to
produce new offspring’s.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S2 0 1 1 1 0 0 0 1 0 1 F=5
S3 1 1 1 0 1 1 0 1 0 1 F=7
S4 0 1 0 0 0 1 0 0 1 1 F=4
S5 1 1 1 0 1 1 1 1 0 1 F=8
S6 0 1 0 0 1 1 0 1 0 0 F=3
23
Table 3.1 shows the initialization of chromosomes of MaxOne problem and fitness value
of all the chromosomes. The fitness value is calculated based on number of one’s occur in
particular iteration. After calculate the fitness value of all chromosomes the chromosomes
is arranged according to fitness value and selection is done based on roulette wheel
selection. The total number of one occur is 34 out of 60. Our aim is to increase the
number of one’s so we use the genetic algorithm for optimization of MaxOne problem.
S1 1 1 1 1 0 1 0 1 0 1 F=7
S3 1 1 1 0 1 1 0 1 0 1 F=7
S5 1 1 1 0 1 1 1 1 0 1 F=8
S2 0 1 1 1 0 0 0 1 0 1 F=5
S4 0 1 0 0 0 1 0 0 1 1 F=4
S6 0 1 0 0 1 1 0 1 0 0 F=3
24
Table 3.2 represent the arrangement of chromosomes based on fitness value. We
randomly select the chromosomes using roulette wheel selection for crossover and
mutation.
2.2 Randomly select chromosomes for crossover and mutation to produce new off
springs.
S3 1 1 1 0 1 1 0 1 0 1 F=7
Table 3.3 represent the chromosomes S1 and S3 for crossover. The two point
crossover is performed on chromosomes S1 and S3 to increase the probability of number
of one’s.
S1 1 1 1 0 1 1 0 1 0 1 F=7
S3 1 1 1 1 0 1 0 1 0 1 F=7
Table 4.4 represent the crossover result of S1 and S3.After applying crossover on S1and
S3 the number of one’s increases.
Table 3.5 represent the chromosomes S2 and S4 for crossover. The two point crossover
is applied to increase the probability of number of one’s.
25
S4 0 1 1 1 0 0 0 1 0 1 F=5
Table 3.6 represents the crossover result of S2 and S4.After applying crossover on S2
and S4 the number of one’s increases.
S6 0 1 0 0 1 1 0 0 0 F=3
Table 3.7 represent the chromosomes S5 and S6 for crossover. The two point crossover is
applied to increase the probability of number of one’s.
S6 0 1 1 0 1 1 1 1 0 F=8
Table 3.8 represent the crossover result of S5 and S6.After applying crossover on S5 and
S6 the number of one’s increases.
Step 3: Perform mutation on the crossover results so that we obtain a better result.
3.1 Mutation means to change the particular bit of the chromosomes to increase the
number of one’s.
Table 3.9 Mutation result of chromosomes
S1 1 1 1 1 0 1 0 1 0 1 F=8
S3 1 1 1 0 1 1 0 1 0 1 F=8
S5 1 1 1 0 1 1 1 1 0 1 F=5
S2 0 1 1 1 0 0 0 1 0 1 F=5
S4 0 1 0 0 0 1 0 0 1 1 F=5
S6 0 1 0 0 1 1 0 1 0 0 F=8
26
Table 3.9 shows the result of mutation of chromosomes of S1 to S6.After applying
mutation on S1 to S6 the probability of number of one’s increases from 34 to 39.
Conclusion: The optimization algorithm give better results when number of iteration is
more. So we have to increase the number of iteration to get better results.
This algorithm focus on random selection of initial seed problem of k means clustering.
Genetic algorithm is used to select the optimum initial centroid for better results. Steps of
Genetic K- means (GAKM) is as follows:
Fig 3.3 shows the flowchart of proposed Genetic K means clustering. First initialize the
population in the form of strings and select the chromosomes based on roulette wheel
selection. After selection of chromosomes the result is passed to crossover .By applying
27
two point crossover to produce new offspring’s the result is passed to mutation .After
applying mutation on particular chromosome the final chromosomes is passed to K-
means clustering to perform clustering. After clustering the evaluation of clusters is done
based on Davies Bouldin index value. Higher the value of Davies Bouldin index indicates
the good clustering.
Where p and q are the points the dataset and d is the distance between the data points.
28
Formula for finding Intra Cluster distance:
𝑞.𝑟
Where 𝐷 is the distance between the neighboring clusters and xi and xj are the
points of the clusters m and n are the data points of the neighboring clusters.
Fmax = max(S(DINTER)/S(DINTRA))
Where Fmax is the fitness function obtained by dividing the total inter cluster distance
29
and total intra cluster distance.
v) Crossover Operator:
After the selection using rank based selection the next step is to produce off springs. The
mainly used solution is crossover. Different types of crossover are single point crossover,
two point crossover, multiple point crossover .Single point crossover gives better result
for Integer and real datasets. Crossover operators are applied to maintain the genetic
diversity. Genetic diversity is the crucial step for the process of evolution. Crossover
operator is applied to the one of the genetic operator to maintain the genetic diversity.
Different types crossover operator are used like one point crossover, two point crossover
and multiple point crossover based on the requirement.
Formula for crossover is:
vi) Mutation:
Mutation is the genetic operator used to preserve the genetic diversity from one
generation to next generation. There were different genetic operator like bitwise operator,
uniform operator, on Uniform operator and Gaussian operator. Uniform mutation
operator is used for real and integer dataset as it gives better results. Mutation is applied
so as to obtained the better accuracy and it is applied to (Pm*pop size* u) where Pm is
the probability of population and u is the chromosome length.
30
The lower value indicate that clustering is good.
Where 𝐴𝑖the centroid of cluster is𝐶𝑖, 𝑇𝑖 is the size of cluster and 𝑆𝑖is the measure of
validity of cluster.
Step 1: The sample Iris dataset is used for performing Genetic K-means algorithm. The
iris dataset consist of three species Setsoa, versicolor and Virginica and four dimensions
sepal length, sepal width, petal length and petal width.
31
Table 3.10 shows the iris dataset used for Genetic K-means clustering. The iris dataset
consist of three dimensions sepal length, sepal width and petal length. Genetic Algorithm
can be applied to iris dataset to find accuracy, precision, recall and sensitivity. Davies
Bouldin index is used for evaluation of clusters. The first 15 rows of iris dataset is
collected to perform genetic K-means clustering.
Table 3.11 represents the normalized iris dataset obtained by applying normalization on
unnormalized dataset. The normalized dataset give accurate results and easiness of
handling.
Step 3: After normalization of dataset:
32
Let assume pop size=4, k=3
Rows actually selected are X= (pop size *K) i.e. 4*3 rows are actually selected from
dataset. Let Y be the 12 indices returned from the dataset
Y= [4, 12, 14, 7, 9, 1, 2, 3, 5]
The rows correspond to first 3 indices represent first chromosome and the first index
represent first Centroid
Selected row
Column Number Chromosomes
indices
{0.083,0.45,0.084}
1 {4,12,14} {0.138,0.583,0.101}
{0,0.416,0.169}
{0.083,0.583,0.0677}
2 {7,9,1} {0.0277,0.375,0.0677}
{0.2222,0.625,0.0677}
{0.1666,0.41666,0.0677}
3 {2,3,5} {0.1111,0.5,0.050}
{0.19444,0.666,0.0677}
{0.1388,0.41666,0.0677}
4 {13,11,15} {0.3055,0.7083,0.084}
{0.41666,0.8333,0.0338}
Table 3.12 shows the selected row indices and chromosomes of the dataset. In column
No. 1 where, Ist Centroid (C1): 0.083, 0.45, 0.084 represent Ist cluster. 2nd Centroid (C2):
0.138, 0.583, 0.101 represent 2nd cluster and 3rd Centroid (C3): 0, 0.0416, and 0.169
represents 3rd cluster.
33
Step 5: Calculate the Euclidean distance of point 1 {0.2222, 0.625, and 0.0677} from
dataset to cluster 1{0.083, 0.45, and 0.084}
Distance from cluster 1 to object 1=0.0163
Distance from cluster 2 to object 1 =1.50
Distance from cluster 3 to object 1=1.66
Table 3.13 represents the calculated distance obtained from cluster 1, cluster 2 and
cluster 3 and also the assignment of clusters.
Step 6: After calculating all the distance of clusters assignment of cluster of is shown in
table
1 1 1 1 1 1 1 1 1 1 3 2 3 3 2
34
produce new off springs. The crossover rate α is taken as 0.6.The off springs produces
are given below.
Generation of off springs
Offspring 1= (α *parent 1) + ((1- α)*parent2)
Offspring 1:0.02055, 0.1530, 0.1770, 0.1754, 0.1694, 0.20000, 0.01374, 0.0131, 0.0230
Offspring 2:0.02877, 0.2235, 0.2655, 0.2493, 0.2306, 0.2867, 0.0137, 0.0212, 0.0212
The offspring 1 and offspring 2 are obtained by applying crossover on selected row
indices 2 and 3.
Parent 3 is selected from selected from selected row indices and chromosome table to
maintain the genetic diversity. Mutation is simply the random tweak to a particular
chromosome. It is applied to 5th chromosome of parent 3.After apply mutation on 5 th
chromosome the value of chromosome changed from 0.9726 to 0.7982
35
Chapter 4 Experiments and Result Analysis
This chapter of project report will focus on implementation of the proposed hybrid
algorithm Genetic K-means using python and MATLAB.
To implement the proposed approach iris dataset is used. The dataset is collected from
UCI Machine Learning Repository [42]. The proposed approach is implemented by
Python using google collab.
36
I. Code of K-means Clustering:
Fig 4.1 Shows the code of K-means clustering on Iris dataset. It shows implementation of
K-means clustering and uses the iris data and value of K used is 3. This code predicts the
accuracy of K- means. The predicted clusters are compared with actual class clusters and
accuracy of iris dataset is evaluated .The accuracy obtained from K-means clustering is
45 %.
37
Centroid Update Results:
Update Centroid as given in fig 5.1 generates a new centroid for cluster 1
38
The experiments performed on these algorithms are on the basis of many parameters that
is confusion matrix, Davies Bouldin index value ,accuracy , Recall, precision, Sensitivity
and Specificity.
Both the algorithms that is K-Means and Genetic K-means algorithm are tested on the iris
datasets to calculate accuracy. Figure 4.3 shows the comparison of the confusion matrix
obtained by each algorithm on Iris dataset.
Predicted Class
0 1 2
0 50 0 0
Actual Class 1 0 10 34
2 0 36 20
Fig 4.3 shows the Confusion matrix obtained from K-means clustering.Accuracy Recall,
Precision, Sensitivity and Specificity are calculated by comparing the actual and
predicted results.
39
4.2.2 Confusion Matrix of Genetic K-means Clustering
Both the two algorithms that is K-Means and Genetic K-means algorithm are tested on
the iris datasets to calculate accuracy. Fig. 4.4 shows the comparison of the confusion
matrix obtained by each algorithm on Iris dataset.
Predicted Class
0 1 2
0 50 0 0
Actual Class 1 0 48 2
2 0 14 36
Fig 4.4 shows the Confusion matrix obtained from Genetic K-means clustering.Accuracy
Recall, Precision, Sensitivity and Specificity are calculated by comparing the actual and
predicted results.
Precision:.TP/Predicted 1=48/62=77%
40
4.2.3 Test for Performance of Accuracy
Both the algorithms that is K-Means algorithm and proposed approach are tested on
iris dataset to calculate the accuracy performance. Accuracy evaluation is needed to
maintain the quality of clusters.
The formula for calculating accuracy
Where accuracy is obtained by dividing the sum of TP (True Positive) and TN (True
Negative) by the total sum. i.e. TP (True Positive), TN(True Negative),FP(False Positive)
and FN(False Negative).
The accuracy obtained from K-means and Genetic K-means shows that the proposed
algorithm gives better accuracy than K-means clustering .The accuracy obtained from all
the four datasets is using K- means and Genetic K-means is shown is table 4.1
Table 4.1 Shows the table of accuracy obtained from K-means and Genetic K-means
Algorithm. The table shows that Genetic algorithm gives higher accuracy than K-means
algorithm.
41
4.2.4 Calculation of Intra cluster distance using K means and proposed algorithm.
Intra cluster distance is the distance between data points in the same cluster. The
objective of k-means and proposed algorithm is to minimize the intra cluster distance.
The formula for finding intra cluster distance is
Where m is the no of elements, 𝑥𝑖 and 𝑥𝑗 are data points of clusters and 𝐷𝑞 is the distance
between data points in same cluster.
The Intra cluster distance obtained from k means and proposed algorithm is shown in
table 4.2 and table 4.3.
Table 4.2: Intra cluster distance using K-means algorithm
1 5.927
2 1.0245
3 1.086
Table 4.2 shows the table of intra-cluster distance using K-means algorithm.The intra-
cluster distance is obtained by calculating the distance between data points in the same
cluster. The distance obtained by K-means algorithm is more than Genetic Algorithm.
2 0.0284
42
3 0.0861
Table 4.3 shows the table of intra-cluster distance using proposed algorithm. The intra-
cluster distance is obtained by calculating the distance between data points in the same
cluster. The distance obtained by proposed algorithm is less than K-means algorithm.
4.2.5 Calculation of Inter cluster distance using K means and proposed algorithm.
Inter cluster distance is the distance between data points of corresponding cluster. The
main objective of K-means and proposed algorithm is to maximize the inter cluster
distance. The inter cluster distance obtained from k means and proposed algorithm is
shown in table.
Formula for finding Inter cluster distance:
Where m, n are the number of elements in the 𝑞𝑡ℎand 𝑟𝑡ℎ cluster, 𝑥𝑖 and 𝑥𝑗 are the
elements in clusters and 𝐷𝑞.𝑟 is the inter cluster distance between data points.
INTER
The Total Inter Cluster distance is:
1 1.3585
2 1.3203
3 0.3323
43
Table 4.4 shows the table of Inter-Cluster distance using K-means Algorithm. The table
shows the value of distance between data points of one cluster to another cluster. The
calculated value is lesser than proposed algorithm.
2 1.4505
3 0.5823
Table 4.5 shows the table of Inter-Cluster distance using proposed Algorithm. The table
shows the value of distance between data points of one cluster to another cluster. The
calculated value is higher than the proposed algorithm.
44
Chapter 5 Conclusion and Future
Scope
This chapter discusses the conclusion of the work done in the project and ends with a
clear vision of future direction which can be taken further.
5.1 Conclusion
This project introduces big data and provides background of various clustering techniques
used to analyze big data. In this work comparative analysis of these techniques is done. A
hybrid approach based on Genetic K-Means is effective grouping of enormous
information is proposed. This approach is developed in python. The experimental results
have been gathered which shows that the proposed approach is more accurate as
compared to K-Means when tested on dataset.
5.2 Limitations
The proposed approach still has some of the following limitations.
The proposed approach still requires the value of K, but after finding the Davies
Bouldin index value we will find the number of initial or desired clusters as input,
though data points has been distributed.
The proposed approach can be applied only for those data sets which have numerical
values or attributes
The proposed approach shows that initial value of clusters is needed as input, in
future the approach can be enhanced by finding the optimal no of cluster using best
45
technique so that automatic number of desired clusters is formed.
In future, this approach can be applied for a particular real time application area by
addressing issues involved and can be applied for data sets with categorical
attributes.
vvvReferences
46
ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery, pp. 12-19, 2003.
[9]“Genetic Algorithm ”[Online] Available .https://towardsdatascience.com/introduction-
to-genetic- algorithms-including-example-code-e396e98d8bf3.[13 Jan 2016]
[10] M.Mor, P.Gupta and P.Sharma "A Genetic Algorithm Approach for
Clustering." International Journal of Engineering & Computer Science 3.6 (2014).
[11] K.Dharmendra and K. Sharma. "Genetic k-Means clustering algorithm for
mixed numeric and categorical data sets." International Journal of Artificial Intelligence
& Applications 1.2 (2010): 2328.
[12] K.Shahroudi and S.Biabani "Variable selection in clustering for market segmentation
using genetic algorithms." Interdisciplinary Journal of Contemporary Research in
Business 3.6 (2011): 333-341.
[13] E.O.Hartono and D. Abdullah. "Determining a Cluster Centroid of K-Means Clustering
Using Genetic Algorithm." International Journal of Computer Science and Software
Engineering (IJCSSE) 4.6 (2015).
[14] D.X.Chang, XD. Zhang, and CW. Zheng. "A genetic algorithm with gene
rearrangement for K-means clustering." Pattern Recognition 42.7 (2009): 1210-1222.
[15] R.Lletı, M.C.Ortiz, L.A.Sarabia, & M.S.Sánchez, (2004). “Selecting variables for k-
means cluster analysis by using a genetic algorithm that optimises the silhouettes”.
Analytica Chimica Acta, 515(1), 87-100.
[16] P.Vats, M. Mandot and A. Gosain, “A Comparative Analysis of Various Cluster
Detection Techniques for Data Mining,” Electronic Systems, Signal Processing and
Computing Technologies (ICESC), 2014 International Conference on. IEEE, 2014.
[17] I.B. Saida, K. Nadjet and B. Omar “A new algorithm for data clustering based on
cuckoo search optimization,” Genetic and Evolutionary Computing, pp. 55-64. Springer,
2014.
[18] A.K. Jain, MN. Murty and P.J. Flynn, “Data clustering: a review,” ACM computing
surveys (CSUR), Vol. 31, No.3, pp. 264-323, 1999.
[19] Lu, Yi, et al. "FGKA: A fast genetic k-means clustering algorithm." Proceedings of the
2004 ACM symposium on applied computing. ACM, 2004.
[20] A.Likas, N.Vlassis and J.J. Verbeek. "The global k-means clustering algorithm."
Pattern recognition 36.2 (2003): 451-461.
[21] K. Kim and H. Ahn. "A recommender system using GA K-means clustering in an online
47
shopping market." Expert systems with applications 34.2 (2008): 1200-1209.
[22] G.P. Babu and M.N. Murty. "A near-optimal initial seed value selection in k-means
means algorithm using a genetic algorithm." Pattern Recognition Letters 14.10 (1993):
763-769.
[23] D.K. Roy and L. K. Sharma. "Genetic k-Means clustering algorithm for mixed numeric
and categorical data sets." International Journal of Artificial Intelligence & Applications
1.2 (2010): 23-28.
[24] M.E. Celebi, H.A. Kingravi, and P. A. Vela. "A comparative study of efficient
initialization methods for the k-means clustering algorithm." Expert systems with
applications40.1 (2013): 200-210.
48
49