Chapter 1 Introduction
Chapter 1 Introduction
Raw data may be gathered in large quantities from many different disciplines, but this
data is meaningless unless it is properly reasoned through to provide knowledge that is
helpful. In this project, we concentrate on clustering, one of the key data mining
techniques.
In unsupervised learning, the goal is often to find hidden structures or clusters in the data
or to uncover relationships between variables. By extracting meaningful patterns,
unsupervised learning algorithms can help with tasks such as data exploration, data pre-
processing, anomaly detection, and feature engineering. The most successful
unsupervised learning approach is clustering, which is one of several strategies employed.
The most used clustering method is K-means. K-means is sensitive to out-of-the-ordinary
data points, and it occasionally creates empty clusters.We recommended a novel method
to address this issue. K-means genetic algorithm. In this study, the Genetic K-means
clustering approach is the main focus.
1
1. Clustering: Clustering algorithms group similar dataipoints together basedion their
characteristics or proximity in the feature space. The objective is to identify
natural clusters within the dataiwithout any prior knowledgeiof the class labels.
Common clusteringialgorithms include k-means, hierarchical clustering,
andiDBSCAN.
2. Dimensionalityireduction: Dimensionality reductionitechniques aim to reduce the
number of variables ofifeatures in the data while preserving important
information. These methods are useful when dealing with high-dimensional data
by simplifying it and allowing for easier visualization and analysis.
PrincipaliComponent Analysis (PCA) andit-SNE (t-Distributed
StochasticiNeighbor Embedding) are popular dimensionality reduction
algorithms.
Unsupervised learning plays a crucial role in various fields, including data mining,
pattern recognition, recommender systems, and exploratory data analysis, as it helps
uncover underlying structures and insights in unlabelled datasets.
1.2.1 Clustering:
It is an unsupervised method that uses an automated method to group items with similar
properties together [5]. Statistics professionals also refer to it as categorization.
Clustering is grouping things based on similarities. The items are divided into two
clusters: one for those with similar qualities and another for those with different
properties. The Manhattan distance or the Euclidian distance are used to measure
similarity. Getting low intra-cluster and the converse for inter-cluster similarity is the
main goal of clustering.
2
Figurei1.1iInter and Intra similarities oficluster [5]
3
4
5
4.HammingiDistance
A straight line distance called a euclidian distance exists between two places. L2 standard
is another name for theiEuclidian separation [8]. The distanceibetween two points on a
plane with the coordinatesi(x, y) is known as the Euclidian separation. The formula to
calculate Euclidian distance is:
City block distance, or L1 norm are other names for Manhattan Distance. It is measured
in right-angle axes [8]. Manhattan distance is the product of diagonal and horizontal
distance, which is calculated using Pythagoras' theorem. The Manhattan Distance formula
is as follows:
1.4.3 EditiDistance:
When the dataset is available through form strings, Edit Distance is utilized for the
purpose of Finding similarities between strings is done using the Edit separate approach.
The amount of steps needed to switch from one string to another separates two strings [8]
.It is used in bioinformatics to determine the similarities between DNA groups.
The operations used in Edit include Insert and Delete.
6
7
Figurei1.3 EvaluationiGraph of SilhouetteiMethod [8]
1 B
Gapn (k) = ∑ log(𝑊𝑘𝑏) − log(𝑊𝑘)
𝐵 𝑏=1
8
1.5.3 Davies Bouldin Index:
It contrasts the total inside intra-cluster variance for different values of k with the values
that would be predicted under the data's null reference distribution. The ideal clusters will
be formed by the values that maximise the gap statistic.
𝑀𝑖 - centroid of cluster
𝐶𝑖, 𝑇𝑗. - theisize of cluster
𝐷𝑗 -is theimeasure ofivalidity of the cluster.
1.5.4 DunniIndex:
J.C Dunn introduced the Dunn Index in 1979. This is a metricifor evaluating clusters.
Dunn's index differentiates sets of clusters with similar characteristics that are grouped
together [9]. A higher Dunn index value indicates better grouping. A primary
disadvantageiof the Dunniindex is the computationalicost, because the number of clusters
and the dimensionalityiof the information increase the computation.
9
1.6.1 Flowchartiof standard K-meansi:
10
11
SepaliLength
1. iiInitializationi:
Populationican be described as a setiof people in the initiation phase. Genes are a set of factors
that define a person. Strings are employed in genetic algorithms to characterise an individual's
gene pool. Binary values are applied to the genes on a chromosome to encode them.
12
Figurei1.8 GeneticiAlgorithm
ChromosomeiandiPopulation [10]
Smax = maxi(iF(iSINTER)/iF(iSINTRA))
Where iSmax is the fitnessifunction obtained by dividing the total inter cluster distance
3.iSelection: The idea behind the selection step is to favour the fittest people and pass
their genes on to the following generation. The fitness scores of individualsiare used to
match up pairs of people
Large fitness values increase the likelihood that a person will be chosen for reproduction.
4.iCrossover: The crossover stage of a genetic algorithm is the most effective. New
children are produced by cross-breeding. A random gene is chosen as the crossover point.
Table 1.1 displays the crossing of chromosomes’S1 and S3, and Table 1.2 displays the
results of the crossover.
13
14
1.7.1 iFlow chart of genetic algorithm:
15
Chapter 2 Literature Survey
2.1 Clustering’
Clustering uses an automated method to group items with comparable attributes into
groups [1] in an unsupervised manner. Statistics professionals also refer to it as
categorization. Clustering is grouping things based on similarities. The items are divided
into two clusters: one for those with similar qualities and another for those with different
properties. The Manhattan distance or the Euclidian distance are used to measure
similarity. Getting low intra-cluster as well as elevated inter-cluster similarity is the main
goal of clustering. Various clustering methods exist, including partitioned clustering,
clustering by hierarchy, fuzzy grouping, and density-based clustering. In this chapter, a
review of the literature on clustering methods for large data analysis is offered, along
with a table comparison of various methods.
There are various forms of partitioning clustering approaches. One of the most
popular Partion clustering method is the K-means clustering has been put forward
byiMacqueen in1967.iIn K- means clusteringieach information point is grouped
according to similarity to one another The K-meansistrategy is susceptible to aberrations
and in some cases it creates empty groups with largeidatasets. In thisiwe provided
another approach (Hybrid K-means plus Genetic algorithm)ito alleviate the drawback of
k-meansiclustering [i11].
16
2.3 Unsupervised learningitechnique: Clustering
Massive amounts of data are produced by the information sector and other sources.
Before it is processed and relevant knowledge is extracted from it, this data is worthless.
The two fundamental goals of data mining that the researchers examine are description
and prediction. Various methods are used to extract valuable information from vast
amounts of data. The most popular partitioning technique for mining huge datasets is
clustering. The most effective approach for partitioning is K-means clustering, however it
can yield empty clusters with big datasets and is susceptible to outliers. Utilising
evolutionary algorithms, decision trees, and neural networks to solve optimisation
problems, clustering solves the issue associated with partitioning strategies. [1].
Description and prediction are two fundamental goals of data mining, according to
researchers. While description generally focuses on identifying patterns that explain the
given information and the ultimate presentation for a person's interpretation, prediction
makes use of a variety of already-existing factors in the database to forecast the future
values of interest. With regard to the core technique and the application, the relative
importance of both description and prediction varies. These goals may be achieved by a
number of data mining approaches, including clustering, association rule mining, and
classification mining employing tools like genetic algorithms, decision trees, neural
networks, and machine learning [2].
S.Bandyopadhyayiet al. [4] developed a genetic algorithm for clustering known as GKA
that classifies pixels of a satellite picture. When the amount of clusters is predetermined
and distinct in nature, the KGA algorithm for clustering is used. In this paper, cluster
centres are found using a genetic algorithm. The cluster centres are coded using a float
point representation of the chromosomes because this is a more natural and suitable form.
The primary drawback of this research is that the method is applied solely to the database
where k has been established.
M.Jain.iet al. [7] For improving stock prediction, a K-means plus genetic algorithm has
been developed. In this study, the cluster centroids are found using a genetic algorithm to
improve stock forecasting or market analysis. Chi square closeness is used to gauge
accuracy, and the outcome has a higher degree of accuracy than k means.The limitation
of the suggested approach is that it can only be used with datasets that are represented as
17
matrices.
C. Ordonez. [9] presented two distinct ways to cluster binary data streams using the K-
means clustering technique. They employ progressive K-means and scalability K-means
as its versions. In comparison to k-means, a suggested variation provides high quality
clusters while being less sensitive. The accuracy, confusion matrix, and error rate for the
suggested progressive k-means and scalability k-means are compared to those of the
existing k-means. Higher accuracy than conventional k-means is provided by the
progressive and scalable k-means.
Moriet al. [11] contrasted the outcomes with the k means method and offered a genetic
algorithm strategy for grouping. In the suggested method, both intra-cluster and inter-
cluster similarity measurements are used to determine fitness. The suggested method
eliminates the disadvantage of local optima and has small intra clustered distance as well
as substantial inter cluster distance. As K is chosen at random in this approach, the GA
algorithm is unable to determine the actual value of K (the number of clusters).
K.Dharmendraiet al. [12] provided a K-means clustering approach that is effective. The
K-meansiclusteringialgorithm's primary goal is to partition the datasetiinto Kiclusters,
where iK is either specified or chosen at random by the analyst. The major objective of
the K-means clustering procedure employed in this study is to reduce the inside sumiof
squares.
18
improved when the MSE is lower. The suggested algorithm's flaw is that the clusters are
not evaluated, and the correct number of K isn't known a priori.
P.Vatsiet al. [17] had talked about comparing different clustering methods utilising the
genetic and K-means algorithms. It applies the various clustering methods, including the
K-means algorithm, Incremental K-means, and Fuzzy C-means, to the example Iris
dataset. The code is put into action using Weka and Matlab.This study discusses fuzzy c-
means, which perform better than the K-means method.
K.Kimiet al. [22] K-means and a genetic algorithm have been presented as a
recommender system for the online retail sector. The first seed is optimised in this
suggested method for an online commerce recommender system using a genetic
algorithm termed GA K-means. Results of the proposed algorithm are compared to those
of current algorithms, and it is clear from the comparison that the suggested algorithm
performs segmentation more effectively than existing algorithms.
2.4 Conclusion
This chapter performs a survey of the literature and a comparative study of the most
recent large data clustering methods. The various clustering methods for large data
analysis are contrasted, and their benefits and drawbacks are discussed.
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
19
Chapter 3 System Design and Development
3.1 ProblemiStatement
Big data has taken the market by storm in the modern era. Designing and creating novel
clustering strategies is one of several difficulties in large data analysis. When analysing
large data, clustering techniques are applied, resulting in the formation of comparable
item clusters that are beneficial for the commercial sector, weather forecasting, etc. The
absence of prior knowledge about the provided dataset is the clustering problem.
Additionally, the selection of input parameters used in these methods, including the
number of clusters, the number of nearest neighbours, and other elements, makes the
clustering issue more difficult. Therefore, choosing these values incorrectly will produce
poor clustering results. Additionally, the accuracy of these methods is subpar when the
dataset comprises clusters with a variety of complicated forms, densities, sizes, noise, and
outliers. K-means is the most potent partitioning method, according to the partitioning
algorithm. K-means performs well when the total amount of groups is known in advance
and is sensitive to outliers. For huge datasets, it produces empty clusters, nevertheless.
Using K-means and the Genetic Algorithm in conjunction can solve these problems. In
this study, a combined approach of genetic and k-means clustering is used to address the
problems with k-means clustering.
3.2 ResearchiGap
The business and scientific environments of today generate a lot of data. It is challenging
to gather and analyse vast amounts of data since both their volume and complexity are
rising. According to a literature review, there are many methods for analysing enormous
datasets, however these methods are inefficient since they don't offer comprehensive
answers. A few of them areigood, butithey had to give up on the qualityiof the clusters,
and the opposite is also true. There has been a lot of effort done to increase the
effectiveness of K-meansi cluster and geneticialgorithms in order to find high-quality
clustersiin a short amount of time, but both of these methods still have certain drawbacks.
Although the k-means method runs quickly, it is sensitive to outliers, cannot handle
non-arbitrary shapes, and forms empty clusters for big datasets. Although the
20
proposed mixed Genetic K-means method is more difficult and time-consuming than
K-Means, it can manage noise and non-arbitrary form.
To address the drawbacks of the current K-means method, the cuckoo search
algorithm is used; nevertheless, it cannot handle huge datasets.
In a different method, k-means is changed to incremental K-means, which produces
better results for numeric datasets than k-means.
A different genetic algorithm-based method to optimised k-means clustering. The
genetic method is used to determine the ideal value of k, however it takes more time to
run than other algorithms.
3.3 iObjectives
The objectivesiof the project are as follows::
Research the K-meansiclustering method currently in use in conjunction with the
Geneticialgorithm.
Toieliminate the flaws in the current clustering process by introducing a more
effective clustering technique.
Use the Iris dataset to build and evaluate the suggested approach.
3.4 ResearchiMethodology
Python is a fantastic programming language for manipulating text, and we will utilise it in
our research. The UCI ML archive will be searched for the dataset needed for study. For
analysing the suggested approach, the Iris dataset and Wine quality dataset are employed.
Python is used in the first step to normalise the data, and the normalised data is then
utilised for clustering. In the second step, a sample dataset is used to evaluate a simple K-
means algorithm and a confusion matrix is used to estimate accuracy. In the third step, a
sample dataset is used to evaluate the Improve K-means approach or a new method, and
the results are then sent to K-means to determine the accuracy of the predictions. The
Davies Bouldin score is used to evaluate clusters in the final phase.
21
makes handling the data easier.
Stepi3: The first cluster centroid is located using a genetic method in the third phase, and
following mutation and crossover, the result is transferred to K-means to create clusters.
Stepi4: After cluster creation, the Davies Bouldin Score is employed in the last phases.
Cluster assessment.
Fmaxi= maxi(Si(DINTER)/iSi(DINTRA))
The overall intra clusteridistance and overall inter cluster distance are divided, and Fmax
represents the fitness function that results.
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Chapter 4 Experiments and Result Analysis
This part of the project report will concentrate on utilising Python and google colab utility
to develop the suggested hybrid method Genetic K-means..
Four distinct datasets are utilised to put the suggested strategy into practise. These
datasets were gathered from several archives. The UCI ML database is where the dataset
was gathered [42]. Iris data, wine quality data, cancer data, and sales data are the datasets
that were utilised. The dimensionality of each of these data sets varies. Python is used to
accomplish the suggested strategy using Google Collab.
37
Code ofiK-means Clustering’:
Fig 4.1 demonstrates the K-meansiclustering algorithm for the Iris dataset. It takes the iris
data and illustrates K-means clustering with a K value of 3. This algorithm forecasts how
accurate K-means will be. The accuracy of the iris dataset is assessed by comparing the
predicted clusters to the actual class clusters.K-means clustering yielded an accuracy of
45%.
38
This means that in each cluster there are as many members[49,49,52]
Update Centroid as given in fig 5.1 generates a new centroid for cluster 1
39
4.2.1.iConfusionaMatrix of K-meansaClustering
The iris datasets are used to assess the accuracy of the two algorithms, K-Means and
Genetic K-Means. The confusion matrix produced by each method on the Iris dataset is
shown :
Fig 4.3 displays the Confusion matrix obtained from K-means clustering.
AccuracyiRecall, Precision, Sensitivityiand Specificity areicalculated by comparing the
actualiand predicted results.
Accuracys:a(TP+TN)i/TP+TN+FP+FNi = 50+10+20i/50+10+20+36+34i=i80%
Precisions:.TP/Predicted 1=10/46=21%
40
4.2.2 Confusion Matrix of Genetic K-means Clustering
The iris datasets are used to assess the accuracy of the two algorithms, K-Means and
Genetic K-Means. The confusion matrix produced by each method on the Iris dataset is
shown in Table 5.6.
Fig 4.4 Demonstrates the confusion matrix produced by genetic K-means clustering by
comparing the actual and anticipated outcomes, the metrics of accuracy, recall, precision,
sensitivity, and specificity are computed.
Accuracys:a(TP+TNi)/iTP+TN+FP+FNi=i50+10+20/50+10+20+36+34i=89%
Misclassificationarates =i(FP+FN)/iTotali=i0+0+0+34+0+36/150i=10%
Precisions:.TP/Predicted 1=10/46=77%
41
4.2.3 aTestifor Performance of Accuracy
To determine the accuracy performance, the K-Means method and the suggested
technique are evaluated on four datasets: the iris dataset, the wine dataset, the cancer
dataset, and the sales dataset. The level of quality of clusters must be maintained by
accuracy review. The formula for calculating accuracy
Where accuracy is calculated by dividing the entire amount by the sum of true positives
and true negatives. TPi(True Positive),iTN (True Negative),iFP (False Positive), andiFN
(False Negative), respectively.
Table 4.1 Shows theitable of accuracy obtained from K-means and Genetic K-means
Algorithm. The table shows that Genetic algorithm gives higheriaccuracy than K-means
algorithmi.
42
4.2.4 aCalculation of Intra clusteridistance using K means and proposedialgorithm.
The distance that exists between points of data in the same cluster is known as theiintra
cluster distance.iK-means' and the suggested algorithm's goal is to reduce intra-cluster
distance. The following formula calculates the intraicluster distance:
miis the number of elements, xi and xj are cluster dataipoints, and Dq is the separation
between cluster dataipoints.
Table displays the intra cluster distance derived from k means and the suggested method.
4.2iand tablei4.3.
Table 4.2:iIntra clusteradistance usingiK-meansaalgorithm
Table 4.2 demonstrates the K-means algorithm intra-cluster distance table. Theidistance
that exists betweenipoints of data in theisame cluster is calculated to give the intra-cluster
distance. K-means algorithm produces a greater distance than Genetic Algorithm.
43
that exists betweenipoints of data in the sameicluster is calculated to give the intra-cluster
distance. K-means algorithm produces a greater distance than Genetic Algorithm.
Here m,ni are the no. of elementsi in the 𝑞𝑡ℎand 𝑟𝑡ℎ cluster,and 𝑥𝑖 and 𝑥𝑗 are the
elements in clusters and the intericluster distance between dataipoints is represented by
INTER
𝐷𝑞.𝑟.
Table 4.4 uses the K-means algorithm to provide a table showing inter-cluster distance.
The separation betweenidata pointsiin one clusteriand another cluster is displayed in the
44
table. Less has been computed than using the suggestedialgorithm.
Table 4.5 demonstrates the Inter-Cluster distance table using the suggested algorithm.
The separation betweenidata points in one clusteriand another cluster is displayed in the
table. The determined value exceeds the suggested algorithm.
45
Chapter 5 Conclusion and Future
Scope
This chapter addresses the project's completion and concludes with a distinct perspective
for the project's potential future course.
5.1Conclusions
5.2iLimitations’
The following restrictions still apply to the suggested strategy:
Even if the data points are dispersed, the suggested method still needs to know the
value of K. However, once we know the Davies Bouldin score value, we may
determine how many initial or intended clusters to use as input.
The suggested method is only applicable to data sets with numerical values or
characteristics.
5.3iFuture Scopes
The suggested method demonstrates that initial cluster values are required as input;
however, in the future, the method may be improved by determining the optimum
way for forming the ideal number of clusters.
In the future, this method may be used for sets of data with categorical properties and
can be utilised for a specific real-time application area by resolving the challenges
involved.
46
vvvReferences
47
mixed numeric and categorical data sets." International Journal of Artificial Intelligence
& Applications 1.2 (2010): 2328.
[12] K.Shahroudi and S.Biabani "Variable selection in clustering for market segmentation
using genetic algorithms." Interdisciplinary Journal of Contemporary Research in
Business 3.6 (2011): 333-341.
[13] E.O.Hartono and D. Abdullah. "Determining a Cluster Centroid of K-Means Clustering
Using Genetic Algorithm." International Journal of Computer Science and Software
Engineering (IJCSSE) 4.6 (2015).
[14] D.X.Chang, XD. Zhang, and CW. Zheng. "A genetic algorithm with gene
rearrangement for K-means clustering." Pattern Recognition 42.7 (2009): 1210-1222.
[15] R.Lletı, M.C.Ortiz, L.A.Sarabia, & M.S.Sánchez, (2004). “Selecting variables for k-
means cluster analysis by using a genetic algorithm that optimises the silhouettes”.
Analytica Chimica Acta, 515(1), 87-100.
[16] P.Vats, M. Mandot and A. Gosain, “A Comparative Analysis of Various Cluster
Detection Techniques for Data Mining,” Electronic Systems, Signal Processing and
Computing Technologies (ICESC), 2014 International Conference on. IEEE, 2014.
[17] I.B. Saida, K. Nadjet and B. Omar “A new algorithm for data clustering based on
cuckoo search optimization,” Genetic and Evolutionary Computing, pp. 55-64. Springer,
2014.
[18] A.K. Jain, MN. Murty and P.J. Flynn, “Data clustering: a review,” ACM computing
surveys (CSUR), Vol. 31, No.3, pp. 264-323, 1999.
[19] Lu, Yi, et al. "FGKA: A fast genetic k-means clustering algorithm." Proceedings of the
2004 ACM symposium on applied computing. ACM, 2004.
[20] A.Likas, N.Vlassis and J.J. Verbeek. "The global k-means clustering algorithm."
Pattern recognition 36.2 (2003): 451-461.
[21] K. Kim and H. Ahn. "A recommender system using GA K-means clustering in an online
shopping market." Expert systems with applications 34.2 (2008): 1200-1209.
[22] G.P. Babu and M.N. Murty. "A near-optimal initial seed value selection in k-means
means algorithm using a genetic algorithm." Pattern Recognition Letters 14.10 (1993):
763-769.
[23] D.K. Roy and L. K. Sharma. "Genetic k-Means clustering algorithm for mixed numeric
and categorical data sets." International Journal of Artificial Intelligence & Applications
1.2 (2010): 23-28.
48
[24] M.E. Celebi, H.A. Kingravi, and P. A. Vela. "A comparative study of efficient
initialization methods for the k-means clustering algorithm." Expert systems with
applications40.1 (2013): 200-210.
49