0% found this document useful (0 votes)

18 views49 pages

Chapter 1 Introduction

Uploaded by

Vatsal Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views49 pages

Chapter 1 Introduction

Uploaded by

Vatsal Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Chapter 1 Introduction

Raw data may be gathered in large quantities from many different disciplines, but this
data is meaningless unless it is properly reasoned through to provide knowledge that is
helpful. In this project, we concentrate on clustering, one of the key data mining
techniques.

1.1 Introduction to Machine learning

Big data is defined as large-scale data, which can range from digital to health-related
information. Big data consists of extensive and complicated data sets, making it difficult
for typical data processing programmes to handle them. Every day, the information sector
produces vast amounts of data [1]. Information explosion, the term used to describe the
increasing rate in data volume, was first used to describe how data grew enormous
seventy years ago. Unsupervisedilearning is a type ofimachine learning where theimodel
learns patterns, structures, and relationships from unlabelled data without any specific
guidance or predefined target values. Unlike supervised learning, which requires labelled
data with known outcomes, unsupervised learning algorithms explore the data on their
own to discover inherent patterns and make sense of the information.

In unsupervised learning, the goal is often to find hidden structures or clusters in the data
or to uncover relationships between variables. By extracting meaningful patterns,
unsupervised learning algorithms can help with tasks such as data exploration, data pre-
processing, anomaly detection, and feature engineering. The most successful
unsupervised learning approach is clustering, which is one of several strategies employed.
The most used clustering method is K-means. K-means is sensitive to out-of-the-ordinary
data points, and it occasionally creates empty clusters.We recommended a novel method
to address this issue. K-means genetic algorithm. In this study, the Genetic K-means
clustering approach is the main focus.

1.2 Unsupervised Learning

There are two main types of unsupervised learning algorithms:

1
1. Clustering: Clustering algorithms group similar dataipoints together basedion their
characteristics or proximity in the feature space. The objective is to identify
natural clusters within the dataiwithout any prior knowledgeiof the class labels.
Common clusteringialgorithms include k-means, hierarchical clustering,
andiDBSCAN.
2. Dimensionalityireduction: Dimensionality reductionitechniques aim to reduce the
number of variables ofifeatures in the data while preserving important
information. These methods are useful when dealing with high-dimensional data
by simplifying it and allowing for easier visualization and analysis.
PrincipaliComponent Analysis (PCA) andit-SNE (t-Distributed
StochasticiNeighbor Embedding) are popular dimensionality reduction
algorithms.
Unsupervised learning plays a crucial role in various fields, including data mining,
pattern recognition, recommender systems, and exploratory data analysis, as it helps
uncover underlying structures and insights in unlabelled datasets.
1.2.1 Clustering:
It is an unsupervised method that uses an automated method to group items with similar
properties together [5]. Statistics professionals also refer to it as categorization.
Clustering is grouping things based on similarities. The items are divided into two
clusters: one for those with similar qualities and another for those with different
properties. The Manhattan distance or the Euclidian distance are used to measure
similarity. Getting low intra-cluster and the converse for inter-cluster similarity is the
main goal of clustering.

2
Figurei1.1iInter and Intra similarities oficluster [5]

3
4
5
4.HammingiDistance

1.4.1 Euclidian Distance:

A straight line distance called a euclidian distance exists between two places. L2 standard
is another name for theiEuclidian separation [8]. The distanceibetween two points on a
plane with the coordinatesi(x, y) is known as the Euclidian separation. The formula to
calculate Euclidian distance is:

𝑑[|𝑦1, 𝑦2 … … … . . 𝑦𝑛|], [|𝑧1,𝑧2 … … … . 𝑧𝑛|] = √∑(𝑦𝑖 − 𝑧𝑖)2

d- distance between the coordinates (x,y)

1.4.2 Manhattan Distance:

City block distance, or L1 norm are other names for Manhattan Distance. It is measured
in right-angle axes [8]. Manhattan distance is the product of diagonal and horizontal
distance, which is calculated using Pythagoras' theorem. The Manhattan Distance formula
is as follows:

𝑑[|𝑥1, 𝑥2 … … … . . 𝑥𝑛|], [|𝑦1,𝑦2 … … … . 𝑦𝑛|] = √∑ |𝑥𝑖 − 𝑦𝑖|

d- distanceibetweenithe coordinates (x,y)

1.4.3 EditiDistance:
When the dataset is available through form strings, Edit Distance is utilized for the
purpose of Finding similarities between strings is done using the Edit separate approach.
The amount of steps needed to switch from one string to another separates two strings [8]
.It is used in bioinformatics to determine the similarities between DNA groups.
The operations used in Edit include Insert and Delete.

1.4.4 Hamming Distance:

Boolean numbers are separated by Hamming [9]. When the two strings have the same
length, the hamming separation is calculated. Hamming distance's value is always
positive. It might be used to the coding theory.

1.5 Techniques to find the optimum number of Clusters

1.5.1 Elbow method:

6
7
Figurei1.3 EvaluationiGraph of SilhouetteiMethod [8]

1.5.2 Gap Statisticalimethod:

It contrasts the total inside intra-clusterivariance for different values ofik with the values
that would be predicted under the data's null reference distribution. The ideal clusters will
be formed by the values that maximise the gap statistic.

1 B
Gapn (k) = ∑ log(𝑊𝑘𝑏) − log(𝑊𝑘)
𝐵 𝑏=1

𝑊𝑘𝑏 – Expected value

𝑊𝑘 – Observed value

Figurei1.4 EvaluationiGraph of Gap Statisticimethod [8]

8
1.5.3 Davies Bouldin Index:
It contrasts the total inside intra-cluster variance for different values of k with the values
that would be predicted under the data's null reference distribution. The ideal clusters will
be formed by the values that maximise the gap statistic.

𝑀𝑖 - centroid of cluster
𝐶𝑖, 𝑇𝑗. - theisize of cluster
𝐷𝑗 -is theimeasure ofivalidity of the cluster.

1.5.4 DunniIndex:
J.C Dunn introduced the Dunn Index in 1979. This is a metricifor evaluating clusters.
Dunn's index differentiates sets of clusters with similar characteristics that are grouped
together [9]. A higher Dunn index value indicates better grouping. A primary
disadvantageiof the Dunniindex is the computationalicost, because the number of clusters
and the dimensionalityiof the information increase the computation.

1.6 Standard K-MeansiAlgorithm

Input: K: clusters to be formed.

Dn: datasetihaving n dataipoints. Dn= {d1, d2…….dnii}
Output:iSet of k clusters
Stepi1: Select dataipoints corresponds to theivalue of k (means).
Stepi2: Calculate Centroid
 Calculate thei centroid of data pointsi using distancei measures like
Euclidean distance
 Data pointsoare attached to nearest centroid based on the Euclidian distance.
Stepi3: Recalculateiithe means.
 When all data points are assigned to the cluster than recalculate the mean.
 Repeat the step of calculating the distance and assigning the clusters.
Stepi4: Repeatiuntil the centersido not change.
 Repeatithe stepsi2 and 3 untilithe centers doinot change.

9
1.6.1 Flowchartiof standard K-meansi:

Fig 1.5 Flowchartiof standard K-meansi

1.6.2 Drawbacks of standard K-means clustering:

The limitations of K-Means algorithm have been listed as follows:

10
11
SepaliLength

Figi1.6 Clusteringion Iris Dataset

1.7 iGenetic Algorithm

It is a technique that comes from biological evolutioniand natural selection to tackle

limited and unconstrainedioptimization issue [10]. The genetic algorithmirepeatedly
adjusts a population of individualisolutions. At every step, the genetic algorithm
randomly chooses individuals from the current population to be parents which are used to
create the offspring for the nextigeneration. Geneticialgorithm may be used to tackle
differentioptimization challenges. Steps are as follows:
1. iInitialization of Population
2. iFinding Fitness value
3. iSelection of chromosomes
4. iCrossover
5. iMutation

1. iiInitializationi:
Populationican be described as a setiof people in the initiation phase. Genes are a set of factors
that define a person. Strings are employed in genetic algorithms to characterise an individual's
gene pool. Binary values are applied to the genes on a chromosome to encode them.

12
Figurei1.8 GeneticiAlgorithm
ChromosomeiandiPopulation [10]

2.iFitness Functioni: The fitnessifunction determines an individual's capacity for

competition. Each person receives a fitness score from it. The likelihood that a person
will be chosen for reproduction depends on how fit they are. The fitness value calculation
varies depending on the nature of the issue.The formula most frequently employed to
determine fitness value is

Smax = maxi(iF(iSINTER)/iF(iSINTRA))

Where iSmax is the fitnessifunction obtained by dividing the total inter cluster distance

(F(SINTER)/) and total intra cluster distance i(Fi(iSINTRA))).

3.iSelection: The idea behind the selection step is to favour the fittest people and pass
their genes on to the following generation. The fitness scores of individualsiare used to
match up pairs of people
Large fitness values increase the likelihood that a person will be chosen for reproduction.
4.iCrossover: The crossover stage of a genetic algorithm is the most effective. New
children are produced by cross-breeding. A random gene is chosen as the crossover point.
Table 1.1 displays the crossing of chromosomes’S1 and S3, and Table 1.2 displays the
results of the crossover.

13
14
1.7.1 iFlow chart of genetic algorithm:

Figure 1.9i Execution Steps of genetics algorithms [10]

I, I

15
Chapter 2 Literature Survey

2.1 Clustering’
Clustering uses an automated method to group items with comparable attributes into
groups [1] in an unsupervised manner. Statistics professionals also refer to it as
categorization. Clustering is grouping things based on similarities. The items are divided
into two clusters: one for those with similar qualities and another for those with different
properties. The Manhattan distance or the Euclidian distance are used to measure
similarity. Getting low intra-cluster as well as elevated inter-cluster similarity is the main
goal of clustering. Various clustering methods exist, including partitioned clustering,
clustering by hierarchy, fuzzy grouping, and density-based clustering. In this chapter, a
review of the literature on clustering methods for large data analysis is offered, along
with a table comparison of various methods.

2.2 Partitioning Clustering

Utilising Euclidian distance or any other distance, partitioning clustering divides datasets
into non-overlapping groups based on similarity measures [3]. The number of clusters in
partitioning methods is either chosen at random or in beforehand by the analyst. Datasets
are partitioned into k groups, where k is chosen at random or in advance by the analyst.

Each cluster satisfies the requirements listed below.

• Each item belongs to one grouping

• Each cluster has at least one objects

There are various forms of partitioning clustering approaches. One of the most
popular Partion clustering method is the K-means clustering has been put forward
byiMacqueen in1967.iIn K- means clusteringieach information point is grouped
according to similarity to one another The K-meansistrategy is susceptible to aberrations
and in some cases it creates empty groups with largeidatasets. In thisiwe provided
another approach (Hybrid K-means plus Genetic algorithm)ito alleviate the drawback of
k-meansiclustering [i11].

16
2.3 Unsupervised learningitechnique: Clustering
Massive amounts of data are produced by the information sector and other sources.
Before it is processed and relevant knowledge is extracted from it, this data is worthless.
The two fundamental goals of data mining that the researchers examine are description
and prediction. Various methods are used to extract valuable information from vast
amounts of data. The most popular partitioning technique for mining huge datasets is
clustering. The most effective approach for partitioning is K-means clustering, however it
can yield empty clusters with big datasets and is susceptible to outliers. Utilising
evolutionary algorithms, decision trees, and neural networks to solve optimisation
problems, clustering solves the issue associated with partitioning strategies. [1].

Description and prediction are two fundamental goals of data mining, according to
researchers. While description generally focuses on identifying patterns that explain the
given information and the ultimate presentation for a person's interpretation, prediction
makes use of a variety of already-existing factors in the database to forecast the future
values of interest. With regard to the core technique and the application, the relative
importance of both description and prediction varies. These goals may be achieved by a
number of data mining approaches, including clustering, association rule mining, and
classification mining employing tools like genetic algorithms, decision trees, neural
networks, and machine learning [2].
S.Bandyopadhyayiet al. [4] developed a genetic algorithm for clustering known as GKA
that classifies pixels of a satellite picture. When the amount of clusters is predetermined
and distinct in nature, the KGA algorithm for clustering is used. In this paper, cluster
centres are found using a genetic algorithm. The cluster centres are coded using a float
point representation of the chromosomes because this is a more natural and suitable form.
The primary drawback of this research is that the method is applied solely to the database
where k has been established.

M.Jain.iet al. [7] For improving stock prediction, a K-means plus genetic algorithm has
been developed. In this study, the cluster centroids are found using a genetic algorithm to
improve stock forecasting or market analysis. Chi square closeness is used to gauge
accuracy, and the outcome has a higher degree of accuracy than k means.The limitation
of the suggested approach is that it can only be used with datasets that are represented as

17
matrices.

C. Ordonez. [9] presented two distinct ways to cluster binary data streams using the K-
means clustering technique. They employ progressive K-means and scalability K-means
as its versions. In comparison to k-means, a suggested variation provides high quality
clusters while being less sensitive. The accuracy, confusion matrix, and error rate for the
suggested progressive k-means and scalability k-means are compared to those of the
existing k-means. Higher accuracy than conventional k-means is provided by the
progressive and scalable k-means.

Moriet al. [11] contrasted the outcomes with the k means method and offered a genetic
algorithm strategy for grouping. In the suggested method, both intra-cluster and inter-
cluster similarity measurements are used to determine fitness. The suggested method
eliminates the disadvantage of local optima and has small intra clustered distance as well
as substantial inter cluster distance. As K is chosen at random in this approach, the GA
algorithm is unable to determine the actual value of K (the number of clusters).

K.Dharmendraiet al. [12] provided a K-means clustering approach that is effective. The
K-meansiclusteringialgorithm's primary goal is to partition the datasetiinto Kiclusters,
where iK is either specified or chosen at random by the analyst. The major objective of
the K-means clustering procedure employed in this study is to reduce the inside sumiof
squares.

K.Shahroudiiet al. [13] a genetic algorithm-based technique for choosing variables in

market segmentation clustering. The goal of this suggested approach is to use a genetic
algorithm to find the parameters that are ideal and eliminate the unnecessary variables.
Finally, using the most effective segmentation approach, the results were successfully
enhanced.
E.O.Hartonoiet al. [14] suggested a genetic algorithm-based technique for locating the
centroid of a cluster. utilising a genetic algorithm to set the cluster centroid's initial value
yields better outcomes than utilising random values. MSE (Mean Square Error) is used to
determine fitness value.The MSE divided by the fitness value is 1. Performance is

18
improved when the MSE is lower. The suggested algorithm's flaw is that the clusters are
not evaluated, and the correct number of K isn't known a priori.

P.Vatsiet al. [17] had talked about comparing different clustering methods utilising the
genetic and K-means algorithms. It applies the various clustering methods, including the
K-means algorithm, Incremental K-means, and Fuzzy C-means, to the example Iris
dataset. The code is put into action using Weka and Matlab.This study discusses fuzzy c-
means, which perform better than the K-means method.

K.Kimiet al. [22] K-means and a genetic algorithm have been presented as a
recommender system for the online retail sector. The first seed is optimised in this
suggested method for an online commerce recommender system using a genetic
algorithm termed GA K-means. Results of the proposed algorithm are compared to those
of current algorithms, and it is clear from the comparison that the suggested algorithm
performs segmentation more effectively than existing algorithms.

2.4 Conclusion
This chapter performs a survey of the literature and a comparative study of the most
recent large data clustering methods. The various clustering methods for large data
analysis are contrasted, and their benefits and drawbacks are discussed.

I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,
I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I,

19
Chapter 3 System Design and Development

3.1 ProblemiStatement
Big data has taken the market by storm in the modern era. Designing and creating novel
clustering strategies is one of several difficulties in large data analysis. When analysing
large data, clustering techniques are applied, resulting in the formation of comparable
item clusters that are beneficial for the commercial sector, weather forecasting, etc. The
absence of prior knowledge about the provided dataset is the clustering problem.
Additionally, the selection of input parameters used in these methods, including the
number of clusters, the number of nearest neighbours, and other elements, makes the
clustering issue more difficult. Therefore, choosing these values incorrectly will produce
poor clustering results. Additionally, the accuracy of these methods is subpar when the
dataset comprises clusters with a variety of complicated forms, densities, sizes, noise, and
outliers. K-means is the most potent partitioning method, according to the partitioning
algorithm. K-means performs well when the total amount of groups is known in advance
and is sensitive to outliers. For huge datasets, it produces empty clusters, nevertheless.
Using K-means and the Genetic Algorithm in conjunction can solve these problems. In
this study, a combined approach of genetic and k-means clustering is used to address the
problems with k-means clustering.

3.2 ResearchiGap
The business and scientific environments of today generate a lot of data. It is challenging
to gather and analyse vast amounts of data since both their volume and complexity are
rising. According to a literature review, there are many methods for analysing enormous
datasets, however these methods are inefficient since they don't offer comprehensive
answers. A few of them areigood, butithey had to give up on the qualityiof the clusters,
and the opposite is also true. There has been a lot of effort done to increase the
effectiveness of K-meansi cluster and geneticialgorithms in order to find high-quality
clustersiin a short amount of time, but both of these methods still have certain drawbacks.
 Although the k-means method runs quickly, it is sensitive to outliers, cannot handle
non-arbitrary shapes, and forms empty clusters for big datasets. Although the

20
proposed mixed Genetic K-means method is more difficult and time-consuming than
K-Means, it can manage noise and non-arbitrary form.
 To address the drawbacks of the current K-means method, the cuckoo search
algorithm is used; nevertheless, it cannot handle huge datasets.
 In a different method, k-means is changed to incremental K-means, which produces
better results for numeric datasets than k-means.
 A different genetic algorithm-based method to optimised k-means clustering. The
genetic method is used to determine the ideal value of k, however it takes more time to
run than other algorithms.

3.3 iObjectives
The objectivesiof the project are as follows::
 Research the K-meansiclustering method currently in use in conjunction with the
Geneticialgorithm.
 Toieliminate the flaws in the current clustering process by introducing a more
effective clustering technique.
 Use the Iris dataset to build and evaluate the suggested approach.

3.4 ResearchiMethodology
Python is a fantastic programming language for manipulating text, and we will utilise it in
our research. The UCI ML archive will be searched for the dataset needed for study. For
analysing the suggested approach, the Iris dataset and Wine quality dataset are employed.
Python is used in the first step to normalise the data, and the normalised data is then
utilised for clustering. In the second step, a sample dataset is used to evaluate a simple K-
means algorithm and a confusion matrix is used to estimate accuracy. In the third step, a
sample dataset is used to evaluate the Improve K-means approach or a new method, and
the results are then sent to K-means to determine the accuracy of the predictions. The
Davies Bouldin score is used to evaluate clusters in the final phase.

3.5 ProposediHybrid Technique

The suggested approach is a hybrid strategy built on K-Mean and evolutionary algorithms
that incorporates the advantages of both [20]. The advantage of genetic algorithms is that
they may be used to find the centroid of the first cluster, and then following applying
crossover and mutation, the results are sent to K-Means for clustering.The Davies
Bouldin score is utilised to assess clusters lastly after clustering.Fig 4.1 shows the
procedure of proposed hybrid technique
Stepi1: Data from the repository machine learning at UCI are collected in the first stage
to construct clusters.
Stepi2: Data normalisation is carried out using Python IDLE in the second stage since it

21
makes handling the data easier.
Stepi3: The first cluster centroid is located using a genetic method in the third phase, and
following mutation and crossover, the result is transferred to K-means to create clusters.
Stepi4: After cluster creation, the Davies Bouldin Score is employed in the last phases.
Cluster assessment.

Fig 3.1iImplementationimethodology foriClustering’ of dataset

3.6 iBasic Genetic Algorithms’

Three steps make up the Genetic Algorithm's execution. Initialization of chromosomes as
binary strings (0s and 1s) as input. Output: The ideal number of less sensitive to outliers
clusters.
Step 1:iInitialization
Initialization of a population or chromosomes according to our dataset is the initial stage.
Step 2:iFitness function
Determine each chromosome's fitness depending on its fitness function. The fitness
function largely depends on our issue. Lowiintra-cluster distanceiandihigh inter-cluster
idistance are the primary goals of clustering, hence the most used fitness functioniis:

Fmaxi= maxi(Si(DINTER)/iSi(DINTRA))

The overall intra clusteridistance and overall inter cluster distance are divided, and Fmax
represents the fitness function that results.

22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Chapter 4 Experiments and Result Analysis

This part of the project report will concentrate on utilising Python and google colab utility
to develop the suggested hybrid method Genetic K-means..

4.1 Implementations of suggested Technique

Four distinct datasets are utilised to put the suggested strategy into practise. These
datasets were gathered from several archives. The UCI ML database is where the dataset
was gathered [42]. Iris data, wine quality data, cancer data, and sales data are the datasets
that were utilised. The dimensionality of each of these data sets varies. Python is used to
accomplish the suggested strategy using Google Collab.

4.1.1 IrisiDataset foriImplementation:

The Iris data is a multi-dimensionalidataset made up of four categories (sepalilength,

petal length,isepal width, and petaliwidth) and three classifications (setsoa, versicolor,
and virginica).Integer data points make up this dataset. This dataset is used to implement
clustering based on K-means and Genetic K-means in order to identify the best clusters.
Accuracy, recall, and precision are computed along with the confusion matrix. Each
cluster is assessed using the Davies Bouldin index. The K-means algorithm and Genetic
K-means implementation on this dataset is shown in the screenshots below, along with
the determined value of the Davies Bouldin Score for assessing each cluster. Figures 4.1
and 4.2 display the K-means clustering code snapshots and results. I

37
Code ofiK-means Clustering’:

Figure 4.1aCode of K-meansiClustering on IrisiDataset.

Fig 4.1 demonstrates the K-meansiclustering algorithm for the Iris dataset. It takes the iris
data and illustrates K-means clustering with a K value of 3. This algorithm forecasts how
accurate K-means will be. The accuracy of the iris dataset is assessed by comparing the
predicted clusters to the actual class clusters.K-means clustering yielded an accuracy of
45%.

Centroid Update Results:

38
This means that in each cluster there are as many members[49,49,52]

Update Centroid as given in fig 5.1 generates a new centroid for cluster 1

Data is in the same range after normalization

I. Codesof Genetic K-means clusteringson Iris dataset

Figure 4.2 Code of Genetic K-meanssClustering on Iris Dataset.

4.2 Experimental Results

These algorithms were tested using a variety of characteristics, including theiconfusion

matrix, the DaviesiBouldin index value,irecall, precision, precision, accuracy,isensitivity,
and specificityi.

39
4.2.1.iConfusionaMatrix of K-meansaClustering

The iris datasets are used to assess the accuracy of the two algorithms, K-Means and
Genetic K-Means. The confusion matrix produced by each method on the Iris dataset is
shown :

Fig 4.3:Confusion Matrix obtainedifrom K meansiAlgorithm

Fig 4.3 displays the Confusion matrix obtained from K-means clustering.
AccuracyiRecall, Precision, Sensitivityiand Specificity areicalculated by comparing the
actualiand predicted results.

 Accuracys:a(TP+TN)i/TP+TN+FP+FNi = 50+10+20i/50+10+20+36+34i=i80%

 Misclassificationarates =(iFP+FN)/iTotal =i0+0+0+34+0+36/150i=46%

 Precisions:.TP/Predicted 1=10/46=21%

 Recalls: TP/Actual 1=10/44=22%

 Sensitivitys: TP/Actual 1=10/44= 22%

 Specificitys: TP/Predicted 1=10/46= 21%

40
4.2.2 Confusion Matrix of Genetic K-means Clustering

The iris datasets are used to assess the accuracy of the two algorithms, K-Means and
Genetic K-Means. The confusion matrix produced by each method on the Iris dataset is
shown in Table 5.6.

Figi4.4:aConfusion Matrix obtainedifrom K-meansiiAlgorithm

Fig 4.4 Demonstrates the confusion matrix produced by genetic K-means clustering by
comparing the actual and anticipated outcomes, the metrics of accuracy, recall, precision,
sensitivity, and specificity are computed.

 Accuracys:a(TP+TNi)/iTP+TN+FP+FNi=i50+10+20/50+10+20+36+34i=89%

 Misclassificationarates =i(FP+FN)/iTotali=i0+0+0+34+0+36/150i=10%

 Precisions:.TP/Predicted 1=10/46=77%

 Recalls: TP/Actual 1=10/44=96%

 Sensitivitys: TP/Actual 1=10/44= 96%

 Specificitys: TP/Predicted 1=10/46= 77%

41
4.2.3 aTestifor Performance of Accuracy

To determine the accuracy performance, the K-Means method and the suggested
technique are evaluated on four datasets: the iris dataset, the wine dataset, the cancer
dataset, and the sales dataset. The level of quality of clusters must be maintained by
accuracy review. The formula for calculating accuracy

Where accuracy is calculated by dividing the entire amount by the sum of true positives
and true negatives. TPi(True Positive),iTN (True Negative),iFP (False Positive), andiFN
(False Negative), respectively.

The suggested approach provides superior accuracy than K-means clustering, as

evidenced by theiaccuracy results fromiK-means and GeneticiK-means.Table 4.1 displays
the accuracyigained from all fouridatasets usingiK-means andiGenetic K-means.

Tablei4.1:iAccuracy obtainedifrom K meansiand ProposediAlgorithm

Table 4.1 Shows theitable of accuracy obtained from K-means and Genetic K-means
Algorithm. The table shows that Genetic algorithm gives higheriaccuracy than K-means
algorithmi.

42
4.2.4 aCalculation of Intra clusteridistance using K means and proposedialgorithm.
The distance that exists between points of data in the same cluster is known as theiintra
cluster distance.iK-means' and the suggested algorithm's goal is to reduce intra-cluster
distance. The following formula calculates the intraicluster distance:

miis the number of elements, xi and xj are cluster dataipoints, and Dq is the separation
between cluster dataipoints.
Table displays the intra cluster distance derived from k means and the suggested method.
4.2iand tablei4.3.
Table 4.2:iIntra clusteradistance usingiK-meansaalgorithm

Table 4.2 demonstrates the K-means algorithm intra-cluster distance table. Theidistance
that exists betweenipoints of data in theisame cluster is calculated to give the intra-cluster
distance. K-means algorithm produces a greater distance than Genetic Algorithm.

Table 4.3 :aIntra Cluster distance using proposed algorithm

Table 4.3 demonstrates the K-means algorithmiintra-clusteridistance table. Theidistance

43
that exists betweenipoints of data in the sameicluster is calculated to give the intra-cluster
distance. K-means algorithm produces a greater distance than Genetic Algorithm.

Calculation of the inter-cluster distance using the suggested algorithm and K

means.
The distance that exists between data points in a related cluster is known as the inter
cluster distance. The major goal of the proposed method and K-means is to increase inter-
cluster distance. The table below displays the inter cluster distance as determined by the
proposed method and k means.
Formula to calculate the distance between clusters:

Here m,ni are the no. of elementsi in the 𝑞𝑡ℎand 𝑟𝑡ℎ cluster,and 𝑥𝑖 and 𝑥𝑗 are the
elements in clusters and the intericluster distance between dataipoints is represented by
INTER
𝐷𝑞.𝑟.

The sum of Inter cluster can be calculated as follows:

Table 4.4:iInter Cluster distanceiusing K meansiclustering

Table 4.4 uses the K-means algorithm to provide a table showing inter-cluster distance.
The separation betweenidata pointsiin one clusteriand another cluster is displayed in the

44
table. Less has been computed than using the suggestedialgorithm.

Tablei4.5:iInter Cluster distanceiusing suggested algorithm

Table 4.5 demonstrates the Inter-Cluster distance table using the suggested algorithm.
The separation betweenidata points in one clusteriand another cluster is displayed in the
table. The determined value exceeds the suggested algorithm.

45
Chapter 5 Conclusion and Future
Scope

This chapter addresses the project's completion and concludes with a distinct perspective
for the project's potential future course.

5.1Conclusions

This project introducesibig data and provides background of various clustering

techniques used to analyze big data. In this work comparative analysis of these techniques
is done. A hybrid approach based on Genetic K-Means is effective grouping of enormous
information is proposed.iThis approach is developed in python.iThe experimental results
have been gatherediwhich shows that the proposed approach is more accurate as
compared to K-Means when tested on dataset.

5.2iLimitations’
The following restrictions still apply to the suggested strategy:

 Even if the data points are dispersed, the suggested method still needs to know the
value of K. However, once we know the Davies Bouldin score value, we may
determine how many initial or intended clusters to use as input.

 The suggested method is only applicable to data sets with numerical values or
characteristics.

5.3iFuture Scopes

 The suggested method demonstrates that initial cluster values are required as input;
however, in the future, the method may be improved by determining the optimum
way for forming the ideal number of clusters.
 In the future, this method may be used for sets of data with categorical properties and
can be utilised for a specific real-time application area by resolving the challenges
involved.

46
vvvReferences

[1]E. Ferrara, P. D. Meo, G. Fiumara and R. Baumgartner, “Web Data Extraction,

Applications and Techniques: A Survey”, Knowledge Based Systems, Vol. 70, No. 3, pp.
301-323, 2014.
[2]P.Vats, M.Mandot and A.Gosain, (2014, January). A Comparative Analysis of Various
Cluster Detection Techniques for Data Mining. In Electronic Systems, Signal Processing
and Computing Technologies (ICESC), 2014 International Conference on (pp. 356-361).
IEEE.
[3]S.Bandyopadhyay and U.Maulik. "An evolutionary technique based on K-means
algorithm for optimal clustering in RN." Information Sciences 146.1-4 (2002): 221-237.
[4]AG.Picciano “The Evolution of Big Data and Learning Analytics in American Higher
Education,” Journal of Asynchronous Learning Networks, Vol. 16, No.3, pp. 9-20, 2012.
[5]T. Hu, H. Chen, L. Huang and X. Zhu “A survey of mass data mining based on cloud-
computing,”
Anti-Counterfeiting, Security and Identification (ASID), 2012 International Conference,
2012.
[6]M.Jain, K.Anil, M. N. Murty and P.J. Flynn, “Data clustering: a review,” ACM
computing surveys (CSUR), Vol. 31, No.3, pp. 264-323, 1999.
[7]“Methods for finding optimal number of clusters” [online] Available:
http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-
the-optimal- number-of-clusters-3-must-know-methods/.[14 July 2014]
[8]C. Ordonez, “Clustering Binary Data streams with K-Means”, In Proceedings of the 8th
ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery, pp. 12-19, 2003.
[9]“Genetic Algorithm ”[Online] Available .https://towardsdatascience.com/introduction-
to-genetic- algorithms-including-example-code-e396e98d8bf3.[13 Jan 2016]
[10] M.Mor, P.Gupta and P.Sharma "A Genetic Algorithm Approach for
Clustering." International Journal of Engineering & Computer Science 3.6 (2014).
[11] K.Dharmendra and K. Sharma. "Genetic k-Means clustering algorithm for

47
mixed numeric and categorical data sets." International Journal of Artificial Intelligence
& Applications 1.2 (2010): 2328.

[12] K.Shahroudi and S.Biabani "Variable selection in clustering for market segmentation
using genetic algorithms." Interdisciplinary Journal of Contemporary Research in
Business 3.6 (2011): 333-341.
[13] E.O.Hartono and D. Abdullah. "Determining a Cluster Centroid of K-Means Clustering
Using Genetic Algorithm." International Journal of Computer Science and Software
Engineering (IJCSSE) 4.6 (2015).
[14] D.X.Chang, XD. Zhang, and CW. Zheng. "A genetic algorithm with gene
rearrangement for K-means clustering." Pattern Recognition 42.7 (2009): 1210-1222.
[15] R.Lletı, M.C.Ortiz, L.A.Sarabia, & M.S.Sánchez, (2004). “Selecting variables for k-
means cluster analysis by using a genetic algorithm that optimises the silhouettes”.
Analytica Chimica Acta, 515(1), 87-100.
[16] P.Vats, M. Mandot and A. Gosain, “A Comparative Analysis of Various Cluster
Detection Techniques for Data Mining,” Electronic Systems, Signal Processing and
Computing Technologies (ICESC), 2014 International Conference on. IEEE, 2014.
[17] I.B. Saida, K. Nadjet and B. Omar “A new algorithm for data clustering based on
cuckoo search optimization,” Genetic and Evolutionary Computing, pp. 55-64. Springer,
2014.
[18] A.K. Jain, MN. Murty and P.J. Flynn, “Data clustering: a review,” ACM computing
surveys (CSUR), Vol. 31, No.3, pp. 264-323, 1999.
[19] Lu, Yi, et al. "FGKA: A fast genetic k-means clustering algorithm." Proceedings of the
2004 ACM symposium on applied computing. ACM, 2004.
[20] A.Likas, N.Vlassis and J.J. Verbeek. "The global k-means clustering algorithm."
Pattern recognition 36.2 (2003): 451-461.
[21] K. Kim and H. Ahn. "A recommender system using GA K-means clustering in an online
shopping market." Expert systems with applications 34.2 (2008): 1200-1209.
[22] G.P. Babu and M.N. Murty. "A near-optimal initial seed value selection in k-means
means algorithm using a genetic algorithm." Pattern Recognition Letters 14.10 (1993):
763-769.
[23] D.K. Roy and L. K. Sharma. "Genetic k-Means clustering algorithm for mixed numeric
and categorical data sets." International Journal of Artificial Intelligence & Applications
1.2 (2010): 23-28.

48
[24] M.E. Celebi, H.A. Kingravi, and P. A. Vela. "A comparative study of efficient
initialization methods for the k-means clustering algorithm." Expert systems with
applications40.1 (2013): 200-210.

Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
47 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Genedata
No ratings yet
Genedata
67 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
SJNanda - Spider and CollidingBodies
No ratings yet
SJNanda - Spider and CollidingBodies
50 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
W6 Clustering
No ratings yet
W6 Clustering
29 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Clustering
No ratings yet
Clustering
24 pages
Clustering Slides
No ratings yet
Clustering Slides
22 pages
Data Mining For BI - Part 5
No ratings yet
Data Mining For BI - Part 5
34 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Clustering
No ratings yet
Clustering
20 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
1.supervised and Unsupervised
No ratings yet
1.supervised and Unsupervised
42 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Unit-V Clustering Part 1
No ratings yet
Unit-V Clustering Part 1
26 pages
Unit 4
No ratings yet
Unit 4
74 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
ML Unsupervised
No ratings yet
ML Unsupervised
35 pages
Unit IV
No ratings yet
Unit IV
96 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
M5
No ratings yet
M5
40 pages
CE345 - Lecture #9 - Clustering
No ratings yet
CE345 - Lecture #9 - Clustering
56 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
ML Unit5 Notes
No ratings yet
ML Unit5 Notes
18 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Unit 2
No ratings yet
Unit 2
89 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
Clustering
No ratings yet
Clustering
4 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Lecture 3 Types of Machine Learning
No ratings yet
Lecture 3 Types of Machine Learning
40 pages
Aiml Prof
No ratings yet
Aiml Prof
8 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
1
No ratings yet
1
76 pages
PlanetSpark Excellence Course Curriculum PDF
No ratings yet
PlanetSpark Excellence Course Curriculum PDF
5 pages
Week 3
No ratings yet
Week 3
11 pages
Week 2
No ratings yet
Week 2
11 pages
Microscope Usage and Handling Procedures: A. Introduction
No ratings yet
Microscope Usage and Handling Procedures: A. Introduction
5 pages
Chapter Two Design of Gears
No ratings yet
Chapter Two Design of Gears
83 pages
Leases
No ratings yet
Leases
16 pages
Xylem - Wikipedia
No ratings yet
Xylem - Wikipedia
92 pages
Mineral Nutrition
No ratings yet
Mineral Nutrition
10 pages
Revision of Chapter 11 - LHB AC Coaches of LHB Manual (Electrical)
No ratings yet
Revision of Chapter 11 - LHB AC Coaches of LHB Manual (Electrical)
83 pages
DLL - Mathematics 6 - Q2 - W6
No ratings yet
DLL - Mathematics 6 - Q2 - W6
8 pages
Original
No ratings yet
Original
28 pages
SunPower E20 327 Spec Sheet
No ratings yet
SunPower E20 327 Spec Sheet
2 pages
SECM112 Product Family SECM112 MY18 Changes
No ratings yet
SECM112 Product Family SECM112 MY18 Changes
2 pages
Opho2023 Open Sols
No ratings yet
Opho2023 Open Sols
43 pages
Afa 2
No ratings yet
Afa 2
134 pages
Procedures and Standards For Digital Cadastral Surveying in Jamaica PDF
No ratings yet
Procedures and Standards For Digital Cadastral Surveying in Jamaica PDF
74 pages
New Method
No ratings yet
New Method
8 pages
Week-6 Geo 3rd QTR
No ratings yet
Week-6 Geo 3rd QTR
11 pages
CSP (22531) Question Set
No ratings yet
CSP (22531) Question Set
4 pages
Year 4
No ratings yet
Year 4
10 pages
ADVC2-1160 ADVC Range Operations Manual R01
No ratings yet
ADVC2-1160 ADVC Range Operations Manual R01
260 pages
Percent Yield
0% (1)
Percent Yield
14 pages
BSP Lab Manual
No ratings yet
BSP Lab Manual
70 pages
Astm Standards For Geomechanical Test: Fall Cone Liquid Limit BS 1377-2
No ratings yet
Astm Standards For Geomechanical Test: Fall Cone Liquid Limit BS 1377-2
1 page
Instrumentation For P&ID's
No ratings yet
Instrumentation For P&ID's
55 pages
Mechanical Gear Footstep Power Generation Using A Mini Generator With Buck and Boost Selectors
No ratings yet
Mechanical Gear Footstep Power Generation Using A Mini Generator With Buck and Boost Selectors
7 pages
My Pals. Picture Graphs - Volume.Word Problems X ÷.length
No ratings yet
My Pals. Picture Graphs - Volume.Word Problems X ÷.length
91 pages
Sor Method (L11)
No ratings yet
Sor Method (L11)
9 pages
GTmetrix Report
No ratings yet
GTmetrix Report
14 pages
MA 20104 Probability and Statistics Assignment No. 3: T X e T
No ratings yet
MA 20104 Probability and Statistics Assignment No. 3: T X e T
3 pages
Theory of Plasticity
No ratings yet
Theory of Plasticity
5 pages
Some Algorthim For Solving System of Linear Volterra Integral Equation of Second Kind by Using MATLAB 7
No ratings yet
Some Algorthim For Solving System of Linear Volterra Integral Equation of Second Kind by Using MATLAB 7
11 pages
Artificial Intelligence - Mini-Max Algorithm
No ratings yet
Artificial Intelligence - Mini-Max Algorithm
5 pages

Chapter 1 Introduction

Uploaded by

Chapter 1 Introduction

Uploaded by

Chapter 1 Introduction

1.1 Introduction to Machine learning

1.2 Unsupervised Learning

1.4.1 Euclidian Distance:

𝑑[|𝑦1, 𝑦2 … … … . . 𝑦𝑛|], [|𝑧1,𝑧2 … … … . 𝑧𝑛|] = √∑(𝑦𝑖 − 𝑧𝑖)2

d- distance between the coordinates (x,y)

1.4.2 Manhattan Distance:

𝑑[|𝑥1, 𝑥2 … … … . . 𝑥𝑛|], [|𝑦1,𝑦2 … … … . 𝑦𝑛|] = √∑ |𝑥𝑖 − 𝑦𝑖|

d- distanceibetweenithe coordinates (x,y)

1.4.4 Hamming Distance:

1.5 Techniques to find the optimum number of Clusters

1.5.1 Elbow method:

1.5.2 Gap Statisticalimethod:

𝑊𝑘𝑏 – Expected value

Figurei1.4 EvaluationiGraph of Gap Statisticimethod [8]

1.6 Standard K-MeansiAlgorithm

Input: K: clusters to be formed.

Fig 1.5 Flowchartiof standard K-meansi

1.6.2 Drawbacks of standard K-means clustering:

Figi1.6 Clusteringion Iris Dataset

1.7 iGenetic Algorithm

It is a technique that comes from biological evolutioniand natural selection to tackle

2.iFitness Functioni: The fitnessifunction determines an individual's capacity for

(F(SINTER)/) and total intra cluster distance i(Fi(iSINTRA))).

Figure 1.9i Execution Steps of genetics algorithms [10]

2.2 Partitioning Clustering

Each cluster satisfies the requirements listed below.

• Each item belongs to one grouping

• Each cluster has at least one objects

K.Shahroudiiet al. [13] a genetic algorithm-based technique for choosing variables in

3.5 ProposediHybrid Technique

Fig 3.1iImplementationimethodology foriClustering’ of dataset

3.6 iBasic Genetic Algorithms’

4.1 Implementations of suggested Technique

4.1.1 IrisiDataset foriImplementation:

The Iris data is a multi-dimensionalidataset made up of four categories (sepalilength,

Figure 4.1aCode of K-meansiClustering on IrisiDataset.

Centroid Update Results:

Data is in the same range after normalization

I. Codesof Genetic K-means clusteringson Iris dataset

Figure 4.2 Code of Genetic K-meanssClustering on Iris Dataset.

4.2 Experimental Results

These algorithms were tested using a variety of characteristics, including theiconfusion

Fig 4.3:Confusion Matrix obtainedifrom K meansiAlgorithm

 Misclassificationarates =(iFP+FN)/iTotal =i0+0+0+34+0+36/150i=46%

 Recalls: TP/Actual 1=10/44=22%

 Sensitivitys: TP/Actual 1=10/44= 22%

 Specificitys: TP/Predicted 1=10/46= 21%

Figi4.4:aConfusion Matrix obtainedifrom K-meansiiAlgorithm

 Recalls: TP/Actual 1=10/44=96%

 Sensitivitys: TP/Actual 1=10/44= 96%

 Specificitys: TP/Predicted 1=10/46= 77%

The suggested approach provides superior accuracy than K-means clustering, as

Tablei4.1:iAccuracy obtainedifrom K meansiand ProposediAlgorithm

Table 4.3 :aIntra Cluster distance using proposed algorithm

Table 4.3 demonstrates the K-means algorithmiintra-clusteridistance table. Theidistance

Calculation of the inter-cluster distance using the suggested algorithm and K

The sum of Inter cluster can be calculated as follows:

Table 4.4:iInter Cluster distanceiusing K meansiclustering

Tablei4.5:iInter Cluster distanceiusing suggested algorithm

This project introducesibig data and provides background of various clustering

[1]E. Ferrara, P. D. Meo, G. Fiumara and R. Baumgartner, “Web Data Extraction,

You might also like