0% found this document useful (0 votes)
10 views6 pages

150

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

Imperial Journal of Interdisciplinary Research (IJIR)

Vol-3, Issue-5, 2017


ISSN: 2454-1362, http://www.onlinejournal.in

A Review of Self Optimal Clustering


Technique and Data Mining Approach
Akanksha Garg1 & Dr. Shiv K. Sahu2
1
M.Tech Scholar, CSE, LNCTE, Bhopal India
2
Prof. & Head, Dept. of Computer Science & Engineering
LNCTE, Bhopal India

Abstract : Clusters by nature are the collection of algorithms and hierarchical clustering algorithms
similar objects. Each group or cluster is grab great attention from the researchers.
homogeneous, i.e., objects belonging to the same Generally, hierarchically clustering algorithms
group are similar to each other. Also, each group produce satisfactory level of clustering
or cluster should be different from other clusters, performance [8].
i.e., objects belonging to one cluster should be
different from the objects of other clusters. Document clustering has been investigated for use
Clustering is the process of group-ing similar in a number of different areas of text mining and
objects. In this paper presents the brief literature information retrieval. It is regarded as a major
survey work for an Optimal clustering techniques technology for intelligent unsupervised
and other. The optimization techniques will also categorization of content in text form of any kind;
provide the better results in the terms of better e.g. news articles, web pages, learning objects,
efficiency for clustering techniques. electronic books, even textual metadata. Document
clustering groups similar documents to form a
Keywords: - Clustering, PSO, SOC, Data Mining, coherent cluster while documents that are different
FCM, K-Means. are separated into different clusters [7]. The quality
of document clustering in both centralized and
I. INTRODUCTION decentralized environments can be improved by
using an advanced clustering framework.
The terms data mining, patent mining, text mining
and visualization are employed for the processing Different approaches to data stream clustering have
of the documents. This chapter will try to give been proposed since the 90’s. Among these,
some explanations of the terms and explain why incremental approaches have emerged as a solution
“data mining” was chosen for the title of the study for incoming data arriving at high rates. This limits
[7]. Data mining is the analysis of (often large) the possibility of storing all the samples, and
observational data sets to find unsuspected reduces the time available for processing them.
relationships and to summarize the data in novel Among the main techniques we can mention:
ways that are both understandable and useful to the CluStream, ClusTree and DenStream. These
data owner. Clustering is a division of data into approaches use a two-phase scheme which consists
groups of similar objects. Since 40 years ago, of processing the raw data stream in real time
clustering, which is one of the renowned data producing some summary data and then using this
mining techniques, is being extensively studied and summary data offline to generate the clusters [6].
applied in numerous applications. In clustering, the
whole data is divided into different sub-groups This survey focuses on clustering in data mining.
based on some similarity and each sub-group is a Data mining adds to clustering the complications of
cluster. Numerous clustering algorithms have been very large datasets with very many attributes of
reported in the literature for clustering the different types [12]. This imposes unique
subjected data in an efficient way. They can be computational requirements on relevant clustering
classified as nearest neighbor clustering, fuzzy algorithms. A variety of algorithms have recently
clustering, partitional clustering, hierarchical emerged that meet these requirements and were
clustering, artificial neural network - based successfully applied to real-life data mining
clustering, statistical clustering algorithms, density- problems. Some real-life data mining problems
based clustering algorithm, etc. Despite varieties of involve learning classifiers from imbalanced data,
algorithm classes prevail, partitional clustering which means that one of the classes called a

Imperial Journal of Interdisciplinary Research (IJIR) Page 912


Imperial Journal of Interdisciplinary Research (IJIR)
Vol-3, Issue-5, 2017
ISSN: 2454-1362, http://www.onlinejournal.in

minority class includes much smaller number of multi vies data with using optimization method
examples than the others classes called as majority such as particle of swarm optimization. Finally, the
classes. conclusions are drawn in Section VI.

Clustering or unsupervised learning is one of the II. RELATED WORK


most important fields of machine learn-ing which
splits the data into groups of similar objects helping In this section we discuss the literature survey
in extraction or summarization of new information. entitled with their author name and given
It is used in a variety of fields such as statistics, references number respectively.
pattern recognition and data mining. This research
focuses on the application of clustering on data Nishchal K. Vermaand Abhishek Roy Et al. [1]
mining. Clustering is one of the major areas in data discussed the proposed clustering technique is
mining that has a vital role in answering to the equipped with major changes and modifications in
challenges of every IT industry [7]. its previous versions of algorithm. SOC is
compared with some of the widely used clustering
Data clustering is one of the most important and techniques such as K-means, fuzzy C-means,
popular data analysis techniques, which involves Expectation and Maximization, and K-medoid.
the process of classifying an unlabeled dataset into Also, the comparison of the proposed technique is
clusters of similar objects. Each cluster consists of shown with IMC and its last updated version. The
objects that are similar within the cluster and quantitative and qualitative performances of all
dissimilar to objects of other clusters. Clustering these well-known clustering techniques are
has been applied in many applications such as web presented and compared with the aid of case
mining, text mining, image processing, stock studies and examples on various benchmarked
prediction, signal processing, biology and other validation indices. SOC has been evaluated via
fields of science and engineering [3]. cluster compactness within itself and separation
with other clusters.
In recent years, meta-heuristic algorithms are
widely used to solve clustering problem [3]. From Pavel Berkhin Et al. [2] Clustering is the division
an optimization perspective, clustering problems of data into groups of similar objects. It disregards
can be formally considered as a particular kind of some details in exchange for data simplification.
NP-hard grouping problem this type of algorithms Informally, clustering can be viewed as data
includes searching for an optimal solution for modeling concisely summarizing the data, and,
clustering problems and reducing the risk of therefore, it relates to many disciplines from
trapping in local optima. statistics to numerical analysis. Clustering plays an
important role in a broad range of applications,
Data mining techniques such as clustering and from information retrieval to CRM. Such
classification methods have been deployed by applications usually deal with large datasets and
researchers to better differentiate between the many attributes. This survey concentrates on
patterns matching of cluster and classified data. clustering algorithms from a data mining
Many studies have been conducted based on FCM perspective. The goal of this survey is to provide a
and K-means techniques [4]. Therefore, inspired by comprehensive review of different clustering
the effectiveness of applying such techniques for techniques in data mining. Clustering is a division
designing fire flame detection systems, we employ of data into groups of similar objects.
K-medoids, which has some advantages over K-
means and FCM [20]. K. A. Abdul Nazeer, M. P. Sebastian Et al. [3] In
this paper author proposes a method for making the
This paper is organized into five sections. An algorithm more effective and efficient, so as to get
overview of some of the existing clustering better clustering with reduced complexity.
algorithms is given in Section II. The clustering Conventional database querying methods are
technique and data mining approach explained in inadequate to extract useful information from huge
detail in Section III. It also includes a brief data banks. Cluster analysis is one of the major
discussion on various data mining approach data analysis methods and the k-means clustering
measures used for review and compare purpose in algorithm is widely used for many practical
this paper. In Section IV, discuss the problem applications. But the original k-means algorithm is
formulation for the selection of cluster and feature computationally expensive and the quality of the
selection and find the optimality in the between resulting clusters heavily depends on the selection
cluster. In section V we discuss the about our of initial centroids.
proposed methods for the cluster validation and

Imperial Journal of Interdisciplinary Research (IJIR) Page 913


Imperial Journal of Interdisciplinary Research (IJIR)
Vol-3, Issue-5, 2017
ISSN: 2454-1362, http://www.onlinejournal.in

Hae-Sang Park, Chi-Hyuck Jun Et al. [4] This challenges of MOEA design and data clustering,
paper proposes a new algorithm for K-medoids along with conclusions and recommendations for
clustering which runs like the K-means algorithm novice and researchers by positioning most
and tests several methods for selecting initial promising paths of future research. he survey
medoids. The proposed algorithm calculates the paper noticeably organizes the developments
distance matrix once and uses it for finding new witnessed in the past three decades for EAs based
medoids at every iterative step. To evaluate the metaheuristics to solve multi-objective
proposed algorithm, we use some real and artificial optimization problems (MOP) and to derive
data sets and compare with the results of other significant progression in ruling high quality
algorithms in terms of the adjusted Rand index. elucidations in a single run.
Experimental results show that the proposed
algorithm takes a significantly reduced time in [8] his paper presents a Pharo implementation of an
computation with comparable performance against iterative and semi-automatic method for clustering.
the partitioning around medoids. Our method proposes, to an end-user, clusters that
are based on domain information and structural
Tapas Kanungo, David M. Mount, Nathan S. information. The method presented in this paper
Netanyahu, Christine D. Piatko, Ruth Silverman, has been applied in an industrial project of
Angela Y. Wu Et al. [5] In this paper author architecture migration. We show that this method
consider the question of whether there exists a helps engineers to cluster software elements into
simple and practical approximation algorithm for domain concepts. The cluster-ing gives a result of
k-means clustering. They present a local 56% of precision and 79% of re-call after the
improvement heuristic based on swapping centers automated part in a high level clustering. A deeper
in and out. They prove that this yields a (9 + ε)- clustering gives a result of 51% of precision and
approximation algorithm. They present an example 52% of recall.
showing that any approach based on performing a
fixed number of swaps achieves an approximation [9] The main objective of this proposed
factor of at least (9 − ε) in all sufficiently high Possibilistic fuzzy c-means method is to determine
dimensions. the precise number of clusters and interpret the
same efficiently. The PFCM is a good clustering
[6] In this paper, author gives the rule mining algorithm to perform classification tests because it
algorithm based on particle swarm was introduced possesses capabilities to give more importance to
into text classification, and established the text topicalities or membership values. PFCM is a
classification model, Text PSO-Miner. In Text hybridization of PCM and FCM that often avoids
PSO-Miner, each particle corresponds to a path, various problems of PCM, FCM and FPCM. Based
producing a classification rule. Rule is a line on the sample dataset ‘lung’ the entire research has
connecting the attribute node and class node. Each been developed. The available research works
attribute node appears only once or not, and must already developed in this area are not exclusively
have a class node. Attribute node corresponds to working with cancer genes. At this juncture, using
the text characteristic value. the intelligent of the Modified Possibilitistic fuzzy c- means
algorithm of swarm intelligence, Particle Swarm algorithm could be found matching with cancer
Optimization (PSO), is introduced into the field of genes in a better fashion. “Matlab” is used for the
text classification. A text classification model Text algorithm. The accuracy of the dataset may be
PSO-Miner based on PSO is constructed and tested identified with the usage of different training sets.
on the Chinese text set. The results show that Text Possibilistic fuzzy c means algorithm has provided
PSO-Miner can be well applied to Chinese text better results while identifying the cancer gene. For
classification. evaluating the feasibility of the Possibilistic Fuzzy
C-Means (PFCM) clustering approach, the
[7] In this paper author present survey provides the researcher has carried out the experimental
state-of-the-art of research, copiously devoted to analysis.
Evolutionary Approach (EAs) for clustering
exemplified with a diversity of evolutionary [10] This research aims to develop a view of
computations. The Survey provides a analysis for the Integral System for Primary Health
nomenclature that highlights some aspects that are Care (SIAPS), employing clustering technique
very important in the context of evolutionary data specifically using the partitioned algorithms, with
clustering. The paper missions the clustering trade- the goal of completing an analysis of the clinical
offs branched out with wide-ranging Multi information of the patients, for it raises the
Objective Evolutionary Approaches (MOEAs) extraction of knowledge from a data warehouse
methods. Finally, this study addresses the potential powered by the repository of electronic medical

Imperial Journal of Interdisciplinary Research (IJIR) Page 914


Imperial Journal of Interdisciplinary Research (IJIR)
Vol-3, Issue-5, 2017
ISSN: 2454-1362, http://www.onlinejournal.in

records. To develop the solution was used Java 1.6 A variety of techniques have also been proposed in
as programming language, JBoss 4.2 as the the literature for addressing concept-drift in data
application Server and Eclipse 3.4 as Integrated multi-category clustering. However, there are two
Development Environment. Java Enterprise Edition other significant characteristics of data multi-
5.0 platform was used during the whole process. As categories, such as concept evolution and feature
result was implemented a system to facilitate the evolution that are ignored by most of the existing
understanding of generated models to support the techniques [10]. Concept-evolution occurs when
process of making clinics decisions. new classes evolve in the data. On the re-category
process we found some important problem in
III. DATA MINING APPROACH cluster oriented multi-category data clustering.
These problems are given below.
Data mining methods are often used to detect
patterns in a large set of data. These patterns are  Multi-category data clustering suffered
then used to identify future instances in a similar from multiple feature evaluation,
type of data. The experimented with a number of  Selection of cluster from nearest.
data mining techniques to identify new malicious  Diversity of feature selection process.
binaries[11]. Here three learning algorithms to train  Boundary value of cluster.
a set of classifiers on some publicly available  Efficient value of cluster.
malicious and benign executable. They compared
their algorithms to a traditional signature-based V. PROPOSED METHODOLOGY
method and reported a higher detection rate for
each of their algorithms. However, their algorithms Particle Swarm Optimization (PSO) is an iterative
also resulted in higher false positive rates when optimization technique originally proposed in
compared to signature-based method. Eberhart and Kennedy (1995) motivated by social
behaviour of mechanisms particularly birds
Data mining is a procedure to discover patterns flocking [4]. It is modelled by multi-dimensional
from large datasets. Text mining is a sub-field of particles in which each individual particle is
data mining that analyses a large collection of regarded as a potential solution for an optimization
document datasets. The prime challenge incurred in problem, and is able to move toward the best
this field is the amount of electronic text position with respect to a fitness function. PSO is
documents available which when increases evolutionary computation technique. Similar to
exponentially makes room for effective methods to genetic algorithm, PSO is a population-based
handle these documents. Also it is infeasible to optimization tool. Particles representing a potential
centralize all the documents from multiple sites to a problem solution move through a D-dimensional
centralized location for processing. Nowadays search space [10]. Representing a possible solution
these document datasets are increasing to the optimization problem, each particle moves in
tremendously which often referred to as Big Data. the direction of its best solution and the global best
The problems of analysis on these datasets are position discovered by any particles in the swarm.
referred to as a curse of dimensionality since they Each particle calculates its own velocity and
are often highly dimensional [7]. updates its position in each iteration [10].
IV. PROBLEM STATEMENT The self optimal clustering technique faced a
problem of index generation and validation of data
For the purpose of self optimal data clustering control. For the validation of data used swarm
various machines learning algorithm are applied, based optimization technique. The family of swarm
such as clustering, weighted clustering, and intelligence gives better optimal value of index for
regression [1]. Two of the most critical and well the process of cluster generation. In the continuity
generalized problems of multi-category data are its of chapter discuss the partition clustering, particle
new evolved feature and concept-drift. Since a of swarm optimization, proposed algorithm and
multi-category data is a fast and continuous event, proposed model.
it is assumed to have infinite length. Therefore, it is
difficult to store and use all the historical data for PROPOSED ALGORITHM
training. The most discover alternative is an
incremental learning technique. Several In this section discuss the proposed algorithm
incremental learners have been proposed to address based on partition clustering and particle of swarm
this problem. optimization. The particle of swarm optimization
gives the optimal number of cluster and validate
point of center and data.

Imperial Journal of Interdisciplinary Research (IJIR) Page 915


Imperial Journal of Interdisciplinary Research (IJIR)
Vol-3, Issue-5, 2017
ISSN: 2454-1362, http://www.onlinejournal.in

Step 1: The velocity and position of all particles are


randomly set to within pre-defined ranges.

Step 2: Velocity updating:-At each iteration, the


velocities of all particles are updated according to,

v i  vi  c1R1 ( pi,best  pi )  c2 R2 ( g i,best  pi )


…1
Where pi and VI are the position and velocity of
particle I, respectively; pi, best and GI, best is the
position with the ‘best’ objective value found so far
by particle I and the entire population, respectively;
w is a parameter controlling the dynamics of
flying; R1 and R2 are random variables in the range
[0,1]; c1 and c2 are the factors controlling the
related weighting of corresponding terms. The
random variables help the PSO with the ability of
stochastic searching.

Step 3: Position is updating – The positions of all


particles are updated according to,

p i  p i  vi
....2
After updating, pi should be checked and limited to
the allowed range.

Step 4: Memory updating – Update pi, best and


GI, best when condition is met,
pi ,best  pi if f ( pi )  f ( pi ,best )
Figure 1: Shows that the proposed working
g i ,best  g i if f ( g i )  f ( g i ,best ) model of PSO Techniques.
….3
Where f (x) is the objective function to be VI. CONCLUSION AND FUTURE WORK
optimized.
In this paper presents the review of clustering
technique, self optimal clustering technique and
Step 5: Stopping Condition:- The algorithm
other optimization techniques based on Swarm
repeats steps 2 to 4 until certain stopping Intelligence Algorithm. The Swarm Intelligence
conditions are met, such as a pre-defined number of Algorithm play major role for the selection of seed
iterations. Once stopped, the algorithm reports the and center optimization of clustering technique.
values of best and f (g best) as its solution. The diversity and applicability of Evolutionary
Algorithm increase the possibility of automatic
PROPOSED MODEL cluster generation. In this paper also discuss the
heuristic based function such as particle of swarm
optimization algorithm, this algorithm work in the
mode of single fitness constraints. Now in future
we implement the self optimal clustering with
Swarm Intelligence techniques such as particle of
swarm optimization techniques.

REFERENCES:-

[1] Nishchal K. Verma, Abhishek Roy “Self-


Optimal Clustering Technique Using Optimized

Imperial Journal of Interdisciplinary Research (IJIR) Page 916


Imperial Journal of Interdisciplinary Research (IJIR)
Vol-3, Issue-5, 2017
ISSN: 2454-1362, http://www.onlinejournal.in

Threshold Function” IEEE SYSTEMS JOURNAL,


IEEE 2013. Pp 1-14. [13] C.L. Sun, J.C. Zeng, J.S. Pan “An improved
vector particle swarm optimization for constrained
[2] Pavel Berkhin “A Survey of Clustering Data optimization problems”, Information Scieces, Vol.
Mining Techniques” Pp 1-59. 181, 2011. Pp. 1153–1163.

[3] K. A. Abdul Nazeer, M. P. Sebastian [14] Mathew, Juby, and R. Vijayakumar.


“Improving the Accuracy and Efficiency of the k- "Scalableparallel clustering approach for large data
means Clustering Algorithm” WCE 2009. Pp 1-6. usinggenetic possibilistic fuzzy c-means
algorithm", 2014 IEEEInternational Conference on
[4] Hae-Sang Park, Chi-Hyuck Jun “A simple and Computational Intelligence and Computing
fast algorithm for K-medoids clustering” Expert Research,2014.
Systems with Applications, 2009. Pp 3336–3341.
[15] RM Suresh, K Dinakaran, P Valarmathie,
[5] Tapas Kanungo, David M. Mount, Nathan S. “Model based modified k-means clustering for
Netanyahu, Christine D. Piatko, Ruth Silverman, microarray data”, International Conference on
Angela Y. Wu “A local search approximation Information Management and Engineering, Vol.13,
algorithm for k-means clustering” Elsevier B.V. pp 271-273, 2009, IEEE.
All rights reserved, 2004. Pp 89-112.
[16] C. Escudero et al., "Classification of Gene
[6] LUO Xin “Chinese Text Classification Based Expression Profiles: Comparison of k-means and
on Particle Swarm Optimization” 4th National expectation maximization algorithms", IEEE
Conference on Electrical, Electronics and Computer Society, 2008, pp. 831-836.
Computer Engineering, NCEECE 2015, Pp 53-59.
[17] Mansoori, G. Eghbal, “FRBC: A fuzzy rule-
[7] Ramachandra Rao Kurada, Dr. K Karteeka based clustering algorithm”, IEEE Transactions on
Pavan, Dr. AV Dattareya Rao “A PRELIMINARY Fuzzy Systems, Vol.19, No.5, pp.960–971, 2011.
SURVEY ON OPTIMIZED MULTIOBJECTIVE
METAHEURISTIC METHODS FOR DATA [18] Singh Vijendra, Kelkar Ashwini, Sahoo
CLUSTERING USING EVOLUTIONARY Laxman, “An effective clustering algorithm for
APPROACHES” International Journal of data mining”, Proc. of the 2010 International
Computer Science & Information Technology Conference on Data Storage and Data Engineering,
(IJCSIT) Vol 5, No 5, October 2013. Pp 58-78. pp.250–253, 2010.

[8] Brice Govin, Arnaud Monegier, du Sorbier, [19] Yu Jin, Qian Feng, Qi Rongbin,
Nicolas Anquetil, Stephane Ducasse “Clustering “Improvement of stochastic particle swarm
technique for conceptual clusters” HAL, 2016. Pp optimization by succession strategy”,
1-7. Communications of the Systemics and Informatics
World Network, Vol.3, pp.155–159, 2008.
[9] Thomas Scaria, Gifty Stephen, Juby Mathew
“Gene Expression Data Analysis using Fuzzy C- [20] N.R. Pal, K. Pal, J.C. Bezdek et al., “A
means Clustering Technique” International Journal possibilistic fuzzy C-Means clustering algorithm”,
of Computer Applications (0975 – 8887) Volume IEEE Trans. Fuzzy Systems, Vol.13, No.4, pp.517–
135, 2016. Pp 33-36. 530, 2005.

[10] A. J. O. Reyes, A. O. Garcia, and Y. L. Mue [21] Lv Zehua, Jin Hai, Yuan Pingpeng, Zou
“System for Processing and Analysis of Deqing, “A fuzzy clustering algorithm for interval-
Information Using Clustering Technique” IEEE valued data based on Gauss distribution functions”,
LATIN AMERICA TRANSACTIONS, VOL. 12, Acta Electronica Sinica, Vol.38, No.2, pp.295–300,
2014. Pp 364-372. 2010.

[11] Jiawei Han M. K, Data Mining Concepts and


Techniques, Morgan Kaufmann Publishers, An
Imprint of Elsevier, 2006.

[12] Margaret H. Dunham, Data Mining-


Introductory and Advanced Concepts, Pearson
Education, 2006.

Imperial Journal of Interdisciplinary Research (IJIR) Page 917

You might also like