150
150
150
Abstract : Clusters by nature are the collection of algorithms and hierarchical clustering algorithms
similar objects. Each group or cluster is grab great attention from the researchers.
homogeneous, i.e., objects belonging to the same Generally, hierarchically clustering algorithms
group are similar to each other. Also, each group produce satisfactory level of clustering
or cluster should be different from other clusters, performance [8].
i.e., objects belonging to one cluster should be
different from the objects of other clusters. Document clustering has been investigated for use
Clustering is the process of group-ing similar in a number of different areas of text mining and
objects. In this paper presents the brief literature information retrieval. It is regarded as a major
survey work for an Optimal clustering techniques technology for intelligent unsupervised
and other. The optimization techniques will also categorization of content in text form of any kind;
provide the better results in the terms of better e.g. news articles, web pages, learning objects,
efficiency for clustering techniques. electronic books, even textual metadata. Document
clustering groups similar documents to form a
Keywords: - Clustering, PSO, SOC, Data Mining, coherent cluster while documents that are different
FCM, K-Means. are separated into different clusters [7]. The quality
of document clustering in both centralized and
I. INTRODUCTION decentralized environments can be improved by
using an advanced clustering framework.
The terms data mining, patent mining, text mining
and visualization are employed for the processing Different approaches to data stream clustering have
of the documents. This chapter will try to give been proposed since the 90’s. Among these,
some explanations of the terms and explain why incremental approaches have emerged as a solution
“data mining” was chosen for the title of the study for incoming data arriving at high rates. This limits
[7]. Data mining is the analysis of (often large) the possibility of storing all the samples, and
observational data sets to find unsuspected reduces the time available for processing them.
relationships and to summarize the data in novel Among the main techniques we can mention:
ways that are both understandable and useful to the CluStream, ClusTree and DenStream. These
data owner. Clustering is a division of data into approaches use a two-phase scheme which consists
groups of similar objects. Since 40 years ago, of processing the raw data stream in real time
clustering, which is one of the renowned data producing some summary data and then using this
mining techniques, is being extensively studied and summary data offline to generate the clusters [6].
applied in numerous applications. In clustering, the
whole data is divided into different sub-groups This survey focuses on clustering in data mining.
based on some similarity and each sub-group is a Data mining adds to clustering the complications of
cluster. Numerous clustering algorithms have been very large datasets with very many attributes of
reported in the literature for clustering the different types [12]. This imposes unique
subjected data in an efficient way. They can be computational requirements on relevant clustering
classified as nearest neighbor clustering, fuzzy algorithms. A variety of algorithms have recently
clustering, partitional clustering, hierarchical emerged that meet these requirements and were
clustering, artificial neural network - based successfully applied to real-life data mining
clustering, statistical clustering algorithms, density- problems. Some real-life data mining problems
based clustering algorithm, etc. Despite varieties of involve learning classifiers from imbalanced data,
algorithm classes prevail, partitional clustering which means that one of the classes called a
minority class includes much smaller number of multi vies data with using optimization method
examples than the others classes called as majority such as particle of swarm optimization. Finally, the
classes. conclusions are drawn in Section VI.
Hae-Sang Park, Chi-Hyuck Jun Et al. [4] This challenges of MOEA design and data clustering,
paper proposes a new algorithm for K-medoids along with conclusions and recommendations for
clustering which runs like the K-means algorithm novice and researchers by positioning most
and tests several methods for selecting initial promising paths of future research. he survey
medoids. The proposed algorithm calculates the paper noticeably organizes the developments
distance matrix once and uses it for finding new witnessed in the past three decades for EAs based
medoids at every iterative step. To evaluate the metaheuristics to solve multi-objective
proposed algorithm, we use some real and artificial optimization problems (MOP) and to derive
data sets and compare with the results of other significant progression in ruling high quality
algorithms in terms of the adjusted Rand index. elucidations in a single run.
Experimental results show that the proposed
algorithm takes a significantly reduced time in [8] his paper presents a Pharo implementation of an
computation with comparable performance against iterative and semi-automatic method for clustering.
the partitioning around medoids. Our method proposes, to an end-user, clusters that
are based on domain information and structural
Tapas Kanungo, David M. Mount, Nathan S. information. The method presented in this paper
Netanyahu, Christine D. Piatko, Ruth Silverman, has been applied in an industrial project of
Angela Y. Wu Et al. [5] In this paper author architecture migration. We show that this method
consider the question of whether there exists a helps engineers to cluster software elements into
simple and practical approximation algorithm for domain concepts. The cluster-ing gives a result of
k-means clustering. They present a local 56% of precision and 79% of re-call after the
improvement heuristic based on swapping centers automated part in a high level clustering. A deeper
in and out. They prove that this yields a (9 + ε)- clustering gives a result of 51% of precision and
approximation algorithm. They present an example 52% of recall.
showing that any approach based on performing a
fixed number of swaps achieves an approximation [9] The main objective of this proposed
factor of at least (9 − ε) in all sufficiently high Possibilistic fuzzy c-means method is to determine
dimensions. the precise number of clusters and interpret the
same efficiently. The PFCM is a good clustering
[6] In this paper, author gives the rule mining algorithm to perform classification tests because it
algorithm based on particle swarm was introduced possesses capabilities to give more importance to
into text classification, and established the text topicalities or membership values. PFCM is a
classification model, Text PSO-Miner. In Text hybridization of PCM and FCM that often avoids
PSO-Miner, each particle corresponds to a path, various problems of PCM, FCM and FPCM. Based
producing a classification rule. Rule is a line on the sample dataset ‘lung’ the entire research has
connecting the attribute node and class node. Each been developed. The available research works
attribute node appears only once or not, and must already developed in this area are not exclusively
have a class node. Attribute node corresponds to working with cancer genes. At this juncture, using
the text characteristic value. the intelligent of the Modified Possibilitistic fuzzy c- means
algorithm of swarm intelligence, Particle Swarm algorithm could be found matching with cancer
Optimization (PSO), is introduced into the field of genes in a better fashion. “Matlab” is used for the
text classification. A text classification model Text algorithm. The accuracy of the dataset may be
PSO-Miner based on PSO is constructed and tested identified with the usage of different training sets.
on the Chinese text set. The results show that Text Possibilistic fuzzy c means algorithm has provided
PSO-Miner can be well applied to Chinese text better results while identifying the cancer gene. For
classification. evaluating the feasibility of the Possibilistic Fuzzy
C-Means (PFCM) clustering approach, the
[7] In this paper author present survey provides the researcher has carried out the experimental
state-of-the-art of research, copiously devoted to analysis.
Evolutionary Approach (EAs) for clustering
exemplified with a diversity of evolutionary [10] This research aims to develop a view of
computations. The Survey provides a analysis for the Integral System for Primary Health
nomenclature that highlights some aspects that are Care (SIAPS), employing clustering technique
very important in the context of evolutionary data specifically using the partitioned algorithms, with
clustering. The paper missions the clustering trade- the goal of completing an analysis of the clinical
offs branched out with wide-ranging Multi information of the patients, for it raises the
Objective Evolutionary Approaches (MOEAs) extraction of knowledge from a data warehouse
methods. Finally, this study addresses the potential powered by the repository of electronic medical
records. To develop the solution was used Java 1.6 A variety of techniques have also been proposed in
as programming language, JBoss 4.2 as the the literature for addressing concept-drift in data
application Server and Eclipse 3.4 as Integrated multi-category clustering. However, there are two
Development Environment. Java Enterprise Edition other significant characteristics of data multi-
5.0 platform was used during the whole process. As categories, such as concept evolution and feature
result was implemented a system to facilitate the evolution that are ignored by most of the existing
understanding of generated models to support the techniques [10]. Concept-evolution occurs when
process of making clinics decisions. new classes evolve in the data. On the re-category
process we found some important problem in
III. DATA MINING APPROACH cluster oriented multi-category data clustering.
These problems are given below.
Data mining methods are often used to detect
patterns in a large set of data. These patterns are Multi-category data clustering suffered
then used to identify future instances in a similar from multiple feature evaluation,
type of data. The experimented with a number of Selection of cluster from nearest.
data mining techniques to identify new malicious Diversity of feature selection process.
binaries[11]. Here three learning algorithms to train Boundary value of cluster.
a set of classifiers on some publicly available Efficient value of cluster.
malicious and benign executable. They compared
their algorithms to a traditional signature-based V. PROPOSED METHODOLOGY
method and reported a higher detection rate for
each of their algorithms. However, their algorithms Particle Swarm Optimization (PSO) is an iterative
also resulted in higher false positive rates when optimization technique originally proposed in
compared to signature-based method. Eberhart and Kennedy (1995) motivated by social
behaviour of mechanisms particularly birds
Data mining is a procedure to discover patterns flocking [4]. It is modelled by multi-dimensional
from large datasets. Text mining is a sub-field of particles in which each individual particle is
data mining that analyses a large collection of regarded as a potential solution for an optimization
document datasets. The prime challenge incurred in problem, and is able to move toward the best
this field is the amount of electronic text position with respect to a fitness function. PSO is
documents available which when increases evolutionary computation technique. Similar to
exponentially makes room for effective methods to genetic algorithm, PSO is a population-based
handle these documents. Also it is infeasible to optimization tool. Particles representing a potential
centralize all the documents from multiple sites to a problem solution move through a D-dimensional
centralized location for processing. Nowadays search space [10]. Representing a possible solution
these document datasets are increasing to the optimization problem, each particle moves in
tremendously which often referred to as Big Data. the direction of its best solution and the global best
The problems of analysis on these datasets are position discovered by any particles in the swarm.
referred to as a curse of dimensionality since they Each particle calculates its own velocity and
are often highly dimensional [7]. updates its position in each iteration [10].
IV. PROBLEM STATEMENT The self optimal clustering technique faced a
problem of index generation and validation of data
For the purpose of self optimal data clustering control. For the validation of data used swarm
various machines learning algorithm are applied, based optimization technique. The family of swarm
such as clustering, weighted clustering, and intelligence gives better optimal value of index for
regression [1]. Two of the most critical and well the process of cluster generation. In the continuity
generalized problems of multi-category data are its of chapter discuss the partition clustering, particle
new evolved feature and concept-drift. Since a of swarm optimization, proposed algorithm and
multi-category data is a fast and continuous event, proposed model.
it is assumed to have infinite length. Therefore, it is
difficult to store and use all the historical data for PROPOSED ALGORITHM
training. The most discover alternative is an
incremental learning technique. Several In this section discuss the proposed algorithm
incremental learners have been proposed to address based on partition clustering and particle of swarm
this problem. optimization. The particle of swarm optimization
gives the optimal number of cluster and validate
point of center and data.
p i p i vi
....2
After updating, pi should be checked and limited to
the allowed range.
REFERENCES:-
[8] Brice Govin, Arnaud Monegier, du Sorbier, [19] Yu Jin, Qian Feng, Qi Rongbin,
Nicolas Anquetil, Stephane Ducasse “Clustering “Improvement of stochastic particle swarm
technique for conceptual clusters” HAL, 2016. Pp optimization by succession strategy”,
1-7. Communications of the Systemics and Informatics
World Network, Vol.3, pp.155–159, 2008.
[9] Thomas Scaria, Gifty Stephen, Juby Mathew
“Gene Expression Data Analysis using Fuzzy C- [20] N.R. Pal, K. Pal, J.C. Bezdek et al., “A
means Clustering Technique” International Journal possibilistic fuzzy C-Means clustering algorithm”,
of Computer Applications (0975 – 8887) Volume IEEE Trans. Fuzzy Systems, Vol.13, No.4, pp.517–
135, 2016. Pp 33-36. 530, 2005.
[10] A. J. O. Reyes, A. O. Garcia, and Y. L. Mue [21] Lv Zehua, Jin Hai, Yuan Pingpeng, Zou
“System for Processing and Analysis of Deqing, “A fuzzy clustering algorithm for interval-
Information Using Clustering Technique” IEEE valued data based on Gauss distribution functions”,
LATIN AMERICA TRANSACTIONS, VOL. 12, Acta Electronica Sinica, Vol.38, No.2, pp.295–300,
2014. Pp 364-372. 2010.