5 CS 03 Ijsrcse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

International Journal of Scientific Research in _______________________________ Review Paper .

Computer Science and Engineering


Vol.6, Special Issue.1, pp.19-22, January (2018) E-ISSN: 2320-7639

A Review: Design and Development of Novel Techniques for Clustering


and Classification of Data
R.S. Walse1, G.D. Kurundkar2, P. U. Bhalchandra3
1
School of Computational Sciences, S.R.T.M. University, (Research Centre) Nanded,
2
Department of Computer Science Shri Gurubuddhi Swami Mahavidyalaya, S.R. T. M. University, Nanded
3
School of Computational Sciences, S.R.T.M.U, Nanded. Maharashtra State, India

Abstract: Data Distribution can be obtained by clustering data. In this work we observed the characteristics of selected cluster,
and make a further study on particular clusters. Also, cluster analysis generally acts as the preprocessing of other data mining
operations. Consequently, cluster analysis has become a very active research topic in data mining. Data mining is a new
technology, developing with database as well as artificial intelligence. It is a processing procedure of extracting credible and
effective novel techniques and understandable patterns from the database. Cluster analysis can be important data mining
method used to figure out the data segmentation and pattern information. The development of data mining methods, different
types of clustering techniques establish. The study of clustering method from the perception of statistics, based on the statistical
theory, The review of this paper make an effort to combine statistical method with the machine learning algorithm technique as
well as introduce the existing best r-statistical softwares, including factor, correspondence and analysis of functional data into
data mining. The present study is undertaken to develop a Data Mining workflow using clustering and classification of data,
solving clustering problem as well as extracting association rules. Use the suitable proximity measure in addition to that to
select the optimal clustering model to solve clustering problems. Develop a Data Mining workflow to extract association rules.

Keywords: ISODATA, SRIDHCR, r-solution, K-means


Data clustering is an exploratory as well as
I. INTRODUCTION descriptive data analysis technique that has gained a lot of
attention, e.g., in statistics, data mining, pattern recognition,
In the data-mining world, clustering and classification are etc. It is an explorative way to investigate multivariate data
two types of methods. Both these methods can be used to sets that contain possibly various type of data. This kind of
characterize the objects into groups which are having one or data sets differs from each other’s in size concerning some
more features. The difference between classification and
objects as well as dimensions, or they contain different data
clustering is an unsupervised learning technique that can be
types. certainly, the data clustering related to the core
used for similar group based on features, while
classification is a supervised learning technique. The technique of data mining, in that one focuses on large data
business. Abundant data carries abundant problems. Data sets with unknown underlying structure. The intention of
mining involves various techniques, also data mining plays this report is to be an introduction to specific parts of this
a very important in today and tomorrow’s scenarios. This methodology called cluster analysis. Partitioning based
becomes possible with the usage of clustering and clustering techniques may be flexible methods for iterative
classification of data to make the data more robust, relocation of data points between clusters. The quality of the
meaningful and usable with least wastage. solutions can be measured by a clustering criterion. Each
The major problems we see with huge date are; iteration the iterative relocation algorithms reduce the value
a) Data mining algorithms need to be more efficient of the criterion function, until convergence mean while
and scalable to extract the information from the changing the clustering criterion, it is possible to construct
huge amount of data in databases. robust clustering method which can be more insensitive to
b) Dealing with datasets that require distributed incorrect and missing.
approaches.
c) Mining information from heterogeneous databases Datasets Used in the Experiments for clustering
and global information systems. algorithms:
d) Processing of large, complex and unstructured data To study the performance of the clustering algorithms
into a structured format. presented in the previous Section, we applied those
algorithms to a benchmark consisting of the datasets listed

© 2018, IJSRCSE All Rights Reserved 19


Int. J. Sci. Res. in Computer Science and Engineering Vol.6(1), Jan 2018, E-ISSN: 2320-7639

in Table: The first 11 datasets was obtained from University representatives as well as by removing a single
of California at Irving Machine Learning Repository [23]. representative from the set of council. The algorithm
The last two datasets we named Complex9 and Oval10, are terminates if the solution quality (measured by q(X)) does
two-dimensional spatial datasets whose examples distribute not show any improvement. Moreover, we assume that the
in different shapes. These two 2D datasets was obtained algorithm is run R (input parameter) period opening from a
from the authors of [18], and some of them seem to be randomly generated initial set of representatives each time,
similar to proprietary datasets used in [15]. These datasets reporting the best of the r solutions as its final result. The
are used mainly for visualization purposes presented later in pseudo-code of algorithm SRIDHCR that was used for the
this Section. evaluation of supervised clustering is given in Figure. It
Table: Datasets used in the benchmark must be noted that the number of clusters K is not fixed for
SRIDHCR; the algorithm searches for “good” values of k.

Although the intuitive idea behind cluster analysis is simple,


the successful completion of the tasks presumes a large
number of correct decisions and choices from several
alternatives. Anderberg [5] states that there appear to be at
least nine major elements in a cluster analysis study before
the results can be attained. For the reason that the current
real world data sets contain missing values as well as
complete this element list with the data presentation and
missing data strategy:
1. Data presentation.
2. Choice of objects.
3. Choice of variables.
4. What to cluster data units or variables.
Clustering is a process of partitioning a set of data (or 5. Normalization of variables.
objects) into a set of meaningful sub-classes, called clusters. 6. Choice of (dis)similarity measures.
7. Choice of clustering criterion (objective function).
Clustering has wide applications in 8. Choice of missing data strategy.
1. Economic Science (especially market research). 9. Algorithms and computer implementation (and their
2. Document classification reliability, e.g., convergence)
3. Pattern Recognition. 10. A number of clusters.
4. Spatial Data Analysis, create thematic maps in GIS 11. Interpretation of results.
by clustering feature spaces
These are the most significant parts of the general clustering
process. Jain et al. [71] suggest that the strategies used in
The Iterative Self-Organizing Data Analysis data collection, data representation, normalization and
Technique (ISODATA) technique used a set of thumb rule cluster validity is as important as the clustering strategy
procedures that have incorporated into an iterative itself. According to Hastie et al. [58, p.459], choice of the
classification algorithm. The ISODATA algorithm is a best (dis)similarity measure is even more important than the
modification of the k-means clustering algorithm was choice of clustering algorithms. This it could also be
developed by Geoffrey H. Ball and David J. Hall, working completed by validation of the resulting cluster solution [70,
at the Stanford Research Institute in Menlo Park, CA. They 4]. Validation is, on the other hand, closely related to the
published their findings in a technical report entitled: estimation of the number of clusters and to the result
ISODATA, a novel technique of data analysis and model interpretation.
classification. Many different method and algorithms are available:
1. For numeric and symbolic data
Single Representative Insertion/Deletion Steepest Decent 2. Exclusive vs. overlapping
Hill Climbing with Randomized Restart (SRIDHCR): 3. Crisp vs. soft computing paradigms
This greedy algorithm starts by randomly selecting 4. Hierarchical vs. flat (non-hierarchical)
some examples from the dataset as the initial set of 5. Access to all data or incremental learning
representatives. Clusters are then created by assigning 6. Semi-supervised mode
examples to the group of their closest representative.
Starting from this randomly generated set of representatives, CONCLUSIONS:
the algorithm tries to improve the quality of the clustering
by adding a single non-representative example to the set of

© 2018, IJSRCSE All Rights Reserved 20


Int. J. Sci. Res. in Computer Science and Engineering Vol.6(1), Jan 2018, E-ISSN: 2320-7639

Unlike other types of data, Text data include many features, classification in marketing research and environmental
in research we include designated and develop Novel health risk assessment. There are different Clustering
techniques of feature extraction techniques using clustering techniques, depending on how the dataset is to be divided.
method. At first, text documents like agglomerative
technique, divisive techniques, and distributive clustering Hierarchical: Algorithms create separate sets of nested
are studied and Novel Techniques on Feature Clustering clusters, each in their hierarchal level. Partitional:
Algorithms for text classification like News: Electronic New Algorithms create just a single set of clusters.
articles are generated very frequently be studied and the Table: Traditional, Semi-Supervised, and supervised
Manual classification of these articles is a very difficult, so, clustering
computerized methods are useful in this Case. This
application is known as text filtering. Is studied, following
this Digital Libraries are studied using a variety of
supervised methods may be used for document organization
in domains like digital libraries, web collections, and
scientific literature.

It is proposed to use SPMF is an open source data mining


collection of information written in Java, specialized in REFERENCES:
pattern mining and distributed under the GPL v3 license.
To provide implementations of 133 data mining algorithms [1] A perusal of big data classification and Hadoop technology,
for: International Transaction of electrical and computer engineers
system, 2017, Vol 4, No.1, 26-38.
 association rule mining,
[2] http://www.cs.put.poznan.pl/jstefanowski/sed/DM-
 sequential pattern mining,
7clusteringnew.pdf
 sequential rule mining, [3] http://www.ijarcsms.com/docs/paper/volume2/issue12/V2I12-
 sequence prediction, 0095.pdf
 periodic pattern mining, [4] M. and Heckerman, D., “An experimental comparison of several
 high-utility pattern mining, clustering and initialization method,” Technical Report MSR TR-
 clustering and classification, 98-06, Microsoft
 Time-series mining. [5] Research, Redmond, WA, February 1998.
[6] Mrs. Bharati M. Ramageri, “Data Mining Techniques and
The research work can be improved the efficiency of Applications,” Indian Journal of Computer Science and
the classification and clustering of data mining algorithm Engineering, Vol. 1 No. 4, pp.301-305, 2010.
method by applying them to data sets as well as to improve [7] Karimella Vikram and Niraj Upadhyaya, “Data Mining Tools and
the performance of traditional algorithms like K-Means and Techniques: a review,” Computer Engineering and Intelligent
presents a hybrid approach. Therefore, which can deal with Systems, Vol 2, No.8.Pp.31-39, 2011.
small convex datasets preferably, reduces the error rate and [8] Usama Fayyad, G. Paitetsky-Shapiro, and Padhrais Smith,
achieves the output accuracy. “knowledge discovery and data mining: Towards a unifying
framework,” proceedings of the International Conference on
The Scope of Study: Knowledge Discovery and Data Mining, pp. 82-22, 1996
[9] A Novel Method for Text and Non-Text Segmentation in
Document Images International conference on Communication
The purpose of classification and clustering algorithms is to and Signal Processing, April 3-5, 2013, India,
make the sense as well as extract value from large sets of [10] Novel Techniques on feature clustering algorithms for text
structured and unstructured data. If we are working with classification IJCST Vol 4, Issue Spl-4 Oct-Dec, 2013. ISSN:
huge volumes of unstructured data, it only makes logic to 2229-4333 (Print)
try to partition the data into some logical groupings before [11] A Novel Method for text and Non-Text segmentation in document
attempting to analyze. Clustering and classification allow us images. International Conference on Communication and signal
to take a sweeping glimpse of data all together, and then processing, April 3-5, 2013 978-1-4673-4866-9/13 2013 IEEE.
form some of the logical structures based on what
researcher find there before going deeper into the deep
analysis. In their simplest form, clusters are sets of data
points which share the similar attributes as well as
clustering algorithms are the methods that group these data
points into different clusters based on their similarity. The
clustering algorithms used for disease classification in
medical science, but we also see them used for customer

© 2018, IJSRCSE All Rights Reserved 21


Int. J. Sci. Res. in Computer Science and Engineering Vol.6(1), Jan 2018, E-ISSN: 2320-7639

ANALYTICAL FRAMEWORK OF THE STUDY

The detailed process be explained below:

Fig. 1: The Initial Processing

© 2018, IJSRCSE All Rights Reserved 22

You might also like