An Efficient Clustering Method To Find Similaritybetween The Documents
An Efficient Clustering Method To Find Similaritybetween The Documents
(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014
ABSTRACT: Data mining is a concept of extracting or mining knowledge from large amount of data. Clustering is a data
mining technique in which it is used to grouping the similar data items. TF-IDF approach is used to calculate the weight of
the cluster and ranking method is used to rank the document of the cluster. For clustering the similarity between the pair of
objects the similarity measure done by using single view point and multi view point and the values are calculated using
cosine similarity. In contrast, the proposed method is based on correlation similarity and uses HAC algorithmfor clustering
the documents. By using Correlation similarity, the similarity between the each and every documents of the cluster is
calculated. HAC algorithm is used for grouping the cluster level by level and ranking technique is used to give rank to the
cluster according to the content of the document and finally the most relevant data is grouped and cluster results are
displayed.
I. INTRODUCTION
In recent years, an increasing number of usages of data sets have become available. Data mining is the practice of
automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining
uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is
also known as Knowledge Discovery in Data. Data mining, the extraction of hidden predictive information from large
databases, isa powerful new technology with great potential to help companies focus on the most important information in
their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive,
knowledge-driven decisions. The automated, prospective analyze offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems.
1.1 Clustering
Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find
intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. Many clustering
algorithms published every year.
They can be proposed for very distinct research fields, and developed using totally different techniques and
approaches. Nevertheless, according to a recent study, more than half a century after it was introduced, the simple
algorithm k-means still remains as one of the best data mining algorithms. It is the most frequently used partitional
clustering algorithm in practice. Another recent scientific discussion states that k-means is the favorite algorithm that
practitioners in the related fields choose to use.
(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014
The internal Structure of the data will be find and organize them into a meaningful groups. Existing Systems
greedily picks the next frequent item set in the next cluster. The clustering result depends on the order of picking up the
item sets. K-means method related fields data's are processed. Used a cosine similarity for find out the dissimilar document
object in the cluster. Existing system proposed a multi viewpoint algorithm for move the dissimilar document object from
one cluster to another cluster. The second similarity measures similarity between the dissimilar document object and the
other cluster groups document objects.
i. Cosine Similarity
ii. Increment Mining
Note that these bounds apply for any number of dimensions, and Cosine similarity is most commonly used in high-
dimensional positive spaces. For example, in Information Retrieval and text mining, each term is notionally assigned a
different dimension and a document is characterised by a vector where the value of each dimension corresponds to the
number of times that term appears in the document.
Cosine similarity will gives a useful measure of how similar two documents are likely to be in terms of their
subject matter. One of the reasons for the popularity of Cosine similarity is that it is very efficient to evaluate, especially for
sparse vectors.
It is Multi View-point Similarity(MVS) algorithm. A Matrix is generated by using MVS. By building matrix the
similarity between documents can be identified. Using multiple viewpoints, more informative assessment of similarity
could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions
such as IR and IV used for document clustering are proposed based on this new measure. Comparison is made with several
well-known clustering algorithms that use other popular similarity measures on various document collections to verify the
similarity of clusters.
(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014
Propose a new method to group the documents into cluster. Cosine similarity is used to find out the dissimilar
document object in the cluster. Similarity measures will depend on the text mining. A multi viewpoint algorithm for move
the dissimilar document object from one cluster to another cluster. The Correlation similarity will measures similarity
between the dissimilar document object and the other cluster groups document objects. Multi-View point based Similarity
Calculation is used for measuring similarity between data objects. With the proposed similarity measure Hierarchical
Agglomerative Clustering Algorithm (HAC) is implemented in which forms the document groups. From the clustered
objects, the document retrieval can be done based on the query. The query is preprocessed then it is matched with the
documents in the clusters. Ranking is provided for the clusters with respect to the query matching result. The most relevant
cluster for the query will be retrieved with this approach.
HAC stands for Hierarchical Agglomerative Clustering algorithm which form the structure like tree. This is one of
the Hierarchical cluster analysis method which performs “bottom up” approach. The observation starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
Single-link
Similarity of the most cosine-similar (single-link)
Complete-link
Similarity of the “furthest” points, the least cosine-similar
Centroid
Clusters whose centroids (centers of gravity) are the most cosine-similar
Average-link
Average cosine between pairs of elements
IV. CONCLUSION
In this, new techniques called HAC and Correlation similarity is used for any type of text document to
display the most relevant document of the clusters. The Correlation similarity and HAC algorithm will makes similarity and
document retrieval more accuracy than the cosine similarity and MVS algorithm. Cluster weight is calculated using
weighted approach called TFIDF. Document ranking method is used to rank the documents of the cluster. In this, study is
made about domain knowledge and also the literature survey is conducted in the area of clustering techniques and
algorithm. The design of proposed system is prepared to solve the problem in the existing system.
(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014
1. DucThang Nguyen and CheeKeong Chan (2012) ‘Clustering with Multiviewpoint-Based Similarity Measure’, IEEE Trans on Knowledge and Data
Eng., Vol. 24, No. 6.
2. Banerjee, A. and Sra, S. (2005) ‘Clustering on the Unit Hypersphere Using Von Mises-Fisher Distributions,’ J. Machine Learning Research, vol. 6,
pp. 1345-1382.
3. CharuAggarwal, C. andCheng Xiang (2005) ’A Survey of Text Clustering Algorithms‘, Proc. SIAM Int’l Conf. Data Mining Workshop Clustering
Algorithm’s and its Applications.
4. Dhillon, I.S. (2001) ‘Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning’, Proc. Seventh ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining (KDD), pp. 269-274.
5. Ding, C. and Simon, H. (2001) ‘A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering’, Proc. IEEE Int’l Conf. Data Mining
(ICDM), pp. 107-114.
6. Friedman , J. and Meulman, J. (2004) ‘Clustering Objects on Subsets of Attributes’, J. Royal Statistical Soc. Series B Statistical Methodology, vol.
66, no. 4, pp. 815-839.
7. Ghosh, J. and Zhong, S. (2003) ‘A Comparative Study of Generative Models for Document Clustering’, Proc. SIAM Int’l Conf. Data Mining
Workshop Clustering High Dimensional Data and Its Applications.
8. Ienco, D and Meo, R. (2009) ‘Context-Based Distance Learning for Categorical Data Clustering’, Proc. Eighth Int’l Symp. Intelligent Data Analysis
(IDA), pp. 83-94.
9. Lakkaraju, P. and Speretta, M. (2008) ‘Document Similarity Based on Concept Tree Distance’, Proc. 19th ACM Conf. Hypertext and Hypermedia,
pp. 127-132.
10. Leela Prasad, V. and SimmiCintre, B. (2012) ‘Analysis of Novel Multi-Viewpoint Similarity Measures’, Int’l Journal of Engineering Research and
Applications ISSN: 2248-962 Vol. 2, Issue 4,pp.409-420.
11. Merugu, S. and Ghosh, J. (2005) ‘Clustering with Bregman Divergences’, J. Machine Learning Research, vol. 6, pp. 1705-1749.
12. Modha, D. andDhillon, I. (2001) ‘Concept Decompositions for Large Sparse Text Data Using Clustering’, Machine Learning, vol. 42, nos. 1/2, pp.
143-175.
13. Nowak, R. D. and Castro, R. M. (2011) ‘Likelihood Based Hierarchical Clustering ‘,IEEE Trans. Knowledge and Data Eng., vol. 20, no. 9, pp. 1217-
1229
14. Pelillo, M. (2009) ‘What Is a Cluster? Perspectives from Game Theory’, Proc. NIPS Workshop Clustering Theory.
15. Xu, W. and Gong, Y. (2003) ‘Document Clustering Based on Non- Negative Matrix Factorization’, Proc. 26th Ann. Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval, pp. 267-273.
16. Yi Wang and YipingKe, (2008) ‘A Model-Based Approach to Attributed Graph Clustering’, Proc. Second Int’l Conf. Autonomous Agents
(AGENTS ’98), pp. 408-415.
17. Zarrinkalam,F. and Kahani , M. (2012) ‘A New Metric For Measuring Relatedness Of Scientific Papers Based On Non-Textual Features’, Intelligent
Information Management,4, 99-107
18. Zhao, Y. andKarypis, G. (2004) ‘Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering’, Machine
Learning, vol. 55, no. 3, pp. 311-331.