0% found this document useful (0 votes)
6 views4 pages

An Efficient Clustering Method To Find Similaritybetween The Documents

The document presents a new clustering method for document similarity using correlation similarity and Hierarchical Agglomerative Clustering (HAC) algorithm. It highlights the limitations of existing systems that rely on cosine similarity and proposes a multi-viewpoint approach to improve accuracy in document retrieval. The study emphasizes the effectiveness of the proposed techniques in organizing and ranking document clusters based on their content.

Uploaded by

kalai vendhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

An Efficient Clustering Method To Find Similaritybetween The Documents

The document presents a new clustering method for document similarity using correlation similarity and Hierarchical Agglomerative Clustering (HAC) algorithm. It highlights the limitations of existing systems that rely on cosine similarity and proposes a multi-viewpoint approach to improve accuracy in document retrieval. The study emphasizes the effectiveness of the proposed techniques in organizing and ranking document clusters based on their content.

Uploaded by

kalai vendhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ISSN(Online): 2320-9801

ISSN (Print): 2320-9798

International Journal of Innovative Research in Computer and Communication Engineering

(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014

Proceedings of International Conference On Global Innovations In Computing Technology (ICGICT’14)


Organized by
Department of CSE, JayShriram Group of Institutions, Tirupur, Tamilnadu, India on 6th & 7th March 2014

An Efficient Clustering Method To Find Similarity


Between The Documents
Kalaivendhan.K1, Sumathi.P2
PG Student, Dept. of CSE, KSR Institute for Engineering and Technology, Tiruchengode, Namakkal, TamilNadu1
Assistant Professor, Dept. of CSE, KSR Institute for Engineering and Technology, Tiruchengode, Namakkal, TamilNadu2

ABSTRACT: Data mining is a concept of extracting or mining knowledge from large amount of data. Clustering is a data
mining technique in which it is used to grouping the similar data items. TF-IDF approach is used to calculate the weight of
the cluster and ranking method is used to rank the document of the cluster. For clustering the similarity between the pair of
objects the similarity measure done by using single view point and multi view point and the values are calculated using
cosine similarity. In contrast, the proposed method is based on correlation similarity and uses HAC algorithmfor clustering
the documents. By using Correlation similarity, the similarity between the each and every documents of the cluster is
calculated. HAC algorithm is used for grouping the cluster level by level and ranking technique is used to give rank to the
cluster according to the content of the document and finally the most relevant data is grouped and cluster results are
displayed.

I. INTRODUCTION

In recent years, an increasing number of usages of data sets have become available. Data mining is the practice of
automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining
uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is
also known as Knowledge Discovery in Data. Data mining, the extraction of hidden predictive information from large
databases, isa powerful new technology with great potential to help companies focus on the most important information in
their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive,
knowledge-driven decisions. The automated, prospective analyze offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems.

1.1 Clustering

Clustering is one of the most interesting and important topics in data mining. The aim of clustering is to find
intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. Many clustering
algorithms published every year.

They can be proposed for very distinct research fields, and developed using totally different techniques and
approaches. Nevertheless, according to a recent study, more than half a century after it was introduced, the simple
algorithm k-means still remains as one of the best data mining algorithms. It is the most frequently used partitional
clustering algorithm in practice. Another recent scientific discussion states that k-means is the favorite algorithm that
practitioners in the related fields choose to use.

Copyright @ IJIRCCE www.ijircce.com 2532


ISSN(Online): 2320-9801
ISSN (Print): 2320-9798

International Journal of Innovative Research in Computer and Communication Engineering

(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014

Proceedings of International Conference On Global Innovations In Computing Technology (ICGICT’14)


Organized by
Department of CSE, JayShriram Group of Institutions, Tirupur, Tamilnadu, India on 6th & 7th March 2014
k-means has more than a few basic drawbacks, such as sensitiveness to initialization and to cluster size, and its
performance can be worse than other state-of-the-art algorithms in many domains. In spite of that, its simplicity,
understandability, and scalability are the reasons for its tremendous popularity. An algorithm with adequate performance
and usability in most of application scenarios could be preferable to one with better performance in some cases but limited
usage due to high complexity. While offering reasonable results, k-means is fast and easy to combine with other methods in
larger systems.

II. EXISTING SYSTEM

The internal Structure of the data will be find and organize them into a meaningful groups. Existing Systems
greedily picks the next frequent item set in the next cluster. The clustering result depends on the order of picking up the
item sets. K-means method related fields data's are processed. Used a cosine similarity for find out the dissimilar document
object in the cluster. Existing system proposed a multi viewpoint algorithm for move the dissimilar document object from
one cluster to another cluster. The second similarity measures similarity between the dissimilar document object and the
other cluster groups document objects.

i. Cosine Similarity
ii. Increment Mining

2.1 COSINE SIMILARITY


It is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle
between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not
magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0,
and two vectors opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in
positive space, where the outcome is neatly bounded in [0,1].

Note that these bounds apply for any number of dimensions, and Cosine similarity is most commonly used in high-
dimensional positive spaces. For example, in Information Retrieval and text mining, each term is notionally assigned a
different dimension and a document is characterised by a vector where the value of each dimension corresponds to the
number of times that term appears in the document.

Cosine similarity will gives a useful measure of how similar two documents are likely to be in terms of their
subject matter. One of the reasons for the popularity of Cosine similarity is that it is very efficient to evaluate, especially for
sparse vectors.

2.2 INCREMENTAL MINING

It is Multi View-point Similarity(MVS) algorithm. A Matrix is generated by using MVS. By building matrix the
similarity between documents can be identified. Using multiple viewpoints, more informative assessment of similarity
could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions
such as IR and IV used for document clustering are proposed based on this new measure. Comparison is made with several
well-known clustering algorithms that use other popular similarity measures on various document collections to verify the
similarity of clusters.

Copyright @ IJIRCCE www.ijircce.com 2533


ISSN(Online): 2320-9801
ISSN (Print): 2320-9798

International Journal of Innovative Research in Computer and Communication Engineering

(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014

Proceedings of International Conference On Global Innovations In Computing Technology (ICGICT’14)


Organized by
Department of CSE, JayShriram Group of Institutions, Tirupur, Tamilnadu, India on 6th & 7th March 2014
III. PROPOSED SYSTEM

Propose a new method to group the documents into cluster. Cosine similarity is used to find out the dissimilar
document object in the cluster. Similarity measures will depend on the text mining. A multi viewpoint algorithm for move
the dissimilar document object from one cluster to another cluster. The Correlation similarity will measures similarity
between the dissimilar document object and the other cluster groups document objects. Multi-View point based Similarity
Calculation is used for measuring similarity between data objects. With the proposed similarity measure Hierarchical
Agglomerative Clustering Algorithm (HAC) is implemented in which forms the document groups. From the clustered
objects, the document retrieval can be done based on the query. The query is preprocessed then it is matched with the
documents in the clusters. Ranking is provided for the clusters with respect to the query matching result. The most relevant
cluster for the query will be retrieved with this approach.

3.1 CORRELATION SIMILARITY

Correlation similarity is the combination of


(i)Distance Covariance
(ii)Distance Variance
In this, the similarity will be done between the each and every document of the cluster which improves the
accuracy similarity of the document clusters.

3.2 HAC ALGORITHM

HAC stands for Hierarchical Agglomerative Clustering algorithm which form the structure like tree. This is one of
the Hierarchical cluster analysis method which performs “bottom up” approach. The observation starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.

 Single-link
Similarity of the most cosine-similar (single-link)
 Complete-link
Similarity of the “furthest” points, the least cosine-similar
 Centroid
Clusters whose centroids (centers of gravity) are the most cosine-similar
 Average-link
Average cosine between pairs of elements

IV. CONCLUSION

In this, new techniques called HAC and Correlation similarity is used for any type of text document to
display the most relevant document of the clusters. The Correlation similarity and HAC algorithm will makes similarity and
document retrieval more accuracy than the cosine similarity and MVS algorithm. Cluster weight is calculated using
weighted approach called TFIDF. Document ranking method is used to rank the documents of the cluster. In this, study is
made about domain knowledge and also the literature survey is conducted in the area of clustering techniques and
algorithm. The design of proposed system is prepared to solve the problem in the existing system.

Copyright @ IJIRCCE www.ijircce.com 2534


ISSN(Online): 2320-9801
ISSN (Print): 2320-9798

International Journal of Innovative Research in Computer and Communication Engineering

(An ISO 3297: 2007 Certified Organization) Vol.2, Special Issue 1, March 2014

Proceedings of International Conference On Global Innovations In Computing Technology (ICGICT’14)


Organized by
Department of CSE, JayShriram Group of Institutions, Tirupur, Tamilnadu, India on 6th & 7th March 2014
REFERENCES

1. DucThang Nguyen and CheeKeong Chan (2012) ‘Clustering with Multiviewpoint-Based Similarity Measure’, IEEE Trans on Knowledge and Data
Eng., Vol. 24, No. 6.

2. Banerjee, A. and Sra, S. (2005) ‘Clustering on the Unit Hypersphere Using Von Mises-Fisher Distributions,’ J. Machine Learning Research, vol. 6,
pp. 1345-1382.

3. CharuAggarwal, C. andCheng Xiang (2005) ’A Survey of Text Clustering Algorithms‘, Proc. SIAM Int’l Conf. Data Mining Workshop Clustering
Algorithm’s and its Applications.

4. Dhillon, I.S. (2001) ‘Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning’, Proc. Seventh ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining (KDD), pp. 269-274.

5. Ding, C. and Simon, H. (2001) ‘A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering’, Proc. IEEE Int’l Conf. Data Mining
(ICDM), pp. 107-114.

6. Friedman , J. and Meulman, J. (2004) ‘Clustering Objects on Subsets of Attributes’, J. Royal Statistical Soc. Series B Statistical Methodology, vol.
66, no. 4, pp. 815-839.

7. Ghosh, J. and Zhong, S. (2003) ‘A Comparative Study of Generative Models for Document Clustering’, Proc. SIAM Int’l Conf. Data Mining
Workshop Clustering High Dimensional Data and Its Applications.

8. Ienco, D and Meo, R. (2009) ‘Context-Based Distance Learning for Categorical Data Clustering’, Proc. Eighth Int’l Symp. Intelligent Data Analysis
(IDA), pp. 83-94.

9. Lakkaraju, P. and Speretta, M. (2008) ‘Document Similarity Based on Concept Tree Distance’, Proc. 19th ACM Conf. Hypertext and Hypermedia,
pp. 127-132.

10. Leela Prasad, V. and SimmiCintre, B. (2012) ‘Analysis of Novel Multi-Viewpoint Similarity Measures’, Int’l Journal of Engineering Research and
Applications ISSN: 2248-962 Vol. 2, Issue 4,pp.409-420.

11. Merugu, S. and Ghosh, J. (2005) ‘Clustering with Bregman Divergences’, J. Machine Learning Research, vol. 6, pp. 1705-1749.

12. Modha, D. andDhillon, I. (2001) ‘Concept Decompositions for Large Sparse Text Data Using Clustering’, Machine Learning, vol. 42, nos. 1/2, pp.
143-175.

13. Nowak, R. D. and Castro, R. M. (2011) ‘Likelihood Based Hierarchical Clustering ‘,IEEE Trans. Knowledge and Data Eng., vol. 20, no. 9, pp. 1217-
1229

14. Pelillo, M. (2009) ‘What Is a Cluster? Perspectives from Game Theory’, Proc. NIPS Workshop Clustering Theory.

15. Xu, W. and Gong, Y. (2003) ‘Document Clustering Based on Non- Negative Matrix Factorization’, Proc. 26th Ann. Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval, pp. 267-273.

16. Yi Wang and YipingKe, (2008) ‘A Model-Based Approach to Attributed Graph Clustering’, Proc. Second Int’l Conf. Autonomous Agents
(AGENTS ’98), pp. 408-415.

17. Zarrinkalam,F. and Kahani , M. (2012) ‘A New Metric For Measuring Relatedness Of Scientific Papers Based On Non-Textual Features’, Intelligent
Information Management,4, 99-107

18. Zhao, Y. andKarypis, G. (2004) ‘Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering’, Machine
Learning, vol. 55, no. 3, pp. 311-331.

Copyright @ IJIRCCE www.ijircce.com 2535

You might also like