0% found this document useful (0 votes)
25 views5 pages

Clustering of Hub and Authority Web Docu

The document discusses clustering web documents for information retrieval using link analysis. It proposes a clustering approach to classify search results from link-based search engines into relevant groups. The paper outlines existing ranking algorithms like PageRank, HITS and evaluates using K-means clustering to increase relevancy of search results.

Uploaded by

lynneunlocks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

Clustering of Hub and Authority Web Docu

The document discusses clustering web documents for information retrieval using link analysis. It proposes a clustering approach to classify search results from link-based search engines into relevant groups. The paper outlines existing ranking algorithms like PageRank, HITS and evaluates using K-means clustering to increase relevancy of search results.

Uploaded by

lynneunlocks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Science and Information Security (IJCSIS),

Vol. 14, No. 3, March 2016

Clustering of Hub and Authority Web Documents for


Information Retrieval
Kavita Kanathey R. S. Thakur Shailesh Jaloree
Computer Science Department of Computer Application Department of applied Mathematics
Barkatullah University Maulana Azad National Institute of SATI,Vidisha,Bhopal,MP,India
Bhopal,MP,India Technology (MANIT)
Bhopal, MP, India

Abstract- Due to the exponential growth of World Wide Web (or linked to them. In this system, user submits a query to the
simply the Web), finding and ranking of relevant web documents has meta-search engine. The meta-search engine searches for the
become an extremely challenging task. When a user tries to retrieve relevant results of users query. From the set of results
relevant information of high quality from the Web, then ranking of
retrieved from web search engine, they are formed as a meta-
search results of a user query plays an important role. Ranking
provides an ordered list of web documents so that users can easily
directory tree. This tree structure helps the user to retrieve
navigate through the search results and find the information content information with high relevancy.
as per their need. In order to rank these web documents, a lot of The relevancy of web page can be obtained by considering the
ranking algorithms (PageRank, HITS, Weight PageRank) have been number of in-links and out-links present in a particular web
proposed based upon many factors like citations analysis, content page. When the web page has more number of out-links to a
similarity, annotations etc. However, the ranking mechanism of these relevant page, then that page can be considered as a central
algorithms gives user with a set of non classified web documents page. From this central page, all other web pages are
according to their query. In this paper, we propose a link-based compared for similarity and the most similar pages are
clustering approach to cluster search results returned from link based
grouped together. The grouping of most similar pages together
web search engine. By filtering some irrelevant pages, our approach
classified relevant web pages into most relevant, relevant and
is known as clustering. Clustering can be done based on
irrelevant groups to facilitate users’ accessing and browsing. In order different algorithms such as hierarchical, k-means,
to increase relevancy accuracy, K-mean clustering algorithm is used. partitioning, etc.
Preliminary evaluations are conducted to examine its effectiveness. The simplest unsupervised learning algorithm that solve
The results show that clustering on web search results through link clustering problem is K- Means algorithm. It is a simple and
analysis is promising. This paper also outlines various page ranking easy way to classify a given data set through a certain number
algorithms. of clusters.
When the documents are clustered [9] using K-Means
Keywords - World Wide Web, search engine, information retrieval,
algorithm, the cluster contains more similar documents and it
Pagerank, HITS, Weighted Pagerank, link analysis.
increases the relevancy rate of search results. When a user
I. INTRODUCTION requests for a query after these clustering process, they get
only the most relevant cluster which matches the request. They
The World Wide Web is a famous and interactive way to
will not get any of the irrelevant pages. So, it increases the
disseminate information nowadays. The Web is the largest
efficiency of search results and reduces computational time
information repository for knowledge reference. The web is
and search space.
huge, semi-structured, dynamic, and heterogeneous and
The paper is organized as follows. Section II is an assessment
broadly distributed global information service center [5].
of previous related works of link analysis and clustering in
Finding relevant web pages of highest quality to the users
web domain. In Section III, we describe the existing system.
based on their queries becomes increasingly difficulty. This
Subsequently in Section IV we describe our proposed
can be observed by the researcher that most of the web
approach in detail. In Section V, We conclude our paper with
documents collected by web spider are not relevant to the
some discussions.
query of the user. It makes in-convenience for the user to filter
out irrelevant information from these search results, hence
leading to waste of time. For these reasons, the cluster search
engine provides a way to find the information, by returning a
set of classified web pages.
An important class of search engine that offer search results
based on hypertext links between sites can be termed as Link
Based Search Engine. Rather than providing results based on
keywords or the content of the web documents, sites are
ranked based on the quality and quantity of other web sites

418 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016

II. RELATED WORK

In order to retrieve more relevant documents, various Link = 0.15+0.425


Analysis Algorithms have been proposed. Three important = 0.575 (1a)
algorithms PageRank,Weight PageRank and hypertext PR(Q)= (1-d) +d [PR (P)/C(P)+ PR(R)/C(R)]
Induced Topic Search(HITS) are discussed below in detail and = (1-0.85)+0.85[0.575/2+1/2]
compared =0.819 (1b)
PR(R) = (1-d) +d [PR (P)/C (P) + PR (Q)/C (Q)]
A. PageRank Algorithm = (1-0.85)+0.85[0.575/2+0.819/1]
The PageRank [1] is the link analysis algorithm that was = 1.091 (1c)
developed by S. Brin and L. Page during their Ph.D. at Do the second iteration by taking the above PageRank value
Stanford University based on the citation analysis. This from (1a), (1b) and (1c):
algorithm is used by the famous search engine GOOGLE. PR (P) = (1-d) + d [PR(R)/C(R)]
PageRank algorithm applied the citation analysis in web = 0.15+0.85[1.091/2]
search by treating the incoming links as citations to the web = 0.614 (2a)
pages. This algorithm is based on the concepts that if a page PR (Q) = (1-d) +d [PR (P)/C (P) + PR(R)/C(R)]
contains “important” links towards it then the links of this = 0.15+0.85[0.614/2+1.091/2]
page towards the other page are also to be considered as =0.875 (2b)
“important” pages. The PageRank considers the back link in PR(R) = (1-d) +d [PR (P)/C (P) + PR (Q)/C (Q)]
deciding the rank score. If the addition of the all the ranks of = 0.15+0.85[0.614/2+0.875/1]
the back links is large then the page then it is provided a large = 1.155 (2c)
rank. Therefore, PageRank provides a more advanced way to Do the third iteration by taking the above PageRank values
compute the importance or relevance of a web page than from (2a), (2b) and (2c):
simply counting the number of pages that are linking to it. If a PR (P) = (1-d) + d [PR(R)/C(R)]
back link comes from an important page, then that back link is = 0.15+0.85[1.155/2]
given a higher weighting than those back links comes from = 0.578 (3a)
non-important pages. In a simple manner, link from one page PR (Q)= (1-d) +d [PR (P)/C (P) + PR(R)/C(R)]
to another page may be considered as a vote. However, not =0.15+0.85[0.578/2+1.155/2]
only the number of votes a page receives is considered =0.886 (3b)
important, but the importance or the relevance of the ones that PR(R) = (1-d)+d[PR(P)/C(P)+ PR(Q)/C(Q)]
cast these votes as well. = 0.15+0.85[0.578/2+0.886/1]
=1.148 (3c)
Assume any arbitrary page A has pages T1 to Tn pointing to it After doing many more iterations of the above calculation, the
(inlink). PageRank can be calculated by the following Eq. (1): PageRanks arrived as shown in Table 1.
For a smaller set of pages, the computation is easier but for a
PR(A) = (1-d)+ d[PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)]…….(1) Web having billions of pages; the above computation becomes
more difficult. As shown in the Table 1, you can notice that
Where PR(A) is the PageRank of page A; PR(Ti),for i=1…n, PR(R) >PR (Q)>PR (P).So the link analysis becomes very
is the PageRank of page Ti which links to page A,C ((Ti); for important in the PageRank. From the Table 1, after the
i=1…n, is the outbound links on page Ti, and d is a damping iteration 15, the PageRank for the pages gets normalized. The
factor, usually sets it to 0.85. PageRank gets converged to a reasonable tolerance.
Consider a small web consisting of three web pages P, Q and
R as shown in fig.1 Table 1: Iterative calculation for PageRank
Iteration PR(P) PR(Q) PR(R)
Page P Page Q 0 1.000 1.000 1.000
1 0.575 0.819 1.091
2 0.614 0.875 1.155
3 0.578 0.886 1.148
……… ………. ……….. ……….
15 0.701 0.999 1.297
16 0.701 0.999 1.297
Page R

Figure.1 Hyperlink Structure of web pages


The PageRank for pages P, Q and R are calculated manually
by using Eq. (1). Let us assume the initial PageRank as 1.0
and do the calculation. The damping factor d is set to 0.85:
PR (P) = (1-d) + d [PR(R)/C(R)]
= (1-0.85) +0.85(1/2)

419 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016

B. Weighted Page Rank Algorithm


WPR (Q) = 0.15+0.85[0.338*1/2*1/3+1*2/3*1/3]
Weighted PageRank Algorithm [4] was proposed by Wenpu = 0.386 (2b)
Xing and Ali Ghorbani. Weighted PageRank algorithm (WPR) The inlink and outlink weights for page R are calculated as
is an extension of the original PageRank algorithm. This follows:
algorithm assigns larger rank values to more important Win(P,R) = IR /( IR+ IQ) = 2/2+2=1/2 (3.1a)
(popular) pages instead of dividing the rank value of a page Wout(P,R) = OR /( OQ+OR)=2/2+1=2/3 (3.1b)
evenly among it’s outlink pages. Each outlink page gets a Win(Q,R) = IR /( IR+ IP)= 2/2+1=2/3 (3.1c)
value proportional to its popularity (its number of inlinks and Wout(Q,R) = OR /( OP+OR)= 2/2+2=1/2 (3.1d)
outlinks). The popularity from the number of inlinks and By substituting these values in (1c), you will get the WPR for
outlinks is recorded as Win (v,u) and Wout (v,u), respectively. page R.
Win(v,u) is the weight of link(v, u) calculated based on the WPR (R) = 0.15+0.85[0.338*1/2*1/3+0.386 *2/3*1/3]
number of inlinks of page u and the number of inlinks of all = 0.354
reference pages of page v. After doing many more iterations of the above calculation, the
Win(v,u) = Iu /Σp∈R(v) Ip (1) Weighted PageRanks arrived as shown in Table 2.
Where Iu and Ip represent the number of inlinks of page u and
page p, respectively. R (v) denotes the reference page list of Table 2. Iterative calculation for PageRank
Iteration WPR(P) WPR(Q) WPR(R)
page v.
0 1.000 1.000 1.000
Wout(v,u) is the weight of link(v, u) calculated based on the 1 0.338 0.386 0.354
number of outlinks of page u and the number of outlinks of all 2 0.217 0.248 0.282
reference pages of page v. 3 0.203 0.232 0.273
Wout(v,u) = Ou /Σp∈R(v) Op (2) 4 0.201 0.231 0.272
Where Ou and Op represent the number of outlinks of page u 5 0.201 0.230 0.272
and page p, respectively. R (v) denotes the reference page list 6 0.201 0.230 0.272
of page v.
Considering the importance of pages, the original PageRank As shown in table 2, WPR(R) >WPR (Q)>WPR (P) in less
formula is modified as iteration.
WPR (u) = (1 − d) + d Σv∈B(u) WPR(v) Win(v,u) Wout(v,u) (3) C. Hypertext Induced Topic Search (HITS) Algorithm
Use the same hyperlink structure as shown in Fig. 1 and
The HITS algorithm is proposed by Kleinberg in 1999.
perform the WPR computation. The WPR equations for page
Kleinberg identifies two different forms of Web pages called
P, Q and R are as follows.
hubs and authorities. Authorities are pages having important
WPR (P) = (1-d) + d [WPR(R)Win(R, P) Wout(R,P)] (1a)
contents. Hubs are pages that act as resource lists, guiding
WPR (Q) = (1-d) + d [WPR (P)Win(P,Q) Wout(P,Q)
users to authorities. Thus, a good hub page for a subject points
+WPR(R)Win(R,Q).Wout(R,Q)] (1b)
to many authoritative pages on that content and a good
WPR(R) = (1-d) + d [WPR (P).Win(P,R)Wout(P,R)
authority page is pointed by many good hub pages on the same
+WPR (Q)Win(Q,R).Wout(Q,R)] (1c)
subject. Hubs and Authorities and their calculations are shown
Let us assume the initial PageRank as 1.0 and do the
in Fig. 2. Kleinberg says that a page may be a good hub and a
calculation. The damping factor d is set to 0.85: The inlink and
good authority at the same time. This circular relationship
outlink weights are calculated as follows:
leads to the definition of an iterative algorithm called
Win(R,P) = IP /( IP+ IQ) = 1/(1+2) = 1/3 (1.1a)
Hyperlink Induced Topic Search (HITS) [6].
Wout(R,P) = OP /( OP+OQ) = 2/2+1= 2/3 (1.1b)
The HITS algorithm treats WWW as a directed graph G (V,
E), where V is a set of vertices representing pages and E is a
By substituting the values of equation (1.1a) and (1.1b) in
set of edges that correspond to links.
(1a), you will get the WPR for page P.
WPR (P) = 0.15 + 0.85[1*1/3*2/3] = 0.338 (2a)
The inlink and outlink weights for page Q are calculated as H H
follows:
Win(P,Q) = IQ /( IQ+ IR) =2/2+2 =1/2 (2.1a) H H
Wout(P,Q) = OQ /( OQ+OR) = 1/1+2 = 1/3 (2.1b)
Win(R,Q) = IQ /( IQ+ IP)=2/2+1=2/3 (2.1c)
Wout(R,Q)]= OQ /( OQ+OP)=1/1+2 =1/3 (2.1d) H H

By substituting the values of equation (2.1a) ,(2.1b),(2.1c)and H


(2.1d) in (1b),You will get the WPR for page Q.
Hubs Authority

Figure.2 Hubs and Authorities

420 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016

Since a good authority is pointed to by many good hubs and a directory-tree approach which can not only cluster the web
good hub points to many good authorities, such mutually pages quickly but also assign a meaningful label to each group
reinforcing relationship can be represented as: of classified results.
xp = Σq:(q;p)∈E yq (1)
• Topic generation
yp = Σq:(q;p)∈E xq (2)
where xp is the authority weight of web document x and yp is This module assumes that the words in the web page at the
the hub weight. E isthe set of links (edges). Iteratively update beginning and at the end parts are more important than in the
the authority and hub weights of every web document, using middle part.
Eq. (1) and (2), and sort the web documents in decreasing
order according to their authority and hub weights,
Users Meta Search
respectively, we can obtain the authorities and hubs of the Web
Query Engine
Pages
topic.
III. EXISTING SYSTEM
Normally, web search engine receives query from the user and
User
returns a list of web documents to them. The web search request
results may be displayed based on the content similarity,
relevancy of keywords, hyperlink structure and web server
logs. Conventional search engines provide users a list of non-
classified web documents based on its ranking algorithm.
However, sometimes these search results are far from user’s Returns result to the
satisfaction. user
To provide more relevant web document to users to satisfy
their need an Intelligent Cluster Search Engine (ICSE) [8] was
developed. This system provided to the user a set of Meta directory
Topic tree
taxonomic web pages in response to a user’s query and filters generation
out the irrelevant pages. The following fig.3 shows the process
of ICSE.
In this system, user’s query is given to the meta-search engine. Figure 3: Design of Intelligent Cluster Search Engine
Then the clustered document set is created based on the given
knowledge base and the clustering algorithm of ICSE. CA- IV. PROPOSED SYTEM
ICSE [8] algorithm is used to cluster the web pages, which
increases the relevancy of search results and reduces the In the proposed system, K-Mean clustering algorithm is used
computation time. This algorithm can be executed in two steps for information retrieval. K-Means clustering is more efficient
such as: compute the similarity and cluster the pages based on in order to improve the relevancy rate of search results and
similarity. ICSE system consists of four modules such as: also in saving computation time. The relevancy rate using CA-
meta-search engine, meta-directory tree, web pages clustering, ICSE is decreased due to the similarity check between the
topic generation [8]. documents using TF-IDF depending only on the contents. i.e.
only the number of occurrences of a given word is compared
• Meta- search engine in each document. So, in some documents the given word may
This module uses information extraction technology to parse have very low occurrence frequency and in other documents
the web pages and analyze the HTML tags. Stemmer is used to the word may have very high occurrence frequency. Based on
discard the common morphological and inflectional endings ranking the documents are displayed in sequence which may
and Stop word to discard worthless words, and then the web have less similarity documents with highest priority and more
pages will be converted to a unified format. similar documents may have least priority.
The least similar documents with high priority may lead to
• Meta- directory tree dissatisfaction of the user’s needs. So the relevancy rate of
In order to cluster the returned web pages rapidly, propose a documents must be increased in order to satisfy the needs of
novel clustering algorithm which uses meta-directory tree as the users. The efficient way to improve the relevancy rate
the knowledge base for reducing the computation time involves the use of K- Means Clustering algorithm [8].
required for clustering and enhancing the quality of clustering In the proposed system by using K- Means Clustering
results. algorithm, the Hub and Authority web documents are grouped
based on the threshold given to the cluster and similarity
• Web pages clustering measure. Based on the threshold value of each cluster the
Traditional clustering and classification technologies classify documents are selected and other are discarded. After
data without a knowledge base. It takes a lot of computation clustering process, when a user requests for a query only the
time to find classified results. To avoid this problem, it uses cluster with highest threshold is displayed to the user. This
increases the relevancy rate and reduces the search space and

421 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016

processing time. The following fig.4 shows the design of the between the documents can be compared by considering the
proposed work. The proposed system works as follows: attribute properties of a data object (web document) not just by
the contents of a document. All the documents are compared
• Enters a query onto the interface of search engine. and the resultant clusters are formed by using K-Means
• Retrieve Hub and Authority documents for a Query. clustering algorithm which improves the relevancy rate and
processing time and search space significantly.
• Decide the threshold value and compute the similarity of REFERENCES
web document for relevancy by considering the weight of
[1] S. Brin and L. Page. “The anatomy of a large-scale hypertextual Web
attributes in a data object. search engine”, Computer Networks and ISDN Systems, 30(1–7):107–
117, 1998.
• Once the weight is calculated, threshold value for clusters
is assigned. According to the threshold values ,the [2] Preeti Chopra, Md. Ataullah, “A Survey on Improving the Efficiency of
documents clusters with most relevant, relevant and Different Web Structure Mining Algorithms” International Journal of
Engineering and Advanced Technology (IJEAT), ISSN: 2249 – 8958,
irrelevant clusters.The document which has weight with Volume-2, Issue-3, February 2013
the centroid is assigned to the cluster and those doesn’t
support are discarded away from the cluster. The process [3] Laxmi Choudhary and Bhawani Shankar Burdak, “Role of Ranking
is repeated until all the obtained results are clustered Algorithms for Information Retrieva”l, International Journal of Artificial
Intelligence & Applications (IJAIA), Vol.3, No.4, pages 203-220, July
2012
• The IR system then receive only most relevant document
to the user for a query [4] Wenpu. Xing and Ali Ghorbani, “Weighted PageRank Algorithm”, Proc.
Of the Second Annual Conference on Communication Networks and
Services Research, IEEE, 2004.
Hub and
Authority [5] Raymond Kosala, Hendrik Blockee, "Web Mining esearch: A Survey",
Web ACM Sigkdd Explorations Newsletter, Volume 2,June 2000.
Search Engine
Pages
[6] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment”
Journal of the ACM, 46(5):604–632, September 1999.

[7] Mr. Dushyant Rathod, “A Review On Web Mining”, International Journal


of Engineering Research and Technology (IJERT), Vol. 1, Issue 2, Pages
21-25, 2012.
User query
Similarity [8] M.Sathya, J.Jayanthi, N. Basker, “Link Based K-Means Clustering
Measure Algorithm for Information Retrieval”, International Conference on Recent
of Web Trends in Information Technology (ICRTIT), 978-1-4577-0590-8, IEEE
Pages MIT, Anna University, Chennai. June 3-5, 2011

User [9] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document


Clustering Techniques,” Proc. KDD-2000 Workshop on Text Mining,
Aug. 2000.

[10] Yitong Wang and Masaru Kitsuregawa, “Use Link-based Clustering to


K-Means Improve Web Search Results”, 0-7695-1393, IEEE, 2002.
Search results clustering
algorithm.

IR
System

most Relevant Irrelevant


relevant document documents
documents s

Figure.4 Design of proposed system

V. CONCLUSION
In this paper, an approach for clustering hub and authority web
documents has been proposed. In which the similarity

422 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

You might also like