Clustering of Hub and Authority Web Docu
Clustering of Hub and Authority Web Docu
Abstract- Due to the exponential growth of World Wide Web (or linked to them. In this system, user submits a query to the
simply the Web), finding and ranking of relevant web documents has meta-search engine. The meta-search engine searches for the
become an extremely challenging task. When a user tries to retrieve relevant results of users query. From the set of results
relevant information of high quality from the Web, then ranking of
retrieved from web search engine, they are formed as a meta-
search results of a user query plays an important role. Ranking
provides an ordered list of web documents so that users can easily
directory tree. This tree structure helps the user to retrieve
navigate through the search results and find the information content information with high relevancy.
as per their need. In order to rank these web documents, a lot of The relevancy of web page can be obtained by considering the
ranking algorithms (PageRank, HITS, Weight PageRank) have been number of in-links and out-links present in a particular web
proposed based upon many factors like citations analysis, content page. When the web page has more number of out-links to a
similarity, annotations etc. However, the ranking mechanism of these relevant page, then that page can be considered as a central
algorithms gives user with a set of non classified web documents page. From this central page, all other web pages are
according to their query. In this paper, we propose a link-based compared for similarity and the most similar pages are
clustering approach to cluster search results returned from link based
grouped together. The grouping of most similar pages together
web search engine. By filtering some irrelevant pages, our approach
classified relevant web pages into most relevant, relevant and
is known as clustering. Clustering can be done based on
irrelevant groups to facilitate users’ accessing and browsing. In order different algorithms such as hierarchical, k-means,
to increase relevancy accuracy, K-mean clustering algorithm is used. partitioning, etc.
Preliminary evaluations are conducted to examine its effectiveness. The simplest unsupervised learning algorithm that solve
The results show that clustering on web search results through link clustering problem is K- Means algorithm. It is a simple and
analysis is promising. This paper also outlines various page ranking easy way to classify a given data set through a certain number
algorithms. of clusters.
When the documents are clustered [9] using K-Means
Keywords - World Wide Web, search engine, information retrieval,
algorithm, the cluster contains more similar documents and it
Pagerank, HITS, Weighted Pagerank, link analysis.
increases the relevancy rate of search results. When a user
I. INTRODUCTION requests for a query after these clustering process, they get
only the most relevant cluster which matches the request. They
The World Wide Web is a famous and interactive way to
will not get any of the irrelevant pages. So, it increases the
disseminate information nowadays. The Web is the largest
efficiency of search results and reduces computational time
information repository for knowledge reference. The web is
and search space.
huge, semi-structured, dynamic, and heterogeneous and
The paper is organized as follows. Section II is an assessment
broadly distributed global information service center [5].
of previous related works of link analysis and clustering in
Finding relevant web pages of highest quality to the users
web domain. In Section III, we describe the existing system.
based on their queries becomes increasingly difficulty. This
Subsequently in Section IV we describe our proposed
can be observed by the researcher that most of the web
approach in detail. In Section V, We conclude our paper with
documents collected by web spider are not relevant to the
some discussions.
query of the user. It makes in-convenience for the user to filter
out irrelevant information from these search results, hence
leading to waste of time. For these reasons, the cluster search
engine provides a way to find the information, by returning a
set of classified web pages.
An important class of search engine that offer search results
based on hypertext links between sites can be termed as Link
Based Search Engine. Rather than providing results based on
keywords or the content of the web documents, sites are
ranked based on the quality and quantity of other web sites
418 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016
419 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016
420 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016
Since a good authority is pointed to by many good hubs and a directory-tree approach which can not only cluster the web
good hub points to many good authorities, such mutually pages quickly but also assign a meaningful label to each group
reinforcing relationship can be represented as: of classified results.
xp = Σq:(q;p)∈E yq (1)
• Topic generation
yp = Σq:(q;p)∈E xq (2)
where xp is the authority weight of web document x and yp is This module assumes that the words in the web page at the
the hub weight. E isthe set of links (edges). Iteratively update beginning and at the end parts are more important than in the
the authority and hub weights of every web document, using middle part.
Eq. (1) and (2), and sort the web documents in decreasing
order according to their authority and hub weights,
Users Meta Search
respectively, we can obtain the authorities and hubs of the Web
Query Engine
Pages
topic.
III. EXISTING SYSTEM
Normally, web search engine receives query from the user and
User
returns a list of web documents to them. The web search request
results may be displayed based on the content similarity,
relevancy of keywords, hyperlink structure and web server
logs. Conventional search engines provide users a list of non-
classified web documents based on its ranking algorithm.
However, sometimes these search results are far from user’s Returns result to the
satisfaction. user
To provide more relevant web document to users to satisfy
their need an Intelligent Cluster Search Engine (ICSE) [8] was
developed. This system provided to the user a set of Meta directory
Topic tree
taxonomic web pages in response to a user’s query and filters generation
out the irrelevant pages. The following fig.3 shows the process
of ICSE.
In this system, user’s query is given to the meta-search engine. Figure 3: Design of Intelligent Cluster Search Engine
Then the clustered document set is created based on the given
knowledge base and the clustering algorithm of ICSE. CA- IV. PROPOSED SYTEM
ICSE [8] algorithm is used to cluster the web pages, which
increases the relevancy of search results and reduces the In the proposed system, K-Mean clustering algorithm is used
computation time. This algorithm can be executed in two steps for information retrieval. K-Means clustering is more efficient
such as: compute the similarity and cluster the pages based on in order to improve the relevancy rate of search results and
similarity. ICSE system consists of four modules such as: also in saving computation time. The relevancy rate using CA-
meta-search engine, meta-directory tree, web pages clustering, ICSE is decreased due to the similarity check between the
topic generation [8]. documents using TF-IDF depending only on the contents. i.e.
only the number of occurrences of a given word is compared
• Meta- search engine in each document. So, in some documents the given word may
This module uses information extraction technology to parse have very low occurrence frequency and in other documents
the web pages and analyze the HTML tags. Stemmer is used to the word may have very high occurrence frequency. Based on
discard the common morphological and inflectional endings ranking the documents are displayed in sequence which may
and Stop word to discard worthless words, and then the web have less similarity documents with highest priority and more
pages will be converted to a unified format. similar documents may have least priority.
The least similar documents with high priority may lead to
• Meta- directory tree dissatisfaction of the user’s needs. So the relevancy rate of
In order to cluster the returned web pages rapidly, propose a documents must be increased in order to satisfy the needs of
novel clustering algorithm which uses meta-directory tree as the users. The efficient way to improve the relevancy rate
the knowledge base for reducing the computation time involves the use of K- Means Clustering algorithm [8].
required for clustering and enhancing the quality of clustering In the proposed system by using K- Means Clustering
results. algorithm, the Hub and Authority web documents are grouped
based on the threshold given to the cluster and similarity
• Web pages clustering measure. Based on the threshold value of each cluster the
Traditional clustering and classification technologies classify documents are selected and other are discarded. After
data without a knowledge base. It takes a lot of computation clustering process, when a user requests for a query only the
time to find classified results. To avoid this problem, it uses cluster with highest threshold is displayed to the user. This
increases the relevancy rate and reduces the search space and
421 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 3, March 2016
processing time. The following fig.4 shows the design of the between the documents can be compared by considering the
proposed work. The proposed system works as follows: attribute properties of a data object (web document) not just by
the contents of a document. All the documents are compared
• Enters a query onto the interface of search engine. and the resultant clusters are formed by using K-Means
• Retrieve Hub and Authority documents for a Query. clustering algorithm which improves the relevancy rate and
processing time and search space significantly.
• Decide the threshold value and compute the similarity of REFERENCES
web document for relevancy by considering the weight of
[1] S. Brin and L. Page. “The anatomy of a large-scale hypertextual Web
attributes in a data object. search engine”, Computer Networks and ISDN Systems, 30(1–7):107–
117, 1998.
• Once the weight is calculated, threshold value for clusters
is assigned. According to the threshold values ,the [2] Preeti Chopra, Md. Ataullah, “A Survey on Improving the Efficiency of
documents clusters with most relevant, relevant and Different Web Structure Mining Algorithms” International Journal of
Engineering and Advanced Technology (IJEAT), ISSN: 2249 – 8958,
irrelevant clusters.The document which has weight with Volume-2, Issue-3, February 2013
the centroid is assigned to the cluster and those doesn’t
support are discarded away from the cluster. The process [3] Laxmi Choudhary and Bhawani Shankar Burdak, “Role of Ranking
is repeated until all the obtained results are clustered Algorithms for Information Retrieva”l, International Journal of Artificial
Intelligence & Applications (IJAIA), Vol.3, No.4, pages 203-220, July
2012
• The IR system then receive only most relevant document
to the user for a query [4] Wenpu. Xing and Ali Ghorbani, “Weighted PageRank Algorithm”, Proc.
Of the Second Annual Conference on Communication Networks and
Services Research, IEEE, 2004.
Hub and
Authority [5] Raymond Kosala, Hendrik Blockee, "Web Mining esearch: A Survey",
Web ACM Sigkdd Explorations Newsletter, Volume 2,June 2000.
Search Engine
Pages
[6] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment”
Journal of the ACM, 46(5):604–632, September 1999.
IR
System
V. CONCLUSION
In this paper, an approach for clustering hub and authority web
documents has been proposed. In which the similarity
422 https://sites.google.com/site/ijcsis/
ISSN 1947-5500