0% found this document useful (0 votes)
62 views4 pages

Pagerank Prediction

Web page prefetching techniques are used to address the access latency problem of the Internet. A novel PageRank-like algorithm is proposed for conducting web page prediction. Two biasing factors are adopted to personalize PageRank, so that it favors the pages that are more important to users.

Uploaded by

Mohan Patsariya
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views4 pages

Pagerank Prediction

Web page prefetching techniques are used to address the access latency problem of the Internet. A novel PageRank-like algorithm is proposed for conducting web page prediction. Two biasing factors are adopted to personalize PageRank, so that it favors the pages that are more important to users.

Uploaded by

Mohan Patsariya
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2007 IEEE/WIC/ACM International Conference on Web Intelligence

Personalized PageRank for Web Page Prediction Based on Access

Time-Length and Frequency

Yong Zhen Guo, Kotagiri Ramamohanarao and Laurence A. F. Park


Dept. of Computer Science and Software Engineering,
The University of Melbourne 3010, Australia
{yzguo, rao, lapark}@csse.unimelb.edu.au

Abstract their subsequent accesses (implying that the prefetching


method has predicted the users’ actions poorly), the lim-
Web page prefetching techniques are used to address ited network bandwidth and server resources will not be
the access latency problem of the Internet. To perform used efficiently, and hence worsen the access delay
successful prefetching, we must be able to predict the next problem. Therefore, the success of the prefetching
set of pages that will be accessed by users. The PageRank method relies mainly on the prediction accuracy.
algorithm used by Google is able to compute the popu- In this paper, we propose a novel PageRank-like al-
larity of a set of Web pages based on their link structure. gorithm to conduct more accurate web page prediction.
In this paper, a novel PageRank-like algorithm is pro- PageRank [1] is a Web page popularity measure used in
posed for conducting Web page prediction. Two biasing the Google search engine. Amongst a set of Web pages,
factors are adopted to personalize PageRank, so that it popularity is equivalent to the likelihood of future access
favors the pages that are more important to users. One of the page, therefore the PageRank value of a Web page
factor is the length of time spent on visiting a page and the can be used to measure the likelihood of the page being
other is the frequency that a page was visited. The ex- downloaded. In this paper, we use a novel approach to
periments conducted show that using these two factors conduct web page prediction by employing a personal-
simultaneously to bias PageRank results in more accurate ized PageRank algorithm. We use both the frequency
Web page prediction than other methods that use only one that a web page is accessed and the time-length spent on
of these two factors. visiting it as the biasing factors to personalize PageRank
so that it favors the pages that are more important to us-
1. Introduction ers.

Over the past decade the World Wide Web has 2. Brief review of PageRank
evolved from being an information source to a global
hub for business and socializing. However, according to The whole Web can be considered as a directed graph
a recent study [2], many users connect to the Internet us- with N nodes representing the N Web pages and the edges
ing low bandwidth connections (e.g. dial-up or wireless representing the links between them. Let Fv be the set of
connections). Usually many of these users need to spend pages page v links to and PR(v) the ranking value of page
a large amount of time waiting for the Web page con- v, then a link v → u will contribute PR(v)/|Fv| units of
tents to be downloaded through their narrow bandwidth page v’s ranking value to page u, where |Fv| is the num-
connections. Although the access latency is alleviated for ber of the pages in Fv. The PageRank of a Web page is
broadband users today, it can also be decreased signifi- the sum of the ranking values of its backlinks, and the
cantly. computation of it is a recursive process. In order to limit
Using a Web page prefetching technique, we can effi- the effect of rank sinks and guarantee the recursive
ciently tackle this access latency problem. However, if computation of PageRank is able to converge to a certain
most prefetched web pages are not visited by the users in value, a damping factor (1- α ) is used ( α is a small

0-7695-3026-5/07 $25.00 © 2007 IEEE 687


DOI 10.1109/WI.2007.58

Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
number which is usually set to 0.15 [1]). Furthermore, consecutively. If we consider all user sessions as
st
for pages which have no outlinks we add a link to all 1 -order Markov Chains (in this case, the next page to be
other pages in the Web graph, in this way, ranks which visited by a user only depends on the page the user is
are lost due to pages with no outlinks can be redistrib- visiting currently), then tu is the sum of the weights of
uted uniformly to all other pages. Let Bu represent the set edges that point to node u. Let Bu be the set of pages that
of pages pointing to page u, we have the recursive defi- point to page u, we have the equation:
nition of PageRank:
α PR (v) (1)
t = u t ∑ ( w,u )
(2)
PR i +1 (u) = + (1 − α ) × ∑ i w∈Bu
N v∈ B u | Fv |
From the definition of t(v,u) we can see that if more pre-
vious users follow the path v → u and stay on page u for
3. The proposed approach: TFPR a longer time, the value of t(v,u) will be larger, thus t(v,u)
covers both information of access time-length and access
In this section we will illustrate our novel PageR- frequency of a page u.
ank-like approach in detail. In order to include access frequency and accessing
time-length of a page to conduct the computation of our
3.1. Access time-length and frequency-based personalized PageRank algorithm TFPR, we adopt t(v,u) as
PageRank (TFPR) the biasing factor. When distributing its ranking value to
its outlinks, page v will now propagate:
In our approach, the ranking system resides on the t ( v ,u )
server side and all the information that is required for the
computation of our personalized PageRank algorithm ∑t
w∈Fv
( v , w)

can be derived from the website’s Web logs. We assume


that these Web logs are preprocessed and all user ses- units of its importance to page u in a non-uniform way,
sions are identified. The access frequency of a page u, where Fv is the set of pages that page v points to.
and the number of times page v was visited right after
We also guarantee the website’s directed graph G is
page u can be obtained simply by counting the number of
strongly connected so that the calculation of TFPR can
times page u appears and the number of times pages u
converge to a certain value by including the damping
and v appear consecutively in all user sessions respec-
factor (1- α ). Then we eliminate all dangling pages from
tively. Moreover, if page u and v were visited consecu-
G by adding a link to all other pages in G for pages with
tively in a user session, and ru and rv are the times the
no outlinks.
user requested them respectively, then rv-ru is the ap-
In accordance with the definition of PageRank, we
proximate time-length the user spent on visiting page u
have the definition of our access time-length and fre-
in this session. In the case that page u is the last page of a
quency-based PageRank (TFPR) as follows:
user session, for which the access duration cannot be
calculated, we can compute the average time-length tu TFPRi (v) × t( v, u )
TFPR i +1 (u) = α × + (1 − α ) × ∑
spent on page u from all user sessions as its access ∑ tv
v∈G
v∈ Bu ∑t
w∈Fv
( v, w)
(3)
time-length. Finally, if a user opened a Web page but did
not read it immediately (for example, the user went out
Where Fv is the set of pages that page v links to and Bu
after he opened the Web page), this will result in a sig-
is the set of pages that link to page u.
nificantly long but inaccurate access time-length for this
Web page. Therefore, when the access time-length spent
3.2. Usage context of TFPR
on a Web page by a user exceeds the average time-length
spent on this page by a large percentage, we use this av-
By using our approach, we bias the PageRank to give
erage access time-length as the user’s access time-length
a higher ranking to the pages that are visited more fre-
on this Web page.
quently and for a longer time by previous users. If we
Let us consider a website as a directed graph G,
use TFPR on the whole directed graph G of a website,
where the nodes represent the web pages and the edges
we can obtain the global ranking of all pages in this
represent the links. Both nodes and edges carry weights,
website.
the weight tu on node u is the total time-length all previ-
However, the calculation of the global ranking of all
ous users spent on browsing page u, while the weight t(v,u)
pages will consume huge computation power and proc-
on edge v → u represents the sum of the time–lengths
essing time. In order to implement Web page prediction
spent on visiting page u when page v and u were visited
for a current user, we should not apply TFPR to the

688

Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
whole directed graph G but only to a subgraph of it. In the total time-length spent on visiting a page as the bias-
st
the 1 -order Markov Chain scenario, the construction of ing factor. It is based on the following equation:
the subgraph for a current user can be described as fol-
tu TPRi (v) × tu
lows: we first extract the navigational path a user has fol- TPR i +1 (u) = α × + (1 − α ) × ∑
lowed, and expand it by including all the pages the last ∑ tv
v∈G
v∈Bu ∑ tw
w∈ Fv
(4)
page links to directly in G, and the weighted links be-
tween them. The length of the path taken into considera- Where tu is the total time-length spent on visiting page
tion when expanding the subgraph depends on the order u by all previous users, Bu is the set of pages that have
of the Markov Model we use. We then continuously links to page u, and Fv is the set of pages that page v
conduct the same process for the pages newly included links to.
in the subgraph until we reach a predefined depth. Then The second contrasting algorithm is UPR (Us-
we apply our TFPR algorithm to this subgraph to con- age-based PageRank) [5], which uses only the access
duct a “local” ranking computation for the pages in the frequency of a Web page as the biasing factor. UPR can
subgraph and provide the current user with a Web page be calculated by the formula below:
prediction list in descending order based on these Web UPRi (v) × w( v,u )
wu
pages’ ranking values. For more detailed information on UPRi +1 (u) = α × + (1 − α ) × ∑
the construction of a subgraph and its relevant theories, ∑ v
w
v∈G
v∈Bu ∑ w(v, k )
k ∈Fv
(5)
readers can refer to [5].
Where wu is the number of times page u was visited
4. Experiment and w(v,u) is the number of times page u was visited right
after page v by previous users.
In this section we demonstrate our experiments. Two
contrasting algorithms are used to compare with our 4.3. Evaluation methods
TFPR algorithm by using three different evaluation
measures. We selected the 10 most popular paths the previous
users followed from the training data set, and each path
4.1. Experimental data set was expanded to construct the corresponding subgraph
according to the sessions in the training data set. We
We conducted our experiment using the Web logs of chose the most popular paths because using these most
the Department of Computer Science and Software En- accessed paths allowed us to provide a better representa-
gineering website [4], at the University of Melbourne. tion of the typical navigational behaviors of Web users
We obtained the Web logs of a two-week period in Sep- than those paths that are with low access frequencies.
tember 2006 and used the Web logs from 01/Sep./2006 Then we applied the three algorithms (TPR, UPR and
to 07/Sep./2006 as the training data set, and the Web logs TFPR) to each subgraph to rank the pages it included,
from 08/Sep./2006 to 14/Sep./2006 as the testing data and for each algorithm, we obtained a set of the top-n
set. We filtered the records and only reserved the hits ranked pages as the most probable “next” pages for each
requesting Web pages (such as *.htm, *.html, *.shtml path. Following this, we derived a set of the top-n most
and *.asp). When identifying user sessions, we set the frequent “next” pages for each of the same path from the
session timeout to 30 minutes, with a maximum of 50 testing data set. Finally, for every most popular path, we
pageviews per session. After filtering out the hits by respectively compared the result sets calculated by these
Web crawlers, the training data set contained 23,132 re- 3 algorithms with the result set obtained from the testing
cords and 9,404 sessions, while the testing data set con- data set.
tained 21,933 records and 9,723 sessions. We used 3 measures, OSim , KSim and MRR, when
comparing two sets r1 and r2, each of size n. The first
4.2. Contrasting algorithms method is OSim(r1,r2), which represents the degree of
overlap between the top-n pages of r1 and r2:
We introduced two ranking methods to compare with | r1 ∩ r2 | (6)
OSim(r1 , r2 ) =
our proposed TFPR algorithm, which uses both the n
length of time spent on visiting a page and the frequency The second method we used is KSim(r1,r2) [3], which
that a page was visited as the biasing factors to personal- focuses on indicating the degree to which the relative
ize PageRank. The first is TPR (Time-length-based ordering of the top-n pages of two sets, r1 and r2, are in
PageRank), a PageRank-style algorithm which uses only agreement. The values of KSim(r1,r2) that are closer to 1

689

Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
indicate closer agreement. Let U be the union of the tively. Therefore, we can confirm that adopting both the
pages in r1 and r2. Let r1’ extend r1 by adding U-r1 at its frequency a Web page is accessed and the time-length
end, and let r2’ be defined analogously. We can define spent on visiting it as the biasing factors simultaneously
KSim(r1,r2) as follows: can improve the accuracy of Web page prediction.
| (u, v) : r1 ' , r2 ' agree on order of (u, v), u ≠ v | (7) Accuracies for top-5 prediction
KSim(r1 , r2 ) =
| U | ×(| U | -1) 0.7
0.6
Where the numerator denotes the number of pairwise 0.5
agreements of elements between r1’ and r2’. 0.4
TPR
UPR
The last measure we used to evaluate the performance 0.3
TFPR
0.2
of these 3 algorithms is the mean reciprocal rank (MRR).
0.1
Given an ordered list of predicted pages (these pages are 0
in a descending order according to their ranking values) OSim KSim MRR

and the “actual page” a user accesses in his subsequent


visit, the reciprocal rank is calculated as 1/p where p is Figure 2. Average OSim, KSim and MRR
the position of the “actual page” in this ordered list, and values for top-5 prediction
it is set to zero if the “actual page” is not in this list. The
MRR value of an algorithm that is closer to 1 denotes a 5. Conclusion and future works
better performance.
In this paper, we used PageRank algorithm in the
4.4. Experimental results context of Web page prediction. Two factors (the
time-length spent on visiting a Web page and the fre-
We used these 3 algorithms (TPR, UPR and TFPR) to quency the Web page is accessed) were used to bias
obtain the sets of the top-3 and top-5 most highly-ranked PageRank so that it favors the pages that were visited for
“next” pages for the 10 most popular paths we chose a longer time and more frequently than others. The posi-
from the training data set. Then for each of the 10 most tive experimental results show that our approach can
popular path we compared these sets of the top-3 and produce more accurate page prediction than the methods
top-5 pages with the sets of the top-3 and top-5 most that use only one of these two biasing factors.
frequent “next” pages derived from the testing data set However, in our experimental setup, we only used the
st
respectively by using the aforementioned evaluation 1 -order Markov Chain model, which is “memoryless”,
measures. Figure 1 and 2 show the average OSim, KSim to calculate the weights of edges in the directed graph of
and MRR values for the top-3 and top-5 predictions of a website and to expand the subgraph for a user’s current
these 3 different algorithms. navigational path. In our future work, we will take into
Accuracies for top-3 prediction account using higher order Markov Chain models to im-
0.6
prove the prediction accuracy and also applying the pro-
0.5 posed approach to other data sets to evaluate its reliabil-
0.4
TPR
ity and performance.
0.3 UPR
TFPR
0.2 6. References
0.1
0 [1] Larry Page. PageRank: Bringing Order to the Web. Tech-
OSim KSim MRR
nical report, Stanford digital libraries, 1997.
Figure 1. Average OSim, KSim and MRR [2] The China Internet Network Information Center.
http://www.cnnic.com.cn/, April 2007.
values for top-3 prediction
[3] T. Haveliwala. “Topic-Sensitive PageRank”. In Proceed-
As depicted in Figure 1 and 2, TFPR outperforms ings of WWW2002 Conference, Hawaii USA, May 2002.
TPR and UPR significantly in all OSim, KSim and MRR [4] The Department of Computer Science and Software Engi-
values. In the top-3 prediction, TFPR outperforms TPR neering, the University of Melbourne, Australia.
by 23.1%, 23.4% and 52.6% for OSim, KSim and MRR http://www.cs.mu.oz.au, April 2007.
respectively. As to UPR, TFPR outperforms it by 23.1%, [5] M. Eirinaki and M. Vazirgiannis, “Usage-based PageRank
24.4% and 93.3% for different evaluation methods. In th
for Web Personalization”. In Proceedings of the 5 IEEE In-
the top-5 prediction, the corresponding percentages are ternational Conference on Data Mining (ICDM’05), Lousiana,
20.0%, 5.0%, 40.6% and 14.3%, 10.8%, 84.8% respec- November 2005.

690

Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.

You might also like