Pagerank Prediction
Pagerank Prediction
Over the past decade the World Wide Web has 2. Brief review of PageRank
evolved from being an information source to a global
hub for business and socializing. However, according to The whole Web can be considered as a directed graph
a recent study [2], many users connect to the Internet us- with N nodes representing the N Web pages and the edges
ing low bandwidth connections (e.g. dial-up or wireless representing the links between them. Let Fv be the set of
connections). Usually many of these users need to spend pages page v links to and PR(v) the ranking value of page
a large amount of time waiting for the Web page con- v, then a link v → u will contribute PR(v)/|Fv| units of
tents to be downloaded through their narrow bandwidth page v’s ranking value to page u, where |Fv| is the num-
connections. Although the access latency is alleviated for ber of the pages in Fv. The PageRank of a Web page is
broadband users today, it can also be decreased signifi- the sum of the ranking values of its backlinks, and the
cantly. computation of it is a recursive process. In order to limit
Using a Web page prefetching technique, we can effi- the effect of rank sinks and guarantee the recursive
ciently tackle this access latency problem. However, if computation of PageRank is able to converge to a certain
most prefetched web pages are not visited by the users in value, a damping factor (1- α ) is used ( α is a small
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
number which is usually set to 0.15 [1]). Furthermore, consecutively. If we consider all user sessions as
st
for pages which have no outlinks we add a link to all 1 -order Markov Chains (in this case, the next page to be
other pages in the Web graph, in this way, ranks which visited by a user only depends on the page the user is
are lost due to pages with no outlinks can be redistrib- visiting currently), then tu is the sum of the weights of
uted uniformly to all other pages. Let Bu represent the set edges that point to node u. Let Bu be the set of pages that
of pages pointing to page u, we have the recursive defi- point to page u, we have the equation:
nition of PageRank:
α PR (v) (1)
t = u t ∑ ( w,u )
(2)
PR i +1 (u) = + (1 − α ) × ∑ i w∈Bu
N v∈ B u | Fv |
From the definition of t(v,u) we can see that if more pre-
vious users follow the path v → u and stay on page u for
3. The proposed approach: TFPR a longer time, the value of t(v,u) will be larger, thus t(v,u)
covers both information of access time-length and access
In this section we will illustrate our novel PageR- frequency of a page u.
ank-like approach in detail. In order to include access frequency and accessing
time-length of a page to conduct the computation of our
3.1. Access time-length and frequency-based personalized PageRank algorithm TFPR, we adopt t(v,u) as
PageRank (TFPR) the biasing factor. When distributing its ranking value to
its outlinks, page v will now propagate:
In our approach, the ranking system resides on the t ( v ,u )
server side and all the information that is required for the
computation of our personalized PageRank algorithm ∑t
w∈Fv
( v , w)
688
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
whole directed graph G but only to a subgraph of it. In the total time-length spent on visiting a page as the bias-
st
the 1 -order Markov Chain scenario, the construction of ing factor. It is based on the following equation:
the subgraph for a current user can be described as fol-
tu TPRi (v) × tu
lows: we first extract the navigational path a user has fol- TPR i +1 (u) = α × + (1 − α ) × ∑
lowed, and expand it by including all the pages the last ∑ tv
v∈G
v∈Bu ∑ tw
w∈ Fv
(4)
page links to directly in G, and the weighted links be-
tween them. The length of the path taken into considera- Where tu is the total time-length spent on visiting page
tion when expanding the subgraph depends on the order u by all previous users, Bu is the set of pages that have
of the Markov Model we use. We then continuously links to page u, and Fv is the set of pages that page v
conduct the same process for the pages newly included links to.
in the subgraph until we reach a predefined depth. Then The second contrasting algorithm is UPR (Us-
we apply our TFPR algorithm to this subgraph to con- age-based PageRank) [5], which uses only the access
duct a “local” ranking computation for the pages in the frequency of a Web page as the biasing factor. UPR can
subgraph and provide the current user with a Web page be calculated by the formula below:
prediction list in descending order based on these Web UPRi (v) × w( v,u )
wu
pages’ ranking values. For more detailed information on UPRi +1 (u) = α × + (1 − α ) × ∑
the construction of a subgraph and its relevant theories, ∑ v
w
v∈G
v∈Bu ∑ w(v, k )
k ∈Fv
(5)
readers can refer to [5].
Where wu is the number of times page u was visited
4. Experiment and w(v,u) is the number of times page u was visited right
after page v by previous users.
In this section we demonstrate our experiments. Two
contrasting algorithms are used to compare with our 4.3. Evaluation methods
TFPR algorithm by using three different evaluation
measures. We selected the 10 most popular paths the previous
users followed from the training data set, and each path
4.1. Experimental data set was expanded to construct the corresponding subgraph
according to the sessions in the training data set. We
We conducted our experiment using the Web logs of chose the most popular paths because using these most
the Department of Computer Science and Software En- accessed paths allowed us to provide a better representa-
gineering website [4], at the University of Melbourne. tion of the typical navigational behaviors of Web users
We obtained the Web logs of a two-week period in Sep- than those paths that are with low access frequencies.
tember 2006 and used the Web logs from 01/Sep./2006 Then we applied the three algorithms (TPR, UPR and
to 07/Sep./2006 as the training data set, and the Web logs TFPR) to each subgraph to rank the pages it included,
from 08/Sep./2006 to 14/Sep./2006 as the testing data and for each algorithm, we obtained a set of the top-n
set. We filtered the records and only reserved the hits ranked pages as the most probable “next” pages for each
requesting Web pages (such as *.htm, *.html, *.shtml path. Following this, we derived a set of the top-n most
and *.asp). When identifying user sessions, we set the frequent “next” pages for each of the same path from the
session timeout to 30 minutes, with a maximum of 50 testing data set. Finally, for every most popular path, we
pageviews per session. After filtering out the hits by respectively compared the result sets calculated by these
Web crawlers, the training data set contained 23,132 re- 3 algorithms with the result set obtained from the testing
cords and 9,404 sessions, while the testing data set con- data set.
tained 21,933 records and 9,723 sessions. We used 3 measures, OSim , KSim and MRR, when
comparing two sets r1 and r2, each of size n. The first
4.2. Contrasting algorithms method is OSim(r1,r2), which represents the degree of
overlap between the top-n pages of r1 and r2:
We introduced two ranking methods to compare with | r1 ∩ r2 | (6)
OSim(r1 , r2 ) =
our proposed TFPR algorithm, which uses both the n
length of time spent on visiting a page and the frequency The second method we used is KSim(r1,r2) [3], which
that a page was visited as the biasing factors to personal- focuses on indicating the degree to which the relative
ize PageRank. The first is TPR (Time-length-based ordering of the top-n pages of two sets, r1 and r2, are in
PageRank), a PageRank-style algorithm which uses only agreement. The values of KSim(r1,r2) that are closer to 1
689
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.
indicate closer agreement. Let U be the union of the tively. Therefore, we can confirm that adopting both the
pages in r1 and r2. Let r1’ extend r1 by adding U-r1 at its frequency a Web page is accessed and the time-length
end, and let r2’ be defined analogously. We can define spent on visiting it as the biasing factors simultaneously
KSim(r1,r2) as follows: can improve the accuracy of Web page prediction.
| (u, v) : r1 ' , r2 ' agree on order of (u, v), u ≠ v | (7) Accuracies for top-5 prediction
KSim(r1 , r2 ) =
| U | ×(| U | -1) 0.7
0.6
Where the numerator denotes the number of pairwise 0.5
agreements of elements between r1’ and r2’. 0.4
TPR
UPR
The last measure we used to evaluate the performance 0.3
TFPR
0.2
of these 3 algorithms is the mean reciprocal rank (MRR).
0.1
Given an ordered list of predicted pages (these pages are 0
in a descending order according to their ranking values) OSim KSim MRR
690
Authorized licensed use limited to: UNIVERSITY OF MELBOURNE. Downloaded on October 22, 2008 at 23:49 from IEEE Xplore. Restrictions apply.