0% found this document useful (0 votes)
30 views30 pages

Week 7 - Link Based Ranking

This document discusses how hyperlinks between web documents can be used for information retrieval. It makes two key points: 1. The text surrounding a hyperlink (anchor text) provides useful context about the target document and can be considered an endorsement. This anchor text can help with retrieval tasks. 2. The link structure between documents, including the number of incoming links to a page, can indicate a page's importance and relevance. Models like the random surfer model that simulate a user randomly clicking links can analyze the link structure and provide ranked scores to pages. This helps address issues like link spamming.

Uploaded by

pchaparro15183
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views30 pages

Week 7 - Link Based Ranking

This document discusses how hyperlinks between web documents can be used for information retrieval. It makes two key points: 1. The text surrounding a hyperlink (anchor text) provides useful context about the target document and can be considered an endorsement. This anchor text can help with retrieval tasks. 2. The link structure between documents, including the number of incoming links to a page, can indicate a page's importance and relevance. Models like the random surfer model that simulate a user randomly clicking links can analyze the link structure and provide ranked scores to pages. This helps address issues like link spamming.

Uploaded by

pchaparro15183
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1

In addition to the textual content, Web documents contain also hyperlinks. A


hyperlink can be exploited for information retrieval in two ways:
1. The link is embedded into some text that typically contains relevant
information on the content of the document the link is pointing to. Thus, this
text can complement the content of the referred document.
2. The link can also be considered as an endorsement of the referenced
document by the author of the referring document. Thus, the link can be used
as a signal for quality and importance of the referred document.

2
Anchor text corresponds to the text that is surrounding the link, and not only the
text contained as part of the link tag (in the example, the text in the link tag would
simply be “here”.) The anchor text can contain valuable information on the
referred page and thus be helpful in retrieval.

3
This example illustrates the use of anchor text in retrieval. Often, a home page is
very visual and contains often little relevant text content. If we consider a home
page, such as the EPFL home page, we probably find many pages pointing to the
EPFL home page that very well characterize EPFL, such as pages mentioning
topics related to research and technology transfer.

Assume that a malicious Internet user would create a fake EPFL home page.
Then chances that such a page is referred by reputed organizations, such as
SNF, is very low. On the other hand, pages listing spam pages might point to
such a page and reveal its true character. These pages would probably also
mention terminology related to spam pages or blacklists, and such text can give
indications about the true character of the spam page.

In addition, links to the EPFL home page indicate a higher importance of the
page, as compared to other less referenced pages, such as pages containing the
EPFL regulations.

4
One of the risks of including anchor text is that it makes pages spammable.
Malicious users could create spam pages that point to web pages and try to
relate it to contents that serve their interests (e.g., higher the quality of preferred
pages by adding links, lower the quality of the undesired page by attaching
negative anchor text). That this is. Real phenomenon can be inferred from
statistics on the in-degree distribution of Web pages that has been produced.

The figure shows a standard log-log representation of the in-degree vs. the
frequency of pages. Normally this relationship should follow a power-law, which
shows in a log-log representation as a linear relationship. In real Web data, we
see that this power law is violated, and that certain levels of in-degrees are over-
represented. This can be attributed to link spamming, which does create
moderate numbers of additional links on Web pages.

This is of course only one example of spamming techniques, and Web search
engines are in a continuous “battle” against this and other forms of spam.

5
In order to fight link spamming, when considering anchor text, the text from
pages with poor reputation can be given lower weight. We will later introduce
methods of how to rank pages based in the hyperlink graph, which is one method
to evaluate the reputation of a page.

In order to avoid self-promotion, another method to fight link spamming is to give


lower weights to links within the same site (nepotism = promoting your own family
members).

6
The use of links in order to evaluate the quality of information sources has a long
tradition, specifically in science. The discipline of bibliometry is fully devoted to
the problem of evaluating the quality of research through citation analysis.
Different ideas can be exploited to that end:
- The frequency of citations to a paper, indicating how popular or visible it is
- Co-citation analysis in order to identify researchers working in related
disciplines
- Analysis of the authority of sources of scientific publications, e.g., journals,
publishers, conferences. This measure can then in turn be used to weight the
relevance of publications.
All these ideas can also be exploited for any other document collections that
have references, in particular, for Web document collections with hyperlinks.

7
When retrieving documents from the Web, the link structure bears important
information on the relevance of documents. A document that is referred more
often by other documents through hyperlinks, is likely to be of higher interest and
therefore relevance. Therefore, a possibility to rank documents is considering the
number of incoming links. Considering the number of incoming links allows to
distinguish documents that otherwise would be ranked similarly when relying on
text-based relevance ranking.

However, when doing this, also the importance of the link sources can be
different. Therefore, not only counting then number of incoming links, but also
weighing the links by the relevance of documents that contain these links can
help to better assess the quality of a document. The same reasoning of course
again applies then for evaluating the relevance of documents pointing to the
source of the link and so forth.

Different to scientific publishing, in the Web references are not reliable and
therefore simple link counting might not be appropriate. Since 1998 when search
engines started to consider links for ranking the phenomenon of link spamming
started. Link farms are groups of websites that are heavily linked to one another
to boost their ranking.

8
We introduce now an approach for link-based scoring that considers not only the
absolute count of links, but also the quality of the link source. The basic idea is to
consider a random walker that visits Web pages following the hyperlinks. At each
page the random walker would select randomly among the hyperlinks of the page
with uniform probability and move to the next page. When the random walker
runs for a long time, it will visit every page with a given probability, which we can
consider as a score for ranking the page. This score can be used to control the
impact of the outgoing links of a page on the ranking of other pages.

One of the consequences of this model would be that pages that have few in-
links, would be relatively infrequently visited. Since link farms and spam pages
usually have not many links pointing to them, the expectation is that this
approach could reduce their impact on ranking.

On the other hand, popular pages with many incoming links will have a higher
impact on ranking, as they have a higher score.

9
We provide a formal description of the random walker model. The model is a
Markov chain, a discrete-time stochastic process in which at each time-step a
random choice is made.

We assign to each page a visiting probability P(pi). Then the probability that a
page pi is visited depends on the probabilities of the pages with a hyperlink to this
page to be visited. For the source pages of the hyperlink the visiting probability is
evenly split among all outgoing links. This formulation of the process results in a
recursive equation, of which the solution is the steady-state of the process.

10
In order to determine the solution to the recursive equation on the probabilities of
a random walker to visit a page, we define a transition probability matrix R, which
captures the probability of transitioning from one page to another. We also
require that the probabilities of visiting a page add up to 1. With this formulation
of the problem, the long-term visiting probabilities become the Eigenvector of
matrix R. More precisely, they are the Eigenvector with the largest Eigenvalue.

11
This example illustrates the computation of the probabilities for visiting a specific
Web page. The values C(pi) correspond to the transition probabilities. They can
be derived from the link matrix. The link matrix is defined as Lij=1 if there is a link
from pj to pi. The link matrix is normalized by the outdegree, by dividing the
values in the columns by the sum of the values found in the column, resulting in
matrix R. The probability of a random walker visiting a node is then obtained from
the Eigenvector of this matrix.

12
This example illustrates a problem with the random walker as we have
formulated, the existence of dead ends. We see that there exists a node p3 that is
a "sink of rank". Any random walk ends up in this sink, and therefore the other
nodes do not receive any ranking weight. Consequently, also the rank of sink
does not. Therefore, the only solution to the equation p=Rp is the zero vector.

13
A practical problem with the random walker is the fact that there exist Web pages
that have no outgoing links. Thus, the random walker would get stuck. To
address this problem, the concept of teleporting is introduced, where the random
walker jumps to a randomly selected Web page with a given probability. If the
random walker arrives at a dead end, it will then always jump to a randomly
selected page.

Another problem are pages that have no incoming links: they would never be
reached by the random walker, and the weight that they could provide to other
pages would not be considered. This problem is also addressed by teleporting.

14
We give now the formal specification of the random walker with teleporting. At
each step, the random walker makes a jump with a probability 1-q and any of the
N pages is reached with the same probability. Therefore, an additional term is (1-
q)/N is added to the probability for reaching a given page. Reformulating the
equation for the probabilities in matrix form, results in adding a NxN Matrix E with
all entries being 1/N. This is equivalent to saying that with probability 1/N
transitions among any pairs of nodes (including transition from a node to itself)
are performed. Since the vector p has norm 1, i.e., the sum of the components is
exactly 1, E.p=e. Based on this property, an alternative formulation for the
equation can be given. The method described is called PageRank and is used by
Google for Web ranking. By modifying the values of the matrix E also a priori
knowledge about the relative importance of pages can be added to the ranking
algorithm.

15
With the modification of rank computation using a source of rank, we obtain for
our example a non-trivial ranking which appears to match intuition about the
relative importance of the pages in the graph well.

16
For the practical computation of the PageRank ranking an iterative approach can
be used. The vector e is used to add a source of rank. It can uniformly distribute
weights to all pages, but it could also incorporate pre-existing knowledge on the
importance of pages and bias the ranking towards them. The vector can also be
used as initial probability distribution.

17
These are the top documents from the PageRank ranking of all Web pages at
ETHZ (Data from 2001). It is interesting to see that documents related to Java
documentation receive high ranking values. This is related to the fact that these
documents have many internal cross-references.

18
PageRank is used as one metrics to rank result documents in Google. At the
basis Google uses text retrieval methods to retrieve relevant documents and then
applies PageRank to create a more appropriate ranking. Google uses also many
other methods to improve ranking, e.g., today largely based on personal
information collected from users, like search history and pages visited. The
details of the ranking methods are trade secrets of the Web search engine
providers.

Building a Web Search engine requires to solve several additional problems,


beyond providing a ranking system. Efficient Web crawling requires algorithms
that can traverse the Web avoiding redundant accesses to pages and techniques
for managing large link databases.

19
20
21
The basic idea of HITS is to apply not a single measure for link-based relevance
of a document, but to distinguish two different roles documents can play. Hub
pages are pages that provide references to high quality pages, whereas authority
pages are high quality pages. The method has been conceived for understand a
larger topic in general and obtain an overview of the essential contents related to
a given topic. It can nevertheless also be used as an alternative ranking model
for Web search, that provides a more refined quality evaluation of Web pages.

22
Hub-authority ranking is, like PageRank, based on a quantitative analysis of the
link structure. Different to PageRank two different measures are considered. The
number of incoming links as a measure for authority, and the number of links
pointing to an authority as a measure for the quality of a hub. The example shows
of how in this way authorative pages, such as university home pages, can be
distinguished from hub pages, such as portal sites referencing universities.

23
As in PageRank the approach is to consider the ranking value of pages from
which hyperlinks are emanating in the weighting of the influence the hyperlink
has on the page it is pointing to. This results directly in a recursive formulation of
the ranking values for hub and authority weights. Note that in the HITS method
presented here you find subtle differences how those equations are formulated,
as compared to PageRank:
1. The weights are not split among the outgoing links, but each link transfers the
while hub or authority weight from the originating page
2. Since the weights are not split, the ranking values need to be normalized
3. The normalization uses L2 norm, and not L1 norm as for PageRank.

24
Similarly, as for PageRank, the equations can be solved using iteration. Here we
show a possible realization of such an iterative computation, using uniformly
distributed weights for initialization.

25
When formulating the HITS equations in matrix form, using the link matrix L, we
see that the authority and hub weights correspond to the Eigenvectors of the
matrices 𝐿𝐿𝑡 and 𝐿𝑡𝐿 This shows that the iterative computation with normalization
of the hub and authority values will converge to the principal Eigenvectors of the
matrices 𝐿𝐿𝑡 and 𝐿𝑡𝐿

26
27
28
One possible application of HITS is to compute the ranking on the complete Web
Graph, as it is done with PageRank. Another way to use it (and this is how it was
initially conceived), is to apply it in the context of a given query, to rerank the
results by promoting results with high authority and hub values. In order to
perform this operation, first all results for a query are retrieved (using a standard
text retrieval model). Then the neighboring pages (either pointing to a result
page, or referred by a result page) are added to the set of pages, which is then
called the base set. HITS is then computed on the base set. This makes sense,
as in this way we both consider referred pages and referring pages for the
relevant documents, which helps to identify both hubs and authorities.

29
HITS suffers from similar potential problems related to the manipulation of the link
structure through link spamming as PageRank. In addition, when performing a broad
topic search and computing a base set for analysis, topic drift may occur, e.g., through
the introduction of off-topic hubs. This is a problem that is similar to the issues of topic
drift in pseudo-relevance feedback that we have observed earlier.

Both, HITS and PageRank are examples of social network analysis algorithms. We will
introduce later other types of algorithms for this purpose, aiming at community
detection.

For efficient implementation, link-based ranking algorithms require an efficient


representation of the Web graph. This is a topic that we will explore next.

30

You might also like