0% found this document useful (0 votes)

30 views30 pages

Week 7 - Link Based Ranking

This document discusses how hyperlinks between web documents can be used for information retrieval. It makes two key points: 1. The text surrounding a hyperlink (anchor text) provides useful context about the target document and can be considered an endorsement. This anchor text can help with retrieval tasks. 2. The link structure between documents, including the number of incoming links to a page, can indicate a page's importance and relevance. Models like the random surfer model that simulate a user randomly clicking links can analyze the link structure and provide ranked scores to pages. This helps address issues like link spamming.

Uploaded by

pchaparro15183

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views30 pages

Week 7 - Link Based Ranking

Uploaded by

pchaparro15183

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

1

In addition to the textual content, Web documents contain also hyperlinks. A

hyperlink can be exploited for information retrieval in two ways:
1. The link is embedded into some text that typically contains relevant
information on the content of the document the link is pointing to. Thus, this
text can complement the content of the referred document.
2. The link can also be considered as an endorsement of the referenced
document by the author of the referring document. Thus, the link can be used
as a signal for quality and importance of the referred document.

2
Anchor text corresponds to the text that is surrounding the link, and not only the
text contained as part of the link tag (in the example, the text in the link tag would
simply be “here”.) The anchor text can contain valuable information on the
referred page and thus be helpful in retrieval.

3
This example illustrates the use of anchor text in retrieval. Often, a home page is
very visual and contains often little relevant text content. If we consider a home
page, such as the EPFL home page, we probably find many pages pointing to the
EPFL home page that very well characterize EPFL, such as pages mentioning
topics related to research and technology transfer.

Assume that a malicious Internet user would create a fake EPFL home page.
Then chances that such a page is referred by reputed organizations, such as
SNF, is very low. On the other hand, pages listing spam pages might point to
such a page and reveal its true character. These pages would probably also
mention terminology related to spam pages or blacklists, and such text can give
indications about the true character of the spam page.

In addition, links to the EPFL home page indicate a higher importance of the
page, as compared to other less referenced pages, such as pages containing the
EPFL regulations.

4
One of the risks of including anchor text is that it makes pages spammable.
Malicious users could create spam pages that point to web pages and try to
relate it to contents that serve their interests (e.g., higher the quality of preferred
pages by adding links, lower the quality of the undesired page by attaching
negative anchor text). That this is. Real phenomenon can be inferred from
statistics on the in-degree distribution of Web pages that has been produced.

The figure shows a standard log-log representation of the in-degree vs. the
frequency of pages. Normally this relationship should follow a power-law, which
shows in a log-log representation as a linear relationship. In real Web data, we
see that this power law is violated, and that certain levels of in-degrees are over-
represented. This can be attributed to link spamming, which does create
moderate numbers of additional links on Web pages.

This is of course only one example of spamming techniques, and Web search
engines are in a continuous “battle” against this and other forms of spam.

5
In order to fight link spamming, when considering anchor text, the text from
pages with poor reputation can be given lower weight. We will later introduce
methods of how to rank pages based in the hyperlink graph, which is one method
to evaluate the reputation of a page.

In order to avoid self-promotion, another method to fight link spamming is to give

lower weights to links within the same site (nepotism = promoting your own family
members).

6
The use of links in order to evaluate the quality of information sources has a long
tradition, specifically in science. The discipline of bibliometry is fully devoted to
the problem of evaluating the quality of research through citation analysis.
Different ideas can be exploited to that end:
- The frequency of citations to a paper, indicating how popular or visible it is
- Co-citation analysis in order to identify researchers working in related
disciplines
- Analysis of the authority of sources of scientific publications, e.g., journals,
publishers, conferences. This measure can then in turn be used to weight the
relevance of publications.
All these ideas can also be exploited for any other document collections that
have references, in particular, for Web document collections with hyperlinks.

7
When retrieving documents from the Web, the link structure bears important
information on the relevance of documents. A document that is referred more
often by other documents through hyperlinks, is likely to be of higher interest and
therefore relevance. Therefore, a possibility to rank documents is considering the
number of incoming links. Considering the number of incoming links allows to
distinguish documents that otherwise would be ranked similarly when relying on
text-based relevance ranking.

However, when doing this, also the importance of the link sources can be
different. Therefore, not only counting then number of incoming links, but also
weighing the links by the relevance of documents that contain these links can
help to better assess the quality of a document. The same reasoning of course
again applies then for evaluating the relevance of documents pointing to the
source of the link and so forth.

Different to scientific publishing, in the Web references are not reliable and
therefore simple link counting might not be appropriate. Since 1998 when search
engines started to consider links for ranking the phenomenon of link spamming
started. Link farms are groups of websites that are heavily linked to one another
to boost their ranking.

8
We introduce now an approach for link-based scoring that considers not only the
absolute count of links, but also the quality of the link source. The basic idea is to
consider a random walker that visits Web pages following the hyperlinks. At each
page the random walker would select randomly among the hyperlinks of the page
with uniform probability and move to the next page. When the random walker
runs for a long time, it will visit every page with a given probability, which we can
consider as a score for ranking the page. This score can be used to control the
impact of the outgoing links of a page on the ranking of other pages.

One of the consequences of this model would be that pages that have few in-
links, would be relatively infrequently visited. Since link farms and spam pages
usually have not many links pointing to them, the expectation is that this
approach could reduce their impact on ranking.

On the other hand, popular pages with many incoming links will have a higher
impact on ranking, as they have a higher score.

9
We provide a formal description of the random walker model. The model is a
Markov chain, a discrete-time stochastic process in which at each time-step a
random choice is made.

We assign to each page a visiting probability P(pi). Then the probability that a
page pi is visited depends on the probabilities of the pages with a hyperlink to this
page to be visited. For the source pages of the hyperlink the visiting probability is
evenly split among all outgoing links. This formulation of the process results in a
recursive equation, of which the solution is the steady-state of the process.

10
In order to determine the solution to the recursive equation on the probabilities of
a random walker to visit a page, we define a transition probability matrix R, which
captures the probability of transitioning from one page to another. We also
require that the probabilities of visiting a page add up to 1. With this formulation
of the problem, the long-term visiting probabilities become the Eigenvector of
matrix R. More precisely, they are the Eigenvector with the largest Eigenvalue.

11
This example illustrates the computation of the probabilities for visiting a specific
Web page. The values C(pi) correspond to the transition probabilities. They can
be derived from the link matrix. The link matrix is defined as Lij=1 if there is a link
from pj to pi. The link matrix is normalized by the outdegree, by dividing the
values in the columns by the sum of the values found in the column, resulting in
matrix R. The probability of a random walker visiting a node is then obtained from
the Eigenvector of this matrix.

12
This example illustrates a problem with the random walker as we have
formulated, the existence of dead ends. We see that there exists a node p3 that is
a "sink of rank". Any random walk ends up in this sink, and therefore the other
nodes do not receive any ranking weight. Consequently, also the rank of sink
does not. Therefore, the only solution to the equation p=Rp is the zero vector.

13
A practical problem with the random walker is the fact that there exist Web pages
that have no outgoing links. Thus, the random walker would get stuck. To
address this problem, the concept of teleporting is introduced, where the random
walker jumps to a randomly selected Web page with a given probability. If the
random walker arrives at a dead end, it will then always jump to a randomly
selected page.

Another problem are pages that have no incoming links: they would never be
reached by the random walker, and the weight that they could provide to other
pages would not be considered. This problem is also addressed by teleporting.

14
We give now the formal specification of the random walker with teleporting. At
each step, the random walker makes a jump with a probability 1-q and any of the
N pages is reached with the same probability. Therefore, an additional term is (1-
q)/N is added to the probability for reaching a given page. Reformulating the
equation for the probabilities in matrix form, results in adding a NxN Matrix E with
all entries being 1/N. This is equivalent to saying that with probability 1/N
transitions among any pairs of nodes (including transition from a node to itself)
are performed. Since the vector p has norm 1, i.e., the sum of the components is
exactly 1, E.p=e. Based on this property, an alternative formulation for the
equation can be given. The method described is called PageRank and is used by
Google for Web ranking. By modifying the values of the matrix E also a priori
knowledge about the relative importance of pages can be added to the ranking
algorithm.

15
With the modification of rank computation using a source of rank, we obtain for
our example a non-trivial ranking which appears to match intuition about the
relative importance of the pages in the graph well.

16
For the practical computation of the PageRank ranking an iterative approach can
be used. The vector e is used to add a source of rank. It can uniformly distribute
weights to all pages, but it could also incorporate pre-existing knowledge on the
importance of pages and bias the ranking towards them. The vector can also be
used as initial probability distribution.

17
These are the top documents from the PageRank ranking of all Web pages at
ETHZ (Data from 2001). It is interesting to see that documents related to Java
documentation receive high ranking values. This is related to the fact that these
documents have many internal cross-references.

18
PageRank is used as one metrics to rank result documents in Google. At the
basis Google uses text retrieval methods to retrieve relevant documents and then
applies PageRank to create a more appropriate ranking. Google uses also many
other methods to improve ranking, e.g., today largely based on personal
information collected from users, like search history and pages visited. The
details of the ranking methods are trade secrets of the Web search engine
providers.

Building a Web Search engine requires to solve several additional problems,

beyond providing a ranking system. Efficient Web crawling requires algorithms
that can traverse the Web avoiding redundant accesses to pages and techniques
for managing large link databases.

19
20
21
The basic idea of HITS is to apply not a single measure for link-based relevance
of a document, but to distinguish two different roles documents can play. Hub
pages are pages that provide references to high quality pages, whereas authority
pages are high quality pages. The method has been conceived for understand a
larger topic in general and obtain an overview of the essential contents related to
a given topic. It can nevertheless also be used as an alternative ranking model
for Web search, that provides a more refined quality evaluation of Web pages.

22
Hub-authority ranking is, like PageRank, based on a quantitative analysis of the
link structure. Different to PageRank two different measures are considered. The
number of incoming links as a measure for authority, and the number of links
pointing to an authority as a measure for the quality of a hub. The example shows
of how in this way authorative pages, such as university home pages, can be
distinguished from hub pages, such as portal sites referencing universities.

23
As in PageRank the approach is to consider the ranking value of pages from
which hyperlinks are emanating in the weighting of the influence the hyperlink
has on the page it is pointing to. This results directly in a recursive formulation of
the ranking values for hub and authority weights. Note that in the HITS method
presented here you find subtle differences how those equations are formulated,
as compared to PageRank:
1. The weights are not split among the outgoing links, but each link transfers the
while hub or authority weight from the originating page
2. Since the weights are not split, the ranking values need to be normalized
3. The normalization uses L2 norm, and not L1 norm as for PageRank.

24
Similarly, as for PageRank, the equations can be solved using iteration. Here we
show a possible realization of such an iterative computation, using uniformly
distributed weights for initialization.

25
When formulating the HITS equations in matrix form, using the link matrix L, we
see that the authority and hub weights correspond to the Eigenvectors of the
matrices 𝐿𝐿𝑡 and 𝐿𝑡𝐿 This shows that the iterative computation with normalization
of the hub and authority values will converge to the principal Eigenvectors of the
matrices 𝐿𝐿𝑡 and 𝐿𝑡𝐿

26
27
28
One possible application of HITS is to compute the ranking on the complete Web
Graph, as it is done with PageRank. Another way to use it (and this is how it was
initially conceived), is to apply it in the context of a given query, to rerank the
results by promoting results with high authority and hub values. In order to
perform this operation, first all results for a query are retrieved (using a standard
text retrieval model). Then the neighboring pages (either pointing to a result
page, or referred by a result page) are added to the set of pages, which is then
called the base set. HITS is then computed on the base set. This makes sense,
as in this way we both consider referred pages and referring pages for the
relevant documents, which helps to identify both hubs and authorities.

29
HITS suffers from similar potential problems related to the manipulation of the link
structure through link spamming as PageRank. In addition, when performing a broad
topic search and computing a base set for analysis, topic drift may occur, e.g., through
the introduction of off-topic hubs. This is a problem that is similar to the issues of topic
drift in pseudo-relevance feedback that we have observed earlier.

Both, HITS and PageRank are examples of social network analysis algorithms. We will
introduce later other types of algorithms for this purpose, aiming at community
detection.

For efficient implementation, link-based ranking algorithms require an efficient

representation of the Web graph. This is a topic that we will explore next.

eAdaptorDevelopersGuide - 11 March 2024
No ratings yet
eAdaptorDevelopersGuide - 11 March 2024
239 pages
Django 'Mahmoud Ahmed+
No ratings yet
Django 'Mahmoud Ahmed+
342 pages
Link Building Is Dead. Long Live Link Building!
From Everand
Link Building Is Dead. Long Live Link Building!
Sage Lewis
2/5 (1)
Link Analysis
No ratings yet
Link Analysis
37 pages
Method For Ranking Hypertext Search Results by Analysis of Hyperlinks From Expert Documents and Keyword Scope
No ratings yet
Method For Ranking Hypertext Search Results by Analysis of Hyperlinks From Expert Documents and Keyword Scope
29 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
Techno Quiz Review Set 1
100% (4)
Techno Quiz Review Set 1
4 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
IR-UNIT 11 (Link Analysis) - 2019
No ratings yet
IR-UNIT 11 (Link Analysis) - 2019
58 pages
At The End of The Session You Will Have Adequate Knowledge To Understand
100% (3)
At The End of The Session You Will Have Adequate Knowledge To Understand
248 pages
Summarize Principles of Distributed Database Systems Chapter 12 Web Data Management
No ratings yet
Summarize Principles of Distributed Database Systems Chapter 12 Web Data Management
24 pages
GT v7.8.3.2 Performance Element User Manual
No ratings yet
GT v7.8.3.2 Performance Element User Manual
407 pages
SNA-UNIT-2 Full
No ratings yet
SNA-UNIT-2 Full
33 pages
Project Report On Internet As A Marketing Tool
100% (1)
Project Report On Internet As A Marketing Tool
74 pages
The Linear Algebre Behind Google
No ratings yet
The Linear Algebre Behind Google
13 pages
IRS Unit4
No ratings yet
IRS Unit4
10 pages
Page Rank, Structure of Web and Analyzing A Web Graph
No ratings yet
Page Rank, Structure of Web and Analyzing A Web Graph
17 pages
Brin and Page 1998 Page Et Al. 1999
No ratings yet
Brin and Page 1998 Page Et Al. 1999
37 pages
Taskar+al NIPS03b
No ratings yet
Taskar+al NIPS03b
8 pages
29-CM-STE Manual
100% (2)
29-CM-STE Manual
46 pages
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
No ratings yet
UNIT III-Web Crawlers Why Do We Need Web Crawlers?
19 pages
Digital Marketing
100% (1)
Digital Marketing
45 pages
Black Book 69 PDF
No ratings yet
Black Book 69 PDF
51 pages
Extracting Knowledge From InterNet
100% (1)
Extracting Knowledge From InterNet
6 pages
Unit 7 - Search Engine
No ratings yet
Unit 7 - Search Engine
10 pages
BDA Presentation1
No ratings yet
BDA Presentation1
12 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
BackLinks Unlimited
From Everand
BackLinks Unlimited
DrFeelsGood
No ratings yet
Clustering of Hub and Authority Web Docu
No ratings yet
Clustering of Hub and Authority Web Docu
5 pages
Google PageRank Algorithm
No ratings yet
Google PageRank Algorithm
10 pages
E Cient Crawling Through URL Ordering
No ratings yet
E Cient Crawling Through URL Ordering
18 pages
Authoritative Sources in A Hyperlinked Environment: Jon M. Kleinberg
No ratings yet
Authoritative Sources in A Hyperlinked Environment: Jon M. Kleinberg
34 pages
Inventory Management System
No ratings yet
Inventory Management System
10 pages
Link Spam Detection Based On Mass
No ratings yet
Link Spam Detection Based On Mass
21 pages
IR
No ratings yet
IR
3 pages
Options As A Strategic Investment Fifth Edition - 7eyzcaj PDF
0% (2)
Options As A Strategic Investment Fifth Edition - 7eyzcaj PDF
2 pages
IJSC (Jan2013) Vol3 Iss2 P7 498to505 PDF
No ratings yet
IJSC (Jan2013) Vol3 Iss2 P7 498to505 PDF
8 pages
CS345 Data Mining
No ratings yet
CS345 Data Mining
22 pages
What Is Linking ?explain Surface Linking and Deep Linking?: Hypertext
No ratings yet
What Is Linking ?explain Surface Linking and Deep Linking?: Hypertext
9 pages
Becchetti 2008 Link Spam Techniques
No ratings yet
Becchetti 2008 Link Spam Techniques
15 pages
The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google
No ratings yet
The $25,000,000,000 Eigenvector: The Linear Algebra Behind Google
13 pages
SNA Unit2 LearningMaterial
No ratings yet
SNA Unit2 LearningMaterial
16 pages
Hyper Search Engines
No ratings yet
Hyper Search Engines
18 pages
The Use of The Linear Algebra by Web Search Engine
No ratings yet
The Use of The Linear Algebra by Web Search Engine
6 pages
SAPBusinessOne-Citrix Installation Guide
No ratings yet
SAPBusinessOne-Citrix Installation Guide
16 pages
The $25,000,000,000 Eigenvector The Linear Algebra Behind Google
No ratings yet
The $25,000,000,000 Eigenvector The Linear Algebra Behind Google
11 pages
Liuty
No ratings yet
Liuty
50 pages
Web Structure Mining
No ratings yet
Web Structure Mining
10 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
Page Rank Link Farm Detection
No ratings yet
Page Rank Link Farm Detection
5 pages
Web Development 2021 - A Practical Guide (Traversy Media)
100% (1)
Web Development 2021 - A Practical Guide (Traversy Media)
10 pages
Wordsmith Getting Started Guide Final
100% (1)
Wordsmith Getting Started Guide Final
8 pages
Backlink Basic
From Everand
Backlink Basic
MUHAMMAD NUR WAHID ANUAR
No ratings yet
Backlink Basics
From Everand
Backlink Basics
Angelica Cosare
No ratings yet
Building A White Hat Website
From Everand
Building A White Hat Website
Kevin Riley
No ratings yet
Detecting Nepotistic Links Based On Qualified Link Analysis and Language Models
No ratings yet
Detecting Nepotistic Links Based On Qualified Link Analysis and Language Models
4 pages
Vignesh P-Roject
No ratings yet
Vignesh P-Roject
33 pages
Structure of The Web:: Be M (MV) M V
No ratings yet
Structure of The Web:: Be M (MV) M V
14 pages
Name: Kartik Jolapara Sapid: Div: Branch
No ratings yet
Name: Kartik Jolapara Sapid: Div: Branch
4 pages
PageRank Algorithm - The Mathematics of Google Search
No ratings yet
PageRank Algorithm - The Mathematics of Google Search
8 pages
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
No ratings yet
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
5 pages
CP102 Final
100% (1)
CP102 Final
2 pages
Link Analysis: (Follow The Links To Learn More!)
No ratings yet
Link Analysis: (Follow The Links To Learn More!)
28 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Impact of Contextual Information For Hypertext Document Retrieval
No ratings yet
Impact of Contextual Information For Hypertext Document Retrieval
9 pages
Relevance Propagation Model For Large Hypertext Document Collections
No ratings yet
Relevance Propagation Model For Large Hypertext Document Collections
8 pages
Why Is The Keyword Research Important
No ratings yet
Why Is The Keyword Research Important
6 pages
Web-Module II - Chapter1
No ratings yet
Web-Module II - Chapter1
16 pages
An Approach To Confidence Based Page Ranking For User Oriented Web Search
No ratings yet
An Approach To Confidence Based Page Ranking For User Oriented Web Search
6 pages
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
Dynamic Traffic Building Tactics
From Everand
Dynamic Traffic Building Tactics
Bill Clement
No ratings yet
Internet Programming Unit II
No ratings yet
Internet Programming Unit II
9 pages
Combating Link Spam: Prof. Soumen Chakrabarti Om P. Damani
No ratings yet
Combating Link Spam: Prof. Soumen Chakrabarti Om P. Damani
23 pages
Applications of Stochastic Models in Web Page Ranking
No ratings yet
Applications of Stochastic Models in Web Page Ranking
8 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
No ratings yet
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
4 pages
Javascript Interview Que
No ratings yet
Javascript Interview Que
5 pages
MM Tech Bpo
No ratings yet
MM Tech Bpo
18 pages
Sri Manikanta Palakollu
No ratings yet
Sri Manikanta Palakollu
3 pages
A Survey of Focused Web Crawling Algorithms
No ratings yet
A Survey of Focused Web Crawling Algorithms
4 pages
The Evolution of Distributed Computing Systems: From Fundamentals To New Frontiers
No ratings yet
The Evolution of Distributed Computing Systems: From Fundamentals To New Frontiers
15 pages
MIL Essay
100% (5)
MIL Essay
1 page
Issuu PDF Download: Preliminares para Ser Mason
No ratings yet
Issuu PDF Download: Preliminares para Ser Mason
17 pages
UGRD-ITE6200 Application Development and Emerging Technology
No ratings yet
UGRD-ITE6200 Application Development and Emerging Technology
11 pages
Empowerment Technologies (ACTIVITIES 1 and 2)
No ratings yet
Empowerment Technologies (ACTIVITIES 1 and 2)
4 pages
Fuzzing Tools - Docx 0
No ratings yet
Fuzzing Tools - Docx 0
6 pages
Report Ece551 Project 3
No ratings yet
Report Ece551 Project 3
4 pages
WordPress Optimization: The Basics
From Everand
WordPress Optimization: The Basics
Janet Amber
No ratings yet
Lungi Dal Caro Bene by Giuseppe Sarti Sheet Music For Piano, Bass Voice (Piano-Voice)
No ratings yet
Lungi Dal Caro Bene by Giuseppe Sarti Sheet Music For Piano, Bass Voice (Piano-Voice)
1 page

Week 7 - Link Based Ranking

Uploaded by

Week 7 - Link Based Ranking

Uploaded by

1

In addition to the textual content, Web documents contain also hyperlinks. A

In order to avoid self-promotion, another method to fight link spamming is to give

Building a Web Search engine requires to solve several additional problems,

For efficient implementation, link-based ranking algorithms require an efficient

You might also like