0% found this document useful (0 votes)
106 views37 pages

Information Retrieval System and The Pagerank Algorithm

This document provides an overview of information retrieval systems and the PageRank algorithm. It discusses the basic concepts of information retrieval like the retrieval process and classical models like the Boolean, vector, and probabilistic models. It then describes features of web search systems like Google's architecture. Finally, it provides a brief analysis of the PageRank algorithm and compares it to the HITS algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views37 pages

Information Retrieval System and The Pagerank Algorithm

This document provides an overview of information retrieval systems and the PageRank algorithm. It discusses the basic concepts of information retrieval like the retrieval process and classical models like the Boolean, vector, and probabilistic models. It then describes features of web search systems like Google's architecture. Finally, it provides a brief analysis of the PageRank algorithm and compares it to the HITS algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

INFORMATION RETRIEVAL SYSTEM AND

THE PAGERANK ALGORITHM


OUTLINE
 Information retrieval system
 Data retrieval versus information retrieval
 Basic concepts of information retrieval
 Retrieval process

 Classical models of information retrieval


 Boolean model
 Vector model
 Probabilistic model

 Web information retrieval


 Features of Google’s search system
 Google’s architecture

 A brief analysis of PageRank algorithm


 PageRank versus HITS algorithm
WHAT IS INFORMATION RETRIEVAL?
 Information retrieval (IR) deals with the representation, storage,
organization of, and access to information items[1].

 The user must first translate this information need into a query which can
be processed by IR system.

 The key goal of an IR system is to retrieve information which might be


useful or relevant to the user.
DATA VERSUS INFORMATION RETRIEVAL
DATA RETRIEVAL INFORMATION RETRIEVAL
Determines which documents of a Retrieves information about a subject
collection contain the keywords in rather than data which satisfies a
the user query given query
All objects which satisfy clearly IR system somehow 'interprets' the
defined conditions are retrieved contents of documents in a collection
and rank them according to a degree
of relevance to the user query

A single erroneous object means The retrieved objects might be


total failure inaccurate and small errors are
ignored

Data has a well defined structure Data is a natural language text which
and semantics is not always well structured and
could be semantically ambiguous
BASIC CONCEPTS OF IR
The effective retrieval of relevant information is directly affected by :

 User task – The task of the user might be:


 Information or a data retrieval
 Browsing
 Filtering

Figure1: User tasks in an IR system[1]


 Logical View-The way the index words might be extracted from the
document can be of 2 types:
 Full Text
 Index term

Figure 2: Text operations for Index Term Logical View [1]


RETRIEVAL PROCESS
Step 1: Before the retrieval process can even be initiated, it is necessary
to define the text database. This is usually done by the manager of the
database, which specifies the following:
(a) the documents to be used
(b) text operations
(c) the text model

Step 2: Once the logical view of the documents is defined, the database
manager builds an index of the text. An index is a critical data structure
because it allows fast searching over large volumes of data(e.g.
inverted file)
Figure 3 : Retrieval Process[1]
Step 3: Then, the user first specifies a user need which is then parsed
and transformed by the same text operations applied to the text. Then,
query operations are applied to the actual query which is then
processed to obtain the retrieved documents. Fast query processing is
made possible by the index structure previously built.

Step 4: Before been sent to the user, the retrieved documents are ranked
according to a likelihood of relevance.

Step 5: The user then examines the set of ranked documents in the
search for useful information. At this point, he might pinpoint a subset
of the documents seen as definitely of interest and initiate a user
feedback cycle[1].
IR MODELS
 The central problem regarding IR systems is the issue of predicting which
documents are relevant and which are not.
 A ranking algorithm operates according to basic premises regarding the
notion of document relevance.
 The IR model adopted determines the predictions of what is relevant and
what is not.

Figure 4 : Classification of the various IR models[1]


FORMAL DEFINITION OF IR
An information retrieval model is a quadruple :
{D,Q, F, R(qi, dj)}
where:
 D is a set composed of logical views (or representations) for the
documents in the collection.
 Q is a set composed of logical views (or representations) for the user
information needs (called queries).
 F is a framework for modeling document representations, queries, and
their relationships.
 R(qi, dj) is a ranking function which associates a real number with a
query qi ϵ Q and a document representation dj ϵ D. Such ranking
defines an ordering among the documents with regard to the query qi.
CLASSICAL MODEL
 Classic models in IR system consider that each document is described
by a set of representative keywords called index terms which are used
to index and summarize the document contents.
 Thus, the distinct index terms have varying relevance when used to
describe document contents.
 In this model, this effect is captured through the assignment of
numerical weights to each index term of a document.

 The main classical models are:


 Boolean Model
 Vector Model
 Probabilistic Model
STRUCTURED MODEL
 Retrieval models which combine information on text content with
information on the document structure are called structured text retrieval
models.[1]

 There are two models for structured text retrieval:-


 Non-overlapping lists model
 Proximal nodes model

Figure 5: List structure for (a) Non-overlapping lists model (b) Proximal nodes model [1]
BROWSING MODEL

 Browsing is a process of retrieving information whose main objectives


are not clearly defined in the beginning and whose purpose might
change during the interaction with the system.

 For browsing, there are 3 models :-


 Flat model
 Structure guided model
 Hypertext model
BOOLEAN MODEL
 The Boolean model is a simple retrieval model based on set theory and
Boolean algebra.

 The queries are specified as Boolean expressions which have precise


semantics.

 The Boolean model considers that index terms are present or absent in
a document. As a result, the index term weights are assumed to be all
binary, i.e., Wi,j ϵ{0,I}.

 A query q is composed of index terms linked by three connectives: not,


and, or.

 A query is essentially a conventional Boolean expression which can be


represented as a disjunction of conjunctive vectors in Disjunctive
normal form. The binary weighted vectors are called the conjunctive
components of Qdnf.
ADVANTAGES:-
1. clean formalism behind the model
2.its simplicity.

DISADVANTAGES:-
1. Its retrieval strategy is based on a binary decision criterion and
behaves more as data retrieval model.
2. The exact matching may lead to retrieval of too few or too many
documents.
3. It is not simple to translate an information need into a Boolean
expression
4. The Boolean expressions actually formulated by users often are
quite simple.

APPLICATIONS:-
Commercial document database systems
VECTOR MODEL
 The vector model was given by Gerard Salton and McGill.

 This model proposes to apply partial matching strategy by assigning


non-binary weights to index terms in queries and in documents.

 These term weights are ultimately used to compute the degree of


similarity between each document stored in the system and the user
query.

 In the vector model,


 Weight Wi,j associated with a pair of index terms and document
vector is positive and non-binary.
 The index terms in the query are also weighted.
 ADVANTAGES:
 Its term-weighting scheme improves retrieval performance.

 Its partial matching strategy allows retrieval of documents that


approximate the query conditions.

 Its cosine ranking formula sorts the documents according to their


degree of similarity to the query.

 It is simple and resilient ranking strategy.

 DISADVANTAGE:
 Index terms are assumed to be mutually independent.
PROBABILISTIC MODEL
 The classic probabilistic model introduced in 1976 by Roberston and
Sparck Jones.

 The probabilistic model attempts to capture the IR problem within a


probabilistic framework.

 BASIC IDEA: Given a user query, there is a set of documents which


contains exactly the relevant documents referred as the ideal answer
set. Given the description of this ideal answer set, we retrieve the
documents that satisfy this condition.

 Thus the querying process will be a process of specifying the


properties of an ideal answer set .
Assumption (Probabilistic Principle) -
‘Given a user query q and a document dj in the collection, the
probabilistic model tries to estimate the probability that the user will
find the document dj relevant.
— The model assumes that this probability of relevance depends on the
query and the document representations only.
— Further, the model assumes that there is a subset of all documents
which the user prefers as the answer set for the query q, called an
ideal answer set is labeled R which should maximize the overall
probability of relevance to the user.
— Documents in the set R are predicted to be relevant to the query.

Documents not in this set are predicted to be non-relevant.’

 This assumption does not state explicitly :-


 How to compute the probabilities of relevance
 We don’t know even the sample space
ADVANTAGES:-
 The documents are ranked in decreasing order of their probability of
being relevant.

DISADVANTAGES:-
 There is a need to guess the initial separation of documents into
relevant and non-relevant sets.

 It does not take into account the frequency with which an index term
occurs inside a document

 The adoption of the independence assumption for index terms.


COMPARISON OF THE CLASSICAL
MODELS
BOOLEAN MODEL VECTOR MODEL PROBABILISTIC MODEL
It evaluates queries as It uses the concept of It evaluates the queries by
evaluating Boolean index weights and partial using the ideal set
expression. matching to match a probabilistic index terms.
document to a query.

Weights are binary. The Index terms are weighted. Weights are binary. Initially
document is either So, there is a ranking the document either
relevant or irrelevant. created based on these belongs to the ideal set or
weights(using similarity). is considered irrelevant.

It is simple to evaluate It is more complex than This is the most complex


based on the query and binary as the index term model since neither the
the document. weighting needs to be weights nor the ideal set is
done. initially defined.
Performance is not that Performance is Performance is proved to
good. considered to be optimal. be optimal. However, in
practice it may become
impractical.
WEB IR VERSUS TRADITIONAL IR
The differences between the modeling for the web and the traditional
document collections are because of the following reasons:

o Web is huge

o Dynamic nature of Web

o Web is self organized

o Web growth is fast

o Web is hyperlinked
GOOGLE SEARCH ENGINE
 Google ,the most popular search engine, came into existence in 1998.

 It was developed by Sergey Brin and Lawrence Page as a solution for


the problem of Web information retrieval.

 DESIGN GOALS OF GOOGLE


 Improved search quality
 Academic search engine
 Usage
 Architecture
HOW GOOGLE SEARCH WORKS
 STEP 1: CRAWLING

 STEP 2: COMPRESSING

 STEP 3: INDEXING

 STEP 4: PAGERANK CALCULATION

 STEP 5: SORTING

 STEP 6: SEARCHING
GOOGLE SYSTEM FEATURES

1. ANCHOR TEXT-Google associates the text of the link with 2 things:

 The page that the link is on

 The page the link points to

2. THE PAGERANK ALGORITHM- PageRank extends the idea of


citations by not counting links from all pages equally and by
normalizing by the number of links on a page.

 We assume page A has pages T1,T2….Tn which point to it (i.e., are


citations). The parameter d is a damping factor which can be set
between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the
number of links going out of page A. The PageRank of a page A is
given as follows:
 ADVANTAGES OF USING PAGERANK ALGORITHM-

 Random Surfer Model is used as the intuitive justification of


PageRank.

 Pages that are well cited from many places around the Web are
worth looking at. Also, pages that have perhaps only one citation
from a well known site are also generally worth looking at.

 ADVANTAGES OF USING ANCHOR TEXT-


 Anchors often provide more accurate of Web pages than the
pages themselves.

 Anchors may exist for documents which cannot be indexed by a


text-based search engine, such as images, programs, and
databases.
MATHEMATICS OF PAGERANK
 The PageRank Thesis: A page is important if it is pointed to by other
important pages.
 ORIGINAL FORMULA- The PageRank of a page Pi is denoted r(i)

Figure 6: Example of PageRank calculation on web pages

The problem is that the PageRanks of pages inlinking to page Pi are


unknown. So, an iterative procedure was used.
 INITIAL ASSUMPTION: In the beginning, all pages have equal
PageRank of 1/n, where n is the number of pages in Google's index
of the Web. So, iterative formula is-

 This can also be written as-

where H is row normalized matrix such that


 OBSERVATIONS:
 Each iteration of the equation involves one vector-matrix
multiplication, which generally requires O( n2) computation, where
n=size (Hnxn).

 H is a very sparse because most web pages link to only a handful


of other pages. Hence, it requires O(nnz(H)) computation, where
nnz(H) =number of non zeros in H which reduces to O( n) effort.

 The iterative method applied to H is the classical power method


applied to H matrix.

 H looks a lot like a stochastic transition probability matrix for a


Markov chain. The dangling nodes of the network, those nodes
with no outlinks, create 0 rows in the matrix. All the other rows,
which correspond to the non dangling nodes, create stochastic
rows. Thus, H is called sub-stochastic. [2].
 PROBLEMS WITH THE ITERATIVE PROCESS-
1. Problem of Rank Sinks-
 Rank sinks are those pages that accumulate more and more
PageRank at each iteration
 It is used by SEO and link
 Thus, ranking nodes by their PageRank values is tough when a
majority of the nodes are tied with PageRank 0.
 It’s peferable to have PageRanks as positive.

Figure 7: (a) Rank Sink (b) Cycle

2. Problem of Cycles-
 In the page cycles, the page1 only points to page 2 and vice
versa which creates an infinite loop or cycle.
 The iterates will not converge no matter how long the process
is run since (k)T will flip flop indefinitely
ADJUSTMENTS TO THE MODEL-
So, to counter the problems, Brin and Page made use of the Random
Surfer Model.

 Imagine a web surfer who bounces along randomly following the


hyperlink structure of the Web & when he arrives at a page with several
outlinks, he chooses one at random, hyperlinks to this new page, and
continues this random decision process indefinitely.
 In the long run, the proportion of time the random surfer spends on a
given page is a measure of the relative importance of that page.
 Unfortunately, this random surfer encounters some problems. He gets
caught whenever he enters a dangling node e.g., pdf files, image files,
data tables, etc[3].”

 To fix this, Brin and Page define their first adjustment, which we call the
stochasticity adjustment because the 0T rows of H are replaced with
1/n eT, thereby making H stochastic. Now, the random surfer can
hyperlink to any page at random. The stochastic matrix is called S.
So,
S=H+a(1/n eT)

where, a = dangling node vector

 This adjustment guarantees that S is stochastic, but it alone cannot


guarantee the convergence results desired. So a ,primitivity
adjustment was done to make it irreducible and aperiodic (so that
a PageRank value is generated)

 When the random surfer abandons the hyperlink method by


entering a new destination, the random surfer, "teleports" to the
new page, where he begins hyperlink surfing again, until the next
teleportation, and so on.
 To model this activity mathematically, Brin and Page invented a new
matrix G, such that-
G=αS + (1-α)1/n eeT
where,
α is teleportation factor/damping factor and α ϵ {(0,1)}
G is called the Google matrix
E = 1/ n eeT is the teleportation matrix

 The teleporting is random because the E is uniform meaning the


surfer is equally likely, when teleporting, to jump to any page.
 So, Google's adjusted PageRank method is:

which is simply the power method applied to G.


HITS ALGORITHM
 HITS (Hypertext Induced Topic Search) was invented by Jon Kleinberg
in 1998 and uses the Web's hyperlink structure to create
 HITS produces two popularity scores and is query-dependent. HITS
thinks of web pages as authorities and hubs.

 An authority is a page with many inlinks, and a hub is a page with


many outIinks.
 The main criteria for HITS is:Good authorities are pointed to by good
hubs and good hubs point to good authorities.

 Every page i has both an authority score xi and a hub score yi . If E is


the set of all directed edges in the web graph,then,

• and for k=1,2,3….

given that each page has somehow been assigned an initial authority
score x(0) and hub score y(0) ,
HITS VERSUS PAGERANK
HITS PageRank
Scoring criteria Good authorities are pointed to by A webpage is important if it
good hubs and good hubs point to is pointed to by other
good authorities. important pages.

Number of scores Dual Rankings Page rank only presents


a) one with the most authoritative one score.
documents related to the query
b) other with the most "hubby"
documents.

Query indepedence HITS score is calculated after PageRank score is query


getting the neighbourhood graph independent
according to the query

Resilience to Susceptible to spamming since Since PageRank is able to


spamming addition of pages slightly affects isolate spam, it is risilient
the ranking. to spamming.
FUTURE WORK
 Creating spam-resistant ranking algorithms-
 The proposed algorithm considers each page one at a time, and asks,
"What proportion of this page's outlinking pages point back to it?" If
this value comes more than a threshold value, then we can detect the
presence of a link farm.
 The second proposal is to build a score that is the "opposite" of
PageRank called BadRank for each page. Then, actual ranking would
be done by the difference of these 2 quantities.

 Intelligent Agent-
 An intelligent agent is a software robot designed to retrieve specific
information automatically. . So we need to factor in such crawlers that
do not cause privacy issues.

You might also like