Information Retrieval System and The Pagerank Algorithm
Information Retrieval System and The Pagerank Algorithm
The user must first translate this information need into a query which can
be processed by IR system.
Data has a well defined structure Data is a natural language text which
and semantics is not always well structured and
could be semantically ambiguous
BASIC CONCEPTS OF IR
The effective retrieval of relevant information is directly affected by :
Step 2: Once the logical view of the documents is defined, the database
manager builds an index of the text. An index is a critical data structure
because it allows fast searching over large volumes of data(e.g.
inverted file)
Figure 3 : Retrieval Process[1]
Step 3: Then, the user first specifies a user need which is then parsed
and transformed by the same text operations applied to the text. Then,
query operations are applied to the actual query which is then
processed to obtain the retrieved documents. Fast query processing is
made possible by the index structure previously built.
Step 4: Before been sent to the user, the retrieved documents are ranked
according to a likelihood of relevance.
Step 5: The user then examines the set of ranked documents in the
search for useful information. At this point, he might pinpoint a subset
of the documents seen as definitely of interest and initiate a user
feedback cycle[1].
IR MODELS
The central problem regarding IR systems is the issue of predicting which
documents are relevant and which are not.
A ranking algorithm operates according to basic premises regarding the
notion of document relevance.
The IR model adopted determines the predictions of what is relevant and
what is not.
Figure 5: List structure for (a) Non-overlapping lists model (b) Proximal nodes model [1]
BROWSING MODEL
The Boolean model considers that index terms are present or absent in
a document. As a result, the index term weights are assumed to be all
binary, i.e., Wi,j ϵ{0,I}.
DISADVANTAGES:-
1. Its retrieval strategy is based on a binary decision criterion and
behaves more as data retrieval model.
2. The exact matching may lead to retrieval of too few or too many
documents.
3. It is not simple to translate an information need into a Boolean
expression
4. The Boolean expressions actually formulated by users often are
quite simple.
APPLICATIONS:-
Commercial document database systems
VECTOR MODEL
The vector model was given by Gerard Salton and McGill.
DISADVANTAGE:
Index terms are assumed to be mutually independent.
PROBABILISTIC MODEL
The classic probabilistic model introduced in 1976 by Roberston and
Sparck Jones.
DISADVANTAGES:-
There is a need to guess the initial separation of documents into
relevant and non-relevant sets.
It does not take into account the frequency with which an index term
occurs inside a document
Weights are binary. The Index terms are weighted. Weights are binary. Initially
document is either So, there is a ranking the document either
relevant or irrelevant. created based on these belongs to the ideal set or
weights(using similarity). is considered irrelevant.
o Web is huge
o Web is hyperlinked
GOOGLE SEARCH ENGINE
Google ,the most popular search engine, came into existence in 1998.
STEP 2: COMPRESSING
STEP 3: INDEXING
STEP 5: SORTING
STEP 6: SEARCHING
GOOGLE SYSTEM FEATURES
Pages that are well cited from many places around the Web are
worth looking at. Also, pages that have perhaps only one citation
from a well known site are also generally worth looking at.
2. Problem of Cycles-
In the page cycles, the page1 only points to page 2 and vice
versa which creates an infinite loop or cycle.
The iterates will not converge no matter how long the process
is run since (k)T will flip flop indefinitely
ADJUSTMENTS TO THE MODEL-
So, to counter the problems, Brin and Page made use of the Random
Surfer Model.
To fix this, Brin and Page define their first adjustment, which we call the
stochasticity adjustment because the 0T rows of H are replaced with
1/n eT, thereby making H stochastic. Now, the random surfer can
hyperlink to any page at random. The stochastic matrix is called S.
So,
S=H+a(1/n eT)
given that each page has somehow been assigned an initial authority
score x(0) and hub score y(0) ,
HITS VERSUS PAGERANK
HITS PageRank
Scoring criteria Good authorities are pointed to by A webpage is important if it
good hubs and good hubs point to is pointed to by other
good authorities. important pages.
Intelligent Agent-
An intelligent agent is a software robot designed to retrieve specific
information automatically. . So we need to factor in such crawlers that
do not cause privacy issues.