IRS Module 2
IRS Module 2
IR Models
CHAPTER 2
Syllabus
Retrieval Models, Retrieval: Formal
Modeling: Taxonomy of Information
Altemative
Characteristics of IR models, Classic Information Retrieval,
Structured text retrieval
Set Theoretic models. Probabilistic Models,
Models, models for Browsing:
Self-learning Topics: Terrier
2.1 INTRODUCTION
(1) Boolean: Documents and queries are represented as set s of index terms
in the Boolean model. As a result, we describe the model as set theoretic
(2) Vector : Documents and queries are represented as vectors in the vector
model in a t-dimensional space. As a result, we define the model as
algebraic.
(3) Probabilistic : The framework for modeling document and query
representations in the probabilistic model is based on probability theory.
As a result, we refer to the model as probabilistic, its
as name
suggests.
For each sort of traditional model (i.e., set-theoretic, algebraic, and
probabilistic), alternative modeling paradigms have been put out over the
years.
We make a distinction between the fuzzy and extended Boolean models
when it comes to alternative set-theoretic models.
Algebralc
Structured Models Generalized Vector
Lat. Semanttc Index
Non-overlapping lists Neural Networks
Proximal Nodes
Browsing
Probablites
Browsing
Inference Network
Flat Structure Belief Network
Guided Hypertext
Ois a
composed of logical views (or
set
representations)
information needs. Such representations are called for the user
queries,
F is a framework tor modeling document
their relationships. representations, queries, and
Retrieval.
GQ. Explain Classic Information
-
Basic Concepts
.Each document is described by a group of representative keywords
called index terms.
document whose semantics
An index term is only a word from the
makes it easier to recall its core ideas.
used to index and summarise the contents of the
Index terms are
index term which does not appear in the document text, Wi,j= 0. With the|
document d, is associated an index term vector dj represented by|
dj (W1j, W2j... . Wj) Further, let gi be a function thatreturns
=
the weight
associated with the index term ki in any t-dimensional vector
i.e,gd)=Wi,).
2.5.1 Boolean Model
-
2 Wi,j X
Wi,q
i-1
L.
sim(di. q) =
1xlg V
2
*V W.
1,
2
where ldl and lql are the norms of the document and query vectors. The
factor lql does not affect the ranking (i.e., the ordering of the
documents) because it is the same for all documents. The factor ldl
provides a normalization in the space of the documents.
N
Weights are given by wij fi.j x log
=
(3 Its cosine ranking formula sorts the documents according to their degree
of similarity to the query.
All index term weights are all binary i.e., Wij ¬ {0,1}, wi, q e {0,1}
query
Let P(R ldj) be the probability that the document di is non-relevant to the
query
The similarity sim(d,q) of the document d; to the query q is defined as
the ratio
P (RId)
sim(d.q)
P(RI d)
using Bayes' rule,
simd, 9) "
P(R)xP(R)
P3R)x PR)
P(d R) stands for the probability of randomly selecting the document dJ
from the set R of relevant documents.
PR) stands for the probability that a document randomly selected from
the entire collection is relevant
descriptions that are only loosely related to the actual semantic contents
on fuzzy theory.
Basics of Fuzzy Set Theory
Fuzzy sets theory is an extension of classical set theory.
Elements have a varying degree of membership. A logic based on two
truth values,
True and False are sometimes insufficient when describing human
reasoning.
Fuzzy Logic uses the whole interval between 0 (false) and 1 (true) to
such that the function maps every element of the universe of discourse X
to the
interval [0, 1].
The mapping is written as: u (x): X -> [0, 1].
Fuzzy Logic is capable of handling inherently imprecise (vague or
inexact or rough or inaccurate) concepts
A fuzzy set is defined as follows: If X is a universe of discourse and x is
a particular element of X, then a fuzzy set A defined on X and can be
written as a collection of ordered pairs A = { (%, u (K), x ¬X}
Example
Let X = {g1, g2, g3, g4, g5} be the reference set of students.
The main idea is to supplement the query's index terms with related
terms (obtained from a thesaurus) so that the user query can acquire
more relevant pages
Ci,
n; +n-ni,!
Where nj is the number or documents which contain the term ki, n 1S
number of documents wnich contain the term ki, and ni1 is the
ij k Ed,(-Ci)
which computes algebraic sum over all terms in document dj
document structure.
A structured text retrieval system looks for all the documents that match
the search criteria, that's why the retrieval job is not associated with any
idea of relevance.
The current models for structured text retrieval are data retrieval models
rather than information retrieval models.
The retrieval system could search for documents that match the query
conditions only partially
The position in the test of a string of words that matches the user query is
refered to as the "term match point."
e.g user query: [ 'information retrieval system']
ifthis appears at 3 positions in document dj, then match points are 3.
section
L2
subsections
L3
subsubsections
L4
list
(183)Fig. 2.7.1 Structure of text documents through different indexing
Implementation
- -
Baeza-Yates
T h i s model was proposed by Navarroand
the text. This
Basic idea is to define a strict hierarchical index over
L2 section
L3 subsections
L4 . subsubsections
Features
Searches for the sections, the subsections, and the sub-subsections that
contain the word "Everest"
keywords
These keywords could be added to the original query for query
expansion and this process is called relevance feedback. this helps in the
retrieval of more relevant documents.
Users can also explore a single document in a flat manner (like a web
page)
Drawback
On a given page user may not have an indication about the context
where the user is. For example, if a user opens a book on a random page,
he might not know in which chapter that page is.
visited
O A history map to identify classes recently
the structures in
Display occurrences (of terms) by showing
a
Hypertext
interactive navigational structure allows users to browse
A high-level
text non-sequentially
article, or a
A node could be
a chapter in a book, a section in an
web page
0Drawbacks of Hypertext
(1) Loose in hyperspace the user will lose track of the organizational
structure of the hypertext when it is large
feedback?
Q.3 What are the characteristics of relevance
Ans.
(1) It shields the user from the details of the query reformulation process.
(2) It breaks down the whole searching task into a sequence of small steps
emphasize others.
Ans.
(1) Assumption of vector space model:
(2) The degree of matching can be used to rank-order documents
(3) This rank-ordering corresponds to how well a document satisfying a
user's information needs
Q.7 What are the three classic models in information retrieval system?
Ans.
(1) Boolean model
Ans.
9-user query,dj
-
representation only
R- ideal answer set, relevant to the query
Ans.
Advantages
(1) Doc's are ranked in decreasing order of their probability of relevant
Disadvantages
(1) Need to guess the initial separation of doc's into relevant and non-
relevant sets.
(2) All weights are binary
(3) The adoption of the independence assumption for index terms
(4) Need to guess initial estimates for P(ki l R)
(5) Method does not take into account tf and idf factors
Chapter Ends...
O00