0% found this document useful (0 votes)
42 views24 pages

IRS Module 2

jgsjgdjsg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views24 pages

IRS Module 2

jgsjgdjsg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MODULE I

IR Models
CHAPTER 2

Syllabus
Retrieval Models, Retrieval: Formal
Modeling: Taxonomy of Information
Altemative
Characteristics of IR models, Classic Information Retrieval,
Structured text retrieval
Set Theoretic models. Probabilistic Models,
Models, models for Browsing:
Self-learning Topics: Terrier

2.1 INTRODUCTION

What do you mean information retrieval models?

Information retrieval's (IR) objective is to give users the documents they


need to satisfy their informational needs.
We use the term "document" in a broad sense to refer to both textual and
non-textual information, including multimedia items.
Index terms are typically used by traditional information retrieval
systems to index and retrieve documents. A keyword (or combination of
related terms) with a distinct meaning is known as an index term (usually
a noun)
The semantics of the documents and of the user information need can be
naturally expressed through sets of index terms.
This method is simple to implement but retrieved documents are often
irrelevant because a lot of semantics are lost when we replace its text
With a set word.

The main problem in information retrieval is judging relevant and non


relevant documents.
Information Retrieval System (MU-Sem.7-1T) (IRModels) Pg.no.(2-2)
Information retrieval systems use rankingalgorithms to determine which
documents are relevant and which are not.
The predictions of what is relevant and what is not are based on the
accepted IR mode

2.2 A TAXONOMY OF INFORMATION RETRIEVAL


MODELS

GQ What are the three classic models in information retrieval system?


Explain the taxonomy of information retrieval with a classification
diagram.

The three classic models in information retrieval:

(1) Boolean: Documents and queries are represented as set s of index terms
in the Boolean model. As a result, we describe the model as set theoretic

(2) Vector : Documents and queries are represented as vectors in the vector
model in a t-dimensional space. As a result, we define the model as
algebraic.
(3) Probabilistic : The framework for modeling document and query
representations in the probabilistic model is based on probability theory.
As a result, we refer to the model as probabilistic, its
as name
suggests.
For each sort of traditional model (i.e., set-theoretic, algebraic, and
probabilistic), alternative modeling paradigms have been put out over the
years.
We make a distinction between the fuzzy and extended Boolean models
when it comes to alternative set-theoretic models.

. We differentiate the generalized vector, latent semantic indexing, and


neural network models as alternative algebraic models.

. We distinguish between the inference network and belief network


models when referring to alternaüve probabilistic models. A taxonomy
of these information retrieval models is shown in Fig. 2.2.1.

We distinguish between the non-overlapping lists model and the


roximal nodes model tor structured text retrieval.

(New Syll. wefacademic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem.7-IT) (IR Models) Pg. no. (2-3)
Set Theoretlc
Classic Models Fuzzy
Boolean Vector Extended Boolean
Retrieva
RHO
Ad hoc
Fitering
Probabilistic

Algebralc
Structured Models Generalized Vector
Lat. Semanttc Index
Non-overlapping lists Neural Networks
Proximal Nodes
Browsing
Probablites
Browsing
Inference Network
Flat Structure Belief Network
Guided Hypertext

(181Fig. 2.2.1l:A taxonomy ofInformation Retrieval Models

As discussed in chapter 1, the logical view of the documents (whole text,


collection of index words, etc.), the IR model (Boolean, vector,
probabilistic, etc.), and the user tasks (retrieval, browsing) are orthogonal
features ofa retrieval system.
Thus, even though some models are better suited for one user task than
another, the same IR model can be utilized with various document
logical views to carry out various user tasks as shown in Fig. 2.2.2.

Logical view of documents

Index terms Full text Full Text+


s
E Structure
R Retrieval | Classical set Classical set Structured
theoretic algebraic theoretic
T probabilistic algebraic
A
probabilistic
S
K
Browsing Flat Flat hypertext Structure
guided
hypertext
Fig. 2.2.2: Retrieval models most frequently associated with distinct
combinations of a document logical view and a user task

(New Syll. w.e.f academic year 22-23) (M7-87) E Tech-Neo Publications


Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-4)

W 2.3 RETRIEVAL:AD HOC AND FILTERING


GQ Define Ad hoc retrieval and Filtering.

Ad hoc retrieval : When new queries are entered into a traditional


information retrieval system, the collection of documents remains
largely
static.

Filtering: Queries are relatively static as new documents are added to


the system (and leave). In filtering user profile is created according to the
user's preferences.

The incoming documents are then compared to this profile in an effort to


identify any that might be of interest to this specific user.
This method can be used, for instance, to choose a news article from
among the many that are broadcast each day.
Ranking of the filtered documents is not provided.
A set of keywords are used to create user profile.
2.4 A FORMAL
CHARACTERIZATION OF IR MODELS
GQ. llustrate formal characterization of IR Model.
- - ---

The formal characterization of IR Model is as follows:


Definition: An information retrieval model is a quaduple [D, Q. F, R|
19i. d)]where
D is a set composed of
logical views (or
documents in the collection. representations) for the

Ois a
composed of logical views (or
set
representations)
information needs. Such representations are called for the user
queries,
F is a framework tor modeling document
their relationships. representations, queries, and

R 4 d) is a ranking function which


assOciates a real number with a
query q e Q and a document
representation
ordering among tne documents with d,
e D.
defines an Such ranking
regard to the query Q

(New Syll. wef academic year 22-23) (M7-87)


Tech-Neo Publications
Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-5)
Infomation

2.5 CLASSIC INFORMATION RETRIEVAL

Retrieval.
GQ. Explain Classic Information
-

briefly present the three classic models in information


In this section, we

the vector, and the probabilistic models.


retrieval namely, the Boolean,

Basic Concepts
.Each document is described by a group of representative keywords
called index terms.
document whose semantics
An index term is only a word from the
makes it easier to recall its core ideas.
used to index and summarise the contents of the
Index terms are

document. Nouns are preferred as index terms.

When used to describe the contents of a document, various index terms

have differing degrees of importance.


numerical weight in order to
Each index term in a document is given a

represent this effect.


Let kibe an index term, d; be a document, and Wij> 0 be a weight
associated with the pair (k; di). This weight quantifies the importance of
the index term for describing the document semantic contents.

Definition: Let t be the number of index terms in the system and kË be a


generic index term. K ={k], .. . k} is the set of all index terms. A weight
document dj For an
Wi,j> 0 is associated with each index term ki of a
.

index term which does not appear in the document text, Wi,j= 0. With the|
document d, is associated an index term vector dj represented by|
dj (W1j, W2j... . Wj) Further, let gi be a function thatreturns
=
the weight
associated with the index term ki in any t-dimensional vector

i.e,gd)=Wi,).
2.5.1 Boolean Model
-

GQ What is the basis for the Boolean model?


What the advantages and disadvantages of the Boolean model?
GQ
GQ. are
- -
The Boolean model is a simple retrieval model based on set theory and

(New Syll. w.e.f academic year 22-23) (M7-87) ATech-Neo Publications


O
o o o o o o
(IR Models) Pg. no (2-7)
Infomation Retrieval System (MU-Sem.7-1T)

D5 = [K4. K5. K6, K7, K8|

D6 (K1. K2. K3, K4)


KI and (K2 or
Query: K1 A (K2v K 3 ) e.g documents containing
(not K3)
Answer:

n ({DI, D2, D3, D6) U {D3, D5)) =


{DI, D2, D6)
(DI, D2, D4, D6)
Definition: For the Boolean model,
the index term weight variables are all
Boolean expression.
binary i.e., Wi,j ¬ {0, 1) A query q is a conventional

form for the query q. Further, let q be


9 dnf be the disjunctive normal
cc
Let
The similarity of a document d
any of the conjunctive components of q dnf
to the query q is defined as

sim(d, 9) = 3 (3.e a) (va (7,))


0 otherwise

If simdj. q) = I then the Boolean model predicts that the document dj is


relevant to the query q (it might not be). Otherwise, the prediction is that the
document is not relevant.

Advantages of the Boolean Mode


(1) The simplest model is based on sets.
(2) Easy to understand and implement.
(3) It only retrieves exact matches
(4) It gives the user, a sense of control over the system.
(5) Boolean retrieval was adopted by many commercial bibliographic
systems.
(6) Boolean queries are akin to database queries.
Disadvantages of the Boolean Model
(1) The model's similarity function is Boolean. Hence, there would be no
partial matches. This can be annoying for the users.
(2) Information need has to be translated into Boolean expressions which
most users find awkward.
(3) In this model, the Boolean operator usage has much more influence than
a critical word.

(New Syll.w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem.7-1T) (1R Models) Pg. no. (2-8)
(4) The Boolean queries formulated by the users are most often too
simplistic.
(5) As a result, the Boolean model frequently returns either too few or too
many documents in response to the user query.
(6) The query language is expressive, but it is complicated too.
(7) No ranking for retrieved documents (absence of grading scale).
(8) t is not possible to assign a degree ofrelevance
2.5.2 Vector Model

GQ. Define the Vector Model with relevant mathematical equations.


GQ What are the assumptions of vector space model?
GQ. What are the Parameters in calculating a weight for a document
term or query term?
GQ. How can you calculate tf and idf in the vector model?

.The vector model suggests a framework that allows for


partial matching
because acknowledges that using binary
it
weights is too restrictive.
It assigns non-binary weights to index terms in queries and documents.
The degree of similarity between each document stored
in the system
and the user query is calculated using these term
weights.
The vector model considers documents that match
the query terms only
partially by ordering the retrieved documents in decreasing order of this
degree of similarity.

In comparison to the Boolean model, the ranked document answer set is


significantly more precise (in the sense that it
better satisfies the users
information need).

Definition: For the vector model, the


weight Wi, i associated
with a pair
k, d) is positive and
non-binary. Further, the index terms in
the query ar
also weighted. Let be the weight
associated with the pair
where wi q0. Then, the query vector q is [ki. q
defined as
q (W1,q, W2,q Wi,9 wnere t is the total number
the svstem. As before, the vectOr ror a
of index terms n
document di is
represented by
d=(Wi.j, W2,J,. ,j).

(New Syll wefacademic year 22-23) (M7-87)


LA Tech-Neo Publications
Infomation Retrieval System (MU.Sem.7-1T) (IR Models) Pg. no. (2-9)

GQ What is cosine similarity?


GQ Define term frequency
GO. Define inverse tem frequency.
--

T h e vector model proposes to evaluate the degree of similarity of the


document d, with regard to the query q as the correlation between the
vectors d and q.
For instance, this correlation can be quantified by the cosine of the angle
between these two vectors as shown in Fig. 2.5.1. That is,

2 Wi,j X
Wi,q
i-1
L.
sim(di. q) =

1xlg V
2

*V W.
1,
2

where ldl and lql are the norms of the document and query vectors. The
factor lql does not affect the ranking (i.e., the ordering of the
documents) because it is the same for all documents. The factor ldl
provides a normalization in the space of the documents.

(182Fig. 2.5.1 The cosine of Q is adopted as sim (d, q).

By calculating the raw frequency of a phrase (ki) within a document (d).


the vector model measures the intra-clustering similarity.
Such term frequency is usually referred to as the tf factor and provides
one measure of how well that term describes the document contents (i.e.,
intra-document characterization).
The inverse of the frequency of a phrase ki among the documents in the
collection is used to calculate the inter-cluster dissimilarity. This factor is
known as the inverse document frequency or the idf factor.

(New Syll. w.e.f academic year 22-23) (M7-87) JTech-Neo Publications


Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-10)

|Definition : Let N be the total number of documents in the system and ni be


the number of documents in which the index term kË appears. Let freqi, j be|
the raw frequency of term k; in the document di (i.e., the number of times|
the term ki is mentioned in the text of the document di). Then, the|
normalized frequency fij of term kj in document dj is given by|
fregii where the maximum is computed over all terms which|
max, freqij
are mentioned in the text of the document
If the term ki does not appear
di.
in the document dj then fi, j = 0. Further, let idf inverse document|

frequency for k, be given by idfi log* N =

N
Weights are given by wij fi.j x log
=

Such term weighting schemes are called tf-idf schemes.

The Vector Model Example


Let's consider that
collection includes 10,000 documents
a

T h e term A appears 20 tümes in a particular document


The maximum appearance of any term in this
document is 50
The term A appears in 2,000 of the collection
documents.
fij) freqi.j)/ max(freqij) = 20/5 = 0.4
idfi)= log(N/n;) log
(10,000/2,000) log(5)
=
=
= 2.32
W i j t a J )* log(N/n;) = 0.4 *2.32 = 0.93
-
GQ. What are the advantages and disadvantages of the Vector Model?
-.
rAdvantages of Vector Space Model

1) Its term-weighting scheme improves the


quality of answer set and
retrieval performance.
(2) Its partial matching strategy allows retrieval of
approximate the query conditions.
documents that

(3 Its cosine ranking formula sorts the documents according to their degree
of similarity to the query.

(New Syll we.f academic year 22-23) (M7-87)


Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-11)

Disadvantage of Vector Space Model


between index terms
(1) The assumption of mutual independence

a2.5.3 Probabilistic Model


------. - - -

GQ. What are the Fundamental assumptions for probabilistic principle?


GQ Write the advantages and disadvantages of probabilistic model.
The probabilistic model is an effort to frame the information retrieval

problem within a probabilistic framework.


The probabilistic model tries to estimate the probability that the user will
find the document d; relevant with ratio
P (dj relevant to q) /P (d; non relevant to q)
It is useful to derive ranking functions used by search engines and web
search engines in order to rank matching documents according to their
relevance to a given search query
This model is used to calculate the probability that a document, dj, will
be relevant to a given query, q
The model makes the assumption that the query and document
representations influence this probability of relevance.
Given a query q, there exists a subset of the documents R which are
relevant to q But membership of R is uncertain
Users give with information needs, which they translate into query

representations. Similarly, there are documents, which are converted into


document representations. Given only a query, an IR system has an
uncertain understanding of the information needed.
So IR is an uncertain process, because,
o Information need to query
Documents to index terms

Query terms and index terms mismatch


Probability theory provides a principled foundation for such reasoning
under uncertainty. This model provides how likely a document is
relevant to an information need.
Documents can be relevant and non-relevant, we can estimate the
probability of a term t appearing in a relevant document P(t 1 R=1).

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-12)
Probabilistic methods are one of the oldest but also one of the currently
hottest topics in Information Retrieval.

For Probabilistic model


- :
GQ How can you find the similarity between doc and query in
probabilistic principle Using Bayes' rule?

All index term weights are all binary i.e., Wij ¬ {0,1}, wi, q e {0,1}

Let R be the set of documents known to be relevant to query q

Let R' be the complement of R.


Let (Rld) be the probability that the document dj is relevant to the

query
Let P(R ldj) be the probability that the document di is non-relevant to the
query
The similarity sim(d,q) of the document d; to the query q is defined as
the ratio

P (RId)
sim(d.q)
P(RI d)
using Bayes' rule,

simd, 9) "
P(R)xP(R)
P3R)x PR)
P(d R) stands for the probability of randomly selecting the document dJ
from the set R of relevant documents.

PR) stands for the probability that a document randomly selected from
the entire collection is relevant

Advantage of Probabilistic Model


(1 Documents are ranked in
decreasing order of probability of relevance.
Disadvantages of Probabilistic Model
(1) Need to guess initial estimates for P( K;IR)

(New Syll. w.e.f academic year 22-23) (M7-87)


Tech-Neo Publications
Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-13)
Infomation

2.6 ALTERNATIVE SET THEORETIC MODELS


- ----

GQ Discuss alternative set theoretic models.


--
In this section, we discuss two alternative set theoretic models, namely
the fuzzy set model and the extended Boolean model.

2.6.1 Fuzzy Set Model


- -

GQ. Explain fuzzy set model


GQ Write basics of fuzzy set theory.
When documents and queries are represented by sets of keywords,

descriptions that are only loosely related to the actual semantic contents

of the corresponding documents and queries are produced.

As a result, there is only a rough match between a document and the


search terms (or vague).
T h i s can be represented mathematically by assuming that each query
degree of
phrase defines a fuzzy set and that each page has a

membership (often smaller than 1) in this set.


This interpretation provides the foundation for many models of IR based

on fuzzy theory.
Basics of Fuzzy Set Theory
Fuzzy sets theory is an extension of classical set theory.
Elements have a varying degree of membership. A logic based on two

truth values,
True and False are sometimes insufficient when describing human

reasoning.
Fuzzy Logic uses the whole interval between 0 (false) and 1 (true) to

describe human reasoning.


A Fuzzy Set is any set that allows its members to have different degree
of membership, called membership function, having interval [0, 1].

Fuzzy Logic is derived from fuzzy set theory


allowed.
Many degrees of membership (between 0 to 1) are

Thus a membership function uA (x) is associated with a fuzzy sets A

Nay Syll we.f academic year 22-23) (M7-87) LTech-Neo Publications


Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-14)

such that the function maps every element of the universe of discourse X
to the
interval [0, 1].
The mapping is written as: u (x): X -> [0, 1].
Fuzzy Logic is capable of handling inherently imprecise (vague or
inexact or rough or inaccurate) concepts
A fuzzy set is defined as follows: If X is a universe of discourse and x is
a particular element of X, then a fuzzy set A defined on X and can be
written as a collection of ordered pairs A = { (%, u (K), x ¬X}

GQ. Define membership function.


GQ.Explain fuzzyinformation retrieval. - -

Example
Let X = {g1, g2, g3, g4, g5} be the reference set of students.

Let A be the fuzzy set of "smart" students, where "smart" is a fuzzy


term.

A= (g1 ,0.4) (g2 ,0.5) (g3,1) (g4 .0.9) (g5 ,0.8)


Here A indicates that the smartness of gl is 0.4 and so on
Membership Function: The membership function fully defines the
fuzzy set. A membership function provides a measure of the degree of
similarity of an element to a fuzzy set

Fuzzy Information Retrieval

The main idea is to supplement the query's index terms with related
terms (obtained from a thesaurus) so that the user query can acquire
more relevant pages

By creating a term-term correlation matrix (referred to as a keyword


connection matrix in whose rows and columns are connected to the index
terms in the document collection, a thesaurus can be created. In thiS
matrix C, a normalized correlation factor Ci between two terms k; and Ki
can be defined by

Ci,
n; +n-ni,!
Where nj is the number or documents which contain the term ki, n 1S

number of documents wnich contain the term ki, and ni1 is the

(New Syll. w.e.f


academic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IAModels) Pg. no. (2-15)
number of documents which contain both terms.

In this fuzzy set. a document d; has a degree of membership ui


computed as

ij k Ed,(-Ci)
which computes algebraic sum over all terms in document dj

a 2.6.2 Extended Boolean Model

GQ. Discuss extended Boolean model.

In the Boolean model, no provision for term weighting and no ranking of


the answer set is generated.
As a result, the size of the output might be too large or too small
However, an alternative strategy is to add the capabilities of term
weighting and partial matching to the Boolean model. With this method,
it's possible to integrate vector model properties with Boolean query
constructions.
The extended Boolean model, was introduced in 1983 by Salton, Fox,
and Wu.

H2.7 STRUCTURED TEXT RETRIEVAL MODELIS

GQ Explain Structured text retrieval models


Think about a user who has a strong visual memory. A user of this type
would then remember that the particular document in which he is
interested has a page where the phrase "Nuclear Blast" occurs in italics
in the text around a Figure whose label contains the word "earth
This query may be phrased as [Nuclear Blast' and 'earth'] in a traditional
information retrieval approach, which would return all pages containing
both strings. But it's clear that this customer didn't want as many
documents as this answer provides.
In this scenario, the user wants to make his inquiry clearer by using a
richer expression, like
same-page (near Nuclear Blast, Figure (label ('earth')))

(New Syll.w.e.facademic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-16)

which conveys the details in his visual recollection


Structured text retrieval models are types of retrieval models that

incorporate information on both the text content and the document


structure.
Structured text retrieval models consider both the text's content and

document structure.
A structured text retrieval system looks for all the documents that match
the search criteria, that's why the retrieval job is not associated with any
idea of relevance.
The current models for structured text retrieval are data retrieval models
rather than information retrieval models.
The retrieval system could search for documents that match the query
conditions only partially
The position in the test of a string of words that matches the user query is
refered to as the "term match point."
e.g user query: [ 'information retrieval system']
ifthis appears at 3 positions in document dj, then match points are 3.

2.7.1 Model based on Non-overlapping lists

G EXplain non overlapping lists with the help of an example.


---.
.Each document's whole text is divided into list of
a
non-overlapping text
sections.
Multiple lists are generated as there are various ways to break a text into
non-overlapping sections. For example

(1) A List for chapters


(2) A List for sections
(3) A List for subsections
These lists are kept as separate and distinct data structures.

A single inverted file is built with each structural element to allow


earching for both index terms and text areas. Fig. 2.7.1 shows an
example of different lists.

(New Syl. we.facademic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System
(MU-Sem.7-1T) (1R Models) Pg. no.(2-17)
L1
chapter

section
L2
subsections
L3
subsubsections
L4
list
(183)Fig. 2.7.1 Structure of text documents through different indexing
Implementation

which each structural component stands


A single inverted file is built, in
as an entry in the index
list of occurrences
Each entry has a list of text regions as a
Such list could be
a merged with the traditional inverted file
easily
Example types of queries
Select a region that contains a given word
B
Select a region A which does not contain any other region
Select a region not contained within any other region

2 . 7 . 2 Model Based on Proximal Nodes

GQ. Discuss model based on proximal nodes. -

- -

Baeza-Yates
T h i s model was proposed by Navarroand
the text. This
Basic idea is to define a strict hierarchical index over

enriches the previous model that uses flat.


It allows the definition of independent hierarchical (non-flat) indexing
structures over the same text of the document.

Every indexing system is made up of nodes, which are chapters,


sections, paragraphs, pages, and lines.
Each node is associated with a text region.
answer is formed
If query refers to different hierarchies, compiles
user

by nodes which all come from only one of them.


T h i s type of models allow us to formulate more complex queries than the
model based on non-overlapping lists.
Only nearby (proximal) nodes are looked for faster query processing.

Fig.2.7.2 shows the hierarchical indexing suructure of four levels and an


inverted list for the word 'Everest

(New Syll wef academic year 22-23) (M7-87) Teth


E Tech-Neo Publications
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg.no.(2-18)
L1 chapter

L2 section

L3 subsections

L4 . subsubsections

erverest 10 256 48.304


(184)Fig. 2.7.2: Hierarchical indexingstructure

Features

One node might be contained within another node.

B u t two nodes of the same hierarchy cannot overlap.

The inverted list for words complements the hierarchicalindex.

Query language in regular expression

(1) Searches for string

(2) Reference to structural components by name


(3) Combination of these

An example query [(*section) with ("Everest"]

Searches for the sections, the subsections, and the sub-subsections that
contain the word "Everest"

Model is a compromise between expressiveness and efficiency

2.8 MODELS FOR BROwSING

Sometimes the user is interested to spend some time in exploring the

references instead of searching for a


document, looking for interesting
specific query.

Users have goals to pursue in both cases


is
But the searching task's goal more clear than a browsing task's goal in
the user's mind.

(New Syll. w.e.f academic year 22-23)


(M7-87) Tech-Neo Publications
Information Retrieval System (MU-Sem.7-IT) (IR Models) Pg. no. (2-19)
Types of Browsing

What are different types of browsing.


GQ
(1) Flat Browsing9
Documents are represented as dots in a (two- dimensional) plan or as
elements in a (single dimension) list.
The user then glances here and there looking for information within the
documents visited
The user looks for correlations among neighbor documents or for

keywords
These keywords could be added to the original query for query
expansion and this process is called relevance feedback. this helps in the
retrieval of more relevant documents.
Users can also explore a single document in a flat manner (like a web
page)
Drawback

On a given page user may not have an indication about the context
where the user is. For example, if a user opens a book on a random page,
he might not know in which chapter that page is.

(2) Structure Guided Browsing


Documents are organized in a structure as a directory to help users in
browsing.
Directories are hierarchies of classes that group documents covering
related topics
These hierarchies of classes have been used to classify document
collections. E.g: "Yahoo!" provides a hierarchical directory
The user performs a structured guided type of browsing.
The same idea applied to a single document
Chapter level, section level, etc.
O The last level is the text itself (flat!)
o A good UI is needed for keeping track ofthe context in a focused
manner.

(New Syll. w.ef academic year 22-23) (M7-87) LATech-Neo Publications


Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-20)

e.g. the "adobe acrobat pdf" files


Additional facilities are provided when searching such as

visited
O A history map to identify classes recently
the structures in
Display occurrences (of terms) by showing
a

global context, in addition to the text positions


(3) The Hypertext Model
writing is the notion of
The fundamental concept related to the task of
sequencing
structure lies underneath the most written
A sequenced organizational
text
The reader should not expect to fully understand the message conveyed
there
by the writer by randomly reading pieces of text here and
Sometimes, we even can't capture the information through sequential

reading of the whole text


For example, a book about "the history of the wars" is organized
chronologically, but the user might in interested in wars fought by
any
case user will have a tough time
particular army or country, in such
interested in.
finding the information he is
Because contents are organized sequentially
solutions is to rewrite the book but
in these situations, one of the possible
book
there is no point in rewriting the
1S to define a new structure to organize the contents
Another solution
which can be achieved through the design of hypertext.

Hypertext
interactive navigational structure allows users to browse
A high-level
text non-sequentially

Consist of nodes (text regions)


correlated by directed links in a grapn
structure

article, or a
A node could be
a chapter in a book, a section in an

web page

Links are attached to specific strings inside the nodes


O

(New Syll. w.e.f academic year


22-23) (M7-87) Tech-Neo Publication
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-21)

the hypertext can be understood as a traversal


The process of navigating
of a directed graph.
Hypertexts provide the basis for HTML(Hyper Text Markup Language)
and HTTP( Hypertext Transfer Protocol)

0Drawbacks of Hypertext

(1) Loose in hyperspace the user will lose track of the organizational
structure of the hypertext when it is large

shows where the user is at all times (graphical user


A hypertext map
interface design)
of information previously
(2) But, the user is restricted to the intended flow
convinced by the hypertext designer
Should take into account the needs of potential users

Analyzing the requirements before starting implementation of hypertext


is required
orient
(3) During the hypertext navigation, the user might find it difficult
to

himself Guiding tools can help in navigation (hypertext map)


Short Questions and Answers

Q.1 What do you mean information retrieval models?


Ans.
A retrieval model can be a description of either the computational
process or the human process of retrieval: The process of choosing
documents for retrieval; the process by which information needs are first
articulated and then refined.

Q. 2 What is cosine similarity?


Ans.
This metric is frequently used when trying to determine similarity
between two documents. Since there are more words that are in common
between two documents, it is useless to use the other methods of calculating
similarities

(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-22)

feedback?
Q.3 What are the characteristics of relevance
Ans.
(1) It shields the user from the details of the query reformulation process.

(2) It breaks down the whole searching task into a sequence of small steps

which are easier to grasp.


controlled process designed to emphasize some terms and de-
(3) Provide a

emphasize others.

Q. 4 What are the assumptions of vector space model?

Ans.
(1) Assumption of vector space model:
(2) The degree of matching can be used to rank-order documents
(3) This rank-ordering corresponds to how well a document satisfying a
user's information needs

Q.5 What are the disadvantages of Boolean model?


Ans.
(1) It is not simple to translate an information need into a Boolean
expression.
(2) Exact matching may lead to retrieval of too many documents.
(3) The retrieved documents are not ranked
(4) The model does not use term weights

Q.6 Define term frequency.


Ans.
Term frequency : Frequency of occurrence of
query keyword
document

Q.7 What are the three classic models in information retrieval system?
Ans.
(1) Boolean model

(2) Vector Space model


(3) Probabilistic model

(New Syll. w.e.f academic year 22-23) (M7-87)


edh Tech-Neo Publications
(IRModels) Pg. no.(2-23)
Infomation Retrieval System (MU-Sem.7-IT)

What is the basis for Boolean


mod
Q. 8
Ans.
and Boolean algebra
Simple model based on set theory
(1) Documents are sets of terms
expressions on terms.
(2) Queries are specified as Boolean
Boolean model?
Q.9 What are the disadvantages of

Ans.

may retrieve too


few or too many documents
Exact matching
some documents are more important than
(1) Difficult to rank output,
others.

(2) Hard to translate a query into


a Boolean expression

(3) All terms are equally weighted


retrieval
(4) More like data retrieval than information

(5) No notion for partial matching


Q. 10 What are the Fundamental assumptions for probabilistic principle?
Ans.

9-user query,dj
-

doc in the collections


Model assumes, relevance depends on the query and the doc

representation only
R- ideal answer set, relevant to the query

R-ideal answerset,non-relevant to the query

Similarity to the query ratio is, i.e. probabilistic ranking computed as


Ratio = P(dj relevant-to q)/P(dj non-relevant-to q)
The rank minimizes the probability of the erroneous judgment

Q.11 Write the advantages and disadvantages of probabilistic model:

Ans.

Advantages
(1) Doc's are ranked in decreasing order of their probability of relevant

(New Syll.w.e.facademic year 22-23) (M7-87) Tech-Neo Publications


Infomation Retrieval System (MU-Sem.7-1T) (IRModels) Pg. no. (2-24)

Disadvantages
(1) Need to guess the initial separation of doc's into relevant and non-
relevant sets.
(2) All weights are binary
(3) The adoption of the independence assumption for index terms
(4) Need to guess initial estimates for P(ki l R)
(5) Method does not take into account tf and idf factors

Q.12 Why Classic IR might lead to poor retrieval ?


Ans.
(1) The user information need is more related to concepts and ideas than to
index terms but in classic IR.
(2) Unrelated documents might be included in the answer set.
(3) Relevant documents that do not contain at least one index term are not
retrieved.
(4) Reasoning: retrieval based on index terms is vague and noisy.

Chapter Ends...

O00

You might also like