0% found this document useful (0 votes)

41 views33 pages

Unit 2 Irt

Uploaded by

Jiju Abutelin Ja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views33 pages

Unit 2 Irt

Uploaded by

Jiju Abutelin Ja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

i) Ad Hoc Retrieval:

The documents in the collection remain relatively static while

new queries are submitted to the system.

ii) Filtering
The queries remain relatively static while new documents come
into the system

Classic IR model:
Each document is described by a set of representative
keywords called index terms. Assign a numerical
weights to distinct relevance between index terms.
Three classic models: Boolean, vector, probabilistic
Boolean Model :
The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in
the form of a Boolean expression of terms, that is, in
which terms are combined with the operators AND,
OR, and NOT. The model views each document as just
a set of words. Based on a binary decision criterion
without any notion of a grading scale. Boolean
expressions have precise semantics.
Vector Model
Assign non-binary weights to index terms in queries
and in documents. Compute the similarity between
documents and query. More precise than Boolean
model.
Probabilistic Model
The probabilistic model tries to estimate the probability
that the user will find the document dj relevant with
ratio
P(dj relevant to q)/P(dj nonrelevant to q)
2
Given a user query q, and the ideal answer set R of the
relevant documents, the problem is to specify the
properties for this set. Assumption (probabilistic
principle): the probability of relevance depends on the
query and document representations only; ideal answer
set R should maximize the overall probability of
relevance

Basic Concepts
 Each document is represented by a set of representative
keywords or index terms
 Index term:
In a restricted sense: it is a keyword that has
some meaning on its own; usually plays the role of
a noun
In a more general form: it is any word that appears in
a document
 Let, t be the number of index terms
in the document collection ki be a
generic index term Then,
 The vocabulary V = {k1, . . . , kt} is
the set of all distinct index terms in
the collection

The Term-Document Matrix

 The occurrence of a term t i in a document dj
establishes a relation between t i and dj
 A term-document relation between ti and dj can be
quantified by the frequency of the term in the
document

3
 In matrix form, this can written as

 where each fi,j element stands for the frequency of term ti

in document dj
 Logical view of a document: from full text to a set of index
terms
Logical view of a document
It represents the full text to a set of index terms

• Documents in a collection are frequently represented through a

set of index termsor keywords. Keywords are extracted from
document. Keywords are derived automatically or generated by a
specialist, they provide a logical view of the document
• This can be accomplished through the elimination of stop words
(articles and connectives) and stemming(reduces distinct words
to their common grammaticalroot), identification of noun groups
( which eliminates adjectives ,adverbs and verb).Further
compression might be employed. These operations are called
Textoperations.
• Stop-words
– To reduce the set of representative keywords from large
collection
– Some examples of stop words are: "a", "and", "but", "how",
"or", and "what."
4
– For example, if you were to search for "What is a
motherboard?" onComputer Hope, the search
engine would only look for the term "motherboard"
. The removal of stop words usually improves IR
effectiveness.
• Noun groups
– To identify the noun groups
– Which eliminates the adjectives, adverbs and verbs
• Stemming
– Which reduces distinct words to their common grammatical
root
– Removing some endings of word
A stemmer for English, for example, should identify the string "cats"
(and possibly"catlike", "catty" etc.) as based on the root "cat", and
"stems", "stemmer", "stemming", "stemmed" as based on "stem". A
stemming algorithm reduces the words "fishing", "fished", and
"fisher" to the root word, "fish". On the other hand, "argue", "argued",
"argues", "arguing", and "argus" reduce to the stem "argu" but
"argument" and "arguments" reduce to the stem "argument".

• Reason for stemming

– Different word forms may bear similar meaning (e.g.
search, searching):create a ―standard‖ representation
for them

• Finally, compression employed, ―Text operations‖ used to

extract the index terms. Text operations reduce the complexity of
the document representation andallow moving the logical view
from that of full text to that of a set of index terms

2.2 Boolean Retrieval Models

Definition : The Boolean retrieval model is a model for
information retrieval in which the query is in the form
of a Boolean expression of terms, combined with the
operators AND, OR, and NOT. The model views each
document as just a set of words.
 Simple model based on set theory and Boolean algebra
 The Boolean model predicts that each document is
either relevant or nonrelevant

5
Example :
A fat book which many people own is Shakespeare‟s Collected
Works.

Problem : To determine which plays of Shakespeare contain

the words Brutus AND Caesar AND NOT Calpurnia.

Method1 : Using Grep

The simplest form of document retrieval is for a computer to
do the linear scan through documents. This process is
commonly referred to as grepping through text, after the
Unix command grep. Grepping through text can be a very
effective process, especially given the speed of modern
computers, and often allows useful possibilities for wildcard
pattern matching through the use of regular expressions.

To Perform simple querying of modest collections , we need :

1. To process large document collections quickly.
2. To allow more flexible matching operations.
For example, it is impractical to perform the query
“Romans NEAR countrymen” with grep , where NEAR
might be defined as “within 5 words” or “within the
same sentence”.
3. To allow ranked retrieval: in many cases we want the
best answer to an information need among many documents
that contain certain words. The way to avoid linearly scanning
the texts for each query is to index documents in advance.

Method2: Using Boolean Retrieval Model

 The Boolean retrieval model is a model for information
retrieval in which we can pose any query which is in the
form of a Boolean expression of terms, that is, in which
terms are combined with the operators AND, OR, and
NOT. The model views each document as just a set of
words.
Terms are the indexed units . we have a vector for each term,
which shows the documents it appears in, or a vector for each
document, showing the terms that occur in it. The result is a
binary term-document incidence matrix, as in Figure

Anthony Julius The Haml Othell Macbeth

and Caesar Tempes et o.
Cleopatra t
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0

6
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0

A term-document incidence matrix. Matrix element (t,

d) is 1 if the play in column d contains the word in row
t, and is 0 otherwise.

 To answer the query Brutus AND Caesar AND NOT

Calpurnia, we take the vectors for Brutus, Caesar and
Calpurnia, complement the last, and then do a
bitwise AND:
110100 AND 110111 AND 101111 = 100100

Solution: Antony and Cleopatra and Hamlet

Results from Shakespeare for the query Brutus AND Caesar

AND NOT Calpurnia.

Drawbacks of the Boolean Model

 Retrieval based on binary decision criteria with no notion of
partial matching
 No ranking of the documents is provided (absence of a
grading scale)
 Information need has to be translated into a Boolean
expression, which most users find awkward
 The Boolean queries formulated by the users are most often
too simplistic
 The model frequently returns either too few or too many
documents in

2.3 Term weighting

Search Engine should return in order the documents
most likely to be useful to the searcher . To achieve
this , ordering documents with respect to a query -
called Ranking

Term-Document Incidence Matrix

A Boolean model only records term presence or absence,
Assign a score – say in [0, 1] – to each document , it measures
how well document and query “match”. For One-term query
7
“BRUTUS” , score is 1 if it is present in the document , 0
otherwise , More appearances of term in document have
higher score
Anthony Julius The Hamlet Othell Macbeth
and Caesar Tempes o.
Cleopatra t
Anthony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Document represented by binary vector Є {0,1}|V

Term Frequency tf
One of the weighting scheme is Term Frequency and is
denoted tft,d, with the subscripts denoting the term and the
document in order.
Term frequency TF(t, d) of term t in document d = number of
times that t occurs in d
Ex: Term-Document Count Matrix
but we would like to give more weight to documents that have
a term several times as opposed to ones that contain it only
once. To do this we need term frequency information the
number of times a term occurs in a document .
Assign a score to represent the number of occurrences

How to use tf for query-document match scores?

Raw term frequency is not what we want. A document with 10

occurrences of the term is more relevant than a document with 1
occurrence of the term . But not 10 times more relevant. We use
Log frequency weighting

8
Log-Frequency Weighting
Log-frequency weight of term t in document d is calculated
as

tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 →

4, etc.
Document Frequency & Collection Frequency
Document frequency DF(t): the number of documents
in the collection that contain a term t
Collection frequency CF(t): the total number of
occurrences of a term t in the collection
Example :

Rare terms are more informative than frequent terms, to capture

this we will use document frequency (df)
Example: rare word ARACHNOCENTRIC

Document containing this term is very likely to be relevant to query

ARACHNOCENTRIC We want high weight for rare terms like
ARACHNOCENTRIC
Example: common word THE

Document containing this term can be about anything

We want very low weight for common terms like THE
Example

Document frequency is more meaningful , as we see in the

example above , there are few documents that contain
“insurance” to get a higher boost for a query on “insurance”
than the many documents containing “try” to get from a
9
query on “try”
Inverse Document Frequency ( idf Weight)
It estimates the rarity of a term in the whole document
collection. idft is an inverse measure of the informativeness of
t and idft <= N
dft is the document frequency of t: the number of documents
that contain t Informativeness idf (inverse document
frequency) of t:

log (N/dft) is used instead of N/dft to diminish the effect of

idf.

N: the total number of documents in the collection (for

example :806,791 documents)
• IDF(t) is high if t is a rare term
• IDF(t) is likely low if t is a frequent term

Effect of idf on Ranking

Does idf have an effect on ranking for one- term queries, like
IPHONE? Ans: No , idf will be effect for >1 term queries.
Query CAPRICIOUS PERSON: idf puts more weight on
CAPRICIOUS than PERSON, since it is a rare term.

TF-IDF Weighting
tf-idf weight of a term: product of tf weight and idf weight
, Best known weighting scheme in information retrieval.
TF(t, d) measures the importance of a term t in document
d , IDF(t) measures the importance of a term t in the
whole collection of documents
TF/IDF weighting: putting TF and IDF together

TFIDF(t, d) = TF(t, d) x IDF(t)

(if log tf is used)

 High if t occurs many times in a small number of

10
documents, i.e., highly discriminative in those
documents
 Not high if t appears infrequent in a document, or is
frequent in many documents, i.e., not discriminative
 Low if t occurs in almost all documents, i.e., no
discrimination at all
Simple Query-Document Score
 scoring finds whether or not a query term is present in a
zone (Zones: document features whose content can be
arbitrary free text – Examples: title, abstracts ) within
a document.
 Score for a document-query pair: sum over
terms t in both q and d:

If the Query contains more than one terms , Score for

a document-query pair is sum over terms t in both q
and d:

The score is 0 if none of the query terms is present in the document

Document Length Normalization

 Document sizes might vary widely
 This is a problem because longer documents are more likely to
be retrieved by a given query
 To compensate for this undesired effect, we can divide the
rank of each document by its length
 This procedure consistently leads to better ranking, and it is
called document length normalization

11
2.4 The Vector Model
 Boolean matching and binary weights is too limiting
 The vector model proposes a framework in which partial
matching is possible
 This is accomplished by assigning non-binary weights to index
terms in queries and in documents
 Term weights are used to compute a degree of similarity
between a query and each document
 The documents are ranked in decreasing order of their degree of
Similarity

 The weight wi,j associated with a pair (k i, dj) is positive and

non-binary
 The index terms are assumed to be all mutually independent
 They are represented as unit vectors of a t-dimensional space (t
is the total number of index terms)
 The representations of document dj and query q are
t-dimensional vectors given by

Similarity between a document dj and a query q

12
 Weights in the Vector model are basically tf-idf weights
 These equations should only be applied for values of term
frequency greater than zero
 If the term frequency is zero, the respective weight is also zero.

Cosine Similarity Measure the similarity between document and

the query using the cosine of the vector representations

Example1: To find similarity between 3 novels

Three novels are taken for discussion , namely
o SaS: Sense and Sensibility
o PaP: Pride and Prejudice
o WH: Wuthering Heights?
To find how similar are the novels using Term frequency
tft , values are

Log frequency weights :

13
After length normalization

cos(SaS,PaP) ≈ 0.7890.832 + 0.5150.555 + 0.3350 + 00 ≈

0.94
cos(SaS,WH) ≈0.79 cos(PaP,WH) ≈ 0.69
cos(SAS,PaP) > cos(*,WH)

Example 2:

We often use different weightings for queries(q ) and

documents(d).
Notation : ddd.qqq
Example : lnc.ltn
Document : l logarithmic tf,
nno idf weighting,
c cosine normalization
Query : l logarithmic tf,
t idf,
nno cosine normalization

Example query: “best car insurance”

Example document: “car insurance auto insurance”
N=10,000,000 & df column given in the question itself.

Final similarity score between query and document:

wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08

Example 3: standard weighting scheme (lnc.ltc)

14
Document : l logarithmic tf,
nno idf weighting,
c cosine normalization

Query : l logarithmic tf,

t idf,
c cosine normalization
Document: car insurance auto insurance

Query: best car insurance

N=1000000

Term Query Document Prod

tf- tf- df idf wt n’lize tf- tf- wt n’lize
raw wt raw wt
(log)
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Score = 0+0+0.27+0.53 = 0.8

Since in the above example only one document is given, only one score is
calculated . if suppose n documents given , n score will be calculated and
ranking done in decreasing order of scores.

Advantages:
 term-weighting improves quality of the answer set
 partial matching allows retrieval of docs that approximate the
query conditions
 cosine ranking formula sorts documents according to a degree of
similarity to the query
 document length normalization is naturally built-in into the ranking
Disadvantages:
It assumes independence of index terms

2.5 Probabilistic Model

The probabilistic model captures the IR problem using a

probabilistic framework
Given a user query, there is an ideal answer set for this
query , Given a description of this ideal answer set, we
could retrieve the relevant documents. Querying is seen
as a specification of the properties of this ideal answer
set.

15
An initial set of documents is retrieved somehow,The
user inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected). The IR
system uses this information to refine the description of
the ideal answer set. By repeating this process, it is
expected that the description of the ideal answer set will
improve.

Probabilistic Ranking Principle

The probabilistic model tries to estimate the probability

that a document will be relevant to a user query ,
Assumes that this probability depends on the query and
document representations only ,The ideal answer set,
referred to as R, should maximize the probability of
relevance

Idea: Given a user query q, and the ideal answer set R

of the relevant documents, the problem is to specify
the properties for this set
– Assumption (probabilistic principle): the
probability of relevance depends on the query and
document representations only; ideal answer set R
should maximize the overall probability of
relevance
– The probabilistic model tries to estimate the
probability that the user will find the document
dj relevant with ratio
P(dj relevant to q)/P(dj nonrelevant to q)

Given a query q, there exists a subset of the documents R

which are relevant to q
But membership of R is uncertain (not sure) , A
Probabilistic retrieval model ranks documents in

16
decreasing order of probability of relevance to the
information need: P(R | q,di)

Why probabilities in IR?

Users gives with information needs, which they translate
into query representations. Similarly, there are
documents, which are converted into document
representations . Given only a query, an IR system has
an uncertain understanding of the information need. So
IR is an uncertain process , Because
 Information need to query representation
 Documents to index terms construction
 Query terms and index terms mismatch
Probability theory provides a principled foundation for
such reasoning under uncertainty. This model provides
how likely a document is relevant to an information
need.

The Ranking

Probabilistic IR - Need to Estimate

17
1. Find measurable statistics (tf, df ,document length) that
affect judgments about document relevance
2. Combine these statistics to estimate the probability of
document relevance
3. Order documents by decreasing estimated probability of
relevance P(R|d, q)
4. Assume that the relevance of each document is
independent of the relevance of other documents

Term Incidence Contingency Table

Let pt = P(xt = 1|R = 1,q) be the probability of a term

appearing in a document relevant to the query,
Let ut = P(x t = 1|R = 0,q) be the probability of a term
appearing in a non relevant document.
Rank documents using the log odds ratios for the terms in
the query ct

(pt/(1 − pt)),  odds of the term appearing if the document

is relevant
(ut/(1 − u t ))  odds of the term appearing if the document
is irrelevant
ct = 0  term has equal odds of appearing in relevant and
irrelevant docs
ct positive  higher odds to appear in relevant documents
ct negative higher odds to appear in irrelevant
documents
ct functions as a term weight
Retrieval status value for document d:

Example for the BIR model

• Assume a query q containing two terms, x1 , x2. Table
18

L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
IR Merged Merged
No ratings yet
IR Merged Merged
132 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Lect 2 Boolean Retrieval
No ratings yet
Lect 2 Boolean Retrieval
24 pages
01 Intro
No ratings yet
01 Intro
145 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Unit 1
No ratings yet
Unit 1
181 pages
Exercise - Analytical Exposition Text
40% (5)
Exercise - Analytical Exposition Text
3 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
HUMSS 12 DIASS FIRST QUARTER EXAM. by ALMIRAH MACALUNAS
100% (9)
HUMSS 12 DIASS FIRST QUARTER EXAM. by ALMIRAH MACALUNAS
11 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Ir 1
No ratings yet
Ir 1
14 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Unit 2
No ratings yet
Unit 2
58 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
L03
No ratings yet
L03
16 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
C++ CH 2
100% (1)
C++ CH 2
43 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Preliminar Não Fabricar: Plan View From Above Showing Foundation Hole Drilling
No ratings yet
Preliminar Não Fabricar: Plan View From Above Showing Foundation Hole Drilling
1 page
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
The Marketig Plan of Cocoon Viet Nam
No ratings yet
The Marketig Plan of Cocoon Viet Nam
36 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Control Applications in Marine Systems 2001
No ratings yet
Control Applications in Marine Systems 2001
526 pages
C-SP-55-017 - Vertical Turbine Pu
No ratings yet
C-SP-55-017 - Vertical Turbine Pu
20 pages
HBRI Brochure
0% (1)
HBRI Brochure
8 pages
350 SX-F Cairoli Replica 2012: Spare Parts Manual: Chassis
No ratings yet
350 SX-F Cairoli Replica 2012: Spare Parts Manual: Chassis
36 pages
Myanmar Cyclone Shelter Assessment
No ratings yet
Myanmar Cyclone Shelter Assessment
116 pages
Motherboard Labeling Designed by Fujitsu
No ratings yet
Motherboard Labeling Designed by Fujitsu
3 pages
Modeep
No ratings yet
Modeep
13 pages
General Physics 1: Phys100
No ratings yet
General Physics 1: Phys100
20 pages
VVM MCQ On Electricity
No ratings yet
VVM MCQ On Electricity
4 pages
Operations and Supply Chain Management Week 6
No ratings yet
Operations and Supply Chain Management Week 6
13 pages
LG HG6 Datasheet
No ratings yet
LG HG6 Datasheet
9 pages
Chalmers, Constructing The World
0% (1)
Chalmers, Constructing The World
11 pages
BC672 772RB-2 6pg
No ratings yet
BC672 772RB-2 6pg
6 pages
DLP Cot2
No ratings yet
DLP Cot2
3 pages
ITP AND Reports Only Approved
No ratings yet
ITP AND Reports Only Approved
18 pages
GM 3500T OwnersManual
No ratings yet
GM 3500T OwnersManual
36 pages
Narrative Report 3rd
No ratings yet
Narrative Report 3rd
2 pages
Caries Detection
No ratings yet
Caries Detection
7 pages
VCO Non-Adjusting PLL FM MPX Stereo Demodulator With FM Accessories
No ratings yet
VCO Non-Adjusting PLL FM MPX Stereo Demodulator With FM Accessories
16 pages
3rd-5 Grade Lesson Plans
No ratings yet
3rd-5 Grade Lesson Plans
2 pages
PERSONAL-LIFELONG-LEARNING-PLAN Marilyn D. Tagao
No ratings yet
PERSONAL-LIFELONG-LEARNING-PLAN Marilyn D. Tagao
7 pages
DSC / (MW/MG) Flow / (Ml/min) Exo: 330.4 J/G 133.2 °C Complex Peak: Area: Peak
No ratings yet
DSC / (MW/MG) Flow / (Ml/min) Exo: 330.4 J/G 133.2 °C Complex Peak: Area: Peak
1 page
A Study On Customer Preference Towards Sports Shoes: Bachelor of Business Administration
No ratings yet
A Study On Customer Preference Towards Sports Shoes: Bachelor of Business Administration
8 pages
1210 6261v1 PDF
No ratings yet
1210 6261v1 PDF
8 pages
Basfiber For Construction Market (US Customary Units) .
No ratings yet
Basfiber For Construction Market (US Customary Units) .
4 pages
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet

Unit 2 Irt

Uploaded by

Unit 2 Irt

Uploaded by

i) Ad Hoc Retrieval:

The documents in the collection remain relatively static while

The Term-Document Matrix

 where each fi,j element stands for the frequency of term ti

• Documents in a collection are frequently represented through a

• Reason for stemming

• Finally, compression employed, ―Text operations‖ used to

2.2 Boolean Retrieval Models

Problem : To determine which plays of Shakespeare contain

Method1 : Using Grep

To Perform simple querying of modest collections , we need :

Method2: Using Boolean Retrieval Model

Anthony Julius The Haml Othell Macbeth

A term-document incidence matrix. Matrix element (t,

 To answer the query Brutus AND Caesar AND NOT

Solution: Antony and Cleopatra and Hamlet

Results from Shakespeare for the query Brutus AND Caesar

Drawbacks of the Boolean Model

2.3 Term weighting

Term-Document Incidence Matrix

How to use tf for query-document match scores?

Raw term frequency is not what we want. A document with 10

tft,d → wt,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 →

Rare terms are more informative than frequent terms, to capture

Document containing this term is very likely to be relevant to query

Document containing this term can be about anything

Document frequency is more meaningful , as we see in the

log (N/dft) is used instead of N/dft to diminish the effect of

N: the total number of documents in the collection (for

Effect of idf on Ranking

TFIDF(t, d) = TF(t, d) x IDF(t)

(if log tf is used)

 High if t occurs many times in a small number of

If the Query contains more than one terms , Score for

The score is 0 if none of the query terms is present in the document

Document Length Normalization

 The weight wi,j associated with a pair (k i, dj) is positive and

Similarity between a document dj and a query q

Cosine Similarity Measure the similarity between document and

Example1: To find similarity between 3 novels

Log frequency weights :

cos(SaS,PaP) ≈ 0.789*0.832 + 0.515*0.555 + 0.335*0 + 0*0 ≈

We often use different weightings for queries(q ) and

Example query: “best car insurance”

Final similarity score between query and document:

Example 3: standard weighting scheme (lnc.ltc)

Query : l logarithmic tf,

Query: best car insurance

Term Query Document Prod

Score = 0+0+0.27+0.53 = 0.8

2.5 Probabilistic Model

The probabilistic model captures the IR problem using a

Probabilistic Ranking Principle

The probabilistic model tries to estimate the probability

Idea: Given a user query q, and the ideal answer set R

Given a query q, there exists a subset of the documents R

Why probabilities in IR?

Probabilistic IR - Need to Estimate

Term Incidence Contingency Table

Let pt = P(xt = 1|R = 1,q) be the probability of a term

(pt/(1 − pt)),  odds of the term appearing if the document

Example for the BIR model

You might also like

cos(SaS,PaP) ≈ 0.7890.832 + 0.5150.555 + 0.3350 + 00 ≈