0% found this document useful (0 votes)

14 views46 pages

Chapter 5 IR

Chapter Five discusses Information Retrieval (IR) models, focusing on the representation of documents and queries using index terms and the Bag of Words approach. It highlights the importance of ranking algorithms in determining the relevance of documents based on user queries, and compares the Boolean Model with the Vector-Space Model for document retrieval. The chapter also covers the computation of term weights using tf-idf and various similarity measures for ranking documents.

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views46 pages

Chapter 5 IR

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Information Retrieval and Storage

Chapter Five
IR models
Target Group –IT 3rd year students

Injibara, Ethiopia
IR Models - Basic Concepts
Word evidence:
 IR systems usually adopt index terms to index and retrieve
documents
 Each document is represented by a set of representative keywords
or index terms (called Bag of Words)

An index term is a document word useful for remembering the

document main themes.

Not all terms are equally useful for representing the document
contents:
 less frequent terms allow identifying a narrower set of documents

But no ordering information is attached to the Bag of Words

identified from the document collection.
....continued
One central problem regarding IR systems is the issue of predicting
the degree of relevance of documents for a given query
 Such a decision is usually dependent on a ranking algorithm
which attempts to establish a simple ordering of the documents
retrieved
 Documents appearning at the top of this ordering are considered
to be more likely to be relevant
Thus ranking algorithms are at the core of IR systems
 The IR models determine the predictions of what is relevant and
what is not, based on the notion of relevance implemented by
the system
....continued
After preprocessing, N distinct terms (Bag of words) remain which
are unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-valued weight,
wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it indicates that
term does not belong to document dj
The weight wij quantifies the importance of the index term for
describing the document contents
• vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the
document dj
Mapping Documents & Queries
Represent both documents and queries as N-dimensional vectors in
a term-document matrix, which shows occurrence of terms in the
document collection or query
– E.g.  
d j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term doesn’t exist in the document.

T1 T2 …. TN  Document collection is mapped to term-

D1 w11 w12 … w1N by-document matrix
D2 w21 w22 … w2N
 View as vector in multidimensional space
: : : :
: : : :  Nearby vectors are related
DM wM1 wM2 … wMN  Normalize for vector length to avoid effect
Qi wi1 wi2 … wiN of document length
Weighting Terms in Vector Sapce
The importance of the index terms is represented by weights
associated to them
Problem: to show the importance of the index term for describing
the document/query contents, what weight we can assign?
Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Similarity: number of terms in common
Problem: Not all terms equally interesting
– E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights
 
d j  ( w1, j , w2, j ,..., wN , j ); qk  ( w1,k , w2,k ,..., wN ,k )
How to evaluate Models?
We need to investigate what procedures they follow and what
techniques they used for:

Are they using binary or non-binary weighting for measuring

importance of terms in documents

Are they using similarity measurements?

Are they applying partial matching?

Are they performing Exact matching or Best matching for

document retrieval?

Any Ranking mechanism?

The Boolean Model
Boolean model is a simple model based on set theory
• The Boolean model imposes a binary criterion for deciding
relevance
Terms are either present or absent. Thus, wij  {0,1}
 sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise

T1 T2 …. TN
- Note that, no weights D1 w11 w12 … w1N
D2 w21 w22 … w2N
assigned in-between 0 and
: : : :
1, just only values 0 or 1. : : : :
DM wM1 wM2 … wMN
The Boolean Model:
Boolean Query expression keywords connected by AND, OR, and
NOT, including the use of brackets to indicate scope.
Example

• Generate the relevant documents retrieved by the Boolean

model for the query :
k2
q = k1  (k2  k3) k1
d7
d2 d6
d4 d5
d3
d1

k3
The Boolean Model: Example
 Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)

• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
Exercise
Given the following four documents with the following contents:

– D1 = “computer information retrieval”

– D2 = “computer retrieval”

– D3 = “information”

– D4 = “computer information”

• What are the relevant documents retrieved for the queries:

– Q1 = “information  retrieval”

– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
• Exact-match only, no partial matches

• ✁ Retrieved documents not ranked

• All terms are equally important

✁ Boolean operator usage has much more influence

than a critical word
Vector-Space Model....
This is the most commonly used strategy for measuring relevance
of documents for a given query. This is because,
 Use of binary weights is too limiting
 Non-binary weights provide consideration for partial matches
These term weights are used to compute a degree of similarity
between a query and each document
 Ranked set of documents provides for better matching
The idea behind VSM is that
 the meaning of a document is conveyed by the words used in
that document.
Vector-Space Model
To find relevant documens for a given query,

• First, map documents and queries into term-document vector space.

Note that queries are considered as short document
• Second, in the vector space, queries and documents are represented as
weighted vectors, wij
There are different weighting technique; the most widely used one
is computing tf*idf for each term

• Third, similarity measurement is used to rank documents by the

closeness of their vectors to the query.
Documents are ranked by closeness to the query. Closeness is
determined by a similarity score calculation
Vector-Space Model.....
• A collection of n documents and query can be represented in the
vector space model by a term-document matrix.
–An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or it
simply doesn’t exist in the document. Otherwise, wij > 0
whenever ki  dj.
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
How to compute weights for term i in document j and query q; wij
and wiq ?

A good weight must take into account two effects:

– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document
– Quantification of inter-documents separation (dissimilarity)
• idf factor, the inverse document frequency
As a result of which most IR systems are using tf*idf weighting
technique:
wij = tf(i,j) * idf(i)
Computing weights....
Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ti
 freq(i,j) raw frequency of ti within dj
A normalized tf(i,j) factor is given by
tf(i,j) = freq(i,j) / max(freq(k,j))
 where the maximum is computed over all terms which occur
within the document dj
The idf factor is computed as
idf(i) = log (N/ni)
 The log is used to make the values of tf and idf comparable. It
can also be interpreted as the amount of information associated
with the term ti.
A normalized tf*idf weight is given by:
wij = freq(i,j) / max(freq(k,j)) * log(N/ni)
Example: Computing weights
Query:-Users query is typically treated as a document and also
tf-idf weighted.
The vector space model is usually as good as the known
ranking alternatives. It is also simple and fast to compute.
A collection includes 10,000 documents
 The term A appears 20 times in a particular document
 The maximum appearance of any term in this document is 50
 The term A appears in 2,000 of the collection documents.
Compute TF*IDF weight?
 tf(i,j) = freq(i,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Similarity Measure

• A similarity measure is a function that computes the

degree of similarity between two vectors.

• Using a similarity measure between the query and

each document:

– It is possible to rank the retrieved documents in

the order of presumed relevance.
Similarity Measures

|QD| Dot Product (Simple matching)

|QD|
2 Dice’s Coefficient
|Q|| D|
|QD|
Jaccard’s Coefficient
|QD|
|QD|
1 1 Cosine Coefficient
|Q | 2 | D | 2
|QD|
min(| Q |, | D |) Overlap Coefficient
Similarity Measure
• Sim(q,dj) = cos() j
dj


q
 

n i
dj q wi , j wi ,q
sim(d j , q)     i 1

i1 w i1 i,q

n n
dj q 2
i, j w 2

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

• A document is retrieved even if it matches the query
terms only partially
Vector Space with Term Weights and Cosine
Matching

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)

Term B Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
1.0 Q = (0.4,0.8)

t
D1=(0.8,0.3) wq j wdij
j 1
0.8
D2 Q D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (wq j )  j 1 dij
t 2 t 2
( w )
0.6
2 (0.4  0.2)  (0.8  0.7)
sim(Q, D 2) 
0.4 [( 0.4) 2  (0.8) 2 ]  [( 0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
  0.98
0.42
0 0.2 0.4 0.6 0.8 1.0 .56
Term A sim(Q, D1 )   0.74
0.58
Vector-Space Model: Example
• Suppose user query for: Q = “gold silver truck”. The database
collection consists of three documents with the following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1. Assume that full text terms are used during indexing, without
removing common terms, stop words, & also no terms are stemmed.
2. Assume that content-bearing terms are selected during indexing
3. Also compare your result with or without normalizing term
frequency
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.4772  0.4772  0.1762  0.1762 = 0.517 = 0.719
|d2|= 0.1762  0.4772  0.1762  0.1762 = 1.2001 = 1.095
|d3|= 2 = 0.124 = 0.352
0.176  0.176  0.176  0.176
2 2 2

|q|= = 0.2896 = 0.538

0.176  0.471  0.176
2 2 2

• Next, compute dot products (zero products ignored)

Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271

Finally, we sort and rank documents in descending order

according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set since
it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according to
degree of similarity to the query

• Disadvantages:
• assumes independence of index terms (??)
Probabilistic Model
• IR is an uncertain process
–Mapping Information need to Query is not perfect
–Mapping Documents to index terms is a logical representation
–Query terms and index terms mostly mismatch

• This situation leads to several statistical approaches: probability theory,

fuzzy logic, theory of evidence, etc.
• Probabilistic retrieval model is rigorous formal model that attempts to
predict the probability that a given document will be relevant to a given
query (P(R|q,di)
– Use probability to estimate the “odds” of relevance of a query to a
document.
– It relies on accurate estimates of probabilities
Probabilistic model…
Asks the question: what is the probability that user will see
relevant information if they read this document.
– P(rel | di ): probability of relevance after reading di
– How likely is the user to get relevance information from reading
this document
– high probability means more likely to get relevant info.
A Probabilistic retrieval model
– Rank documents in decreasing order of probability of relevance
to users information need
– Calculate P(rel|di) for each document and rank
Probability Ranking Principle
You have a collection of Documents
– User issues a query
– A Set of documents needs to be returned
– Intuitively, want the “best” document to be first, second best -
second, etc…
– We need a formal way to judge the “goodness” of documents with
respect to queries.
Probability ranking principle: if a reference retrieval system's
response to each request is a ranking of the documents in the
collection in order of decreasing probability of relevance… the
overall effectiveness of the system to its user will be the best that is
obtainable.
Difficulties
Evidence is based on a lossy representation
– Evaluate probability of relevance based on occurrence of terms
in query and documents

– Start with an initial estimate , refine through relevance feedback

Computing the probabilities exactly according to the

model is intractable
– Make some simplifying assumptions
Probabilistic Model definitions
• Let D be a document in the collection.
– dj = (t1,j, t2,j, … tt,j), ti,j Î {0,1}
• terms occurrences are boolean (not counts)
• query q is represented similarly
• Let R represent the set of relevant documents with respect to a given
query and let NR represent the set of irrelevant documents.
– P(R | dj) is probability that dj is relevant,
– P(NR | dj) is probability that dj is irrelevant
• Need to find p(R| D) - probability that a retrieved document D is
relevant.
• Similarity function: p( D | R) p( R)
– Ratio of prob of relevance to prob of p( R | D) 
p( D)
non-relevance:
If p(R|D) > p(NR|D) then D is relevant, p( NR | D)  p( D | NR ) p( NR )
otherwise D is not relevant p( D)
Bayes’ Theorem: Application in IR
• Goal: want to estimate the probability that a document D is
relevant to a given query.

• It is easier to estimate log odds of probability of

relevance

p(R | D)  1 - p(R | D)
Probabilistic Models
Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions

– I1: The distribution of terms in relevant documents is independent

and their distribution in all documents is independent.
– I2: The distribution of terms in relevant documents is independent
and their distribution in non-relevant documents is independent.
• Ordering Principles
– O1: Probable relevance is based only on the presence of search
terms in the documents.
– O2: Probable relevance is based on both the presence of search
terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents

– R is completely unknown

– Assume P(ti|R) is constant (usually 0.5)

– Assume P(ti|NR) approximated by distribution of ti across

collection – IDF

• This can be used to compute an initial rank using IDF as the

basic term weight
Probabilistic Model Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26

• q1 = eat
• q2 = porridge
• q3 = hot porridge
• q4 = eat nine day old porridge
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user

– the user has labeled some of the documents as relevant

("relevance feedback")

• We now have
– N documents in coll, R are known relevant

– ni documents containing ti, ri are relevant

Improving Term Weight Estimates
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term ti
Document Relevance
For term ti No of relevant No of non-relevant Total
docs docs
No of docs including r n-r n
term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
Compute Term Weight: Robertson-Spark Jones Weights
• Retrospective formulation
–Ratio of the odds of a relevant  r 
document having the term (i.e.,  
ratio of relevant documents  Rr
log
having the term to not having  nr 
the term) to the odds of all non-  
relevant documents having the  N nRr
term (i.e., ratio of all non-
relevant documents having the  r  0.5 
term to not having the term)  
• Predictive formulation  R  r  0.5 
w  log
(1)
– To guarantee that the  n  r  0.5 
denominator is never zero,  
adding a minor 0.5 to all  N  n  R  r  0.5 
numerators and denominators:
(r  0.5)( N  n  R  r  0.5)
w  log
(1)

(n  r  0.5)(R  r  0.5)
Relevance weighted Example

d Document vectors <tfd,t>

col day eat hot lot nin old pea por pot Relevanc
e
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
wt -0.33 0.00 0.00 -0.33 0.0 0.0 0.0 0.62 0.62 0.95
0 0 0

• q3 = hot porridge
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)

• D2: “Cost of jellybeans is up.” (not relevant)

• D3: “Salaries of CEO’s are up.” (not relevant)

• D4: “Paper: CEO’s labor cost up.” (????)

Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up
D1 1 1 0 0 0 0 1
D2 1 0 1 0 0 0 1
D3 0 0 0 1 1 0 1
D4 1 1 0 0 1 1 1
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222
• D1=0.477 +1.176+ -0.222
• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Probabilistic model
• Probabilistic model uses probability theory to model the uncertainty
in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better term
probability estimates
• Advantages of probabilistic model over vector‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors
46

Audit Objectives Procedures Evidences and Documentation
100% (4)
Audit Objectives Procedures Evidences and Documentation
35 pages
Electrical Electronics VOL.08 PDF
50% (2)
Electrical Electronics VOL.08 PDF
148 pages
F5 Got It Pass Class Notes 2021 June
No ratings yet
F5 Got It Pass Class Notes 2021 June
221 pages
Colligative Properties of Non Electrolytes
50% (2)
Colligative Properties of Non Electrolytes
20 pages
Company QMS - Quality Policy
No ratings yet
Company QMS - Quality Policy
2 pages
Chapter 12 - Sheep and Goat Meat Characteristics and Quality PDF
No ratings yet
Chapter 12 - Sheep and Goat Meat Characteristics and Quality PDF
16 pages
BBA OB Unit-1
No ratings yet
BBA OB Unit-1
16 pages
Bourdon Pressure - Gauges PDF
No ratings yet
Bourdon Pressure - Gauges PDF
2 pages
First Summative Test in English 5
No ratings yet
First Summative Test in English 5
2 pages
Pursue Lesson 1
No ratings yet
Pursue Lesson 1
10 pages
66279238
No ratings yet
66279238
8 pages
Britannia Industries Historical Closing Price Data-Final
No ratings yet
Britannia Industries Historical Closing Price Data-Final
48 pages
Instructional Module
100% (2)
Instructional Module
6 pages
Philippine Primitive Art
100% (1)
Philippine Primitive Art
3 pages
Zok The Armenian Dialect of Agulis
No ratings yet
Zok The Armenian Dialect of Agulis
19 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
68 133 1 SM PDF
No ratings yet
68 133 1 SM PDF
9 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
No ratings yet
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
6 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Psychology Chapter 1
No ratings yet
Psychology Chapter 1
2 pages
IR Models
No ratings yet
IR Models
65 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
IR Chap4
100% (1)
IR Chap4
32 pages
TF Idf
100% (3)
TF Idf
38 pages
EUPoP-Solo and Bot Rules-1.2-Single Pages
No ratings yet
EUPoP-Solo and Bot Rules-1.2-Single Pages
16 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
CNS Unit 3
No ratings yet
CNS Unit 3
94 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Lockheed Martin Case Study
No ratings yet
Lockheed Martin Case Study
2 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
IR - Models
100% (3)
IR - Models
58 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
So Harian N3 TGL 03 Juni 2024
No ratings yet
So Harian N3 TGL 03 Juni 2024
160 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Ae8502 Question Bank-2022
No ratings yet
Ae8502 Question Bank-2022
153 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Week 7 Milestone Worksheet Completed
No ratings yet
Week 7 Milestone Worksheet Completed
16 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Wube Lab Report
No ratings yet
Wube Lab Report
21 pages
Network Design, Configuration-IP Assignment
No ratings yet
Network Design, Configuration-IP Assignment
58 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Vmodel
No ratings yet
Vmodel
10 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
2009 Economic Factors and Incentives For Ocean Wave Energy Conversion
No ratings yet
2009 Economic Factors and Incentives For Ocean Wave Energy Conversion
8 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Ict Lesson 9 Notes
No ratings yet
Ict Lesson 9 Notes
1 page
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Web Search
No ratings yet
Web Search
30 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
SUpervised Result in Graphy
No ratings yet
SUpervised Result in Graphy
1 page
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Chapter 2
No ratings yet
Chapter 2
24 pages
Chapter 4
No ratings yet
Chapter 4
37 pages
Artificial Intelligence Ass
No ratings yet
Artificial Intelligence Ass
33 pages
L03
No ratings yet
L03
16 pages
National Guidelines For Management of DR TB - 27 3 2025
No ratings yet
National Guidelines For Management of DR TB - 27 3 2025
82 pages
Chapter 1 Event
No ratings yet
Chapter 1 Event
39 pages
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
No ratings yet
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
8 pages
Mobile App Chapter 2
No ratings yet
Mobile App Chapter 2
44 pages
Chapter Four
No ratings yet
Chapter Four
49 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Test Bank For Understanding Economics A Contemporary Perspective, 9th Edition Mark Lovewell
100% (1)
Test Bank For Understanding Economics A Contemporary Perspective, 9th Edition Mark Lovewell
10 pages
Corrosion of Aluminium 2nd Edition Christian Vargel Instant Download
100% (1)
Corrosion of Aluminium 2nd Edition Christian Vargel Instant Download
62 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet

Chapter 5 IR

Uploaded by

Chapter 5 IR

Uploaded by

Information Retrieval and Storage

An index term is a document word useful for remembering the

But no ordering information is attached to the Bag of Words

T1 T2 …. TN  Document collection is mapped to term-

Are they using binary or non-binary weighting for measuring

Are they using similarity measurements?

Are they applying partial matching?

Are they performing Exact matching or Best matching for

Any Ranking mechanism?

• Generate the relevant documents retrieved by the Boolean

– D1 = “computer information retrieval”

• What are the relevant documents retrieved for the queries:

• ✁ Retrieved documents not ranked

• All terms are equally important

✁ Boolean operator usage has much more influence

• First, map documents and queries into term-document vector space.

• Third, similarity measurement is used to rank documents by the

A good weight must take into account two effects:

• A similarity measure is a function that computes the

• Using a similarity measure between the query and

– It is possible to rank the retrieved documents in

|QD| Dot Product (Simple matching)

i1 w i1 i,q

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

Di=(di1,wdi1;di2, wdi2;…;dit, wdit)

|q|= = 0.2896 = 0.538

• Next, compute dot products (zero products ignored)

Finally, we sort and rank documents in descending order

• This situation leads to several statistical approaches: probability theory,

– Start with an initial estimate , refine through relevance feedback

Computing the probabilities exactly according to the

• It is easier to estimate log odds of probability of

– I1: The distribution of terms in relevant documents is independent

– Assume P(ti|R) is constant (usually 0.5)

– Assume P(ti|NR) approximated by distribution of ti across

• This can be used to compute an initial rank using IDF as the

– the user has labeled some of the documents as relevant

– ni documents containing ti, ri are relevant

d Document vectors <tfd,t>

• D2: “Cost of jellybeans is up.” (not relevant)

• D3: “Salaries of CEO’s are up.” (not relevant)

• D4: “Paper: CEO’s labor cost up.” (????)

You might also like