0% found this document useful (0 votes)

5 views46 pages

4 IRModels

Uploaded by

gosatilahun2017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views46 pages

4 IRModels

Uploaded by

gosatilahun2017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 46

IR models

IR Models - Basic Concepts

• Word evidence:
 IR systems usually adopt index terms to index and
retrieve documents
 Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
• An index term is a document word useful for remembering the
document main themes
• Not all terms are equally useful for representing
the document contents:
 less frequent terms allow identifying a narrower set of
documents
• But no ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
•One central problem regarding IR systems is
the issue of predicting the degree of relevance
of documents for a given query
•Such a decision is usually dependent on a
ranking algorithm which attempts to
establish a simple ordering of the
documents retrieved
•Documents appearing at the top of this
ordering are considered to be more likely to
be relevant
•Thus ranking algorithms are at the core of IR
systems
•The IR models determine the predictions of
what is relevant and what is not, based on
IR Models - Basic Concepts
• After preprocessing, N distinct terms (Bag of words) remain
which are unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-valued
weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it
indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• vec(d ) = (w , w , …, w ) is a weighted vector associated
j 1j 2j tj
with the document dj
Mapping Documents &
Queries
• Represent both documents and queries as N-dimensional
vectors in a term-document matrix, which shows
occurrence of terms in the document collection or query

– E.g. d j (t1, j , t 2, j ,..., t N , j ); qk (t1,k , t 2,k ,..., t N ,k )
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term doesn’t exist in
the document.
T1 T2 …. TN – Document collection is mapped to
D1 w11 w12 … w1N term-by-document matrix
D2 w21 w22 … w2N – View as vector in multidimensional
space
: : : :
• Nearby vectors are related
: : : :
– Normalize for vector length to avoid
DM wM1 wM2 … wMN
the effect of document length
Qi wi1 wi2 … wiN
Weighting Terms in Vector
Space
• The importance of the index terms is represented by
weights associated to them
• Problem: to show the importance of the index term for
describing the document/query contents, what weight we
can assign?
• Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Similarity: number of terms in common
• Problem: Not all terms equally interesting
– E.g. the vs. dog vs. cat
• Solution: Replace binary weights with non-binary weights
 
d j ( w1, j , w2, j ,..., wN , j ); qk ( w1,k , w2,k ,..., wN ,k )
How to evaluate Models?
• We need to investigate what procedures they
follow and what techniques they used for:
– Are they using binary or non-binary weighting for
measuring importance of terms in documents
– Are they using similarity measurements?
– Are they applying partial matching?
– Are they performing Exact matching or Best
matching for document retrieval?
– Any Ranking mechanism?
The Boolean Model
•Boolean model is a simple model based on
set theory
•The Boolean model imposes a binary
criterion for deciding relevance
•Terms are either present or absent. Thus,
wij  {0,1}
•sim(q,dj) = 1, if document satisfies the boolean
T T …. T
query 1 2 N

0 otherwise D1 w11 w12 … w1N

- Note that, no weights
D2 w21 w22 … w2N
assigned in-between 0
and 1, just only values 0 : : : :
or 1 : : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
•Generate the relevant documents
retrieved by the Boolean model for the
query :
q = k1  (k2  k3)
k2
k1
d7
d2 d6
d4 d5
d3
d1

k3
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}

2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Further Example
Given the following three documents, Construct Term –
document matrix and find the relevant documents
retrieved by the Boolean model for given query
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Query: “gold silver truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck
• Find the
D1 relevant
documents for
D2
the queries:
D3 (a)gold delivery
query (b)ship gold
(c)silver truck
Exercise
Given the following four documents with the following
contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”

• What are the relevant documents retrieved for the

queries:
– Q1 = “information  retrieval”
– Q2 = “information  ¬computer”
Drawbacks of the Boolean Model
•Retrieval based on binary decision criteria with no
notion of partial matching
•No ranking of the documents is provided (absence of
a grading scale)
•Information need has to be translated into a Boolean
expression which most users find awkward
•The Boolean queries formulated by the users are most
often too simplistic
As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
Vector-Space Model
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
• The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, map documents and queries into term-document
vector space.
Note that queries are considered as short document
• Second, in the vector space, queries and documents are
represented as weighted vectors, wij
There are different weighting technique; the most
widely used one is computing tf*idf for each term
• Third, similarity measurement is used to rank documents by
the closeness of their vectors to the query.
Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
– An entry in the matrix corresponds to the “weight” of a term in
the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij >
0 whenever ki  dj
T 1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
• How to compute weights for term i in document j and
query q; wij and wiq ?
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• tf factor, the term frequency within a document

– Quantification of inter-documents separation

(dissimilarity)
• idf factor, the inverse document frequency

• As a result of which most IR systems are using tf*idf

weighting technique:
wij = tf(i,j) * idf(i)
Computing weights
• Let,
 N be the total number of documents in the collection
 ni be the number of documents which contain ti
 freq(i,j) raw frequency of ti within dj
• A normalized tf(i,j) factor is given by
tf(i,j) = freq(i,j) / max(freq(k,j))
 where the maximum is computed over all terms which occur
within the document dj
• The idf factor is computed as
idf(i) = log (N/ni)
 The log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ti.
• A normalized tf*idf weight is given by:
w = freq(i,j) / max(freq(k,j)) * log(N/n )
Example: Computing weights
• A collection includes 10,000 documents
The term A appears 20 times in a particular
document
 The maximum appearance of any term in this
document is 50
 The term A appears in 2,000 of the collection
documents.
• Compute TF*IDF weight?
 tf(i,j) = freq(i,j) / max(freq(k,j)) = 20/50 = 0.4
 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Similarity Measure
• A similarity measure is a function that computes the
degree of similarity between two vectors.

• Using a similarity measure between the query and

each document:
– It is possible to rank the retrieved documents in
the order of presumed relevance.
– It is possible to enforce a certain threshold so that
we can control the size of the retrieved set of
documents.
Similarity Measures
|QD| Dot Product (Simple matching)
|QD|
2 Dice’s Coefficient
|Q|| D|
|QD|
Jaccard’s Coefficient
|QD|
|QD|
1 1 Cosine Coefficient
| Q | | D |
2 2

|QD|
min(| Q |, | D |) Overlap Coefficient
Similarity Measure
•Sim(q,dj) = cos() j
dj


q
 

n i
d j q wi , j wi ,q
sim(d j , q )     i 1

i 1 w i 1 i,q
n n
dj q 2
i, j w 2

•Since wij > 0 and wiq > 0, 0 <= sim(q,dj)

<=1
•A document is retrieved even if it matches the
query terms only partially
Vector Space with Term Weights and
Cosine Matching
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)
Term B
Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
1.0 Q = (0.4,0.8)

t
D2 Q D1=(0.8,0.3)
j 1
wq j wd ij
0.8 D2=(0.2,0.7) sim(Q, Di ) 
 j 1 (wq j )  j 1 dij
t 2 t 2
( w )
0.6
2 (0.4 0.2)  (0.8 0.7)
sim (Q, D 2) 
0.4
[(0.4) 2  (0.8) 2 ] [(0.2) 2  (0.7) 2 ]
D1
0.2 1 0.64
 0.98
0.42
0 0.2 0.4 0.6 0.8 1.0
.56
Term A sim(Q, D1 )  0.74
0.58
Vector-Space Model:
• Suppose
Example user query for: Q = “gold silver truck”. The
database collection consists of three documents with the
following content.
D1: “Shipment of gold damaged in a fire”
D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”
• Show retrieval results in ranked order?
1.Assume that full text terms are used during indexing,
without removing common terms, stop words, & also no
terms are stemmed.
2.Assume that content-bearing terms are selected during
indexing
3.Also compare your result with or without normalizing
term frequency
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3

a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
•Compute similarity using cosine Sim(q,d1)

• First, for each document and query, compute all

vector lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.176
= 2 = 0.719
0.517
|d2|= 0.1762  0.477 2  0.1762  0.= 1=
1762 1.095
.2001
|d3|= 2 2 2 = 2 =.124
0 0.352
0.176  0.176  0.176  0.176
|q|= =2 = 0.538
0.2896
0.1762  0.4712  0.176
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.176 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.176 + 0.176*0.176 = 0.0620
Vector-Space Model:
Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

• Exercise: using normalized TF, rank documents

using cosine similarity measure? Hint: Normalize
TF of term i in doc j using max frequency of a
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set
since it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according
to degree of similarity to the query

• Disadvantages:
• assumes independence of index terms (??)
Exercise 1
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure

Query:
Find documents relevant to "human computer
Exercise 2
• Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
–Draw the term-document incidence matrix for this document
collection.
–Draw the inverted index representation for this collection.

• For the document collection shown above, what are the

returned results for the queries:
–schizophrenia AND drug
–for AND NOT(drug OR approach)
Probabilistic model
• Asks the question: what is the probability that user will
see relevant information if they read this document.
– P(rel | di ): probability of relevance after reading di
– How likely is the user to get relevance information from
reading this document
– high probability means more likely to get relevant info.

• A Probabilistic retrieval model

– Rank documents in decreasing order of probability of relevance
to users information need
– Calculate P(rel|di) for each document and rank
Probabilistic Model definitions
• Let D be a document in the collection.
– dj = (t1,j, t2,j, … tt,j), ti,j Î {0,1}
• terms occurrences are boolean (not counts)
• query q is represented similarly
• Let R represent the set of relevant documents with respect to a given
query and let NR represent the set of irrelevant documents.
– P(R | dj) is probability that dj is relevant,
– P(NR | dj) is probability that dj is irrelevant
• Need to find p(R| D) - probability that a retrieved document D is
relevant.
• Similarity function: p( R | D) 
p( D | R) p( R)
– Ratio of prob of relevance to prob of p( D)
non-relevance: p( D | NR ) p( NR )
p( NR | D) 
If p(R|D) > p(NR|D) then D is relevant, p( D)
otherwise D is not relevant
Bayes’ Theorem: Application in IR
• Goal: want to estimate the probability that a document D
is relevant to a given query.
p(R)p(D | R) p(R)p(D | R)
p(R | D)  
p(D) p(R)p(D | R)  p(R)p(D | R)

• It is easier to estimate log odds of probability of

relevance

p(R | D) 1 - p(R | D)
Probabilistic Models
• Most probabilistic models based on combining probabilities of
relevance and non-relevance of individual terms
– Probability that a term will appear in a relevant document
– Probability that the term will not appear in a non‐relevant
document
• These probabilities are estimated based on counting term
appearances in document descriptions
• Retrieval Status Value (rsv)
D is a vector of binary term
occurrences
We assume that terms occur
independently of each other
Principles surrounding weights
• Independence Assumptions
– I1: The distribution of terms in relevant documents is
independent and their distribution in all documents is
independent.
– I2: The distribution of terms in relevant documents is
independent and their distribution in non-relevant documents
is independent.

• Ordering Principles
– O1: Probable relevance is based only on the presence of search
terms in the documents.
– O2: Probable relevance is based on both the presence of
search terms in documents and their absence from documents.
Computing term probabilities
• Initially, there are no retrieved documents
– R is completely unknown
– Assume P(ti|R) is constant (usually 0.5)
– Assume P(ti|NR) approximated by distribution of ti
across collection – IDF

• This can be used to compute an initial rank

using IDF as the basic term weight
Probabilistic Model Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot
1 1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1
6 1 1
wt 0.26 0.56 0.56 0.26 0.56 0.56 0.56 0.0 0.0 0.26
• q1 = eat
• q2 = porridge
• q3 = hot porridge
• q4 = eat nine day old porridge
Improving the Ranking
• Now, suppose
– we have shown the initial ranking to the user
– the user has labeled some of the documents as
relevant ("relevance feedback")
• We now have
– N documents in coll, R are known relevant
– ni documents containing ti, ri are relevant
Improving Term Weight Estimates
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
Document Relevance
For term ti No of relevant No of non-relevant Total
docs docs
No of docs including r n-r n
term ti
No of docs excluding R-r N-R-(n-r) N-n
term ti
Total R N-R N
Compute Term Weight: Robertson-Spark
Jones Weights
• Retrospective formulation  r 
–Ratio of the odds of a relevant  
document having the term (i.e.,  R r
ratio of relevant documents log
having the term to not having  n r 
 
the term) to the odds of all  N  n Rr 
non-relevant documents
having the term (i.e., ratio of all
non-relevant documents  r  0.5 
having the term to not having
 
(1)  R  r  0.5 
the term) w log
 n  r  0.5 
 
• Predictive formulation  N  n  R  r  0.5 
–To guarantee that the
denominator is never zero,
adding a minor 0.5 to all (1) (r  0.5)( N  n  R  r  0.5)
w log
numerators and denominators:
(n  r  0.5)( R  r  0.5)
Relevance weighted Example
d Document vectors <tfd,t>
col day eat hot lot nin old pea por pot Relev
ance
1 1 1 1 1 NR
2 1 1 1 R
3 1 1 1 NR
4 1 1 1 NR
5 1 1 NR
6 1 1 NR
wt -0.33 0.00 0.00 -0.33 0.00 0.00 0.00 0.62 0.62 0.95
• q3 = hot porridge
• Document 2 is relevant
Probabilistic Retrieval Example
• D1: “Cost of paper is up.” (relevant)
• D2: “Cost of jellybeans is up.” (not relevant)
• D3: “Salaries of CEO’s are up.” (not relevant)
• D4: “Paper: CEO’s labor cost up.” (not relevant)
Probabilistic Retrieval Example
cost paper Jellybean salary CEO labor up
D1 1 1 0 0 0 0 1
D2 1 0 1 0 0 0 1
D3 0 0 0 1 1 0 1
D4 1 1 0 0 1 1 1
Wij 0.477 1.176 -0.477 -0.477 -0.477 0.222 -0.222
• D1=0.477 +1.176+ -0.222
• D2=0.477 + -0.477+ -0.222
• D3= -0.477 + -0.477+ -0.222
• D4=0.477 +1.176 + -0.477 + 0.222 + -0.222
Exercise
• Consider the collection below. The collection has 5 documents and
each document is described by two terms. The initial guess of
relevance to a particular query Q is as given in the table below.
Assuming the query Q has a total of 2 relevant documents in this
collection solve the following questions
Document T1 T2 Relevance
D1 1 1 R
D2 0 1 NR
D3 1 0 NR
D4 1 0 R
D5 0 1 NR
• Using the probabilistic term weighting formula, calculate the new
weight for each of the query in Q
• Rank the documents according to their probability of relevance
with the new query
Probabilistic model
• Probabilistic model uses probability theory to model the
uncertainty in the retrieval process
– Assumptions are made explicit
– Term weight without relevance information is IDF
• Relevance feedback can improve the ranking by giving better
term probability estimates
• Advantages of probabilistic model over vector ‐space
– Strong theoretical basis
– Since the base is probability theory, it is very well understood
– Easy to extend
• Disadvantages
– Models are often complicated
– No term frequency weighting
• Which is better: vector‐space or probabilistic?
– Both are approximately as good as each other
– Depends on collection, query, and other factors

CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
IR Models
No ratings yet
IR Models
65 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
IR - Models
100% (3)
IR - Models
58 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Unit 2
No ratings yet
Unit 2
58 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Web Search
No ratings yet
Web Search
30 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Lesson Plan Union and Intersection of Sets 2
No ratings yet
Lesson Plan Union and Intersection of Sets 2
5 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
IR Chap4
100% (1)
IR Chap4
32 pages
L03
No ratings yet
L03
16 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
IR Chap4
100% (1)
IR Chap4
32 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
ASTM International Constructuring Smooth Hot Mix Asphalt 2003 PDF
100% (1)
ASTM International Constructuring Smooth Hot Mix Asphalt 2003 PDF
274 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Vmodel
No ratings yet
Vmodel
10 pages
TF Idf
100% (3)
TF Idf
38 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Machining Process - I
No ratings yet
Machining Process - I
30 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Elements of Railway Tracks
No ratings yet
Elements of Railway Tracks
27 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
VVDED302023 Altistart 48 Modbus Protocol
No ratings yet
VVDED302023 Altistart 48 Modbus Protocol
61 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Bucks Engines 2007 GM Powertrain Owners Manual
100% (1)
Bucks Engines 2007 GM Powertrain Owners Manual
11 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Interface Knowledge
No ratings yet
Interface Knowledge
4 pages
Network How To
100% (1)
Network How To
139 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Chapter 5 Ai
No ratings yet
Chapter 5 Ai
66 pages
BOP Control System BC0114001A
No ratings yet
BOP Control System BC0114001A
2 pages
Machinery Report
No ratings yet
Machinery Report
13 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Choosing Right Automation Tool
No ratings yet
Choosing Right Automation Tool
8 pages
PTDLKD Final Report 2 PDFF
No ratings yet
PTDLKD Final Report 2 PDFF
60 pages
Likert Scales, Levels of Measurement and The 'Laws' of Statistics PDF
No ratings yet
Likert Scales, Levels of Measurement and The 'Laws' of Statistics PDF
8 pages
Chapter Three
No ratings yet
Chapter Three
104 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
430-3-2 Maths Basic
No ratings yet
430-3-2 Maths Basic
11 pages
Chapter 4 AI
No ratings yet
Chapter 4 AI
33 pages
1 Binary & Hexadecimal Systems J24
No ratings yet
1 Binary & Hexadecimal Systems J24
19 pages
Forms of EMH
No ratings yet
Forms of EMH
33 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
Paper - On-Site Investigation Techniques For The Structural Evaluation of Historic Masonry Buildings
No ratings yet
Paper - On-Site Investigation Techniques For The Structural Evaluation of Historic Masonry Buildings
8 pages
Andrew Antena CV3PX310R1 CRET INTEGRADO
No ratings yet
Andrew Antena CV3PX310R1 CRET INTEGRADO
2 pages
Ecs 7000 6GD
No ratings yet
Ecs 7000 6GD
2 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
13 pages
TEXA Axone Nemo Specs
No ratings yet
TEXA Axone Nemo Specs
36 pages
Positouch DBF Files 2
No ratings yet
Positouch DBF Files 2
68 pages
Relevance of Free Jet Model For Soil Erosion by Impinging Jets
No ratings yet
Relevance of Free Jet Model For Soil Erosion by Impinging Jets
15 pages
Automatic Pixel-Level Detection of Vertical Cracks in Asphalt Pavement Based On GPR Investigation and Improved Mask R-CNN
No ratings yet
Automatic Pixel-Level Detection of Vertical Cracks in Asphalt Pavement Based On GPR Investigation and Improved Mask R-CNN
44 pages
TG63 DS en
No ratings yet
TG63 DS en
4 pages
Lesson 1 Measures of Position
No ratings yet
Lesson 1 Measures of Position
23 pages
Gigamon Gigavue VM Virtual Machine 4022
No ratings yet
Gigamon Gigavue VM Virtual Machine 4022
7 pages
Furmark Log
No ratings yet
Furmark Log
2 pages
Term 1: Mechanics and Thermodynamics: Chapter 2: Kinematics
No ratings yet
Term 1: Mechanics and Thermodynamics: Chapter 2: Kinematics
7 pages
Adobe Photoshop Cs6 Portable Camera Raw
No ratings yet
Adobe Photoshop Cs6 Portable Camera Raw
2 pages

4 IRModels

Uploaded by

4 IRModels

Uploaded by

IR models

IR Models - Basic Concepts

0 otherwise D1 w11 w12 … w1N

1. D1 = {K1, K2, K3, K4, K5}

• What are the relevant documents retrieved for the

– Quantification of inter-documents separation

• As a result of which most IR systems are using tf*idf

• Using a similarity measure between the query and

•Since wij > 0 and wiq > 0, 0 <= sim(q,dj)

• First, for each document and query, compute all

• Exercise: using normalized TF, rank documents

• For the document collection shown above, what are the

• A Probabilistic retrieval model

• It is easier to estimate log odds of probability of

• This can be used to compute an initial rank

You might also like