0% found this document useful (0 votes)

58 views30 pages

5 IRModels

The document discusses vector space models in information retrieval. It explains that in vector space models, documents and queries are represented as weighted vectors. Similar vectors are considered more relevant, and relevance is determined by calculating similarity scores between document and query vectors. Different weighting techniques can be used to calculate vector weights, with TF-IDF being a commonly used approach.

Uploaded by

gcrossn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views30 pages

5 IRModels

Uploaded by

gcrossn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

[email protected].

et
IR Models - Basic Concepts
Word evidence: Bag of words
IR systems usually adopt index terms to index and retrieve
documents
Each document is represented by a set of representative
keywords or index terms (called Bag of Words)
An index term is a word from a document useful for
remembering the document main themes
Not all terms are equally useful for representing the document
contents:
less frequent terms allow identifying a narrower set of
documents
But No ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting which documents are relevant and
which are not
• Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
• Documents appearning at the top of this ordering
are considered to be more likely to be relevant
• Thus ranking algorithms are at the core of IR systems
• The IR models determine the predictions of what is
relevant and what is not, based on the notion of
relevance implemented by the system
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which are
Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a real-
valued weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 , it
indicates that term does not belong to document dj
• The weight wij quantifies the importance of the index term
for describing the document contents
• vec(d ) = (w , w , …, w ) is a weighted vector
j 1j 2j tj
associated with the document dj
Mapping Documents & Queries
Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which
shows occurrence
 of terms in the document collection

or queryd j  (t1, j , t 2, j ,..., t N , j ); qk  (t1,k , t 2,k ,..., t N ,k )
An entry in the matrix corresponds to the “weight” of a
term in the document;
– Document collection is mapped to
T1 T2 …. TN
term-by-document matrix
D1 w11 w12 … w1N – The documents are viewed as
D2 w21 w22 … w2N vectors in multidimensional space
: : : : • “Nearby” vectors are related
: : : : – Normalize the weight as usual for
DM wM1 wM2 … wMN vector length to avoid the effect of
document length
Weighting Terms in Vector Sapce
The importance of the index terms is represented by
weights associated to them
Problem: to show the importance of the index term for
describing the document/query contents, what weight can
we assign?
Solution 1: Binary weights: t=1 if presence, 0 otherwise
Similarity: number of terms in common
Problem: Not all terms equally interesting
E.g. the vs. dog vs. cat
Solution: Replace binary weights with non-binary weights
 
d j  ( w1, j , w2, j ,..., wN , j ); qk  ( w1,k , w2,k ,..., wN ,k )
The Boolean Model
• Boolean model is a simple model based on set theory
• The Boolean model imposes a binary criterion for
deciding relevance
• Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise T1 T2 …. TN
D1 w11 w12 … w1N
- Note that, no weights
D2 w21 w22 … w2N
assigned in-between 0 and 1,
only values 0 or 1 can be : : : :
assigned : : : :
DM wM1 wM2 … wMN
The Boolean Model: Example
• Generate the relevant documents retrieved by
the Boolean model for the query :
q = k1  (k2  k3)

k2
k1
d7
d2 d6
d4 d5
d3
d1

k3
The Boolean Model: Example
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:

1. D1 = {K1, K2, K3, K4, K5}

2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2  K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Further Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for given query
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Query: “gold silver truck”
Table below shows document –term (ti) matrix
arrive damage deliver fire gold silver ship truck Also find the relevant
documents for the
D1 queries:
D2 (a) “gold delivery”;
D3 (b) ship gold;
query (c) “silver truck”
Exercise
Given the following three documents with the following
contents:
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”

What are the relevant documents retrieved for the

queries:
Q1 = “information  retrieval”
Q2 = “information  ¬computer”
Exercise: What are the relevant documents retrieved for the
query: ((chaucer OR milton) AND (swift OR shakespeare))
Doc No Term 1 Term 2 Term 3 Term 4
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
• As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
• Just changing a boolean operator from “AND” to “OR”
changes the result from intersection to union
Vector-Space Model (VSM)
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
• The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, Documents and queries are mapped into term vector
space.
• Note that queries are considered as short documents
• Second, in the vector space, queries and documents are
represented as weighted vectors
• There are different weighting technique; the most widely
used one is computing tf*idf for each term
• Third, similarity measurement is used to rank documents by
the closeness of their vectors to the query.
• Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Term-document matrix.
A collection of n documents and query can be
represented in the vector space model by a term-document
matrix.
An entry in the matrix corresponds to the “weight” of a term in
the document;
zero means the term has no significance in the document
or it simply doesn’t exist in the document. Otherwise, wij >
0 whenever ki  dj
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
Computing weights
• How do we compute weights for term i in document j and
query q; wij and wiq ?
• A good weight must take into account two effects:
– Quantification of intra-document contents (similarity)
• The tf factor, the term frequency within a document

– Quantification of inter-documents separation

(dissimilarity)
• The idf factor, the inverse document frequency

– As a result of which most IR systems are using tf*idf

weighting technique:
wij = tf(i,j) * idf(i)
Computing Weights
• Let,
• N be the total number of documents in the collection
• ni be the number of documents which contain ki
• freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by
• f(i,j) = freq(i,j) / max(freq(l,j))
• where the maximum is computed over all terms which
occur within the document dj
• The idf factor is computed as
• idf(i) = log (N/ni)
• the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Computing weights
• The best term-weighting schemes use tf*idf weights
which are given by
 w = f(i,j) * log(N/n )
ij i

• For the query term weights, a suggestion is

wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/n i)

• The vector space model with tf*idf weights is a good

ranking strategy with general collections
• The vector space model is usually as good as the known
ranking alternatives. It is also simple and fast to compute.
Example: Computing weights
• A collection includes 10,000 documents
• The term A appears 20 times in a particular
document
• The maximum appearance of any term in this
document is 50
• The term A appears in 2,000 of the collection
documents.
• Compute TF*IDF weight?
• f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4
• idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
• wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Similarity Measure
j
dj


q
• Sim(q,dj) = cos() i
 

n
d j q wi , j qi ,k
sim(d j , q )     i 1

i 1 w i 1 i,k
n n
dj q 2
i, j q 2

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

• A document is retrieved even if it matches the query
terms only partially
Vector-Space Model: Example
• Suppose we query for the query: Q: “gold silver
truck”. The database collection consists of three
documents with the following documents.
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Assume that all terms are used, including common
terms, stop words, and also no terms are reduced to
root terms.
• Show retrieval results in ranked order?
Vector-Space Model: Example
Terms Q Counts TF DF IDF Wi = TF*IDF
Q D1 D2 D3
D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= 0.477 2  0.477 2  0.1762  0.176
= 2 0.= 0.719
517
|d2|= 0.176  0.477  0.176  0.176
2 2 2 2
= 1.2001
= 1.095
|d3|= 0.176 2  0.176 2  0.176 2  0.176
= 2 0.124
= 0.352

|q|= 0.1762  0.4712  0.1762= 0.2896

= 0.538
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

• Exercise: using normalized TF, rank documents

using cosine similarity measure? Hint: Normalize
TF of term i in doc j using max frequency of a
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer set
since it displays in ranked order
• partial matching allows retrieval of documents that
approximate the query conditions
• cosine ranking formula sorts documents according
to degree of similarity to the query

• Disadvantages:
• assumes independence of index terms (??)
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer
Exercises
Given the following documents, rank documents
according to their relevance to the query using
cosine similarity, Euclidean distance and inner
product measures?
docID words in document
1 Taipei Taiwan
2 Macao Taiwan Shanghai
3 Japan Sapporo
4 Sapporo Osaka Taiwan
Query: Taiwan Sapporo ?

IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
IR Models
No ratings yet
IR Models
65 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
IR - Models
100% (3)
IR - Models
58 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
L03
No ratings yet
L03
16 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Unit 2
No ratings yet
Unit 2
58 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
Vmodel
No ratings yet
Vmodel
10 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Web Search
No ratings yet
Web Search
30 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
TF Idf
100% (3)
TF Idf
38 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter 1 - Introduction To Multimedia
No ratings yet
Chapter 1 - Introduction To Multimedia
6 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
TELSMITH Rotary Drum Scrubber New
100% (4)
TELSMITH Rotary Drum Scrubber New
7 pages
Experiance Letter Sample
No ratings yet
Experiance Letter Sample
3 pages
Catalogo Typhoon Piaggio 125 4T 2V 2010-2011
No ratings yet
Catalogo Typhoon Piaggio 125 4T 2V 2010-2011
62 pages
HVAC - Part-3
No ratings yet
HVAC - Part-3
55 pages
Apple Iphone Field Test Mode PDF
No ratings yet
Apple Iphone Field Test Mode PDF
3 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
CH 2
No ratings yet
CH 2
59 pages
Directing The Documentary 6th 5084
No ratings yet
Directing The Documentary 6th 5084
710 pages
CH 1
No ratings yet
CH 1
60 pages
Sybex - Maya - Secrets of The Pros - 2003 (PDF)
No ratings yet
Sybex - Maya - Secrets of The Pros - 2003 (PDF)
384 pages
CH 44
No ratings yet
CH 44
49 pages
CH 3
No ratings yet
CH 3
11 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
40 pages
Roller Chains Catalogue en Kettenwulf
No ratings yet
Roller Chains Catalogue en Kettenwulf
146 pages
B.A. H Economics Intermedi Bikup2y2023
No ratings yet
B.A. H Economics Intermedi Bikup2y2023
32 pages
CH 4 - 1
No ratings yet
CH 4 - 1
42 pages
Discrete Maths Sle
No ratings yet
Discrete Maths Sle
13 pages
Radio Com & Nav System
No ratings yet
Radio Com & Nav System
32 pages
01 CyberPhysicalSystems
No ratings yet
01 CyberPhysicalSystems
39 pages
Optimization Concepts and Applications in Engineering Second Edition Ashok D. Belegundu Instant Download
No ratings yet
Optimization Concepts and Applications in Engineering Second Edition Ashok D. Belegundu Instant Download
46 pages
CH 2
No ratings yet
CH 2
20 pages
CH 5
No ratings yet
CH 5
39 pages
2 - Text Operation
No ratings yet
2 - Text Operation
47 pages
Ns2-Vw00-p0uyq-174226 Vehicle Repair Shop Side Elevation Rev.0int1
No ratings yet
Ns2-Vw00-p0uyq-174226 Vehicle Repair Shop Side Elevation Rev.0int1
1 page
K039-Pic-Checklist For JCB
No ratings yet
K039-Pic-Checklist For JCB
1 page
10 Realtime Communication
No ratings yet
10 Realtime Communication
19 pages
Mescab Catalogue - Compressed
No ratings yet
Mescab Catalogue - Compressed
7 pages
(System Message) (System Message) (System Message) : (Dota V6.69C.W3X)
No ratings yet
(System Message) (System Message) (System Message) : (Dota V6.69C.W3X)
38 pages
Giao Trinh Globe Valve
No ratings yet
Giao Trinh Globe Valve
8 pages
CH 4 - 2
No ratings yet
CH 4 - 2
14 pages
Curriculum Map Grade 7
No ratings yet
Curriculum Map Grade 7
7 pages
Monalyn Señaris - 2.2.1.4 Packet Tracer - Simulating IoT Devices
No ratings yet
Monalyn Señaris - 2.2.1.4 Packet Tracer - Simulating IoT Devices
5 pages
Solar Dryer
No ratings yet
Solar Dryer
25 pages
DL QB With Ans
No ratings yet
DL QB With Ans
38 pages
UNIT 2.2 Functional Modeling
No ratings yet
UNIT 2.2 Functional Modeling
23 pages
Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning
No ratings yet
Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning
6 pages
Arwa Alrezehi - Shahad Sultan
No ratings yet
Arwa Alrezehi - Shahad Sultan
1 page
Oil Seals Met
No ratings yet
Oil Seals Met
22 pages
WM412C.1-V1.1-1.2 Main Vertical Sections
No ratings yet
WM412C.1-V1.1-1.2 Main Vertical Sections
1 page
6 BSTs and AVL Trees
No ratings yet
6 BSTs and AVL Trees
12 pages
Geotechnical Earthquake Engineering: Dr. Deepankar Choudhury
No ratings yet
Geotechnical Earthquake Engineering: Dr. Deepankar Choudhury
40 pages
Student Registration Form
No ratings yet
Student Registration Form
1 page
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

5 IRModels

Uploaded by

5 IRModels

Uploaded by

[email protected].

1. D1 = {K1, K2, K3, K4, K5}

What are the relevant documents retrieved for the

– Quantification of inter-documents separation

– As a result of which most IR systems are using tf*idf

• For the query term weights, a suggestion is

• The vector space model with tf*idf weights is a good

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

|q|= 0.1762  0.4712  0.1762= 0.2896

• Exercise: using normalized TF, rank documents

You might also like