0% found this document useful (0 votes)

10 views34 pages

Chapter 3 IR

Chapter Three discusses term weighting and similarity measures in information storage and retrieval, focusing on how documents and queries are represented as vectors. It explains the importance of term frequency (TF), inverse document frequency (IDF), and the TF-IDF weighting scheme for improving document relevance and retrieval accuracy. The chapter also covers various similarity measures, including Euclidean distance and cosine similarity, to evaluate the relevance of documents in relation to a query.

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views34 pages

Chapter 3 IR

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Information Storage and Retrieval

Chapter Three
Term weighting and similarity
measures
Target Group –IT 3rd year students

Injibara, Ethiopia
Terms
• Terms are usually stems. Terms can be also phrases, such as
“Computer Science”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or “bags of words”
(BOW).
– Each vector holds a place for every term in the collection.
– Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn
W=0 if a term is absent
-(Wdi1) is Weight of term 1 in document di.
• Documents are represented by binary weights or Non-binary
weighted vectors of terms.
2
Document Collection
 A collection of n documents can be represented in the vector space
model by a term-document matrix.
 An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the
document or it simply doesn’t exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2 a term-document matrix
: : : :
: : : :
Dn w1n w2n … wtn

3
Binary Weights
 Only the presence (1) or absence (0)
docs t1 t2 t3
of a term is included in the vector
D1 1 0 1
 Binary formula gives every word that D2 1 0 0
D3 0 1 1
appears in a document equal D4 1 0 0
D5 1 1 1
relevance. D6 1 1 0
D7 0 1 0
 It can be useful when frequency is not D8 0 1 0
important. D9 0 0 1
D10 0 1 1
 Binary Weights Formula: D11 1 0 1

1 if freqij  0

freqij  
0 if freqij  0

Why use term weighting?
 Term-weighting improves quality of answer set.
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
 Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query

 Non-binary weights allow to model partial matching.

– Partial matching allows retrieval of docs that approximate
the query.

5
Term Weighting: Term Frequency (TF)
 TF (term frequency) - Count the number of
term occurs in document.
docs t1 t2 t3
fij =frequency of term i in document j
D1 2 0 3
 The more times a term t occurs in document D2 1 0 0
d the more likely it is that t is relevant to the
D3 0 4 7
document, i.e. more indicative of the topic..
D4 3 0 0
– If used alone, it favors common words and
D5 1 6 3
long documents.
D6 3 5 0
– It gives too much credit to words that D7 0 8 0
appears more frequently.
D8 0 10 0
 May want to normalize term frequency (tf) D9 0 0 1
across the entire corpus:
D10 0 3 5
tfij = fij /max{fij} D11 4 0 1
Document Normalization
 Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
 Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.
 If we don’t normalize short documents may not be recognized as
relevant.

7
…con
 Term normalization is the process of converting multiple terms into a
single term for indexing and retrieval. By normalizing various terms
into one, you increase the consistency of the search results.
 Term normalization is the process of converting multiple terms into a
single term for indexing and retrieval. By normalizing various terms into
one, you increase the consistency of the search results. There are
several common classes of term normalizations:
1. Spelling variants, such as customise => customize
2. Compound terms, such as data base => database
3. Spelling corrections, such as febuary => february
4. Aliases or name variations, such as MS Word => Microsoft Word
8
Problems with term frequency
 Need a mechanism for reducing the effect of terms that occur too often in
the collection to be meaningful for relevance/meaning determination
 Scale down the term weight of terms with high collection frequency
– Reduce the tf weight of a term by a factor that grows with the
collection frequency
 More common for this purpose is document frequency
– how many documents in the collection contain the term

 The example shows that collection

frequency and document frequency
behaves differently
9
Document Frequency
 It is defined to be the number of documents in the collection that
contain a term

DF = document frequency

– Count the frequency considering the whole collection of

documents.

– Less frequently a term appears in the whole collection, the

more discriminating it is.

df i = document frequency of term i

= number of documents containing term i

10
Inverse Document Frequency (IDF)
 IDF measures rarity of the term in collection. The IDF is a measure of
the general importance of the term
– Inverts the document frequency.
 It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
– Gives full weight to terms that occur in one document only.
– Gives lowest weight to terms that occur in all documents.
– Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)

11
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?

Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ? 12
TF*IDF Weighting
• The most used term-weighting is tf*idf weighting scheme:

wij = tfij idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.

– The tf-idf value for a term will always be greater than or equal to
zero.

• Experimentally, tf*idf has been found to work well.

– It is often used in the vector space model together with cosine

similarity to determine the similarity between two documents.

13
TF*IDF weighting
 When does TF*IDF registers a high weight?
 when a term t occurs many times within a small number of
documents
– Highest tf*idf for a term shows a term has a high term frequency (in
the given document) and a low document frequency (in the whole
collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lowest TF*IDF is registered when the term occurs in virtually all
documents.
Computing TF-IDF: An Example
 Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms are:
A(50), B(1300), C(250). And also term frequencies (TF) of these
terms are: A(3), B(2), C(1). Compute TF*IDF for each term?

where: idfi = log2 (N/ df i)  (N: total number of documents)

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
 Query vector is typically treated as a document and also tf-idf
weighted.
15
More Example
 Consider a document containing 100 words wherein the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.

– The term frequency (TF) for cow :

3/100 = 0.03

– The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

– The TF*IDF score is the product of these frequencies:

0.03 * 13.228 = 0.39684
16
Exercise
 Let C = number of times a Word C TW TD DF TF IDF TFIDF
given word appears in a airplane 5 46 3 1 5/45 Log2(3/1)

document; blue 1 46 3 1
 TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
 TD = total number of
forest 2 46 3 1
documents in a corpus, and
justice 7 46 3 3
 DF = total number of
love 2 46 3 1
documents containing a
given word; might 2 46 3 1
 compute TF, IDF and perl 5 46 3 2
TF*IDF score for each rose 6 46 3 3
term shoe 4 46 3 1
thesis 2 46 3 2

 Highest tf*idf for a term shows a term has a high term frequency
(in the given document) and a low document frequency (in the whole
collection of documents); 17
Similarity Measure
 We now have vectors for all documents in the
collection, a vector for the query, how to compute
t3
similarity?
 A similarity measure is a function that computes
1
the degree of similarity or distance between
document vector and query vector. D1
Q
 Using a similarity measure between the query and
2
each document: t1
 It is possible to rank the retrieved documents
in the order of presumed relevance. t2 D2
 It is possible to enforce a certain threshold so
that the size of the retrieved set can be
controlled.

18
Intuition
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close together”

in the vector space talk about the same things and
more similar than others.
Similarity Measure
• A similarity measure attempts to compute the distance between document
vector wj and query wq vector.
– The assumption here is that documents whose vectors are close to the
query vector are more relevant to the query than documents whose
vectors are away from the query vector.
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
– Sometimes it is a good idea to determine the maximum possible
similarity as the “distance” between a document d and itself

20
Similarity Measure: Techniques
• Euclidean distance

 It is the most common similarity measure. Euclidean distance examines

the root of square differences between coordinates of a pair of
document and query terms.

• Dot product

 The dot product is also known as the scalar product or inner product

 the dot product is defined as the product of the magnitudes of query

and document vectors

• Cosine similarity (or normalized inner product)

 It projects document and query vectors into a term space and calculate
the cosine angle between these. 21
Euclidean distance
 Similarity between vectors for the document di and query q can be
computed as:
n
sim(dj,q) = |dj – q| =
 (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the weight of
term i in the query
 Example: Determine the Euclidean distance between the document 1 vector
(0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means corresponding term
not found in document or query

 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11.05
2 2 2 2 2

22
Inner Product
 Similarity between vectors for the document di and query q can be
computed as the vector inner product:
n
sim(dj,q) = dj•q = w · w
i 1
ij iq

 where wij is the weight of term i in document j and wiq is the

weight of term i in the query q.

 For binary vectors, the inner product is the number of matched

query terms in the document (size of intersection).

 For weighted term vectors, it is the sum of the products of the

weights of the matched terms. 23
Properties of Inner Product
• Favors long documents with a large number of unique terms.

– Again, the issue of normalization

• Measures how many terms matched but not how many terms
are not matched.

24
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
sim(D, Q) = 3
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2

sim(D1 , Q) = 21 + 30 + 5*2 = 12

sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Inner Product:
Example 1
k2
k1 k2 k3 q  dj k1
d2 d6 d7
d1 1 0 1 2
d4
d2 1 0 0 1 d5
d3 0 1 1 2 d3
d1
d4 1 0 0 1
d5 1 1 1 3 k3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1

26
Inner Product:
Exercise k1 d2
k2
d6 d7
k1 k2 k3 q  dj d4 d5
d1 1 0 1 ? d3
d1
d2 1 0 0 ?
d3 0 1 1 ?
k3
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3

27
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
dj q wi , j wi , q
sim(d j , q )    
i 1

i 1 w i 1 i ,q
n n
dj q 2
w 2
• Or; i, j

 

n
d j  dk wi , j wi , k
sim(d j , d k )     i 1

i 1 w i 1 i ,k
n n
d j dk 2
i, j w 2

• The denominator involves the lengths of the vectors

• So the cosine measure is also known as the normalized
inner product 

n
Length d j  i 1
w 2
i, j
Example: Computing Cosine Similarity

 Let say we have query vector Q = (0.4, 0.8); and also

document D1 = (0.2, 0.7). Compute their similarity using
cosine?
(0.4 * 0.2)  (0.8 * 0.7)
sim(Q, D2 ) 
[( 0.4)  (0.8) ] *[( 0.2)  (0.7) ]
2 2 2 2

0.64
  0.98
0.42
Example: Computing Cosine Similarity
 Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4, 0.8),
determine which document is the most relevant one for
the query?

1.0 Q
cos1  0.74 0.8
D2

cos 2  0.98 0.6 2

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0

30
Example
 Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are more
similar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

31
Cosine Similarity vs. Inner Product
 Cosine similarity measures the cosine of the angle
between two vectors.
 Inner product normalized by the vector lengths.
  t

dj q

 (wij  wiq)
CosSim(dj, q) =
   i 1
t t

 wij   wiq
2 2
dj  q
i 1 i 1
 
InnerProduct(dj, q) = d j q 

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times

better using inner product.
32
Exercises
 A database collection consists of 1 million documents, of which
200,000 contain the term holiday while 250,000 contain the term
season. A document repeats holiday 7 times and season 5 times. It
is known that holiday is repeated more than any other term in the
document. Calculate the weight of both terms in this document
using three different term weight methods. Try with

(i) normalized and unnormalized TF;

(ii) TF*IDF based on normalized and unnormalized TF

33
34

Philosophy of Mind (Jenkins & Sullivan) (2012)
100% (2)
Philosophy of Mind (Jenkins & Sullivan) (2012)
199 pages
2 75
33% (3)
2 75
18 pages
Abstract Algebra Rings, Modules, Polynomials, Ring Extensions, Categorical and Commutative Algebra
No ratings yet
Abstract Algebra Rings, Modules, Polynomials, Ring Extensions, Categorical and Commutative Algebra
488 pages
Cinematography: Lighting
88% (24)
Cinematography: Lighting
77 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Offshore Hvac Design
100% (1)
Offshore Hvac Design
6 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
3rd Plate Sample
100% (1)
3rd Plate Sample
39 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Hydrocracking Technology
100% (1)
Hydrocracking Technology
12 pages
Gpon Cli Manual-V1.01
No ratings yet
Gpon Cli Manual-V1.01
257 pages
Smartview Common Issues - Master Blog Part-1: Issue-Smart View Not Submitting Data To Essbase Application/Database
No ratings yet
Smartview Common Issues - Master Blog Part-1: Issue-Smart View Not Submitting Data To Essbase Application/Database
19 pages
TF Idf
100% (3)
TF Idf
38 pages
(Fold/Cover If You Don'T Wanna See The Answers Yet) B
100% (2)
(Fold/Cover If You Don'T Wanna See The Answers Yet) B
43 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
AASHTO T-265 Moisture Content of Soils PDF
No ratings yet
AASHTO T-265 Moisture Content of Soils PDF
12 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Guidance Note C - B - ENV 002, July 02
No ratings yet
Guidance Note C - B - ENV 002, July 02
12 pages
Design of Pulley and V Belt
100% (1)
Design of Pulley and V Belt
12 pages
Chapter 1 Event
No ratings yet
Chapter 1 Event
39 pages
Module 9,10 & 11 - Bosh
No ratings yet
Module 9,10 & 11 - Bosh
8 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Chapter 4
No ratings yet
Chapter 4
37 pages
TF Idf
No ratings yet
TF Idf
4 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Final Jacking Method
No ratings yet
Final Jacking Method
15 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Chapter Four
No ratings yet
Chapter Four
49 pages
Text Representation
No ratings yet
Text Representation
16 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Introduction To Differential Calculus PDF
No ratings yet
Introduction To Differential Calculus PDF
45 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
High Electron Mobility Transistor-Foti
No ratings yet
High Electron Mobility Transistor-Foti
17 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
(Jaffar) IR - Modeling - II
No ratings yet
(Jaffar) IR - Modeling - II
39 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Secunderabad, Telangana To Sri Chaitanya Junior College - Google Maps
No ratings yet
Secunderabad, Telangana To Sri Chaitanya Junior College - Google Maps
2 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Mobile App Chapter 2
No ratings yet
Mobile App Chapter 2
44 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
Network Design, Configuration-IP Assignment
No ratings yet
Network Design, Configuration-IP Assignment
58 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Artificial Intelligence Ass
No ratings yet
Artificial Intelligence Ass
33 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Chapter 2
No ratings yet
Chapter 2
24 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
The Vector Space Model in Information Re
No ratings yet
The Vector Space Model in Information Re
9 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Vmodel
No ratings yet
Vmodel
10 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
Eta-120114 Spax Screws
No ratings yet
Eta-120114 Spax Screws
84 pages
Forging Presentation
No ratings yet
Forging Presentation
17 pages
Wube Lab Report
No ratings yet
Wube Lab Report
21 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Design of PV
No ratings yet
Design of PV
6 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Identifying The Sources of Gains From Takeovers: Feature Article
No ratings yet
Identifying The Sources of Gains From Takeovers: Feature Article
25 pages
Final Demonstration LP
No ratings yet
Final Demonstration LP
12 pages
Analisa Sifat Material
No ratings yet
Analisa Sifat Material
10 pages
Kubler - Bellows Couplings
No ratings yet
Kubler - Bellows Couplings
2 pages
Practical No 18
No ratings yet
Practical No 18
8 pages
UNIT TEST CHAPTER 11 IMMUNITY - Jamal XI69069
No ratings yet
UNIT TEST CHAPTER 11 IMMUNITY - Jamal XI69069
8 pages
Pervaporation Ketazine Aq Layer Prodn HH Peroxide Proc PDF
No ratings yet
Pervaporation Ketazine Aq Layer Prodn HH Peroxide Proc PDF
6 pages
SUpervised Result in Graphy
No ratings yet
SUpervised Result in Graphy
1 page
Explosive Detection Systems For Cabin Baggage Edscb Excel Format
No ratings yet
Explosive Detection Systems For Cabin Baggage Edscb Excel Format
3 pages
Key & Common Swedish Words A Vocabulary List of High Frequency Swedish Words(1000 Words): Swedish, #0
From Everand
Key & Common Swedish Words A Vocabulary List of High Frequency Swedish Words(1000 Words): Swedish, #0
MostUsedWords
2/5 (4)
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet

Chapter 3 IR

Uploaded by

Chapter 3 IR

Uploaded by

Information Storage and Retrieval

 Non-binary weights allow to model partial matching.

 The example shows that collection

– Count the frequency considering the whole collection of

– Less frequently a term appears in the whole collection, the

df i = document frequency of term i

= number of documents containing term i

wij = tfij idfi = tfij * log2 (N/ dfi)

• Experimentally, tf*idf has been found to work well.

– It is often used in the vector space model together with cosine

where: idfi = log2 (N/ df i)  (N: total number of documents)

– The term frequency (TF) for cow :

– The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

– The TF*IDF score is the product of these frequencies:

Postulate: Documents that are “close together”

 It is the most common similarity measure. Euclidean distance examines

 the dot product is defined as the product of the magnitudes of query

• Cosine similarity (or normalized inner product)

 where wij is the weight of term i in document j and wiq is the

 For binary vectors, the inner product is the number of matched

 For weighted term vectors, it is the sum of the products of the

– Again, the issue of normalization

sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12

• The denominator involves the lengths of the vectors

 Let say we have query vector Q = (0.4, 0.8); and also

cos 2  0.98 0.6 2

0.2 0.4 0.6 0.8 1.0

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

D1 is 6 times better than D2 using cosine similarity but only 5 times

(i) normalized and unnormalized TF;

(ii) TF*IDF based on normalized and unnormalized TF

You might also like

sim(D1 , Q) = 21 + 30 + 5*2 = 12