0% found this document useful (0 votes)
7 views34 pages

Chapter 3 IR

Chapter Three discusses term weighting and similarity measures in information storage and retrieval, focusing on how documents and queries are represented as vectors. It explains the importance of term frequency (TF), inverse document frequency (IDF), and the TF-IDF weighting scheme for improving document relevance and retrieval accuracy. The chapter also covers various similarity measures, including Euclidean distance and cosine similarity, to evaluate the relevance of documents in relation to a query.

Uploaded by

bekeletamirat931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views34 pages

Chapter 3 IR

Chapter Three discusses term weighting and similarity measures in information storage and retrieval, focusing on how documents and queries are represented as vectors. It explains the importance of term frequency (TF), inverse document frequency (IDF), and the TF-IDF weighting scheme for improving document relevance and retrieval accuracy. The chapter also covers various similarity measures, including Euclidean distance and cosine similarity, to evaluate the relevance of documents in relation to a query.

Uploaded by

bekeletamirat931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Information Storage and Retrieval

Chapter Three
Term weighting and similarity
measures
Target Group –IT 3rd year students

Injibara, Ethiopia
Terms
• Terms are usually stems. Terms can be also phrases, such as
“Computer Science”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or “bags of words”
(BOW).
– Each vector holds a place for every term in the collection.
– Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn
W=0 if a term is absent
-(Wdi1) is Weight of term 1 in document di.
• Documents are represented by binary weights or Non-binary
weighted vectors of terms.
2
Document Collection
 A collection of n documents can be represented in the vector space
model by a term-document matrix.
 An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the
document or it simply doesn’t exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2 a term-document matrix
: : : :
: : : :
Dn w1n w2n … wtn

3
Binary Weights
 Only the presence (1) or absence (0)
docs t1 t2 t3
of a term is included in the vector
D1 1 0 1
 Binary formula gives every word that D2 1 0 0
D3 0 1 1
appears in a document equal D4 1 0 0
D5 1 1 1
relevance. D6 1 1 0
D7 0 1 0
 It can be useful when frequency is not D8 0 1 0
important. D9 0 0 1
D10 0 1 1
 Binary Weights Formula: D11 1 0 1

1 if freqij  0

freqij  
0 if freqij  0

Why use term weighting?
 Term-weighting improves quality of answer set.
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
 Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query

 Non-binary weights allow to model partial matching.


– Partial matching allows retrieval of docs that approximate
the query.

5
Term Weighting: Term Frequency (TF)
 TF (term frequency) - Count the number of
term occurs in document.
docs t1 t2 t3
fij =frequency of term i in document j
D1 2 0 3
 The more times a term t occurs in document D2 1 0 0
d the more likely it is that t is relevant to the
D3 0 4 7
document, i.e. more indicative of the topic..
D4 3 0 0
– If used alone, it favors common words and
D5 1 6 3
long documents.
D6 3 5 0
– It gives too much credit to words that D7 0 8 0
appears more frequently.
D8 0 10 0
 May want to normalize term frequency (tf) D9 0 0 1
across the entire corpus:
D10 0 3 5
tfij = fij /max{fij} D11 4 0 1
Document Normalization
 Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
 Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.
 If we don’t normalize short documents may not be recognized as
relevant.

7
…con
 Term normalization is the process of converting multiple terms into a
single term for indexing and retrieval. By normalizing various terms
into one, you increase the consistency of the search results.
 Term normalization is the process of converting multiple terms into a
single term for indexing and retrieval. By normalizing various terms into
one, you increase the consistency of the search results. There are
several common classes of term normalizations:
1. Spelling variants, such as customise => customize
2. Compound terms, such as data base => database
3. Spelling corrections, such as febuary => february
4. Aliases or name variations, such as MS Word => Microsoft Word
8
Problems with term frequency
 Need a mechanism for reducing the effect of terms that occur too often in
the collection to be meaningful for relevance/meaning determination
 Scale down the term weight of terms with high collection frequency
– Reduce the tf weight of a term by a factor that grows with the
collection frequency
 More common for this purpose is document frequency
– how many documents in the collection contain the term

 The example shows that collection


frequency and document frequency
behaves differently
9
Document Frequency
 It is defined to be the number of documents in the collection that
contain a term

DF = document frequency

– Count the frequency considering the whole collection of


documents.

– Less frequently a term appears in the whole collection, the


more discriminating it is.

df i = document frequency of term i

= number of documents containing term i


10
Inverse Document Frequency (IDF)
 IDF measures rarity of the term in collection. The IDF is a measure of
the general importance of the term
– Inverts the document frequency.
 It diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
– Gives full weight to terms that occur in one document only.
– Gives lowest weight to terms that occur in all documents.
– Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of documents)

11
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?

Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ? 12
TF*IDF Weighting
• The most used term-weighting is tf*idf weighting scheme:

wij = tfij idfi = tfij * log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.

– The tf-idf value for a term will always be greater than or equal to
zero.

• Experimentally, tf*idf has been found to work well.

– It is often used in the vector space model together with cosine


similarity to determine the similarity between two documents.

13
TF*IDF weighting
 When does TF*IDF registers a high weight?
 when a term t occurs many times within a small number of
documents
– Highest tf*idf for a term shows a term has a high term frequency (in
the given document) and a low document frequency (in the whole
collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lowest TF*IDF is registered when the term occurs in virtually all
documents.
Computing TF-IDF: An Example
 Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms are:
A(50), B(1300), C(250). And also term frequencies (TF) of these
terms are: A(3), B(2), C(1). Compute TF*IDF for each term?

where: idfi = log2 (N/ df i)  (N: total number of documents)


A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
 Query vector is typically treated as a document and also tf-idf
weighted.
15
More Example
 Consider a document containing 100 words wherein the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.

– The term frequency (TF) for cow :

3/100 = 0.03

– The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

– The TF*IDF score is the product of these frequencies:


0.03 * 13.228 = 0.39684
16
Exercise
 Let C = number of times a Word C TW TD DF TF IDF TFIDF
given word appears in a airplane 5 46 3 1 5/45 Log2(3/1)

document; blue 1 46 3 1
 TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
 TD = total number of
forest 2 46 3 1
documents in a corpus, and
justice 7 46 3 3
 DF = total number of
love 2 46 3 1
documents containing a
given word; might 2 46 3 1
 compute TF, IDF and perl 5 46 3 2
TF*IDF score for each rose 6 46 3 3
term shoe 4 46 3 1
thesis 2 46 3 2

 Highest tf*idf for a term shows a term has a high term frequency
(in the given document) and a low document frequency (in the whole
collection of documents); 17
Similarity Measure
 We now have vectors for all documents in the
collection, a vector for the query, how to compute
t3
similarity?
 A similarity measure is a function that computes
1
the degree of similarity or distance between
document vector and query vector. D1
Q
 Using a similarity measure between the query and
2
each document: t1
 It is possible to rank the retrieved documents
in the order of presumed relevance. t2 D2
 It is possible to enforce a certain threshold so
that the size of the retrieved set can be
controlled.

18
Intuition
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close together”


in the vector space talk about the same things and
more similar than others.
Similarity Measure
• A similarity measure attempts to compute the distance between document
vector wj and query wq vector.
– The assumption here is that documents whose vectors are close to the
query vector are more relevant to the query than documents whose
vectors are away from the query vector.
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
– Sometimes it is a good idea to determine the maximum possible
similarity as the “distance” between a document d and itself

20
Similarity Measure: Techniques
• Euclidean distance

 It is the most common similarity measure. Euclidean distance examines


the root of square differences between coordinates of a pair of
document and query terms.

• Dot product

 The dot product is also known as the scalar product or inner product

 the dot product is defined as the product of the magnitudes of query


and document vectors

• Cosine similarity (or normalized inner product)

 It projects document and query vectors into a term space and calculate
the cosine angle between these. 21
Euclidean distance
 Similarity between vectors for the document di and query q can be
computed as:
n
sim(dj,q) = |dj – q| =
 (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the weight of
term i in the query
 Example: Determine the Euclidean distance between the document 1 vector
(0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means corresponding term
not found in document or query

 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11.05
2 2 2 2 2

22
Inner Product
 Similarity between vectors for the document di and query q can be
computed as the vector inner product:
n
sim(dj,q) = dj•q = w · w
i 1
ij iq

 where wij is the weight of term i in document j and wiq is the


weight of term i in the query q.

 For binary vectors, the inner product is the number of matched


query terms in the document (size of intersection).

 For weighted term vectors, it is the sum of the products of the


weights of the matched terms. 23
Properties of Inner Product
• Favors long documents with a large number of unique terms.

– Again, the issue of normalization

• Measures how many terms matched but not how many terms
are not matched.

24
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
sim(D, Q) = 3
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2

sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12


sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Inner Product:
Example 1
k2
k1 k2 k3 q  dj k1
d2 d6 d7
d1 1 0 1 2
d4
d2 1 0 0 1 d5
d3 0 1 1 2 d3
d1
d4 1 0 0 1
d5 1 1 1 3 k3
d6 1 1 0 2
d7 0 1 0 1

q 1 1 1

26
Inner Product:
Exercise k1 d2
k2
d6 d7
k1 k2 k3 q  dj d4 d5
d1 1 0 1 ? d3
d1
d2 1 0 0 ?
d3 0 1 1 ?
k3
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

q 1 2 3

27
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
dj q wi , j wi , q
sim(d j , q )    
i 1

i 1 w i 1 i ,q
n n
dj q 2
w 2
• Or; i, j

 

n
d j  dk wi , j wi , k
sim(d j , d k )     i 1

i 1 w i 1 i ,k
n n
d j dk 2
i, j w 2

• The denominator involves the lengths of the vectors


• So the cosine measure is also known as the normalized
inner product 

n
Length d j  i 1
w 2
i, j
Example: Computing Cosine Similarity

 Let say we have query vector Q = (0.4, 0.8); and also


document D1 = (0.2, 0.7). Compute their similarity using
cosine?
(0.4 * 0.2)  (0.8 * 0.7)
sim(Q, D2 ) 
[( 0.4)  (0.8) ] *[( 0.2)  (0.7) ]
2 2 2 2

0.64
  0.98
0.42
Example: Computing Cosine Similarity
 Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4, 0.8),
determine which document is the most relevant one for
the query?

1.0 Q
cos1  0.74 0.8
D2

cos 2  0.98 0.6 2


0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0


30
Example
 Given three documents; D1, D2 and D3 with the
corresponding TFIDF weight, Which documents are more
similar using the three measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

31
Cosine Similarity vs. Inner Product
 Cosine similarity measures the cosine of the angle
between two vectors.
 Inner product normalized by the vector lengths.
  t

dj q

 (wij  wiq)
CosSim(dj, q) =
   i 1
t t

 wij   wiq
2 2
dj  q
i 1 i 1
 
InnerProduct(dj, q) = d j q 

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81


D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times


better using inner product.
32
Exercises
 A database collection consists of 1 million documents, of which
200,000 contain the term holiday while 250,000 contain the term
season. A document repeats holiday 7 times and season 5 times. It
is known that holiday is repeated more than any other term in the
document. Calculate the weight of both terms in this document
using three different term weight methods. Try with

(i) normalized and unnormalized TF;

(ii) TF*IDF based on normalized and unnormalized TF

33
34

You might also like