3 Term Weighting

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Introduction to

Information Storage and Retrieval


Chapter Three: Term weighting
and similarity measures

1
Terms
 Terms are usually stems. Terms can be also phrases, such as
“Computer Science”, “World Wide Web”, etc.
 Documents and queries are represented as vectors or “bags of
words” (BOW).
 Each vector holds a place for every term in the collection.
 Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Di  wd i1 , wd i 2 ,..., wd in
Q  wq1 , wq 2, ..., wqn Wij=0 if a term is absent

 Documents are represented by binary weights or Non-


binary weighted vectors of terms.
2
Document Collection
 A collection of n documents can be represented in the vector
space model by a term-document matrix.
 An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

3
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
included in the vector D2 1 0 0
D3 0 1 1
• Binary formula gives every D4 1 0 0
word that appears in a D5 1 1 1
D6 1 1 0
document equal relevance. D7 0 1 0
• It can be useful when D8 0 1 0
frequency is not important. D9 0 0 1
D10 0 1 1
• Binary Weights Formula: D11 1 0 1

1 if freqij  0

freqij  
0 if freqij  0

Why use term weighting?
 Binary weights are too limiting.
 terms are either present or absent.
 Does NOT allow to order documents according to their
level of relevance for a given query

 Non-binary weights allow to model partial matching .


 Partial matching allows retrieval of docs that approximate
the query.
 Term-weighting improves quality of answer set
 Term weighting enables ranking of retrieved
documents; such that best matching documents are
ordered at the top as they are more relevant than others.
5
Term Weighting: Term Frequency (TF)
 TF (term frequency) - Count the
number of times term occurs in document.
fij = frequency of term i in doc j
docs t1 t2 t3
 The more times a term t occurs in D1 2 0 3
document d the more likely it is that t is D2 1 0 0
relevant to the document, i.e. more D3 0 4 7
indicative of the topic.. D4 3 0 0
 If used alone, it favors common words D5 1 6 3
and long documents. D6 3 5 0
 It gives too much credit to words that D7 0 8 0
appears more frequently. D8 0 10 0
 We may want to normalize term D9 0 0 1
frequency (tf) across the entire document: D10 0 3 5
tfij D11 4 0 1
tfij  Q 1 2 3
Max(tf j ) q1 q2 q3
Document Normalization
 Long documents have an unfair advantage:
 They use a lot of terms
 So they get more matches than short documents
 And they use the same words repeatedly
 So they have large term frequencies
 Normalization seeks to remove these effects:
 Related somehow to maximum term frequency.
 But also sensitive to the number of terms.

 What would happen if term frequency in a document


are not normalized?
 short documents may not be recognized as
relevant.
7
Problems with term frequency
 Need a mechanism for reduce the effect of terms that occur too
often in the collection to be meaningful for relevance/meaning
determination
 Scale down the term weight of terms with high collection
frequency
 Reduce the tf weight of a term by a factor that grows with the collection
frequency
 More common for this purpose is document frequency
 how many documents in the collection contain the term

• The example shows that collection


frequency (cf) and document
frequency (df) behaves differently
8
Document Frequency
 It is defined to be the number of documents in the collection
that contain a term

DF = document frequency

 Count the frequency considering the whole


collection of documents.
 Less frequently a term appears in the whole
collection, the more discriminating it is.

df i = document frequency of term i


= number of documents containing term i

9
Inverse Document Frequency (IDF)
 IDF measures how rare a term is in collection.
 The IDF is a measure of the general importance of the term
 Inverts the document frequency.
 It diminishes the weight of terms that occur very frequently in
the collection and increases the weight of terms that occur
rarely.
 Gives full weight to terms that occur in one document
only.
 Gives lowest weight to terms that occur in all
documents.
 Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
10
= log 2 (N/ df i ) (N: total number of documents)
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to reduce the effect relative to tf.
• Make the difference between Document frequency vs. corpus
11 frequency ?
TF*IDF Weighting
 The most used term-weighting is tf*idf weighting
scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)
 A term occurring frequently in the document but rarely
in the rest of the collection is given high weight.
 The tf-idf value for a term will always be greater than
or equal to zero.
 Experimentally, tf*idf has been found to work well.
 It is often used in the vector space model together
with cosine similarity to determine the similarity
12
between two documents.
TF*IDF weighting
 When does TF*IDF registers a high weight?
 when a term t occurs many times within a small number of
documents
 Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low document
frequency (in the whole collection of documents);
 the weights hence tend to filter out common terms thus giving
high discriminating power to those documents
 Lower TF*IDF is registered when the term occurs fewer times in a
document, or occurs in many documents
 Thus offering a less pronounced relevance signal
 Lowest TF*IDF is registered when the term occurs in virtually all
documents
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms are:
A(50), B(1300), C(250). And also term frequencies (TF) of these
terms are: A(3), B(2), C(1). Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774

 Query vector is typically treated as a document and also tf-idf


weighted.

14
More Example
 Consider a document containing 100 words wherein the word cow
appears 3 times. Now, assume we have 10 million documents and cow
appears in one thousand of these.

 The term frequency (TF) for cow :


3/100 = 0.03

 The inverse document frequency is


log2(10,000,000 / 1,000) = 13.228

 The TF*IDF score is the product of these frequencies: 0.03 * 13.228 =


0.39684

15
Exercise
Word C TW TD DF TF IDF TFIDF
• Let C = number of
airplane 5 46 3 1
times a given word
appears in a blue 1 46 3 1
document; chair 7 46 3 3
• TW = total number of computer 3 46 3 1
words in a document; forest 2 46 3 1
• TD = total number of justice 7 46 3 3
documents in a
love 2 46 3 1
corpus, and
• DF = total number of might 2 46 3 1
documents containing perl 5 46 3 2
a given word; rose 6 46 3 3
• compute TF, IDF and shoe 4 46 3 1
TF*IDF score for each thesis 2 46 3 2
term
16
Concluding remarks
 Suppose from a set of English documents, we wish to determine which once
are the most relevant to the query "the brown cow."
 A simple way to start out is by eliminating documents that do not contain all
three words "the," "brown," and "cow," but this still leaves many documents.
 To further distinguish them, we might count the number of times each term
occurs in each document and sum them all together;
 the number of times a term occurs in a document is called its TF. However, because
the term "the" is so common, this will tend to incorrectly emphasize documents
which happen to use the word "the" more, without giving enough weight to the
more meaningful terms "brown" and "cow".
 Also the term "the" is not a good keyword to distinguish relevant and non-relevant
documents and terms like "brown" and "cow" that occur rarely are good keywords
to distinguish relevant documents from the non-relevant once.

17
Concluding remarks
 Hence IDF is incorporated which diminishes the weight of
terms that occur very frequently in the collection and increases
the weight of terms that occur rarely.
 This leads to use TF*IDF as a better weighting technique

 On top of that we apply similarity measures to calculate the


distance between document i and query j.
• There are a number of similarity measures; the most common similarity
measure are
 Euclidean distance , Inner or Dot product, Cosine similarity, Dice
similarity, Jaccard similarity, etc.
Similarity Measure
 We now have vectors for all documents in the
collection, a vector for the query, how to t3
compute similarity?
1
 A similarity measure is a function that
computes the degree of similarity or distance D1
Q
between document vector and query vector.
2 t1
 Using a similarity measure between the query
and each document: t2 D2
 It is possible to rank the retrieved
documents in the order of presumed
relevance.
 It is possible to enforce a certain threshold so
that the size of the retrieved set can be
19 controlled.
Intuition
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Postulate: Documents that are “close together” in the


vector space talk about the same things and more similar
than others.
Similarity Measure
Desiderata for proximity
1. If d1 is near d2, then d2 is near d1.
2. If d1 is near d2, and d2 near d3, then d1 is not far from d3.
3. No document is closer to d than d itself.
 Sometimes it is a good idea to determine the maximum possible
similarity as the “distance” between a document d and itself
 A similarity measure attempts to compute the distance between
document vector wj and query wq vector.
 The assumption here is that documents whose vectors are
close to the query vector are more relevant to the query than
documents whose vectors are away from the query vector.
21
Similarity Measure: Techniques
• Euclidean distance
 It is the most common similarity measure. Euclidean distance
examines the root of square differences between coordinates
of a pair of document and query terms.
• Dot product
 The dot product is also known as scalar product or inner
product
 the dot product is defined as the product of the magnitudes
of query and document vectors
 Cosine similarity
 Also called normalized inner product
 It projects document and query vectors into a term space and
calculate the cosine angle between these.
22
Euclidean distance
 Similarity between vectors for the document di and query q can
be computed as:
n
sim(dj,q) = |dj – q| =
 (w
i 1
ij  wiq ) 2

where wij is the weight of term i in document j and wiq is the weight of
term i in the query
• Example: Determine the Euclidean distance between the document 1
vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means
corresponding term not found in document or query

23
 (0  2)  (3  7)  (2  1)  (1  0)  (10  0)  11.05
2 2 2 2 2
Inner Product
 Similarity between vectors for the document di and query q can
be computed as the vector inner product:
n
sim(dj,q) = dj•q =
 i 1
wij · wiq

where wij is the weight of term i in document j and wiq is the weight of
term i in the query q
 For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.

24
Properties of Inner Product

 Favors long documents with a large number of unique


terms.
 Again, the issue of normalization needs to be considered
 Measures how many terms matched but not how many
terms are not matched. The unmatched terms will have valu
zero as either the document vector or the query vector will
have a weight of ZERO

25
Inner Product -- Examples
 Binary weight :
 Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
sim(D, Q) = 3
D1 1 1 1 0 1 1 0
D2 0 1 0 0 0 1 0
Q 1 0 1 0 0 1 1
sim(D1 , Q) = 1*1 + 1*0 + 1*1 + 0*0 + 1*0 + 1*1 + 0*1 = 3
sim(D 2, Q) = 0*1 + 1*0 + 0*1 + 0*0 + 0*0 + 1*1 + 0*1 = 1

• Term Weighted: Retrieval Database Architecture


D1 2 3 5
D2 3 7 1
Q 1 0 2
sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12
sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5
Inner Product:
Example 1
k2
k1
d2 d6 d7
d4
d5
d3
d1

k1 k2 k3 q  dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1

27 q 1 1 1
Inner Product:
Exercise k1 d2
k2
d6 d7
d4 d5
d3
d1

k1 k2 k3 q  dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?

28 q 1 2 3
Cosine similarity
 Measures similarity between d1 and d2 captured by the cosine of
the angle x between them.
 

n
dj q wi , j wi , q
sim(d j , q )    
i 1

i 1 w i 1 i ,q
n n
dj q 2
i, j w 2

 Or;  

n
d j  dk wi,j wi,k
sim(d j , d k )     i 1

i1 w i1 i,k


n n
d j dk 2
i, j w 2

 The denominator involves the lengths of the vectors


 The cosine measure is also known as normalized inner product


n
Length d j  i 1
2
w
i, j
Example: Computing Cosine Similarity
• Let say we have query vector Q = (0.4, 0.8); and also
document D1 = (0.2, 0.7). Compute their similarity using
cosine?

(0.4 * 0.2)  (0.8 * 0.7)


sim(Q, D2 ) 
[( 0.4) 2  (0.8) 2 ] *[( 0.2) 2  (0.7) 2 ]
0.64
  0.98
0.42
Example: Computing Cosine Similarity
• Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4,
0.8), determine which document is the most relevant one for
the query?

1.0 Q
D2
cos1  0.74 0.8

2
cos 2  0.98 0.6

0.4
1 D1
0.2

0.2 0.4 0.6 0.8 1.0


31
Example
 Given three documents; D1, D2 and D3 with the corresponding
TF*IDF weight, Which documents are more similar using the
three measurement?

Terms D1 D2 D3
affection 2 3 1
Jealous 0 2 4
gossip 1 0 2

32
Cosine Similarity vs. Inner Product
 Cosine similarity measures the cosine of the angle between
two vectors.
 Inner product normalized by the vector lengths.
  t

dj q

 (wij  wiq)
CosSim(dj, q) =
   i 1
t t

 wij   wiq
2 2
dj  q
i 1 i 1
 
InnerProduct(dj, q) = d j q 

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81


D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3

D1 is 6 times better than D2 using cosine similarity but only 5 times


better using inner product.
33
Exercises
 A database collection consists of 1 million documents, of which
200,000 contain the term holiday while 250,000 contain the
term season. A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than any other
term in the document. Calculate the weight of both terms in this
document using three different term weight methods. Try with
(i) normalized and unnormalized TF;
(ii) TF*IDF based on normalized and unnormalized TF

34

You might also like