3 Term Weighting
3 Term Weighting
3 Term Weighting
1
Terms
Terms are usually stems. Terms can be also phrases, such as
“Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or “bags of
words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn Wij=0 if a term is absent
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
3
Binary Weights
• Only the presence (1) or docs t1 t2 t3
absence (0) of a term is D1 1 0 1
included in the vector D2 1 0 0
D3 0 1 1
• Binary formula gives every D4 1 0 0
word that appears in a D5 1 1 1
D6 1 1 0
document equal relevance. D7 0 1 0
• It can be useful when D8 0 1 0
frequency is not important. D9 0 0 1
D10 0 1 1
• Binary Weights Formula: D11 1 0 1
1 if freqij 0
freqij
0 if freqij 0
Why use term weighting?
Binary weights are too limiting.
terms are either present or absent.
Does NOT allow to order documents according to their
level of relevance for a given query
DF = document frequency
9
Inverse Document Frequency (IDF)
IDF measures how rare a term is in collection.
The IDF is a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very frequently in
the collection and increases the weight of terms that occur
rarely.
Gives full weight to terms that occur in one document
only.
Gives lowest weight to terms that occur in all
documents.
Terms that appear in many different documents are less indicative of
overall topic.
idfi = inverse document frequency of term i,
10
= log 2 (N/ df i ) (N: total number of documents)
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to reduce the effect relative to tf.
• Make the difference between Document frequency vs. corpus
11 frequency ?
TF*IDF Weighting
The most used term-weighting is tf*idf weighting
scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)
A term occurring frequently in the document but rarely
in the rest of the collection is given high weight.
The tf-idf value for a term will always be greater than
or equal to zero.
Experimentally, tf*idf has been found to work well.
It is often used in the vector space model together
with cosine similarity to determine the similarity
12
between two documents.
TF*IDF weighting
When does TF*IDF registers a high weight?
when a term t occurs many times within a small number of
documents
Highest tf*idf for a term shows a term has a high term
frequency (in the given document) and a low document
frequency (in the whole collection of documents);
the weights hence tend to filter out common terms thus giving
high discriminating power to those documents
Lower TF*IDF is registered when the term occurs fewer times in a
document, or occurs in many documents
Thus offering a less pronounced relevance signal
Lowest TF*IDF is registered when the term occurs in virtually all
documents
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms are:
A(50), B(1300), C(250). And also term frequencies (TF) of these
terms are: A(3), B(2), C(1). Compute TF*IDF for each term?
A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
14
More Example
Consider a document containing 100 words wherein the word cow
appears 3 times. Now, assume we have 10 million documents and cow
appears in one thousand of these.
15
Exercise
Word C TW TD DF TF IDF TFIDF
• Let C = number of
airplane 5 46 3 1
times a given word
appears in a blue 1 46 3 1
document; chair 7 46 3 3
• TW = total number of computer 3 46 3 1
words in a document; forest 2 46 3 1
• TD = total number of justice 7 46 3 3
documents in a
love 2 46 3 1
corpus, and
• DF = total number of might 2 46 3 1
documents containing perl 5 46 3 2
a given word; rose 6 46 3 3
• compute TF, IDF and shoe 4 46 3 1
TF*IDF score for each thesis 2 46 3 2
term
16
Concluding remarks
Suppose from a set of English documents, we wish to determine which once
are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do not contain all
three words "the," "brown," and "cow," but this still leaves many documents.
To further distinguish them, we might count the number of times each term
occurs in each document and sum them all together;
the number of times a term occurs in a document is called its TF. However, because
the term "the" is so common, this will tend to incorrectly emphasize documents
which happen to use the word "the" more, without giving enough weight to the
more meaningful terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant and non-relevant
documents and terms like "brown" and "cow" that occur rarely are good keywords
to distinguish relevant documents from the non-relevant once.
17
Concluding remarks
Hence IDF is incorporated which diminishes the weight of
terms that occur very frequently in the collection and increases
the weight of terms that occur rarely.
This leads to use TF*IDF as a better weighting technique
d3
d1
θ
φ
t1
d5
t2
d4
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
• Example: Determine the Euclidean distance between the document 1
vector (0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means
corresponding term not found in document or query
23
(0 2) (3 7) (2 1) (1 0) (10 0) 11.05
2 2 2 2 2
Inner Product
Similarity between vectors for the document di and query q can
be computed as the vector inner product:
n
sim(dj,q) = dj•q =
i 1
wij · wiq
where wij is the weight of term i in document j and wiq is the weight of
term i in the query q
For binary vectors, the inner product is the number of matched
query terms in the document (size of intersection).
For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
24
Properties of Inner Product
25
Inner Product -- Examples
Binary weight :
Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
sim(D, Q) = 3
D1 1 1 1 0 1 1 0
D2 0 1 0 0 0 1 0
Q 1 0 1 0 0 1 1
sim(D1 , Q) = 1*1 + 1*0 + 1*1 + 0*0 + 1*0 + 1*1 + 0*1 = 3
sim(D 2, Q) = 0*1 + 1*0 + 0*1 + 0*0 + 0*0 + 1*1 + 0*1 = 1
k1 k2 k3 q dj k3
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
27 q 1 1 1
Inner Product:
Exercise k1 d2
k2
d6 d7
d4 d5
d3
d1
k1 k2 k3 q dj k3
d1 1 0 1 ?
d2 1 0 0 ?
d3 0 1 1 ?
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
28 q 1 2 3
Cosine similarity
Measures similarity between d1 and d2 captured by the cosine of
the angle x between them.
n
dj q wi , j wi , q
sim(d j , q )
i 1
i 1 w i 1 i ,q
n n
dj q 2
i, j w 2
Or;
n
d j dk wi,j wi,k
sim(d j , d k ) i 1
1.0 Q
D2
cos1 0.74 0.8
2
cos 2 0.98 0.6
0.4
1 D1
0.2
Terms D1 D2 D3
affection 2 3 1
Jealous 0 2 4
gossip 1 0 2
32
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle between
two vectors.
Inner product normalized by the vector lengths.
t
dj q
(wij wiq)
CosSim(dj, q) =
i 1
t t
wij wiq
2 2
dj q
i 1 i 1
InnerProduct(dj, q) = d j q
34