Chapter 3 IR
Chapter 3 IR
Chapter Three
Term weighting and similarity
measures
Target Group –IT 3rd year students
Injibara, Ethiopia
Terms
• Terms are usually stems. Terms can be also phrases, such as
“Computer Science”, “World Wide Web”, etc.
• Documents and queries are represented as vectors or “bags of words”
(BOW).
– Each vector holds a place for every term in the collection.
– Position 1 corresponds to term 1, position 2 to term 2, position n
to term n.
Di wd i1 , wd i 2 ,..., wd in
Q wq1 , wq 2, ..., wqn
W=0 if a term is absent
-(Wdi1) is Weight of term 1 in document di.
• Documents are represented by binary weights or Non-binary
weighted vectors of terms.
2
Document Collection
A collection of n documents can be represented in the vector space
model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term in the
document; zero means the term has no significance in the
document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2 a term-document matrix
: : : :
: : : :
Dn w1n w2n … wtn
3
Binary Weights
Only the presence (1) or absence (0)
docs t1 t2 t3
of a term is included in the vector
D1 1 0 1
Binary formula gives every word that D2 1 0 0
D3 0 1 1
appears in a document equal D4 1 0 0
D5 1 1 1
relevance. D6 1 1 0
D7 0 1 0
It can be useful when frequency is not D8 0 1 0
important. D9 0 0 1
D10 0 1 1
Binary Weights Formula: D11 1 0 1
1 if freqij 0
freqij
0 if freqij 0
Why use term weighting?
Term-weighting improves quality of answer set.
– Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
Binary weights are too limiting.
– terms are either present or absent.
– Not allow to order documents according to their level of
relevance for a given query
5
Term Weighting: Term Frequency (TF)
TF (term frequency) - Count the number of
term occurs in document.
docs t1 t2 t3
fij =frequency of term i in document j
D1 2 0 3
The more times a term t occurs in document D2 1 0 0
d the more likely it is that t is relevant to the
D3 0 4 7
document, i.e. more indicative of the topic..
D4 3 0 0
– If used alone, it favors common words and
D5 1 6 3
long documents.
D6 3 5 0
– It gives too much credit to words that D7 0 8 0
appears more frequently.
D8 0 10 0
May want to normalize term frequency (tf) D9 0 0 1
across the entire corpus:
D10 0 3 5
tfij = fij /max{fij} D11 4 0 1
Document Normalization
Long documents have an unfair advantage:
– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly
• So they have much higher term frequencies
Normalization seeks to remove these effects:
– Related somehow to maximum term frequency.
– But also sensitive to the number of terms.
If we don’t normalize short documents may not be recognized as
relevant.
7
…con
Term normalization is the process of converting multiple terms into a
single term for indexing and retrieval. By normalizing various terms
into one, you increase the consistency of the search results.
Term normalization is the process of converting multiple terms into a
single term for indexing and retrieval. By normalizing various terms into
one, you increase the consistency of the search results. There are
several common classes of term normalizations:
1. Spelling variants, such as customise => customize
2. Compound terms, such as data base => database
3. Spelling corrections, such as febuary => february
4. Aliases or name variations, such as MS Word => Microsoft Word
8
Problems with term frequency
Need a mechanism for reducing the effect of terms that occur too often in
the collection to be meaningful for relevance/meaning determination
Scale down the term weight of terms with high collection frequency
– Reduce the tf weight of a term by a factor that grows with the
collection frequency
More common for this purpose is document frequency
– how many documents in the collection contain the term
DF = document frequency
11
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ? 12
TF*IDF Weighting
• The most used term-weighting is tf*idf weighting scheme:
• A term occurring frequently in the document but rarely in the rest of the
collection is given high weight.
– The tf-idf value for a term will always be greater than or equal to
zero.
13
TF*IDF weighting
When does TF*IDF registers a high weight?
when a term t occurs many times within a small number of
documents
– Highest tf*idf for a term shows a term has a high term frequency (in
the given document) and a low document frequency (in the whole
collection of documents);
– the weights hence tend to filter out common terms.
– Thus lending high discriminating power to those documents
• Lowest TF*IDF is registered when the term occurs in virtually all
documents.
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms are:
A(50), B(1300), C(250). And also term frequencies (TF) of these
terms are: A(3), B(2), C(1). Compute TF*IDF for each term?
3/100 = 0.03
document; blue 1 46 3 1
TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
TD = total number of
forest 2 46 3 1
documents in a corpus, and
justice 7 46 3 3
DF = total number of
love 2 46 3 1
documents containing a
given word; might 2 46 3 1
compute TF, IDF and perl 5 46 3 2
TF*IDF score for each rose 6 46 3 3
term shoe 4 46 3 1
thesis 2 46 3 2
Highest tf*idf for a term shows a term has a high term frequency
(in the given document) and a low document frequency (in the whole
collection of documents); 17
Similarity Measure
We now have vectors for all documents in the
collection, a vector for the query, how to compute
t3
similarity?
A similarity measure is a function that computes
1
the degree of similarity or distance between
document vector and query vector. D1
Q
Using a similarity measure between the query and
2
each document: t1
It is possible to rank the retrieved documents
in the order of presumed relevance. t2 D2
It is possible to enforce a certain threshold so
that the size of the retrieved set can be
controlled.
18
Intuition
t3
d2
d3
d1
θ
φ
t1
d5
t2
d4
20
Similarity Measure: Techniques
• Euclidean distance
• Dot product
The dot product is also known as the scalar product or inner product
It projects document and query vectors into a term space and calculate
the cosine angle between these. 21
Euclidean distance
Similarity between vectors for the document di and query q can be
computed as:
n
sim(dj,q) = |dj – q| =
(w
i 1
ij wiq ) 2
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
Example: Determine the Euclidean distance between the document 1 vector
(0, 3, 2, 1, 10) and query vector (2, 7, 1, 0, 0). 0 means corresponding term
not found in document or query
(0 2) (3 7) (2 1) (1 0) (10 0) 11.05
2 2 2 2 2
22
Inner Product
Similarity between vectors for the document di and query q can be
computed as the vector inner product:
n
sim(dj,q) = dj•q = w · w
i 1
ij iq
• Measures how many terms matched but not how many terms
are not matched.
24
Inner Product -- Examples
• Binary weight :
–Size of vector = size of vocabulary = 7
Retrieval Database Term Computer Text Manage Data
D 1 1 1 0 1 1 0
Q 1 0 1 0 0 1 1
sim(D, Q) = 3
• Term Weighted:
Retrieval Database Architecture
D1 2 3 5
D2 3 7 1
Q 1 0 2
q 1 1 1
26
Inner Product:
Exercise k1 d2
k2
d6 d7
k1 k2 k3 q dj d4 d5
d1 1 0 1 ? d3
d1
d2 1 0 0 ?
d3 0 1 1 ?
k3
d4 1 0 0 ?
d5 1 1 1 ?
d6 1 1 0 ?
d7 0 1 0 ?
q 1 2 3
27
Cosine similarity
• Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
n
dj q wi , j wi , q
sim(d j , q )
i 1
i 1 w i 1 i ,q
n n
dj q 2
w 2
• Or; i, j
n
d j dk wi , j wi , k
sim(d j , d k ) i 1
i 1 w i 1 i ,k
n n
d j dk 2
i, j w 2
0.64
0.98
0.42
Example: Computing Cosine Similarity
Let say we have two documents in our corpus; D1 = (0.8,
0.3) and D2 = (0.2, 0.7). Given query vector Q = (0.4, 0.8),
determine which document is the most relevant one for
the query?
1.0 Q
cos1 0.74 0.8
D2
Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254
31
Cosine Similarity vs. Inner Product
Cosine similarity measures the cosine of the angle
between two vectors.
Inner product normalized by the vector lengths.
t
dj q
(wij wiq)
CosSim(dj, q) =
i 1
t t
wij wiq
2 2
dj q
i 1 i 1
InnerProduct(dj, q) = d j q
33
34