Chapter 3 Term Weighting
Chapter 3 Term Weighting
Term Weighting
The terms of a document are not equally useful for describing the document contents.
1|Page
•There are properties of an index term which are useful for evaluating the importance of the term
in a document
–For instance, a word which appears in all documents of a collection is completely useless for
retrieval tasks.
Term Weighting
1.Binary Weights
2.Term Frequency (TF) Weights
3.Inverse Document Frequency(IDF)
4.TF*IDF Weighting
1. Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
• Binary formula gives every word that appears in a document equal relevance
2|Page
2. Term Frequency (TF) Weights
•TF (term frequency) - Count the number of times term occurs in document
fij = frequency of term i in document j
•The more times a term t occurs in document d the more likely it is that t is relevant to the
document, i.e. more indicative of the topic
–If used alone, it favors common words and long documents
–It gives too much credit to words that appears more frequently
–Not allow to order documents according to their level of relevance for a given query.
3|Page
•Non-binary weights allow to model partial matching
–Partial matching allows retrieval of documents that approximate the query.
•Term-weighting helps to apply best matching that improves quality of answer set.
–Term weighting enables ranking of retrieved documents; such that best matching documents are
ordered at the top as they are more relevant than others.
TF Normalization
TF Normalization
4|Page
Problems with Term Frequency
•Need a mechanism for attenuating(reducing) the effect of terms that occur too often in the
collection to be meaningful for relevance/meaning determination
•The example shows that collection frequency and document frequency behaves differently
–Less frequently a term appears in the whole collection, the more discriminating it is
5|Page
idfi = inverse document frequency of term i,
4. TF*IDF Weighting
•A good weight must take into account two effects:
–Quantification of intra-document contents (similarity)
•tf factor, the term frequency within a document
–Quantification of inter-documents separation (dissimilarity)
•idf factor, the inverse document frequency
•As a result of which the most widely used term-weighting by IR systems is tf*idf weighting
technique:
6|Page
Lower TF*IDF is registered when the term occurs fewer times in a document, or occurs in
many documents (virtually all documents)
–Thus offering a less pronounced relevance signal
II.Calculate the inverse document frequency (IDF): Take the total number of documents divided
by the number of documents containing the word.
•TF-IDF, is used to determine how important a word is within a single document of a collection.
Computing TF-IDF: An Example
•Assume collection contains 10,000 documents and statistical analysis shows that document
frequencies (DF) of three terms are:
•DFA = 50, DFB =1300, DFC = 250
•And also term frequencies (TF) of these terms are:
•TFA = 3, TFB =2, TFC =1
•Compute TF*IDF for each term?
7|Page
More Example
•Consider a document containing 100 words wherein the word computer appears 3 times
•Now, assume we have 10, 000, 000 documents and computer appears in 1, 000 of these
•TF-IDF? Calculate.
–The term frequency (TF) for computer :
3/100 = 0.03
–The inverse document frequency is
Similarity Measure
•A similarity measure is a function that computes the degree of similarity or distance between
document vector and query vector
•Using a similarity measure between the query and each document:
–It is possible to rank the retrieved documents in the order of presumed relevance
–It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled
Similarity/Dissimilarity Measures
1. Euclidean distance
–It is the most common similarity measure. Euclidean distance examines the root of square
differences between coordinates of a pair of document and query terms
–the dot product is defined as the product of the magnitudes of query and document vectors
3. Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space and calculate the cosine angle between
these
8|Page
9|Page
Inner Product
•What is more relevant to a query?
–A 50-word document which contains 3 of the query terms?
•The inner-product doesn’t account for the fact that documents have widely varying lengths
•Measures how many terms matched but not how many terms are not matched
10 | P a g e
11 | P a g e