0% found this document useful (0 votes)
28 views11 pages

Chapter 3 Term Weighting

Uploaded by

oneno2536
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

Chapter 3 Term Weighting

Uploaded by

oneno2536
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Information Storage and Retrieval (ISR)

Chapter 3 Term weighting and Similarity measures


Introduction (Basic Concepts)
• Each document is represented by a set of representative keywords or index
terms as we have learned from chapter two.
• Documents and queries are represented as vectors or “bags of
words” (BOW) – unordered words with frequencies
• Bag – a set that allows multiple occurrences of the same
element
• An index term is a word or group of consecutive words in a document
• Those terms are usually stems
• Terms can be also phrases, such as “Computer Science”, “World Wide
Web”, etc.

Term Weighting
The terms of a document are not equally useful for describing the document contents.

•That is why we used text operation as we have learned in chapter 2.

1|Page
•There are properties of an index term which are useful for evaluating the importance of the term
in a document
–For instance, a word which appears in all documents of a collection is completely useless for
retrieval tasks.

–Please refer and memorize chapter two.

Term Weighting
1.Binary Weights
2.Term Frequency (TF) Weights
3.Inverse Document Frequency(IDF)
4.TF*IDF Weighting
1. Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector

• Binary formula gives every word that appears in a document equal relevance

• It can be useful when frequency is not important

• Not enables ranking of retrieved documents

2|Page
2. Term Frequency (TF) Weights
•TF (term frequency) - Count the number of times term occurs in document
fij = frequency of term i in document j
•The more times a term t occurs in document d the more likely it is that t is relevant to the
document, i.e. more indicative of the topic
–If used alone, it favors common words and long documents

–It gives too much credit to words that appears more frequently

•May want to normalize term frequency (tf)

Why use term weighting?


•Binary weights are too limiting.
–Terms are either present or absent(1 or 0)

–Not allow to order documents according to their level of relevance for a given query.

3|Page
•Non-binary weights allow to model partial matching
–Partial matching allows retrieval of documents that approximate the query.

•Term-weighting helps to apply best matching that improves quality of answer set.

–Term weighting enables ranking of retrieved documents; such that best matching documents are
ordered at the top as they are more relevant than others.

TF Normalization

•Long documents have an unfair advantage:


–They use a lot of terms
•So they get more matches than short documents
–And they use the same words repeatedly
•So they have much higher term frequencies
•Normalization seeks to remove these effects:
–Related somehow to maximum term frequency
–But also sensitive to the number of terms
•If we don’t normalize short documents may not be recognized as relevant.

TF Normalization

4|Page
Problems with Term Frequency
•Need a mechanism for attenuating(reducing) the effect of terms that occur too often in the
collection to be meaningful for relevance/meaning determination

•Scale down the weight of terms with high collection frequency


–Reduce the tf weight of a term by a factor that grows with the collection frequency.
•More common for this purpose is document frequency
–how many documents in the collection contain the term

•The example shows that collection frequency and document frequency behaves differently

3. Inverse Document Frequency(IDF)


•Document frequency is defined to be the number of documents in the collection that contain a
term
DF = document frequency
–Count the frequency considering the whole collection of documents

–Less frequently a term appears in the whole collection, the more discriminating it is

df i = (document frequency of term i)


= number of documents containing term i

Inverse Document Frequency (IDF)


•IDF measures rarity of the term in collection. The IDF is a measure of the general importance of
the term.
–Inverts the document frequency
•It diminishes (reduces) the weight of terms that occur very frequently in the collection and
increases the weight of terms that occur rarely
–Gives full weight to terms that occur in one document only
–Gives zero weight to terms that occur in all documents
–Terms that appear in many different documents are less indicative of overall topic.

5|Page
idfi = inverse document frequency of term i,

where N: total number of documents


Example: given a collection of 1000 documents and document frequency, compute IDF for each
word?

4. TF*IDF Weighting
•A good weight must take into account two effects:
–Quantification of intra-document contents (similarity)
•tf factor, the term frequency within a document
–Quantification of inter-documents separation (dissimilarity)
•idf factor, the inverse document frequency
•As a result of which the most widely used term-weighting by IR systems is tf*idf weighting
technique:

wij = tfij idfi = tfij * log2 (N/ dfi)


•A term occurring frequently in the document but rarely in the rest of the collection is given high
weight
–The tf*idf value for a term will always be greater than or equal to zero
TF*IDF weighting
•When does TF*IDF registers a high weight?
when a term t occurs many times within a small number of documents:-
Highest tf*idf for a term shows a term has a high term frequency (in the given document) and a
low document frequency (in the whole collection of documents);
the weights hence tend to filter out common terms
Thus lending high discriminating power to those documents

6|Page
Lower TF*IDF is registered when the term occurs fewer times in a document, or occurs in
many documents (virtually all documents)
–Thus offering a less pronounced relevance signal

How is TF-IDF calculated?


I.Calculate term frequency(TF) in each document. Iterate each document and count how often
each word appears.

II.Calculate the inverse document frequency (IDF): Take the total number of documents divided
by the number of documents containing the word.

III.Calculate TF-IDF: multiply TF and IDF together.

•TF-IDF, is used to determine how important a word is within a single document of a collection.
Computing TF-IDF: An Example
•Assume collection contains 10,000 documents and statistical analysis shows that document
frequencies (DF) of three terms are:
•DFA = 50, DFB =1300, DFC = 250
•And also term frequencies (TF) of these terms are:
•TFA = 3, TFB =2, TFC =1
•Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644


B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
•s

•Query is also treated as a short document and also tf-idf weighted

7|Page
More Example
•Consider a document containing 100 words wherein the word computer appears 3 times

•Now, assume we have 10, 000, 000 documents and computer appears in 1, 000 of these

•TF-IDF? Calculate.
–The term frequency (TF) for computer :
3/100 = 0.03
–The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228


–The TF*IDF score is the product of these frequencies: 0.03 * 13.228 = 0.39684

Similarity Measure
•A similarity measure is a function that computes the degree of similarity or distance between
document vector and query vector
•Using a similarity measure between the query and each document:
–It is possible to rank the retrieved documents in the order of presumed relevance
–It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled

Similarity/Dissimilarity Measures
1. Euclidean distance
–It is the most common similarity measure. Euclidean distance examines the root of square
differences between coordinates of a pair of document and query terms

2. inner product (Dot product)


–The inner product is also known as the scalar product

–the dot product is defined as the product of the magnitudes of query and document vectors
3. Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space and calculate the cosine angle between
these

8|Page
9|Page
Inner Product
•What is more relevant to a query?
–A 50-word document which contains 3 of the query terms?

–A 100-word document which contains 3 of the query-terms?


•All things being equal, longer documents are more likely to have the query-terms

•The inner-product doesn’t account for the fact that documents have widely varying lengths

•Measures how many terms matched but not how many terms are not matched

•So, the inner-product favors long documents

• So the cosine measure is also known as the normalized inner product


• Ranges from 0 to 1
– equals 1 if the vectors are identical

– 0 if the angle is 90 degrees

10 | P a g e
11 | P a g e

You might also like