Term Weighting
Term Weighting
(Term Weighting)
Chapter Three
1
Objectives
2
Motivation
4
Parameters playing important
role in weight computation are
5
Cont…
Remark:
Most weighting functions relay on the distribution pattern of
the terms in the text to be indexed and/or in a reference
collection and use statistics to compute the weights.
The weight of an index term is usually a numerical value.
Term weights have a value of zero or above and in the case of
normalized weights it vary b/n 0 and 1,
with values closer to one indicating very important index
terms and
values closer to zero very weak terms.
Term weighting is a crucial part of an automatic information
retrieval system.
6
Cont…
In conventional retrieval systems,
a term is either used to identify a given item (in
which case it is assumed to carry weight 1) or it is
not (in which case it is assumed to carry a weight
of 0).
Such a system is called a binary system and has
proved to be of limited importance.
For example no distinction is made within the set of
retrieved documents.
That is, all retrieved documents are considered to be
equally important for the query
7
Cont…
Term weighting based systems (e.g., systems that
use statistical weighting schemes) are designed to
overcome these shortcomings
And this is done by collecting numerical values to
each of the index terms in a query or a document
reflecting their relative importance.
A term with a high weight is assumed to be very
relevant to the document or query
A term with low weight on the other hand indicates
little relevance to the content of the document or
query
8
Cont…
These importance indicator values (weights) can then be used to
define a function, which measures the similarity or closeness
between query and documents
9
Some of the term weighting schemes (or
functions or methods) suggested are:
10
Early proposal to term weighting
It will be important to consider an early proposal to term weighting
(assignment of importance indicators). And it follows basically the
following procedures:
Calculate the frequency of each unique term for each
document for a given collection of n documents, The frequency
of term k in document i, or FREQik
Determine the total collection frequency TOTFREQk for each term by
summing the frequencies of each unique term across all n documents,
TOTFREQk = ∑ FREQik
Motivation:
The frequency of occurrence of a term is a useful
indication of its relative importance in describing
(defining) a document
In other words, term importance is related to its
frequency of occurrence.
If term A is mentioned more than term B, then the
document is more about A than about B
(assuming A and B to be content bearing terms)
12
Cont…
Supporting idea:
“Authors tend to repeat certain words as he or she
advances or varies the argument on an aspect of the
subject”
One such measure assumes that the value, importance, or
weight, of a term assigned to a document is simply
proportional to the term frequency (i.e., the frequency of
occurrence of that particular term in that particular document)
Thus the assumption here is that,
The more frequently a term occurs in a document the
more likely it is to be of value in describing the
content of the document.
Do you agree with this?
13
Cont…
Accordingly, the weight of term k in document i,
denoted by wik, might be determined by
wik FREQik
Where, FREQik is the frequency of term k in document i
14
Cont…
Remarks:
It is a simple count of the number of occurrences of a
term in a particular document (or query).
Is a measure of term density in a document.
Experiments have shown that this is better than Boolean.
Having all weaknesses this method shows better results
than that of Boolean (Binary systems)
The basic idea is to differentiate terms in a document
15
Problems with Term frequency (tf)
weights
Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency words
are equally distributed throughout the collection
Since it does not take into account the role of term k in any
document other than document i it doesn‟t consider the
importance of term k in a collection.
This simple measure is not normalized to account for variances
in the length of documents
A one-page document with 10 mentions of A is “more about
A” than a 100-page document with 20 mentions of A
Used alone, favors common words and long documents
HOW??????
16
Solutions to the Problems with
Term frequency (tf) weights
Divide each frequency count by the length of the
document, in terms of words (length Normalization)
Divide each frequency count by the maximum frequency
count of any term in the document (frequency
normalization).
In this case the normalized frequency fij is used
instead of FREQik
17
Cont…
The normalized tf is given by
FREQij
f ij
max m ( FREQi m )
Where
18
Inverse Document Frequency (IDF)
weights
Also called collection frequency,
introduced by Spark Jones, another personality in
Information Retrieval.
According to this measure, the importance of a term in a
document is measured (weighted) by the number of
documents in a collection that contain the term.
The basic idea here is to differentiate terms in queries.
Accordingly the assumption is:
If a term occurs in many of the documents in the
collection then it doesn’t serve well as a document
identifier and should be given a low importance (weight)
as a potential index term. 19
Cont…
Assuming that term k occurs in at least one
document (dk ≠ 0) a possible measure of the inverse
document frequency is defined by
( dN )
wk log2 k
Where,
N is the total number of documents in the collection
dk the number of documents in which k occurs
Wk is the weight assigned to term k
20
Cont…
That is, the weight of a term in a document is the logarithm of the
number of documents in the collection divided by the number of
documents in the collection that contain the term (with 2 as the
base of the logarithm)
The log is used to make the values of tf and idf comparable.
It can be interpreted as the amount of information associated with
term ki
IDF measures the rarity of a term across the whole document.
According to this measure, i.e., IDF, The more a term t occurs
throughout all documents, the more poorly that term t
discriminates between documents.
As the collection frequency of a term decreases its weight increases.
Emphasis is on terms exhibiting the lowest document frequency.
21
Cont…
Term importance is inversely proportional to the total
number of documents to which each term is assigned,
associated towards terms appearing in less number of
documents or items
22
Problems with IDF weights
It identifies a term that appears in many documents as
not very useful for distinguishing relevant documents
from non-relevant ones
But this function does not take into account the
frequency of a term in a given document (i.e.,
FREQik ).
That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author who uses
important terms now and then
23
Solution to the problem of IDF
weights
24
The composite measure (tf*idf)
Is a measure that combines term frequency and inverse document
frequency
Rational for this approach:
A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the same time
that the importance of the term in the remainder of the collection
is relatively small so that the term can actually distinguish the
documents to which it is assigned from the remainder of the
collection.
Thus, such a term can be considered as being of potentially greater
importance for retrieval purposes
25
Cont…
FREQik
tf * idf WEIGHTik wik *{log2N logd2 k }
max k {FREQkj }
26
According to this function
Weight of term k in a given document i would
increase as the frequency of the term in the
document (FREQik ) increases but decreases as the
document frequency dk increases
27
Computing TF-IDF: An Example
28
Solution
29
Exercise - 1
A document contains, and only contains, the phrase
“being Ethiopian and not being Ethiopian”. Suppose
every word is indexed.
The document collection contains 1000 documents
and every word has equal document frequency of 100.
What is the weight of each term according to the
tf.idf weighting formula using a normalized
(frequency) term weight?
30
Exercise-2
A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season.
A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than
any other term in the document.
Calculate the weight of both terms in this document
using three different term weight methods
31
Solution
W (holiday)
Tf= 7/7= 1
Idf= Log 1000000/200000= log5 = 2.32
Tf*idf= 2.32
W (season)
Tf=5/7= 0.71
Idf=log 1000000/250000= log 4= 2
Tf*idf= 0.71*2= 1.42
32
More Example (length normalization)
Consider a document containing 100 words wherein the
word computer appears 3 times. Now, assume we have
10, 000, 000 documents and computer appears in 1, 000
of these.
The term frequency (TF) for computer :
3/100 = 0.03
INFORMATIO N log2p ,
Where
P is the probability of occurrence of the word
35
Examples
Finding information content of the following terms will result
in the following situation
„T 1’ if it occurs once in every 10,000 words breed
(collection). Using the above formula the result will be
13.28
36
Cont…
The above values can be regarded as a measure of
reduced uncertainty
37
Extending the idea
Suppose we have t number of terms selected to
represent a document,
Let pk be the probability of each term, then the average
information content (i.e., average reduction in uncertainty
about the document) can be given by Shannon‟s formula
t
AVERAGEINF ORMATION pk . log2pk
k 1
38
Example (Average information content)
39
Cont…
It is known that the average information is maximized,
when the occurrence probabilities of the terms are all
equal to 1/t for t distinct terms and the maximum
value is log p
2
40
Noise
41
Cont…
Noise
Disturbance in doing some thing
42
Cont…
The noise of an index term k, Nk or NOISEk, for a
collection of N documents can be defined by
analogy to Shannon‟s information measure and is
given by
N TOTFREQk
FREQik ( )
N k NOISE k log 2
FREQik
i 1 TOTFREQ k
43
Cont…
Nk or NOISEk
Isa function that measures the noise of the index
term k for a collection of N documents and
relates the noise to the spread of an index term
throughout the document collection
44
Cont…
For a perfectly even distribution, when a term
occurs an identical number of times in every
document of the collection, the noise is maximized
If we have an even distribution (i.e. if you find
the term almost appearing equally in all
documents or its is a non-specific term) the noise
is higher
45
Example
If a term k occurs exactly once in each document
(perfect even distribution, i.e., FREQik =1
{1,2,3…N}).
The noise of term k= NOISEk = log2N, which is the
maximum noise
On the other hand, the noise of a term k which
appears in only one document with
frequency = TOTFREQk will be, zero
(If a term appears in only one document with its total
frequency, then the Noise is zero)
46
Cont…
If the noise is maximum or the term does not exist in a
document the weight of a term is zero, the term does not
discriminate the documents
Thus there is a relation between noise and term
specificity.
Broad, nonspecific terms tend to have more even
distribution across the document of a collection, and hence
a high noise
That is, broad nonspecific terms have high noise
An inverse function of the noise might be used as a possible
function of term value (measure of term importance)
One such function is known as the SIGNAL of 47term k
Signal
Amount of information
48
Cont…
For the maximum noise case previously discussed
(where each FREQik is equal to 1) the SIGNAL is equal
to 0, it is because TOTFREQk in that case equals to n
SIGNALk Sk logTOTFREQ
2
k
49
Cont…
50
Cont…
A possible measure of this type, analogous to the term
weighting function of expression is
51
Cont…
It is just to get the signal following the approach
discussed and then multiply it by the appropriate
FREQik
52
Term Discrimination Value (TDV)
(The Discrimination Model)
As discussed, the major use of indexing is to identify
sets of documents that are relevant to the user‟s
information need
53
Cont…
The term discrimination value is proposed to measure
the degree to which the use of a term will help to
distinguish (or discriminate) the documents from each
other
55
Cont…
Requires first a way to measure the similarity of two documents
Dot product, Cosine, Dice, Jaccard or overlapping coefficient
similarity measures
Consider a collection D of n documents each indexed by a set of t
terms
A particular document in D can be represented by a vector as
di=(wi1, wi2, …, wit)
Where
wij represents the weight, or degree of importance, of the j-th
term in document i
The wij may be weighted according to their importance using
one of the term weighting schemes
56
Cont…
57
Cont…
In the way described above, each document may be
represented by a single point whose position is
specified by the location where the corresponding
document vector touches of the sphere
60
61
Indexing structures
Last thing before retrieval
62
How Current Search Engines
index?
Search engines build indexes using a web crawler,
which gather each page on the Web for indexing.
The pages are then organized with the help of the
selected indexing structure.
Once the pages are indexed, the local copy of each
page is discarded, unless stored in a cache.
63
Cont…
Some of the search engines, such as Google, AltaVista,
Excite, HotBot, InfoSeek, Lycos, automatically index
pages.
64
Building Index file
65
Cont…
A good index file maps each keyword Ki to a set of
documents Di that contain the keyword.
Index file usually has index terms in a sorted order.
The sort order of the terms in the index file provides
an order on a physical file.
66
Cont…
An index file contains list of search terms that are
organized for associative look-up, i.e., to answer
user‟s query.
Once index file is constructed it becomes easier to
answer in which documents a specified search term
appears.
In case there may be several occurrences index file
also enables to know the position where within each
document the terms appear.
67
Cont…
68
Next on
Modeling Modern IR Systems
But, remember two key issues in IR
Organizing (indexing)-index
Retrieval (Searching)- responding to query
69
Quiz
Describe the four basic weighting mechanisms ?
Assumptions , how they work and limitations
70
End
Questions, comments and
reflections
71