0% found this document useful (0 votes)
222 views71 pages

Term Weighting

This document discusses term weighting in information retrieval systems. It introduces the concept of assigning importance levels or weights to index terms to better represent their significance for describing document content. Several common term weighting schemes are described, including term frequency (tf), inverse document frequency (idf), and the composite tf-idf weighting. Tf weighting assigns weights based on raw term frequencies, but has limitations like favoring common words. Idf weighting assigns lower weights to terms that occur in many documents, making rare terms better descriptors. The document explores methods to address issues with tf and motivate the use of term weighting overall to improve retrieval effectiveness.

Uploaded by

dawit woldu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views71 pages

Term Weighting

This document discusses term weighting in information retrieval systems. It introduces the concept of assigning importance levels or weights to index terms to better represent their significance for describing document content. Several common term weighting schemes are described, including term frequency (tf), inverse document frequency (idf), and the composite tf-idf weighting. Tf weighting assigns weights based on raw term frequencies, but has limitations like favoring common words. Idf weighting assigns lower weights to terms that occur in many documents, making rare terms better descriptors. The document explores methods to address issues with tf and motivate the use of term weighting overall to improve retrieval effectiveness.

Uploaded by

dawit woldu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Identifying Term Importance

(Term Weighting)
Chapter Three

( assigning importance level to index terms)

1
Objectives

 Understand why query terms and document terms


are weighted in modern information retrieval
 Describe the various methods for weighting terms
 Discuss the problems with the existing term
weighting methods in information retrieval
 Write programs to implement the various term
weighting schemes

2
Motivation

 The indexing process so far has generated a set of


natural language index terms as a representation
of the text

 Although these terms belong to the general class of


the content words, they are not equally important
regarding the content of the text

 All terms may not have equal capacity in telling


about the semantics of the document/item
3
Cont…
 Thus assignment of importance indicator or term
weighting is important.

 It enhances the performance of an information


retrieval system by improving precision.

 But what do you think is the impact of it on recall?

4
Parameters playing important
role in weight computation are

 The index term it self


 The text to be indexed: length of text and number of
different terms in the text
 The relationship between the index term and the
text to be indexed: frequency of term in a text
 The relationship between index terms and document
collection (its frequency in the whole collection

5
Cont…
 Remark:
 Most weighting functions relay on the distribution pattern of
the terms in the text to be indexed and/or in a reference
collection and use statistics to compute the weights.
 The weight of an index term is usually a numerical value.
 Term weights have a value of zero or above and in the case of
normalized weights it vary b/n 0 and 1,
 with values closer to one indicating very important index
terms and
 values closer to zero very weak terms.
 Term weighting is a crucial part of an automatic information
retrieval system.
6
Cont…
 In conventional retrieval systems,
a term is either used to identify a given item (in
which case it is assumed to carry weight 1) or it is
not (in which case it is assumed to carry a weight
of 0).
 Such a system is called a binary system and has
proved to be of limited importance.
 For example no distinction is made within the set of
retrieved documents.
 That is, all retrieved documents are considered to be
equally important for the query
7
Cont…
 Term weighting based systems (e.g., systems that
use statistical weighting schemes) are designed to
overcome these shortcomings
 And this is done by collecting numerical values to
each of the index terms in a query or a document
reflecting their relative importance.
 A term with a high weight is assumed to be very
relevant to the document or query
 A term with low weight on the other hand indicates
little relevance to the content of the document or
query
8
Cont…
 These importance indicator values (weights) can then be used to
define a function, which measures the similarity or closeness
between query and documents

 This in turn makes it possible to retrieve documents in the


decreasing order of query-document similarity, the most similar
(presumably relevant) being retrieved first
 Two difficulties in this respect
 The difficulty in deciding how to allocate the weights

 The difficulty in actually calculating the values which are to


be used (computation intensive)

9
Some of the term weighting schemes (or
functions or methods) suggested are:

 The term frequency (tf) weights


 An inverse document frequency (IDF) or collection
frequency) weights
 The composite weight (tf*idf)
 The signal-noise ratio

10
Early proposal to term weighting
 It will be important to consider an early proposal to term weighting
(assignment of importance indicators). And it follows basically the
following procedures:
 Calculate the frequency of each unique term for each
document for a given collection of n documents, The frequency
of term k in document i, or FREQik
 Determine the total collection frequency TOTFREQk for each term by
summing the frequencies of each unique term across all n documents,
 TOTFREQk = ∑ FREQik

 Arrange the words in decreasing order according to their collection


frequency. Decide on some threshold value and remove all words
with a collection frequency above this threshold.
 Do the same for low-frequency words.
 The remaining medium-frequency words are now used for
assignment to the documents as index terms 11
Term frequency (tf) weights

 Motivation:
 The frequency of occurrence of a term is a useful
indication of its relative importance in describing
(defining) a document
 In other words, term importance is related to its
frequency of occurrence.
 If term A is mentioned more than term B, then the
document is more about A than about B
 (assuming A and B to be content bearing terms)

12
Cont…
 Supporting idea:
 “Authors tend to repeat certain words as he or she
advances or varies the argument on an aspect of the
subject”
 One such measure assumes that the value, importance, or
weight, of a term assigned to a document is simply
proportional to the term frequency (i.e., the frequency of
occurrence of that particular term in that particular document)
 Thus the assumption here is that,
 The more frequently a term occurs in a document the
more likely it is to be of value in describing the
content of the document.
 Do you agree with this?
13
Cont…
 Accordingly, the weight of term k in document i,
denoted by wik, might be determined by

wik  FREQik
Where, FREQik is the frequency of term k in document i

14
Cont…
 Remarks:
 It is a simple count of the number of occurrences of a
term in a particular document (or query).
 Is a measure of term density in a document.
 Experiments have shown that this is better than Boolean.
 Having all weaknesses this method shows better results
than that of Boolean (Binary systems)
 The basic idea is to differentiate terms in a document

15
Problems with Term frequency (tf)
weights
 Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency words
are equally distributed throughout the collection
 Since it does not take into account the role of term k in any
document other than document i it doesn‟t consider the
importance of term k in a collection.
 This simple measure is not normalized to account for variances
in the length of documents
 A one-page document with 10 mentions of A is “more about
A” than a 100-page document with 20 mentions of A
 Used alone, favors common words and long documents
 HOW??????
16
Solutions to the Problems with
Term frequency (tf) weights
 Divide each frequency count by the length of the
document, in terms of words (length Normalization)
 Divide each frequency count by the maximum frequency
count of any term in the document (frequency
normalization).
 In this case the normalized frequency fij is used
instead of FREQik

17
Cont…
 The normalized tf is given by

FREQij
f ij 
max m ( FREQi m )

Where

fij is the normalized frequency of term j in document i


maxm is the maximum frequency of any term in document dl

18
Inverse Document Frequency (IDF)
weights
 Also called collection frequency,
 introduced by Spark Jones, another personality in
Information Retrieval.
 According to this measure, the importance of a term in a
document is measured (weighted) by the number of
documents in a collection that contain the term.
 The basic idea here is to differentiate terms in queries.
Accordingly the assumption is:
 If a term occurs in many of the documents in the
collection then it doesn’t serve well as a document
identifier and should be given a low importance (weight)
as a potential index term. 19
Cont…
 Assuming that term k occurs in at least one
document (dk ≠ 0) a possible measure of the inverse
document frequency is defined by

( dN )
wk  log2 k
Where,
N is the total number of documents in the collection
dk the number of documents in which k occurs
Wk is the weight assigned to term k

20
Cont…
 That is, the weight of a term in a document is the logarithm of the
number of documents in the collection divided by the number of
documents in the collection that contain the term (with 2 as the
base of the logarithm)
 The log is used to make the values of tf and idf comparable.
 It can be interpreted as the amount of information associated with
term ki
 IDF measures the rarity of a term across the whole document.
 According to this measure, i.e., IDF, The more a term t occurs
throughout all documents, the more poorly that term t
discriminates between documents.
 As the collection frequency of a term decreases its weight increases.
 Emphasis is on terms exhibiting the lowest document frequency.
21
Cont…
 Term importance is inversely proportional to the total
number of documents to which each term is assigned,
associated towards terms appearing in less number of
documents or items

 Tests with IDF scheme show that it consistently


produces substantial performance improvements
compared to unweighted (binary) systems

22
Problems with IDF weights
 It identifies a term that appears in many documents as
not very useful for distinguishing relevant documents
from non-relevant ones
 But this function does not take into account the
frequency of a term in a given document (i.e.,
FREQik ).
 That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author who uses
important terms now and then
23
Solution to the problem of IDF
weights

 Term importance or weights, should combine two


measurements
 The direct proportion to the frequency of the term in
a document (which quantifies how well the term
describes the document or the content)
 The inverse proportion to the number of documents in
the collection in which the term appears
 This takes us to the next and best term importance
measure, the composite measure (tf*idf)

24
The composite measure (tf*idf)
 Is a measure that combines term frequency and inverse document
frequency
 Rational for this approach:
 A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
 A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the same time
that the importance of the term in the remainder of the collection
is relatively small so that the term can actually distinguish the
documents to which it is assigned from the remainder of the
collection.
 Thus, such a term can be considered as being of potentially greater
importance for retrieval purposes

25
Cont…

 This scheme assigns a weight to each term (vocabulary


word) in a given document by
 combining term frequency and inverse document
frequency. And it is defined by

tf * idf  WEIGHTik  wik  FREQik *{log2N  logd2k }

FREQik
tf * idf  WEIGHTik  wik  *{log2N  logd2 k }
max k {FREQkj }
26
According to this function
 Weight of term k in a given document i would
increase as the frequency of the term in the
document (FREQik ) increases but decreases as the
document frequency dk increases

 Example on the composite measure (tf*idf):


 Consider the following scenario in identifying the
importance of the terms in the sample collection

27
Computing TF-IDF: An Example

 Assume collection contains 10,000 documents and


statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And
also term frequencies (TF) of these terms in a
document are: A(3), B(2), C(1).

 Compute TF*IDF for each term?

28
Solution

A: tf = 3/3=1.00; idf = log2(10000/50)= 7.644;


tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300)= 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250)= 5.322;
tf*idf = 1.774

29
Exercise - 1
 A document contains, and only contains, the phrase
“being Ethiopian and not being Ethiopian”. Suppose
every word is indexed.
 The document collection contains 1000 documents
and every word has equal document frequency of 100.
 What is the weight of each term according to the
tf.idf weighting formula using a normalized
(frequency) term weight?

30
Exercise-2
 A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season.
 A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than
any other term in the document.
 Calculate the weight of both terms in this document
using three different term weight methods

31
Solution

 W (holiday)
 Tf= 7/7= 1
 Idf= Log 1000000/200000= log5 = 2.32
 Tf*idf= 2.32

 W (season)
 Tf=5/7= 0.71
 Idf=log 1000000/250000= log 4= 2
 Tf*idf= 0.71*2= 1.42

32
More Example (length normalization)
Consider a document containing 100 words wherein the
word computer appears 3 times. Now, assume we have
10, 000, 000 documents and computer appears in 1, 000
of these.
 The term frequency (TF) for computer :
3/100 = 0.03

 The inverse document frequency is


log2(10,000,000 / 1,000) = 13.228

 The TF*IDF score is the product of these frequencies: 0.03 *


13.228 = 0.39684
33
Exercise (use length normalization for tf)
Word C TW TD DF TF IDF TFIDF
• Let C = number of
airplane 5 46 3 1
times a given word
appears in a document; blue 1 46 3 1
• TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
• TD = total number of forest 2 46 3 1
documents in a corpus,
justice 7 46 3 3
and
• DF = total number of love 2 46 3 1
documents containing might 2 46 3 1
a given word; perl 5 46 3 2
• compute TF, IDF and rose 6 46 3 3
TF*IDF score for each shoe 4 46 3 1
term
thesis 2 46 3 2 34
The Signal-Noise Ratio Approach
 Is an approach to term weighting based on Shannon‟s
Information Theory, which states that,the information content
of a message, or term, is measured as an inverse function of
the probability of occurrence of the terms in a given text.
 As a result, the higher the probability of a word in a document
(e.g. the) the lower the message (or the information it
contains)
 Accordingly the information content of a word or term is
measured using the formula

INFORMATIO N   log2p ,
Where
P is the probability of occurrence of the word
35
Examples
 Finding information content of the following terms will result
in the following situation
 „T 1’ if it occurs once in every 10,000 words breed
(collection). Using the above formula the result will be
13.28

 “T 2” if it occurs once in every 10 terms. The result will be


3.322

 (That is probability of occurrence increases from 0.0001 to


0.1)

36
Cont…
 The above values can be regarded as a measure of
reduced uncertainty

 Accordingly from the above example again, knowing the


term T2 reduce no (less) uncertainty while T1 reduces
the uncertainty by 13.28 values

 In other words Information content decreases from


13.278 to 3.223 and it is non-sense to look for T2 to
identify the document (that is why we have low value
(3.223)

37
Extending the idea
 Suppose we have t number of terms selected to
represent a document,
 Let pk be the probability of each term, then the average
information content (i.e., average reduction in uncertainty
about the document) can be given by Shannon‟s formula

t
AVERAGEINF ORMATION    pk . log2pk
k 1

38
Example (Average information content)

 If the terms ALPHA, BETA, GAMA and DELTA are


expected to occur with the probabilities 0.5, 0.2,
0.2, and 0.1 respectively, find the average
information

 Using the above formula the result will be 1.3,


and this is the average reduction of uncertainty
that we will achieve by selecting any term from
the above.

39
Cont…
 It is known that the average information is maximized,
when the occurrence probabilities of the terms are all
equal to 1/t for t distinct terms and the maximum
value is  log p
2

 For example if all the above 4 (four) terms are


expected to occur equally, that means one-fourth of
the times, the average information value will be 2
instead of 1.3, which is the maximum value expected.

40
Noise

 Salton, assumes that a complete picture of term


behavior can be obtained by considering the
frequency characteristics of each term not only in a
particular document whose term weight are currently
under consideration but also in all other documents
in the collection

 One such measure, which is derived from Shannon‟s


communication theory, is the Signal-Noise Ratio

41
Cont…
 Noise
 Disturbance in doing some thing

 Of terms in our case (in IR)

 Noisy term Vs good term

42
Cont…
 The noise of an index term k, Nk or NOISEk, for a
collection of N documents can be defined by
analogy to Shannon‟s information measure and is
given by

N TOTFREQk
FREQik ( )
N k  NOISE k   log 2
FREQik

i 1 TOTFREQ k

43
Cont…

 Nk or NOISEk
 Isa function that measures the noise of the index
term k for a collection of N documents and
relates the noise to the spread of an index term
throughout the document collection

 This measure of noise varies inversely with the


“concentration” of a term in the document
collection

44
Cont…
 For a perfectly even distribution, when a term
occurs an identical number of times in every
document of the collection, the noise is maximized
 If we have an even distribution (i.e. if you find
the term almost appearing equally in all
documents or its is a non-specific term) the noise
is higher

 Thatis, the term is not useful to select


documents and thus noise is maximized

45
Example
 If a term k occurs exactly once in each document
(perfect even distribution, i.e., FREQik =1
{1,2,3…N}).
 The noise of term k= NOISEk = log2N, which is the
maximum noise
 On the other hand, the noise of a term k which
appears in only one document with
frequency = TOTFREQk will be, zero
(If a term appears in only one document with its total
frequency, then the Noise is zero)

46
Cont…
 If the noise is maximum or the term does not exist in a
document the weight of a term is zero, the term does not
discriminate the documents
 Thus there is a relation between noise and term
specificity.
 Broad, nonspecific terms tend to have more even
distribution across the document of a collection, and hence
a high noise
 That is, broad nonspecific terms have high noise
 An inverse function of the noise might be used as a possible
function of term value (measure of term importance)
 One such function is known as the SIGNAL of 47term k
Signal
 Amount of information

 The signal of a term k, Sk or SIGNALk, is defined by

SIGNALk  log (TOTFREQk )


2  NOISEk

48
Cont…
 For the maximum noise case previously discussed
(where each FREQik is equal to 1) the SIGNAL is equal
to 0, it is because TOTFREQk in that case equals to n

 On the other hand, when the term occurs in only one


document (NOISEk = 0) a maximum signal of the term,
will be obtained.

SIGNALk  Sk  logTOTFREQ
2
k

49
Cont…

 In principle, it is possible to rank the index words


extracted from documents of a collection in
decreasing order of signal value

 Alternatively, the importance, or weight, of term k


in a document i can be computed as a composite
function taking into account FREQik as well as
SIGNALk.

50
Cont…
 A possible measure of this type, analogous to the term
weighting function of expression is

wik  FREQik *{log  log }


N
2
dk
2

wik  FREQik * SIGNALk  FREQik * S k

51
Cont…
 It is just to get the signal following the approach
discussed and then multiply it by the appropriate
FREQik

 This is the signal ratio approach

 It is suggested that even the simple ranking in order


of decreasing signal is sufficient to indicate the
“better” index terms

52
Term Discrimination Value (TDV)
(The Discrimination Model)
 As discussed, the major use of indexing is to identify
sets of documents that are relevant to the user‟s
information need

 Thus, an index term should provide a means of


separating documents into two sets
 Those to be retrieved and
 Those to be ignored

53
Cont…
 The term discrimination value is proposed to measure
the degree to which the use of a term will help to
distinguish (or discriminate) the documents from each
other

 It is an elegant mathematical model for the indexing


process and provides facilities for
 Selecting index terms and
 Weighting index terms
 It is designed to rate a given index term in accordance
with its usefulness as a discriminator among the
documents collection

 That is, how well they are able to discriminate the


documents of a collection from each other
54
Cont…
 It takes as its base the idea of document space
configuration
 where a document may be considered as a vector,
the elements of which are weights of every indexing
term to describe the document, and

 where retrieval is considered as a vector matching


operation between a query vector (defined, similarly,
as a vector the elements of which are the weights of
the terms used to describe the query) and a document
vector

55
Cont…
 Requires first a way to measure the similarity of two documents
 Dot product, Cosine, Dice, Jaccard or overlapping coefficient
similarity measures
 Consider a collection D of n documents each indexed by a set of t
terms
 A particular document in D can be represented by a vector as
di=(wi1, wi2, …, wit)
 Where
 wij represents the weight, or degree of importance, of the j-th
term in document i
 The wij may be weighted according to their importance using
one of the term weighting schemes
56
Cont…

 These vectors (vector dj) can then be represented as


belonging to a t-dimensional vector space
A document is thus a vector in this space
(multidimensional space)
 The main assumption here is that we can talk about
density of documents and compactness of space

57
Cont…
 In the way described above, each document may be
represented by a single point whose position is
specified by the location where the corresponding
document vector touches of the sphere

 It is then possible to compute a similarity coefficient,


S(di , dj ), between any two documents by comparing
the corresponding vector pairs,

 The similarity coefficient S(di , dj ) reflects the degree


of similarity or closeness between the two documents
58
Overall procedure for computing DV
of a term
 Compute the average inter document similarity in the
collection using some appropriate function
 Next, the term k being evaluated is removed from the
index vocabulary and the same average document
similarity is computed
 The DV for the term is then computed as
DVk= average similarity without k – average
similarity with k
 DVK gives difference in space density before and after
the assignment of term k as an index term
 A higher value is better because including the keyword
will result in better information retrieval.
59
Calculating the DV of a Term

 For each potential index term, a discrimination value,


DISCVALUEk or DVk, can be computed as the difference
in space density before and after assignments of that
term
 The greater the difference in space densities, the
more the space will be spread after assignment of
a given term, and therefore the better that this
term will function as discriminator

60
61
Indexing structures
Last thing before retrieval

62
How Current Search Engines
index?
 Search engines build indexes using a web crawler,
which gather each page on the Web for indexing.
 The pages are then organized with the help of the
selected indexing structure.
 Once the pages are indexed, the local copy of each
page is discarded, unless stored in a cache.

63
Cont…
 Some of the search engines, such as Google, AltaVista,
Excite, HotBot, InfoSeek, Lycos, automatically index
pages.

 Others, such as Yahoo, Magellan, Galaxy and WWW


Virtual Library, semi-automatically index; that means
there is partial human involvement during indexing.

64
Building Index file

 Given text document collections, they can be


described by a set of representative keywords called
index terms.
 An index file of a document is therefore a file
consisting of a list of index terms and a link to one or
more documents that has the index term, as shown in
the following Figure.

65
Cont…
 A good index file maps each keyword Ki to a set of
documents Di that contain the keyword.
 Index file usually has index terms in a sorted order.
 The sort order of the terms in the index file provides
an order on a physical file.

66
Cont…
 An index file contains list of search terms that are
organized for associative look-up, i.e., to answer
user‟s query.
 Once index file is constructed it becomes easier to
answer in which documents a specified search term
appears.
 In case there may be several occurrences index file
also enables to know the position where within each
document the terms appear.

67
Cont…

 For organizing index file for a collection of


documents, there are various techniques available.
 The common indexing structures, are sequential file
and inverted file.

68
Next on
 Modeling Modern IR Systems
 But, remember two key issues in IR
 Organizing (indexing)-index
 Retrieval (Searching)- responding to query

69
Quiz
 Describe the four basic weighting mechanisms ?
 Assumptions , how they work and limitations

70
End
Questions, comments and
reflections

71

You might also like