0% found this document useful (0 votes)

222 views71 pages

Term Weighting

This document discusses term weighting in information retrieval systems. It introduces the concept of assigning importance levels or weights to index terms to better represent their significance for describing document content. Several common term weighting schemes are described, including term frequency (tf), inverse document frequency (idf), and the composite tf-idf weighting. Tf weighting assigns weights based on raw term frequencies, but has limitations like favoring common words. Idf weighting assigns lower weights to terms that occur in many documents, making rare terms better descriptors. The document explores methods to address issues with tf and motivate the use of term weighting overall to improve retrieval effectiveness.

Uploaded by

dawit woldu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

222 views71 pages

Term Weighting

Uploaded by

dawit woldu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Identifying Term Importance

(Term Weighting)
Chapter Three

( assigning importance level to index terms)

1
Objectives

 Understand why query terms and document terms

are weighted in modern information retrieval
 Describe the various methods for weighting terms
 Discuss the problems with the existing term
weighting methods in information retrieval
 Write programs to implement the various term
weighting schemes

2
Motivation

 The indexing process so far has generated a set of

natural language index terms as a representation
of the text

 Although these terms belong to the general class of

the content words, they are not equally important
regarding the content of the text

 All terms may not have equal capacity in telling

about the semantics of the document/item
3
Cont…
 Thus assignment of importance indicator or term
weighting is important.

 It enhances the performance of an information

retrieval system by improving precision.

 But what do you think is the impact of it on recall?

4
Parameters playing important
role in weight computation are

 The index term it self

 The text to be indexed: length of text and number of
different terms in the text
 The relationship between the index term and the
text to be indexed: frequency of term in a text
 The relationship between index terms and document
collection (its frequency in the whole collection

5
Cont…
 Remark:
 Most weighting functions relay on the distribution pattern of
the terms in the text to be indexed and/or in a reference
collection and use statistics to compute the weights.
 The weight of an index term is usually a numerical value.
 Term weights have a value of zero or above and in the case of
normalized weights it vary b/n 0 and 1,
 with values closer to one indicating very important index
terms and
 values closer to zero very weak terms.
 Term weighting is a crucial part of an automatic information
retrieval system.
6
Cont…
 In conventional retrieval systems,
a term is either used to identify a given item (in
which case it is assumed to carry weight 1) or it is
not (in which case it is assumed to carry a weight
of 0).
 Such a system is called a binary system and has
proved to be of limited importance.
 For example no distinction is made within the set of
retrieved documents.
 That is, all retrieved documents are considered to be
equally important for the query
7
Cont…
 Term weighting based systems (e.g., systems that
use statistical weighting schemes) are designed to
overcome these shortcomings
 And this is done by collecting numerical values to
each of the index terms in a query or a document
reflecting their relative importance.
 A term with a high weight is assumed to be very
relevant to the document or query
 A term with low weight on the other hand indicates
little relevance to the content of the document or
query
8
Cont…
 These importance indicator values (weights) can then be used to
define a function, which measures the similarity or closeness
between query and documents

 This in turn makes it possible to retrieve documents in the

decreasing order of query-document similarity, the most similar
(presumably relevant) being retrieved first
 Two difficulties in this respect
 The difficulty in deciding how to allocate the weights

 The difficulty in actually calculating the values which are to

be used (computation intensive)

9
Some of the term weighting schemes (or
functions or methods) suggested are:

 The term frequency (tf) weights

 An inverse document frequency (IDF) or collection
frequency) weights
 The composite weight (tf*idf)
 The signal-noise ratio

10
Early proposal to term weighting
 It will be important to consider an early proposal to term weighting
(assignment of importance indicators). And it follows basically the
following procedures:
 Calculate the frequency of each unique term for each
document for a given collection of n documents, The frequency
of term k in document i, or FREQik
 Determine the total collection frequency TOTFREQk for each term by
summing the frequencies of each unique term across all n documents,
 TOTFREQk = ∑ FREQik

 Arrange the words in decreasing order according to their collection

frequency. Decide on some threshold value and remove all words
with a collection frequency above this threshold.
 Do the same for low-frequency words.
 The remaining medium-frequency words are now used for
assignment to the documents as index terms 11
Term frequency (tf) weights

 Motivation:
 The frequency of occurrence of a term is a useful
indication of its relative importance in describing
(defining) a document
 In other words, term importance is related to its
frequency of occurrence.
 If term A is mentioned more than term B, then the
document is more about A than about B
 (assuming A and B to be content bearing terms)

12
Cont…
 Supporting idea:
 “Authors tend to repeat certain words as he or she
advances or varies the argument on an aspect of the
subject”
 One such measure assumes that the value, importance, or
weight, of a term assigned to a document is simply
proportional to the term frequency (i.e., the frequency of
occurrence of that particular term in that particular document)
 Thus the assumption here is that,
 The more frequently a term occurs in a document the
more likely it is to be of value in describing the
content of the document.
 Do you agree with this?
13
Cont…
 Accordingly, the weight of term k in document i,
denoted by wik, might be determined by

wik  FREQik
Where, FREQik is the frequency of term k in document i

14
Cont…
 Remarks:
 It is a simple count of the number of occurrences of a
term in a particular document (or query).
 Is a measure of term density in a document.
 Experiments have shown that this is better than Boolean.
 Having all weaknesses this method shows better results
than that of Boolean (Binary systems)
 The basic idea is to differentiate terms in a document

15
Problems with Term frequency (tf)
weights
 Such a weighting system sometimes does not perform as
expected, especially in cases where the high frequency words
are equally distributed throughout the collection
 Since it does not take into account the role of term k in any
document other than document i it doesn‟t consider the
importance of term k in a collection.
 This simple measure is not normalized to account for variances
in the length of documents
 A one-page document with 10 mentions of A is “more about
A” than a 100-page document with 20 mentions of A
 Used alone, favors common words and long documents
 HOW??????
16
Solutions to the Problems with
Term frequency (tf) weights
 Divide each frequency count by the length of the
document, in terms of words (length Normalization)
 Divide each frequency count by the maximum frequency
count of any term in the document (frequency
normalization).
 In this case the normalized frequency fij is used
instead of FREQik

17
Cont…
 The normalized tf is given by

FREQij
f ij 
max m ( FREQi m )

Where

fij is the normalized frequency of term j in document i

maxm is the maximum frequency of any term in document dl

18
Inverse Document Frequency (IDF)
weights
 Also called collection frequency,
 introduced by Spark Jones, another personality in
Information Retrieval.
 According to this measure, the importance of a term in a
document is measured (weighted) by the number of
documents in a collection that contain the term.
 The basic idea here is to differentiate terms in queries.
Accordingly the assumption is:
 If a term occurs in many of the documents in the
collection then it doesn’t serve well as a document
identifier and should be given a low importance (weight)
as a potential index term. 19
Cont…
 Assuming that term k occurs in at least one
document (dk ≠ 0) a possible measure of the inverse
document frequency is defined by

( dN )
wk  log2 k
Where,
N is the total number of documents in the collection
dk the number of documents in which k occurs
Wk is the weight assigned to term k

20
Cont…
 That is, the weight of a term in a document is the logarithm of the
number of documents in the collection divided by the number of
documents in the collection that contain the term (with 2 as the
base of the logarithm)
 The log is used to make the values of tf and idf comparable.
 It can be interpreted as the amount of information associated with
term ki
 IDF measures the rarity of a term across the whole document.
 According to this measure, i.e., IDF, The more a term t occurs
throughout all documents, the more poorly that term t
discriminates between documents.
 As the collection frequency of a term decreases its weight increases.
 Emphasis is on terms exhibiting the lowest document frequency.
21
Cont…
 Term importance is inversely proportional to the total
number of documents to which each term is assigned,
associated towards terms appearing in less number of
documents or items

 Tests with IDF scheme show that it consistently

produces substantial performance improvements
compared to unweighted (binary) systems

22
Problems with IDF weights
 It identifies a term that appears in many documents as
not very useful for distinguishing relevant documents
from non-relevant ones
 But this function does not take into account the
frequency of a term in a given document (i.e.,
FREQik ).
 That is, it is possible for a term to occur in only few
documents of a collection and at the same time a
small number of times in such documents but such a
term is not important for an author who uses
important terms now and then
23
Solution to the problem of IDF
weights

 Term importance or weights, should combine two

measurements
 The direct proportion to the frequency of the term in
a document (which quantifies how well the term
describes the document or the content)
 The inverse proportion to the number of documents in
the collection in which the term appears
 This takes us to the next and best term importance
measure, the composite measure (tf*idf)

24
The composite measure (tf*idf)
 Is a measure that combines term frequency and inverse document
frequency
 Rational for this approach:
 A high occurrence frequency in a particular document
indicates that the term carries a great deal of importance in
that document.
 A low-overall collection (the number of documents in the
collection to which the term is assigned) indicates at the same time
that the importance of the term in the remainder of the collection
is relatively small so that the term can actually distinguish the
documents to which it is assigned from the remainder of the
collection.
 Thus, such a term can be considered as being of potentially greater
importance for retrieval purposes

25
Cont…

 This scheme assigns a weight to each term (vocabulary

word) in a given document by
 combining term frequency and inverse document
frequency. And it is defined by

tf * idf  WEIGHTik  wik  FREQik *{log2N  logd2k }

FREQik
tf * idf  WEIGHTik  wik  *{log2N  logd2 k }
max k {FREQkj }
26
According to this function
 Weight of term k in a given document i would
increase as the frequency of the term in the
document (FREQik ) increases but decreases as the
document frequency dk increases

 Example on the composite measure (tf*idf):

 Consider the following scenario in identifying the
importance of the terms in the sample collection

27
Computing TF-IDF: An Example

 Assume collection contains 10,000 documents and

statistical analysis shows that document frequencies
(DF) of three terms are: A(50), B(1300), C(250). And
also term frequencies (TF) of these terms in a
document are: A(3), B(2), C(1).

 Compute TF*IDF for each term?

28
Solution

A: tf = 3/3=1.00; idf = log2(10000/50)= 7.644;

tf*idf = 7.644
B: tf = 2/3=0.67; idf = log2(10000/1300)= 2.943;
tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250)= 5.322;
tf*idf = 1.774

29
Exercise - 1
 A document contains, and only contains, the phrase
“being Ethiopian and not being Ethiopian”. Suppose
every word is indexed.
 The document collection contains 1000 documents
and every word has equal document frequency of 100.
 What is the weight of each term according to the
tf.idf weighting formula using a normalized
(frequency) term weight?

30
Exercise-2
 A database collection consists of 1 million documents,
of which 200,000 contain the term holiday while
250,000 contain the term season.
 A document repeats holiday 7 times and season 5
times. It is known that holiday is repeated more than
any other term in the document.
 Calculate the weight of both terms in this document
using three different term weight methods

31
Solution

 W (holiday)
 Tf= 7/7= 1
 Idf= Log 1000000/200000= log5 = 2.32
 Tf*idf= 2.32

 W (season)
 Tf=5/7= 0.71
 Idf=log 1000000/250000= log 4= 2
 Tf*idf= 0.71*2= 1.42

32
More Example (length normalization)
Consider a document containing 100 words wherein the
word computer appears 3 times. Now, assume we have
10, 000, 000 documents and computer appears in 1, 000
of these.
 The term frequency (TF) for computer :
3/100 = 0.03

 The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

 The TFIDF score is the product of these frequencies: 0.03

13.228 = 0.39684
33
Exercise (use length normalization for tf)
Word C TW TD DF TF IDF TFIDF
• Let C = number of
airplane 5 46 3 1
times a given word
appears in a document; blue 1 46 3 1
• TW = total number of chair 7 46 3 3
words in a document; computer 3 46 3 1
• TD = total number of forest 2 46 3 1
documents in a corpus,
justice 7 46 3 3
and
• DF = total number of love 2 46 3 1
documents containing might 2 46 3 1
a given word; perl 5 46 3 2
• compute TF, IDF and rose 6 46 3 3
TF*IDF score for each shoe 4 46 3 1
term
thesis 2 46 3 2 34
The Signal-Noise Ratio Approach
 Is an approach to term weighting based on Shannon‟s
Information Theory, which states that,the information content
of a message, or term, is measured as an inverse function of
the probability of occurrence of the terms in a given text.
 As a result, the higher the probability of a word in a document
(e.g. the) the lower the message (or the information it
contains)
 Accordingly the information content of a word or term is
measured using the formula

INFORMATIO N   log2p ,
Where
P is the probability of occurrence of the word
35
Examples
 Finding information content of the following terms will result
in the following situation
 „T 1’ if it occurs once in every 10,000 words breed
(collection). Using the above formula the result will be
13.28

 “T 2” if it occurs once in every 10 terms. The result will be

3.322

 (That is probability of occurrence increases from 0.0001 to

0.1)

36
Cont…
 The above values can be regarded as a measure of
reduced uncertainty

 Accordingly from the above example again, knowing the

term T2 reduce no (less) uncertainty while T1 reduces
the uncertainty by 13.28 values

 In other words Information content decreases from

13.278 to 3.223 and it is non-sense to look for T2 to
identify the document (that is why we have low value
(3.223)

37
Extending the idea
 Suppose we have t number of terms selected to
represent a document,
 Let pk be the probability of each term, then the average
information content (i.e., average reduction in uncertainty
about the document) can be given by Shannon‟s formula

t
AVERAGEINF ORMATION    pk . log2pk
k 1

38
Example (Average information content)

 If the terms ALPHA, BETA, GAMA and DELTA are

expected to occur with the probabilities 0.5, 0.2,
0.2, and 0.1 respectively, find the average
information

 Using the above formula the result will be 1.3,

and this is the average reduction of uncertainty
that we will achieve by selecting any term from
the above.

39
Cont…
 It is known that the average information is maximized,
when the occurrence probabilities of the terms are all
equal to 1/t for t distinct terms and the maximum
value is  log p
2

 For example if all the above 4 (four) terms are

expected to occur equally, that means one-fourth of
the times, the average information value will be 2
instead of 1.3, which is the maximum value expected.

40
Noise

 Salton, assumes that a complete picture of term

behavior can be obtained by considering the
frequency characteristics of each term not only in a
particular document whose term weight are currently
under consideration but also in all other documents
in the collection

 One such measure, which is derived from Shannon‟s

communication theory, is the Signal-Noise Ratio

41
Cont…
 Noise
 Disturbance in doing some thing

 Of terms in our case (in IR)

 Noisy term Vs good term

42
Cont…
 The noise of an index term k, Nk or NOISEk, for a
collection of N documents can be defined by
analogy to Shannon‟s information measure and is
given by

N TOTFREQk
FREQik ( )
N k  NOISE k   log 2
FREQik

i 1 TOTFREQ k

43
Cont…

 Nk or NOISEk
 Isa function that measures the noise of the index
term k for a collection of N documents and
relates the noise to the spread of an index term
throughout the document collection

 This measure of noise varies inversely with the

“concentration” of a term in the document
collection

44
Cont…
 For a perfectly even distribution, when a term
occurs an identical number of times in every
document of the collection, the noise is maximized
 If we have an even distribution (i.e. if you find
the term almost appearing equally in all
documents or its is a non-specific term) the noise
is higher

 Thatis, the term is not useful to select

documents and thus noise is maximized

45
Example
 If a term k occurs exactly once in each document
(perfect even distribution, i.e., FREQik =1
{1,2,3…N}).
 The noise of term k= NOISEk = log2N, which is the
maximum noise
 On the other hand, the noise of a term k which
appears in only one document with
frequency = TOTFREQk will be, zero
(If a term appears in only one document with its total
frequency, then the Noise is zero)

46
Cont…
 If the noise is maximum or the term does not exist in a
document the weight of a term is zero, the term does not
discriminate the documents
 Thus there is a relation between noise and term
specificity.
 Broad, nonspecific terms tend to have more even
distribution across the document of a collection, and hence
a high noise
 That is, broad nonspecific terms have high noise
 An inverse function of the noise might be used as a possible
function of term value (measure of term importance)
 One such function is known as the SIGNAL of 47term k
Signal
 Amount of information

 The signal of a term k, Sk or SIGNALk, is defined by

SIGNALk  log (TOTFREQk )

2  NOISEk

48
Cont…
 For the maximum noise case previously discussed
(where each FREQik is equal to 1) the SIGNAL is equal
to 0, it is because TOTFREQk in that case equals to n

 On the other hand, when the term occurs in only one

document (NOISEk = 0) a maximum signal of the term,
will be obtained.

SIGNALk  Sk  logTOTFREQ
2
k

49
Cont…

 In principle, it is possible to rank the index words

extracted from documents of a collection in
decreasing order of signal value

 Alternatively, the importance, or weight, of term k

in a document i can be computed as a composite
function taking into account FREQik as well as
SIGNALk.

50
Cont…
 A possible measure of this type, analogous to the term
weighting function of expression is

wik  FREQik *{log  log }

N
2
dk
2

wik  FREQik * SIGNALk  FREQik * S k

51
Cont…
 It is just to get the signal following the approach
discussed and then multiply it by the appropriate
FREQik

 This is the signal ratio approach

 It is suggested that even the simple ranking in order

of decreasing signal is sufficient to indicate the
“better” index terms

52
Term Discrimination Value (TDV)
(The Discrimination Model)
 As discussed, the major use of indexing is to identify
sets of documents that are relevant to the user‟s
information need

 Thus, an index term should provide a means of

separating documents into two sets
 Those to be retrieved and
 Those to be ignored

53
Cont…
 The term discrimination value is proposed to measure
the degree to which the use of a term will help to
distinguish (or discriminate) the documents from each
other

 It is an elegant mathematical model for the indexing

process and provides facilities for
 Selecting index terms and
 Weighting index terms
 It is designed to rate a given index term in accordance
with its usefulness as a discriminator among the
documents collection

 That is, how well they are able to discriminate the

documents of a collection from each other
54
Cont…
 It takes as its base the idea of document space
configuration
 where a document may be considered as a vector,
the elements of which are weights of every indexing
term to describe the document, and

 where retrieval is considered as a vector matching

operation between a query vector (defined, similarly,
as a vector the elements of which are the weights of
the terms used to describe the query) and a document
vector

55
Cont…
 Requires first a way to measure the similarity of two documents
 Dot product, Cosine, Dice, Jaccard or overlapping coefficient
similarity measures
 Consider a collection D of n documents each indexed by a set of t
terms
 A particular document in D can be represented by a vector as
di=(wi1, wi2, …, wit)
 Where
 wij represents the weight, or degree of importance, of the j-th
term in document i
 The wij may be weighted according to their importance using
one of the term weighting schemes
56
Cont…

 These vectors (vector dj) can then be represented as

belonging to a t-dimensional vector space
A document is thus a vector in this space
(multidimensional space)
 The main assumption here is that we can talk about
density of documents and compactness of space

57
Cont…
 In the way described above, each document may be
represented by a single point whose position is
specified by the location where the corresponding
document vector touches of the sphere

 It is then possible to compute a similarity coefficient,

S(di , dj ), between any two documents by comparing
the corresponding vector pairs,

 The similarity coefficient S(di , dj ) reflects the degree

of similarity or closeness between the two documents
58
Overall procedure for computing DV
of a term
 Compute the average inter document similarity in the
collection using some appropriate function
 Next, the term k being evaluated is removed from the
index vocabulary and the same average document
similarity is computed
 The DV for the term is then computed as
DVk= average similarity without k – average
similarity with k
 DVK gives difference in space density before and after
the assignment of term k as an index term
 A higher value is better because including the keyword
will result in better information retrieval.
59
Calculating the DV of a Term

 For each potential index term, a discrimination value,

DISCVALUEk or DVk, can be computed as the difference
in space density before and after assignments of that
term
 The greater the difference in space densities, the
more the space will be spread after assignment of
a given term, and therefore the better that this
term will function as discriminator

60
61
Indexing structures
Last thing before retrieval

62
How Current Search Engines
index?
 Search engines build indexes using a web crawler,
which gather each page on the Web for indexing.
 The pages are then organized with the help of the
selected indexing structure.
 Once the pages are indexed, the local copy of each
page is discarded, unless stored in a cache.

63
Cont…
 Some of the search engines, such as Google, AltaVista,
Excite, HotBot, InfoSeek, Lycos, automatically index
pages.

 Others, such as Yahoo, Magellan, Galaxy and WWW

Virtual Library, semi-automatically index; that means
there is partial human involvement during indexing.

64
Building Index file

 Given text document collections, they can be

described by a set of representative keywords called
index terms.
 An index file of a document is therefore a file
consisting of a list of index terms and a link to one or
more documents that has the index term, as shown in
the following Figure.

65
Cont…
 A good index file maps each keyword Ki to a set of
documents Di that contain the keyword.
 Index file usually has index terms in a sorted order.
 The sort order of the terms in the index file provides
an order on a physical file.

66
Cont…
 An index file contains list of search terms that are
organized for associative look-up, i.e., to answer
user‟s query.
 Once index file is constructed it becomes easier to
answer in which documents a specified search term
appears.
 In case there may be several occurrences index file
also enables to know the position where within each
document the terms appear.

67
Cont…

 For organizing index file for a collection of

documents, there are various techniques available.
 The common indexing structures, are sequential file
and inverted file.

68
Next on
 Modeling Modern IR Systems
 But, remember two key issues in IR
 Organizing (indexing)-index
 Retrieval (Searching)- responding to query

69
Quiz
 Describe the four basic weighting mechanisms ?
 Assumptions , how they work and limitations

70
End
Questions, comments and
reflections

300+ TOP COMPUTER NETWORKS Multiple Choice Questions and Answers 2023
No ratings yet
300+ TOP COMPUTER NETWORKS Multiple Choice Questions and Answers 2023
23 pages
Chapter 1 - Query Processing and Optimization
No ratings yet
Chapter 1 - Query Processing and Optimization
62 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Chapter 2 - Query Processing and Optimization
No ratings yet
Chapter 2 - Query Processing and Optimization
16 pages
Lab Manual (CS-408 Database Systems)
0% (2)
Lab Manual (CS-408 Database Systems)
2 pages
Jimma University JIT School of Computing Advanced Database System Lab
100% (1)
Jimma University JIT School of Computing Advanced Database System Lab
70 pages
Chapter 3. Control Statements
100% (1)
Chapter 3. Control Statements
62 pages
Canonical Cover of Functional Dependency DBMS
No ratings yet
Canonical Cover of Functional Dependency DBMS
15 pages
Chapter 5 Database Recovery Techniques
100% (1)
Chapter 5 Database Recovery Techniques
46 pages
Computer Graphics - Chapter 4 - 10
No ratings yet
Computer Graphics - Chapter 4 - 10
226 pages
Statistics and Probability - Solved Assignments - Semester Fall 2007
50% (2)
Statistics and Probability - Solved Assignments - Semester Fall 2007
27 pages
CSC331 Midterm Exam Answers
100% (2)
CSC331 Midterm Exam Answers
11 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
05 - Strategies For Query Processing (Ch18)
No ratings yet
05 - Strategies For Query Processing (Ch18)
50 pages
Exit Exam
No ratings yet
Exit Exam
100 pages
Advanced Database Technology: Ambo University
100% (1)
Advanced Database Technology: Ambo University
28 pages
3hx Boolean Algebra Assign Sop Pos
No ratings yet
3hx Boolean Algebra Assign Sop Pos
10 pages
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
100% (1)
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
6 pages
Presentation1 Operating System
No ratings yet
Presentation1 Operating System
12 pages
Midterm Exam
100% (1)
Midterm Exam
3 pages
Question Paper Code:: (10×2 20 Marks)
100% (1)
Question Paper Code:: (10×2 20 Marks)
3 pages
Network Management Book - Principles and Practice Test Questions Bank
No ratings yet
Network Management Book - Principles and Practice Test Questions Bank
14 pages
Ch#22 TRANSACTION - MANAGEMENT
No ratings yet
Ch#22 TRANSACTION - MANAGEMENT
80 pages
Group - A (Short Answer Questions)
No ratings yet
Group - A (Short Answer Questions)
9 pages
SQL Level 2 - Powerpoint Joins
100% (1)
SQL Level 2 - Powerpoint Joins
43 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Features of OOP
100% (1)
Features of OOP
7 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
CH 10 Questions
No ratings yet
CH 10 Questions
5 pages
SQL SERVER Weekly Test2 Complex Questions
No ratings yet
SQL SERVER Weekly Test2 Complex Questions
24 pages
Chapter 6
No ratings yet
Chapter 6
20 pages
Chapter 4 - Distributed System
No ratings yet
Chapter 4 - Distributed System
24 pages
Object Oriented Databases
No ratings yet
Object Oriented Databases
12 pages
Query Language
No ratings yet
Query Language
44 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Unit3 CS8792 CNS Notes PEC
No ratings yet
Unit3 CS8792 CNS Notes PEC
26 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Hannah Arendt-Banality of Evil
50% (2)
Hannah Arendt-Banality of Evil
2 pages
ADBMS Sem 1 Mumbai University (MSC - CS)
No ratings yet
ADBMS Sem 1 Mumbai University (MSC - CS)
39 pages
SHA512
No ratings yet
SHA512
6 pages
Fundamentals of Database Management System
No ratings yet
Fundamentals of Database Management System
4 pages
Concurrency Control Dbms
No ratings yet
Concurrency Control Dbms
49 pages
2009 S Pre Exam2 Review 6up PDF
No ratings yet
2009 S Pre Exam2 Review 6up PDF
9 pages
Mainframe Operating Systems
No ratings yet
Mainframe Operating Systems
4 pages
sathyabama-IIsem-Distributed Computing-683201-783201
No ratings yet
sathyabama-IIsem-Distributed Computing-683201-783201
2 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Chapter 3 Review Questions
No ratings yet
Chapter 3 Review Questions
5 pages
DDBMS True False
No ratings yet
DDBMS True False
7 pages
Distributed File Systems: Unit - V Essay Questions
No ratings yet
Distributed File Systems: Unit - V Essay Questions
10 pages
Anna University OOPS Question Bank Unit 2
100% (1)
Anna University OOPS Question Bank Unit 2
6 pages
Chapter 6 Introduction To SQL
No ratings yet
Chapter 6 Introduction To SQL
8 pages
Exercises Final Exam Part1
No ratings yet
Exercises Final Exam Part1
31 pages
DBA.... - Level 3 Theoretical Exam
No ratings yet
DBA.... - Level 3 Theoretical Exam
4 pages
Unit-Iii Normalization Functional Dependency: For Example
No ratings yet
Unit-Iii Normalization Functional Dependency: For Example
18 pages
Final Exam Data Structures and Algorithms
No ratings yet
Final Exam Data Structures and Algorithms
2 pages
Query Optimization: Practice Exercises
No ratings yet
Query Optimization: Practice Exercises
4 pages
Laphormur F7 - Rieter Manual
No ratings yet
Laphormur F7 - Rieter Manual
391 pages
Chapter 6-2 - Network Components (Bridges)
No ratings yet
Chapter 6-2 - Network Components (Bridges)
7 pages
Chapter 6 - Review Questions
No ratings yet
Chapter 6 - Review Questions
6 pages
Computer Networks 159.334 Answers Tutorial No. 5 Professor Richard Harris
No ratings yet
Computer Networks 159.334 Answers Tutorial No. 5 Professor Richard Harris
3 pages
10 Emerging Wireless Networks: UWB, FSO, MANET, and Flash OFDM
No ratings yet
10 Emerging Wireless Networks: UWB, FSO, MANET, and Flash OFDM
39 pages
Both Statements Are False
No ratings yet
Both Statements Are False
26 pages
Man Diesel Engine
No ratings yet
Man Diesel Engine
325 pages
Calculate With Confidence 8th Edition Morris Test Bank Available Instantly
No ratings yet
Calculate With Confidence 8th Edition Morris Test Bank Available Instantly
311 pages
003 - Syngas Generation For GTL PDF
No ratings yet
003 - Syngas Generation For GTL PDF
91 pages
Khushboo Plastics Project 2
No ratings yet
Khushboo Plastics Project 2
42 pages
HP250 G7 Laptop PDF
No ratings yet
HP250 G7 Laptop PDF
4 pages
Merit Scholarship Form 2021
No ratings yet
Merit Scholarship Form 2021
2 pages
Uncontrolled Rectifier
No ratings yet
Uncontrolled Rectifier
18 pages
Dynamics Problem Solving
No ratings yet
Dynamics Problem Solving
6 pages
RHCSA Rapid Track Course
No ratings yet
RHCSA Rapid Track Course
3 pages
Checklist and Procedure Ver 3.0
No ratings yet
Checklist and Procedure Ver 3.0
4 pages
PDF 24
0% (1)
PDF 24
2 pages
Mockingbird
No ratings yet
Mockingbird
4 pages
01 Bio Cell 2024
No ratings yet
01 Bio Cell 2024
28 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
Nova Southeastern Dissertation Guide
100% (2)
Nova Southeastern Dissertation Guide
4 pages
PUMA Annual Report 2021
No ratings yet
PUMA Annual Report 2021
322 pages
Et Zc341 Ec-3r Solution Second Sem 2013-2014
No ratings yet
Et Zc341 Ec-3r Solution Second Sem 2013-2014
9 pages
Hug Benefits PDF
No ratings yet
Hug Benefits PDF
8 pages
Cab and Chassis Connections Cab Wiring (Right Side) Fuse Block Wiring
No ratings yet
Cab and Chassis Connections Cab Wiring (Right Side) Fuse Block Wiring
4 pages
Business Research Methods: Problem Definition and The Research Proposal
No ratings yet
Business Research Methods: Problem Definition and The Research Proposal
37 pages
Sacher Torte
No ratings yet
Sacher Torte
2 pages
EC101 Autumn 2023 Quiz 2
No ratings yet
EC101 Autumn 2023 Quiz 2
3 pages
Solucionario Capitulo 17 Giancoli Septima Edicion
No ratings yet
Solucionario Capitulo 17 Giancoli Septima Edicion
28 pages
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
No ratings yet
Patliputra University, Patna: Under-Graduate (UG) Admission Application Form 24G0066273
2 pages
Seismic Fragility of Transportation Lifeline Piers in The Philippines, Under Confinement and Shear Failure.
No ratings yet
Seismic Fragility of Transportation Lifeline Piers in The Philippines, Under Confinement and Shear Failure.
20 pages
Rubric For Preparation of Design/Computational Plate
No ratings yet
Rubric For Preparation of Design/Computational Plate
1 page

Term Weighting

Uploaded by

Term Weighting

Uploaded by

Identifying Term Importance

( assigning importance level to index terms)

 Understand why query terms and document terms

 The indexing process so far has generated a set of

 Although these terms belong to the general class of

 All terms may not have equal capacity in telling

 It enhances the performance of an information

 But what do you think is the impact of it on recall?

 The index term it self

 This in turn makes it possible to retrieve documents in the

 The difficulty in actually calculating the values which are to

 The term frequency (tf) weights

 Arrange the words in decreasing order according to their collection

fij is the normalized frequency of term j in document i

 Tests with IDF scheme show that it consistently

 Term importance or weights, should combine two

 This scheme assigns a weight to each term (vocabulary

tf * idf  WEIGHTik  wik  FREQik *{log2N  logd2k }

 Example on the composite measure (tf*idf):

 Assume collection contains 10,000 documents and

 Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50)= 7.644;

 The inverse document frequency is

 The TF*IDF score is the product of these frequencies: 0.03 *

 “T 2” if it occurs once in every 10 terms. The result will be

 (That is probability of occurrence increases from 0.0001 to

 Accordingly from the above example again, knowing the

 In other words Information content decreases from

 If the terms ALPHA, BETA, GAMA and DELTA are

 Using the above formula the result will be 1.3,

 For example if all the above 4 (four) terms are

 Salton, assumes that a complete picture of term

 One such measure, which is derived from Shannon‟s

 Of terms in our case (in IR)

 Noisy term Vs good term

 This measure of noise varies inversely with the

 Thatis, the term is not useful to select

 The signal of a term k, Sk or SIGNALk, is defined by

SIGNALk  log (TOTFREQk )

 On the other hand, when the term occurs in only one

 In principle, it is possible to rank the index words

 Alternatively, the importance, or weight, of term k

wik  FREQik *{log  log }

wik  FREQik * SIGNALk  FREQik * S k

 This is the signal ratio approach

 It is suggested that even the simple ranking in order

 Thus, an index term should provide a means of

 It is an elegant mathematical model for the indexing

 That is, how well they are able to discriminate the

 where retrieval is considered as a vector matching

 These vectors (vector dj) can then be represented as

 It is then possible to compute a similarity coefficient,

 The similarity coefficient S(di , dj ) reflects the degree

 For each potential index term, a discrimination value,

 Others, such as Yahoo, Magellan, Galaxy and WWW

 Given text document collections, they can be

 For organizing index file for a collection of

You might also like

 The TFIDF score is the product of these frequencies: 0.03