0% found this document useful (0 votes)

28 views11 pages

Chapter 3 Term Weighting

Uploaded by

oneno2536

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views11 pages

Chapter 3 Term Weighting

Uploaded by

oneno2536

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Information Storage and Retrieval (ISR)

Chapter 3 Term weighting and Similarity measures

Introduction (Basic Concepts)
• Each document is represented by a set of representative keywords or index
terms as we have learned from chapter two.
• Documents and queries are represented as vectors or “bags of
words” (BOW) – unordered words with frequencies
• Bag – a set that allows multiple occurrences of the same
element
• An index term is a word or group of consecutive words in a document
• Those terms are usually stems
• Terms can be also phrases, such as “Computer Science”, “World Wide
Web”, etc.

Term Weighting
The terms of a document are not equally useful for describing the document contents.

•That is why we used text operation as we have learned in chapter 2.

1|Page
•There are properties of an index term which are useful for evaluating the importance of the term
in a document
–For instance, a word which appears in all documents of a collection is completely useless for
retrieval tasks.

–Please refer and memorize chapter two.

Term Weighting
1.Binary Weights
2.Term Frequency (TF) Weights
3.Inverse Document Frequency(IDF)
4.TF*IDF Weighting
1. Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector

• Binary formula gives every word that appears in a document equal relevance

• It can be useful when frequency is not important

• Not enables ranking of retrieved documents

2|Page
2. Term Frequency (TF) Weights
•TF (term frequency) - Count the number of times term occurs in document
fij = frequency of term i in document j
•The more times a term t occurs in document d the more likely it is that t is relevant to the
document, i.e. more indicative of the topic
–If used alone, it favors common words and long documents

–It gives too much credit to words that appears more frequently

•May want to normalize term frequency (tf)

Why use term weighting?

•Binary weights are too limiting.
–Terms are either present or absent(1 or 0)

–Not allow to order documents according to their level of relevance for a given query.

3|Page
•Non-binary weights allow to model partial matching
–Partial matching allows retrieval of documents that approximate the query.

•Term-weighting helps to apply best matching that improves quality of answer set.

–Term weighting enables ranking of retrieved documents; such that best matching documents are
ordered at the top as they are more relevant than others.

TF Normalization

•Long documents have an unfair advantage:

–They use a lot of terms
•So they get more matches than short documents
–And they use the same words repeatedly
•So they have much higher term frequencies
•Normalization seeks to remove these effects:
–Related somehow to maximum term frequency
–But also sensitive to the number of terms
•If we don’t normalize short documents may not be recognized as relevant.

TF Normalization

4|Page
Problems with Term Frequency
•Need a mechanism for attenuating(reducing) the effect of terms that occur too often in the
collection to be meaningful for relevance/meaning determination

•Scale down the weight of terms with high collection frequency

–Reduce the tf weight of a term by a factor that grows with the collection frequency.
•More common for this purpose is document frequency
–how many documents in the collection contain the term

•The example shows that collection frequency and document frequency behaves differently

3. Inverse Document Frequency(IDF)

•Document frequency is defined to be the number of documents in the collection that contain a
term
DF = document frequency
–Count the frequency considering the whole collection of documents

–Less frequently a term appears in the whole collection, the more discriminating it is

df i = (document frequency of term i)

= number of documents containing term i

Inverse Document Frequency (IDF)

•IDF measures rarity of the term in collection. The IDF is a measure of the general importance of
the term.
–Inverts the document frequency
•It diminishes (reduces) the weight of terms that occur very frequently in the collection and
increases the weight of terms that occur rarely
–Gives full weight to terms that occur in one document only
–Gives zero weight to terms that occur in all documents
–Terms that appear in many different documents are less indicative of overall topic.

5|Page
idfi = inverse document frequency of term i,

where N: total number of documents

Example: given a collection of 1000 documents and document frequency, compute IDF for each
word?

4. TF*IDF Weighting
•A good weight must take into account two effects:
–Quantification of intra-document contents (similarity)
•tf factor, the term frequency within a document
–Quantification of inter-documents separation (dissimilarity)
•idf factor, the inverse document frequency
•As a result of which the most widely used term-weighting by IR systems is tf*idf weighting
technique:

wij = tfij idfi = tfij * log2 (N/ dfi)

•A term occurring frequently in the document but rarely in the rest of the collection is given high
weight
–The tf*idf value for a term will always be greater than or equal to zero
TF*IDF weighting
•When does TF*IDF registers a high weight?
when a term t occurs many times within a small number of documents:-
Highest tf*idf for a term shows a term has a high term frequency (in the given document) and a
low document frequency (in the whole collection of documents);
the weights hence tend to filter out common terms
Thus lending high discriminating power to those documents

6|Page
Lower TF*IDF is registered when the term occurs fewer times in a document, or occurs in
many documents (virtually all documents)
–Thus offering a less pronounced relevance signal

How is TF-IDF calculated?

I.Calculate term frequency(TF) in each document. Iterate each document and count how often
each word appears.

II.Calculate the inverse document frequency (IDF): Take the total number of documents divided
by the number of documents containing the word.

III.Calculate TF-IDF: multiply TF and IDF together.

•TF-IDF, is used to determine how important a word is within a single document of a collection.
Computing TF-IDF: An Example
•Assume collection contains 10,000 documents and statistical analysis shows that document
frequencies (DF) of three terms are:
•DFA = 50, DFB =1300, DFC = 250
•And also term frequencies (TF) of these terms are:
•TFA = 3, TFB =2, TFC =1
•Compute TF*IDF for each term?

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644

B: tf = 2/3=0.67; idf = log2(10000/1300) = 2.943; tf*idf = 1.962
C: tf = 1/3=0.33; idf = log2(10000/250) = 5.322; tf*idf = 1.774
•s

•Query is also treated as a short document and also tf-idf weighted

7|Page
More Example
•Consider a document containing 100 words wherein the word computer appears 3 times

•Now, assume we have 10, 000, 000 documents and computer appears in 1, 000 of these

•TF-IDF? Calculate.
–The term frequency (TF) for computer :
3/100 = 0.03
–The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

–The TF*IDF score is the product of these frequencies: 0.03 * 13.228 = 0.39684

Similarity Measure
•A similarity measure is a function that computes the degree of similarity or distance between
document vector and query vector
•Using a similarity measure between the query and each document:
–It is possible to rank the retrieved documents in the order of presumed relevance
–It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled

Similarity/Dissimilarity Measures
1. Euclidean distance
–It is the most common similarity measure. Euclidean distance examines the root of square
differences between coordinates of a pair of document and query terms

2. inner product (Dot product)

–The inner product is also known as the scalar product

–the dot product is defined as the product of the magnitudes of query and document vectors
3. Cosine similarity (or normalized inner product)
–It projects document and query vectors into a term space and calculate the cosine angle between
these

8|Page
9|Page
Inner Product
•What is more relevant to a query?
–A 50-word document which contains 3 of the query terms?

–A 100-word document which contains 3 of the query-terms?

•All things being equal, longer documents are more likely to have the query-terms

•The inner-product doesn’t account for the fact that documents have widely varying lengths

•Measures how many terms matched but not how many terms are not matched

•So, the inner-product favors long documents

• So the cosine measure is also known as the normalized inner product

• Ranges from 0 to 1
– equals 1 if the vectors are identical

– 0 if the angle is 90 degrees

10 | P a g e
11 | P a g e

L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
TF Idf
No ratings yet
TF Idf
4 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
(Jaffar) IR - Modeling - II
No ratings yet
(Jaffar) IR - Modeling - II
39 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Lecture#3 TFIDF
No ratings yet
Lecture#3 TFIDF
16 pages
TF Idf
No ratings yet
TF Idf
15 pages
TF Idf
No ratings yet
TF Idf
18 pages
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
No ratings yet
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
15 pages
Lec 4
No ratings yet
Lec 4
39 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Lecture 10 - Term Frequency
No ratings yet
Lecture 10 - Term Frequency
17 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Tf-Idf Weighting
No ratings yet
Tf-Idf Weighting
7 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Vmodel
No ratings yet
Vmodel
10 pages
The Vector Space Model in Information Re
No ratings yet
The Vector Space Model in Information Re
9 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Basic IR: Modeling
No ratings yet
Basic IR: Modeling
22 pages
Study On Street Vendors Before and After Pandemic
100% (1)
Study On Street Vendors Before and After Pandemic
81 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Text Representation
No ratings yet
Text Representation
16 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
TF Idf
No ratings yet
TF Idf
3 pages
Term Frequency
No ratings yet
Term Frequency
3 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
Sumana Bandyopadhyay - Kolkata The Colonial City in Transition - Reflections in Geographies of Urban India-Routledge (2022)
100% (1)
Sumana Bandyopadhyay - Kolkata The Colonial City in Transition - Reflections in Geographies of Urban India-Routledge (2022)
395 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
ZYJ260
No ratings yet
ZYJ260
78 pages
TF Idf
100% (3)
TF Idf
38 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
100% (1)
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
34 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
RCD Tester Rev.1 Sop
67% (3)
RCD Tester Rev.1 Sop
2 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
XXXXX: Important Instructions To Examiners
No ratings yet
XXXXX: Important Instructions To Examiners
16 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Deterioration of Concrete
No ratings yet
Deterioration of Concrete
34 pages
Design of HVAC Control System For Building Energy Management Systems
No ratings yet
Design of HVAC Control System For Building Energy Management Systems
5 pages
Beyond The Blackboard Reflection Paper
100% (1)
Beyond The Blackboard Reflection Paper
3 pages
기존 시설물 (기초및지반) 내진성능 평가요령 (안)
No ratings yet
기존 시설물 (기초및지반) 내진성능 평가요령 (안)
216 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
PGP Aiml2024
No ratings yet
PGP Aiml2024
22 pages
Sony STR-DE598 Service Manual
No ratings yet
Sony STR-DE598 Service Manual
70 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
Iso 11600 2002
No ratings yet
Iso 11600 2002
9 pages
EPP Lessonplan
No ratings yet
EPP Lessonplan
6 pages
Physics Class Xii Project PDF
No ratings yet
Physics Class Xii Project PDF
20 pages
Acknowledgment
No ratings yet
Acknowledgment
6 pages
Chapter 5
No ratings yet
Chapter 5
9 pages
Literature Review On Accessibility
100% (1)
Literature Review On Accessibility
7 pages
AI Chapter-Three
No ratings yet
AI Chapter-Three
33 pages
Lecture 23: Outline: Yell If You Have Any Questions
No ratings yet
Lecture 23: Outline: Yell If You Have Any Questions
43 pages
Quiet Versus Loud Luxury The Influence of Overt and Covert Narcissism On Young Chinese and US Luxury Consumers' Preferences
No ratings yet
Quiet Versus Loud Luxury The Influence of Overt and Covert Narcissism On Young Chinese and US Luxury Consumers' Preferences
27 pages
Quidos Technical Bulletin - 15th September 2019
100% (1)
Quidos Technical Bulletin - 15th September 2019
7 pages
Pospiszyl 2023 The Fifth Element The Enlightenment and The Draining of Eastern Europe
No ratings yet
Pospiszyl 2023 The Fifth Element The Enlightenment and The Draining of Eastern Europe
28 pages
Đê DX Duyên H I Final
No ratings yet
Đê DX Duyên H I Final
14 pages
Hydroline Breather FSB TB 130417
No ratings yet
Hydroline Breather FSB TB 130417
3 pages
Chapter 6
No ratings yet
Chapter 6
9 pages
Edexcel Igcse Physics
No ratings yet
Edexcel Igcse Physics
12 pages
Mosi Debat
No ratings yet
Mosi Debat
8 pages
Data Sheet: SFH757 and SFH757V
No ratings yet
Data Sheet: SFH757 and SFH757V
4 pages
Footscan®v9 Software Packages
No ratings yet
Footscan®v9 Software Packages
1 page
F4 Chapter 3 (Exercise 6)
No ratings yet
F4 Chapter 3 (Exercise 6)
3 pages
Instructional Design Rubric Final
No ratings yet
Instructional Design Rubric Final
1 page
Graph 2 Worksheet
No ratings yet
Graph 2 Worksheet
2 pages
Invoice 10
No ratings yet
Invoice 10
1 page
SDL Trados Studio – A Practical Guide
From Everand
SDL Trados Studio – A Practical Guide
Andy Walker
5/5 (1)
Impeccability of the Source Text in Translations
From Everand
Impeccability of the Source Text in Translations
Luis R. Cerna
No ratings yet

Chapter 3 Term Weighting

Uploaded by

Chapter 3 Term Weighting

Uploaded by

Information Storage and Retrieval (ISR)

Chapter 3 Term weighting and Similarity measures

•That is why we used text operation as we have learned in chapter 2.

–Please refer and memorize chapter two.

• It can be useful when frequency is not important

• Not enables ranking of retrieved documents

•May want to normalize term frequency (tf)

Why use term weighting?

•Long documents have an unfair advantage:

•Scale down the weight of terms with high collection frequency

3. Inverse Document Frequency(IDF)

df i = (document frequency of term i)

Inverse Document Frequency (IDF)

where N: total number of documents

wij = tfij idfi = tfij * log2 (N/ dfi)

How is TF-IDF calculated?

III.Calculate TF-IDF: multiply TF and IDF together.

A: tf = 3/3=1.00; idf = log2(10000/50) = 7.644; tf*idf = 7.644

•Query is also treated as a short document and also tf-idf weighted

log2(10,000,000 / 1,000) = 13.228

2. inner product (Dot product)

–A 100-word document which contains 3 of the query-terms?

•So, the inner-product favors long documents

• So the cosine measure is also known as the normalized inner product

– 0 if the angle is 90 degrees

You might also like