Lecture 10 - Term Frequency

Uploaded by

ravleen3310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views17 pages

Lecture 10 - Term Frequency

Uploaded by

ravleen3310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Term Frequency in NLP

Term Frequency
• In Natural Language Processing (NLP), Term Frequency (TF) is a key concept used to
evaluate the importance of a word within a document.
• It's a measure of how frequently a term appears in a document relative to the total number
of terms in that document.
• This is done by multiplying two metrics:
• Term Frequency (TF): how many times a word appears in a document.
• Inverse Document Frequency (IDF): the inverse document frequency of the word across a
collection of documents. Rare words have high scores, common words have low scores.
• Term frequency is how common a word is, inverse document frequency (IDF) is how
unique or rare a word is.
• TF-IDF has many uses, such as in information retrieval, text analysis, keyword
extraction, and as a way of obtaining numeric features from text for machine learning
algorithms.
TF-IDF origin
• TF-IDF was first designed for document search and information retrieval, where a
query is run and the system has to find the most relevant documents.
• Suppose the query is the text “The bug”. The system would give each document a
higher score proportionally to the frequencies of the query words found in the
document, weighting more rare words like “bug” with respect to common words
like “the”.
How to compute TF-IDF
• Suppose we are looking for documents using the query Q and our database is
composed of the documents D1, D2, and D3.

• Q: The cat.
• D1: The cat is on the mat.
• D2: My dog and cat are the best.
• D3: The locals are playing.
How to compute TF-IDF
• There are several ways of calculating TF, with the simplest being a raw count of
instances a word appears in a document.
• We’ll compute the TF scores using the ratio of the count of instances over the
length of the document.

• TF(word, document) = “number of occurrences of the word in the document” /

“number of words in the document”
How to compute TF-IDF
• Let’s compute the TF scores of the words “the” and “cat” (i.e. the query words)
with respect to the documents D1, D2, and D3.
• TF(“the”, D1) = 2/6 = 0.33
• TF(“the”, D2) = 1/7 = 0.14
• TF(“the”, D3) = 1/4 = 0.25
• TF(“cat”, D1) = 1/6 = 0.17
• TF(“cat”, D2) = 1/7 = 0.14
• TF(“cat”, D3) = 0/4 = 0
How to compute TF-IDF
• IDF can be calculated by taking the total number of documents, dividing it by the
number of documents that contain a word, and calculating the logarithm. If the
word is very common and appears in many documents, this number will approach
0. Otherwise, it will approach 1.

• IDF(word) = log(number of documents / number of documents that contain the

word)
• Let’s compute the IDF scores of the words “the” and “cat”.
• IDF(“the”) = log(3/3) = log(1) = 0
• IDF(“cat”) = log(3/2) = 0.18
How to compute TF-IDF
• Multiplying TF and IDF gives the TF-IDF score of a word in a document. The
higher the score, the more relevant that word is in that particular document.
• TF-IDF(word, document) = TF(word, document) * IDF(word)
• Let’s compute the TF-IDF scores of the words “the” and “cat”.
• TF-IDF(“the”, D1) = 0.33 * 0 = 0
• TF-IDF(“the, D2) = 0.14 * 0 = 0
• TF-IDF(“the”, D3) = 0.25 * 0 = 0
• TF-IDF(“cat”, D1) = 0.17 * 0.18= 0.0306
• TF-IDF(“cat, D2) = 0.14 * 0.18= 0.0252
• TF-IDF(“cat”, D3) = 0 * 0.18 = 0
How to compute TF-IDF
• The next step is to use a ranking function to order the documents according to the
TF-IDF scores of their words. We can use the average TF-IDF word scores over
each document to get the ranking of D1, D2, and D3 with respect to the query Q.
• Average TF-IDF of D1 = (0 + 0.0306) / 2 = 0.0153
• Average TF-IDF of D2 = (0 + 0.0252) / 2 = 0.0126
• Average TF-IDF of D3 = (0 + 0) / 2 = 0
• Looks like the word “the” does not contribute to the TF-IDF scores of each
document. This is because “the” appears in all of the documents and thus it is
considered a not-relevant word.
How to compute TF-IDF
• There are better-performing ranking functions in the literature, such as Okapi
BM25.

• As a conclusion, when performing the query “The cat” over the collection of
documents D1, D2, and D3, the ranked results would be:

• D1: The cat is on the mat.

• D2: My dog and cat are the best.
• D3: The locals are playing.
The use of TF-IDF in Machine Learning
• TF-IDF is often used to transform text into a vector of numbers, otherwise known
as text vectorization, where the numbers of the vectors are meant to somehow
represent the content of the text.

• TF-IDF gives us a way to associate each word in a document with a number that
represents how relevant each word is in that document. Such numbers can be then
used as features of machine learning models.
Example 2
• Example,
• Consider a document containing 100 words wherein the word apple appears 5
times. The term frequency (i.e., TF) for apple is then (5 / 100) = 0.05.

• Now, assume we have 10 million documents and the word apple appears in one
thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated
as log(10,000,000 / 1,000) = 4.

• Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
Implementation of TF-IDF
• Implementation of TF-IDF consists of the nine following steps.
• Perquisites Python3, NLTK library of python, Python IDE
1. Tokenize the sentences
2. Create the Frequency matrix of the words in each sentence.
Where, each sentence is the key and the value is a dictionary of word frequency.
3. Calculate Term Frequency and generate a matrix
We’ll find the Term Frequency for each word in a paragraph.
Now, remember the definition of TF,
TF(t) = (Number of times term t appears in a document) / (Total number of terms in
the document)
Implementation of TF-IDF
4. Creating a table for documents per words
• This again a simple table which helps in calculating IDF matrix.
• we calculate, “how many sentences contain a word”, Let’s call it Documents per
words matrix.
5. Calculate IDF and generate a matrix
• We’ll find the IDF for each word in a paragraph.
• Now, remember the definition of IDF,
• IDF(t) = log_e(Total number of documents / Number of documents with term t in
it)
Implementation of TF-IDF
6. Calculate TF-IDF and generate a matrix
• Now we have both the matrix and the next step is very easy.
• TF-IDF algorithm is made of 2 algorithms multiplied together.
• In simple terms, we are multiplying the values from both the matrix and
generating new matrix.
7. Score the sentences
• Scoring a sentence is differs with different algorithms. Here, we are using Tf-IDF
score of words in a sentence to give weight to the paragraph.
Implementation of TF-IDF
8. Find the threshold
• Similar to any summarization algorithms, there can be different ways to calculate
a threshold value. We’re calculating the average sentence score.
• 9. Generate the summary
• Algorithm: Select a sentence for a summarization if the sentence score is more
than the average score.
References
• https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3

TF Idf
No ratings yet
TF Idf
4 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
NLP DL Lecture1
No ratings yet
NLP DL Lecture1
48 pages
Lecture 5 - Language Representation Tf-Idf
No ratings yet
Lecture 5 - Language Representation Tf-Idf
51 pages
TF Idf
No ratings yet
TF Idf
18 pages
TF Idf
No ratings yet
TF Idf
15 pages
Reference Material For NLP - 1
No ratings yet
Reference Material For NLP - 1
40 pages
Lesson 2.1 - V4 - Term Frequency-Inverse Document Frequency (TF-IDF)
No ratings yet
Lesson 2.1 - V4 - Term Frequency-Inverse Document Frequency (TF-IDF)
14 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
No ratings yet
(Example) SCSE Dr. Sunita Yadav Microteaching Slides TF-IDF Revised
15 pages
Lecture#3 TFIDF
No ratings yet
Lecture#3 TFIDF
16 pages
TF Idf
No ratings yet
TF Idf
27 pages
Term Weighting and Similarity Measures
50% (2)
Term Weighting and Similarity Measures
54 pages
Exploring TF-IDF Weighting in Natural Language Processing
No ratings yet
Exploring TF-IDF Weighting in Natural Language Processing
14 pages
Text Representation
No ratings yet
Text Representation
16 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
TF Idf
No ratings yet
TF Idf
8 pages
Tf-Idf Weighting
No ratings yet
Tf-Idf Weighting
7 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
Aiml P5
No ratings yet
Aiml P5
10 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
3 Term Weighting
No ratings yet
3 Term Weighting
34 pages
TF Idf
No ratings yet
TF Idf
6 pages
The Power of TF-IDF: Streamlining Your Research With An Easy-to-Use Calculator 128937
No ratings yet
The Power of TF-IDF: Streamlining Your Research With An Easy-to-Use Calculator 128937
4 pages
Bag of Words
No ratings yet
Bag of Words
19 pages
Experiment No. 4: Kjsce/It/Lybtech/Sem Viii/Ir/2023-24
No ratings yet
Experiment No. 4: Kjsce/It/Lybtech/Sem Viii/Ir/2023-24
4 pages
Vmodel
No ratings yet
Vmodel
10 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
115 Ir 8
No ratings yet
115 Ir 8
8 pages
TF Idf
No ratings yet
TF Idf
3 pages
2 Tws
No ratings yet
2 Tws
3 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
TF Idf
No ratings yet
TF Idf
3 pages
Module III
No ratings yet
Module III
42 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
TF IDF Vectorizer
No ratings yet
TF IDF Vectorizer
2 pages
Term Frequency
No ratings yet
Term Frequency
3 pages
Ch-5 Measurement of Length and Motion
No ratings yet
Ch-5 Measurement of Length and Motion
6 pages
Alkwjdlaksjd
No ratings yet
Alkwjdlaksjd
2 pages
Week 3 TF-IDF - Vectorizer - Calculation
No ratings yet
Week 3 TF-IDF - Vectorizer - Calculation
2 pages
Chapter-3 Termweighting
No ratings yet
Chapter-3 Termweighting
17 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
TF-IDF - From - Scratch - Towards - Data - Science
No ratings yet
TF-IDF - From - Scratch - Towards - Data - Science
20 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Topic 4: Subject-Verb Agreement I. Choose The Best Answer To Finish Each Sentence
No ratings yet
Topic 4: Subject-Verb Agreement I. Choose The Best Answer To Finish Each Sentence
15 pages
AI and Machine Learning For Disaster Prediction
No ratings yet
AI and Machine Learning For Disaster Prediction
18 pages
Shs Physical Science 1 q1 m1 Formation of Heavy Elements 1
100% (1)
Shs Physical Science 1 q1 m1 Formation of Heavy Elements 1
21 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
The Case Study On Penang South - Island Reclamation (PSR) Megaproject
100% (1)
The Case Study On Penang South - Island Reclamation (PSR) Megaproject
43 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Comparative Politics Evolution Nature and Scope
No ratings yet
Comparative Politics Evolution Nature and Scope
7 pages
EAPP A Sample Critique Paper
No ratings yet
EAPP A Sample Critique Paper
4 pages
Pre-Calculus First Quarter Worksheets
No ratings yet
Pre-Calculus First Quarter Worksheets
28 pages
Delhi Public School: Name-Vipanshu Class 10th G Summited To
No ratings yet
Delhi Public School: Name-Vipanshu Class 10th G Summited To
19 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages
DFPC Fire Instructor I NFPA 1041 2007
No ratings yet
DFPC Fire Instructor I NFPA 1041 2007
10 pages
MELC IA Masonry G7-8
No ratings yet
MELC IA Masonry G7-8
3 pages
How To Preserve Malaysian Identity Essay
No ratings yet
How To Preserve Malaysian Identity Essay
1 page
Simply Supported Beam Example
No ratings yet
Simply Supported Beam Example
4 pages
Q1W4 Solving Equations Tranformable Into Quadratic Equations Problem Solving Involving Quadratic Equation and Rational Algebraic Equations
No ratings yet
Q1W4 Solving Equations Tranformable Into Quadratic Equations Problem Solving Involving Quadratic Equation and Rational Algebraic Equations
38 pages
MSME Certificate - OFG
No ratings yet
MSME Certificate - OFG
2 pages
CBSE Class 12 Mathematics Matrices & Determinants Worksheet (2) - 1
No ratings yet
CBSE Class 12 Mathematics Matrices & Determinants Worksheet (2) - 1
4 pages
Thermographic Scanning Fact Sheet
No ratings yet
Thermographic Scanning Fact Sheet
3 pages
Nas5311 Spec
No ratings yet
Nas5311 Spec
4 pages
Environmental Economics Chapter 3
No ratings yet
Environmental Economics Chapter 3
37 pages
Learning Episode 03
No ratings yet
Learning Episode 03
4 pages
Definitions of Research
No ratings yet
Definitions of Research
3 pages
L4 Slides - Representations - From Clay To Silicon - Y8
No ratings yet
L4 Slides - Representations - From Clay To Silicon - Y8
51 pages
To Technical Paper III
No ratings yet
To Technical Paper III
10 pages
Flexural Behaviour of RC One-Way Slabs Reinforced Using PAN Based Carbon Textile Grid
No ratings yet
Flexural Behaviour of RC One-Way Slabs Reinforced Using PAN Based Carbon Textile Grid
13 pages
Digital Filter Structures Digital Filter Structures
No ratings yet
Digital Filter Structures Digital Filter Structures
10 pages
Scholarship For MSC Student
No ratings yet
Scholarship For MSC Student
3 pages
3D Dynamic Soil - Uid-Structure Interaction Analysis in The Time Domain
No ratings yet
3D Dynamic Soil - Uid-Structure Interaction Analysis in The Time Domain
6 pages
Marking scheme-PT-1-XII Physics
No ratings yet
Marking scheme-PT-1-XII Physics
3 pages
Iit Jee Aiee Book List
No ratings yet
Iit Jee Aiee Book List
6 pages
Sentence Structure
No ratings yet
Sentence Structure
3 pages
Python programming for beginners: Python programming for beginners by Tanjimul Islam Tareq
From Everand
Python programming for beginners: Python programming for beginners by Tanjimul Islam Tareq
Tanjimul Islam Tareq
No ratings yet
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet

Lecture 10 - Term Frequency

Uploaded by

Lecture 10 - Term Frequency

Uploaded by

Term Frequency in NLP

• TF(word, document) = “number of occurrences of the word in the document” /

• IDF(word) = log(number of documents / number of documents that contain the

• D1: The cat is on the mat.

You might also like