0% found this document useful (0 votes)
32 views17 pages

Lecture 10 - Term Frequency

Uploaded by

ravleen3310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views17 pages

Lecture 10 - Term Frequency

Uploaded by

ravleen3310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Term Frequency in NLP

Term Frequency
• In Natural Language Processing (NLP), Term Frequency (TF) is a key concept used to
evaluate the importance of a word within a document.
• It's a measure of how frequently a term appears in a document relative to the total number
of terms in that document.
• This is done by multiplying two metrics:
• Term Frequency (TF): how many times a word appears in a document.
• Inverse Document Frequency (IDF): the inverse document frequency of the word across a
collection of documents. Rare words have high scores, common words have low scores.
• Term frequency is how common a word is, inverse document frequency (IDF) is how
unique or rare a word is.
• TF-IDF has many uses, such as in information retrieval, text analysis, keyword
extraction, and as a way of obtaining numeric features from text for machine learning
algorithms.
TF-IDF origin
• TF-IDF was first designed for document search and information retrieval, where a
query is run and the system has to find the most relevant documents.
• Suppose the query is the text “The bug”. The system would give each document a
higher score proportionally to the frequencies of the query words found in the
document, weighting more rare words like “bug” with respect to common words
like “the”.
How to compute TF-IDF
• Suppose we are looking for documents using the query Q and our database is
composed of the documents D1, D2, and D3.

• Q: The cat.
• D1: The cat is on the mat.
• D2: My dog and cat are the best.
• D3: The locals are playing.
How to compute TF-IDF
• There are several ways of calculating TF, with the simplest being a raw count of
instances a word appears in a document.
• We’ll compute the TF scores using the ratio of the count of instances over the
length of the document.

• TF(word, document) = “number of occurrences of the word in the document” /


“number of words in the document”
How to compute TF-IDF
• Let’s compute the TF scores of the words “the” and “cat” (i.e. the query words)
with respect to the documents D1, D2, and D3.
• TF(“the”, D1) = 2/6 = 0.33
• TF(“the”, D2) = 1/7 = 0.14
• TF(“the”, D3) = 1/4 = 0.25
• TF(“cat”, D1) = 1/6 = 0.17
• TF(“cat”, D2) = 1/7 = 0.14
• TF(“cat”, D3) = 0/4 = 0
How to compute TF-IDF
• IDF can be calculated by taking the total number of documents, dividing it by the
number of documents that contain a word, and calculating the logarithm. If the
word is very common and appears in many documents, this number will approach
0. Otherwise, it will approach 1.

• IDF(word) = log(number of documents / number of documents that contain the


word)
• Let’s compute the IDF scores of the words “the” and “cat”.
• IDF(“the”) = log(3/3) = log(1) = 0
• IDF(“cat”) = log(3/2) = 0.18
How to compute TF-IDF
• Multiplying TF and IDF gives the TF-IDF score of a word in a document. The
higher the score, the more relevant that word is in that particular document.
• TF-IDF(word, document) = TF(word, document) * IDF(word)
• Let’s compute the TF-IDF scores of the words “the” and “cat”.
• TF-IDF(“the”, D1) = 0.33 * 0 = 0
• TF-IDF(“the, D2) = 0.14 * 0 = 0
• TF-IDF(“the”, D3) = 0.25 * 0 = 0
• TF-IDF(“cat”, D1) = 0.17 * 0.18= 0.0306
• TF-IDF(“cat, D2) = 0.14 * 0.18= 0.0252
• TF-IDF(“cat”, D3) = 0 * 0.18 = 0
How to compute TF-IDF
• The next step is to use a ranking function to order the documents according to the
TF-IDF scores of their words. We can use the average TF-IDF word scores over
each document to get the ranking of D1, D2, and D3 with respect to the query Q.
• Average TF-IDF of D1 = (0 + 0.0306) / 2 = 0.0153
• Average TF-IDF of D2 = (0 + 0.0252) / 2 = 0.0126
• Average TF-IDF of D3 = (0 + 0) / 2 = 0
• Looks like the word “the” does not contribute to the TF-IDF scores of each
document. This is because “the” appears in all of the documents and thus it is
considered a not-relevant word.
How to compute TF-IDF
• There are better-performing ranking functions in the literature, such as Okapi
BM25.

• As a conclusion, when performing the query “The cat” over the collection of
documents D1, D2, and D3, the ranked results would be:

• D1: The cat is on the mat.


• D2: My dog and cat are the best.
• D3: The locals are playing.
The use of TF-IDF in Machine Learning
• TF-IDF is often used to transform text into a vector of numbers, otherwise known
as text vectorization, where the numbers of the vectors are meant to somehow
represent the content of the text.

• TF-IDF gives us a way to associate each word in a document with a number that
represents how relevant each word is in that document. Such numbers can be then
used as features of machine learning models.
Example 2
• Example,
• Consider a document containing 100 words wherein the word apple appears 5
times. The term frequency (i.e., TF) for apple is then (5 / 100) = 0.05.

• Now, assume we have 10 million documents and the word apple appears in one
thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated
as log(10,000,000 / 1,000) = 4.

• Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
Implementation of TF-IDF
• Implementation of TF-IDF consists of the nine following steps.
• Perquisites Python3, NLTK library of python, Python IDE
1. Tokenize the sentences
2. Create the Frequency matrix of the words in each sentence.
Where, each sentence is the key and the value is a dictionary of word frequency.
3. Calculate Term Frequency and generate a matrix
We’ll find the Term Frequency for each word in a paragraph.
Now, remember the definition of TF,
TF(t) = (Number of times term t appears in a document) / (Total number of terms in
the document)
Implementation of TF-IDF
4. Creating a table for documents per words
• This again a simple table which helps in calculating IDF matrix.
• we calculate, “how many sentences contain a word”, Let’s call it Documents per
words matrix.
5. Calculate IDF and generate a matrix
• We’ll find the IDF for each word in a paragraph.
• Now, remember the definition of IDF,
• IDF(t) = log_e(Total number of documents / Number of documents with term t in
it)
Implementation of TF-IDF
6. Calculate TF-IDF and generate a matrix
• Now we have both the matrix and the next step is very easy.
• TF-IDF algorithm is made of 2 algorithms multiplied together.
• In simple terms, we are multiplying the values from both the matrix and
generating new matrix.
7. Score the sentences
• Scoring a sentence is differs with different algorithms. Here, we are using Tf-IDF
score of words in a sentence to give weight to the paragraph.
Implementation of TF-IDF
8. Find the threshold
• Similar to any summarization algorithms, there can be different ways to calculate
a threshold value. We’re calculating the average sentence score.
• 9. Generate the summary
• Algorithm: Select a sentence for a summarization if the sentence score is more
than the average score.
References
• https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3

You might also like