Lecture 10 - Term Frequency
Lecture 10 - Term Frequency
Term Frequency
• In Natural Language Processing (NLP), Term Frequency (TF) is a key concept used to
evaluate the importance of a word within a document.
• It's a measure of how frequently a term appears in a document relative to the total number
of terms in that document.
• This is done by multiplying two metrics:
• Term Frequency (TF): how many times a word appears in a document.
• Inverse Document Frequency (IDF): the inverse document frequency of the word across a
collection of documents. Rare words have high scores, common words have low scores.
• Term frequency is how common a word is, inverse document frequency (IDF) is how
unique or rare a word is.
• TF-IDF has many uses, such as in information retrieval, text analysis, keyword
extraction, and as a way of obtaining numeric features from text for machine learning
algorithms.
TF-IDF origin
• TF-IDF was first designed for document search and information retrieval, where a
query is run and the system has to find the most relevant documents.
• Suppose the query is the text “The bug”. The system would give each document a
higher score proportionally to the frequencies of the query words found in the
document, weighting more rare words like “bug” with respect to common words
like “the”.
How to compute TF-IDF
• Suppose we are looking for documents using the query Q and our database is
composed of the documents D1, D2, and D3.
• Q: The cat.
• D1: The cat is on the mat.
• D2: My dog and cat are the best.
• D3: The locals are playing.
How to compute TF-IDF
• There are several ways of calculating TF, with the simplest being a raw count of
instances a word appears in a document.
• We’ll compute the TF scores using the ratio of the count of instances over the
length of the document.
• As a conclusion, when performing the query “The cat” over the collection of
documents D1, D2, and D3, the ranked results would be:
• TF-IDF gives us a way to associate each word in a document with a number that
represents how relevant each word is in that document. Such numbers can be then
used as features of machine learning models.
Example 2
• Example,
• Consider a document containing 100 words wherein the word apple appears 5
times. The term frequency (i.e., TF) for apple is then (5 / 100) = 0.05.
• Now, assume we have 10 million documents and the word apple appears in one
thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated
as log(10,000,000 / 1,000) = 4.
• Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.
Implementation of TF-IDF
• Implementation of TF-IDF consists of the nine following steps.
• Perquisites Python3, NLTK library of python, Python IDE
1. Tokenize the sentences
2. Create the Frequency matrix of the words in each sentence.
Where, each sentence is the key and the value is a dictionary of word frequency.
3. Calculate Term Frequency and generate a matrix
We’ll find the Term Frequency for each word in a paragraph.
Now, remember the definition of TF,
TF(t) = (Number of times term t appears in a document) / (Total number of terms in
the document)
Implementation of TF-IDF
4. Creating a table for documents per words
• This again a simple table which helps in calculating IDF matrix.
• we calculate, “how many sentences contain a word”, Let’s call it Documents per
words matrix.
5. Calculate IDF and generate a matrix
• We’ll find the IDF for each word in a paragraph.
• Now, remember the definition of IDF,
• IDF(t) = log_e(Total number of documents / Number of documents with term t in
it)
Implementation of TF-IDF
6. Calculate TF-IDF and generate a matrix
• Now we have both the matrix and the next step is very easy.
• TF-IDF algorithm is made of 2 algorithms multiplied together.
• In simple terms, we are multiplying the values from both the matrix and
generating new matrix.
7. Score the sentences
• Scoring a sentence is differs with different algorithms. Here, we are using Tf-IDF
score of words in a sentence to give weight to the paragraph.
Implementation of TF-IDF
8. Find the threshold
• Similar to any summarization algorithms, there can be different ways to calculate
a threshold value. We’re calculating the average sentence score.
• 9. Generate the summary
• Algorithm: Select a sentence for a summarization if the sentence score is more
than the average score.
References
• https://towardsdatascience.com/text-summarization-using-tf-idf-e64a0644ace3