TF IDF Vectorizer
TF IDF Vectorizer
Term Frequency (TF): gives us the frequency of the word in each document in the
corpus. It is the ratio of the number of times the word appears in a document compared
to the total number of words in that document. It increases as the number of
occurrences of that word within the document increases. Each document has its own tf.
Inverse Data
Frequency (idf): used to calculate the weight of rare words across all documents in the
corpus. The words that occur rarely in the corpus have a high IDF score. It is given by
the equation below.
Combining these two we come up with the TF-IDF score (w) for a word in a document in
the corpus. It is the product of tf and idf:
Real-life Example:
If We have a search engine and somebody looks for “Coke”. The search engine will
return all documents containing the word “Coke”. However, some documents may
contain the word “Coke” more frequently than others. In this case, TF-IDF can be used to
figure out if a page titled “COKE” is about: a) Coca-Cola. b) Cocaine. c) A solid, carbon-
rich residue derived from the distillation of crude oil. d) A county in Texas .
Mathematical Simulation:
There are two documents in a corpus: Text A and Text B. We will use them to create a
TF-IDF matrix.
Text A: "The quick brown fox jumps over the lazy dog"
Text B: "The dog is lazy and the fox is quick"
The table below shows the values of TF for A and B, IDF, and TFIDF for A and B.
From the above table we can see that TFIDF of common words was zero, which shows
they are not significant.On the other hand, the TFIDF of “brown”, “jumps”, “over”, “is”,
“and” are non-zero.This words have more significance.