0% found this document useful (0 votes)
84 views4 pages

TF Idf

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It combines term frequency (TF), which counts how often a word appears in a document, and inverse document frequency (IDF), which assesses how common or rare a word is across documents. TF-IDF is widely used in text vectorization for machine learning, allowing words to be represented as numerical features based on their relevance.

Uploaded by

chodanker15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views4 pages

TF Idf

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents. It combines term frequency (TF), which counts how often a word appears in a document, and inverse document frequency (IDF), which assesses how common or rare a word is across documents. TF-IDF is widely used in text vectorization for machine learning, allowing words to be represented as numerical features based on their relevance.

Uploaded by

chodanker15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

 Term Frequency (TF): how many times a word appears in a

document.

 Inverse Document Frequency (IDF): the inverse document


frequency of the word across a collection of documents.
Rare words have high scores, common words have low
scores.

Understanding TF-IDF (Term Frequency-Inverse Document


Frequency)

TF-IDF stands for Term Frequency Inverse Document Frequency of


records. It can be defined as the calculation of how relevant a word in a
series or corpus is to a text. The meaning increases proportionally to the
number of times in the text a word appears but is compensated by the
word frequency in the corpus (data-set).

Terminologies:

 Term Frequency: In document d, the frequency represents the


number of instances of a given word t. Therefore, we can see that it
becomes more relevant when a word appears in the text, which is
rational. Since the ordering of terms is not significant, we can use a
vector to describe the text in the bag of term models. For each
specific term in the paper, there is an entry with the value being the
term frequency.

The weight of a term that occurs in a document is simply proportional to


the term frequency.

tf(t,d) = count of t in d / number of words in d

 Document Frequency: This tests the meaning of the text, which is


very similar to TF, in the whole corpus collection. The only difference
is that in document d, TF is the frequency counter for a term t, while
df is the number of occurrences in the document set N of the term t.
In other words, the number of papers in which the word is present is
DF.

df(t) = occurrence of t in documents

 Inverse Document Frequency: Mainly, it tests how relevant the


word is. The key aim of the search is to locate the appropriate
records that fit the demand. Since tf considers all terms equally
significant, it is therefore not only possible to use the term
frequencies to measure the weight of the term in the paper. First,
find the document frequency of a term t by counting the number of
documents containing the term:

df(t) = N(t)

where

df(t) = Document frequency of a term t

N(t) = Number of documents containing the term t

Term frequency is the number of instances of a term in a single document


only; although the frequency of the document is the number of separate
documents in which the term appears, it depends on the entire corpus.
Now let’s look at the definition of the frequency of the inverse paper. The
IDF of the word is the number of documents in the corpus separated by
the frequency of the text.

idf(t) = N/ df(t) = N/N(t)

The more common word is supposed to be considered less significant, but


the element (most definite integers) seems too harsh. We then take the
logarithm (with base 2) of the inverse frequency of the paper. So the if of
the term t becomes:

idf(t) = log(N/ df(t))

 Computation: Tf-idf is one of the best metrics to determine how


significant a term is to a text in a series or a corpus. tf-idf is a
weighting system that assigns a weight to each word in a document
based on its term frequency (tf) and the reciprocal document
frequency (tf) (idf). The words with higher scores of weight are
deemed to be more significant.

How to compute TF-IDF

Suppose we are looking for documents using the query Q and our
database is composed of the documents D1, D2, and D3.

 Q: The cat.

 D1: The cat is on the mat.

 D2: My dog and cat are the best.

 D3: The locals are playing.


There are several ways of calculating TF, with the simplest being a raw
count of instances a word appears in a document. We’ll compute the TF
scores using the ratio of the count of instances over the length of the
document.

TF(word, document) = “number of occurrences of the word in the


document” / “number of words in the document”

Let’s compute the TF scores of the words “the” and “cat” (i.e. the query
words) with respect to the documents D1, D2, and D3.

TF(“the”, D1) = 2/6 = 0.33

TF(“the”, D2) = 1/7 = 0.14

TF(“the”, D3) = 1/4 = 0.25

TF(“cat”, D1) = 1/6 = 0.17

TF(“cat”, D2) = 1/7 = 0.14

TF(“cat”, D3) = 0/4 = 0

IDF can be calculated by taking the total number of documents, dividing it


by the number of documents that contain a word, and calculating the
logarithm. If the word is very common and appears in many documents,
this number will approach 0. Otherwise, it will approach 1.

IDF(word) = log(number of documents / number of documents that


contain the word)

Let’s compute the IDF scores of the words “the” and “cat”.

IDF(“the”) = log(3/3) = log(1) = 0

IDF(“cat”) = log(3/2) = 0.18

Multiplying TF and IDF gives the TF-IDF score of a word in a document. The
higher the score, the more relevant that word is in that particular
document.

TF-IDF(word, document) = TF(word, document) * IDF(word)

Let’s compute the TF-IDF scores of the words “the” and “cat”.

TF-IDF(“the”, D1) = 0.33 * 0 = 0

TF-IDF(“the, D2) = 0.14 * 0 = 0

TF-IDF(“the”, D3) = 0.25 * 0 = 0

TF-IDF(“cat”, D1) = 0.17 * 0.18= 0.0306

TF-IDF(“cat, D2) = 0.14 * 0.18= 0.0252


TF-IDF(“cat”, D3) = 0 * 0 = 0

The next step is to use a ranking function to order the documents


according to the TF-IDF scores of their words. We can use the average TF-
IDF word scores over each document to get the ranking
of D1, D2, and D3 with respect to the query Q.

Average TF-IDF of D1 = (0 + 0.0306) / 2 = 0.0153

Average TF-IDF of D2 = (0 + 0.0252) / 2 = 0.0126

Average TF-IDF of D3 = (0 + 0) / 2 = 0

Looks like the word “the” does not contribute to the TF-IDF scores of each
document. This is because “the” appears in all of the documents and thus
it is considered a not-relevant word.

There are better-performing ranking functions in the literature, such


as Okapi BM25.

As a conclusion, when performing the query “The cat” over the collection
of documents D1, D2, and D3, the ranked results would be:

1. D1: The cat is on the mat.

2. D2: My dog and cat are the best.

3. D3: The locals are playing.

The use of TF-IDF in Machine Learning

TF-IDF is often used to transform text into a vector of numbers, otherwise


known as text vectorization, where the numbers of the vectors are meant
to somehow represent the content of the text.

TF-IDF gives us a way to associate each word in a document with a


number that represents how relevant each word is in that document. Such
numbers can be then used as features of machine learning models.

You might also like