0% found this document useful (0 votes)
15 views6 pages

TF Idf

The document outlines the process of calculating TF-IDF for a corpus of four documents, detailing steps for computing Term Frequency (TF) and Inverse Document Frequency (IDF). It identifies words with the highest TF-IDF values, such as 'Transforming' and 'World', and constructs a document vector table based on these values. Additionally, it presents a practice exercise involving a smaller corpus of three text documents.

Uploaded by

953622243011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

TF Idf

The document outlines the process of calculating TF-IDF for a corpus of four documents, detailing steps for computing Term Frequency (TF) and Inverse Document Frequency (IDF). It identifies words with the highest TF-IDF values, such as 'Transforming' and 'World', and constructs a document vector table based on these values. Additionally, it presents a practice exercise involving a smaller corpus of three text documents.

Uploaded by

953622243011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

TF-IDF (Term Frequency-Inverse Document Frequency)

Consider the following corpus of four documents:


Document 1: "Data science is transforming the world."
Document 2: "Machine learning is a subset of data science."
Document 3: "Deep learning and AI are advancing rapidly."
Document 4: "AI and machine learning are reshaping industries."
a. Step-by-step, calculate the TF-IDF (Term Frequency-Inverse Document Frequency) for the
given corpus and identify the word(s) with the highest value.
b. Construct a document vector table based on the TF-IDF values for the given corpus.

Answer:
Step 1: Create the Term Frequency (TF) Table
The formula for TF is:

Let's list out all the unique words in the corpus:

Word

Data

Science

Is

Transforming

The

World

Machine

Learning

A
Word

Subset

Of

Deep

And

AI

Are

Advancing

Rapidly

Reshaping

Industries

Now, we count word occurrences and calculate term frequencies.


TF Calculation for Each Document
• Document 1: "Data science is transforming the world."
o Total words: 6
o TF values:
▪ TF(Data) = 1/6=0.1667
▪ TF(Science) = 1/6=0.1667
▪ TF(Is) = 1/6=0.1667
▪ TF(Transforming) = 1/6=0.1667
▪ TF(The) = 1/6=0.1667
▪ TF(World) = 1/6=0.1667
• Document 2: "Machine learning is a subset of data science."
o Total words: 7
o TF values:
▪ TF(Machine) = 1/7=0.1429
▪ TF(Learning) = 1/7=0.1429
▪ TF(Is) = 1/7=0.1429
▪ TF(A) = 1/7=0.1429
▪ TF(Subset) = 1/7=0.1429
▪ TF(Of) = 1/7=0.1429
▪ TF(Data) = 1/7=0.1429
▪ TF(Science) = 1/7=0.1429
• Document 3: "Deep learning and AI are advancing rapidly."
o Total words: 6
o TF values:
▪ TF(Deep) = 1/6=0.1667
▪ TF(Learning) = 1/6=0.1667
▪ TF(And) = 1/6=0.1667
▪ TF(AI) = 1/6=0.1667
▪ TF(Are) = 1/6=0.1667
▪ TF(Advancing) = 1/6=0.1667
▪ TF(Rapidly) = 1/6=0.1667
• Document 4: "AI and machine learning are reshaping industries."
o Total words: 6
o TF values:
▪ TF(AI) = 1/6=0.1667
▪ TF(And) = 1/6=0.1667
▪ TF(Machine) = 1/6=0.1667
▪ TF(Learning) = 1/6=0.1667
▪ TF(Are) = 1/6=0.1667
▪ TF(Reshaping) = 1/6=0.1667
▪ TF(Industries) = 1/6=0.1667

Step 2: Compute Inverse Document Frequency (IDF)

The formula for IDF is:


where:

• N=4 (Total number of documents)


• DF(t) = Number of documents that contain the term t.

Let's calculate IDFIDFIDF:

Word DF (Number of Docs) IDF = log(4/DF)

Data 2 log(4/2) = 0.693

Science 2 log(4/2) = 0.693

Is 2 log(4/2) = 0.693

Transforming 1 log(4/1) = 1.386

The 1 log(4/1) = 1.386

World 1 log(4/1) = 1.386

Machine 2 log(4/2) = 0.693

Learning 3 log(4/3) = 0.287

A 1 log(4/1) = 1.386

Subset 1 log(4/1) = 1.386

Of 1 log(4/1) = 1.386

Deep 1 log(4/1) = 1.386

And 2 log(4/2) = 0.693

AI 2 log(4/2) = 0.693

Are 2 log(4/2) = 0.693

Advancing 1 log(4/1) = 1.386

Rapidly 1 log(4/1) = 1.386

Reshaping 1 log(4/1) = 1.386

Industries 1 log(4/1) = 1.386


Step 3: Compute TF-IDF
TF − IDF (t, d) = TF (t, d) × IDF(t)
Now we compute the values. The word with the highest TF-IDF will have the highest product
of TF and IDF.
After calculation, the highest TF-IDF value is for words that appear in only one document
(IDF = 1.386), and their TF is 0.1667, giving:
TF −IDF = 0.1667 × 1.386 = 0.231
The words with the highest TF-IDF score are:
• Transforming
• The
• World
• A
• Subset
• Of
• Deep
• Advancing
• Rapidly
• Reshaping
• Industries
Step 4: Construct Document Vector Table
We construct a matrix where each row represents a document, and each column represents a
word in the corpus, filled with TF-IDF values.

Word D1 D2 D3 D4

Data 0.116 0.099 0 0

Science 0.116 0.099 0 0

Is 0.116 0.099 0 0

Transforming 0.231 0 0 0

The 0.231 0 0 0

World 0.231 0 0 0
Word D1 D2 D3 D4

Machine 0 0.099 0 0.115

Learning 0 0.099 0.048 0.048

AI 0 0 0.115 0.115

Thus, Transforming, World, The, etc., have the highest TF-IDF.

Questions for Practise:


Consider a small corpus consisting of three Text documents:
Text Doc 1: "The cat sat on the mat."
Text Doc 2: "The dog chased the cat."
Text Doc 3: "The cat and the dog played together."
Calculate TF-IDF.

You might also like