0% found this document useful (0 votes)
93 views3 pages

PORTFOLIO-AI-NLP - Document Vector

Uploaded by

Arnav Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views3 pages

PORTFOLIO-AI-NLP - Document Vector

Uploaded by

Arnav Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Step by step approach for Bag of words Algorithm

To create Document Vector Table and Inverse Document Frequency


Corpus:

Document 1: Johny Johny, Yes Papa,


Document 2: Eating sugar? No Papa
Document 3: Telling lies? No Papa
Document 4: Open your mouth, Ha! Ha! Ha!

Solution:
Step 1: Text Normalisation ( Collect data and preprocess)

Here are 4 documents having 1 senetence each , after text Normalisation the text becomes

Documents 1: [ johny , johny, yes, papa]


Document 2: [ eating, sugar , no ,papa]
Document 3: [telling , lies , no, papa]
Document 4: [open, your, mouth, ha, ha, ha]

Step 2: Create Dictionary


Dictionary
johny yes papa eating sugar no
telling lies open your mouth ha
Step 3 : Create Document Vector
Johny yes papa eating sugar no telling lies open your mouth ha
Doc 1 2 1 1 0 0 0 0 0 0 0 0 0

Step 4: Repeat for all documents


Johny yes papa eating sugar no telling lies open your mouth ha

Doc 1 2 1 1 0 0 0 0 0 0 0 0 0
Doc 2 0 0 1 1 1 1 0 0 0 0 0 0
Doc 3 0 0 1 0 0 1 1 1 0 0 0 0
Doc 4 0 0 0 0 0 0 0 0 1 1 1 3

This gives us the Document Vector Table for the Corpus


Term frequency is the frequency of a word in one document. Term frequency can be
easily found from the Document Vector table as in that table we mention the
frequency of each word of the vocabulary in each document

Step 5: Document Frequency


Record the occurrence of word in the document using term frequency table
(Document Frequency Table)
Johny yes papa eating sugar no telling lies open your mouth ha

1 1 3 1 1 2 1 1 1 1 1 1

Document frequency: Document frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in that document.
Step 6: Inverse Document Frequency
Inverse document frequency table is represented wherein, we need to put the document
frequency in the denominator while the total number of documents is the numerator.
Here, the total number of documents are 4, hence inverse document frequency becomes:
Johny yes papa eating sugar no telling lies open your mouth ha
4/1 4/1 4/3 4/1 4/1 4/2 4/1 4/1 4/1 4/1 4/1 4/1

You might also like