Natural Language Processing_compressed
Natural Language Processing_compressed
• Ex:
• Document 1: Aman and Anil are stressed.
• Document 2: Aman went to a therapist.
• Document 3: Anil went to download a health chatbot.
Bag of Words (BoW)
• Step 1: Text Normalisation - Collect data and pre-process it.
• After normalisation-
• Document 1: [aman, and, anil, are, stressed]
• Document 2: [aman, went, to, a, therapist]
• Document 3: [anil, went, to, download, a, health, chatbot]
• Step 2: Create Dictionary - List down all the words which occur in all
three documents.
Bag of Words (BoW)
• Step 3: Create document vector.
aman and anil are stressed went to a therapist download heal chat
th bot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
• Step 4:
• Term Frequency(TF): Frequency of a word in one document.
Bag of Words (BoW)
• Inverse Document Frequency(IDF): Document Frequency is the
number of documents in which the word occurs irrespective of how
many times it has occurred in those documents.
aman and anil are stressed went to a therapist download heal chat
th bot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
aman and anil are stressed went to a therapist download heal chat
th bot
3/2 3/1 3/1 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Bag of Words (BoW)
TFIDF = TF(W) x log(IDF(W))
aman and anil are stress went to a therapist download heal chat
th bot
1*log( 1*lo 1*lo 1*lo 1*log(3 0*log(3/ 0*l 0*l 0*log(3) 0*log(3) 0*lo 0*log(
3/2) g(3) g(3/ g(3) ) 2) og( og( g(3) 3)
2) 3/2) 3/2)
1*log( 0*lo 0*lo 0*lo 0*log(3 1*log(3/ 1*l 1*l 1*log(3) 0*log(3) 0*lo 0*log(
3/2) g(3) g(3/ g(3) ) 2) og( og( g(3) 3)
2) 3/2) 3/2)
0*log( 0*lo 1*lo 0*lo 0*log(3 1*log(3/ 1*l 1*l 0*log(3) 1*log(3) 1*lo 1*log(
3/2) g(3) g(3/ g(3) ) 2) og( og( g(3) 3)
2) 3/2) 3/2)
Bag of Words (BoW)
TFIDF = TF(W) x log(IDF(W))
aman and anil are stress went to a therapis download health chatbot
t
0.176 0.47 0.1 0.47 0.477 0 0 0 0 0 0 0
7 76 7
Document Classification:
Helps in classifying the type and genre of a document.
Topic Modelling:
Helps in predicting the topic for a corpus.
Information Retrieval System:
To extract the important information out of a corpus.
Stop word filtering:
Helps in removing the unnecessary words out of a text body.