0% found this document useful (0 votes)
12 views

Natural Language Processing_compressed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Natural Language Processing_compressed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Natural Language Processing

• Branch of AI that enables computers to process human language


in the form of text or voice data and mimic human conversation.
• Applications

Automatic Sentiment Text Virtual


Summarization Analysis Classification Assistants
Natural Language Processing
• Branch of AI that enables computers to process human language
in the form of text or voice data and mimic human conversation.

Underfitting Perfect fit Overfitting


Chatbots
• Script Bots
• Smart Bots
Script Bots Smart Bots
Data Processing
• Text Normalisation helps in cleaning up the textual data in such
a way that it comes down to a level where its complexity is lower
than the actual data.
• The whole textual data from all the documents altogether is known
as corpus.
• Sentence Segmentation: the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the
whole corpus gets reduced to sentences.
• Tokenisation: Each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special
character occurring in a sentence.
Data Processing
• Removing Stopwords, Special Characters and Numbers: tokens
which are not necessary are removed from the token list.
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it. Ex: a,an,and,are,as,for,it, is,into...
• Converting text to a common case: After the stopwords
removal, we convert the whole text into a similar case,
preferably lower case.
• Stemming: The remaining words are reduced to their root
words. In other words, stemming is the process in which the
affixes of words are removed and the words are converted to
their base form.
Data Processing
• Removing Stopwords, Special Characters and Numbers: tokens
which are not necessary are removed from the token list.
• Stopwords are the words which occur very frequently in the corpus
but do not add any value to it. Ex: a,an,and,are,as,for,it, is,into...
• Converting text to a common case: After the stopwords
removal, we convert the whole text into a similar case,
preferably lower case.
• Stemming: The remaining words are reduced to their root
words. In other words, stemming is the process in which the
affixes of words are removed and the words are converted to
their base form.
Data Processing
• Lemmatization: Difference between Stemming and
Lemmatization is that in lemmatization, the word we get after
affix removal (also known as lemma) is a meaningful one.
Bag of Words (BoW)
• Bag of Words is a Natural Language Processing model which
helps in extracting features out of the text which can be helpful
in machine learning algorithms. In bag of words, we get the
occurrences of each word and construct the vocabulary for the
corpus.
• Steps to implement BoW:
• Text Normalisation: Collect data and pre-process it.
• Create Dictionary: Make a list of all the unique words occurring in
the corpus. (Vocabulary).
• Create document vectors: For each document in the corpus, find
out how many times the word from the unique list of words has
occurred.
Bag of Words (BoW)
• Bag of Words is a Natural Language Processing model which
helps in extracting features out of the text which can be helpful
in machine learning algorithms. In bag of words, we get the
occurrences of each word and construct the vocabulary for the
corpus.
• Steps to implement BoW:
• Create document vectors for all the documents.

• Ex:
• Document 1: Aman and Anil are stressed.
• Document 2: Aman went to a therapist.
• Document 3: Anil went to download a health chatbot.
Bag of Words (BoW)
• Step 1: Text Normalisation - Collect data and pre-process it.
• After normalisation-
• Document 1: [aman, and, anil, are, stressed]
• Document 2: [aman, went, to, a, therapist]
• Document 3: [anil, went, to, download, a, health, chatbot]

• Step 2: Create Dictionary - List down all the words which occur in all
three documents.
Bag of Words (BoW)
• Step 3: Create document vector.
aman and anil are stressed went to a therapist download heal chat
th bot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1

• Step 4:
• Term Frequency(TF): Frequency of a word in one document.
Bag of Words (BoW)
• Inverse Document Frequency(IDF): Document Frequency is the
number of documents in which the word occurs irrespective of how
many times it has occurred in those documents.
aman and anil are stressed went to a therapist download heal chat
th bot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
aman and anil are stressed went to a therapist download heal chat
th bot
3/2 3/1 3/1 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Bag of Words (BoW)
TFIDF = TF(W) x log(IDF(W))
aman and anil are stress went to a therapist download heal chat
th bot
1*log( 1*lo 1*lo 1*lo 1*log(3 0*log(3/ 0*l 0*l 0*log(3) 0*log(3) 0*lo 0*log(
3/2) g(3) g(3/ g(3) ) 2) og( og( g(3) 3)
2) 3/2) 3/2)
1*log( 0*lo 0*lo 0*lo 0*log(3 1*log(3/ 1*l 1*l 1*log(3) 0*log(3) 0*lo 0*log(
3/2) g(3) g(3/ g(3) ) 2) og( og( g(3) 3)
2) 3/2) 3/2)

0*log( 0*lo 1*lo 0*lo 0*log(3 1*log(3/ 1*l 1*l 0*log(3) 1*log(3) 1*lo 1*log(
3/2) g(3) g(3/ g(3) ) 2) og( og( g(3) 3)
2) 3/2) 3/2)
Bag of Words (BoW)
TFIDF = TF(W) x log(IDF(W))
aman and anil are stress went to a therapis download health chatbot
t
0.176 0.47 0.1 0.47 0.477 0 0 0 0 0 0 0
7 76 7

0.176 0 0 0 0 0.176 0.176 0.176 0.477 0 0 0

0 0 0.1 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477


76
Applications of TFIDF

Document Classification:
Helps in classifying the type and genre of a document.
Topic Modelling:
Helps in predicting the topic for a corpus.
Information Retrieval System:
To extract the important information out of a corpus.
Stop word filtering:
Helps in removing the unnecessary words out of a text body.

You might also like