0% found this document useful (0 votes)
32 views37 pages

Unit 6 - AI (NLP)

Uploaded by

divyanagarajj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views37 pages

Unit 6 - AI (NLP)

Uploaded by

divyanagarajj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Unit 6

N L P
atural anguage rocessing
What is NLP?
• AI uses interaction between machines & humans through natural
languages used by humans- This Enables communication b/w machine &
human.
• It deals with programming the computers to process & analyse large
amount of natural language data.
• It accepts anything spoken or typed, converts them to digital signals
and generate useful response.
• It helps machines to understand, analyse, manipulate & interpret
human language.
• It’s used in all industries where human interactions are involved such as
enquiry handling, education, counselling, customer support system,
crime detection & consultation.
Applications of NLP:
• Chatbots
• Virtual Assistants
• Social Media Monitoring
• Speech Recognition
• Machine Translation
• Sentiment Analysis
• Text Classification
• Education & Training
• Text Extraction
• Health Care
• Automatic Summarization
Sentiment Analysis
Chatbots :
• Script bot • Smart bot
• Script bots are easy to make and • Smart bots are flexible ,
less interactive powerful and are more
• Script bot have limited interactive
functionality • Smart bot have wide
• These chatbots are easy to functionality
integrate to a messaging platform. • Smart bots learn with more data
by itself
• Eg:Customer care section(These
• Eg:Google Assistant , Cortana ,
bots answer some basic queries
and connect them to human once Siri etc.,(They can manage to
they are unable to handle) handle customer’s query)
Human Language Vs Computer Language :

• How can humans “talk to”(instruct)


computers ?
• Answer ?????????????
Multiple Meanings of Words
• To understand let us have an example of the following three
sentences:
1.His face turned red after he found out that he had taken the
wrong bag
1. What does this mean? Is he feeling ashamed because he took
another person’s bag instead of his? Is he feeling angry because he
did not manage to steal the bag that he has been targeting?
2.His face turns red after consuming the medicine
1. Is he having an allergic reaction? Or is he not able to bear the
taste of that medicine?
Perfect Syntax but no Meaning
• Sometimes, when the statement is perfectly correct syntax but there
is no meaning to it.
• Example: Chickens feed extravagantly while the moon drinks tea
• This statement is correct grammatically but does this make any sense?
In Human language, a perfect
balance of syntax and semantics is important for better
understanding.
Data Processing
• As we have already gone through some of the complications in human
languages above, now it is time to see how Natural Language Processing
makes it possible for machines to understand and speak Natural
Languages just like humans.
• Since we all know that the language of computers is Numerical, the very
first step that comes to our mind is to convert our language to numbers.
This conversion takes a few steps to happen. The first step to it is Text
Normalisation.
• Text Normalisation helps in cleaning up the textual data in such a way
that it comes down to a level where its complexity is lower than the
actual data.
Text Normalisation
1. Sentence Segmentation
2. Tokenisation
3. Removal of Stopwords
4. Converting text to a common case
5. Stemming
6. Lemmatization
Text Normalization
• Steps to normalize the text to a lower level.
• Process of downsizing & simplifying the text to make it
suitable for machine processing.
• Remove unnecessary pieces from text & break the text into
simpler tokens converting them to numerical form.
• Processing is done on text collected from multiple documents
& sources.
• This Textual data collected from multiple sources is known
as corpus.(A corpus can be defined as a collection of text
documents. It can be thought of as just a bunch of text files in
a directory.)
https://www.youtube.com/watch?v=2HVe6rYID2I
1. Sentence Segmentation

• The whole corpus is broken down into simple sentences.


• Split paragraph into small sentences, wherever there is a
punctuation(each sentences is treated as separate data to be
processed).
Example:
• Before Sentence Segmentation
• “You want to see the dreams with close eyes and achieve them?
They’ll remain dreams, look for AIMs and your eyes have to stay open
for a change to be seen.”
• After Sentence Segmentation
1.You want to see the dreams with close eyes and achieve them?
2.They’ll remain dreams, look for AIMs and your eyes have to stay open
for a change to be seen.
2. Tokenisation

• After segmenting the sentences, each sentence is then further


divided into individual text pieces called tokens(also called
word tokenization/ word segmentation).
• A “Token” can be a word, number or special character in a
sentence.
• In most languages words are separated by spaces. But
sometimes this may not be very helpful.
Text
The cat sat on the mat.
Tokens
‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’, ‘.’
Under Tokenisation, every word, number, and special character is
considered separately and each of them is now a separate token.
You
want
to
see
the
drea
ms A Ach
with T Se Th Dream Clos Eye The
clos You want o e e s With
e s
n iev
m
?
e d e
eyes
and
achi
eve
them
?
3. Removal of Stop words, Special characters & numbers

• Stop words- grammatical words that frequently occur in corpus


but do not any value or essence to the information. Ex: a, an,
and, are, as, for, it , is, into, in, if, on, such, the, there, to.
• Stop words are removed to make it easier for computer to focus
on important, meaningful words.

• Removal of special characters and/or numbers depends on the type of corpus


that we are working on and whether we should keep them in it or not.
• For example: if you are working on a document containing email IDs, then you
might not want to remove the special characters and numbers whereas in
some other textual data if these characters do not make sense, then you can
remove them along with the stopwords.
Stop Words :
• Stopwords: Stopwords are the words that occur very
frequently in the corpus but do not add any value to it.
• Examples: a, an, and, are, as, for, it, is, into, in, if, on,
or, such, the, there, to.
• Example
1.You want to see the dreams with close eyes and achieve them?
1. the removed words would be
2. to, the, and, ?
2.The outcome would be:
1. You want see dreams with close eyes achieve them
4. Converting text to a common case
• As the name suggests, we convert the whole text into a
similar case, preferably lower case. This ensures that
the case sensitivity of the machine does not consider
the same words as different just because of different
cases.
5. Stemming & Lemmatization
Stemming

• Definition: Stemming is a technique used to extract the base form


of the words(root word called Lemma) by removing affixes from
them. It is just like cutting down the branches of a tree to its
stems.
• This is done because a single word may have different forms.
Example – eat, eats, eating etc.
• The stemmed words (words which we get after removing the
affixes) might not be meaningful.
• NOTE: Some may end up having no meaning like for words like
happiness, after stemming results in happi
Example ::

Words Affixes Stem


healing ing heal
dreams s dream
studies es studi
Lemmatization
• Definition: In lemmatization, affix are removed and
words are converted to their base form (also known as
lemma) to keep them meaningful is a meaningful.
• It takes a longer time to execute than stemming.
• Example :

Words Affixes lemma

healing ing heal

dreams s dream

studies es study

studying ing study


Difference between stemming and lemmatization
Stemming lemmatization
1.The stemmed words might 1.The lemma word is a
not be meaningful. meaningful one.
2.Caring ➔ Car 2.Caring ➔ Care
Feature Extraction from Text

• After text normalization, to start processing, the features of the text


has to be extracted.
• For this it has to be converted to suitable numeric form with
algorithms such as Bag of Words and Term Frequency
Bag of word Algorithm (BoW) – Algorithm to transform tokens into a set of features

• In a bag of words, we get the occurrences of each word and


construct the vocabulary for the corpus.
• Bag of Words just creates a set of vectors containing the count of
word occurrences in the document (reviews).
• The algorithm returns to us the unique words out of the corpus
and their occurrences in it.
• BoW algorithm is not concerned with sequence of words
The bag of words gives us two things:
• Vocabulary-Unique words identified in the corpus.
• Frequency-Number of occurrences of each word.

https://www.youtube.com/watch?v=c0Tk8KEHBFc
Steps of the bag of words algorithm

1. Text Normalisation: Removing all punctuations, unnecessary


symbols & converting entire text into lowercase.
2. Create Vocabulary or Dictionary: Making a list of all the
unique words occurring in the corpus.
3. Text Vectorisation: Document vectors to be created using
separate column for each word, whereas each row corresponds
to its review. For each document in the corpus, find out how
many times the word has occurred.
Step 1: Collecting data and pre-processing it.

• Raw Data • Processed Data


• Document 1: Aman and Anil are • Document 1: [aman, and, anil,
stressed are, stressed ]
• Document 2: Aman went to a • Document 2: [aman, went, to, a,
therapist therapist]
• Document 3: Anil went to • Document 3: [anil, went, to,
download a health chatbot download, a, health, chatbot]

Note: No tokens have been removed in the stop words removal step. It is
because we have very little data and since the frequency of all the words is
almost the same, no word can be said to have lesser value than the other.
Step 2: Create Dictionary
Dictionary in NLP means a list of all the unique words
occurring in the corpus.
If some words are repeated in different documents, they are
all written just once while creating the dictionary.

Dictionary:

aman and anil are stressed went


downloa
health chatbot therapist a to
d
Step 3: Create a document vector
How to make a document vector table?
• In the document, vector vocabulary is written in the top row. Now, for
each word in the document, if it matches the vocabulary, put a 1 under it.
If the same word appears again, increment the previous value by 1. And if
the word does not occur in that document, put a 0 under it.

therapis downloa chatb


aman and anil are stressed went to a health
t d ot

1 1 1 1 1 0 0 0 0 0 0 0
step 4: Create a document vector
table for all documents

ama stresse therapi downloa healt


and anil are went to a chatbot
n d st d h

1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
TFIDF

• TFIDF stands for Term Frequency & Inverse


Document Frequency.
Term Frequency
dow
ama stres ther healt chat
and anil are went to a nloa
n sed apist h bot
d
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
Inverse Document Frequency

DF (Document Frequency) dow


ama stres ther healt chat
and anil are went to a nloa
n sed apist h bot
d
2 1 2 1 1 2 2 2 1 1 1 1

IDF (Inverse Document Frequency)


• Definition of Inverse Document Frequency (IDF): In the
case of inverse document frequency, we need to put
the document frequency in the denominator while
the total number of documents is the numerator.
stress therapi downl chatbo
aman and anil are went to a health
ed st oad t
3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1
Formula of TFIDF
• The formula of TFIDF for any word W becomes:
• TFIDF(W) = TF(W) * log( IDF(W) )

stresse therapi downlo


aman and anil are went to a health chatbot
d st ad
1*log(3/ 1*log(3 1*log(3/ 1*log(3 1*log(3 0*log(3/ 0*log(3/ 0*log(3/ 0*log(3 0*log(3 0*log(3 0*log(3
2) ) 2) ) ) 2) 2) 2) ) ) ) )
1*log(3/ 0*log(3 0*log(3/ 0*log(3 0*log(3 1*log(3/ 1*log(3/ 1*log(3/ 1*log(3 0*log(3 0*log(3 0*log(3
2) ) 2) ) ) 2) 2) 2) ) ) ) )
0*log(3/ 0*log(3 1*log(3/ 0*log(3 0*log(3 1*log(3/ 1*log(3/ 1*log(3/ 0*log(3 1*log(3 1*log(3 1*log(3
2) ) 2) ) ) 2) 2) 2) ) ) ) )
stress therapi downl chatbo
aman and anil are went to a health
ed st oad t

0.176 0.477 0.176 0.477 0.477 0 0 0 0 0 0 0

0.176 0 0 0 0 0.176 0.176 0.176 0.477 0 0 0

0 0 0.176 0 0 0.176 0.176 0.176 0 0.477 0.477 0.477

You might also like