Unit 6 Natural Language Processing
Unit 6 Natural Language Processing
Language
Processing
sub-field of AI
Lesson Objective
Introduction to Natural Language Processing
Applications of Natural Language
Processing Revisiting AI project Cycle
Introduction to Chatbots (Activity)
Human Language Vs Computer Language
Text Processing
Data Processing
Bagof Words
The third domain
(Natural Language
Processing)
1.NLP takes in the data of Natural Languages
which humans use in their daily lives and operates
on this.
2.NLP, is the sub-field of AI that is focused on
enabling computers to understand and process
human languages.
3.Artificial Intelligence is concerned with the
interactions between computers and human
(natural) languages, in particular how to program
computers to process and analyze large amounts of
natural language data.
This chapter is all about demystifying the Natural
Language Processing domain and understanding how it
works.
Before we get
deeper into NLP
let us experience
it with the help of
this game
IDEN TIFY THE MYS
TERY AN IM AL
http://bit.ly/iai4yma
1. Try to identitfy the M yst er y Animal by 5.W e r e there any challenges t h a t you tfaced
asking the machine 2 0 Yes or No questions. while playing this game? Itf yes, list them
W e r e you able to guess the animal? down. In this game human is the guesser and that is
the fun challenge because there are so many things
2. Itf yes, in how many questions w e r e you able a person might ask. The game uses dialog flow with
to guess it? actions on Google which uses machine learning to
understand your dialogue. Dialog flow is a tool which
3. Itf no, how many times did you t r y playing allows the
this game? computer to understand language without you having
to write any code. Generative text techniques is used to
4. W h a t according to you was the task otf create mystery animal's responses and knowledge
the machine? graph to handle the questions throughout it.
Mystery Animal is a new spin on the classic 20- 6.W h a t approach must one tfollow to win
questions game. The computer pretends to be an this game?
animal, and you have to guess what it is using your The first question you ask is very important. It must
voice. Ask any yes-or-no question you want, like "Do you lead you closer to the answer regardless of whether the
have feathers?" or "Do you sleep at night?" answer is yes or no. It allows us to ascertain useful
information quickly that allows us to ask relevant
questions that quickly lead to the right answer.
Applications of
NLP Automatic Summarization:
1 2 3 4 5
STE STE STE STE STE
P P P P P
Problem Scoping Data Acqusition Data Modelling Evaluatio
Exploratio n
Need to To understand Once the text The model trained
n
the sentiments of the text is has been is then evaluated
bridge the
people, we need normalised through and the accuracy
gap between various steps and is
normalised, it is
to collect their for the same is
a person conversational lowered to minimum then fed to an generated on the
data so the vocabulary since NLP based AI basis of the
who needs the machine does
machine can model. relevance of the
help and the interpret the
not require
answers which the
grammatically
psychiatrist words that they machine gives to
correct statements
use and the user’s
but the essence of
understand their it. responses.
meaning.
Problem
Scoping
CBT IS A TECHN IQUE
N T S O U T O F ST RE S S
USED BY M OST T H E R A P I S T ST O C U R E
A N D D E P R E S S I O N .B U T IT HAS BEEN
PATIE
O B SER
VED THAT P EO P L E D O N O T W I S HT O SE E K THE H ELP OF A
PSYCHIATRIST W I L L I N G L Y . T H EY T R Y T O
A V O I D SU C H I N T E R A C T I O N S AS M UCH AS P O S S I B L E .L E T U S
LOOK AT
V A R I O U SF A C T O R S AROUN D THI S PROBLEM THR
OUGH THE 4WS PROBLEM CAN VAS.
Who Canvas – What Canvas – Where Canvas – Why Canvas –
Who has the What is the nature Where does the Why do you think it
problem? of the problem? problem arise? is a problem worth
solving?
As you can see in the above diagram, the blue line talks about the model’s output while the green one is the actual output along with the data samples
Chatbots
ON E O F THE M O ST CO M M O N APPLICAT
ION S O F N ATURAL L A N G U A G EP R O C E S S I N G
IS A CHATBO T
• CleverBot* https://www.cleverbot.com/
• Jabberwacky* http://www.jabberwacky.com/
• Haptik* https://haptik.ai/contact-us
http://ec2-54-215-197-164.us-west-
• Rose
1.compute.amazonaws.com/speech.php
*
https://
• Ochatbot*
www.ometrics.com/blog/li
st-of-fun-
Types of Chatbots
Script- bot and Smart-bot
Script-bot are scripted or in other words are traditional chatbots while
Smart-bots are AI-powered and have more knowledge.
“Example
s:
T H EST O R Y SP E A K E R C A N BE CO N SID E R E D A S A
SC R I P T - B O T A S I N TH A T A CT IV I T Y W E US E D T
O CR E A T E A SC R I P T A R O U N D W H I C H
T H EIN T E R A CT IV E ST O R Y REVOLVED. AS SO O
N
A N AA SN S W E A CC O TRHD EIN GMLAYC. HOITNHE E X A M P L O SC R I P T B O T
D O TT
G RR I GG E R E D B Y ETRH E P E R S OENS , I TF US E D
MO
T A FI NOCL L U
O DW T HTEHB EO T SSC R I P T W H I D E P L O Y E I N T H EC S T O M E
Y E C H AR E D UR
C A R ES E C T I O N O F V A RIO U S C O M P A N I E S . TH E I R J O B I
S TO A N SW E R SO M E BA S I C QU ER IE S TH A T
T H E Y A R E CO D E D F O R A N D CO NN E CT
T H E MT O HUM A N E X E C U T I V E S ON C E T
HEY ARE UNABLE T O HA NDL E THE C
ONVERSATION.
Human Language Computer Language
Humans communicate through language which we The computer understands the language of numbers.
process all the time. Our brain keeps on processing the Everything that is sent to the machine has to be
sounds that it hears around itself and tries to make converted to numbers.
sense out of them all the time. And while typing, if a single mistake is made, the
The sound reaches the brain through a long channel. computer throws an error and does not process that
As a person speaks, the sound travels from his mouth part. The communications made by the machines are
and goes to the listener’s eardrum. very basic and simple.
The sound striking the eardrum is converted into
neuron impulse, gets transported to the brain and
then gets processed.
After processing the signal, the brain gains
understanding around the meaning of it. If it is clear, N o w, i f w e w a n t t h e m a c h i n e t o u n d e r s t a n d
the signal gets stored. Otherwise, the listener asks for o u r language, h o w should t h i s happen?
clarity to the speaker. This is how human languages
are processed by humans.
What are the possible difficulties a machine
would face in processing natural language?
1 2 3
4 of
Arrangement Analogy with Multiple Meanings of Perfect Syntax,
the words and programmin a word no Meaning
meaning g language:
A word can have Statement is correct
This is the issue Different syntax, grammatically but
multiple meanings and
related to the syntax does this make any
same semantics: the meanings fit into the
of the language. sense
2+3 = 3+2 statement according to
Syntax refers to the
grammatical the context of it.
structure of a
sentence.
Arrangement of the words and meaning
There are rules in human language. There are nouns, verbs, adverbs, adjectives. A
word can be a noun at one time and an adjective some other time. There are rules
to provide structure to a language.
This is the issue related to the syntax of the language. Syntax refers to the
grammatical structure of a sentence. When the structure is present, we can start
interpreting the message. Now we also want to have the computer do this. One way
to do this is to use the part-of-speech tagging. This allows the computer to identify
the different parts of a speech.
Besides the matter of arrangement, there’s also meaning behind the language we
use. Human communication is complex. There are multiple characteristics of the
human language that might be easy for a human to understand but extremely
difficult for a computer to understand.
Analogy with programming language:
Here the way these statements are written is different, but their meanings
are the same that is 5.
Here the statements written have the same syntax but their meanings are
different. In Python 2.7, this statement would result in 1 while in Python 3, it
would give an output of 1.5.
Multiple Meanings of a word
Let’s consider these three sentences:
His face t u r n e d r e d a f t e r he f o u n d o u t t h a t he took t h e w r o n g bag.
What does this mean? Is he feeling ashamed because he took another person’s bag instead of
his? Is he feeling angry because he did not manage to steal the bag that he has been
targeting?
T h e r e d c a r zoomed p a s t h i s nose.
Probably talking about the color of the car
His face t u r n s r e d a f t e r consuming the medicine.
Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?
Here we can see that context is important. We understand a sentence almost intuitively,
depending on our history of using the language, and the memories that have been built within.
In all three sentences, the word red has been used in three different ways which according to
the context of the statement changes its meaning completely. Thus, in natural language, it is
important to understand that a word can have multiple meanings and the meanings fit into
the statement according to the context of it.
Perfect Syntax, no Meaning
Sometimes, a statement can have a perfectly correct syntax
but it does not mean anything. For example, take a look at this
statement:
Text Normalisation
a. Sentence Segmentation
b. Tokenisation
c. Removing Stopwords, Special Characters and
Numbers
d. Converting text to a common
case Stemming
Lemmatization
Text Normalisation
In Text Normalisation, we undergo several steps
to normalise the text to a lower level. The whole
textual data from all the documents altogether is
known as corpus.
1 2 3 4
ST E P STE STE STE
P P P
Sentence Tokenisation Removing Converting text
Stopwords, Special to a common
Segmentatio Each sentence is Charcaters and
n further divided case
Corpus is Numbers
divided into into tokens. Convert the whole
Tokens which are
sentences text into a similar
unnecessary are
case.
removed from the
token list
Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Each
sentence is taken as a different data so now the whole corpus gets reduced to
sentences.
Tokenisation
After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a
sentence. Under tokenisation, every word, number and special character is considered
separately and each of them is now a separate token.
Removing Stopwords, Special Characters and
Numbers
In this step, the tokens which are not necessary are removed from the token list. What
can be the possible words which we might not require?
Stopwords are the words which occur very frequently in the corpus but do not add any
value to it. Humans use grammar to make their sentences meaningful for the other
person to understand. But grammatical words do not add any essence to the
information which is to be transmitted through the statement hence they come under
stopwords. Some examples of stopwords are:
These words occur the most in any given corpus but talk very little or
nothing about the context or the meaning of it. Hence, to make it easier
for the computer to focus on meaningful terms, these words are removed.
Along with these words, a lot of times our corpus might have special
characters and/or numbers. Now it depends on the type of corpus that we
are working on whether we should keep them in it or not. For example, if
you are working on a document containing email IDs, then you might not
want to remove the special characters and numbers whereas in some other
textual data if these characters do not make sense, then you can remove
them along with the stopwords.
Converting text to a common
After case
the stopwords removal, we convert the whole text into a similar case,
preferably lower case. This ensures that the case-sensitivity of the machine does
not consider same words as different just because of different cases.
Here in this example, the all the 6 forms of hello would be converted to
lower case and hence would be treated as the same word by the
Stemming
In this step, the remaining words are reduced to their root words. In other words, stemming is the
process in which the affixes of words are removed and the words are converted to their base form.
Note that in stemming, the stemmed words (words which we get after removing the
affixes) might not be meaningful. Here in this example as you can see: healed, healing
and healer all were reduced to heal but studies was reduced to studi after the affix
removal which is not a meaningful word. Stemming does not take into account if the
stemmed word is meaningful or not. It just removes the affixes hence it is faster.
Lemmatization
Stemming and lemmatization both are alternative processes to each other as the role of
both the processes is same – removal of affixes. But the difference between both of them
is that in lemmatization, the word we get after affix removal (also known as lemma) is a
meaningful one. Lemmatization makes sure that lemma is a word with meaning and
hence it takes a longer time to execute than stemming.
As you can see in the same example, the output for studies after affix removal has
become study instead of studi.
Difference between stemming and lemmatization can be summarized by this example:
With this we have normalised our text to tokens which are the simplest form of
words present in the corpus. Now it is time to convert the tokens into numbers. For
this, we would use the Bag of Words algorithm
Bag of Words
Bag of Words is a Natural Language Processing model which helps in extracting
features out of the text which can be helpful in machine learning algorithms. In bag of
words, we get the occurrences of each word and construct the vocabulary for the corpus.
This image gives us a brief overview about how bag of words works. Let us assume that the text on the left
in this image is the normalised corpus which we have got after going through all the steps of text
processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us the unique
words out of the corpus and their occurrences in it. As you can see at the right, it shows us a list of words
appearing in the corpus and the numbers corresponding to it shows how many times the word has occurred
in the text body.
Bag of Words
Thus, we can say that the bag of words gives us two things:
1.A vocabulary of words for the corpus
2.The frequency of these words (number of times it has occurred in the whole corpus).
Here calling this algorithm “bag” of words symbolises that the sequence of sentences
or tokens does not matter in this case as all we need are the unique words and their
frequency in it.
The step-by-step approach to implement bag of words algorithm:
1.Text Normalisation: Collect data and pre-process it
2.Create Dictionary: Make a list of all the unique words occurring in the
corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out
how
many times the word from the unique list of words has occurred.
4. Create document vectors for all the documents.
For Example
Step 1: Collecting data and pre-processing it.
Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health
chatbot
Here are three documents having one sentence each. After text normalisation, the text
becomes:
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a,
health, chatbot]
Note that no tokens have been removed in the stopwords removal step. It is because
we have very little data and since the frequency of all the words is almost the same, no
word can be said to have lesser value than the other.
Step 2: Create Dictionary
Go through all the steps and create a dictionary i.e., list down all
the words which occur in all three documents:
Dictionary:
Since in the first document, we have words: aman, and, anil, are,
stressed. So, all these words get a value of 1 and rest of the words
get a 0 value.
Step 4: Repeat for all documents
Same exercise has to be done for all the documents. Hence, the table
becomes:
In this table, the header row contains the vocabulary of the corpus and
three rows correspond to three different documents. Take a look at this
table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But
the
tokens have still not converted to numbers. This leads us to the final steps
TFIDF: Term Frequency & Inverse Document Frequency
Suppose you have a book. Which characters or words do you think
would occur the most in it?
Bag of words algorithm gives us the frequency of words in each
document we have in our corpus. It gives us an idea that if the word is
occurring more in a document, its value is more for that document. For
example, if I have a document on air pollution, air and pollution would
be the words which occur many times in it. And these words are
valuable too as they give us some context around the document. But
let us suppose we 10 documents and all them talk
have
of diff erent issues. One is on women about the
unemployment and so on. Do you think air and pollution
empowerment, would still
other is be
one of the most occurring words in the whole corpus? Ifon
not, then which
words do you think would have the highest frequency in all of them?
And, this, is, the, etc. are the words which occur the
most in almost all the documents. But these words do
not talk about the corpus at all. Though they are
important for humans as they make the
statements understandable to us, for the machine
they are a complete waste as they do not provide us
with any information regarding the corpus. Hence,
these are termed as stopwords and are mostly
removed at the pre-processing stage only.
Take a look at this graph. It is a plot of occurrence
of words versus their value. As you can see, if the
words have highest occurrence in all the documents
of the corpus, they are said to have negligible
value hence they are termed as stop words. These
words are mostly removed at the pre-
processing stage only. Now as we move ahead
level
from the the
stopwords, drastically and the words
occurrence have adequate
drop
occurrence in the corpus are said to have s
which
some
amount of value and are termed as f requent
words. These mostly talk about the
document’s subject and their
words is
adequate
occurrencein the corpus. Then as the occurrence of
words drops further, the value of such words rises.
These words are termed as rare or valuable words.
These words occur the least but add the most value
to the corpus. Hence, when we look at the text, we
take frequent and rare words into consideration.
DIY – Do It Yourself!
Here is a corpus for you to challenge yourself with the given tasks. Use the knowledge you have gained
in the above sections and try completing the whole exercise by yourself.
The Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use N LP to create chatbots and we will be making health
chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now.
Accomplish the following challenges on the basis of the corpus given above. You can use
the tools available online for these challenges. Link for each tool is given below:
1.Sentence Segmentation: https://tinyurl.com/y36hd92n
2. Tokenisation: https://text-processing.com/demo/tokenize/
3. Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/
4. Lowercase conversion: https://caseconverter.com/
5. Stemming: http://textanalysisonline.com/nltk-porter-stemmer
6. Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize
7. Bag of Words: Create a document vector table for all documents.
8. Generate TFIDF values for all the words.
9. Find the words having highest value.
10.Find the words having the least value