Natural Language Processing
Natural Language Processing
Processing
*Natural Language Processing, or NLP, is the sub-field of AI that is focused on
enabling computers to understand and process human languages or deals with
how computers through a program will perform tasks like speech recognition,
translation, large amounts of Natural language data analysis and extraction so
that a successful interaction can occur between the machines and the humans
to give the desired output.
NLP refers to data taken from the natural language spoken by human used by
them in daily life and operates on them.
Process Involved in NLP
The process of understanding human language is quite difficult for a machine. This process is divided into
five major steps. These steps are:
• Lexical Analysis: In this step, an AI machine identifies and analyses the structure of words in a
speech and converts them into text. This text is then converted into paragraphs, sentences and
words by lexical analyser.
• Syntactic Analysis: In this step, the converted words and sentences are arranged according the
grammar of the language. This arrangement shows the relation between words.
• Semantic Analysis: In this step, the semantic analyser checks the text for meaningfulness and
draws the meaning from the text.
• Discourse Integration: In this step, meaning of the sentence is drawn according to the meanings of
the preceding sentence and the succeeding sentence.
• Pragmatic Analysis: The actual meaning of the sentence is rechecked in this process.
Identify the mystery animal: Animal Mystery (gamestolearnenglish.com)
Rose*
http://ec2-54-215-197-164.us-west-1.compute.amazonaws.com/speech.php
Ochatbot*
https://www.ometrics.com/blog/list-of-fun-chatbots/
• Chat means “to make a conversation”; bot means “to make a task automated”. Chatbot, short form of “Chat
Robot” (conversational agents) is an AI enabled computer program that communicates with the user in Natural
Human Language either through voice or text used in mobile apps, websites, messages etc.
• Chatbot is a short form of “Chat Robot”, also known as conversational agents. It is an AI enabled computer
program that communicates with the user in Natural Human Language either through voice or text used in
mobile apps, websites, messages etc.
• Chatbot can be defined as an application that automates your tasks like saying good morning when you wake up,
telling you news on a daily basis, helping you in choosing a less traffic route for your school, ordering a coffee
for you on your way back home.
• These chatbots need not be downloaded either in your computers or your phones. It works like a service made
available by different companies on a digital platform and does not need any updates or occupy any memory or
disk space.
• When you do live chats an websites you are probably dealing with chatbots. Almost all the companies across the
globe are using Chatbots to have a wide range of customers and to cater to their customised needs. For
example, you visit an online showroom of party apparels. You are confused about what to buy— a short dress,
simple formats, glittery gown etc. This difficult situation can be easily handled through chatbots. It will help you
to make your choice in a smoother way depending on your likes and preferences.
• Some of the virtual assistants like Google Assistant, Cortana, Siri, etc. are considered as smarter versions of
chatbots as they handle the conversations as well as other tasks in a smart and intelligent way just like a human
with a brain.
Some chatbots are scripted or in other words are traditional chatbots while others were Al-
powered and had more knowledge.
There are 2 types of chatbots around us: Script- bot and Smart-bot.
• Script-bot
• Script-bot are simple chatbots with limited functionalities. They are scripted based on a
task and follow a predetermined route to get the work done. Since it works around a script
therefore, they are easy to make and need either no or very less programming skills.
• These types of chatbots are best suited for straightforward interactions like making life easy
at home like Switch off lights based on commands, order food online etc. Some of the
companies use them as customer care services for providing interactive Frequently asked
questions services to the customers and if these chatbots are unable to handle certain
queries then they connect a company's customer care employee directly to the customer.
• Smart-bot
• Smart-bot are flexible, powerful, AI based models that have wider functionalities and
support machine learning algorithms that make a machine learn from the experience. They
simulate human-like interactions with the users.
• These chatbots require a lot of programming and work on bigger databases and other
resources to help the model understand the context of interactions.
• All the assistants like Google Assistant, Alexa, Cortana, Siri, etc. can be taken as smart bots
as not only can they handle the conversations but can also manage to do other tasks which
makes them smarter.
Human Language Vs Computer Language
• Humans Language is the language used by humans to interact with the people around them. They interact
using different languages like English, Hindi, Spanish etc. For eg; in English we use nouns, pronouns,
verbs, adjectives and many more parts of speech to make a proper sentence used in speech or textual
conversation. This conversation applied with their experience and knowledge of the past will form an
intellectual conversation between two sensible people. Over the time these human languages evolve and
develop and new words add up in the dictionary based on their usage in specific places.
• It is a language made up of syntax and semantics which is also known as the grammar of a language
which forms the words arranged in a proper order to make a meaningful sentence.
• Our brain keeps on processing the sounds that it hears around itself and tries to make sense out of them
all the time. Even in the classroom, as the teacher delivers the session, our brain is continuously processing
everything and storing it in some place. Also, while this is happening, when your friend whispers
something, the focus of your brain automatically shifts from the teacher's speech to your friend's
conversation. So now, the brain is processing both the sounds but is prioritizing the one on which our
interest lies.
• The sound reaches the brain through a long channel. As a person speaks, the sound travels from his mouth
and goes to the listener's eardrum. The sound striking the eardrum is converted into neuron impulse, gets
transported to the brain and then gets processed. After processing the signal, the brain gains
understanding around the meaning of it. If it is clear, the signal gets stored. Otherwise, the listener asks
for clarity to the speaker. This is how human languages are processed by humans.
Human Language VS Computer Language
• On the other hand, the computer understands the language of numbers. Everything that is sent to the
machine has to be converted to numbers. And while typing, if a single mistake is made, the computer
throws an error and does not process that part. The communications made by the machines are very
basic and simple.
• Computer Language is a language used by the programmers to develop a computer program which
helps humans to interact with an electronic device-computer. The Central Processing unit is the brain of
the computer and it interprets the instructions given to the computer. The response of the computer is
in terms of the output or the execution of the instruction.
• The computer interacts in the form of a Binary Language. This language is the language of 0's and 1's.
This language has its own disadvantages and is quite difficult for humans to interact using binary
numbers. So, keeping in mind this nature of interaction with the computer, other programming
languages are now available that are closer to human language. Some of the examples of Computer
Languages are Java, C+ +, Python etc.
• Every computer language has its own syntax and semantics which needs to be followed strictly
otherwise the computer gives an error and further processing of the task will not be done.
Syntax and Semantics of a language
Arrangement of the words and meaning
• There are rules in human language. There are nouns, verbs, adverbs, adjectives. A word can be
a noun at one time and an adjective some other time. There are rules to provide structure to a
language.
• Syntax refers to the grammatical structure of a sentence. When the structure is present, we can
start interpreting the message. One way how the computer do this is to use the part-of-speech
tagging. This allows the computer to identify the different parts of a speech.
• Human communication is complex. There are multiple characteristics of the human language
that might be easy for a human to understand but extremely difficult for a computer to
understand.
• The proper arrangements of words in a sentence or a statement makes the syntax of a
language. Semantics refers to the meaning of the word in a sentence which helps in
interpreting the proper message of the complete structure of words.
Analogy with programming language:
These are some of the challenges we might have to face if we try to teach computers
how to understand and interact in human language.
• As the language of computers is Numerical, the very first step is to convert our
language to numbers. This conversion takes a few steps to happen. The first step to
it is Text Normalisation. Since human languages are complex, we need to first of
all simplify them in order to make sure that the understanding becomes possible.
• Text Normalisation
• In Text Normalisation, we undergo several steps to normalise the text to a
lower level. The whole textual data from all the documents altogether is known
as corpus.
• This is the process of cleaning the textual data by converting the text in
such a way that it comes down to a level where its complexity is lower
than the actual data or into a standard form. It is considered as the Pre-
Processing stage of NLP as this is the first thing to do before we begin the
actual data processing. It helps in reducing the complexity of the language.
Words used as slang, short forms, misspelled, abbreviations, special meaning
characters etc. need to be converted into a canonical form after Text
Normalisation.
• B4, beefor, bifare to before, gr8,grt to great
• Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Most of the
human languages used across the world have punctuation marks to mark the boundaries
of the sentence so this feature helps in bringing down the complexity of the big data set
into a low and less complicated level of Data processing. Each sentence after this will be
a separate data or each sentence is taken as a different data so now the whole corpus gets
reduced to sentences.
• Tokenisation
After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a
sentence. This process is done mainly by finding the boundaries of a word i.e., where
one word ends and the other word begins. In English, a space in between two words is an
important word boundary detector. So under tokenisation, every word, number and
special character is considered separately and each of them is now a separate token.
Removing Stopwords, Special Characters and Numbers
• In this step, the tokens which are not necessary are removed from the token
list. Stopwords are the words which occur very frequently in the corpus
but do not add any value to it.
• Humans use grammar to make their sentences meaningful for the other
person to understand. But grammatical words do not add any essence to the
information which is to be transmitted through the statement hence they
come under stopwords. These words occur the most in any given corpus but
talk very little or nothing about the context or the meaning of it. Hence, to
make it easier for the computer to focus on meaningful terms, these words are
removed.
• Along with these words, a lot of times our corpus might have special
characters and/or numbers like #%&@ etc. Now it depends on the type of
corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then
you might not want to remove the special characters and numbers whereas
in some other textual data if these characters do not make sense, then you
can remove them along with the stopwords.
Converting text to a common case
• After the stopwords removal, we convert the whole text into a similar case, preferably
lower case. This ensures that the case-sensitivity of the machine does not consider
same words as different just because of different cases.
• This is a very important step as we want the same word but different case to be taken
as one token so that the program does not become case sensitive.
The output for studies after affix removal has become study instead of studi.
With this we have normalised our text to tokens which are the simplest form of
words present in the corpus. Now it is time to convert the tokens into
numbers. Bag of Words algorithm is used for this.
DATA PROCESSING STEPS
Pre-process stage of NLP- Text Normalisation-normalise the text to a lower level
Stemming/Lemmatization- affixes of words are removed and the words are converted to their
base form
Note that even though some words are repeated in different documents, they are all
written just once as while creating the dictionary, we create the list of unique words.
Step 3: Create document vector
• In this step, the vocabulary is written in the top row. Now, for each word in the
document, if it matches with the vocabulary, put a 1 under it. If the same word
appears again, increment the previous value by 1. And if the word does not occur in
that document, put a 0 under it.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all
these words get a value of 1 and rest of the words get a 0 value.
In this table, the header row contains the vocabulary of the corpus and three rows
correspond to three different documents.
Finally, this gives us the document vector table for our corpus. But the tokens have
still not converted to numbers. This leads us to the final steps of our algorithm:
TFIDF.
The graph to represent the text in corpus will be This graph is a plot of occurrence of words versus
their value. If the words have highest occurrence in
all the documents of the corpus, they are said to
have negligible value hence they are termed as stop
words. These words are mostly removed at the pre-
processing stage only. Now as we move ahead from
the stopwords, the occurrence level drops drastically
and the words which have adequate occurrence in
the corpus are said to have some amount of value
and are termed as frequent words. These words
mostly talk about the document's subject and their
occurrence is adequate in the corpus. Then as the
occurrence of words drops further, the value of
such words rises. These words are termed as rare or
valuable words. These words occur the least but add
the most value to the corpus. Hence, when we look
at the text, we take frequent and rare words into
consideration.
• TFIDF: Term Frequency & Inverse Document Frequency
• This method is considered better than the Bag of Words algorithm because BoW gives the numeric vector of each
word in the document but TFIDF through its numeric value gives us the importance of each word in the document or
helps un in identifying the value for each word.
• TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given a
numeric value.
• Bag of words algorithm gives us the frequency of words in each document we have in our corpus. It gives us an idea
that if the word is occurring more in a document, its value is more for that document. For example, if I have a
document on air pollution, air and pollution would be the words which occur many times in it. And these words are
valuable too as they give us some context around the document. But let us suppose we have 10 documents and all of
them talk about different issues. One is on women empowerment, the other is on unemployment and so on. Do you
think air and pollution would still be one of the most occurring words in the whole corpus? If not, then which words
do you think would have the highest frequency in all of them?
• And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words do not talk
about the corpus at all. Though they are important for humans as they make the statements understandable to us, for
the machine they are a complete waste as they do not provide us with any information regarding the corpus. Hence,
these are termed as stopwords and are mostly removed at the pre-processing stage only.
• Term Frequency
• Term frequency is the frequency of a word in one document.
• Term frequency can easily be found from the document vector table as in
that table we mention the frequency of each word of the vocabulary in each
document.
Document Frequency
Document Frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in those documents.
Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two
documents. Rest of them occurred in just one document hence the document frequency for them is one.
Talking about inverse document frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator. Here, the total number of documents are 3, hence inverse document
frequency becomes:
• Multiply the IDF values to the TF values. TF values are for each document
while the IDF values are for the whole corpus. Hence, we need to multiply
the IDF values to each row of the document vector table.
• IDF values for Aman in each row is the same and similar pattern is followed for all the
words of the vocabulary. After calculating all the values, we get:
• Finally, the words have been converted to numbers. These numbers are the values of each for each document. Here,
you can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF
value increases, the value of that word decreases. That is, for example:
• Total Number of documents: 10
• Number of documents in which ‘and’ occurs: 10 Therefore, IDF(and) = 10/10 = 1
• Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
• On the other hand, number of documents in which ‘pollution’ occurs: 3 IDF(pollution) = 10/3 = 3.3333...
• Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the corpus.
Summary
• Words that occur in all the documents with high term frequencies have the least numeric values and are
considered to be the stopwords.
• For a word to have high TFIDF value, the word needs to have a high term frequency but less document
frequency which shows that the word is important for one document but is not a common word for all
documents.
• These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.
Applications of TFIDF
• TFIDF is commonly used in the Natural Language Processing domain.
DIY — Do It Yourself !
• The Corpus
• Document 1: We can use health chatbots for treating stress.
• Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
• Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y
• Accomplish the following challenges on the basis of the corpus given above. You can use the tools available online for
these challenges.
• Sentence Segmentation: https://tinyurl.com/y36hd92n
• Tokenisation: https://text-processing.com/demo/tokenize/
• Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/
• Lowercase conversion: https://caseconverter.com/
• Stemming: http://textanalysisonIine.com/nltk-porter-stemmer
• Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize
• Bag of Words: Create a document vector table for all documents.
• Generate TFIDF values for all the words.
• Find the words having highest value and the least value
NLTK
• The Natural Language Toolkit (NLTK) is one of the most commonly used
open-source NLP toolkit that is made up of Python libraries and is used for
building programs that help in synthesis and statistical analysis of human
language processing. The text processing libraries do text processing through
tokenization, parsing, classification, stemming, tagging and semantic
reasoning.
At a glance
• Natural Language Processing or NLP is the subset of Artificial Intelligence that deals with how computers through a program will perform tasks like speech
recognition, translation, large amounts of Natural language data analysis and extraction so that a successful interaction can occur between the machines and
the humans to give the desired output.
• The process of understanding human language is quite difficult for a machine. This process is divided into five major steps.
• Automatic Text Summarization is the process of creating the most meaningful and relevant summary of voluminous texts from multiple resoumes.
• Sentiment and emotion analysis application of NLP is very significant as it helps business organizations gain insights on consumers and do a competitive
comparison and make necessary adjustments in the business strategy development.
• Text classification is defined as classifying the unstructured text into groups or categories.
• Chatbot is an AI enabled computer program that communicates with the user in Natural Human Language either through voice or text used in mobile apps,
websites, messages etc.
• Script-bot are simple chatbots with limited functionalities that are scripted based on a task and follow a predetermined route to get the work done.
• Smart-bot are flexible, powerful, AI based models that have wider 1Tinctionalities and support machine learning algorithms that make a machine learn from
the experience.
• Making a computer understand a natural language is a complex process. First, we need to understand that humans interact using alphabets and sentences
and machines interact using numbers.
• Bag of Words is a simple and important technique used in Natural Language Processing for extracting features from the textual data.
• TF-IDF method is considered better than the Bag of Words algorithm because BoW gives the numeric vector of each word in the document but TFIDF through its
numeric value gives us the importance of each word in the document.
• The Natural Language Toolkit (NLTK} is one of the most commonly used open-source NLP toolkit that is made up of Python libraries and is used for
building programs that help in synthesis and statistical analysis of human language processing.