0% found this document useful (0 votes)
49 views10 pages

Natural Language Processing

Uploaded by

Manish Mavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views10 pages

Natural Language Processing

Uploaded by

Manish Mavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Natural Language Processing

Natural Language Processing, or NLP, is the sub-field of AI


that is focused on enabling computers to understand and
process human languages.
In NLP, we can break down the process of understanding
English for a model into a number of small pieces.
Example
A usual interaction between machines and humans using
Natural Language Processing could go as follows:
 Humans talk to the computer
 The computer captures the audio
 There is an audio to text conversion
 Text data is processed Data is converted to audio
 The computer plays the audio file and responds to
humans
Applications of Natural Language Processing
1. Chatbots
 Chatbots are a form of artificial intelligence that is
programmed to interact with humans in such a way that
they sound like humans themselves.
 Chatbots are created using Natural Language Processing
and Machine Learning, which means that they
understand the complexities of the English language and
find the actual meaning of the sentence and they also
learn from their conversations with humans and become
better with time.
 Chatbots work in two simple steps.
 First, they identify the meaning of the question
asked and collect all the data from the user that may
be required to answer the question.
 Then they answer the question appropriately.
 Examples of ChatBots :
o Mitsuku Bot * https://www.pandorabots.com/mitsuku/
o CleverBot * https://www.cleverbot.com/
o Jabberwacky * http://www.jabberwacky.com/
o Haptik * https://haptik.ai/contact-us
o Rose
*http://ec2-54-215-197-164.us-west-
1.compute.amazonaws.com/speech.php

Types of Chatbots
1) Script Bot
2) Smart Bot

Difference between Script Bot and Smart Bot


SCRIPT BOT SMART BOT
Script bots are easy to make Smart-bots are flexible and powerful

Script bots work around a script Smart bots work on bigger databases
which is programmed in them and other resources directly

Mostly they are free and are easy to Smart bots learn with more data
integrate to a messaging platform
No or little language processing Coding is required to take this up on
skills board

Limited functionality Wide functionality


2. Autocomplete in Search Engines :
 Have you noticed that search engines tend to guess what
you are typing and automatically complete your
sentences?
 For example, on typing “game” in Google, you may get
further suggestions for “game of thrones”, “game of
life”or if you are interested in maths then “game theory”.
All these suggestions are provided using auto complete
that uses Natural Language Processing to guess what you
want to ask.
3. Voice Assistants
 These days voice assistants are all the rage! Whether its
Siri, Alexa, or Google Assistant, almost everyone uses one
of these to make calls, place reminders, schedule
meetings, set alarms, surf the internet, etc. These voice
assistants have made life much easier. But how do they
work?
 They use a complex combination of speech recognition,
natural language understanding, and natural language
processing to understand what humans are saying and
then act on it.
4. Language Translator
 Want to translate a text from English to Hindi but don’t
know Hindi? Well, Google Translate is the tool for you!
While it’s not exactly 100% accurate, it is still a great tool
to convert text from one language to another.
 Google Translate and other translation tools use
Sequence to sequence modeling that is a technique in
Natural Language Processing. It allows the algorithm to
convert a sequence of words from one language to
another which is translation.
5. Sentiment Analysis
 Almost all the world is on social media these days! And
companies can use sentiment analysis to understand
how a particular type of user feels about a particular
topic, product, etc.
 They can use Natural Language Processing,
computational linguistics, text analysis, etc. to
understand the general sentiment of the users for their
products and services and find out if the sentiment is
good, bad, or neutral.
6. Automatic Summarization
 Automatic summarization is useful for gathering data
from social media and other online sources, as well as for
summarizing the meaning of documents and other
written materials.
7. Text classification
 Text classification enables you to classify a document
and organise it to make it easier to find the information
you need or to carry out certain tasks. Spam screening in
email is one example of how text categorization is used.
TEXT NORMALISATION
Process to normalize the text to a lower level is called Text
Normalization.
In other words,” Text Normalisation The process of converting
a text into a canonical (standard) form is known as text
normalisation.
For instance, the canonical form of the word “good” can be
created from the words “gooood” and “gud.”
Sentence Segmentation
Under sentence segmentation, the whole corpus is divided
into sentences. Each sentence is taken as a different data so
now the whole corpus gets reduced to sentences.
Tokenisation
 Sentences are first broken into segments, and then each
segment is further divided into tokens.
 Any word, number, or special character that appears in a
sentence is referred to as a token.
 Tokenization treats each word, integer, and special
character as a separate entity and creates a token for
each of them.
Removing Stopwords, Special Characters and Numbers
 In this step, the tokens which are not necessary are
removed from the token list.
 “Stopwords are words that are used frequently in a
corpus but provide nothing useful.”
 Stopwords include a, an, and, or, for, it, is, are, to, into,
on, there etc.
 These words occurs the most in any given corpus but
talk very little or nothing about the context or the
meaning of it. Hence to make it easier for the computer
to focus on meaningful terms, these words are removed.
Converting text to a common case
 After eliminating the stopwords, we convert the whole
text into a similar case, preferably lower case.
 Python, PYTHON, python, PYThon etc. all are converted
into lowercase ‘python’.

Stemming: It is the process in which the affixes of words are


removed and the words are converted to their base form.
Word Affixes Stem
Tries Es Tri
Trying Ing Try
Sweetening Ing Sweeten
Sweetened Ed Sweeten
Sweetener Er sweeten

Lemmatization
 Stemming and lemmatization are alternate techniques to
one another because they both function to remove
affixes.
 But the difference between both of them is that in
lemmatization, the word we get after affix removal (also
known as lemma) is a meaningful one.
Word Affixes Lemma
Tries Es Try
Trying Ing Try
Sweetened ed Sweeten
Sweetening Ing Sweeten
Sweetener Er Sweeten
BAG OF WORDS
Bag of Words is an algorithm of Natural Language
processing. In bag of words, we get the occurrences of each
word and construct the vocabulary for the corpus.

As you can see at the right, it shows us a list of words


appearing in the corpus and the numbers corresponding to it
shows how many times the word has occurred in the text
body.
Thus, we can say that the bag of words gives us two things:
• A vocabulary of words for the corpus
• The frequency of these words (number of times it has
occurred in the whole corpus).
Here is the step-by-step approach to implement bag of
words algorithm:
Text Normalisation: Collect data and pre-process it
Create Dictionary: Make a list of all the unique words
occurring in the corpus. (Vocabulary)
Create document vectors: For each document in the corpus,
find out how many times the word from the unique list of
words has occurred.
EXAMPLE
Create document vectors for all the documents. Let us go through all the steps
with an example:

Step 1: Collecting data and pre-processing it.


Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot

Here are three documents having one sentence each. After text normalisation,
the text becomes:

Document 1: [aman, and, anil, are, stressed]


Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]

NOTE: Note that no tokens have been removed in the stop words removal step.
It is because we have very little data and since the frequency of all the words is
almost the same, no word can be said to have lesser value than the other.

Step 2: Create Dictionary : Go through all the steps and create a dictionary
i.e., list down all the words which occur in all three documents:

Aman And Anil Are Stressed Went


To A Therapist Download Health Chatbot

Step 3: Create document vector : In this step, the vocabulary is written in the
top row. Now, for each word in the document, if it matches with the vocabulary,
put a 1 under it.

If the same word appears again, increment the previous value by 1. And if the
word does not occur in that document, put a 0 under it.
Document 1:
Aman And Anil Are Stressed Went To A Therapist Download Health chatbot
1 1 1 1 1 0 0 0 0 0 0 0
Since in the first document, we have words: aman, and, anil, are, stressed. So,
all these words get a value of 1 and the rest of the words get a 0 value.

Document 2:
Aman And Anil Are Stressed Went To A Therapist Download Health chatbot
1 0 0 0 0 1 1 1 1 0 0 0

Document 3:
Aman And Anil Are Stressed Went To A Therapist Download Health chatbot
0 0 1 0 0 1 1 1 0 1 1 1

Combined Table
Aman And Anil Are Stressed Went To A Therapist Download Health chatbot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1
TFIDF: Term Frequency & Inverse Document Frequency
TFIDF helps in identifying the value for each word.

Term frequency (TF) is the frequency of a word in one document. Term


frequency can easily be found from the document vector table as in that table
we mention the frequency of each word of the vocabulary in each document.

TF = No. of times a phrase comes in a document


Aman And Anil Are Stressed Went To A Therapist Download Health chatbot
1 1 1 1 1 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 1 1 1 0 1 1 1

Inverse Document Frequency

Document Frequency is the number of documents in which the word occurs


irrespective of how many times it has occurred in those documents.
Talking about Inverse Document Frequency (IDF), It is calculated by dividing the
total number of documents in the corpus by the number of documents that
contain the phrase.

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log(IDF(W))

Here, log is to the base of 10.

Note: Use calculator to calculate exact value of Word

Finally, the words have been converted to numbers. These numbers are the
values of each for each document. Here, you can see that since we have less
amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF
value increases, the value of that word decreases.

You might also like