0% found this document useful (0 votes)
6 views70 pages

Natural Language Processing

Natural Language Processing (NLP) is a sub-field of AI focused on enabling computers to understand and process human languages, involving steps like lexical, syntactic, semantic, discourse, and pragmatic analysis. NLP has various applications, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, which enhance human-computer interaction. The document also discusses the challenges in bridging the gap between individuals needing help and therapists, emphasizing the role of NLP in cognitive behavioral therapy and data analysis.

Uploaded by

vaigasanish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views70 pages

Natural Language Processing

Natural Language Processing (NLP) is a sub-field of AI focused on enabling computers to understand and process human languages, involving steps like lexical, syntactic, semantic, discourse, and pragmatic analysis. NLP has various applications, including automatic summarization, sentiment analysis, text classification, virtual assistants, and chatbots, which enhance human-computer interaction. The document also discusses the challenges in bridging the gap between individuals needing help and therapists, emphasizing the role of NLP in cognitive behavioral therapy and data analysis.

Uploaded by

vaigasanish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Natural Language

Processing
*Natural Language Processing, or NLP, is the sub-field of AI that is focused on
enabling computers to understand and process human languages or deals with
how computers through a program will perform tasks like speech recognition,
translation, large amounts of Natural language data analysis and extraction so
that a successful interaction can occur between the machines and the humans
to give the desired output.
NLP refers to data taken from the natural language spoken by human used by
them in daily life and operates on them.
Process Involved in NLP
The process of understanding human language is quite difficult for a machine. This process is divided into
five major steps. These steps are:
• Lexical Analysis: In this step, an AI machine identifies and analyses the structure of words in a
speech and converts them into text. This text is then converted into paragraphs, sentences and
words by lexical analyser.
• Syntactic Analysis: In this step, the converted words and sentences are arranged according the
grammar of the language. This arrangement shows the relation between words.
• Semantic Analysis: In this step, the semantic analyser checks the text for meaningfulness and
draws the meaning from the text.
• Discourse Integration: In this step, meaning of the sentence is drawn according to the meanings of
the preceding sentence and the succeeding sentence.
• Pragmatic Analysis: The actual meaning of the sentence is rechecked in this process.
Identify the mystery animal: Animal Mystery (gamestolearnenglish.com)

Were you able to guess the animal?


If yes, in how many questions were you able to guess it?
If no, how many times did you try playing this game?
What according to you was the task of the machine?
Were there any challenges that you faced while playing this game? If yes, list them down.
What approach must one follow to win this game?
Applications of
Natural Language Processing

Since Artificial Intelligence nowadays is becoming an integral part


of our lives, its applications are very commonly used by the majority
of people in their daily lives. Here are some of the applications of
Natural Language Processing which are used in the real-life
scenario:
Automatic Summarization
Information overload is a real problem when we need to
access a specific, important piece of information from a
huge knowledge base. Automatic summarization is relevant
not only for summarizing the meaning of documents and
information, but also to understand the emotional meanings
within the information, such as in collecting data from social
media.
Automatic Text Summarization is the process of creating
the most meaningful and relevant summary of voluminous
texts from multiple resources.
• Google news, Blogspot, Inshorts app and many other apps dealing with data
summarization works by using Machine learning algorithms that helps in
producing short and relevant data from the scattered resources, by
identifying the important sections in a huge textual data. There are two
different ways of creating Automatic text summarization:
• Extractive summarization: In this the selected text, phrases, sentences or
sections are picked up from the scattered resources and joined appropriately
to form a concise summary.
• Abstractive Summarization: In this, the summary is created by interpreting
the text from multiple resources using advanced NLP techniques. This new
summary may or may not have text, phrases or sentences from the original
documents.
Sentiment Analysis
The goal of sentiment analysis is to identify sentiment among several
posts or even in the same post where emotion is not always explicitly
expressed. Companies use this to identify opinions and sentiment online
to help them understand what customers think about their products and
services. Beyond determining simple polarity, sentiment analysis
understands sentiment in context to help better understand what's behind
an expressed opinion, which can be extremely relevant in understanding
and driving purchasing decisions.
Sentiment Analysis is the process of analyzing digital text to determine if
the emotional tone of the message is positive, negative or neutral. *
Text classification
Text classification is defined as classifying the
unstructured text into groups or categories. Text
classification makes it possible to assign predefined
categories to a document and organize it to help
you find the information you need or simplify some
activities.
For example, an application of text categorization
is spam filtering in email. The articles can be
organized by topics, chat conversations can be
organized by languages and many more uses of text
classification.
Virtual Assistants
Nowadays Google Assistant, Cortana, Siri, Alexa, etc have
become an integral part of our lives. Not only can we talk to
them but they also have the abilities to make our lives easier.
By accessing our data, they can help us in keeping notes of
our tasks, make calls for us, send messages and a lot more.
With the help of speech recognition, these assistants can not
only detect our speech but can also make sense out of it.
According to recent researches, a lot more advancements are
expected in this field in the near future.
Chatbots

Intelligent chatbots are offering


personalised assistance to the
customers. Analysts predict that the
use of chatbots will grow 5 times in a
year.
Natural Language Processing:
Getting Started

Natural Language Processing is all about how machines try to


understand and interpret human language and operate accordingly.
How can Natural Language Processing be used to solve the
problems around us?
Revisiting the AI Project Cycle
Scenario
• The world is competitive nowadays. People face competition in even the tiniest tasks
and are expected to give their best at every point in time. When people are unable to
meet these expectations, they get stressed and could even go into depression. We get to
hear a lot of cases where people are depressed due to reasons like peer pressure,
studies, family issues, relationships, etc. and they eventually get into something that is
bad for them as well as for others. So, to overcome this, cognitive behavioural therapy
(CBT) is considered to be one of the best methods to address stress as it is easy to
implement on people and also gives good results. This therapy includes understanding
the behaviour and mindset of a person in their normal life. With the help of CBT,
therapists help people overcome their stress and live a happy life.
Problem Scoping

• CBT is a technique used by most therapists to cure patients out of


stress and depression. But it has been observed that people do not
wish to seek the help of a psychiatrist willingly. They try to avoid
such interactions as much as possible. Thus, there is a need to
bridge the gap between a person who needs help and the
psychiatrist. Let us look at various factors around this problem
through the 4Ws problem canvas.
• Who Canvas-Who has the Problem?
• The “Who” block helps you in analysing the people getting affected directly
or indirectly due to it. Under this, you find out who the ’Stakeholders’ to this
problem are and what you know about them. Stakeholders are the people
who face this problem and would be benefited with the solution.
• What Canvas-What is the Nature of the Problem?
Under the “What” block, you need to look into what you have an hand. At this stage,
you need to determine the nature of the problem. What is the problem and how do you
know that it is a problem? Under this block, you also gather evidence to prove that the
problem you have selected actually exists. Newspaper articles, Media, announcements,
etc. are some examples.
• Where Canvas-Where does the Problem arise?
Who is associated with the problem and what the problem actually is; now
need to focus an the context/situation/location of the problem. This block will
help you look into the situation in which the problem arises, the context of it,
and the locations where it is prominent.
• Why Canvas-Why do you think it is a Problem worth
Thus, in the “Why” canvas, think about the benefits which the stakeholders would get
from the solution and how it would benefit them as well as the society.
Data Acquisition
• To understand the sentiments of people, we need to collect their
conversational data so the machine can interpret the words that they use
and understand their meaning. ie;In this scenario we need to collect data
of people who are going through stress.
• Such data can be collected from various means:
1. Surveys
2. Observing the therapist's sessions
3. Databases available on the internet
4. Interviews, etc.
Data Exploration

Once the textual data has been collected, it needs to be


processed and cleaned so that an easier version can be sent
to the machine. Thus, the text is normalised through
various steps and is lowered to minimum vocabulary since
the machine does not require grammatically correct
statements but the essence of it.
Modelling

Once the text has been normalised, it is then fed to an


NLP based AI model. Note that in NLP, modelling
requires data pre-processing only after which the data is fed
to the machine. Depending upon the type of chatbot we
try to make, there are a lot of AI models available which
help us build the foundation of our project.
Evaluation

The model trained is then evaluated and the accuracy for


the same is generated on the basis of the relevance of the
answers which the machine gives to the user's responses.
To understand the efficiency of the model, the suggested
answers by the chatbot are compared to the actual answers.
Chatbots
Chatbot is one of the most common applications of Natural Language Processing. There are a
lot of chatbots available and many of them use the same approach as we used in the scenario
above. Some links of chatbots are given to have a clear picture on how a NLP can turn out to
be miracle in the world of Artificial Intelligence.*
Mitsuku Bot*
https://www.pandorabots.com/mitsuku/
CleverBot*
https://www.cleverbot.com/
Jabberwacky*
http://www.jabberwacky.com/
Haptik*
https://haptik.ai/contact-us

Rose*
http://ec2-54-215-197-164.us-west-1.compute.amazonaws.com/speech.php

Ochatbot*
https://www.ometrics.com/blog/list-of-fun-chatbots/
• Chat means “to make a conversation”; bot means “to make a task automated”. Chatbot, short form of “Chat
Robot” (conversational agents) is an AI enabled computer program that communicates with the user in Natural
Human Language either through voice or text used in mobile apps, websites, messages etc.
• Chatbot is a short form of “Chat Robot”, also known as conversational agents. It is an AI enabled computer
program that communicates with the user in Natural Human Language either through voice or text used in
mobile apps, websites, messages etc.
• Chatbot can be defined as an application that automates your tasks like saying good morning when you wake up,
telling you news on a daily basis, helping you in choosing a less traffic route for your school, ordering a coffee
for you on your way back home.
• These chatbots need not be downloaded either in your computers or your phones. It works like a service made
available by different companies on a digital platform and does not need any updates or occupy any memory or
disk space.
• When you do live chats an websites you are probably dealing with chatbots. Almost all the companies across the
globe are using Chatbots to have a wide range of customers and to cater to their customised needs. For
example, you visit an online showroom of party apparels. You are confused about what to buy— a short dress,
simple formats, glittery gown etc. This difficult situation can be easily handled through chatbots. It will help you
to make your choice in a smoother way depending on your likes and preferences.
• Some of the virtual assistants like Google Assistant, Cortana, Siri, etc. are considered as smarter versions of
chatbots as they handle the conversations as well as other tasks in a smart and intelligent way just like a human
with a brain.
Some chatbots are scripted or in other words are traditional chatbots while others were Al-
powered and had more knowledge.
There are 2 types of chatbots around us: Script- bot and Smart-bot.
• Script-bot
• Script-bot are simple chatbots with limited functionalities. They are scripted based on a
task and follow a predetermined route to get the work done. Since it works around a script
therefore, they are easy to make and need either no or very less programming skills.
• These types of chatbots are best suited for straightforward interactions like making life easy
at home like Switch off lights based on commands, order food online etc. Some of the
companies use them as customer care services for providing interactive Frequently asked
questions services to the customers and if these chatbots are unable to handle certain
queries then they connect a company's customer care employee directly to the customer.
• Smart-bot
• Smart-bot are flexible, powerful, AI based models that have wider functionalities and
support machine learning algorithms that make a machine learn from the experience. They
simulate human-like interactions with the users.
• These chatbots require a lot of programming and work on bigger databases and other
resources to help the model understand the context of interactions.
• All the assistants like Google Assistant, Alexa, Cortana, Siri, etc. can be taken as smart bots
as not only can they handle the conversations but can also manage to do other tasks which
makes them smarter.
Human Language Vs Computer Language
• Humans Language is the language used by humans to interact with the people around them. They interact
using different languages like English, Hindi, Spanish etc. For eg; in English we use nouns, pronouns,
verbs, adjectives and many more parts of speech to make a proper sentence used in speech or textual
conversation. This conversation applied with their experience and knowledge of the past will form an
intellectual conversation between two sensible people. Over the time these human languages evolve and
develop and new words add up in the dictionary based on their usage in specific places.
• It is a language made up of syntax and semantics which is also known as the grammar of a language
which forms the words arranged in a proper order to make a meaningful sentence.
• Our brain keeps on processing the sounds that it hears around itself and tries to make sense out of them
all the time. Even in the classroom, as the teacher delivers the session, our brain is continuously processing
everything and storing it in some place. Also, while this is happening, when your friend whispers
something, the focus of your brain automatically shifts from the teacher's speech to your friend's
conversation. So now, the brain is processing both the sounds but is prioritizing the one on which our
interest lies.
• The sound reaches the brain through a long channel. As a person speaks, the sound travels from his mouth
and goes to the listener's eardrum. The sound striking the eardrum is converted into neuron impulse, gets
transported to the brain and then gets processed. After processing the signal, the brain gains
understanding around the meaning of it. If it is clear, the signal gets stored. Otherwise, the listener asks
for clarity to the speaker. This is how human languages are processed by humans.
Human Language VS Computer Language
• On the other hand, the computer understands the language of numbers. Everything that is sent to the
machine has to be converted to numbers. And while typing, if a single mistake is made, the computer
throws an error and does not process that part. The communications made by the machines are very
basic and simple.
• Computer Language is a language used by the programmers to develop a computer program which
helps humans to interact with an electronic device-computer. The Central Processing unit is the brain of
the computer and it interprets the instructions given to the computer. The response of the computer is
in terms of the output or the execution of the instruction.
• The computer interacts in the form of a Binary Language. This language is the language of 0's and 1's.
This language has its own disadvantages and is quite difficult for humans to interact using binary
numbers. So, keeping in mind this nature of interaction with the computer, other programming
languages are now available that are closer to human language. Some of the examples of Computer
Languages are Java, C+ +, Python etc.
• Every computer language has its own syntax and semantics which needs to be followed strictly
otherwise the computer gives an error and further processing of the task will not be done.
Syntax and Semantics of a language
Arrangement of the words and meaning
• There are rules in human language. There are nouns, verbs, adverbs, adjectives. A word can be
a noun at one time and an adjective some other time. There are rules to provide structure to a
language.
• Syntax refers to the grammatical structure of a sentence. When the structure is present, we can
start interpreting the message. One way how the computer do this is to use the part-of-speech
tagging. This allows the computer to identify the different parts of a speech.
• Human communication is complex. There are multiple characteristics of the human language
that might be easy for a human to understand but extremely difficult for a computer to
understand.
• The proper arrangements of words in a sentence or a statement makes the syntax of a
language. Semantics refers to the meaning of the word in a sentence which helps in
interpreting the proper message of the complete structure of words.
Analogy with programming language:
These are some of the challenges we might have to face if we try to teach computers
how to understand and interact in human language.

Different syntax, same semantics:


2+3 = 3+2 (Here the way these statements are written is different, but their meanings
are the same that is 5. )

Different semantics, same syntax:


3/2 (Python 2.7) x 3/2 (Python 3)
Here the statements written have the same syntax but their meanings are different. In
Python 2.7, this statement would result in 1 while in Python 3, it would give an output
of 1.5. Correct syntax but no meaning
Multiple Meanings of a word
• English is mostly used as a natural language. It is a language where a word can have multiple meanings and the
meanings fit into the statement according to the context of it.
• His face turned red after he found out that he took the wrong bag
• His face turns red after consuming the medicine
• Here the context is important. We understand a sentence almost intuitively, depending on our history of using the
language, and the memories that have been built within. In all these sentences, the word red has been used in
different ways which according to the context of the statement changes its meaning completely. Thus, in natural
language, it is important to understand that a word can have multiple meanings and the meanings fit into the
statement according to the context of it.
• Another example: His future is very bright./ Today the Sun is very bright
• In the above sentences the word bright is playing a different role. This kind of situation can be easily handled by
humans using their intellectual power and through their language skills. Teaching a computer to understand and
interact in human language is a very challenging job.
• Perfect Syntax, no Meaning
• Sometimes, a statement can have a perfectly correct syntax but it does not mean
anything.
• For example-Chickens feed lavishly while the moon drinks tea.
• This statement is correct grammatically but does not make any sense. In Human
language, a perfect balance of syntax and semantics is important for better
understanding.

How does Natural Language Processing overcome these challenges?


• Data Processing
• To make the machine learn and process a sentence in terms of numbers we
first need to follow a Pre-Processing Stage of NLP

• As the language of computers is Numerical, the very first step is to convert our
language to numbers. This conversion takes a few steps to happen. The first step to
it is Text Normalisation. Since human languages are complex, we need to first of
all simplify them in order to make sure that the understanding becomes possible.
• Text Normalisation
• In Text Normalisation, we undergo several steps to normalise the text to a
lower level. The whole textual data from all the documents altogether is known
as corpus.
• This is the process of cleaning the textual data by converting the text in
such a way that it comes down to a level where its complexity is lower
than the actual data or into a standard form. It is considered as the Pre-
Processing stage of NLP as this is the first thing to do before we begin the
actual data processing. It helps in reducing the complexity of the language.
Words used as slang, short forms, misspelled, abbreviations, special meaning
characters etc. need to be converted into a canonical form after Text
Normalisation.
• B4, beefor, bifare to before, gr8,grt to great
• Sentence Segmentation
Under sentence segmentation, the whole corpus is divided into sentences. Most of the
human languages used across the world have punctuation marks to mark the boundaries
of the sentence so this feature helps in bringing down the complexity of the big data set
into a low and less complicated level of Data processing. Each sentence after this will be
a separate data or each sentence is taken as a different data so now the whole corpus gets
reduced to sentences.
• Tokenisation
After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a
sentence. This process is done mainly by finding the boundaries of a word i.e., where
one word ends and the other word begins. In English, a space in between two words is an
important word boundary detector. So under tokenisation, every word, number and
special character is considered separately and each of them is now a separate token.
Removing Stopwords, Special Characters and Numbers
• In this step, the tokens which are not necessary are removed from the token
list. Stopwords are the words which occur very frequently in the corpus
but do not add any value to it.
• Humans use grammar to make their sentences meaningful for the other
person to understand. But grammatical words do not add any essence to the
information which is to be transmitted through the statement hence they
come under stopwords. These words occur the most in any given corpus but
talk very little or nothing about the context or the meaning of it. Hence, to
make it easier for the computer to focus on meaningful terms, these words are
removed.
• Along with these words, a lot of times our corpus might have special
characters and/or numbers like #%&@ etc. Now it depends on the type of
corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then
you might not want to remove the special characters and numbers whereas
in some other textual data if these characters do not make sense, then you
can remove them along with the stopwords.
Converting text to a common case
• After the stopwords removal, we convert the whole text into a similar case, preferably
lower case. This ensures that the case-sensitivity of the machine does not consider
same words as different just because of different cases.
• This is a very important step as we want the same word but different case to be taken
as one token so that the program does not become case sensitive.

In this example, all the 6 forms of


hello would be converted to lower
case and hence would be treated as
the same word by the machine.
Stemming
• In this step, the remaining words are reduced to their root words. Stemming is the
process in which the affixes of words are removed and the words are
converted to their base form. This process helps in normalising the text into its
root form but the disadvantage is that it works on all the affixes irrespective whether
a base word is a meaningful word or not. It just removes the affixes hence it is faster.

Before stemming some of the base words with


affixes are:
increases, reserved, planning, programming,
engaging, flier
After stemming the base words are:
increas, reserv, plann, programm, engag, fl

Some of the above words after stemming do


not make any sense and are not considered as
base words.
Lemmatization
• Stemming and lemmatization both are alternative processes to each other as the role of
both the processes is same — removal of affixes. But the difference between both of
them is that in lemmatization, the word we get after affix removal (also known as
lemma) is a meaningful one. Lemmatization makes sure that lemma is a word with
meaning and hence it takes a longer time to execute than stemming.

The output for studies after affix removal has become study instead of studi.
With this we have normalised our text to tokens which are the simplest form of
words present in the corpus. Now it is time to convert the tokens into
numbers. Bag of Words algorithm is used for this.
DATA PROCESSING STEPS
Pre-process stage of NLP- Text Normalisation-normalise the text to a lower level

Sentence segmentation - whole corpus is divided into sentences.

Tokenisation-each sentence is then further divided into tokens.

Removing Stopwords, Special Characters and Numbers

Converting text to a common case

Stemming/Lemmatization- affixes of words are removed and the words are converted to their
base form

Convert the text to numbers-


Bag Of Words algorithm and Term Frequency and Inverse Document Frequency (TFIDF)
Techniques of Natural Language Processing
There are many techniques used in NLP for extracting information but the two
given below are most commonly used:
• Bag Of Words
• Term Frequency and Inverse Document Frequency (TFIDF)
• NLTK
Bag of Words
After the process of text normalisation the corpus is converted into normalised corpus
which is just a collection of meaningful words with no sequence.
Bag of Words is a NLP model which helps in extracting features out of the text
which can be helpful in machine learning algorithms. It converts text sentences into
numeric vectors by returning the unique words along with its number of occurrences.
In bag of words, we get the occurrences of each word and construct the vocabulary for
the corpus.
• The text on the left in this image is the normalised corpus which we have got after
going through all the steps of text processing. Now, as we put this text into the bag
of words algorithm, the algorithm returns to us the unique words out of the corpus
and their occurrences in it. At the right, it shows a list of words appearing in the
corpus and the numbers corresponding to it shows how many times the word has
occurred in the text body.
• This algorithm is named as Bag of Words because it contains meaningful words (also
known as Tokens) scattered in a dataset just like a bag full of words scattered with no
specific order.
• Thus, we can say that the bag of words gives us two things:
➢A vocabulary of words for the corpus
➢The frequency/occurrence of these words
• “Bag” of words algorithm symbolises that the sequence of sentences or tokens does
not matter in this case as all we need are the unique words and their frequency in it.
Step-by-step approach to implement bag of words algorithm:
• Text Normalisation: Collect data and pre-process it. The collection of data is
processed to get normalised corpus.
• Create Dictionary: Make a list of all the unique words occurring in the normalized
corpus. (Vocabulary)
• Create document vectors: For each document in the corpus, find out how many times
the word from the unique list of words has occurred.
• Create document vectors for all the documents. Repeat Step 3 for all documents in the
corpus to create a "Document Vector Table".
Example
• Step 1: Collecting data and pre-processing it.
• Document 1: Aman and Anil are stressed
• Document 2: Aman went to a therapist
• Document 3: Anil went to download a health chatbot
Here are three documents having one sentence each.
After text normalisation, the text becomes:
• Document 1: [aman, and, anil, are, stressed]
• Document 2: [aman, went, to, a, therapist]
• Document 3: [anil, went, to, download, a, health, chatbot]
Note that no tokens have been removed in the stopwords removal step. It is because we have very little data
and since the frequency of all the words is almost the same, no word can be said to have lesser value than the
other.
• Step 2: Create Dictionary
• Go through all the steps and create a dictionary i.e., list down all the words
which occur in all three documents:
• Dictionary:
aman and anil stresse
are went
d
downlo health chatbot therapis a to
ad t

Note that even though some words are repeated in different documents, they are all
written just once as while creating the dictionary, we create the list of unique words.
Step 3: Create document vector
• In this step, the vocabulary is written in the top row. Now, for each word in the
document, if it matches with the vocabulary, put a 1 under it. If the same word
appears again, increment the previous value by 1. And if the word does not occur in
that document, put a 0 under it.

Since in the first document, we have words: aman, and, anil, are, stressed. So, all
these words get a value of 1 and rest of the words get a 0 value.
In this table, the header row contains the vocabulary of the corpus and three rows
correspond to three different documents.
Finally, this gives us the document vector table for our corpus. But the tokens have
still not converted to numbers. This leads us to the final steps of our algorithm:
TFIDF.
The graph to represent the text in corpus will be This graph is a plot of occurrence of words versus
their value. If the words have highest occurrence in
all the documents of the corpus, they are said to
have negligible value hence they are termed as stop
words. These words are mostly removed at the pre-
processing stage only. Now as we move ahead from
the stopwords, the occurrence level drops drastically
and the words which have adequate occurrence in
the corpus are said to have some amount of value
and are termed as frequent words. These words
mostly talk about the document's subject and their
occurrence is adequate in the corpus. Then as the
occurrence of words drops further, the value of
such words rises. These words are termed as rare or
valuable words. These words occur the least but add
the most value to the corpus. Hence, when we look
at the text, we take frequent and rare words into
consideration.
• TFIDF: Term Frequency & Inverse Document Frequency
• This method is considered better than the Bag of Words algorithm because BoW gives the numeric vector of each
word in the document but TFIDF through its numeric value gives us the importance of each word in the document or
helps un in identifying the value for each word.
• TFIDF was introduced as a statistical measure of important words in a document. Each word in a document is given a
numeric value.
• Bag of words algorithm gives us the frequency of words in each document we have in our corpus. It gives us an idea
that if the word is occurring more in a document, its value is more for that document. For example, if I have a
document on air pollution, air and pollution would be the words which occur many times in it. And these words are
valuable too as they give us some context around the document. But let us suppose we have 10 documents and all of
them talk about different issues. One is on women empowerment, the other is on unemployment and so on. Do you
think air and pollution would still be one of the most occurring words in the whole corpus? If not, then which words
do you think would have the highest frequency in all of them?
• And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words do not talk
about the corpus at all. Though they are important for humans as they make the statements understandable to us, for
the machine they are a complete waste as they do not provide us with any information regarding the corpus. Hence,
these are termed as stopwords and are mostly removed at the pre-processing stage only.
• Term Frequency
• Term frequency is the frequency of a word in one document.
• Term frequency can easily be found from the document vector table as in
that table we mention the frequency of each word of the vocabulary in each
document.
Document Frequency
Document Frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in those documents.

It is shown below using the above example:


Inverse Document Frequency
Inverse Document Frequency is obtained when document frequency is in the
denominator and the total number of documents is the numerator.
It is shown below using the above example:
Therefore, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log(IDF(W))
Here, log is to the base of 10.

Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two
documents. Rest of them occurred in just one document hence the document frequency for them is one.
Talking about inverse document frequency, we need to put the document frequency in the denominator while the total
number of documents is the numerator. Here, the total number of documents are 3, hence inverse document
frequency becomes:
• Multiply the IDF values to the TF values. TF values are for each document
while the IDF values are for the whole corpus. Hence, we need to multiply
the IDF values to each row of the document vector table.
• IDF values for Aman in each row is the same and similar pattern is followed for all the
words of the vocabulary. After calculating all the values, we get:

• Finally, the words have been converted to numbers. These numbers are the values of each for each document. Here,
you can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF
value increases, the value of that word decreases. That is, for example:
• Total Number of documents: 10
• Number of documents in which ‘and’ occurs: 10 Therefore, IDF(and) = 10/10 = 1
• Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
• On the other hand, number of documents in which ‘pollution’ occurs: 3 IDF(pollution) = 10/3 = 3.3333...
• Which means: log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the corpus.
Summary
• Words that occur in all the documents with high term frequencies have the least numeric values and are
considered to be the stopwords.
• For a word to have high TFIDF value, the word needs to have a high term frequency but less document
frequency which shows that the word is important for one document but is not a common word for all
documents.
• These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.

Applications of TFIDF
• TFIDF is commonly used in the Natural Language Processing domain.
DIY — Do It Yourself !
• The Corpus
• Document 1: We can use health chatbots for treating stress.
• Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
• Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y
• Accomplish the following challenges on the basis of the corpus given above. You can use the tools available online for
these challenges.
• Sentence Segmentation: https://tinyurl.com/y36hd92n
• Tokenisation: https://text-processing.com/demo/tokenize/
• Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/
• Lowercase conversion: https://caseconverter.com/
• Stemming: http://textanalysisonIine.com/nltk-porter-stemmer
• Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize
• Bag of Words: Create a document vector table for all documents.
• Generate TFIDF values for all the words.
• Find the words having highest value and the least value
NLTK
• The Natural Language Toolkit (NLTK) is one of the most commonly used
open-source NLP toolkit that is made up of Python libraries and is used for
building programs that help in synthesis and statistical analysis of human
language processing. The text processing libraries do text processing through
tokenization, parsing, classification, stemming, tagging and semantic
reasoning.
At a glance
• Natural Language Processing or NLP is the subset of Artificial Intelligence that deals with how computers through a program will perform tasks like speech
recognition, translation, large amounts of Natural language data analysis and extraction so that a successful interaction can occur between the machines and
the humans to give the desired output.
• The process of understanding human language is quite difficult for a machine. This process is divided into five major steps.
• Automatic Text Summarization is the process of creating the most meaningful and relevant summary of voluminous texts from multiple resoumes.
• Sentiment and emotion analysis application of NLP is very significant as it helps business organizations gain insights on consumers and do a competitive
comparison and make necessary adjustments in the business strategy development.
• Text classification is defined as classifying the unstructured text into groups or categories.
• Chatbot is an AI enabled computer program that communicates with the user in Natural Human Language either through voice or text used in mobile apps,
websites, messages etc.
• Script-bot are simple chatbots with limited functionalities that are scripted based on a task and follow a predetermined route to get the work done.
• Smart-bot are flexible, powerful, AI based models that have wider 1Tinctionalities and support machine learning algorithms that make a machine learn from
the experience.
• Making a computer understand a natural language is a complex process. First, we need to understand that humans interact using alphabets and sentences
and machines interact using numbers.
• Bag of Words is a simple and important technique used in Natural Language Processing for extracting features from the textual data.
• TF-IDF method is considered better than the Bag of Words algorithm because BoW gives the numeric vector of each word in the document but TFIDF through its
numeric value gives us the importance of each word in the document.

• The Natural Language Toolkit (NLTK} is one of the most commonly used open-source NLP toolkit that is made up of Python libraries and is used for
building programs that help in synthesis and statistical analysis of human language processing.

You might also like