0% found this document useful (0 votes)
42 views33 pages

Grade 10 Unit 6 - Natural Language Processing

Uploaded by

shlokk2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views33 pages

Grade 10 Unit 6 - Natural Language Processing

Uploaded by

shlokk2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Unit 6: Natural Language Processing

▪Introduction to NLP
▪Applications of NLP
▪Revisiting AI Project Cycle
▪Human Language vs Computer
Language
▪Data Processing
▪Bag of Words
What is NLP?
Natural Language Processing (NLP) is a domain of Artificial Intelligence that focuses on the ability of
a computer to understand human language (command) as spoken or written and to give an output by
processing it.

Also NLP is a field of study that deals with understanding, interpreting, and manipulating human
spoken languages using computers.

Since most of the significant information is written down in natural languages such as English,
French, German, etc. thus, NLP helps computers communicate with humans in their own languages
and perform other language-related tasks.

In conclusion, NLP makes it possible for computers to read the text, hear speech, interpret and realize
it, understand the sentiment, and identify important parts of a text or speech.
Applications of NLP
Automatic Summarization
▪Summarizing the meaning of documents and information
▪Extract the key emotional information from the text to understand the
reactions (Social Media)
News articles, blogs, books, reports, etc. are generally very large documents and it is quite difficult for a
human to read lengthy documents to get out the basic information given in the document. So to reduce
this problem, text summarization is used. With the help of NLP, a short, concise and meaningful
summary can be generated from these documents that help the human to understand the given document
in a short time by reading its summary.
Sentiment Analysis
▪ Identify sentiments and emotions from one or more posts
▪ Companies use it to identify opinions and sentiments to get feedback
▪ Can be Positive, Negative or Neutral
Customer review is the major part of our data. It can be about any product, website, article, or movie. So
to analyze the customer review sentiment analysis can be used. With the help of sentiment analyze we
can classify the customer review as a positive or a negative review.
Text classification
▪ Assign predefined categories to a document and organize it to help
you find the information you need or simplify some activities.
Eg: Spam filtering in email.
Virtual Assistants
▪By accessing our data, they can help us in keeping notes of our tasks, making calls for us, sending
messages, and a lot more. Eg: Set up a Alarm, book a salon appointment, play your favourite play list etc.
▪With speech recognition, these assistants can not only detect our speech but can also make sense of it.
▪A lot more advancements are expected in this field in the near future
Eg: Google Assistant, Cortana, Siri, Alexa, etc
Chatbots
A chatbot is one of the most widely used NLP applications. Chat means “to make conversation”; bot
means “to make a task automated”. Chat bot is the short form of “Chat Robot”. Nowadays we usually
see chatbots on every website that gives an automatic response to customer queries. The major advantage
of chatbots is that they give responses within seconds and help customers to give basic information.
Analyst predict that the use of chatbot will grow 5time more in a year .
Revisiting the AI Project Cycle
NLP can be used to solve the problems around you. Now let us revisit each and every stage of an Al
Project cycle understand how we can develop a project using Natural Language processing using the
given scenario:
The world is full of competition in all aspects of lives. People face competition in even the tiniest tasks
and are expected to give their best at every point in time. When people are unable to meet these
expectations, they get stressed and could even go into depression.
We get to hear a lot of cases where people are depressed due to reasons like
peer pressure, studies, family issues, relationships, etc. and they eventually
get into something that is bad for them as well as for others. So, to overcome
this, cognitive behavioural therapy (CBT) is considered to be one of the best
methods to address stress as it is easy to implement on people and also gives
good results.

With the help of CBT, therapists help people overcome their stress and live a
happy life.
Stage 1: Problem Scoping
CBT is a technique used by most therapists to cure patients out of stress and depression. But it has been observed people
do not wish to seek the help of a psychiatrist willingly.
4Ws problem canvas for this scenario
Who Canvas – Who has the problem?
Who are the stakeholders? People who suffer from stress and are on the onset of depression.
What do we know about them? People who are going through stress are reluctant to consult a psychiatrist.
What Canvas – What is the nature of the problem?
What is the problem? People who need help are reluctant to consult a psychiatrist and hence miserably.
How do you know it is a problem? Studies around mental stress and depression available on various authentic sources.
Where Canvas-Where does the Problem arise?
What is the context/situation in which the When they are going through a stressful period of time due to some unpleasant
stakeholders experience this problem? experiences
Why Canvas – Why do you think it is a problem worth solving?
▪ People get a platform where they can talk and vent out their feelings anonymously
What would be of key value to the
▪ People get a medium that can interact with them and applies primitive CBT on them
stakeholders?
and can suggest help whenever needed
▪ People would be able to vent out their stress
How do you know it is a problem?
▪ They would consider going to a psychiatrist whenever required
problem statement templates go as follows, which leads to the goal of our project

Our People undergoing stress. Who?


Have a problem of Not being able to share their feelings. What?
While They need help in venting out their emotions Where?
An Ideal Solution would Be to provide them a platform to share their thoughts Why?
anonymously and suggest help whenever required.
Goal of our project becomes
Create a chatbot which can interact with people, help them to vent out their feelings and take them
through primitive CBT
Stage 2: Data Acquisition
After filling the 4W project template, the next step is to acquire data. In this scenario we need to collect data of people
who are going through stress. This can be collected through the following ways:
• Surveys and interviews of people in various stages of life using online and offline mode.
• Observations from the therapist's clinic.
• Web scraping - Data collected from web of people who are looking for assistance in stress management and vent out ways.
Stage 3: Data Exploration
Once the data is collected, it needs to be cleaned and processed to be fed into the machine. Hence the collected ala
goes through multiple ways where it is explored and made fit for testing the model.

In the above scenario once the data from therapist's clinic and interviews is collected, its filtered for only those people
who are going through stress (CBT).
Stage 4: Data Modelling
Once the data is processed and ready for the machines, now it's time to make a model for the project. On the basis of
the chatbot to be made, a suitable Al project model is prepared to achieve Al project goals.

Stage 5: Data Evaluation


The model is now tested with the testing data. It is evaluated for the accuracy of the answers which the machine
gives to the user's response. The Al model is then evaluated and compared to see its efficiency.

As you can see in the below diagram, the blue line talks about the model’s output while the green one is the
actual output along with the data samples.
In the first one the model’s output does not match the true function at all. Hence the model is said to be
underfitting and its accuracy is lower.
In the second one, the model’s performance matches well with the true function which states that the model has
optimum accuracy and the model is called a perfect fit.
In the third case, model performance is trying to cover all the data samples even if they are out of alignment to
the true function. This model is said to be overfitting and this too has a lower accuracy.
Once the model is evaluated thoroughly, it is then deployed in the form of an app which people can use easily.
Mitsuku Bot : https://www.pandorabots.com/mitsuku/
Clever Bot : https://www.cleverbot.com/
Jabberwacky : http://www.jabberwacky.com/
Haptik: https://haptik.ai/contact-us
Rose: http://ec2-54-215-197-164.us-west-1.compute.amazonaws.com/speech.php
Ochatbot : https://www.ometrics.com/blog/list-of-fun-chatbots/
There are 2 types of chatbots around us: Script-bot and Smart-bot
Script-bot Smart-bot
Script bots are easy to make Smart-bots are flexible and powerful
Script bots work around a script which is Smart bots work on bigger databases and other resources
programmed in them directly
Mostly they are free and are easy to integrate to a Smart bots learn with more data
messaging platform
No or little language processing skills Coding is required to take this up on board
Limited functionality Wide functionality
Example: Google story speaker, customer care bots. Eg: Google Assistant, Alexa, Cortana, Siri.
Human Language VS Computer Language
Human Language Computer Language
It’s the language used by the human to interact It’s the language used by the programmers to
with the people around them. develop a computer program
Human interact using different human Computer interacts in the form of binary
languages language in the form of 0’s and 1’s
Human language includes nouns, verbs, adverb Computer language includes syntax and
and adjectives semantic format
Human Language ignores the mistake. In Computer language throws an error and does
Human language, a perfect balance of syntax not process that part if a single mistake is
and semantics is important for better made.
understanding.
Example of Human Languages are English, Some of the computer languages are Python,
Hindi, French, German , Spanish etc. Java, C and C++
Syntax
▪Syntax refers to the grammatical structure of a sentence.
▪The proper arrangement of words in a sentence or a statement makes the syntax of a language.
Semantics
▪Semantics refers to meaning of a sentence.
▪Semantics helps in interpreting the proper message of the complete structure of words.
Example :
Different syntax but same semantics: 2+3 = 3+2, Whatever the method may be, answer will be 5
Different semantics, same syntax: 3/2 (Python 2.7) ≠ 3/2 (Python 3).
Here we have the same syntax but their meanings are different. In Python2.7, this statement would result
in 1 while in Python 3, it would give an output of 1.5.
what is Corpus
We will be working on a collection of written text. That is, we will be working on text from multiple
documents and the term used for the whole textual data from all the documents altogether is known as
corpus .
what is Token
Tokens is a term used for any word or number or special character occurring in a sentence.
Multiple Meanings of a word
English mostly used as a natural language. It is a language where a word can have multiple
meanings and the meanings fit into the statement according to the context of it.
For example:
1.His face turned red after he found out that he took the wrong bag
2.The red car zoomed past his nose
3.His face turns red after consuming the medicine

i) His future is very bright


ii) Today the sun is very bright
In the above sentences same word is used with different meaning, Human use their
intellectual power to handle these type of situations through their language skills.

To make a machine to understand and interact in human language is a very challenging task
NLP makes it possible ..
How does Natural Language Processing do this magic?
Data Processing
Humans interact with each other very easily. For us, the natural languages that we use are so
convenient that we speak them easily and understand them well too. But for computers, our
languages are very complex.

Since we all know that the language of computers is Numerical, the very first step that comes
to our mind is to convert our language to numbers. This conversion takes a few steps to
happen.
Pre-processing of data
The first step to make an NLP model is the pre-processing of data. The text data that we
have is in raw form and can contain many errors along with much undesirable text due to
which it will not give our results with accurate efficiency. So to get better outcomes it is
necessary to pre-processed our data and makes it better to understand and analyze.
Text preprocessing is an essential step in natural language processing (NLP) that
involves cleaning and transforming unstructured text data to prepare it for analysis.
The steps to perform preprocessing of data in NLP includes
▪ Text Normalisation
▪ Sentence Segmentation
▪ Tokenization
▪ Removing Stop Words
▪ Converting text to a common case
▪ Stemming
▪ Lemmatization
Text Normalisation Words Canonical form
Text Normalisation helps in cleaning up the textual data in such a way Gud Good
that it comes down to a level where its complexity is lower than the B4 Before

actual data. Simplification of human languages in order to be easily btw By the way

understood by computers is called Text Normalisation . Ty Thank you


2mrow tomorrow
Example : refer the table
Gr8,grt Great
Sentence Segmentation:
You first need to break the entire document down into its constituent sentences. You can do this by
segmenting the article along with its punctuations like full stops and commas. So whole Corpus gets reduced
to sentences.

Sentence Segmentation: https://tinyurl.com/y36hd92n


Tokeniziation:
In this step, we decompose our text data into the smallest unit called tokens. Generally, our dataset consists long
paragraph which is made up of many lines and lines are made up of words. It is quite difficult to analyze the long
paragraphs so first, we decompose the paragraphs into separate lines and then lines are decomposed into words. This is
called tokenization, and each world is called a token. Tokenization treats each word, integer, and special character as a
separate entity and creates a token for each of them.

Tokenisation: https://text-processing.com/demo/tokenize/
Removing Stop Words:
In this step, the tokens which are not necessary are removed from the token list.
You can make the learning process faster by getting rid of non-essential words, which add little meaning
to our statement and are just there to make our statement sound more cohesive. Words such as was, is,
the, a, an, and, or, for, it, is are called stop words and can be removed.

Stopwords removal: https://demos.datasciencedojo.com/demo/stopwords/

Some examples of stopwords are:


Converting text to a common case
After eliminating the stop words, we change the text’s case throughout,
preferably to lower case. This makes sure that the machine’s case-
sensitivity does not treat similar terms differently solely because of
varied case usage.
Stemming:
It is the process of obtaining the Word Stem of a word. Word Stem gives new words upon adding
affixes to them
Stemming is a technique used to reduce an inflected word down to its
word stem. For example, the words “programming,” “programmer,”
and “programs” can all be reduced down to the common word stem
“program.” In other words, “program” can be used as a synonym for
the prior three inflection words.
Stemming: http://textanalysisonline.com/nltk-porter-stemmer

Lemmatisation:
The process of obtaining the Root Stem of a word. Root Stem gives
the new base form of a word that is present in the dictionary and
from which the word is derived. You can also identify the base
words for different words based on the tense, mood, gender,etc.v
Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize
Difference between Stemming and Lemmatization:
Stemming Lemmatization
Stemming is a process that stems or removes last few Lemmatization considers the context and converts
characters from a word, often leading to incorrect the word to its meaningful base form, which is
meanings and spelling called Lemma.
Stemming chops off word endings without considering Lemmatization analyzes word forms to determine
linguistic context, making it computationally faster. the base or dictionary form, which takes more
processing time.
Eg Cared, caring, care stemming would return car, which Eg Cared, caring, care Lemmatization would return
is not the right meaning. care, which gives the right meaning.
Studies, studied, study would return studi, which is not Studies, studied, study would return study, which is
the right meaning. not the right meaning

Application of stemming and lemmatization


Stemming and lemmatization are used in natural language processing tasks such as information retrieval, text
mining, sentiment analysis, and search engines to reduce words to their base or root forms for better analysis and
understanding.
1.Perform the given Operation

Word Operation Output Word Operation Output


Knives Stemming Taking Stemming
Caring Stemming Taking Lemmatization
Caring Lemmatization Studies Stemming
Healing Lemmatization Studies Lemmatization

2.Perform Data processing on the following corpus


This is my 1st experience of Text Mining. I have Learnt New Techniques in this.
1.Sentence Segmentation
2.Tokenization
3.Stop words removal
4.Lower case conversion
There are many techniques used in NLP for extracting information but
the 3 given below are the most commonly used technique
1. Bag of Words (BoW)
2. Term Frequency and Invert Document Frequency (TFIDF)
3. Natural Language Toolkit (NLTK}
The graph to represent the text in all documents in corpus will be
Bag of Words
Bag of Words is a Natural Language Processing model which helps in extracting features out of the text
which can be helpful in machine learning algorithms. It is used for finding the frequency of words in text
sample. In bag of words, we get the occurrences of each word and construct the vocabulary for the corpus

The bag of words gives us two things:


▪ A vocabulary of words for the corpus
▪ The frequency of these words (number of times it has occurred in the whole corpus).
Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens
does not matter in this case as all we need are the unique words and their frequency in it.
step-by-step approach to implement bag of words algorithm :
1. Text Normalisation: Collecting data and pre-processing it
2. Create Dictionary: Making a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from
the unique list of words has occurred.
4. Create document vectors for all the documents.
Example: Step 1: Collecting data and pre-processing it.
Raw Data Processed Data [Text after normalisation}]
Document 1: Aman and Anil are stressed Document 1: [aman, and, anil, are, stressed]
Document 2: Aman went to a therapist Document 2: [aman, went, to, a, therapist]
Document 3: Anil went to download a health Document 3: [anil, went, to, download, a,
chatbot health, chatbot]
Note that no tokens have been removed in the stopwords removal step. It is because we have
very little data and since the frequency of all the words is almost the same, no word can be
said to have lesser value than the other.
Step 2: Create Dictionary
Dictionary in NLP means a list of all the unique words occurring in the corpus. If some words are
repeated in different documents, they are all written just once while creating the dictionary.

Step 3: Create document vector


The document Vector contains the frequency of each word of the vocabulary in a particular document
In the document, vector vocabulary is written in the top row.
▪ Now, for each word in the document, if it matches the vocabulary, put a 1 under it.
▪ If the same word appears again, increment the previous value by 1.
▪ And if the word does not occur in that document, put a 0 under it.

Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a
value of 1 and rest of the words get a 0 value.
Step 4: Creating a document vector table for all documents
Same exercise has to be done for all the documents. Hence, the table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond to
three different documents.
Finally, this gives us the document vector table for our corpus. But the tokens have still not
converted to numbers. This leads us to the final steps of our algorithm: TFIDF
Perform the step by step approach to implement a BoW(Bag of Words) algorithm on the
following documents.
[ Hint : i)Text Normalization ii)Create Dictionary iii) Create a Document Vector ]
Question 1:
Document 1 : Rohan and Riya are twins.
Document 2 : Rohan lives with his uncle in Chicago.
Document 3 : Riya lives with her parents in Dubai.

Question 2:
Document 1 : I like math.
Document 2 : I also like science.
Document 3 : Math and science are good scoring subject.
TFIDF:
TFIDF stands for Term Frequency & Inverse Document Frequency
Term Frequency
▪ Term frequency is the frequency of a word in one document.
▪ Term frequency can easily be found in the document vector table

Document Frequency
The number of documents in which the word occurs irrespective of how many times it has occurred in
those documents is called document frequency

We can observe from the table is:


▪Document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two documents.
▪Rest of them occurred in just one document hence the document frequency for them is one.
Inverse Document Frequency
In the case of inverse document frequency, we need to put the document frequency in the denominator
while the total number of documents is the numerator.
𝑰𝒏𝒗𝒆𝒓𝒔𝒆 𝑫𝒐𝒄𝒖𝒎𝒆𝒏𝒕 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 = 𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕𝒔 Τ𝑫𝒐𝒄𝒖𝒎𝒆𝒏𝒕 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚

Formula of TFIDF for any word W becomes TFIDF(W) = TF(W) * log( IDF(W) )

let’s multiply the IDF values to the TF values. Note that the TF values are for each document while the
IDF values are for the whole corpus. Hence, we need to multiply the IDF values to each row of the
document vector table.
After calculating all the values, we get:
After calculating all the values, we get:

Finally, the words have been converted to


numbers. These numbers are the values of
each for each document. Here, you can see
that since we have less amount of data,
words like ‘are’ and ‘and’ also have a
high value.
But as the IDF value increases, the value of that word decreases. IDF(aman) = 3/2
= 0.176…
Which means: log(1.5) = 0.176; which shows that the word ‘aman’ has considerable
value in the corpus.
Summarising the concept, we can say that:
1. Words that occur in all the documents with high term frequencies have the least values and are
considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a common
word for all documents.
3. These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. The TF-IDF value is used to
rank the significance of terms in a document and is often employed in tasks like information retrieval,
text classification, and search engines. Some of its applications are:
Through a Step-by-Step Process, Calculate TFIDF for the below
documents
Question 1:
Document 1: It is better to learn Data Science.
Document 2: It is best to learn Data Science.
Document 3: It is excellent to learn Data Science.

Question 2:
Document 1: Jack and Jill went up the hill
Document 2: Jack fell down
Document 3: Jill broke down
1. What is NLP and list any 4 applications of NLP
2. List down the difference between Script-bot and Smart-bot.
3. Distinguish between human language and computer language
4. Differentiate between Stemming and Lemmatization
5. What is BoW? What is the use of it.
6. What are the different stages followed to implement bag of words algorithm? .

You might also like