0% found this document useful (0 votes)
53 views7 pages

Unit 6 - NLP Notes

Natural Language Processing (NLP) is a field that focuses on the interaction between computers and human languages, enabling computers to process and analyze natural language data. Key applications of NLP include automatic summarization, sentiment analysis, text classification, and virtual assistants. The document also covers concepts like chatbots, text normalization steps, the Bag of Words model, and the relationship between word occurrence and value.

Uploaded by

ZIA PUTHENKATTIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views7 pages

Unit 6 - NLP Notes

Natural Language Processing (NLP) is a field that focuses on the interaction between computers and human languages, enabling computers to process and analyze natural language data. Key applications of NLP include automatic summarization, sentiment analysis, text classification, and virtual assistants. The document also covers concepts like chatbots, text normalization steps, the Bag of Words model, and the relationship between word occurrence and value.

Uploaded by

ZIA PUTHENKATTIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

UNIT 6: NATURAL LANGUAGE PROCESSING

1. Define NLP.
NLP is a subfield of Linguistics, Computer Science, Information Engineering,
and Artificial Intelligence concerned with the interactions between computers
and human (natural) languages, in particular how to program computers to
process and analyse large amounts of natural language data.
2. Explain the Applications of Natural Language Processing.
I. Automatic Summarization
Automatic summarization – is relevant not only for summarizing the
meaning of documents and information, but also to understand the
emotional meanings within the information such as in collecting data
from social media.
E.g. Overview of a news item or blog post
II. Sentiment Analysis:
The goal of sentiment analysis is to identify sentiment among several
posts or even in the same post where emotion is not always explicitly
expressed.
Companies use Natural Language Processing applications, such as
sentiment analysis, to identify opinions and sentiment online to help
them understand what customers think about their products and
services.

III. Text classification:


Text classification makes it possible to assign predefined categories to
a document and organize it to help you find the information you need or
simplify some activities. For example, an application of text
categorization is spam filtering in email.
IV. Virtual Assistants:
By accessing our data, they can help us in keeping notes of our tasks,
make calls for us, send messages and a lot more. With the help of
speech recognition, these assistants can not only detect our speech
but can also make sense out of it.
E.g. Alexa, Cortana, Siri, Google Assistant
3. What is a Chabot?

A chatbot is a computer program that's designed to simulate human


conversation through voice commands or text chats or both. Eg: Mitsuku
Bot, Jabberwacky,CleverBot,Haptik,Rose,Ochatbot etc.
4. Differentiate between a script-bot and a smart-bot.
5. Differentiate between Human Language and Computer Language.
Humans communicate through language which we process all the time.
Our brain keeps on processing the sounds that it hears around itself and
tries to make sense out of them all the time.
On the other hand, the computer understands the language of numbers.
Everything that is sent to the machine has to be converted to numbers.
And while typing, if a single mistake is made, the computer throws an
error and does not process that part. The communications made by the
machines are very basic and simple.

6. Give an example of the following: Multiple meanings of a word Perfect


syntax, no meaning.
Example of Multiple meanings of a word
His face turns red after consuming the medicine
Meaning - Is he having an allergic reaction? Or is he not able to bear the taste
of that medicine?

Example of Perfect syntax, no meaning-


Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically but it does not make any sense. In
Human language, a perfect balance of syntax and semantics is important for
better understanding.
7. Define POS.
Part-of-speech (POS) tagging is a popular Natural Language Processing
process which refers to categorizing words in a text (corpus) in
correspondence with a particular part of speech, depending on the definition
of the word and its context.

8. What are the steps of text Normalization? Explain them in brief.

Text Normalization: In Text Normalization, we undergo several steps to


normalize the text to a lower level.

1. Sentence Segmentation –
Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the whole
corpus gets reduced to sentences.
2. Tokenisation- After segmenting the sentences, each sentence is then
further divided into tokens.
Tokens is a term used for any word or number or special character
occurring in a sentence.
Under tokenisation, every word, number and special character is
considered separately and each of them is now a separate token.
3. Removing Stop words, Special Characters and Numbers - In this
step, the tokens which are not necessary are removed from the token list.
Stop words are the words which occur very frequently in the corpus but
do not add any value to it.
4. Converting text to a common case -After the stop words removal, we
convert the whole text into a similar case, preferably lower case. This
ensures that the case-sensitivity of the machine does not consider same
words as different just because of different cases.
5. Stemming -In this step, the remaining words are reduced to their root
words. In other words, stemming is the process in which the affixes of
words are removed and the words are converted to their base form.
6. Lemmatization -in lemmatization, the word we get after affix removal
(also
known as lemma) is a meaningful one. With this we have normalized our
text to tokens which are the simplest form of words present in the corpus.
Now it is time to convert the tokens into numbers. For this, we would use
the Bag of Words algorithm.
9. Rahul has been given the task of text normalisation. Help him
normalise the text in the segmented sentences given below:
Document 1: My brother loves math and science.
Document 2: My brother likes to read books on science and listen to rock
music.
Step1: Tokenisation
[My, brother, loves, math, and, science, likes, to, read, books, on, science,
and, listen, to, rock, music]
Step2: Removing stopwords
[My, brother, loves, math, science, likes, read, books, listen, rock, music]
Step3: Converting text to common case
[my, brother, loves, math, science, likes, read, books, listen, rock, music]
Step4: stemming/Lemmatization
[my, brother, love, math, science, like, read, book, listen, rock, music]
10. Define Bag of Words.
Bag of Words is a Natural Language Processing model which helps in
extracting features out of the text which can be helpful in machine learning
algorithms. In bag of words, we get the occurrences of each word and
construct the vocabulary for the corpus.
11. Describe the steps to implement bag of words algorithm.
Step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the
corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how
many times the word from the unique list of words has occurred.
4. Create document vectors for all the documents.
12. Create a document vector table from the following documents by
implementing all the four steps of Bag of words model.
Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot
Solution:
Step 1: Collecting data and pre-processing it.
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]
Step 2: Create Dictionary

Step 3: Create document vector


Step 4: Repeat for all documents

13. Define the following terms:


Document Vector Table- is a table containing the frequency of each word of
the vocabulary in each document.
Term frequency- is the frequency of a word in one document.
Document Frequency is the number of documents in which the word occurs
irrespective of how many times it has occurred in those documents.
Inverse Document Frequency- In case of inverse document frequency, we
need to put the document frequency in the denominator while the total
number of documents is the numerator.
Inverse document frequency=total no.of documents/document frequency.

14. Explain the relation between occurrence and value of a word.


As shown in the graph, occurrence and value of a word are inversely
proportional. The words which occur most (like stop words) have negligible
value. As the occurrence of words drops, the value of such words rises.
These words are termed as rare or valuable words. These words occur the
least but add the most value to the corpus.
15. Classify each of the images according to how well the model’s output
matches the data samples:

Here, the red dashed line is model’s output while the blue crosses are actual
data samples.
1. The model’s output does not match the true function at all. Hence the
model is said to be under fitting and its accuracy is lower.
2. In the second case, model performance is trying to cover all the data
samples even if they are out of alignment to the true function. This
model is said to be over fitting and this too has a lower accuracy
3. In the third one, the model’s performance matches well with the true
function which states that the model has optimum accuracy and the
model is called a perfect fit.
16. Calculate Term frequency, Document frequency and inverse document
frequency for the given corpus and mention the word(s) having highest
value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Solution:
Term Frequency:

Document Frequency:

Inverse Document Frequency:

W Ar goi to Mumb is a Famo Pla I a in


e e ng ai us ce m
4/2 4/2 4/2 4/ 4/3 4/ 4/ 4/3 4/2 4/ 4/ 4/
2 1 2 1 1 1

The words having highest value are – Mumbai, Famous

You might also like