Unit 6 - NLP Notes
Unit 6 - NLP Notes
1. Define NLP.
NLP is a subfield of Linguistics, Computer Science, Information Engineering,
and Artificial Intelligence concerned with the interactions between computers
and human (natural) languages, in particular how to program computers to
process and analyse large amounts of natural language data.
2. Explain the Applications of Natural Language Processing.
I. Automatic Summarization
Automatic summarization – is relevant not only for summarizing the
meaning of documents and information, but also to understand the
emotional meanings within the information such as in collecting data
from social media.
E.g. Overview of a news item or blog post
II. Sentiment Analysis:
The goal of sentiment analysis is to identify sentiment among several
posts or even in the same post where emotion is not always explicitly
expressed.
Companies use Natural Language Processing applications, such as
sentiment analysis, to identify opinions and sentiment online to help
them understand what customers think about their products and
services.
1. Sentence Segmentation –
Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the whole
corpus gets reduced to sentences.
2. Tokenisation- After segmenting the sentences, each sentence is then
further divided into tokens.
Tokens is a term used for any word or number or special character
occurring in a sentence.
Under tokenisation, every word, number and special character is
considered separately and each of them is now a separate token.
3. Removing Stop words, Special Characters and Numbers - In this
step, the tokens which are not necessary are removed from the token list.
Stop words are the words which occur very frequently in the corpus but
do not add any value to it.
4. Converting text to a common case -After the stop words removal, we
convert the whole text into a similar case, preferably lower case. This
ensures that the case-sensitivity of the machine does not consider same
words as different just because of different cases.
5. Stemming -In this step, the remaining words are reduced to their root
words. In other words, stemming is the process in which the affixes of
words are removed and the words are converted to their base form.
6. Lemmatization -in lemmatization, the word we get after affix removal
(also
known as lemma) is a meaningful one. With this we have normalized our
text to tokens which are the simplest form of words present in the corpus.
Now it is time to convert the tokens into numbers. For this, we would use
the Bag of Words algorithm.
9. Rahul has been given the task of text normalisation. Help him
normalise the text in the segmented sentences given below:
Document 1: My brother loves math and science.
Document 2: My brother likes to read books on science and listen to rock
music.
Step1: Tokenisation
[My, brother, loves, math, and, science, likes, to, read, books, on, science,
and, listen, to, rock, music]
Step2: Removing stopwords
[My, brother, loves, math, science, likes, read, books, listen, rock, music]
Step3: Converting text to common case
[my, brother, loves, math, science, likes, read, books, listen, rock, music]
Step4: stemming/Lemmatization
[my, brother, love, math, science, like, read, book, listen, rock, music]
10. Define Bag of Words.
Bag of Words is a Natural Language Processing model which helps in
extracting features out of the text which can be helpful in machine learning
algorithms. In bag of words, we get the occurrences of each word and
construct the vocabulary for the corpus.
11. Describe the steps to implement bag of words algorithm.
Step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the
corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how
many times the word from the unique list of words has occurred.
4. Create document vectors for all the documents.
12. Create a document vector table from the following documents by
implementing all the four steps of Bag of words model.
Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot
Solution:
Step 1: Collecting data and pre-processing it.
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]
Step 2: Create Dictionary
Here, the red dashed line is model’s output while the blue crosses are actual
data samples.
1. The model’s output does not match the true function at all. Hence the
model is said to be under fitting and its accuracy is lower.
2. In the second case, model performance is trying to cover all the data
samples even if they are out of alignment to the true function. This
model is said to be over fitting and this too has a lower accuracy
3. In the third one, the model’s performance matches well with the true
function which states that the model has optimum accuracy and the
model is called a perfect fit.
16. Calculate Term frequency, Document frequency and inverse document
frequency for the given corpus and mention the word(s) having highest
value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Solution:
Term Frequency:
Document Frequency: