Part B Notes
Part B Notes
TOPICS:
1. Evaluation : What is evaluation?, Model Evaluation Terminologies, Confusion matrix
2. Text Normalisation technique used in NLP and popular NLP model - Bag-of-Words
Data Processing:
• Since we all know that the language of computers is Numerical, the very first step that
comes to our mind is to convert our language to numbers.
• This conversion takes a few steps to happen. The first step to it is Text Normalisation.
Text Normalisation:
• The process of converting or normalising the text to a lower level i.e. reducing the
complexity of the actual text for the computer to understand is known as Text
Normalisation.
Here we will be working on the text from multiple documents.
NLP uses a hierarchy to determine which groups of words and sentences belong to each other.
Corpora
Corpus
Document
Sentence
Token
Steps of Text Normalisation:-
Example:
Step 2: Tokenisation
• After segmenting the sentences, each sentence is then further divided into tokens.
• It is the process of dividing the sentences into tokens.
• Under tokenisation, every word, number and special character is considered separately
and each of them is now a separate token. After this step we get a List of tokens.
Example:
• Stopwords: Stopwords are the words that occur very frequently in the corpus but do not
add any value to it.
• Humans use grammar to make their sentences meaningful for the other person to
understand. But grammatical words do not add any essence to the information which is
to be transmitted through the statement hence they come under stopwords.
• Every language has its own set of stopwards. In English some of the stopwards are - a,
an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to etc.
• In this step, the tokens which are not necessary are removed from the token list. To
make it easier for the computer to focus on meaningful terms, these words are removed.
• Along with these words, a lot of times our corpus might have special characters and/or
numbers.
Example:
After removing stopwards ‘with’ and ‘.’ from the above list:
[‘Humans’,’interact’,’each’,’other’,’very’,’easily’]
Note: - If you are working on a document containing email IDs, then you might not want to remove
the special characters and numbers.
Hence, when we look at the text, we take frequent and rare words into considerationand remove the
stopwards.
[‘humans’,’interact’,’each’,’other’,’very’,’easily’]
Step 5: Stemming
Stemming is a technique used to extract the base form of the words by removing affixes from
them. It is just like cutting down the branches of a tree to its stems.
Example:
These
Stemmed
words are not
meaningful
Note: - In stemming, the stemmed words (words which are we get after removing the affixes) might not
be meaningful.
• Stemming and lemmatization both are alternative processes to each other as the role of
both the processes is same – removal of affixes.
• Lemmatization makes sure that lemma(the word after removing affixes) is a word
with meaning and hence it takes a longer time to execute than stemming.
Example:
With this we have normalised our text to tokens which are the simplest form of words present
in the corpus. This process is done throughout the corpus for all the documents.
Stemming Vs. Lemmatisation:
Stemming Lemmatisation
The stemmed words might not be The lemma word is a meaningful one.
meaningful.
E.g. E.g.
Our goal is to convert the natural languages to numbers so that the computer can understand.
With text normalisation we have reduced the complexity of the text. Now it is time to convert
the tokens into numbers. For this, we would use the Bag of Words algorithm.
• Thus, we can say that the bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole
corpus).
Ques. Why it is called “bag” of words?
Ans. Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens
does not matter in this case as all we need are the unique words and their frequency in it.
Bag of Words(BoW) algorithm:
Step 2: Step 4:
Step 3:
Step 1: Create the document
Create Create Document
Text Normalisation vector for all the
Dictionary Vector documents
Note: - No tokens have been removed in the stopwords removal step. It is because we have very little
data and since the frequency of all the words is almost the same, no word can be said to have lesser
value than the other.
• Dictionary in NLP means a list of all the unique words occurring in the corpus.
• If some words are repeated in different documents, they are all written just once while
creating the dictionary.
Dictionary:
Note: - Some words are repeated in different documents, they are all written just once, while creating
the dictionary, we create a list of unique words.
Step 2: Creating a Document Vector
• The document Vector contains the frequency of each word of the vocabulary in a particular
document.
• In the document, vector vocabulary is written in the top row.
o Now, for each word in the document, if it matches the vocabulary, put a 1 under it.
o If the same word appears again, increment the previous value by 1.
o And if the word does not occur in that document, put a 0 under it.
aman and anil are stressed went download health chatbot therapist a to
Document 1 1 1 1 1 0 0 0 0 0 0 0
1
Document 1 0 0 0 0 1 0 0 0 1 1 1
2
Document 0 0 1 0 0 1 1 1 1 0 1 1
3
In this table, the header row contains the vocabulary of the corpus and three rows
correspond to three different documents.
Finally, this gives us the document vector table for our corpus.
Note: - We have still not achieved our goal i.e. converting the tokens to number. For this we have to
apply another algorithm after this called TFIDF. It stands for Term Frequency & Inverse Document
Frequency. (This part is optional in our syllabus)
MCQs :
Q1. With reference to NLP, consider the following plot of occurrence of words versus their value:
a) Tokens
X
b) Popular words
c) Stop Words
d) Pronouns
Q2. Token is a term used for any word or numbers or special character occurring in a sentence.
(Token/Punctuators)
Q3. Lemmatization is the process of converting the stemmed words to their proper form to keep them
meaningful.
a) Stemming b) Tokenization c) Lemmatization d) Segmentation
Q4. Bag of words Algorithm is used to extract features of text in the corpus.
d) None of these
Q8. Manav is learning to create NLP based AI model. He is having a corpus which contains email
several email IDs. Now while normalizing the text which of the following he should not perform
with the text.
c) Segmentation d) Lemmatisation
aman and anil are stressed went download health chatbot therapist a to
Document 1 1 1 1 1 0 0 0 0 0 0 0
1
Document 1 0 0 0 0 1 0 0 0 1 1 1
2
Document 0 0 1 0 0 1 1 1 1 0 1 1
3
What is the term frequency of the word “aman”?
a) 3 b) 1 c) 2 d) 0
Q11. A corpus contains 12 documents. How many document vectors will be there for that corpus?
a) 12 b) 1 c) 24 d) 1/12
Q1.
Assertion: Text normalization is an essential preprocessing step in Natural Language Processing (NLP).
Reasoning: Text normalization helps standardize text data by converting it to a common format, making
it easier to analyze and process.
Options:
A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of
the assertion.
B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation of the
assertion.
Q2.
Assertion: In the bag of words (BoW) model, word order is not considered.
Reasoning: BoW represents text as a collection of individual word occurrences without considering their
sequence.
Options:
A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of
the assertion.
B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation of the
assertion.
Assertion: In text normalization, stemming and lemmatization are two common techniques used to reduce
words to their base or root form.
Reasoning: Stemming is a more aggressive technique than lemmatization and may result in nondictionary
words.
Options:
A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of the
assertion.
B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation
of the assertion.
Q4.
Assertion: when we look at the text, we take frequent and rare words into consideration and remove the
stop wards.
Reasoning: The words which have highest occurrence in all the documents of the corpus, they are said to
have negligible value hence they are termed as frequent words.
Options:
A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of the
assertion.
B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation of the
assertion.
Q2. In Bag of words algorithm sequences of the words does matter. (True/False)
Q3. Text Normalisation converts the text into numerical form. (True/False)
Q4. Dictionary in NLP means a list of all the unique words occurring in the corpus.
(True/False)
SHORT ANSWER TYPE QUESTIONS:
Q1. What will be the results of conversion of the term, ‘happily’ and ‘travelled’ in the process of
stemming and lemmatization?
Ans.
Q2. What is the importance of eliminating stop words and special characters during Text Normalization
process?
Ans.
• Frequency of stop words are more in comparison to other word present in corpus.
• They pose unnecessary processing efforts
Q3. What is Tokenization? Count how many tokens are present in the following statement:
“I find that the harder I work, the more luck I seem to have.”
Ans. Tokenization:
[ "I" ,"find", "that" , "the" , "harder" , "I" , "work" , ",", "the" , "more" , "luck" , "I" , "seem" , "to" ,
"have" , "." ]
So, there are a total of 16 tokens in the given statement after tokenization.
Q4. Kiara, a beginner in the field of NLP is trying to understand the process of Stemming. Help her in
filling up the following table by suggesting appropriate affixes and stem of the words mentioned there:
S. Word Affixes Stem
No.
i. Tries
ii. Learning
Ans.
Q5. Explain the following picture which depicts one of the processes on NLP. Also mention the purpose
which will be achieved by this process.
Ans.
Purpose: In Text Normalization, we undergo several steps to normalize the text to a lower level. After
the removal of stop words, we convert the whole text into a similar case, preferably lower case. This
ensures that the case-sensitivity of the machine does not consider same words as different just because of
different cases.
Q6. What is Document Frequency?
Ans.
Ans.
• Example:
o Consider the following document vector table
Ans.
Q9. Does the vocabulary of a corpus remain the same before and after text normalization? Why?
Ans.
No, the vocabulary of a corpus does not remain the same before and after text normalization. Reasons are
–
● In normalization the text is normalized through various steps and is lowered to minimum
vocabulary since the machine does not require grammatically correct statements but the
essence of it.
● In stemming the affixes of words are removed and the words are converted to their base
form.
Q1. What do you mean by Bag of words in NLP? Differentiate between stemming and lemmatization.
Explain with the help of an example.
Q2. Create a document vector table from the following documents by implementing all the four steps of
Bag of words model. Also depict the outcome of each step.
Ans.
Document
Vector 2:
Sameera Sanya classmate like dance love study mathematics
Document 1 1 0 1 1 1 1 1
2
Step 4: Creating Document Vector Table
Apply all the four steps of Bag of words model of NLP on the above given document and generate the
output.
Q4. Perform all of Text Normalization operation on the following text and write reduced text at the end
of operation. Show output of each step during the process.
Ans.
Output: - “Player in the ground managed the situation beautifully otherwise it.”
Step 2: Tokenization
Output:- [“Player” , “in” , “the” , “ground” , “Managed” , “the” , “situation” , “beautifully” ,
“otherwise” , “it” ]
Document 2: Akash likes to play football but Ajay prefers to play online games.
Ans.
Step 1. Tokenisation
[Akash, and, Ajay, are, best, friends, Akash, likes, to, play, football, but, Ajay,
prefers, to, play, online, games]
[Akash, Ajay, best, friends, Akash, likes, play, football, Ajay, prefers, play, online,
games]
[akash, ajay, best, friends, akash, likes, play, football, ajay, prefers, play, online,
games]
Step 4. Stemming/Lemmatization
[akash, ajay, best, friend, akash, like, play, football, ajay, prefer, play, online, game]
TOPIC: EVALUATION
What is evaluation? Model Evaluation Terminologies, Confusion matrix
What is evaluation?
There are various new terminologies which come into the picture when we work on evaluating our model.
Let’s explore them through a case study.
Overview: This model takes the MRI scan image of a patient and detects weather that
patient has cancer or not! ALSO it detects the cancer area in the scanned image if that person
is having a cancer.
There exist two conditions which we need to ponder upon: Prediction and Reality.
Now let us look at various combinations that we can have with these two conditions.
(True Positive)
Here, we can see in the MRI scan image of a patient that the patient has a brain tumor. The model
predicts a Yes which means the patient has cancer. The Prediction matches with the Reality. Hence, this
condition is termed as True Positive.
(True Negative)
Here, we can see in the MRI scan image of a patient that the patient doesn’t has a brain tumor. The
model predicts a No. The Prediction matches with the Reality. Hence, this condition is termed as True
Negative.
Case 3: Does the patient has cancer?
(False Positive)
Here, we can see in the MRI scan image of a patient that the patient doesn’t has a brain tumor. But the
model predicts a Yes. The Prediction does not matches with the Reality. Hence, this condition is termed
as False Positive. It is also known as Type 1 error.
(False Negative)
Here, we can see in the MRI scan image of a patient that the patient has a brain tumor. But the model
predicts a No. The Prediction does not matches with the Reality. Hence, this condition is termed as False
Negative. It is also known as Type 2 error.
Summary:
Confusion Matrix:
• It’s just a representation of the above parameters in a matrix format. Better visualization is
always good :)
• In each of the cells (like TP, FN, FP and TN) we write a number that represents the count of the
particular cases found in the test data set after comparing the prediction with the reality.
• Confusion matrix is a method of recording the result of comparisons between the prediction
and the reality based on the results of training dataset for an AI model in a 4*4 matrix.
Example:
1. Accuracy: The most commonly used metric to judge a model and is actually the
measure of how many number of times the model is giving
the correct/True predictions out of total possible cases.
2. Precision: Percentage of positive instances out of the total predicted positive
instances. Here denominator is the model prediction done as
positive from the whole given dataset. Take it as to find out ‘how
much the model is right when it says it is right’.
3. Recall: Percentage of positive instances out of the total actual positive instances.
Therefore denominator (TP + FN) here is the actual number
of positive instances present in the dataset. Take it as to find
out ‘how much extra right ones, the model missed when it
showed the right ones’.
4. F1 Score: It is the harmonic mean of precision and recall. This takes the contribution of
both, so higher the F1 score, the
better.
MCQs :
Q1. Raunak was learning the conditions that make up the confusion matrix. He came across a scenario
in which the machine that was supposed to predict an animal was always predicting not an animal. What
is this condition called?
Q2. The output given by the AI machine is known as ________ (Prediction/ Reality)
Q3. _____________ is used to record the result of comparison between the prediction and reality. It is
not an evaluation metric but a record which can help in evaluation.
a) Prediction
b) Reality
c) Accuracy
d) Confusion Matrix
a) Problem Scoping
b) Evaluation
c) Modelling
d) Data Acquisition
Q6. Which of the following statements is true for the term Evaluation?
a) Prediction is the input given to the machine to receive the expected result of the reality.
c) The prediction is the output which is given by the machine and the reality is the real scenario in
which the prediction has been made.
Q8. When the model predicts No for a situation and the Prediction does not matches with the Reality, The
condition is termed as
a) True Positive
b) False Positive
c) True Negative
d) False Negative
a) Prediction ≠ Reality
b) Prediction = 1
c) Reality = 0
d) Reality = Yes
Q10. When we use the training data for testing the model, the model is said to be
a) Over fitting
b) Under fitting
c) Perfectly fit
d) None of these
Ans.
Ans.
Prediction and Reality are the two parameters considered for Evaluation of a model.
• Overfitting is "the production of an analysis that corresponds too closely or exactly to a particular
set of data, and may therefore fail to fit additional data or predict future observations reliably".
• Models that use the training dataset during testing, will always results in correct output. This is
known as overfitting.
Q8. Let us assume that for training of a cancer detection model there are total 100 data points in our
training dataset of cancer patients and out of that 4% patients actually have cancer. Now after the
completion of training it has been seen that the number of true positive and true negative cases are 2
and 95 respectively.
Ans.
Ans.
Moving towards deploying the model in the real world, we test it in as many ways as possible. The stage
of testing the models is known as EVALUATION.
Q1. What is a confusion matrix? Explain in detail with the help of an example.
Q2. Imagine that you have come up with an AI based prediction model which has been deployed on the
roads to check traffic jams. Now, the objective of the model is to predict whether there will be a traffic
jam or not. Now, to understand the efficiency of this model, we need to check if the predictions which
it makes are correct or not. Thus, there exist two conditions which we need to ponder upon: Prediction
and Reality.Traffic Jams have become a common part of our lives nowadays. Living in an urban area
means you have to face traffic each and every time you get out on the road. Mostly, school students opt
for buses to go to school. Many times, the bus gets late due to such jams and the students are not able
to reach their school on time.
Considering all the possible situations make a Confusion Matrix for the above situation.
Ans.
True Positive
Prediction: No Reality: No
True Negative
False Positive
(ii) (Note: For calculating Precision, Recall and F1 score, we need not multiply the formula by 100 as all
these parameters need to range between 0 to 1)
(1 mark for total number of cases; 1 mark each for the calculation of precision, recall and F1 score)
-------------x------------------------------------------x---------------------------------------------x------------------