0% found this document useful (0 votes)
20 views62 pages

Part B Notes

The document provides study material for Class X on Artificial Intelligence, specifically focusing on evaluation and text normalization techniques in Natural Language Processing (NLP). It outlines the steps involved in text normalization, including tokenization, stopword removal, stemming, and lemmatization, as well as the Bag of Words model for feature extraction. Additionally, it includes sample questions and answers related to the topics covered.

Uploaded by

venus2v142009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views62 pages

Part B Notes

The document provides study material for Class X on Artificial Intelligence, specifically focusing on evaluation and text normalization techniques in Natural Language Processing (NLP). It outlines the steps involved in text normalization, including tokenization, stopword removal, stemming, and lemmatization, as well as the Bag of Words model for feature extraction. Additionally, it includes sample questions and answers related to the topics covered.

Uploaded by

venus2v142009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

CLASS X

ARTIFICIAL INTELLIGENCE (417) STUDY MATERIAL

TOPICS:
1. Evaluation : What is evaluation?, Model Evaluation Terminologies, Confusion matrix
2. Text Normalisation technique used in NLP and popular NLP model - Bag-of-Words

TOPIC: TEXT NORMALISATION


(NLP)

Data Processing:
• Since we all know that the language of computers is Numerical, the very first step that
comes to our mind is to convert our language to numbers.
• This conversion takes a few steps to happen. The first step to it is Text Normalisation.
Text Normalisation:
• The process of converting or normalising the text to a lower level i.e. reducing the
complexity of the actual text for the computer to understand is known as Text
Normalisation.
Here we will be working on the text from multiple documents.

Terminologies related to text:-


1. Token:- It is the smallest level of text. A “Token” is a term used for any word or number or
special character occurring in a sentence.
For example in the sentence ‘AI is a skill subject.’ The tokens are ‘AI’,’is’,’a’, ‘skill’
‘subject’,’.’
2. Sentence:- A sentence is a collection of tokens ending with full stop(.)
E.g. “AI is a skill subject.”
3. Document:- A document is a collection of sentences which is also known as a paragraph.
E,g. “AI is a skill subject. It is a newly introduced subject by CBSE.”
4. Corpus:- The term used for the whole textual data from all the documents altogether is
known as corpus. E.g.
Document1: “AI is a skill subject. It is a newly introduced subject by CBSE .”
Document2:”AI is a very scoring subject. If you study with focus you can easil y get
100/100. Best of luck!”
5. Corpora:- A group of corpus is called Corpora.

NLP uses a hierarchy to determine which groups of words and sentences belong to each other.

Corpora

Corpus

Document

Sentence

Token
Steps of Text Normalisation:-

Step 1: Step 3: Step 4:


Step 2: Step 5: Step 6:
Sentence Stopward Converting Text
Tokenaisation Stemming Lemmatisation
Segmentation removal to common case

Step 1: Sentence Segmentation


• Under sentence segmentation, the whole corpus is divided into sentences.
• Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.

Example:

BEFORE SEGMENTATION AFTER SEGMENTATION


“Humans interact with each other very easily. For Sentence 1: “Humans interact with each other very
us, the natural languages that we use are so easily.”
convenient that we speak them easily and Sentence 2: “For us, the natural languages that
understand them well too.” we use are so convenient that we speak them
easily and understand them well too.”

Step 2: Tokenisation

• After segmenting the sentences, each sentence is then further divided into tokens.
• It is the process of dividing the sentences into tokens.
• Under tokenisation, every word, number and special character is considered separately
and each of them is now a separate token. After this step we get a List of tokens.

Example:

Sentence 1: “Humans interact with each other very easily.”

After tokenization : [‘Humans’,’interact’,’with’,’each’,’other’,’very’,’easily’,’.’]

Step 3: Stopword Removal

• Stopwords: Stopwords are the words that occur very frequently in the corpus but do not
add any value to it.
• Humans use grammar to make their sentences meaningful for the other person to
understand. But grammatical words do not add any essence to the information which is
to be transmitted through the statement hence they come under stopwords.
• Every language has its own set of stopwards. In English some of the stopwards are - a,
an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to etc.
• In this step, the tokens which are not necessary are removed from the token list. To
make it easier for the computer to focus on meaningful terms, these words are removed.
• Along with these words, a lot of times our corpus might have special characters and/or
numbers.

Example:

After tokenization : [‘Humans’,’interact’,’with’,’each’,’other’,’very’,’easily’,’.’]

After removing stopwards ‘with’ and ‘.’ from the above list:

[‘Humans’,’interact’,’each’,’other’,’very’,’easily’]

Note: - If you are working on a document containing email IDs, then you might not want to remove
the special characters and numbers.

It is a plot of occurrence of words versus their value. As we can see,


If the words have hig hest occurrence in all the documents of the corpus, they are said to have
negligible value hence they are termed as stop words . These words are mostly removed at the pre
processing stage only.
Now as we move ahead from the stopwords, the occurrence level drops drastically and the words
which have adequate occurrence in the corpus are said to have some amount of value and are termed
as frequent words . These words mostly talk about the document’s subject and their occurrence is
adequate in the corpus.
Then as the occurrence of words drops further, thevalue of such words rises. These words are termed
as rare or valuable words. These words occur the least but add the most value to the corpus.

Hence, when we look at the text, we take frequent and rare words into considerationand remove the
stopwards.

Step 4: Converting text to common case

• We convert the whole text into a similar case, preferably lowercase.


• This ensures that the case sensitivity of the machine does not consider the same words
as different just because of different cases.

Example: After converting all tokens to lower case:

[‘humans’,’interact’,’each’,’other’,’very’,’easily’]

Step 5: Stemming

Stemming is a technique used to extract the base form of the words by removing affixes from
them. It is just like cutting down the branches of a tree to its stems.

Example:

These
Stemmed
words are not
meaningful

After stemming: [‘human’,’interact’,’each’,’other’,’very’,’easi’]

Note: - In stemming, the stemmed words (words which are we get after removing the affixes) might not
be meaningful.

Step 6:- Lemmatisation

• Stemming and lemmatization both are alternative processes to each other as the role of
both the processes is same – removal of affixes.
• Lemmatization makes sure that lemma(the word after removing affixes) is a word

with meaning and hence it takes a longer time to execute than stemming.
Example:

Now all the


Root words are
meaningful.

After lemmatisation: [‘human’,’interact’,’each’,’other’,’very’,’easy’]


Note: - In lemmatisation, the lemma (words which are we get after removing the affixes) will always
be a meaningful word.

With this we have normalised our text to tokens which are the simplest form of words present
in the corpus. This process is done throughout the corpus for all the documents.
Stemming Vs. Lemmatisation:

Stemming Lemmatisation
The stemmed words might not be The lemma word is a meaningful one.
meaningful.
E.g. E.g.

BAG OF WORDS MODEL


(NLP)

Our goal is to convert the natural languages to numbers so that the computer can understand.
With text normalisation we have reduced the complexity of the text. Now it is time to convert
the tokens into numbers. For this, we would use the Bag of Words algorithm.

Bag of Words (BoW):


• It is a Natural Language Processing model which helps in extracting features out of the
text which can be helpful in machine learning algorithms.
• In bag of words, we get the occurrences of each word and construct the vocabulary for
the corpus.
• The BoW algorithm returns to us the unique words out of the corpus and their
occurrences in it.

• Thus, we can say that the bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole
corpus).
Ques. Why it is called “bag” of words?
Ans. Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens
does not matter in this case as all we need are the unique words and their frequency in it.
Bag of Words(BoW) algorithm:

Step 2: Step 4:
Step 3:
Step 1: Create the document
Create Create Document
Text Normalisation vector for all the
Dictionary Vector documents

Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it


2. Create Dictionary: Make a list of all the unique words occurring in the corpus.
(Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the
word from the unique list of words has occurred.
4. Create document vectors for all the documents.

Example: Let us understand the algorithm with an example.


Step 1: Collecting data and text normalisation

Raw Data Processed Data After Normalisation


Document 1: “Aman and Anil are stressed.” [ “aman” , “and” , “anil” , “are” , “stressed”]
Document 2: “Aman went to a therapist.” [ “aman” , “went” , “to” , “a” , “therapist”]
Document 3: “Anil went to download a health [ “anil”, “went” , “to” , “download” , “a” ,
chatbot.” “health” , “chatbot” ]

Note: - No tokens have been removed in the stopwords removal step. It is because we have very little
data and since the frequency of all the words is almost the same, no word can be said to have lesser
value than the other.

Step 2: Creating Dictionary

• Dictionary in NLP means a list of all the unique words occurring in the corpus.
• If some words are repeated in different documents, they are all written just once while
creating the dictionary.

Dictionary:

“aman’ “and” “anil” “are” “stressed” “went”

“download” “health” “chatbot” “therapist” “a” “to”

Note: - Some words are repeated in different documents, they are all written just once, while creating
the dictionary, we create a list of unique words.
Step 2: Creating a Document Vector

• The document Vector contains the frequency of each word of the vocabulary in a particular
document.
• In the document, vector vocabulary is written in the top row.
o Now, for each word in the document, if it matches the vocabulary, put a 1 under it.
o If the same word appears again, increment the previous value by 1.
o And if the word does not occur in that document, put a 0 under it.

aman and anil are stressed went download health chatbot a to


therapist
1 1 1 1 1 0 0 0 0 0 0 0

Step 2: Creating a Document Vector Table for all documents

• Now we select another normalised document form the corpus.


• Then create the document vector for that like in step 2.
• Repeat the same process for all the document in the corpus and we get Document Vector
Table.

aman and anil are stressed went download health chatbot therapist a to
Document 1 1 1 1 1 0 0 0 0 0 0 0
1
Document 1 0 0 0 0 1 0 0 0 1 1 1
2
Document 0 0 1 0 0 1 1 1 1 0 1 1
3

In this table, the header row contains the vocabulary of the corpus and three rows
correspond to three different documents.

Finally, this gives us the document vector table for our corpus.

Note: - We have still not achieved our goal i.e. converting the tokens to number. For this we have to
apply another algorithm after this called TFIDF. It stands for Term Frequency & Inverse Document
Frequency. (This part is optional in our syllabus)

SAMPLE QUESTIONS WITH ANSWERS

TOPIC: TEXT NORMALIZATION AND BAG OF WORDS

 MCQs :

Q1. With reference to NLP, consider the following plot of occurrence of words versus their value:

In the given graph, X represents:

a) Tokens
X
b) Popular words

c) Stop Words

d) Pronouns

Q2. Token is a term used for any word or numbers or special character occurring in a sentence.

(Token/Punctuators)

Q3. Lemmatization is the process of converting the stemmed words to their proper form to keep them
meaningful.
a) Stemming b) Tokenization c) Lemmatization d) Segmentation

Q4. Bag of words Algorithm is used to extract features of text in the corpus.

a) Bag of words b) Corpus of words c) Bag of corpus d) Term frequency


Q5. Simplification of human language in order to be easily understood by the computer is called?

a) Sentence Planning b) Text Normalisation c) Text realization d) None of these

Q6. What does TF-IDF stands for?

a) Total Frequency – Inverse Document Frequency

b) Total Frequency – Invertible Document Frequency

c) Term Frequency - Inverse Document Frequency

d) None of these

Q7. The term sentence segmentation is

a) The whole corpus is divided into sentences

b) To undergo several steps to normalize the text to a lower level

c) In which each sentence is then further divided into tokens

d) The process in which affixes of words are removed.

Q8. Manav is learning to create NLP based AI model. He is having a corpus which contains email
several email IDs. Now while normalizing the text which of the following he should not perform
with the text.

a) Removing stop words b) Removing Special Characters

c) Segmentation d) Lemmatisation

Q9. While Normalising Text, Manav used the following words:


‘crying’ , ‘cry’ , ‘cried’ , ‘smiling’ , ‘smile’ , ‘smiled’ , ‘smiles’

And the program returned:

‘cry’ , ‘cry’ , ‘cryi’ , ‘smili’ , ‘smile’ , ‘smile’ , ‘smile’

Which of the following technique did he used in his program?

a) Stemming b) Bag of Words c) Lemmatisation d) Tokenisation


Q10. Consider the following document vector table:

aman and anil are stressed went download health chatbot therapist a to
Document 1 1 1 1 1 0 0 0 0 0 0 0
1
Document 1 0 0 0 0 1 0 0 0 1 1 1
2
Document 0 0 1 0 0 1 1 1 1 0 1 1
3
What is the term frequency of the word “aman”?

a) 3 b) 1 c) 2 d) 0

Q11. A corpus contains 12 documents. How many document vectors will be there for that corpus?
a) 12 b) 1 c) 24 d) 1/12

 ASSERTION-REASIONG TYPE QUESTIONS:

Q1.

Assertion: Text normalization is an essential preprocessing step in Natural Language Processing (NLP).

Reasoning: Text normalization helps standardize text data by converting it to a common format, making
it easier to analyze and process.

Options:

A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of
the assertion.

B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation of the
assertion.

C. Assertion is false, but reasoning is true.

D. Assertion is true, but reasoning is false.

Q2.

Assertion: In the bag of words (BoW) model, word order is not considered.

Reasoning: BoW represents text as a collection of individual word occurrences without considering their
sequence.

Options:

A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of
the assertion.

B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation of the
assertion.

C. Assertion is false, but reasoning is true.

D. Assertion is true, but reasoning is false.


Q3.

Assertion: In text normalization, stemming and lemmatization are two common techniques used to reduce
words to their base or root form.

Reasoning: Stemming is a more aggressive technique than lemmatization and may result in nondictionary
words.

Options:

A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of the
assertion.
B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation
of the assertion.

C. Assertion is false, but reasoning is true.

D. Assertion is true, but reasoning is false.

Q4.

Assertion: when we look at the text, we take frequent and rare words into consideration and remove the
stop wards.

Reasoning: The words which have highest occurrence in all the documents of the corpus, they are said to
have negligible value hence they are termed as frequent words.

Options:

A. Assertion is true, and reasoning is true, and the reasoning is the correct explanation of the
assertion.

B. Assertion is true, and reasoning is true, but the reasoning is not the correct explanation of the
assertion.

C. Assertion is false, but reasoning is true.

D. Assertion is true, but reasoning is false.

 True-False TYPE QUESTIONS:

Q1. “Converting text to a common case” is a step in Text Normalisation. (True/False)

Q2. In Bag of words algorithm sequences of the words does matter. (True/False)

Q3. Text Normalisation converts the text into numerical form. (True/False)
Q4. Dictionary in NLP means a list of all the unique words occurring in the corpus.
(True/False)
 SHORT ANSWER TYPE QUESTIONS:

Q1. What will be the results of conversion of the term, ‘happily’ and ‘travelled’ in the process of
stemming and lemmatization?
Ans.

Token After Stemming After Lemmatisation


‘happily’ ‘happi’ ‘happy’
‘travelled’ ‘travell’ ‘travel’

Q2. What is the importance of eliminating stop words and special characters during Text Normalization
process?

Ans.

• Frequency of stop words are more in comparison to other word present in corpus.
• They pose unnecessary processing efforts

Hence they need to be eliminated.

Q3. What is Tokenization? Count how many tokens are present in the following statement:
“I find that the harder I work, the more luck I seem to have.”

Ans. Tokenization:

o It is the process of dividing the sentences into tokens.


o Under tokenization, every word, number and special character is considered
separately and each of them is now a separate token.
o After this step we get a List of tokens.
In the statement, "I find that the harder I work, the more luck I seem to have," if we apply tokenization,
we break down the text into individual tokens or words based on spaces and punctuation and put it in a
list. Here's the tokenized version of the statement:

[ "I" ,"find", "that" , "the" , "harder" , "I" , "work" , ",", "the" , "more" , "luck" , "I" , "seem" , "to" ,
"have" , "." ]

So, there are a total of 16 tokens in the given statement after tokenization.
Q4. Kiara, a beginner in the field of NLP is trying to understand the process of Stemming. Help her in
filling up the following table by suggesting appropriate affixes and stem of the words mentioned there:
S. Word Affixes Stem
No.
i. Tries
ii. Learning

Ans.

S. Word Affixes Stem


No.
i. Tries ‘s’ Trie
ii. Learning ‘ing’ Learning

Q5. Explain the following picture which depicts one of the processes on NLP. Also mention the purpose
which will be achieved by this process.

Ans.

It is the process of “converting to common case” in text normalization.

Purpose: In Text Normalization, we undergo several steps to normalize the text to a lower level. After
the removal of stop words, we convert the whole text into a similar case, preferably lower case. This
ensures that the case-sensitivity of the machine does not consider same words as different just because of
different cases.
Q6. What is Document Frequency?

Ans.

• It is also known as Term frequency of TF.


• Document Frequency is the number of documents in which the word occurs irrespective of how
many times it has occurred in those documents.
• Example:
o Consider the following document vector table

aman and anil are stressed went to


Document 1 1 1 1 1 0 0
1
Document 1 0 0 0 0 1 1
2
Document 0 0 1 0 0 1 1
3
Here, Term Frequency (“aman”) = 1 + 1 + 0 = 2,
Term Frequency (“and”) = 1 + 0 + 0 = 1 etc.

aman and anil are stressed went to


Term/Document 2 1 1 1 1 2 2
Frequency

Q7. What is Inverse Document Frequency?

Ans.

• It is also written IDF.


• In case of inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator.
• So,
IDF (token)

• Example:
o Consider the following document vector table

With Total Number of Documents = 3

aman and anil are stressed went to


Document 1 1 1 1 1 0 0
1
Document 1 0 0 0 0 1 1
2
Document 0 0 1 0 0 1 1
3

Here, Term Frequency (“aman”) = 1 + 1 + 0 = 2, IDF (“aman”) =

Term Frequency (“and”) = 1 + 0 + 0 = 1, IDF (“and”) =

Q8. What is TFIDF? Write its formula.

Ans.

• Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect


how important a word is to a document in a collection or corpus.
• The formula of TFIDF for any word W becomes:

TFIDF (W) = TF (W) * log (IDF (W))

Q9. Does the vocabulary of a corpus remain the same before and after text normalization? Why?

Ans.

No, the vocabulary of a corpus does not remain the same before and after text normalization. Reasons are

● In normalization the text is normalized through various steps and is lowered to minimum
vocabulary since the machine does not require grammatically correct statements but the
essence of it.

● In normalization Stop words, Special Characters and Numbers are removed.

● In stemming the affixes of words are removed and the words are converted to their base
form.

So, after normalization, we get the reduced vocabulary.


Q10. Differentiate between Human Language and Computer Language. Ans.

Human language Computer Language


Human language is made up of letters, words and Machine/computer understands the language of
sentences depending on the languages. numbers (binary numbers- 0’s and 1’s).
Everything that is sent to the machine has to be
converted to numbers.
It is very easy for humans to process and For machines understanding and generating
communicate in natural languages like English, natural languages is very complex process.
Hindi etc.
Our brain keeps on processing the sounds that it Computer uses NLP techniques like Text
hears around itself and tries to make sense out Normalisation, Bag of words to convert the text
of them all the time. to numbers for it to process.

 LONG ANSWER TYPE/CASE STUDY BASED QUESTIONS :

Q1. What do you mean by Bag of words in NLP? Differentiate between stemming and lemmatization.
Explain with the help of an example.

Ans. (Answer is in the notes above)

Q2. Create a document vector table from the following documents by implementing all the four steps of
Bag of words model. Also depict the outcome of each step.

Document 1: Sameera and Sanya are classmates.

Document 2: Sameera likes dancing but Sanya loves to study mathematics.

Ans.

Step 1: Text Normalization

Document 1: [ ‘Sameera’, ‘Sanya’, ‘classmate’]

Document 2: [‘Sammera’, ‘like’ , ‘dance’ , ‘Sanya’ , ‘love’ , ‘study’ , ‘mathematis’]

Step 2: Creating Dictionary


[‘Sameera’ , ‘Sanya’ , ‘classmate’ , ‘like’ , ‘dance’ , ‘love’ , ‘study’ , ‘mathematics’ ]
Step 3: Creating document vectors Document Vector 1:

Sameera Sanya classmate like dance love study mathematics


Document 1 1 1 0 0 0 0 0
1

Document
Vector 2:
Sameera Sanya classmate like dance love study mathematics
Document 1 1 0 1 1 1 1 1
2
Step 4: Creating Document Vector Table

Sameera Sanya classmate like dance love study mathematics


Document 1 1 1 0 0 0 0 0
1
Document 1 1 0 1 1 1 1 1
2

Q3. Consider the text of the following documents:

Document 1: Sahil likes to play cricket.

Document 2: Sajal likes cricket too

Document 3: Sajal also likes to play basketball.

Apply all the four steps of Bag of words model of NLP on the above given document and generate the
output.

Ans. [Similar to Q2. Try to do this on your own.]

Q4. Perform all of Text Normalization operation on the following text and write reduced text at the end
of operation. Show output of each step during the process.

“Player in the ground managed the situation beautifully otherwise it.”

Ans.

Step 1: Sentence Segmentation

Since there is only once sentence.

Output: - “Player in the ground managed the situation beautifully otherwise it.”

Step 2: Tokenization
Output:- [“Player” , “in” , “the” , “ground” , “Managed” , “the” , “situation” , “beautifully” ,
“otherwise” , “it” ]

Step 3: Eliminating stop words

Output: - [“Player”, “ground”, “Managed”, “situation”, “beautifully”, “otherwise”]

Step 4: Converting text to a common case

Output: - [“player” , “ground” , “managed” , “situation” , “beautifully” , “otherwise”]

Step 5: Stemming and Lemmatization

Output: - [“player” , “ground” , “manage” , “situation” , “beautiful” , “other”]


Q5. Samiksha, a student of class X was exploring the Natural Language Processing domain. She got
stuck while performing the text normalization. Help her to normalize the text on the segmented sentences
given below:

Document 1: Akash and Ajay are best friends.

Document 2: Akash likes to play football but Ajay prefers to play online games.

Ans.

Step 1. Tokenisation

[Akash, and, Ajay, are, best, friends, Akash, likes, to, play, football, but, Ajay,
prefers, to, play, online, games]

Step 2. Removal of stopwords

[Akash, Ajay, best, friends, Akash, likes, play, football, Ajay, prefers, play, online,
games]

Step 3. Converting text to a common case

[akash, ajay, best, friends, akash, likes, play, football, ajay, prefers, play, online,
games]

Step 4. Stemming/Lemmatization

[akash, ajay, best, friend, akash, like, play, football, ajay, prefer, play, online, game]

TOPIC: EVALUATION
What is evaluation? Model Evaluation Terminologies, Confusion matrix

What is evaluation?

• Evaluation is the process of understanding the reliability of any AI model, based on


outputs by feeding test dataset into the model and comparing with actual answers.
• There can be different Evaluation techniques, depending of the type and purpose of the
model.
Note: - If at the time of evaluation we use the data we used to build the model i.e.
Training Data = Testing Data,
then it will always predict the correct label for any point in the training set.

This is known as overfitting.


Model Evaluation Terminologies:

There are various new terminologies which come into the picture when we work on evaluating our model.
Let’s explore them through a case study.

Case study 1:- Cancer Detection Model

Overview: This model takes the MRI scan image of a patient and detects weather that
patient has cancer or not! ALSO it detects the cancer area in the scanned image if that person
is having a cancer.

On a multipara metric MRI scan of a


patient's prostate, a cancer-suspicious
area (red) is highlighted by an AI model
developed by Dr. Turkbey.
Credit: Courtesy of Stephanie Harmon,
Ph.D.

There exist two conditions which we need to ponder upon: Prediction and Reality.

• The prediction is the output which is given by the machine.


• The reality is the real scenario in the forest when the prediction has been made.

Now let us look at various combinations that we can have with these two conditions.

Case 1: Does the patient has cancer?

(True Positive)

Here, we can see in the MRI scan image of a patient that the patient has a brain tumor. The model
predicts a Yes which means the patient has cancer. The Prediction matches with the Reality. Hence, this
condition is termed as True Positive.

Case 2: Does the patient has cancer?

(True Negative)

Here, we can see in the MRI scan image of a patient that the patient doesn’t has a brain tumor. The
model predicts a No. The Prediction matches with the Reality. Hence, this condition is termed as True
Negative.
Case 3: Does the patient has cancer?

(False Positive)

Here, we can see in the MRI scan image of a patient that the patient doesn’t has a brain tumor. But the
model predicts a Yes. The Prediction does not matches with the Reality. Hence, this condition is termed
as False Positive. It is also known as Type 1 error.

Case 4: Does the patient has cancer?

(False Negative)

Here, we can see in the MRI scan image of a patient that the patient has a brain tumor. But the model
predicts a No. The Prediction does not matches with the Reality. Hence, this condition is termed as False
Negative. It is also known as Type 2 error.
Summary:

Confusion Matrix:

• It’s just a representation of the above parameters in a matrix format. Better visualization is
always good :)
• In each of the cells (like TP, FN, FP and TN) we write a number that represents the count of the
particular cases found in the test data set after comparing the prediction with the reality.

• It's not an evaluation metric rather it helps in evaluation.

• Confusion matrix is a method of recording the result of comparisons between the prediction
and the reality based on the results of training dataset for an AI model in a 4*4 matrix.
Example:

Evaluation Methods or KPIs:


The different evaluation methos or KPIs (Key Performance Indicators) for classification
models are:-
• Accuracy
• Precision
• Recall/Sensitivity/True Positive Rate
• F1 Score

1. Accuracy: The most commonly used metric to judge a model and is actually the
measure of how many number of times the model is giving
the correct/True predictions out of total possible cases.
2. Precision: Percentage of positive instances out of the total predicted positive
instances. Here denominator is the model prediction done as
positive from the whole given dataset. Take it as to find out ‘how
much the model is right when it says it is right’.

3. Recall: Percentage of positive instances out of the total actual positive instances.
Therefore denominator (TP + FN) here is the actual number
of positive instances present in the dataset. Take it as to find
out ‘how much extra right ones, the model missed when it
showed the right ones’.
4. F1 Score: It is the harmonic mean of precision and recall. This takes the contribution of
both, so higher the F1 score, the
better.

SAMPLE QUESTIONS WITH ANSWERS

TOPIC: What is evaluation? Model Evaluation Terminologies, Confusion matrix

 MCQs :

Q1. Raunak was learning the conditions that make up the confusion matrix. He came across a scenario
in which the machine that was supposed to predict an animal was always predicting not an animal. What
is this condition called?

(a) False Positive

(b) True Positive

(c) False Negative

(d) True Negative

Q2. The output given by the AI machine is known as ________ (Prediction/ Reality)

Q3. _____________ is used to record the result of comparison between the prediction and reality. It is
not an evaluation metric but a record which can help in evaluation.

(a) Reliability Matrix

(b) Comparison Matrix

(c) Confusion Matrix

(d) None of these


Q4. Prediction and Reality can be easily mapped together with the help of:

a) Prediction

b) Reality

c) Accuracy
d) Confusion Matrix

Q5. _________________ is the last stage of the AI project Life cycle.

a) Problem Scoping

b) Evaluation

c) Modelling

d) Data Acquisition

Q6. Which of the following statements is true for the term Evaluation?

a) Helps in classifying the type and genre of a document.

b) It helps in predicting the topic for a corpus.

c) Helps in understanding the reliability of any AI model

d) Process to extract the important information out of a corpus.

Q7. Differentiate between Prediction and Reality.

a) Prediction is the input given to the machine to receive the expected result of the reality.

b) Prediction is the output given to match the reality.

c) The prediction is the output which is given by the machine and the reality is the real scenario in
which the prediction has been made.

d) Prediction and reality both can be used interchangeably.

Q8. When the model predicts No for a situation and the Prediction does not matches with the Reality, The
condition is termed as

a) True Positive

b) False Positive

c) True Negative

d) False Negative

Q9. For a True Positive which of the following is true:

a) Prediction ≠ Reality

b) Prediction = 1

c) Reality = 0

d) Reality = Yes
Q10. When we use the training data for testing the model, the model is said to be

a) Over fitting

b) Under fitting

c) Perfectly fit

d) None of these

 SHORT ANSWER TYPE QUESTIONS:

Q1. What is purpose of Evaluation stage? Discuss briefly.

Ans.

• To check the reliability of the model


• To calculate the accuracy
• To compare it with other models

Q2. Which two parameters are considered for Evaluation of a model?

Ans.

Prediction and Reality are the two parameters considered for Evaluation of a model.

• The “Prediction” is the output which is given by the machine


• The “Reality”is the real scenario, when the prediction has been made

Q3. What is True Positive?

• The predicted value matches the actual value


• The actual value was positive and the model predicted a positive value

Q4. What is True Negative?

• The predicted value matches the actual value


• The actual value was negative and the model predicted a negative value

Q5. What is False Positive?

• The predicted value was falsely predicted


• The actual value was negative but the model predicted a positive value
• Also known as the Type 1 error

Q6. What is False Negative?

• The predicted value was falsely predicted


• The actual value was positive but the model predicted a negative value
• Also known as the Type 2 error
Q7. What is meant by Overfitting of Data?

• Overfitting is "the production of an analysis that corresponds too closely or exactly to a particular
set of data, and may therefore fail to fit additional data or predict future observations reliably".
• Models that use the training dataset during testing, will always results in correct output. This is
known as overfitting.

Q8. Let us assume that for training of a cancer detection model there are total 100 data points in our
training dataset of cancer patients and out of that 4% patients actually have cancer. Now after the
completion of training it has been seen that the number of true positive and true negative cases are 2
and 95 respectively.

Draw the confusion matrix.

Ans.

Q9. Define Evaluation.

Ans.

Moving towards deploying the model in the real world, we test it in as many ways as possible. The stage
of testing the models is known as EVALUATION.

 LONG ANSWER TYPE/CASE STUDY BASED QUESTIONS :

Q1. What is a confusion matrix? Explain in detail with the help of an example.

Ans. [can be found in the notes above]

Q2. Imagine that you have come up with an AI based prediction model which has been deployed on the
roads to check traffic jams. Now, the objective of the model is to predict whether there will be a traffic
jam or not. Now, to understand the efficiency of this model, we need to check if the predictions which
it makes are correct or not. Thus, there exist two conditions which we need to ponder upon: Prediction
and Reality.Traffic Jams have become a common part of our lives nowadays. Living in an urban area
means you have to face traffic each and every time you get out on the road. Mostly, school students opt
for buses to go to school. Many times, the bus gets late due to such jams and the students are not able
to reach their school on time.

Considering all the possible situations make a Confusion Matrix for the above situation.

Ans.

Case 1: Is there a traffic Jam?

Prediction: Yes Reality: Yes

True Positive

Case 2: Is there a traffic Jam?

Prediction: No Reality: No

True Negative

Case 3: Is there a traffic Jam?

Prediction: Yes Reality: No

False Positive

Case 4: Is there a traffic Jam?

Prediction: No Reality: Yes


False Negative
Q3.
Ans. (i) TP=60, TN=10, FP=25, FN=5

60+25+5+10=100 total cases have been performed

(ii) (Note: For calculating Precision, Recall and F1 score, we need not multiply the formula by 100 as all
these parameters need to range between 0 to 1)

Precision =TP/(TP+FP)=60/(60+25) = 60/85 =0.7

Recall=TP/(TP+FN)=60/(60+5) = 60/65 = 0.92

F1 Score=2*Precision*Recall/ (Precision + Recall)=2*0.7*0.92/ (0.7+0.92) = 0.79

(1 mark for total number of cases; 1 mark each for the calculation of precision, recall and F1 score)

-------------x------------------------------------------x---------------------------------------------x------------------

You might also like