Notes On Module 3 - Pattern Recognition
Notes On Module 3 - Pattern Recognition
PECCS702B
WorkBook
Semester - 7
Prof. Bavrabi Ghosh
Prototype methods seek a minimal subset of samples that can serve as a distillation or
condensed view of a data set. As the size of modern data sets grows, being able to present a
domain specialist with a short list of “representative” samples chosen from the data set is of
increasing interpretative value.
Model prototyping is the phase in pattern recognition model development lifecycle where
data scientists iterate towards building best performing models through data loading,
cleansing, preparation, feature engineering, model training, tuning and scoring so that it can
be used in production environment to meet a business need. On the data side, this
experimental and iterative phase is where data scientists gather all the domain knowledge
from SMEs, explore the univariate data distributions and relationships between features
and possible target labels, and establish relationships among multiple features. On the
model side, data scientists explore different modelling options based upon the identified
business use case, as well as requirements for interpretability and metrics for evaluating the
performance of the models.
The various decisions made during model prototyping contribute to the end performance of
AI applications. Further optimizing and automating the model prototyping experience (as
needed) for rapid iteration enables data scientists to become efficient in terms of time
taken, infrastructural resources used and number of experiments required to iterate on,
thereby accelerating the entire AI application development lifecycle.
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes theorem
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular
day according to the weather conditions. So to solve this problem, we need to follow the
below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Play
Outlook
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
Note - https://www.analyticsvidhya.com/blog/2022/10/frequently-asked-interview-
questions-on-naive-bayes-classifier/
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot
learn the relationship between features.
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution.
This means if predictors take continuous values instead of discrete, then the model
assumes that these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document classification problems, it
means a particular document belongs to which category such as Sports, Politics,
education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular
word is present or not in a document. This model is also famous for document
classification tasks.
Steps to Implement Naïve Bayes Algorithm:
Spam Email , become a big trouble over the internet. Spam is waste of time, storage space
and communication bandwidth. The problem of spam e-mail has been increasing for years.
In recent statistics, 40% of all emails are spam which about 15.4 billion email per day and
that cost internet users about $355 million per year. Knowledge engineering and machine
learning are the two general approaches used in e-mail filtering. In knowledge engineering
approach a set of rules has to be specified according to which emails are categorized as
spam or ham.
Machine learning approach is more efficient than knowledge engineering approach; it does
not require specifying any rules . Instead, a set of training samples, these samples is a set of
pre classified e-mail messages. A specific algorithm is then used to learn the classification
rules from these e-mail messages. Machine learning approach has been widely studied and
there are lots of algorithms can be used in e-mail filtering. They include Naive Bayes,
support vector machines, Neural Networks, K-nearest neighbour, Rough sets and the
artificial immune system.
Naive Bayes work on dependent events and the probability of an event occurring in the
future that can be detected from the previous occurring of the same event . This technique
can be used to classify spam e-mails, words probabilities play the main rule here. If some
words occur often in spam but not in ham, then this incoming e-mail is probably spam.
Naive Bayes classifier technique has become a very popular method in mail filtering Email.
Every word has certain probability of occurring in spam or ham email in its database. If the
total of words probabilities exceeds a certain limit, the filter will mark the e-mail to either
category. Here, only two categories are necessary: spam or ham.
The statistic we are mostly interested for a token T is its spamminess (spam rating),
calculated as follows:-
Where CSpam(T) and CHam(T) are the number of spam or ham messages containing token
T, respectively. To calculate the possibility for a message M with tokens {T1,……,TN}, one
needs to combine the individual token’s spamminess to evaluate the overall message
spamminess. A simple way to make classifications is to calculate the product of individual
token’s spamminess and compare it with the product of individual token’s hamminess
(H [M] = Π ( 1- S [T ]))
The message is considered spam if the overall spamminess product S[M] is larger than the
hamminess product H[M].
Training Stage.
Testing Stage.
So In the Training Stage Naive Bayes create a Lookup table in which they store all the
possibility of probability which we are going to use in the Algorithm for predicting the result.
And In the testing phase let Suppose you have given a test point to the algorithm to predict
the result , they fetch the values from the lookup table in which they store all the possibility
of probability and use that value to predict the result .
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction
techniques in machine learning to solve more than two-class classification problems. It is
also known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).
This can be used to project the features of higher dimensional space into lower-dimensional
space in order to reduce resources and dimensional costs. In this topic, "Linear Discriminant
Analysis (LDA) in machine learning”, we will discuss the LDA algorithm for classification
predictive modeling problems, limitation of logistic regression, representation of linear
Discriminant analysis model, how to make a prediction using LDA, how to prepare data for
LDA, extensions to LDA and much more. So, let's start with a quick introduction to Linear
Discriminant Analysis (LDA) in machine learning.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common
technique to solve such classification problems. For e.g., if we have two classes with
multiple features and need to separate them efficiently. When we classify them using a
single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the
number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data
points efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D
plane into the 1-D plane. Using this technique, we can also maximize the separability
between multiple classes.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and
we need to classify them efficiently. As we have already seen in the above example that LDA
enables us to draw a straight line that can completely separate the two classes of the data
points. Here, LDA uses an X-Y axis to create a new axis by separating them using a straight
line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane
into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform
well for binary classification but falls short in the case of multiple classification
problems with well-separated classes. At the same time, LDA handles these quite
efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just
as PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective
results.
Drawbacks of Linear Discriminant Analysis (LDA)
Although, LDA is specifically used to solve supervised classification problems for two or
more classes which are not possible using logistic regression in machine learning. But LDA
also fails in some cases where the Mean of the distributions is shared. In this case, LDA fails
to create a new axis that makes both the classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in machine learning.
Extension to Linear Discriminant Analysis (LDA)
Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as
follows:
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of
inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the
variance (actually covariance) and hence moderates the influence of different
variables on LDA.
Real-world Applications of LDA
Some of the common real-world applications of Linear discriminant Analysis are given
below:
o Face Recognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used
to minimize the number of features to a manageable number before going through
the classification process. It generates a new template in which each dimension
consists of a linear combination of pixel values. If a linear combination is generated
using Fisher's linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can
be helpful when we want to identify a group of customers who mostly purchase a
product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work,
and it can also be considered a classification problem. In this case, LDA builds similar
groups on the basis of different parameters, including pitches, frequencies, sound,
tunes, etc.
Difference between Linear Discriminant Analysis and PCA
Below are some basic differences between LDA and PCA:
o PCA is an unsupervised algorithm that does not care about classes and labels and
only aims to find the principal components to maximize the variance in the given
dataset. At the same time, LDA is a supervised algorithm that aims to find the linear
discriminants to represent the axes that maximize separation between different
classes of data.
o LDA is much more suitable for multi-class classification tasks compared to PCA.
However, PCA is assumed to be an as good performer for a comparatively small
sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where PCA is
first followed by LDA.