0% found this document useful (0 votes)
37 views37 pages

UNIT 3 - Final

Uploaded by

anshugarg401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views37 pages

UNIT 3 - Final

Uploaded by

anshugarg401
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT -3

3.1 Classification in Machine Learning


Classification is a supervised learning task where the goal is to predict the category
or class of a given input data point. In classification problems, the output is a discrete
label, rather than a continuous value. Classification is widely used in various
applications such as spam detection, medical diagnosis, image recognition, and
sentiment analysis.
Key Concepts in Classification:
1. Supervised Learning:
o Classification is a type of supervised learning where the model is trained on labeled
data.
o The labeled data consists of input features (independent variables) and corresponding
class labels (dependent variable).
2. Class Labels:
o The class labels are discrete and can be either binary (e.g., spam/not spam) or multi-
class (e.g., types of animals).
3. Training and Testing:
o The dataset is divided into a training set and a testing set.
o The model is trained on the training set and evaluated on the testing set.
4. Classification Algorithms:
o Logistic Regression: Used for binary classification problems.
o Decision Trees: Can handle both binary and multi-class classification.
o Random Forest: An ensemble of decision trees, used for both binary and multi-class
classification.
o Support Vector Machines (SVM): Effective for high-dimensional spaces and can
handle both binary and multi-class classification.
o Naive Bayes: Based on the Bayes theorem, used for binary and multi-class
classification.
o K-Nearest Neighbors (KNN): A non-parametric algorithm used for both binary and
multi-class classification.
o Neural Networks: Can handle complex classification problems with non-linear
decision boundaries.
5. Evaluation Metrics:
o Accuracy: The proportion of correctly classified instances.
o Precision: The proportion of true positive instances among the predicted positive
instances.
o Recall (Sensitivity): The proportion of true positive instances among all actual
positive instances.
o F1 Score: The harmonic mean of precision and recall.
o Confusion Matrix: A table that summarizes the performance of a classification
model.
o ROC Curve: A graphical plot that illustrates the diagnostic ability of a binary
classifier.
o AUC (Area Under the Curve): A measure of the ROC curve's quality.
6. Overfitting and Underfitting:
o Overfitting: The model performs well on the training data but poorly on unseen data.
o Underfitting: The model fails to capture the underlying pattern in the data.
7. Regularization:
o Techniques like L1 and L2 regularization are used to prevent overfitting by penalizing
large weights.
8. Cross-Validation:
o Techniques like k-fold cross-validation are used to assess the model's performance
and generalize it to an independent dataset.
Example of Classification:
Suppose you have a dataset of patients with features like age, blood pressure, and
cholesterol levels, and the corresponding class label indicates whether the patient has
heart disease or not. A classification model can be trained on this dataset to predict
whether a new patient has heart disease based on their features.
3.1.1 Steps in Classification:
1. Data Collection: Gather a dataset with input features and corresponding class labels.
2. Data Preprocessing: Clean and preprocess the data (e.g., handle missing values, normalize
features).
3. Feature Selection: Select the most relevant features for the classification task.
4. Model Training: Train a classification model on the training data.
5. Model Evaluation: Evaluate the model's performance on the testing data using appropriate
metrics.
6. Hyperparameter Tuning: Fine-tune the model's hyperparameters to improve performance.
7. Deployment: Deploy the model for real-world predictions.
Conclusion:
Classification is a fundamental task in machine learning with a wide range of
applications. Understanding the different algorithms, evaluation metrics, and
techniques for improving model performance is crucial for effectively solving
classification problems.
3.2K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.

3.2.1Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a
new data point x1, so this data point will lie in which of these categories. To solve
this type of problem, we need a K-NN algorithm. With the help of K-NN, we can
easily identify the category or class of a particular dataset. Consider the below
diagram:
3.2.3 How does K-NN work
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.
3.3 Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.

Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all the
trees predict the correct output. Therefore, below are two assumptions for a better Random forest
classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

3.3.1 Why use Random Forest

Below are some points that explain why we should use the Random Forest algorithm:

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

3.3.2 How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.


o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
Other ML
Feature Random Forest Algorithms

Typically
relies on a
single model
Utilizes an (e.g., linear
ensemble of regression,
decision trees, support vector
Ensemble combining their machine)
Approach outputs for without the
predictions, ensemble
fostering robustness approach,
and accuracy. potentially
leading to less
resilience
against noise.

Some
algorithms
may be prone
Resistant to
to overfitting,
overfitting due to
Overfittin especially
the aggregation of
g when dealing
diverse decision
Resistanc with complex
trees, preventing
e datasets, as
memorization of
they may
training data.
excessively
adapt to
training noise.

Other
Exhibits resilience algorithms
in handling missing may require
values by imputation or
Handling
leveraging available elimination of
of
features for missing data,
Missing
predictions, potentially
Data
contributing to impacting
practicality in real- model training
world scenarios. and
performance.

Provides a built-in Many


mechanism for algorithms
Variable assessing variable may lack an
Importan importance, aiding explicit feature
ce in feature selection importance
and interpretation of assessment,
influential factors. making it
Other ML
Feature Random Forest Algorithms

challenging to
identify crucial
variables for
predictions.

Some
algorithms
Capitalizes on may have
parallelization, limited
enabling the parallelization
Paralleliz
simultaneous capabilities,
ation
training of decision potentially
Potential
trees, resulting in leading to
faster computation longer training
for large datasets. times for
extensive
datasets.

Key Features of Random Forest


Some of the Key Features of Random Forest are discussed below–>
1. High Predictive Accuracy: Imagine Random Forest as a team of decision-making wizards.
Each wizard (decision tree) looks at a part of the problem, and together, they weave their
insights into a powerful prediction tapestry. This teamwork often results in a more accurate
model than what a single wizard could achieve.
2. Resistance to Overfitting: Random Forest is like a cool-headed mentor guiding its
apprentices (decision trees). Instead of letting each apprentice memorize every detail of their
training, it encourages a more well-rounded understanding. This approach helps prevent
getting too caught up with the training data which makes the model less prone to overfitting.
3. Large Datasets Handling: Dealing with a mountain of data? Random Forest tackles it like a
seasoned explorer with a team of helpers (decision trees). Each helper takes on a part of the
dataset, ensuring that the expedition is not only thorough but also surprisingly quick.
4. Variable Importance Assessment: Think of Random Forest as a detective at a crime scene,
figuring out which clues (features) matter the most. It assesses the importance of each clue in
solving the case, helping you focus on the key elements that drive predictions.
5. Built-in Cross-Validation: Random Forest is like having a personal coach that keeps you in
check. As it trains each decision tree, it also sets aside a secret group of cases (out-of-bag) for
testing. This built-in validation ensures your model doesn’t just ace the training but also
performs well on new challenges.
6. Handling Missing Values: Life is full of uncertainties, just like datasets with missing values.
Random Forest is the friend who adapts to the situation, making predictions using the
information available. It doesn’t get flustered by missing pieces; instead, it focuses on what it
can confidently tell us.
7. Parallelization for Speed: Random Forest is your time-saving buddy. Picture each decision
tree as a worker tackling a piece of a puzzle simultaneously. This parallel approach taps into
the power of modern tech, making the whole process faster and more efficient for handling
large-scale projects.
Random Forest vs. Other Machine Learning Algorithms
Some of the key-differences are discussed below.
Applications of Random Forest in Real-World Scenarios
Some of the widely used real-world application of Random Forest is discussed below:
1. Finance Wizard: Imagine Random Forest as our financial superhero, diving into the world of
credit scoring. Its mission? To determine if you’re a credit superhero or, well, not so much.
With a knack for handling financial data and sidestepping overfitting issues, it’s like having a
guardian angel for robust risk assessments.
2. Health Detective: In healthcare, Random Forest turns into a medical Sherlock Holmes.
Armed with the ability to decode medical jargon, patient records, and test results, it’s not just
predicting outcomes; it’s practically assisting doctors in solving the mysteries of patient
health.
3. Environmental Guardian: Out in nature, Random Forest transforms into an environmental
superhero. With the power to decipher satellite images and brave noisy data, it becomes the
go-to hero for tasks like tracking land cover changes and safeguarding against potential
deforestation, standing as the protector of our green spaces.
4. Digital Bodyguard: In the digital realm, Random Forest becomes our vigilant guardian
against online trickery. It’s like a cyber-sleuth, analyzing our digital footsteps for any hint of
suspicious activity. Its ensemble approach is akin to having a team of cyber-detectives,
spotting subtle deviations that scream “fraud alert!” It’s not just protecting our online
transactions; it’s our digital bodyguard.
Preparing Data for Random Forest Modeling
For Random Forest modeling, some key-steps of data preparation are discussed
below:
• Handling Missing Values: Begin by addressing any missing values in the dataset.
Techniques like imputation or removal of instances with missing values ensure a complete
and reliable input for Random Forest.
• Encoding Categorical Variables: Random Forest requires numerical inputs, so categorical
variables need to be encoded. Techniques like one-hot encoding or label encoding transform
categorical features into a format suitable for the algorithm.
• Scaling and Normalization: While Random Forest is not sensitive to feature scaling,
normalizing numerical features can still contribute to a more efficient training process and
improved convergence.
• Feature Selection: Assess the importance of features within the dataset. Random Forest
inherently provides a feature importance score, aiding in the selection of relevant features for
model training.
• Addressing Imbalanced Data: If dealing with imbalanced classes, implement techniques
like adjusting class weights or employing resampling methods to ensure a balanced
representation during training.
Assumptions for Random Forest:
Since the random forest combines multiple trees to predict the class of the dataset, it
is possible that some decision trees may predict the correct output, while others may
not. But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
• There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
• The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
• It takes less training time as compared to other algorithms.
• It predicts output with high accuracy, even for the large dataset it runs efficiently.
• It can also maintain accuracy when a large proportion of data is missing

Overcoming Challenges in Random Forest Modeling


To use Random Forest algorithm very efficiently in real-world applications, we need
to overcome some potential challenges which are discussed below:
• Addressing Overfitting: Taming the tendency of individual decision trees to overfit remains
a challenge. Strategies like tuning hyperparameters, adjusting tree depth and implementing
feature selection techniques are crucial for striking the right balance between complexity and
generalization.
• Optimizing Computational Resources: Random Forest’s efficiency in handling large
datasets can sometimes be a double-edged sword, demanding substantial computational
resources. Implementing parallelization techniques and exploring optimized algorithms are
key steps in overcoming computational challenges and ensuring scalability.
• Dealing with imbalanced data: When unbalanced data sets are encountered, where one class
is significantly superior to the other, the random forest may skew toward the majority group.
Reducing this bias includes strategies like adjusting class weights, oversampling subclasses,
or using special algorithms to deal with imbalanced situations.
• Defining complex models: Although random forests provide strong predictions,
interpretation of the model’s decision process can be complicated by its clustered nature.
Methods such as feature importance analysis, partial dependence plots, and model-agnostic
interpretability methods are used to improve model interpretation.
• Handling noisy data: The resilience of random forests to noisy data is a strength, but can still
be a challenge in high-noise situations. Careful data preprocessing techniques, outlier
identification, and feature engineering are required to ensure model accuracy and reliability.
• Managing Memory Usage: As Random Forest constructs numerous decision trees during
training, managing memory usage becomes critical. Fine-tuning parameters like the number
of trees, tree depth, and the size of feature subsets can help strike a balance between model
performance and memory efficiency.

3.4 SUPPORT VECTOR MACHINE


Support vector machine is a supervised learning system and is used for classification and regression
problems. Support vector machine is extremely favored by many as it produces notable correctness
with less computation power. It is mostly used in classification problems. We have three types of
learning: supervised, unsupervised, and reinforcement learning. A support vector machine is a
selective classifier formally defined by dividing the hyperplane. Given labeled training data the
algorithm outputs best hyperplane which classified new examples. In two-dimensional space, this
hyperplane is a line splitting a plane into two parts where each class lies on either side. The intention
of the support vector machine algorithm is to find a hyperplane in an N-dimensional space that
separately classifies the data points.
Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for
classification and regression tasks. The main idea behind SVM is to find the best boundary (or
hyperplane) that separates the data into different classes.
In the case of classification, an SVM algorithm finds the best boundary that separates the data into
different classes. The boundary is chosen in such a way that it maximizes the margin, which is the
distance between the boundary and the closest data points from each class. These closest data points
are called support vectors.
SVMs can also be used for non-linear classification by using a technique called the kernel trick. The
kernel trick maps the input data into a higher-dimensional space where the data becomes linearly
separable. Common kernels include the radial basis function (RBF) and the polynomial kernel.
SVMs can also be used for regression tasks by allowing for some of the data points to be within the
margin, rather than on the boundary. This allows for a more flexible boundary and can lead to better
predictions.
SVMs have several advantages, such as the ability to handle high-dimensional data and the ability to
perform well with small datasets. They also have the ability to model non-linear decision boundaries,
which can be very useful in many applications. However, SVMs can be sensitive to the choice of
kernel, and they can be computationally expensive when the dataset is large.
Advantages of support vector machine:
• Support vector machine works comparably well when there is an understandable margin
of dissociation between classes.
• It is more productive in high-dimensional spaces.
• It is effective in instances where the number of dimensions is larger than the number of
specimens.
• Support vector machine is comparably memory systematic. Support Vector Machine
(SVM) is a powerful supervised machine learning algorithm with several advantages.
Some of the main advantages of SVM include:
• Handling high-dimensional data: SVMs are effective in handling high-dimensional data,
which is common in many applications such as image and text classification.
• Handling small datasets: SVMs can perform well with small datasets, as they only require
a small number of support vectors to define the boundary.
• Modeling non-linear decision boundaries: SVMs can model non-linear decision
boundaries by using the kernel trick, which maps the data into a higher-dimensional space
where the data becomes linearly separable.
• Robustness to noise: SVMs are robust to noise in the data, as the decision boundary is
determined by the support vectors, which are the closest data points to the boundary.
• Generalization: SVMs have good generalization performance, which means that they are
able to classify new, unseen data well.
• Versatility: SVMs can be used for both classification and regression tasks, and it can be
applied to a wide range of applications such as natural language processing, computer
vision, and bioinformatics.
• Sparse solution: SVMs have sparse solutions, which means that they only use a subset of
the training data to make predictions. This makes the algorithm more efficient and less
prone to overfitting.
• Regularization: SVMs can be regularized, which means that the algorithm can be
modified to avoid overfitting.
Disadvantages of support vector machine:
• Support vector machine algorithm is not acceptable for large data sets.
• It does not execute very well when the data set has more sound i.e. target classes are
overlapping.
• In cases where the number of properties for each data point outstrips the number of
training data specimens, the support vector machine will underperform.
• As the support vector classifier works by placing data points, above and below the
classifying hyperplane there is no probabilistic clarification for the classification.Support
Vector Machine (SVM) is a powerful supervised machine learning algorithm, but it also
has some limitations and disadvantages. Some of the main disadvantages of SVM
include:
• Computationally expensive: SVMs can be computationally expensive for large datasets,
as the algorithm requires solving a quadratic optimization problem.
• Choice of kernel: The choice of kernel can greatly affect the performance of an SVM,
and it can be difficult to determine the best kernel for a given dataset.
• Sensitivity to the choice of parameters: SVMs can be sensitive to the choice of
parameters, such as the regularization parameter, and it can be difficult to determine the
optimal parameter values for a given dataset.
• Memory-intensive: SVMs can be memory-intensive, as the algorithm requires storing the
kernel matrix, which can be large for large datasets.
• Limited to two-class problems: SVMs are primarily used for two-class problems,
although multi-class problems can be solved by using one-versus-one or one-versus-all
strategies.
• Lack of probabilistic interpretation: SVMs do not provide a probabilistic interpretation
of the decision boundary, which can be a disadvantage in some applications.
• Not suitable for large datasets with many features: SVMs can be very slow and can
consume a lot of memory when the dataset has many features.
• Not suitable for datasets with missing values: SVMs requires complete datasets, with no
missing values, it can not handle missing values.
Applications of support vector machine:
1. Face observation – It is used for detecting the face according to the classifier and model.
2. Text and hypertext arrangement – In this, the categorization technique is used to find
important information or you can say required information for arranging text.
3. Grouping of portrayals – It is also used in the Grouping of portrayals for grouping or
you can say by comparing the piece of information and take an action accordingly.
4. Bioinformatics – It is also used for medical science as well like in laboratory, DNA,
research, etc.
5. Handwriting remembrance – In this, it is used for handwriting recognition.
6. Protein fold and remote homology spotting – It is used for spotting or you can say the
classification class into functional and structural classes given their amino acid
sequences. It is one of the problems in bioinformatics.
7. Generalized predictive control(GPC) – It is also used for Generalized predictive
control(GPC) for predicting and it relies on predictive control using a multilayer feed-
forward network as the plants linear model is presented.
8. Support Vector Machine (SVM) is a type of supervised machine learning algorithm that
can be used for classification and regression tasks. The idea behind SVM is to find the
best boundary (or hyperplane) that separates the different classes of data. This boundary
is chosen in such a way that it maximizes the margin, which is the distance between the
boundary and the closest data points from each class, also known as support vectors.
9. In the case of classification, the goal is to find a boundary that separates the different
classes of data as well as possible. The input data is plotted in a high-dimensional space
(with as many dimensions as the number of features), and the SVM algorithm finds the
best boundary that separates the classes.
10. In the case of regression, the goal is to find the best hyperplane that fits the data. Similar
to classification, the data is plotted in a high-dimensional space, but instead of trying to
separate the classes, the algorithm is trying to fit the data with the best hyperplane.
11. One of the main advantages of SVM is that it works well in high-dimensional spaces and
it’s relatively memory efficient. it also able to handle non-linearly separable data by
transforming them into a higher dimensional space where they become linear separable,
this is done by using kernel trick.
12. SVMs are not just limited to linear boundaries, it could also handle non-linear boundaries
by using kernel functions that allow us to map the input data into a higher-dimensional
space where it becomes linearly separable. The most commonly used kernel functions are
linear, polynomial and radial basis functions (RBF).
13. SVMs are popular in various applications such as image classification, natural language
processing, bioinformatics, and more.
14. Facial Expression Classification – Support vector machines (SVMs) is a binary
classification technique. The face Expression Classification model determines the precise
face expression by modeling differences between two facial images. Validation
techniques include the leave-one-out methods and the K-fold test methods.
15. Speech Recognition – The transcription of speech into text is called speech
recognition. Mel Frequency Cepstral Coefficients (MFCC)-based features are used to
train Support Vector Machines (SVM), which are used for figuring out speech. Speech
recognition is a challenging classification problem that is categorized using a variety of
mathematical techniques, including support vector machines, pattern recognition
techniques, etc.

Sample Question : Write short note on any two: (paper2023) (


a. Support Vector machine (SVM)
b. Given a data set and set of machine learning algorithms, how to choose an
appropriate algorithm.
c. Logistic Regression
(a) Support Vector Machine (SVM): - SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However, primarily, it
is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
The dimension of the hyperplane depends upon the number of features. If the number of
input features is two, then the hyperplane is just a line. If the number of input features is
three, then the hyperplane becomes a 2-D plane. SVM chooses the extreme points/vectors
that help in creating the hyperplane. These extreme cases are called as support vectors,
and hence algorithm is termed as Support Vector Machine.
The followings are important concepts in SVM −
• Support Vectors − Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
• Hyperplane − As we can see in the above diagram, it is a decision plane or space which is
divided between a set of objects having different classes.

Supervised
Unsupervised Reinforcement
Learning
Learning Algorithm Learning Algorithm
Algorithm

Regression K-Mean Clustering Monte Carlo

Markov Decision
Random Forest KNN
Process

Decision Tree SVD Q Learning

• Margin − It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −
• First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
• Then, it will choose the hyperplane that separates the classes correctly.
SVMs are used in text categorization, image classification, handwriting recognition and
in the sciences
Given a data set and a set of machine algorithms, how to choose an appropriate
algorithm.
The machine learning algorithm can be broadly classified into three groups.
1. Supervised learning model: Trained with the dataset which consists of labels for both Input
and output.
2. Unsupervised learning model: Trained with the dataset that has input features but does not
have labels for the output.
3. Reinforcement learning model: Trains itself based on a set of actions and makes the decision.
The factors we need to consider while categorizing and solving the problem are:
• Knowledge of Data: The data’s structure and complexity help dictate the right algorithm
• Accuracy Requirements: Different questions demand different degrees of accuracy, which
influences algorithm selection
• Processing Speed: Algorithm choice may depend on the time constraints in place for a given
analysis
• Variables: The unique features considered while training the model for the optimal result
and accuracy help determine the right algorithm.
•Parameters: Factors such as the number of iterations directly relate to the training time
needed when generating output.
first step is to analyze the data and observe the patterns and any hidden insights by
visualizing the data. The insights from data visualization will help in making an initial
decision on which algorithm to choose for solving the given problem.
Another important factor in deciding the algorithm is speed and accuracy. There is a
tradeoff between speed and accuracy. If accuracy is not vital, perhaps when estimating a
value, we can reduce the processing time, thus increasing the execution speed
Another factor that plays an important role in choosing the correct algorithm is the
number and types of parameters we pass while training the model. These parameters may
include iteration cycle, splitting train and test datasets, error tolerance, and more. The
training time for a given model is directly proportional to the number of parameters
included. When multiple parameters are required to train the model, SVM (Support
Vector Machine) algorithms work well.
Answer (c) Logistic Regression
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic
function.
On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

3.5 Decision Tree in Machine Learning


A decision tree in machine learning is a versatile, interpretable algorithm used for predictive
modelling. It structures decisions based on input data, making it suitable for both classification and
regression tasks. This article delves into the components, terminologies, construction, and advantages
of decision trees, exploring their applications and learning algorithms.
Decision Tree in Machine Learning
Disadvantages of Decision Trees
• Overfitting: Decision trees can easily overfit the training data,
especially if they are deep with many nodes.
• Instability: Small variations in the data can result in a completely
different tree being generated.
• Bias towards Features with More Levels: Features with more levels
can dominate the tree structure.

Question: Explain the decision tree algorithm with example.


(paper 2023)

Decision tree is a technique for data classification in data mining. As an


example, we can consider a decision on whether the conditions are
suitable to play or not. The various attributes which can be used to
make the decision are Outlook, Temperature, Humidity, Windy as
shown in the figure 3.1.

Figure 3.1 Dataset for deciding whether to play or not


As, we can see in our dataset, 14 rules are given to make the decision.
The nodes for each attribute can be created. The possibility of playing
or not with respect to all the attributes is represented in the figure 3.2. If
we see the attribute “outlook”, it has the value “sunny” in the dataset 5
times, out of which 2 result in ‘yes’ for outcome “Play” and 3 result in
‘no’ for outcome “Play”. We have to make a decision on which node to
be used as root node. Different measures are used to make this decision.
Figure 3.2 All possibilities of playing or not with respect to (a) Outlook,
(b) Temperature, (c) Humidity & (d) Windy
We have to make a decision on which node to be used as root node. We
take entropy as a measures to make this decision.
ID3 algorithm
ID3 (Iterative Dichotomiser) uses the concept of information gain to
make the decision. Information gain is based on entropy (information
value), whose unit is bits.
Entropy
For any element X, entropy can be calculated using the formula:
Entropy (P1, P2) = -(P1*log2P1)-(P2*log2P2)
= -(P1log2P1) – (1-P1)log2(1-P1)
where P1, P2 denotes the fraction of X belonging to first class and
second class.
Entropy of outlook = Entropy(9/14,5/14) = -(9/14)log 2(9/14)-(5/14)log
2(5/14)
= 0.940
Entropy of sunny = Entropy(2/5,3/5) = -(2/5)log 2(2/5)-(3/5)log 2(3/5)
= 0.971 bits.
Information Gain
Information Gain = Entropy of parent node – Entropy of child nodes
Info-gain of outlook = entropy of outlook – [ (5/14) * entropy of sunny)
] + [ (4/14) * entropy of overcast ] + [ (5/14) * entropy of rainy ]
= 0.940 – [(5/14) *0.971 + (4/14) *0 + (5/14) *0.940]
= 0.940 – 0.693 = 0.247
For the given dataset:
gain(outlook) = 0.247 bits,
gain(temperature) = 0.029 bits,
gain(humidity) = 0.152 bits,
gain(windy) = 0.048 bits.
Therefore, we select “outlook” with maximum gain value as the root of
the tree.

Figure 3.3 Possibilities of selection of next node after Sunny


Now we have 3 branches for this root (sunny, overcast, rainy) as shown
in figure 3.3. We have to select next node for each of these branches.
Each branch can point to either nodes temperature, humidity or windy.

Calculating gain for each of these sub-trees:


gain(temperature) = 0.571 bits
gain(humidity) = 0.971 bits
gain(windy) = 0.020 bits

Hence, next node for branch sunny is “humidity”.


The process terminates when all leaf nodes are pure i.e. contain either
all yes or all no.

3.6 Iterative Dichotomiser3 Algorithm


ID3 or Iterative Dichotomiser3 Algorithm is used in machine learning for
building decision trees from a given dataset. It was developed in 1986 by Ross
Quinlan. It is a greedy algorithm that builds a decision tree by recursively
partitioning the data set into smaller and smaller subsets until all data points in each
subset belong to the same class. It employs a top-down approach, recursively
selecting features to split the dataset based on information gain.
Thе ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm
used for both classification and regression tasks.ID3 deals primarily with
categorical properties, which means that it can efficiently handle objects with a
discrete set of values. This property is consistent with its suitability for problems
where the input features are categorical rather than continuous.One of the strengths
of ID3 is its ability to generate interpretable decision trees. The resulting tree
structure is easily understood and visualized, providing insight into the decision-
making process. However, ID3 can be sensitive to noisy data and prone to
overfitting, capturing details in the training data that may not adequately account
for new unseen data.
ID3 Algorithms work
The ID3 algorithm works by building a decision tree, which is a hierarchical
structure that classifies data points into different categories and splits the dataset
into smaller subsets based on the values of the features in the dataset. The ID3
algorithm then selects the feature that provides the most information about the
target variable. The decision tree is built top-down, starting with the root node,
which represents the entire dataset. At each node, the ID3 algorithm selects the
attribute that provides the most information gain about the target variable. The
attribute with the highest information gain is the one that best separates the data
points into different categories.
ID3 metrices
The ID3 algorithm utilizes metrics related to information theory, particularly
entropy and information gain, to make decisions during the tree-building process.
Information Gain and Attribute Selection
The ID3 algorithm uses a measure of impurity, such as entropy or Gini impurity,
to calculate the information gain of each attribute. Entropy is a measure of
disorder in a dataset. A dataset with high entropy is a dataset where the data
points are evenly distributed across the different categories. A dataset with low
entropy is a dataset where the data points are concentrated in one or a few
categories.
H(S)=Σ−(Pi∗log2(Pi))H(S)=Σ−(Pi∗log2(Pi))
• where, PiPi represents the fraction of the sample within a particular node.
• S – The current dataset.
• i – Set of classes in S
If entropy is low, data is well understood; if high, more information is needed.
Preprocessing data before using ID3 can enhance accuracy. In sum, ID3 seeks to
reduce uncertainty and make informed decisions by picking attributes that offer
the most insight in a dataset.
Information gain assesses how much valuable information an attribute can
provide. We select the attribute with the highest information gain, which signifies
its potential to contribute the most to understanding the data. If information gain
is high, it implies that the attribute offers a significant insight. ID3 acts like an
investigator, making choices that maximize the information gain in each step.
This approach aims to minimize uncertainty and make well-informed decisions,
which can be further enhanced by preprocessing the data.
IG(A,D)=H(S)–∑v∣Sv∣∣S∣×H(Sv)]IG(A,D)=H(S)–∑v∣S∣∣Sv∣×H(Sv)]
• where, ∣S∣∣S∣ is the total number of instances in dataset.
• ∣Sv∣∣Sv∣ is the number of instances in dataset for which attribute D has
values v.
• H(S)H(S) is the entropy of dataset.

3.6.1 What are the steps in ID3 algorithm?


1. Determine entropy for the overall the dataset using class distribution.
2. For each feature.
• Calculate Entropy for Categorical Values.
• Assess information gain for each unique categorical value of the
feature.
3. Choose the feature that generates highest information gain.
4. Iteratively apply all above steps to build the decision tree structure.
Advantages of ID3
• Simple and easy to understand.
• Requires little training data.
• Can work well with data with discrete and continuous attributes.
Disadvantages of ID3
• Can lead to overfitting.
• May not be effective with data with many attributes.
Applications of ID3
1. Fraud detection: ID3 can be used to develop models that can detect
fraudulent transactions or activities.
2. Medical diagnosis: ID3 can be used to develop models that can diagnose
diseases or medical conditions.
3. Customer segmentation: ID3 can be used to segment customers into
different groups based on their demographics, purchase history, or other
factors.
4. Risk assessment: ID3 can be used to assess risk in a variety of different
areas, such as insurance, finance, and healthcare.
5. Recommendation systems: ID3 can be used to develop recommendation
systems that can recommend products, services, or content to users based
on their past behavior or preferences.
• Amazon uses ID3 to recommend products to its customers.
• Netflix uses ID3 to recommend movies and TV shows to its users.
• Spotify uses ID3 to recommend songs and playlists to its users.
• LendingClub uses ID3 to assess the risk of approving loans to
borrowers.
• Healthcare organizations use ID3 to diagnose diseases, predict
patient outcomes, and develop personalized treatment plans.
3.7 Bayes theorem
Bayes' theorem is a fundamental concept in probability theory that plays a crucial role in
various machine learning algorithms, especially in the fields of Bayesian statistics and
probabilistic modelling. It provides a way to update probabilities based on new evidence or
information. In the context of machine learning, Bayes' theorem is often used in Bayesian
inference and probabilistic models.
The theorem can be mathematically expressed as:

where
• P(A∣B) is the posterior probability of event A given event B.
• (B∣A) is the likelihood of event B given event A.
• P(A) is the prior probability of event A.
• P(B) is the total probability of event B.
In the context of modeling hypotheses, Bayes' theorem allows us to infer our belief in a
hypothesis based on new data. We start with a prior belief in the hypothesis, represented
by P(A), and then update this belief based on how likely the data are to be observed under
the hypothesis, represented by P(B∣A). The posterior probability P(A∣B) represents our
updated belief in the hypothesis after considering the data.
Key Terms Related to Bayes Theorem
1. Likelihood(P(B∣A)):
• Represents the probability of observing the given evidence (features)
given that the class is true.
• In the Naive Bayes algorithm, a key assumption is that features are
conditionally independent given the class label. In other words, Naive
Bayes works best with discrete features.
2. Prior Probability (P(A)):
• In machine learning, this represents the probability of a particular
class before considering any features.
• It is estimated from the training data.
3. Evidence Probability( P(B) ):
• This is the probability of observing the given evidence (features).
• It serves as a normalization factor and is often calculated as the sum
of the joint probabilities over all possible classes.
4. Posterior Probability( P(A∣B) ):
• This is the updated probability of the class given the observed
features.
• It is what we are trying to predict or infer in a classification task.
Now, to utilise this in terms of machine learning we use the Naive Bayes Classifier but in
order to understand how precisely this classifier works we must first understand the maths
behind it.
Applications of Bayes Theorem in Machine learning
1. Naive Bayes Classifier
The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes'
theorem with a strong (naive) independence assumption between the features. It is widely
used for text classification, spam filtering, and other tasks involving high-dimensional data.
Despite its simplicity, the Naive Bayes classifier often performs well in practice and is
computationally efficient.
How it works?
• Assumption of Independence: The "naive" assumption in Naive Bayes is that
the presence of a particular feature in a class is independent of the presence of
any other feature, given the class. This is a strong assumption and may not hold
true in real-world data, but it simplifies the calculation and often works well in
practice.
• Calculating Class Probabilities: Given a set of features x1,x2,...,xn, the Naive
Bayes classifier calculates the probability of each class Ck given the features
using Bayes' theorem:

o the denominator P(x1,x2,...,xn) is the same for all classes and can be ignored
for the purpose of comparison.
• Classification Decision: The classifier selects the class Ck with the highest probability as the
predicted class for the given set of features.
2. Bayes optimal classifier
The Bayes optimal classifier is a theoretical concept in machine learning that
represents the best possible classifier for a given problem. It is based on Bayes'
theorem, which describes how to update probabilities based on new evidence.
In the context of classification, the Bayes optimal classifier assigns the class label that
has the highest posterior probability given the input features. Mathematically, this can
be expressed as:
y^=argmaxyP(y∣x)y=argmaxyP(y∣x)
where y^y is the predicted class label, y is a class label, x is the input feature vector,
and P(y∣x) is the posterior probability of class y given the input features.
3. Bayesian Optimization
Bayesian optimization is a powerful technique for global optimization of expensive-
to-evaluate functions. To choose which point to assess next, a probabilistic model of
the objective function—typically based on a Gaussian process—is constructed.
Bayesian optimization finds the best answer fast and requires few evaluations by
intelligently searching the search space and iteratively improving the model. Because
of this, it is especially well-suited for activities like machine learning model
hyperparameter tweaking, where each assessment may be computationally costly.
4. Bayesian Belief Networks
Bayesian Belief Networks (BBNs), also known as Bayesian networks, are
probabilistic graphical models that represent a set of random variables and their
conditional dependencies using a directed acyclic graph (DAG).The graph's edges
show the relationships between the nodes, which each represent a random variable.
BBNs are employed for modeling uncertainty and generating probabilistic
conclusions regarding the network's variables. They may be used to provide answers
to queries like "What is the most likely explanation for the observed data?" and
"What is the probability of variable A given the evidence of variable B?"

BBNs are extensively utilized in several domains, including as risk analysis,


diagnostic systems, and decision-making. They are useful tools for reasoning under
uncertainty because they provide complicated probabilistic connections between
variables a graphical and understandable representation.
Bayesian Belief Network
A Bayesian Belief Network (BBN), also known as a Bayesian Network, Belief Network, or Directed
Acyclic Graphical Model (DAG), is a probabilistic graphical model used to represent a set of random
variables and their conditional dependencies via a directed acyclic graph (DAG). BBNs are crucial in
machine learning for dealing with uncertainty and making predictions based on probabilistic
relationships among variables.
Key Concepts
1. Graphical Representation:
o BBNs use a directed acyclic graph (DAG) to represent the relationships between
variables. Each node in the graph represents a random variable, and the edges
represent the conditional dependencies between these variables.
o The graph structure allows for a compact representation of the joint probability
distribution over the variables, making it easier to understand and compute
probabilities.
2. Conditional Probability Tables (CPTs):
o Each node in a BBN has a conditional probability table (CPT) that specifies the
probability distribution of the node given its parent nodes.
o For nodes without parents (root nodes), the CPT represents the marginal probability
distribution of the node.
3. Inference:
o Inference in BBNs involves computing the posterior probabilities of variables given
evidence. This can be done using various algorithms such as variable elimination,
belief propagation, and Monte Carlo methods.
o Inference allows for making predictions and updating beliefs based on new evidence,
which is essential for decision-making under uncertainty.
4. Learning:
o Learning in BBNs involves estimating the parameters of the network (CPTs) from
data. This can be done using methods like maximum likelihood estimation (MLE) or
Bayesian parameter estimation.
o Structure learning involves determining the optimal graph structure that best
represents the dependencies in the data.
Applications
• Classification: BBNs are used in classification tasks where the goal is to predict the class
label of an instance based on its features.
• Diagnosis: In medical diagnosis, BBNs can model the relationships between symptoms and
diseases to aid in diagnosis.
• Prediction: BBNs can be used to predict future events based on historical data, such as
weather forecasting or financial market predictions.
• Decision Making: BBNs can support decision-making processes by providing a framework
to model and reason about uncertain information.
Advantages
• Expressiveness: BBNs can represent complex probabilistic relationships and dependencies
among variables.
• Efficiency: The graphical structure allows for efficient computation of probabilities and
inference.
• Interpretability: The graphical representation makes it easier to understand the relationships
and dependencies among variables.
Challenges
• Complexity: Learning the structure and parameters of a BBN can be computationally
intensive, especially for large and complex networks.
• Data Requirements: Accurate learning of BBNs often requires a large amount of data to
estimate the parameters reliably.
Example
Consider a simple BBN for predicting whether a person will go to the park based on the weather
conditions. The variables might include "Weather" (Sunny, Rainy), "Temperature" (Hot, Cold), and
"Go to Park" (Yes, No). The BBN would represent the dependencies between these variables, and
inference could be used to predict the probability of going to the park given the weather and
temperature.
Conclusion
Bayesian Belief Networks are powerful tools in machine learning for modeling and reasoning about
uncertain information. They provide a structured and efficient way to represent probabilistic
relationships and make predictions based on these relationships. Understanding BBNs is essential for
anyone working with probabilistic models and decision-making under uncertainty.

3.7 EM algorithm
The Expectation-Maximization (EM) algorithm is a fundamental iterative method used in
machine learning for parameter estimation in probabilistic models, especially when dealing
with incomplete or missing data. It is widely used in various applications, including Hidden
Markov Models (HMM), Latent Dirichlet Allocation (LDA), and Gaussian Mixture Models
(GMM).
Key Concepts
1. Incomplete Data and Hidden Variables:
o The EM algorithm is particularly useful when the data is incomplete or
contains hidden variables. Incomplete data means that some of the data points
are missing or unobserved. Hidden variables are latent variables that are not
directly observable but influence the observed data.
2. Two-Step Process:
o The EM algorithm consists of two main steps: the Expectation (E) step and the
Maximization (M) step.
▪ E-step: In this step, the algorithm computes the expected value of the
log-likelihood function, given the current estimate of the parameters
and the observed data. This involves calculating the posterior
probabilities of the hidden variables.
▪ M-step: In this step, the algorithm maximizes the expected log-
likelihood found in the E-step to find new parameter estimates. This
step updates the parameters to maximize the likelihood of the data
given the current estimates of the hidden variables.
3. Iterative Nature:
o The EM algorithm iterates between the E-step and M-step until convergence.
Convergence is typically determined when the change in the log-likelihood
between iterations falls below a certain threshold.
4. Mathematical Formulation:
o Let XX be the observed data and ZZ be the hidden variables. The goal is to
maximize the log-likelihood function log⁡P(X∣θ)logP(X∣θ), where θθ are the
parameters of the model.
o The EM algorithm alternates between:
▪ E-step:
Compute Q(θ∣θ(t))=EZ∣X,θ(t)[log⁡P(X,Z∣θ)]Q(θ∣θ(t))=EZ∣X,θ(t)
[logP(X,Z∣θ)]
▪ M-step: Update the
parameters: θ(t+1)=arg⁡max⁡θQ(θ∣θ(t))θ(t+1)=argmaxθQ(θ∣θ(t))

Applications
1. Gaussian Mixture Models (GMM):
o GMMs are used to model data as a mixture of several Gaussian distributions.
The EM algorithm is used to estimate the parameters of these distributions,
including the means, variances, and mixing coefficients.
2. Hidden Markov Models (HMM):
o HMMs are used in speech recognition, natural language processing, and
bioinformatics. The EM algorithm is used to estimate the transition and
emission probabilities of the HMM.
3. Latent Dirichlet Allocation (LDA):
o LDA is used for topic modeling in text data. The EM algorithm is used to
estimate the topic distributions and word distributions.

Advantages
• Handling Missing Data: The EM algorithm can handle missing data by treating the
missing values as hidden variables.
• Convergence: The algorithm is guaranteed to converge to a local optimum, although
it may not always find the global optimum.
• Flexibility: The EM algorithm can be applied to a wide range of models and
problems.

Challenges
• Convergence to Local Optima: The EM algorithm may converge to a local optimum
rather than the global optimum.
• Computational Complexity: The algorithm can be computationally expensive,
especially for large datasets and complex models.

Example: Gaussian Mixture Model (GMM)

Consider a dataset that is a mixture of several Gaussian distributions. The goal is to estimate
the parameters of these distributions. The EM algorithm proceeds as follows:
1. Initialization: Randomly initialize the parameters of the Gaussian distributions.
2. E-step: Compute the posterior probabilities of each data point belonging to each
Gaussian distribution.
3. M-step: Update the parameters of the Gaussian distributions based on the posterior
probabilities.
4. Iteration: Repeat the E-step and M-step until convergence.

Question :
Explain Bayesian estimation and maximum likelihood estimation in generative
learning. (paper 2023)

Bayesian estimation and maximum likelihood estimation are two common approaches
used in statistical inference to estimate the parameters of a statistical model.
Bayesian Estimation: - It is a probabilistic approach to parameter estimation. It uses
the prior information about the parameters and updates them based on the observed
data using Bayes’s theorem. Bayesian estimation provides a posterior distribution for
the parameters, which represents the updated belief about the parameter values after
considering the data. The posterior distribution combines the prior distribution and the
likelihood function, and it provides a complete representation of the uncertainty
associated with the parameter estimates. It allows for the calculation of credible
intervals or intervals of uncertainty for the parameters.
Given as
P(A|B) = (P(B|A)* P(A)) / P(B)

Here, we find the probability of event A given B is true. And P(A) and P(B) are
independent probabilities of events A and B.

P(A) = Prior Probability. This is the probability of any event before we take into
consideration any new piece of information.

P(B) is referred to as evidence. How likely an observation of B is given our prior


beliefs about A.

P(B|A) is referred to as likelihood function. It tells how likely each observation of B is


for a fixed A.

P(A|B) = Posterior Probability. This is the probability of an event after some event
has already occurred.
Bayesian estimation treats the parameters as random variables, and it provides a
framework for incorporating prior knowledge or expert opinions into the analysis. It
also allows for model comparison and hypothesis testing within the Bayesian
framework.
Maximum Likelihood Estimation (MLE): - Maximum likelihood estimation is a
frequentist approach to parameter estimation. It aims to find the parameter values that
maximize the likelihood function, which is a measure of how well the observed data
fits the model. The likelihood function is constructed based on the assumed
probability distribution of the data. In MLE, the parameter estimates are chosen to
maximize the probability of observing the given data under the assumed model. This
estimation method provides point estimates of the parameters, which are single values
that are considered the best estimates given the observed data. MLE has desirable
properties, such as consistency and asymptotic efficiency. However, it does not
provide any information about the uncertainty associated with the parameter
estimates. Additionally, it assumes that the true parameter values are fixed and
unknown.
Likelihood function
The objective is to maximise the probability of observing the data points from joint
probability distribution considering specific probability distribution. This is formally
stated as-
P(X | theta)

Here, theta is an unknown parameter.


This may also be written as P(X ; theta) or P(x1,x2,x3,...,xn ; theta)

The likelihood function is commonly denoted with L-


L(X ; theta)

Since the aim is to find the parameters that maximise the likelihood function-
Maximum{L(X;theta)}

The joint probability is restated as a product of conditional probability for every


observation given the distribution parameters.
L(X | theta) = π(i to n) P (xi | theta)

Log of likelihood
It’s going to be a lot of work taking the product of all these conditional probabilities.

So, to make it slightly easier, we can take log(natural log) on both sides
ln L(X | theta) = ln(π(i to n) P (xi | theta))

Which becomes
ln L(X | theta) = ∑(i to n) log P(xi | theta)

MLE is an optimisation technique which can be used on various machine learning


models like Logistic regression, linear regression, etc.
3.8 Bagging and Boosting.
Bagging is performed when a decision tree classifier's variance needs to be decreased.
Here, the goal is to divide the training sample, which was selected at random with
replacement, into several subsets of data. Their decision trees are trained using each set
of subset data. We end up with an ensemble of many models as a result. This method
is more reliable than a single decision tree classifier since it uses the average of all
predictions from various trees.
A basic training dataset with n examples is available.
From the training set, we divide the data into m subgroups. For each subgroup, we
select a subset of N sample points from the starting dataset. It is taken with replacement
from each subgroup. This implies that a given data point may be sampled multiple
times.
We train the associated weak learners separately for each subset of data. These models
are all of the same type because they are homogeneous.
Every model makes a forecast.
One forecast is created by combining all the predictions. Either maximum voting or
average is applied in this case.
A set of predictors is built via boosting. According to this method, early learners first
fit simple models to the data before looking for mistakes in the data. Consecutive trees
(random sample) are fitted with the aim of increasing accuracy over the previous tree
at each stage. An input's weight is increased when a hypothesis incorrectly classifies it,
increasing the likelihood that the subsequent hypothesis will classify it properly. Weak
learners are transformed into better performing models through this approach.

CLASSIFICATION MODEL EVALUATION AND SELECTION


In machine learning, classification refers to predicting the label of an
observation. We’ll cover some of the most widely used classification measures;
namely, accuracy, precision, recall, F-1 Score, ROC curve, and AUC.
We’ll also compare two most confused metrics; precision and recall.
2. Binary Classification
Binary classification is a subset of classification problems, where we only have
two possible labels. Generally speaking, a yes/no question or a setting with 0-1
outcome can be modeled as a binary classification problem. For instance, a well-
known problem is predicting whether an e-mail is spam or not.
Given two classes, we can talk in terms of positive and negative samples. In this case,
we say a given e-mail is spam (positive) or not spam (negative).
For a given sample observation, the actual class is either positive or negative.
Similarly, the predicted class is also either positive or negative. We can visualize the
outcome of a binary classification model using a confusion matrix, like the one
presented above.

There are four categories associated with the actual and predicted class of an
observation:
1. True Positive (TP): Both the actual and predicted values of the given observation are
positive.
2. False Positive (FP): The given observation is negative but the predicted value is
positive.
3. True Negative (TN): Both the actual and predicted values of the given observation
are negative.
4. False Negative (FN): The given observation is predicted to be negative, despite, in
fact, being positive.
The diagonal of the confusion matrix presents the correct predictions. Obviously, we
want the majority of our predictions to reside here.
FPs and FNs are classification errors. In statistics, FPs are referred to as Type I
errors, and FNs are referred to as Type II errors. In some cases, Type II errors are
dangerous and not tolerable.
For example, if our classifier is predicting if there is a fire in the house, a Type I error
is a false alarm.
On the other hand, a Type II error means that the house is burning down and the fire
department isn’t aware.

3. Binary Classification Measures


Suppose we have a simple binary classification case as shown in the figure below.
The actual positive and negative samples are distributed on rectangular surfaces.
However, our classifier marks the samples inside the circle as positive and the rest as
negative.
How did our classifier do?
Many metrics are proposed and being used to evaluate the performance of a
classifier. Depending on the problem and goal, we must select the relevant metric
to observe and report.
In this section, we’ll try to understand what do some popular classification measures
represent and which one to use for different problems.

a). Accuracy
The most simple and straightforward classification metric is accuracy. Accuracy
measures the fraction of correctly classified observations. The formula is:

Thinking in terms of the confusion matrix, we sometimes say “the diagonal over all
samples”. Simply put, it measures the absence of error rate. An accuracy of 90%
means that out of 100 observations, 90 samples are classified correctly.
Although 90% accuracy sounds very promising at first, using the accuracy measure in
imbalanced datasets might be misleading.
Again recall the spam e-mail example. Consider out of every 100 e-mails received, 90
of them are spam. In this case, labeling each e-mail as spam would lead to 90%
accuracy with an empty inbox. Think about all the important e-mails we might be
missing.
Similarly, in the case of fraud detection, only a tiny fraction of transactions are
fraudulent. If our classifier were to mark every single case as not fraudulent, we’d still
have close to 100% accuracy.
b). Precision and Recall
Precision and recall are two important metrics used to evaluate the performance of
classification models in machine learning. They are particularly useful for imbalanced
datasets where one class has significantly fewer instances than the other.
Precision is a measure of how many of the positive predictions made by a classifier
were correct. It is defined as the ratio of true positives (TP) to the total number of
positive predictions (TP + FP). In other words, precision measures the proportion of
true positives among all positive predictions.
Precision=TP/(TP+FP)Precision=TP/(TP+FP)

Recall, on the other hand, is a measure of how many of the actual positive instances
were correctly identified by the classifier. It is defined as the ratio of true positives (TP)
to the total number of actual positive instances (TP + FN). In other words, recall
measures the proportion of true positives among all actual positive instances.
Recall=TP/(TP+FN)Recall=TP/(TP+FN)

To understand precision and recall, consider the problem of detecting spam emails. A
classifier may label an email as spam (positive prediction) or not spam (negative
prediction). The actual label of the email can be either spam or not spam. If the email
is actually spam and the classifier correctly labels it as spam, then it is a true positive.
If the email is not spam but the classifier incorrectly labels it as spam, then it is a false
positive. If the email is actually spam but the classifier incorrectly labels it as not spam,
then it is a false negative. Finally, if the email is not spam and the classifier correctly
labels it as not spam, then it is a true negative.
In this scenario, precision measures the proportion of spam emails that were correctly
identified as spam by the classifier. A high precision indicates that the classifier is
correctly identifying most of the spam emails and is not labeling many legitimate emails
as spam. On the other hand, recall measures the proportion of all spam emails that were
correctly identified by the classifier. A high recall indicates that the classifier is
correctly identifying most of the spam emails, even if it is labeling some legitimate
emails as spam.

c). F-1 Score


Usually, having a balanced recall and precision is more important than having one
type of error really low. That’s why generally we take both precision and recall into
consideration. Several classification measures are developed to identify this problem.
For example, the F-1 score is the harmonic mean of precision and recall. It gives
equal importance to Type I and Type II errors. The calculation is:

When the dataset labels are evenly distributed, accuracy gives meaningful results. But
if the dataset is imbalanced, like the spam e-mail example, we should prefer the F-1
score.
Another point to keep in mind is; accuracy gives more importance to TPs and TNs.
On the other hand, the F-1 score considers FN and FPs.
d) . ROC Curve and AUC
A well-known method to visualize the classification performance is a ROC
curve (receiver operating characteristic curve). The plot shows the classifier’s
success for different threshold values.
In order to plot the ROC curve, we need to calculate the True Positive Rate (TP) and
the False Positive Rate (FP), where:

To better understand what this graph stands for, first imagine setting the classifier
threshold to . This will lead to labeling every observation as positive. Thus, we
can conclude that lowering the threshold marks more items as positive. Having more
observations classified as positive results in having more TPs and FPs.

Conversely, setting the threshold value to causes every observation to be marked


as negative. Then, the algorithm will label more observations as negative. So,
increasing the threshold value will cause both TP and FP numbers to decrease, leading
to lower TPR and FPR.
For each different threshold value, we get different TP and FP rates. Plotting the result
gives us the ROC curve. But how does the ROC curve help us find out the success of
a classifier?
In this case, AUC helps us. It stands for “area under the curve”. Simply put, AUC is
the area under the ROC curve. In case we have a better classification for each
threshold value, the area grows. A perfect classification leads to an AUC of 1.0.
On the contrary, worse classifier performance reduces the area. AUC is often used as
a classifier comparison metric for machine learning problems.
Multi-Class Classification
When there are more than two labels available for a classification problem, we
call it multiclass classification. Measuring the performance of a multiclass classifier
is very similar to the binary case.
Suppose a certain classifier generates the confusion matrix presented above. There are
127 samples in total. Now let’s see how well the classifier performed.
Recall that accuracy is the percentage of correctly classified samples, which reside on
the diagonal. To calculate accuracy, we need to just divide the number of
correctly classified samples by the total number of samples.

So, the classifier correctly classified almost half of the samples. Given we have five
labels, it is much better than the random case.
Calculating the precision and recall is a little trickier than the binary class case. We
can’t talk about the overall precision or recall of a classifier for multiclass
classification. Instead, we calculate precision and recall for each class separately.
Class by class, we can categorize each element into one of TP, TN, FP, or FN. Hence,
we can rewrite the precision formula as:

Similarly, the
recall formula
can be rewritten as:

Similarly, the recall formula can be rewritten as:

Correspondingly, we can calculate precision and recall values


for other classes.
In this , we have investigated how to evaluate a classifier depending on the problem
domain and dataset label distribution. Then, starting with accuracy, precision, and
recall, we have covered some of the most well-known performance measures.
Precision and recall are two crucial metrics, each minimizing a different type of
statistical error. F-1 Score is another metric for measuring classifier success, and it
considers both Type I and Type II errors.
We have briefly explained the ROC curve and AUC measure to compare classifiers.
Before concluding, we have discussed some differences between binary and
multiclass classification problems and how to adapt the elementary metrics to the
multiclass case.

You might also like