UNIT 3 - Final
UNIT 3 - Final
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.
3.3 Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a
complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have knowledge of the
Decision Tree Algorithm.
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all the
trees predict the correct output. Therefore, below are two assumptions for a better Random forest
classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to
the Random forest classifier. The dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction result, and when a new data
point occurs, then based on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
o Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
Other ML
Feature Random Forest Algorithms
Typically
relies on a
single model
Utilizes an (e.g., linear
ensemble of regression,
decision trees, support vector
Ensemble combining their machine)
Approach outputs for without the
predictions, ensemble
fostering robustness approach,
and accuracy. potentially
leading to less
resilience
against noise.
Some
algorithms
may be prone
Resistant to
to overfitting,
overfitting due to
Overfittin especially
the aggregation of
g when dealing
diverse decision
Resistanc with complex
trees, preventing
e datasets, as
memorization of
they may
training data.
excessively
adapt to
training noise.
Other
Exhibits resilience algorithms
in handling missing may require
values by imputation or
Handling
leveraging available elimination of
of
features for missing data,
Missing
predictions, potentially
Data
contributing to impacting
practicality in real- model training
world scenarios. and
performance.
challenging to
identify crucial
variables for
predictions.
Some
algorithms
Capitalizes on may have
parallelization, limited
enabling the parallelization
Paralleliz
simultaneous capabilities,
ation
training of decision potentially
Potential
trees, resulting in leading to
faster computation longer training
for large datasets. times for
extensive
datasets.
Supervised
Unsupervised Reinforcement
Learning
Learning Algorithm Learning Algorithm
Algorithm
Markov Decision
Random Forest KNN
Process
• Margin − It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −
• First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
• Then, it will choose the hyperplane that separates the classes correctly.
SVMs are used in text categorization, image classification, handwriting recognition and
in the sciences
Given a data set and a set of machine algorithms, how to choose an appropriate
algorithm.
The machine learning algorithm can be broadly classified into three groups.
1. Supervised learning model: Trained with the dataset which consists of labels for both Input
and output.
2. Unsupervised learning model: Trained with the dataset that has input features but does not
have labels for the output.
3. Reinforcement learning model: Trains itself based on a set of actions and makes the decision.
The factors we need to consider while categorizing and solving the problem are:
• Knowledge of Data: The data’s structure and complexity help dictate the right algorithm
• Accuracy Requirements: Different questions demand different degrees of accuracy, which
influences algorithm selection
• Processing Speed: Algorithm choice may depend on the time constraints in place for a given
analysis
• Variables: The unique features considered while training the model for the optimal result
and accuracy help determine the right algorithm.
•Parameters: Factors such as the number of iterations directly relate to the training time
needed when generating output.
first step is to analyze the data and observe the patterns and any hidden insights by
visualizing the data. The insights from data visualization will help in making an initial
decision on which algorithm to choose for solving the given problem.
Another important factor in deciding the algorithm is speed and accuracy. There is a
tradeoff between speed and accuracy. If accuracy is not vital, perhaps when estimating a
value, we can reduce the processing time, thus increasing the execution speed
Another factor that plays an important role in choosing the correct algorithm is the
number and types of parameters we pass while training the model. These parameters may
include iteration cycle, splitting train and test datasets, error tolerance, and more. The
training time for a given model is directly proportional to the number of parameters
included. When multiple parameters are required to train the model, SVM (Support
Vector Machine) algorithms work well.
Answer (c) Logistic Regression
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a
curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic
function.
On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
where
• P(A∣B) is the posterior probability of event A given event B.
• (B∣A) is the likelihood of event B given event A.
• P(A) is the prior probability of event A.
• P(B) is the total probability of event B.
In the context of modeling hypotheses, Bayes' theorem allows us to infer our belief in a
hypothesis based on new data. We start with a prior belief in the hypothesis, represented
by P(A), and then update this belief based on how likely the data are to be observed under
the hypothesis, represented by P(B∣A). The posterior probability P(A∣B) represents our
updated belief in the hypothesis after considering the data.
Key Terms Related to Bayes Theorem
1. Likelihood(P(B∣A)):
• Represents the probability of observing the given evidence (features)
given that the class is true.
• In the Naive Bayes algorithm, a key assumption is that features are
conditionally independent given the class label. In other words, Naive
Bayes works best with discrete features.
2. Prior Probability (P(A)):
• In machine learning, this represents the probability of a particular
class before considering any features.
• It is estimated from the training data.
3. Evidence Probability( P(B) ):
• This is the probability of observing the given evidence (features).
• It serves as a normalization factor and is often calculated as the sum
of the joint probabilities over all possible classes.
4. Posterior Probability( P(A∣B) ):
• This is the updated probability of the class given the observed
features.
• It is what we are trying to predict or infer in a classification task.
Now, to utilise this in terms of machine learning we use the Naive Bayes Classifier but in
order to understand how precisely this classifier works we must first understand the maths
behind it.
Applications of Bayes Theorem in Machine learning
1. Naive Bayes Classifier
The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes'
theorem with a strong (naive) independence assumption between the features. It is widely
used for text classification, spam filtering, and other tasks involving high-dimensional data.
Despite its simplicity, the Naive Bayes classifier often performs well in practice and is
computationally efficient.
How it works?
• Assumption of Independence: The "naive" assumption in Naive Bayes is that
the presence of a particular feature in a class is independent of the presence of
any other feature, given the class. This is a strong assumption and may not hold
true in real-world data, but it simplifies the calculation and often works well in
practice.
• Calculating Class Probabilities: Given a set of features x1,x2,...,xn, the Naive
Bayes classifier calculates the probability of each class Ck given the features
using Bayes' theorem:
o the denominator P(x1,x2,...,xn) is the same for all classes and can be ignored
for the purpose of comparison.
• Classification Decision: The classifier selects the class Ck with the highest probability as the
predicted class for the given set of features.
2. Bayes optimal classifier
The Bayes optimal classifier is a theoretical concept in machine learning that
represents the best possible classifier for a given problem. It is based on Bayes'
theorem, which describes how to update probabilities based on new evidence.
In the context of classification, the Bayes optimal classifier assigns the class label that
has the highest posterior probability given the input features. Mathematically, this can
be expressed as:
y^=argmaxyP(y∣x)y=argmaxyP(y∣x)
where y^y is the predicted class label, y is a class label, x is the input feature vector,
and P(y∣x) is the posterior probability of class y given the input features.
3. Bayesian Optimization
Bayesian optimization is a powerful technique for global optimization of expensive-
to-evaluate functions. To choose which point to assess next, a probabilistic model of
the objective function—typically based on a Gaussian process—is constructed.
Bayesian optimization finds the best answer fast and requires few evaluations by
intelligently searching the search space and iteratively improving the model. Because
of this, it is especially well-suited for activities like machine learning model
hyperparameter tweaking, where each assessment may be computationally costly.
4. Bayesian Belief Networks
Bayesian Belief Networks (BBNs), also known as Bayesian networks, are
probabilistic graphical models that represent a set of random variables and their
conditional dependencies using a directed acyclic graph (DAG).The graph's edges
show the relationships between the nodes, which each represent a random variable.
BBNs are employed for modeling uncertainty and generating probabilistic
conclusions regarding the network's variables. They may be used to provide answers
to queries like "What is the most likely explanation for the observed data?" and
"What is the probability of variable A given the evidence of variable B?"
3.7 EM algorithm
The Expectation-Maximization (EM) algorithm is a fundamental iterative method used in
machine learning for parameter estimation in probabilistic models, especially when dealing
with incomplete or missing data. It is widely used in various applications, including Hidden
Markov Models (HMM), Latent Dirichlet Allocation (LDA), and Gaussian Mixture Models
(GMM).
Key Concepts
1. Incomplete Data and Hidden Variables:
o The EM algorithm is particularly useful when the data is incomplete or
contains hidden variables. Incomplete data means that some of the data points
are missing or unobserved. Hidden variables are latent variables that are not
directly observable but influence the observed data.
2. Two-Step Process:
o The EM algorithm consists of two main steps: the Expectation (E) step and the
Maximization (M) step.
▪ E-step: In this step, the algorithm computes the expected value of the
log-likelihood function, given the current estimate of the parameters
and the observed data. This involves calculating the posterior
probabilities of the hidden variables.
▪ M-step: In this step, the algorithm maximizes the expected log-
likelihood found in the E-step to find new parameter estimates. This
step updates the parameters to maximize the likelihood of the data
given the current estimates of the hidden variables.
3. Iterative Nature:
o The EM algorithm iterates between the E-step and M-step until convergence.
Convergence is typically determined when the change in the log-likelihood
between iterations falls below a certain threshold.
4. Mathematical Formulation:
o Let XX be the observed data and ZZ be the hidden variables. The goal is to
maximize the log-likelihood function logP(X∣θ)logP(X∣θ), where θθ are the
parameters of the model.
o The EM algorithm alternates between:
▪ E-step:
Compute Q(θ∣θ(t))=EZ∣X,θ(t)[logP(X,Z∣θ)]Q(θ∣θ(t))=EZ∣X,θ(t)
[logP(X,Z∣θ)]
▪ M-step: Update the
parameters: θ(t+1)=argmaxθQ(θ∣θ(t))θ(t+1)=argmaxθQ(θ∣θ(t))
Applications
1. Gaussian Mixture Models (GMM):
o GMMs are used to model data as a mixture of several Gaussian distributions.
The EM algorithm is used to estimate the parameters of these distributions,
including the means, variances, and mixing coefficients.
2. Hidden Markov Models (HMM):
o HMMs are used in speech recognition, natural language processing, and
bioinformatics. The EM algorithm is used to estimate the transition and
emission probabilities of the HMM.
3. Latent Dirichlet Allocation (LDA):
o LDA is used for topic modeling in text data. The EM algorithm is used to
estimate the topic distributions and word distributions.
Advantages
• Handling Missing Data: The EM algorithm can handle missing data by treating the
missing values as hidden variables.
• Convergence: The algorithm is guaranteed to converge to a local optimum, although
it may not always find the global optimum.
• Flexibility: The EM algorithm can be applied to a wide range of models and
problems.
Challenges
• Convergence to Local Optima: The EM algorithm may converge to a local optimum
rather than the global optimum.
• Computational Complexity: The algorithm can be computationally expensive,
especially for large datasets and complex models.
Consider a dataset that is a mixture of several Gaussian distributions. The goal is to estimate
the parameters of these distributions. The EM algorithm proceeds as follows:
1. Initialization: Randomly initialize the parameters of the Gaussian distributions.
2. E-step: Compute the posterior probabilities of each data point belonging to each
Gaussian distribution.
3. M-step: Update the parameters of the Gaussian distributions based on the posterior
probabilities.
4. Iteration: Repeat the E-step and M-step until convergence.
Question :
Explain Bayesian estimation and maximum likelihood estimation in generative
learning. (paper 2023)
Bayesian estimation and maximum likelihood estimation are two common approaches
used in statistical inference to estimate the parameters of a statistical model.
Bayesian Estimation: - It is a probabilistic approach to parameter estimation. It uses
the prior information about the parameters and updates them based on the observed
data using Bayes’s theorem. Bayesian estimation provides a posterior distribution for
the parameters, which represents the updated belief about the parameter values after
considering the data. The posterior distribution combines the prior distribution and the
likelihood function, and it provides a complete representation of the uncertainty
associated with the parameter estimates. It allows for the calculation of credible
intervals or intervals of uncertainty for the parameters.
Given as
P(A|B) = (P(B|A)* P(A)) / P(B)
Here, we find the probability of event A given B is true. And P(A) and P(B) are
independent probabilities of events A and B.
P(A) = Prior Probability. This is the probability of any event before we take into
consideration any new piece of information.
P(A|B) = Posterior Probability. This is the probability of an event after some event
has already occurred.
Bayesian estimation treats the parameters as random variables, and it provides a
framework for incorporating prior knowledge or expert opinions into the analysis. It
also allows for model comparison and hypothesis testing within the Bayesian
framework.
Maximum Likelihood Estimation (MLE): - Maximum likelihood estimation is a
frequentist approach to parameter estimation. It aims to find the parameter values that
maximize the likelihood function, which is a measure of how well the observed data
fits the model. The likelihood function is constructed based on the assumed
probability distribution of the data. In MLE, the parameter estimates are chosen to
maximize the probability of observing the given data under the assumed model. This
estimation method provides point estimates of the parameters, which are single values
that are considered the best estimates given the observed data. MLE has desirable
properties, such as consistency and asymptotic efficiency. However, it does not
provide any information about the uncertainty associated with the parameter
estimates. Additionally, it assumes that the true parameter values are fixed and
unknown.
Likelihood function
The objective is to maximise the probability of observing the data points from joint
probability distribution considering specific probability distribution. This is formally
stated as-
P(X | theta)
Since the aim is to find the parameters that maximise the likelihood function-
Maximum{L(X;theta)}
Log of likelihood
It’s going to be a lot of work taking the product of all these conditional probabilities.
So, to make it slightly easier, we can take log(natural log) on both sides
ln L(X | theta) = ln(π(i to n) P (xi | theta))
Which becomes
ln L(X | theta) = ∑(i to n) log P(xi | theta)
There are four categories associated with the actual and predicted class of an
observation:
1. True Positive (TP): Both the actual and predicted values of the given observation are
positive.
2. False Positive (FP): The given observation is negative but the predicted value is
positive.
3. True Negative (TN): Both the actual and predicted values of the given observation
are negative.
4. False Negative (FN): The given observation is predicted to be negative, despite, in
fact, being positive.
The diagonal of the confusion matrix presents the correct predictions. Obviously, we
want the majority of our predictions to reside here.
FPs and FNs are classification errors. In statistics, FPs are referred to as Type I
errors, and FNs are referred to as Type II errors. In some cases, Type II errors are
dangerous and not tolerable.
For example, if our classifier is predicting if there is a fire in the house, a Type I error
is a false alarm.
On the other hand, a Type II error means that the house is burning down and the fire
department isn’t aware.
a). Accuracy
The most simple and straightforward classification metric is accuracy. Accuracy
measures the fraction of correctly classified observations. The formula is:
Thinking in terms of the confusion matrix, we sometimes say “the diagonal over all
samples”. Simply put, it measures the absence of error rate. An accuracy of 90%
means that out of 100 observations, 90 samples are classified correctly.
Although 90% accuracy sounds very promising at first, using the accuracy measure in
imbalanced datasets might be misleading.
Again recall the spam e-mail example. Consider out of every 100 e-mails received, 90
of them are spam. In this case, labeling each e-mail as spam would lead to 90%
accuracy with an empty inbox. Think about all the important e-mails we might be
missing.
Similarly, in the case of fraud detection, only a tiny fraction of transactions are
fraudulent. If our classifier were to mark every single case as not fraudulent, we’d still
have close to 100% accuracy.
b). Precision and Recall
Precision and recall are two important metrics used to evaluate the performance of
classification models in machine learning. They are particularly useful for imbalanced
datasets where one class has significantly fewer instances than the other.
Precision is a measure of how many of the positive predictions made by a classifier
were correct. It is defined as the ratio of true positives (TP) to the total number of
positive predictions (TP + FP). In other words, precision measures the proportion of
true positives among all positive predictions.
Precision=TP/(TP+FP)Precision=TP/(TP+FP)
Recall, on the other hand, is a measure of how many of the actual positive instances
were correctly identified by the classifier. It is defined as the ratio of true positives (TP)
to the total number of actual positive instances (TP + FN). In other words, recall
measures the proportion of true positives among all actual positive instances.
Recall=TP/(TP+FN)Recall=TP/(TP+FN)
To understand precision and recall, consider the problem of detecting spam emails. A
classifier may label an email as spam (positive prediction) or not spam (negative
prediction). The actual label of the email can be either spam or not spam. If the email
is actually spam and the classifier correctly labels it as spam, then it is a true positive.
If the email is not spam but the classifier incorrectly labels it as spam, then it is a false
positive. If the email is actually spam but the classifier incorrectly labels it as not spam,
then it is a false negative. Finally, if the email is not spam and the classifier correctly
labels it as not spam, then it is a true negative.
In this scenario, precision measures the proportion of spam emails that were correctly
identified as spam by the classifier. A high precision indicates that the classifier is
correctly identifying most of the spam emails and is not labeling many legitimate emails
as spam. On the other hand, recall measures the proportion of all spam emails that were
correctly identified by the classifier. A high recall indicates that the classifier is
correctly identifying most of the spam emails, even if it is labeling some legitimate
emails as spam.
When the dataset labels are evenly distributed, accuracy gives meaningful results. But
if the dataset is imbalanced, like the spam e-mail example, we should prefer the F-1
score.
Another point to keep in mind is; accuracy gives more importance to TPs and TNs.
On the other hand, the F-1 score considers FN and FPs.
d) . ROC Curve and AUC
A well-known method to visualize the classification performance is a ROC
curve (receiver operating characteristic curve). The plot shows the classifier’s
success for different threshold values.
In order to plot the ROC curve, we need to calculate the True Positive Rate (TP) and
the False Positive Rate (FP), where:
To better understand what this graph stands for, first imagine setting the classifier
threshold to . This will lead to labeling every observation as positive. Thus, we
can conclude that lowering the threshold marks more items as positive. Having more
observations classified as positive results in having more TPs and FPs.
So, the classifier correctly classified almost half of the samples. Given we have five
labels, it is much better than the random case.
Calculating the precision and recall is a little trickier than the binary class case. We
can’t talk about the overall precision or recall of a classifier for multiclass
classification. Instead, we calculate precision and recall for each class separately.
Class by class, we can categorize each element into one of TP, TN, FP, or FN. Hence,
we can rewrite the precision formula as:
Similarly, the
recall formula
can be rewritten as: