Unit-3
Unit-3
• SYLLABUS:
• In our practical life, our decisions are affected by our prior knowledge or belief
about an event.
• The key reasons why Bayes' Theorem is important in machine learning are:
Introduction
• 1. Uncertainty Estimation
• Machine learning models often have uncertainty due to incomplete data, noise,
or model limitations.
• 2. Updating Beliefs
• A core strength of Bayes' Theorem is its ability to update beliefs based on new
evidence.
• In machine learning, models can start with some prior assumptions (priors)
and improve predictions as new data (likelihood) becomes available.
Introduction
• Here, new evidence (the data) updates our belief in the hypothesis (model
parameters), making the model adaptable and data-driven.
• Many machine learning tasks benefit from not just predicting outcomes but
understanding the likelihood of those predictions.
• 5. Bayesian Networks
• The hypothesis space represents all possible concepts the model can
consider, and the learning algorithm's task is to find the best hypothesis
that fits the training data.
• where A and B are conditionally related events and p(A|B) denotes the
probability of event A occurring when event B has already occurred.
• 4. Evidence (P(Data))
• Consider working of an email spam filter. Historically, 80% of the emails are
not spam and 20% are spam. We know that 90% of spam emails contain the
word "offer," and 5% of non-spam emails don’t contain the word "offer.“
• Suppose, if we receive an email that contains the word "offer," what is the
probability that it is spam?
• Solution:
• P(Word 'offer' | Not Spam) = 0.05 (Probability that the word "offer" appears in
non-spam).
• The probability that the email is spam given that it contains the word "offer").
• Solution
• We need to find the probability that an item is actually defective given that it has
been classified as defective by the inspection system, denoted as P(D ∣ T)
Example 2:
• D = the event that an item is defective.
• If the model predicts rain, what is the probability that it will actually
rain?
Example 3:
Example 3:
Example 3:
Bayes’ Theorem : Maximum A Posteriori (MAP) hypothesis
• We will assume that P(h) is the initial probability of a hypothesis ‘h’
called the prior probability P(h).
• P(T) is the prior probability that the training data will be observed.
• From the above equation, we can deduce that P(h|T) increases as P(h)
and P(T|h) increases and also as P(T) decreases.
• Naïve Bayes classifier is among the most successful known algorithms for
learning to classify text documents.
• In text classification, the features are typically the individual words (or tokens)
present in the document.
• A common way to represent text is using the Bag of Words (BoW) model,
• where:
• For instance, in the sentence "I love machine learning," the words "I,"
"love," "machine," and "learning" are the features.
• This means that the presence of one word in the document does not affect
the presence of another word, which simplifies the computation of the
likelihood:
Applications of Naïve Bayes classifier
• During training, the Naive Bayes classifier calculates the following:
• The prior probability P(Class): The probability of each class in the training
data.
• Classification
• For a new document, the Naive Bayes classifier computes the posterior
probability for each class using the trained priors and likelihoods.
• The class with the highest posterior probability is chosen as the predicted class.
Applications of Naïve Bayes classifier
• Spam filtering:
• Server-side email filters such as, Spam Assassin, Spam Bayes, etc.
make use of Bayesian spam filtering techniques.
Applications of Naïve Bayes classifier
• Hybrid Recommender System:
• It allocates user utterances into nice, nasty, and neutral classes, labelled
as +1, −1, and 0, respectively.
Advantages of Naïve Bayes classifier
1. Simplicity and Ease of Implementation: The algorithm is
straightforward to implement and easy to understand.
2. Speed and Efficiency: It is very fast in terms of both training and
prediction, making it suitable for real-time applications.
3. Scalability: Naïve Bayes scales linearly with the number of features and
data points.
• However, other than the Naïve Bayes classifier, there are more
algorithms for classification.
• In supervised learning, the labelled training data provide the basis for
learning.
• In the other case, that is, for price prediction, we are trying to predict an
absolute value and not a class.
CLASSIFICATION MODEL
• When we are trying to predict a categorical or nominal variable, the
problem is known as a classification problem.
1. Image classification.
4. Prediction of natural calamity such
2. Disease prediction. as earthquake, flood, etc.
2. Identification of Required Data: On the basis of the problem identified above, the
required data set that precisely represents the identified problem needs to be
identified/evaluated.
• This step ensures that the data is ready to be fed into the machine learning
algorithm.
CLASSIFICATION LEARNING STEPS
4. Definition of Training Data Set: Before starting the analysis, the user
should decide what kind of data set is to be used as a training set.
• Thus, a set of data input (X) and corresponding outputs (Y) is gathered
either from human experts or from experiments analysis.
• On the basis of various parameters, the best algorithm for a given problem
is chosen.
CLASSIFICATION LEARNING STEPS
6. Training: The learning algorithm identified in the previous step is run on the
gathered training set for further fine tuning.
7. Evaluation with the Test Data Set: Training data is run on the algorithm, and
its performance is measured.
2. Decision tree
3. Random forest
• As a part of the kNN algorithm, the unknown and unlabelled data which
comes for a prediction problem is judged on the basis of the training data
set elements which are similar to the unknown element.
k-Nearest Neighbour (kNN)
• Hence, the class label of the unknown element is assigned on the basis of
the class labels of the similar training data set elements.
1. What is the basis of this similarity or when can we say that two data elements
are similar ?
2. How many similar elements should be considered for deciding the class label
of each test data element ?
• 1. One common practice is to set k equal to the square root of the number of
training records.
sets and choose the one that delivers the best performance.
• 3. Choose a larger value of k, but apply a weighted voting process in which the
vote of close neighbours is considered more influential than the vote of distant
neighbours.
k-Nearest Neighbour (kNN) Algorithm
1. Input:
• Training data set: A labeled set of data points are used to "train" the
model.
• Test data set: A new, unlabeled set of data points for which we want to
predict the class label.
2. Steps:
• Find the K nearest training data points (i.e., the K training points
that have the smallest distance from the test point).
k-Nearest Neighbour (kNN) Algorithm
• If K = 1:
3. End:
Sepal
Sepal Width Species
Length
5.1 3.5 Setosa
4.9 3.0 Setosa
7.0 3.2 Versicolor
6.4 3.2 Versicolor
5.5 2.3 Versicolor
5.0 3.4 ??
Example on KNN
• Solution :
1. Calculate the Euclidean distance between the test data point and each of the
training data points.
• where x1, y1 represent the Sepal Length and Sepal Width of the test point, and
x2, y2 represent the Sepal Length and Sepal Width of the training data points.
• Let's assume K = 3. The three nearest neighbors to the test point (5.0, 3.4) are :
4. Majority voting:
• Out of the three nearest neighbors, two belong to the species Setosa and one
belongs to Versicolor.
• Therefore, based on majority voting, we assign the test data point to the Setosa
class.
Example on KNN
• Thus, the test data point with Sepal Length = 5.0 and Sepal Width = 3.4 would
• Unlike other algorithms, kNN does not build a model in the training phase.
Instead, it stores the entire training dataset and performs computations (like
• The liking pattern may be revealed from past purchases or browsing history
and the similar items are identified using the kNN algorithm.
• In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node.
Decision Tree
• Decision nodes are used to make any decision
tree-like structure.
• The ID3 algorithm is specifically designed for building decision trees from a
given dataset.
• Its primary objective is to construct a tree that best explains the relationship
between attributes in the data and their corresponding class labels.
• The following steps are followed during the working of decision tree :
Decision Tree
1. Selecting the Best Attribute
❑ ID3 employs the concept of entropy and information gain to
determine the attribute that best separates the data.
❑ Entropy measures the impurity or randomness in the dataset.
❑ The algorithm calculates the entropy of each attribute and selects
the one that results in the most significant information gain when
used for splitting the data.
Decision Tree
2. Creating Tree Nodes
• The chosen attribute is used to split the dataset into subsets based on its distinct
values.
• For each subset, ID3 recurses to find the next best attribute to further partition
the data, forming branches and new nodes accordingly.
3. Stopping Criteria
• The recursion continues until one of the stopping criteria is met, such as when all
instances in a branch belong to the same class or when all attributes have been
used for splitting.
4. Handling Missing Values
• ID3 can handle missing attribute values by employing various strategies like
attribute of mean/mode substitution or using majority class values.
Decision Tree
5. Tree Pruning
1. Entropy
2. Choose the Best Attribute: The attribute that provides the most
information gain is chosen for splitting.
Advantages of using Decision tree
1. Easy to Understand and Interpret: Decision trees are easy to visualize and
understand. They can be easily explained to non-technical stakeholders.
2. Requires Little Data Preparation: Decision trees do not require normalization
of data or scaling of variables.
3. Handles Both Numerical and Categorical Data: Decision trees can handle both
types of data.
4. Complexity: Can become complex and less interpretable if the tree is too deep.
5. Can work well both with small and large training data sets.
Weaknesses of decision tree
1. Decision tree models are often biased towards features having more
number of possible values, i.e. levels.
2. This model gets overfitted or underfitted quite easily.
3. Decision trees are prone to errors in classification problems with many
classes and relatively small number of training examples.
4. A decision tree can be computationally expensive to train.
5. Large trees are complex to understand.
Random forest model
• Random forest is an ensemble classifier, i.e. a combining classifier that
uses and combines many decision tree classifiers.
2. It predicts output with high accuracy, even for the large dataset it runs
efficiently.
2. Use the best split principle on these ‘m’ features to calculate the
number of nodes ‘d’.
3. Keep splitting the nodes to child nodes till the tree is grown to
the maximum possible extent.
Random forest model
4. Final class assignment is done on the basis of the majority votes from
the ‘n’ trees.
• Out-of-bag (OOB) error in random forest
• In random forests, we have seen, that each tree is constructed using a
different bootstrap sample from the original data.
• The samples left out of the bootstrap and not used in the construction of
the i-th tree can be used to measure the performance of the model.
• At the end of the run, predictions for each such sample evaluated each
time are tallied, and the final prediction for that sample is obtained by
taking a vote.
Random forest model
• The total error rate of predictions for such samples is termed as out-
of-bag (OOB) error rate.
• New examples (i.e. new instances) are then mapped into that same space
and predicted to which class the new instance belongs.
• In the overall training process, the SVM algorithm analyses input data
and identifies a surface in the multi-dimensional feature space called the
hyperplane.
• There may be many possible hyperplanes, and one of the challenges with
the SVM model is to find the optimal hyperplane.
Support vector machines
• Support Vectors in the SVM
• Support vectors are the data points (representing classes), the critical
component in a data set, which are near the identified set of lines (hyperplane).
• For a three-dimensional feature space (data set having three features and a
class variable), hyperplane is a two-dimensional subspace or a simple plane.
Support vector machines
• Mathematically, in a two-dimensional space, hyperplane can be defined by
the equation:
• , which is nothing but an equation of a straight
line.
• For an N-dimensional space, hyperplane can be defined by the equation:
• OR
• The distance between hyperplane and data points is known as margin.
Identifying the correct hyperplane in SVM
• Let us examine a few examples to identify
which hyperplanes will result in the best
classification.
• Example 1 : As depicted in Figure, we have
three hyperplanes: A, B, and C.
• Now, we need to identify the correct hyperplane
which better segregates the two classes
represented by the triangles and circles.
• hyperplane ‘A’ has performed this task quite
well.
Identifying the correct hyperplane in SVM
• Example 2 : As depicted in Figure (a) and (b), we
have three hyperplanes: A, B, and C.
• This is to ensure that all the data instances that belong to one class
falls above one hyperplane and all the data instances belonging to
the other class falls below another hyperplane.
Identifying the MMH for non-linearly separable data
• For identifying MMH in non-linearly
separable data we have to use a slack
variable ξ, which provides some soft
margin for data instances in one class
that fall on the wrong side of the
hyperplane as shown in Figure.
• SVM has a another technique called the kernel trick to deal with non-linearly
separable data as shown in Figure.
Identifying the MMH for non-linearly separable data
• In the process, it converts linearly non-separable data to a linearly separable
data. These functions are called kernels.
• Some of the common kernel functions for transforming from a lower dimension
‘i’ to a higher dimension ‘j’ used by different SVM implementations are as
follows:
Strengths of SVM
1. SVM can be used for both classification and regression.
2. The SVM model is very complex – almost like a black box when it
deals with a high-dimensional data set. Hence, it is very difficult
and close to impossible to understand the model in such cases.
3. It is slow for a large dataset, i.e. a data set with either a large
number of features or a large number of instances
Application of SVM
1. SVM is most effective when it is used for binary classification,
i.e. for solving a machine learning problem with two classes.
forecast, etc.
• The most popular and simplest algorithm is simple linear regression. This
model roots from the statistical concept of fitting a straight line and the
• EXAMPLE OF REGRESSION
3. Polynomial regression
5. Logistic regression
• where ‘a’ and ‘b’ are intercept and slope of the straight line, respectively
1. Simple Linear Regression
• Recall, straight lines can be defined in a slope intercept form
• Find the slope of the graph where the lower point on the line is
represented as (−3, −2) and the higher point on the line is
represented as (2, 2).
1. Slope of the simple linear regression model
• There can be two types of slopes in a linear regression model: positive
slope and negative slope.
• Different types of regression lines based on the type of slope include :
1. Linear positive slope
2. Curve linear positive slope
3. Linear negative slope
4. Curve linear negative slope
Types of regression lines
1. Linear positive slope
Slope = Rise/Run
= (Y2 − Y1 ) / (X2 − X, )
• Slope = Rise/Run
= (Y2 − Y1 ) / (X2 − X1 )
• Slope for a variable (X) may vary between two graphs, but it will always be negative;
hence, the above graphs are called as graphs with curve linear negative slope.
Types of regression lines
• No relationship graph
• A random sample of 15 students in that class was selected, and the data is
given below:
• We need to predict the value of Y for any given X OR Find the value of
Y for any given X.
Example of Simple Linear Regression:
• A scatter plot is shown in Figure to explore the relationship between the
independent variable (internal marks) mapped to X-axis and dependent
variable (external marks) mapped to Y-axis.
• If we know the values of ‘a’ and ‘b’, then it is easy to predict the value of Y
for any given X.
• The corresponding value of ‘a’ calculated using the above value of ‘b’ is :
Example of Simple Linear Regression:
Example of Simple Linear Regression:
Xi Yi ( Sales in
(Week) Thousands)
1 1.2
2 1.8
3 2.6
4 3.2
5 3.8
Example of Simple Linear Regression:
• STEPS:
Y= a0+a1*x+e OR
• where
Example of Simple Linear Regression:
• How to calculate
• From the real estate price prediction problem involving dependent variable
(property price) and independent variable (Area of property).
• Multiple regression for estimating equation when there are ‘n’ predictor
variables is as follows:
• While finding the best fit line, we can fit either a polynomial or
curvilinear regression.
2. Multiple Linear Regression
• Consider an example where the five weeks’ sales data (in thousands) is given as
shown table. Apply Multiple Linear Regression Technique to predict the values
X1= 6, X2 = 9, find the Weekly Sales?
X1 X2 Weekly
(Product 1 (Product 2 Sales
Sales) Sales)
1 4 1
2 5 6
3 8 8
4 2 12
2. Multiple Linear Regression
• The Multiple Linear Regression of two variables x1 and x2 is given as follows :
• Apply multiple Linear Regression for the values given in the table, where
weekly sales along with sales products x1 and x2 are provided.
2. Multiple Linear Regression
• Use matrix approach for finding Multiple Linear Regression, design Matrix
with a column of 1’s for the intercept.
• Hence, the constructed model equation given below predicts the y value
given the variable x1 and x2.
= 22.039
Main Problems in Regression Analysis
• In multiple regressions, there are two primary problems: multicollinearity
and heteroskedasticity.
• 1. Multicollinearity
• This violates one of the assumptions of linear regression and can lead to
inefficient estimates, potentially making hypothesis tests invalid.
• Transform the dependent variable (e.g., taking the log or square root).
• Try weighted least squares regression, which gives less weight to observations
with high variance.
Improving Accuracy of the Linear Regression Model
• To improve the accuracy of a linear regression model, consider the
following strategies:
• Feature Engineering:
• Include Relevant Features: Ensure that all important and relevant
variables are included in the model.
• 3. Handle Outliers:
• Detect and address outliers in the dataset, as they can disproportionately influence the
regression coefficients.
• 4. Address Multicollinearity:
• If independent variables are highly correlated, use techniques like Variance Inflation Factor
(VIF) to detect multicollinearity and drop or combine correlated features.
Improving Accuracy of the Linear Regression Model
• 5. Transform Variables:
• Use transformations (e.g., log, square root) to linearize relationships between features and
the target variable or stabilize variance.
• 6. Regularization:
• Apply Lasso (L1) or Ridge (L2) regression to penalize overly complex models and reduce the
likelihood of overfitting.
• 7. Cross-Validation:
• Use k-fold cross-validation to evaluate model performance on unseen data and ensure
robustness.