ML 2 ND Unit
ML 2 ND Unit
SUB: ML
FACULTY: S. DURGADEVI
Unit II
Regression and Classification Models
1) Linear Regression
2) Multivariate Regression
3) Linear Classification
4) Logistic Regression
5) Principal Component Analysis
6) Linear Discriminant Analysis
7) Multiclass Classification
8) Naïve-Bayes
9) Bayesian Networks
10) Support Vector Machines
Unit II
Regression and Classification Models
Regression and classification are two fundamental types of supervised machine
learning models used in predictive analytics. They serve different purposes and
are applied in various domains depending on the nature of the data and the
problem at hand.
1. Regression Models:
Regression models are used when the target variable (the variable you're trying
to predict) is continuous. These models predict a continuous value, such as a
price, temperature, or sales volume. Common types of regression models include:
Linear Regression: This is one of the simplest regression algorithms where the
relationship between the input variables and the target variable is assumed to be
linear. It tries to fit a straight line to the data.
Polynomial Regression: In cases where the relationship between the variables
isn't linear, polynomial regression fits a polynomial function to the data.
Ridge Regression, Lasso Regression, Elastic Net: These are regularization
techniques used to prevent overfitting in regression models.
Support Vector Regression (SVR): It's an extension of support vector machines
for regression tasks.
Decision Tree Regression: Decision trees can also be used for regression tasks,
where the target variable is predicted by learning simple decision rules inferred
from the data features.
Example: Suppose we want to do weather forecasting, so for this, we will use
the Regression algorithm. In weather prediction, the model is trained on the past
data, and once the training is completed, it can easily predict the weather for
future days.
2. Classification Models:
Classification models are used when the target variable is categorical, i.e., it
belongs to a finite set of classes. These models predict the class labels for new
instances. Common types of classification models include:
Logistic Regression: Despite its name, logistic regression is a classification
algorithm used to model the probability of a binary outcome. It's commonly used
for binary classification problems.
Decision Trees: Decision trees can be used for both regression and classification
tasks. In classification, they split the data based on features to create distinct
classes.
Random Forest: Random forest is an ensemble learning method that builds
multiple decision trees and merges their predictions to improve accuracy and
prevent overfitting.
Support Vector Machines (SVM): SVMs can be used for both regression and
classification tasks. In classification, they try to find the hyperplane that best
separates the classes.
K-Nearest Neighbors (KNN): KNN is a simple algorithm that stores all
available cases and classifies new cases based on a similarity measure (e.g.,
distance functions).
Example: The best example to understand the Classification problem is Email
Spam Detection. The model is trained on the basis of millions of emails on
different parameters, and whenever it receives a new email, it identifies whether
the email is spam or not. If the email is spam, then it is moved to the Spam folder.
Difference between Regression and Classification
The task of the regression algorithm The task of the classification algorithm is to
is to map the input value (x) with map the input value(x) with the discrete output
the continuous output variable(y). variable(y).
Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals:
The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high,
and so cost function will high. If the scatter points are close to the regression line,
then the residual will be small and hence the cost function.
Gradient Descent:
• Gradient descent is used to minimize the MSE by calculating the gradient
of the cost function.
• A regression model uses gradient descent to update the coefficients of the
line by reducing the cost function.
• It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:
1. R-squared method:
❖ R-squared is a statistical method that determines the goodness of fit.
❖ It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
❖ The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
❖ It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
It can be calculated from the below formula:
3. AUC-ROC curve:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
On the basis of the categories, Logistic Regression can be classified into three
types:
Suppose you have a dataset with several features or dimensions. For instance,
let's say you have a dataset that contains information about houses, such as size
(in square feet), number of bedrooms, number of bathrooms, and price. Each
house in the dataset is represented as a point in a high-dimensional space, with
each feature being one dimension.
PCA aims to find a new set of dimensions, called principal components, that
capture the most significant variability in the data. These principal components
are linear combinations of the original features.
PCA Steps:
First, we need to standardize our dataset to ensure that each variable has
a mean of 0 and a standard deviation of 1.
for some scalar values . then is known as the eigenvalue of matrix A and
X is known as the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :
where I am the identity matrix of the same shape as matrix A. And the above
conditions will be true only if will be non-invertible (i.e. singular
matrix). That means,
From the above equation, we can find the eigenvalues \lambda, and therefore
corresponding eigenvector can be found using the equation
In That example, after performing PCA, you might find that the first principal
component captures most of the variability in the dataset. This component could
represent a combination of features that correlate strongly with the overall size or
value of the houses. Subsequent principal components might capture additional,
smaller sources of variance, such as specific characteristics like the number of
bedrooms versus the number of bathrooms.
PCA can be useful for various tasks, such as visualization, noise reduction,
feature extraction, and speeding up machine learning algorithms by reducing the
dimensionality of the data.
Applications of PCA:
For example, we have two classes and we need to separate them efficiently.
Classes can have multiple features. Using only a single feature to classify them
may result in some overlapping as shown in the below figure. So, we will keep
on increasing the number of features for proper classification.
Example:
Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:
it is impossible to draw a straight line in a 2-d plane that can separate these data
points efficiently but using linear Discriminant analysis; we can dimensionally
reduce the 2-D plane into the 1-D plane. Using this technique, we can also
maximize the separability between multiple classes.
Let's consider an example where we have two classes in a 2-D plane having an
X-Y axis, and we need to classify them efficiently. As we have already seen in
the above example that LDA enables us to draw a straight line that can
completely separate the two classes of the data points. Here, LDA uses an X-Y
axis to create a new axis by separating them using a straight line and projecting
data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-
D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following
criteria:
Using the above two conditions, LDA generates a new axis in such a way that it
can maximize the distance between the means of the two classes and minimizes
the variation within each class.
In other words, we can say that the new axis will increase the separation between
the data points of the two classes and plot them onto the new axis.
The target variable (dependent variable) has more than two categories.
For example, consider the Iris dataset, which contains three species of Iris
plants: Virginica, Setosa, and Versicolor. Each species corresponds to a
different class1.
• Multilabel classification
At fitting time, one binary classifier per bit in the code book is fitted. At prediction
time, the classifiers are used to project new points in the class space and the class
closest to the points is chosen.
A number greater than 1 will require more classifiers than one-vs-the-rest. In this
case, some classifiers will in theory correct for the mistakes made by other
classifiers, hence the name “error-correcting”. In practice, however, this may not
happen as classifier mistakes will typically be correlated. The error-correcting
output codes have a similar effect to bagging.
Performance Metrics:
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
9.Bayesian Network
➢ Bayesian belief network is key computer technology for dealing with
probabilistic events and to solve a problem which has uncertainty. We can
define a Bayesian network as:
➢ "A Bayesian network is a probabilistic graphical model which represents a
set of variables and their conditional dependencies using a directed acyclic
graph."
➢ It is also called a Bayes network, belief network, decision network,
or Bayesian model.
➢ Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction
and anomaly detection.
Bayesian Network can be used for building models from data and experts
opinions, and it consists of two parts:
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
o Causal Component
o Actual numbers
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
Example:
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive
the burglary and also do not notice the minor earthquake, and they also not
confer before calling.
o The conditional distributions for each node are given as conditional
probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains
2K probabilities. Hence, if there are two parents, then CPT will contain 4
probability values
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
P[D, S, A, B, E], can rewrite the above probability statement using joint
probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
Conditional probability table for Alarm A:
The Conditional probability of David that he will call depends on the probability
of Alarm.
The Conditional probability of Sophia that she calls is depending on its Parent
Node "Alarm."
= 0.00068045.
detection tasks.
• SVMs can be used for a variety of tasks, such as text classification, image
• The goal of the SVM algorithm is to create the best line or decision
can easily put the new data point in the correct category in the future. This
hyperplane. These extreme cases are called as support vectors, and hence
Example:
SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm. We will first train our model
with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On
the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight line,
then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be
a straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
Linear SVM:
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider
the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis.
If we convert it in 2d space with z=1, then it will become as: