0% found this document useful (0 votes)
26 views

ML - Module 2

Module 2 of notes of ML

Uploaded by

Amritesh Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

ML - Module 2

Module 2 of notes of ML

Uploaded by

Amritesh Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Module 2

Statistical Decision Theory, Bayesian Learning (ML, MAP, Bayes estimates, Conjugate
priors), Linear Regression, Ridge Regression, Lasso, Principal Component Analysis,
Partial Least Squares.
Statistical Decision Theory
Statistical decision theory: Decision theory is the science of making optimal
decisions in the face of uncertainty. Statistical decision theory is concerned with the
making of decisions when in the presence of statistical knowledge (data) which sheds
light on some of the uncertainties involved in the decision problem.

Statistical Decision Theory may be defined as a body of several methods


which facilitate the decision-maker to select wisely the best course of action
from amongst several alternatives.
The ideal of decision theory is to make choices rational by reducing them to a
kind of routine calculation.

Linear Regression
Regression
Regression: The main goal of regression is the construction of an efficient model to
predict the dependent attributes from a bunch of attribute variables. A regression
problem is when the output variable is either real or a continuous value i.e salary,
weight, area, etc.

Types of Regression

1. Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression

Linear Regression
Linear Regression: Linear Regression is a machine learning algorithm based
on supervised learning. It performs a regression task. It is a statistical method that is
used for predictive analysis. Linear regression makes predictions for continuous/real
or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent


(y) and one or more independent (y) variables, hence called as linear
regression.
Since linear regression shows the linear relationship, which means it finds how
the value of the dependent variable is changing according to the value of the
independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as: Machine Learning

y= a0+a1x+ ε

Here,

Y= Dependent Variable/labels to data (Target Variable).


X= Independent Variable/input training data (predictor Variable).
a0= intercept of the line (Gives an additional degree of freedom).
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

Linear Regression Line


Linear Regression line: A linear line showing the relationship between the
dependent and independent variables is called a regression line. A regression line
can show two types of relationship:

o Positive Linear Relationship: If the dependent variable increases on the Y-axis


and independent variable increases on X-axis, then such a relationship is
termed as a Positive linear relationship.
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression: If a single independent variable is used to predict


the value of a numerical dependent variable, then such a Linear Regression
algorithm is called Simple Linear Regression.
o Multiple Linear Regression: If more than one independent variable is used to
predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.
Finding the best fit line
When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different line
of regression, so we need to calculate the best values for a0 and a1 to find the best fit
line, so to calculate this we use cost function.

Cost function
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function is
also known as Hypothesis function.
o By achieving the best-fit regression line, the model aims to predict Y value
such that the error difference between predicted value and true value is
minimum. So, it is very important to update the a0 and a1 values, to reach
the best value that minimize the error between predicted Y value (predicted)
and true Y value.

Mean Squared Error (MSE): For Linear Regression, we use the Mean Squared Error
(MSE) cost function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as:

Where,

N=Total number of observation/data points.


Yi = Actual value.
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual will
be high, and so cost function will high. If the scatter points are close to the regression
line, then the residual will be small and hence the cost function.

Gradient Descent
o Gradient descent is used to minimize the MSE by calculating the gradient of
the cost function.
o A regression model uses gradient descent to update the coefficients of the
line by reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.
o To update a0 and a1, we take gradients from the cost function. To find these
gradients, we take partial derivatives with respect to a0 and a1.
o Alpha is the learning rate which is a hyper-parameter that you must specify. A
smaller learning rate takes closer to the minimum, but it takes more time and
in case of a larger learning rate. The time taken is sooner but there is a chance
to overshoot the minimum value.

Advantages and Disadvantages Linear Regression

Advantages Disadvantages

Linear regression performs exceptionally The assumption of linearity between


well for linearly separable data. dependent and independent variables.

Easier to implement, interpret and It is often quite prone to noise and over-
efficient to train. fitting.

It handles over-fitting pretty well using


Linear regression is quite sensitive to
dimensionally reduction techniques,
outliers.
regularization, and cross-validation.

One more advantage is the extrapolation


It is prone to multicollinearity.
beyond a specific data set.
Model Performance
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method.

R-squared method

o R-squared is a statistical method that determines the goodness of fit.


o It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression


Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best
possible result from the given dataset.

o Linear relationship between the features and target: Linear regression


assumes the linear relationship between the dependent and independent
variables.
o Small or no Multicollinearity between the features: Multicollinearity means
high-correlation between the independent variables. Due to multicolinearity,
it may difficult to find the true relationship between the predictors and target
variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either
little or no multicollinearity between the features or independent variables.
o Homoscedasticity Assumption: Homoscedasticity is a situation when the error
term is the same for all the values of independent variables. With
homoscedasticity, there should be no clear pattern distribution of data in the
scatter plot.
o Normal distribution of error terms: Linear regression assumes that the error
term should follow the normal distribution pattern. If error terms are not
normally distributed, then confidence intervals will become either too wide or
too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.
o No autocorrelations: The linear regression model assumes no autocorrelation
in error terms. If there will be any correlation in the error term, then it will
drastically reduce the accuracy of the model. Autocorrelation usually occurs if
there is a dependency between residual errors.

Linear Regression Use Cases


o Sales Forecasting.
o Risk Analysis.
o Housing Applications To Predict the prices and other factors.
o Finance Applications To Predict Stock prices, investment evaluation, etc.

Ridge Regression & Lasso Regression

Regularization

Regularization: Regularization is a technique to prevent the model from over-fitting


by adding extra information to it. It mainly regularizes or reduces the coefficient of
features toward zero.

In simple words, "In regularization technique, we reduce the magnitude of the


features by keeping the same number of features." Hence, it maintains
accuracy as well as generalizes the model.

Regularization is a technique used to reduce the errors by fitting the function


appropriately on the given training set and avoid over-fitting.
How does Regularization Work
Regularization works by adding a penalty or complexity term to the complex model.
Let's consider the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively.


Here represents the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called
as RSS or Residual sum of squares.

Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:

o Ridge Regression / L2 Regularization


o Lasso Regression / L1 Regularization

L1 & L2 Regularization
L1 regularization adds a penalty that is equal to the absolute value of the magnitude
of the coefficient. This regularization type can result in sparse models with few
coefficients. Some coefficients might become zero and get eliminated from the
model. Larger penalties result in coefficient values that are closer to zero (ideal for
producing simpler models). On the other hand, L2 regularization does not result in
any elimination of sparse models or coefficients. Thus, Lasso Regression is easier to
interpret as compared to the Ridge.
Ridge Regression

o Ridge regression is a model tuning method that is used to analyse any data
that suffers from multicollinearity. When the issue of multicollinearity occurs,
least-squares are unbiased, and variances are large, this results in predicted
values being far away from the actual values. It shrinks the parameters,
Therefore, it is used to prevent multicollinearity.
o Ridge regression is one of the types of linear regression in which a small
amount of bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it.
The amount of bias added to the model is called Ridge Regression penalty.
We can calculate it by multiplying with the lambda to the squared weight of
each individual feature.
o Ridge regression adds “squared magnitude” of coefficient as penalty term to
the loss function(L).
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the
model, and hence ridge regression reduces the amplitudes of the coefficients
that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression
model.
o A general linear or polynomial regression will fail if there is high collinearity
between the independent variables, so to solve such problems, Ridge
regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression
o Lasso regression is another regularization technique to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only
the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o Lasso Regression adds “absolute value of magnitude” of coefficient as penalty
term to the loss function (L).
o This particular type of regression is well-suited for models showing high levels
of multicollinearity or when you want to automate certain parts of model
selection, like variable selection/parameter elimination.
o It is also called as L1 regularization. The equation for the cost function of
Lasso regression will be:

o Some of the features in this technique are completely neglected for model
evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model
as well as the feature selection.

Note: During Regularization the output function (y_hat) does not change. The
change is only in the loss function.

Differences between Ridge and Lasso Regression

o Ridge regression is mostly used to reduce the overfitting in the model, and it
includes all the features present in the model. It reduces the complexity of the
model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as
feature selection.
Principal Component Analysis
Principal Component Analysis: Principal Component Analysis is an unsupervised
learning algorithm that is used for the dimensionality reduction in machine learning.
It is a statistical process that converts the observations of correlated features into a
set of linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional


space that approximate the data as well as possible in the least squares sense.
A line or plane that is the least squares approximation of a set of data points
makes the variance of the coordinates on the line or plane as large as possible.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
It is a feature extraction technique, so it contains the important variables and
drops the least important variable.

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given


dataset. More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed. The
correlation value ranges from -1 to +1. Here, -1 occurs if variables are
inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given.
Then v will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of
variables is called the Covariance Matrix.

Principal Components

Principal Components: The transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of these principal
components are given below:
o The principal component must be the linear combination of the original
features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 to n, it means
the 1 PC has the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1. Getting the dataset: Firstly, we need to take the input dataset and divide it
into two subparts X and Y, where X is the training set, and Y is the validation
set.
2. Representing data into a structure: Now we will represent our dataset into a
structure. Such as we will represent the two-dimensional matrix of
independent variable X. Here each row corresponds to the data items, and the
column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data: In this step, we will standardize our dataset. Such as
in a particular column, the features with high variance are more important
compared to the features with lower variance. If the importance of features is
independent of the variance of the feature, then we will divide each data item
in a column with the standard deviation of the column. Here we will name the
matrix as Z.
4. Calculating the Covariance of Z: To calculate the covariance of Z, we will take
the matrix Z, and will transpose it. After transpose, we will multiply it by Z. The
output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors: Now we need to calculate the
eigenvalues and eigenvectors for the resultant covariance matrix Z.
Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the
eigenvalues.
6. Sorting the Eigen Vectors: In this step, we will take all the eigenvalues and will
sort them in decreasing order, which means from largest to smallest. And
simultaneously sort the eigenvectors accordingly in matrix P of eigenvalues.
The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components: Here we will calculate
the new features. To do this, we will multiply the P* matrix to the Z. In the
resultant matrix Z*, each observation is the linear combination of original
features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset: The new feature
set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.

Applications of Principal Component Analysis


o PCA is mainly used as the dimensionality reduction technique in various AI
applications such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.
o It is used to find inter-relation between variables in the data.
o It is used to interpret and visualize data.
o The number of variables is decreasing it makes further analysis simpler.
o It’s often used to visualize genetic distance and relatedness between
populations.

Principal Axis Method: PCA basically searches a linear combination of variables so


that we can extract maximum variance from the variables. Once this process
completes it removes it and searches for another linear combination that gives an
explanation about the maximum proportion of remaining variance which basically
leads to orthogonal factors. In this method, we analyze total variance.

Partial Least Squares


Partial Least Squares (PLS): Partial least squares (PLS) regression is a technique that
reduces the predictors to a smaller set of uncorrelated components and performs
least squares regression on these components, instead of on the original data. PLS
regression is especially useful when your predictors are highly collinear, or when you
have more predictors than observations and ordinary least-squares regression either
produces coefficients with high standard errors or fails completely.
PLS does not assume that the predictors are fixed, unlike multiple regression.
This means that the predictors can be measured with error, making PLS more
robust to measurement uncertainty.
PLS combines features of principal components analysis and multiple
regression. It first extracts a set of latent factors that explain
as much of the covariance as possible between the independent and
dependent variables. Then a regression step predicts values of the
dependent variables using the decomposition of the independent variables.
In PLS, components are selected based on how much variance they explain in
the predictors and between the predictors and the response(s).
PLS is a predictive technique that is an alternative
to ordinary least squares (OLS) regression, canonical correlation,
or structural equation modelling.
PLS regression is primarily used in the chemical, drug, food, and plastic
industries.

You might also like