ML - Module 2
ML - Module 2
Statistical Decision Theory, Bayesian Learning (ML, MAP, Bayes estimates, Conjugate
priors), Linear Regression, Ridge Regression, Lasso, Principal Component Analysis,
Partial Least Squares.
Statistical Decision Theory
Statistical decision theory: Decision theory is the science of making optimal
decisions in the face of uncertainty. Statistical decision theory is concerned with the
making of decisions when in the presence of statistical knowledge (data) which sheds
light on some of the uncertainties involved in the decision problem.
Linear Regression
Regression
Regression: The main goal of regression is the construction of an efficient model to
predict the dependent attributes from a bunch of attribute variables. A regression
problem is when the output variable is either real or a continuous value i.e salary,
weight, area, etc.
Types of Regression
1. Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression
Linear Regression
Linear Regression: Linear Regression is a machine learning algorithm based
on supervised learning. It performs a regression task. It is a statistical method that is
used for predictive analysis. Linear regression makes predictions for continuous/real
or numeric variables such as sales, salary, age, product price, etc.
y= a0+a1x+ ε
Here,
The different values for weights or the coefficient of lines (a 0, a1) gives a different line
of regression, so we need to calculate the best values for a0 and a1 to find the best fit
line, so to calculate this we use cost function.
Cost function
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function is
also known as Hypothesis function.
o By achieving the best-fit regression line, the model aims to predict Y value
such that the error difference between predicted value and true value is
minimum. So, it is very important to update the a0 and a1 values, to reach
the best value that minimize the error between predicted Y value (predicted)
and true Y value.
Mean Squared Error (MSE): For Linear Regression, we use the Mean Squared Error
(MSE) cost function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as:
Where,
Gradient Descent
o Gradient descent is used to minimize the MSE by calculating the gradient of
the cost function.
o A regression model uses gradient descent to update the coefficients of the
line by reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.
o To update a0 and a1, we take gradients from the cost function. To find these
gradients, we take partial derivatives with respect to a0 and a1.
o Alpha is the learning rate which is a hyper-parameter that you must specify. A
smaller learning rate takes closer to the minimum, but it takes more time and
in case of a larger learning rate. The time taken is sooner but there is a chance
to overshoot the minimum value.
Advantages Disadvantages
Easier to implement, interpret and It is often quite prone to noise and over-
efficient to train. fitting.
R-squared method
Regularization
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called
as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
L1 & L2 Regularization
L1 regularization adds a penalty that is equal to the absolute value of the magnitude
of the coefficient. This regularization type can result in sparse models with few
coefficients. Some coefficients might become zero and get eliminated from the
model. Larger penalties result in coefficient values that are closer to zero (ideal for
producing simpler models). On the other hand, L2 regularization does not result in
any elimination of sparse models or coefficients. Thus, Lasso Regression is easier to
interpret as compared to the Ridge.
Ridge Regression
o Ridge regression is a model tuning method that is used to analyse any data
that suffers from multicollinearity. When the issue of multicollinearity occurs,
least-squares are unbiased, and variances are large, this results in predicted
values being far away from the actual values. It shrinks the parameters,
Therefore, it is used to prevent multicollinearity.
o Ridge regression is one of the types of linear regression in which a small
amount of bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it.
The amount of bias added to the model is called Ridge Regression penalty.
We can calculate it by multiplying with the lambda to the squared weight of
each individual feature.
o Ridge regression adds “squared magnitude” of coefficient as penalty term to
the loss function(L).
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the
model, and hence ridge regression reduces the amplitudes of the coefficients
that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression
model.
o A general linear or polynomial regression will fail if there is high collinearity
between the independent variables, so to solve such problems, Ridge
regression can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression
o Lasso regression is another regularization technique to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only
the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge Regression can only shrink it near to 0.
o Lasso Regression adds “absolute value of magnitude” of coefficient as penalty
term to the loss function (L).
o This particular type of regression is well-suited for models showing high levels
of multicollinearity or when you want to automate certain parts of model
selection, like variable selection/parameter elimination.
o It is also called as L1 regularization. The equation for the cost function of
Lasso regression will be:
o Some of the features in this technique are completely neglected for model
evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model
as well as the feature selection.
Note: During Regularization the output function (y_hat) does not change. The
change is only in the loss function.
o Ridge regression is mostly used to reduce the overfitting in the model, and it
includes all the features present in the model. It reduces the complexity of the
model by shrinking the coefficients.
o Lasso regression helps to reduce the overfitting in the model as well as
feature selection.
Principal Component Analysis
Principal Component Analysis: Principal Component Analysis is an unsupervised
learning algorithm that is used for the dimensionality reduction in machine learning.
It is a statistical process that converts the observations of correlated features into a
set of linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
Principal Components
Principal Components: The transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the
original features present in the dataset. Some properties of these principal
components are given below:
o The principal component must be the linear combination of the original
features.
o These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
o The importance of each component decreases when going to 1 to n, it means
the 1 PC has the most importance, and n PC will have the least importance.