Unit - Iii
Unit - Iii
Regression
A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”.
Many different models can be used, the simplest is the linear regression. It tries to fit data with the best
hyperplane which goes through the points.
Regression Analysis is a statistical process for estimating the relationships between the dependent variables
or criterion variables and one or more independent variables or predictors. Regression analysis explains the
changes in criterions in relation to changes in select predictors. The conditional expectation of the criterions
based on predictors where the average value of the dependent variables is given when the independent
variables are changed. Three major uses for regression analysis are determining the strength of predictors,
forecasting an effect, and trend forecasting.
Types of Regression –
Linear regression
Logistic regression
Polynomial regression
Stepwise regression
Ridge regression
Lasso regression
ElasticNet regression
Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the
relationship between the criterion or the scalar response and the multiple predictors or explanatory variables.
Linear regression focuses on the conditional probability distribution of the response given the values of the
predictors. For linear regression, there is a danger of overfitting. The formula for linear regression is: Y’ =
bX + A.
Y = estimated dependent variable score, A = constant, b = regression coefficient, and X = score on the
independent variable.
Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates the
parameters of a logistic model and is form of binomial regression. Logistic regression is used to deal with
data that has two possible criterions and the relationship between the criterions and the predictors. The
equation for logistic regression is:
Where b0 is constant and k is independent (X) variables. In ordinal logistic regression, the threshold
coefficient will be different for every order of dependent variables. The coefficient will give the cumulative
probability of every order of dependent variables
Polynomial regression is used for curvilinear data. Polynomial regression is fit with the method of least
squares. The goal of regression analysis to model the expected value of a dependent variable y in regards to
the independent variable x. The equation for polynomial regression is:
where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In this model, for
each unit increase in the value of x, the conditional expectation of y increases by β1 units.
Stepwise regression is used for fitting regression models with predictive models. It is carried out
automatically. With each step, the variable is added or subtracted from the set of explanatory variables. The
approaches for stepwise regression are forward selection, backward elimination, and bidirectional
elimination. The formula for stepwise regression is
Where Sy and Sx are the standard deviations for the dependent variable and the corresponding jth
independent variable
Ridge regression is a technique for analyzing multiple regression data. When multi-collinearity occurs,
least squares estimates are unbiased. A degree of bias is added to the regression estimates, and a result, ridge
regression reduces the standard errors. The formula for ridge regression is
β is Coefficient
Y = response variable
Lasso regression is a regression analysis method that performs both variable selection and regularization.
Lasso regression uses soft thresholding. Lasso regression selects only a subset of the provided covariates for
use in the final model. Lasso regression is
Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and
magnitude of coefficients. Like that of ridge, α can take various values.
ElasticNet regression is a regularized regression method that linearly combines the penalties of the lasso
and ridge methods. ElasticNet regression is used for support vector machines, metric learning, and portfolio
optimization. The penalty function is given by:
Use of this penalty function has several limitations. For example, in the "large p, small n" case
(high-dimensional data with few examples), the LASSO selects at most n variables before it
saturates.
Blue property assumptions
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)
LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the sample
observations
UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true mean of the
variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as its expected
value equal to the true value of population.
• We also write this as follows:
Similarly, if this is not the case, we say that the estimator is biased
Bias=
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be
centered around the true population, so it is desirable for the sampling distribution to be as narrow
(or precise) as possible.
To understand the least-squares regression method lets get familiar with the concepts involved in
formulating the line of best fit.
Regression analysis makes use of mathematical methods such as least squares to obtain a definite
relationship between the predictor variable (s) and the target variable. The least-squares method is one of the
most effective ways used to draw the line of best fit. It is based on the idea that the square of the errors
obtained must be minimized to the most possible extent and hence the name least squares method.
If we were to plot the best fit line that shows the depicts the sales of a company over a period of time, it
would look something like this:
Notice that the line is as close as possible to all the scattered data points. This is what an ideal best fit line
looks like.
Let’s see how to calculate the line using the Least Squares Regression.
It is a simple equation that represents a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To
better understand this, let’s break down the equation:
y: dependent variable
m: the slope of the line
x: independent variable
c: y-intercept
So the aim is to calculate the values of slope, y-intercept and substitute the corresponding ‘x’ values in the
equation in order to derive the value of the dependent variable.
Step 2: Compute the y-intercept (the value of y at the point where the line crosses the y-axis):
Now let’s look at an example and see how you can use the least-squares regression method to compute the
line of best fit.
Let us use the concept of least squares regression to find the line of best fit for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:
After you substitute the respective values, m = 1.518 approximately.
Step 2: Compute the y-intercept value
y = 1.518x + 0.305
Once you substitute the values, it should look something like this:
Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at the retail
shop.
This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear Regression.
Now let’s try to understand based on what factors can we confirm that the above line is the line of best fit.
The least squares regression method works by minimizing the sum of the square of the errors as small as
possible, hence the name least squares. Basically the distance between the line of best fit and the error must
be minimized as much as possible. This is the basic idea behind the least squares regression method.
A few things to keep in mind before implementing the least squares regression method is:
The data must be free of outliers because they might lead to a biased and wrongful line of best fit.
The line of best fit can be drawn iteratively until you get a line with the minimum possible squares of
errors.
This method works well even with non-linear data.
Technically, the difference between the actual value of ‘y’ and the predicted value of ‘y’ is called the
Residual (denotes the error).