0% found this document useful (0 votes)
17 views9 pages

Unit - Iii

Uploaded by

madhurcb1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Unit - Iii

Uploaded by

madhurcb1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIT - III

Regression

A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”.
Many different models can be used, the simplest is the linear regression. It tries to fit data with the best
hyperplane which goes through the points.

Regression Analysis is a statistical process for estimating the relationships between the dependent variables
or criterion variables and one or more independent variables or predictors. Regression analysis explains the
changes in criterions in relation to changes in select predictors. The conditional expectation of the criterions
based on predictors where the average value of the dependent variables is given when the independent
variables are changed. Three major uses for regression analysis are determining the strength of predictors,
forecasting an effect, and trend forecasting.

Types of Regression –

 Linear regression
 Logistic regression
 Polynomial regression
 Stepwise regression
 Ridge regression
 Lasso regression
 ElasticNet regression

Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the
relationship between the criterion or the scalar response and the multiple predictors or explanatory variables.
Linear regression focuses on the conditional probability distribution of the response given the values of the
predictors. For linear regression, there is a danger of overfitting. The formula for linear regression is: Y’ =
bX + A.

Y = estimated dependent variable score, A = constant, b = regression coefficient, and X = score on the
independent variable.

Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates the
parameters of a logistic model and is form of binomial regression. Logistic regression is used to deal with
data that has two possible criterions and the relationship between the criterions and the predictors. The
equation for logistic regression is:

z = b0 + b1X1 + b2X2 +....+ bkXk

Where b0 is constant and k is independent (X) variables. In ordinal logistic regression, the threshold
coefficient will be different for every order of dependent variables. The coefficient will give the cumulative
probability of every order of dependent variables

Polynomial regression is used for curvilinear data. Polynomial regression is fit with the method of least
squares. The goal of regression analysis to model the expected value of a dependent variable y in regards to
the independent variable x. The equation for polynomial regression is:
where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In this model, for
each unit increase in the value of x, the conditional expectation of y increases by β1 units.

Stepwise regression is used for fitting regression models with predictive models. It is carried out
automatically. With each step, the variable is added or subtracted from the set of explanatory variables. The
approaches for stepwise regression are forward selection, backward elimination, and bidirectional
elimination. The formula for stepwise regression is

Where Sy and Sx are the standard deviations for the dependent variable and the corresponding jth
independent variable

Ridge regression is a technique for analyzing multiple regression data. When multi-collinearity occurs,
least squares estimates are unbiased. A degree of bias is added to the regression estimates, and a result, ridge
regression reduces the standard errors. The formula for ridge regression is

β is Coefficient

X=Independent Variable = Feature = Attribute = Predictor

The λ parameter is the regularization penalty

Y = response variable

Lasso regression is a regression analysis method that performs both variable selection and regularization.
Lasso regression uses soft thresholding. Lasso regression selects only a subset of the provided covariates for
use in the final model. Lasso regression is

Objective = RSS + α * (sum of absolute value of coefficients)

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and
magnitude of coefficients. Like that of ridge, α can take various values.

Lets iterate it here briefly:

1. α = 0: Same coefficients as simple linear regression


2. α = ∞: All coefficients zero (same logic as before)
3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

ElasticNet regression is a regularized regression method that linearly combines the penalties of the lasso
and ridge methods. ElasticNet regression is used for support vector machines, metric learning, and portfolio
optimization. The penalty function is given by:

Use of this penalty function has several limitations. For example, in the "large p, small n" case
(high-dimensional data with few examples), the LASSO selects at most n variables before it
saturates.
Blue property assumptions
• B-BEST
• L-LINEAR
• U-UNBIASED
• E-ESTIMATOR
An estimator is BLUE if the following hold:
1. It is linear (Regression model)
2. It is unbiased
3. It is an efficient estimator(unbiased estimator with least variance)

LINEARITY
• An estimator is said to be a linear estimator of (β) if it is a linear function of the sample
observations

• Sample mean is a linear estimator because it is a linear function of the X values.

UNBIASEDNESS
• A desirable property of a distribution of estimates is that its mean equals the true mean of the
variables being estimated
• Formally, an estimator is an unbiased estimator if its sampling distribution has as its expected
value equal to the true value of population.
• We also write this as follows:

Similarly, if this is not the case, we say that the estimator is biased

Bias=
MINIMUM VARIANCE
• Just as we wanted the mean of the sampling distribution to be
centered around the true population, so it is desirable for the sampling distribution to be as narrow
(or precise) as possible.

– Centering around “the truth” but with high variability might be of


very little use

• One way of narrowing the sampling distribution is to increase


the sampling size
What is the Least Squares Regression Method?
The least-squares regression method is a technique commonly used in Regression Analysis. It is a
mathematical method used to find the best fit line that represents the relationship between an independent
and dependent variable.

To understand the least-squares regression method lets get familiar with the concepts involved in
formulating the line of best fit.

What is the Line of Best Fit?


Line of best fit is drawn to represent the relationship between 2 or more variables. To be more specific, the
best fit line is drawn across a scatter plot of data points in order to represent a relationship between those
data points.

Regression analysis makes use of mathematical methods such as least squares to obtain a definite
relationship between the predictor variable (s) and the target variable. The least-squares method is one of the
most effective ways used to draw the line of best fit. It is based on the idea that the square of the errors
obtained must be minimized to the most possible extent and hence the name least squares method.
If we were to plot the best fit line that shows the depicts the sales of a company over a period of time, it
would look something like this:

Notice that the line is as close as possible to all the scattered data points. This is what an ideal best fit line
looks like.

Let’s see how to calculate the line using the Least Squares Regression.

Steps to calculate the Line of Best Fit


To start constructing the line that best depicts the relationship between variables in the data, the equation
used is:

It is a simple equation that represents a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To
better understand this, let’s break down the equation:

 y: dependent variable
 m: the slope of the line
 x: independent variable
 c: y-intercept

So the aim is to calculate the values of slope, y-intercept and substitute the corresponding ‘x’ values in the
equation in order to derive the value of the dependent variable.

Let’s see how this can be done.

As an assumption, let’s consider that there are ‘n’ data points.


Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the y-axis):

Step 3: Substitute the values in the final equation:

Simple, isn’t it?

Now let’s look at an example and see how you can use the least-squares regression method to compute the
line of best fit.

Least Squares Regression Example


Consider an example. Tom who is the owner of a retail shop, found the price of different T-shirts vs the
number of T-shirts sold at his shop over a period of one week.

He tabulated this like shown below:

Let us use the concept of least squares regression to find the line of best fit for the above data.
Step 1: Calculate the slope ‘m’ by using the following formula:
After you substitute the respective values, m = 1.518 approximately.
Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.

Step 3: Substitute the values in the final equation

y = 1.518x + 0.305

Once you substitute the values, it should look something like this:

Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T-shirts of price $8 can he sell at the retail
shop.

y = 1.518 X 8 + 0.305 = 12.45 T-shirts

This comes down to 13 T-shirts! That’s how simple it is to make predictions using Linear Regression.

Now let’s try to understand based on what factors can we confirm that the above line is the line of best fit.

The least squares regression method works by minimizing the sum of the square of the errors as small as
possible, hence the name least squares. Basically the distance between the line of best fit and the error must
be minimized as much as possible. This is the basic idea behind the least squares regression method.

A few things to keep in mind before implementing the least squares regression method is:

 The data must be free of outliers because they might lead to a biased and wrongful line of best fit.
 The line of best fit can be drawn iteratively until you get a line with the minimum possible squares of
errors.
 This method works well even with non-linear data.
 Technically, the difference between the actual value of ‘y’ and the predicted value of ‘y’ is called the
Residual (denotes the error).

You might also like