Regression Techniques
Regression Techniques
7 Regression Techniques
Linear and Logistic regressions are usually the first
algorithms people learn in data science. Due to their
popularity, a lot of analysts even end up thinking that
they are the only form of regressions. The ones who are
slightly more involved think that they are the most
important among all forms of regression analysis. The
truth is that there are innumerable forms of regressions,
which can be performed. Each form has its own
importance and a specific condition where they are best
suited to apply.
Contents
1. What is Regression Analysis?
2. Why do we use Regression Analysis?
3. What are the types of Regressions?
a. Linear Regression
b. Logistic Regression
c. Polynomial Regression
d. Stepwise Regression
e. Ridge Regression
f. Lasso Regression
g. ElasticNet Regression
4. How to select the right Regression Model?
What is Regression Analysis?
Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent
variable (s) (predictor). This technique is used for forecasting, time series
modelling and finding the causal effect relationship between the variables. For
example, relationship between rash driving and number of road accidents by a
driver is best studied through regression.
Regression analysis is an important tool for modelling and analyzing data. Here,
we fit a curve / line to the data points, in such a manner that the differences
between the distances of data points from the curve or line is minimized. I’ll
explain this in more details in coming sections.
Why do we use Regression Analysis?
As mentioned above, regression analysis estimates the relationship between two
or more variables. Let’s understand this with an easy example:
Let’s say, you want to estimate growth in sales of a company based on current
economic conditions. You have the recent company data which indicates that
the growth in sales is around two and a half times the growth in the economy.
Using this insight, we can predict future sales of the company based on current
& past information.
There are multiple benefits of using regression analysis. They are as follows:
For the creative ones, you can even cook up new regressions, if you feel the
need to use a combination of the parameters above, which people haven’t used
before. But before you start that, let us understand the most commonly used
regressions:
1. Linear Regression
The difference between simple linear regression and multiple linear regression
is that, multiple linear regression has (>1) independent variables, whereas
simple linear regression has only 1 independent variable. Now, the question is
“How do we obtain best fit line?”.
Important Points:
3. Polynomial Regression
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a
curve that fits into the data points.
Important Points:
• While there might be a temptation to fit a higher degree polynomial to get lower
error, this can result in over-fitting. Always plot the relationships to see the fit
and focus on making sure that the curve fits the nature of the problem. Here is
an example of how plotting can help:
• Especially look out for curve towards the ends and see whether those shapes
and trends make sense. Higher polynomials can end up producing wierd results
on extrapolation.
4. Stepwise Regression
This feat is achieved by observing statistical values like R-square, t-stats and
AIC metric to discern significant variables. Stepwise regression basically fits
the regression model by adding/dropping co-variates one at a time based on a
specified criterion. Some of the most commonly used Stepwise regression
methods are listed below:
• Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
• Forward selection starts with most significant predictor in the model and adds
variable for each step.
• Backward elimination starts with all predictors in the model and removes the
least significant variable for each step.
The aim of this modeling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the method to
handle higher dimensionality of data set.
5. Ridge Regression
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a
prediction error between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
In this equation, we have two components. First one is least square term and
other one is lambda of the summation of β2 (beta- square) where β is the
coefficient. This is added to least square term in order to shrink the parameter to
have a very low variance.
Important Points:
6. Lasso Regression
Important Points:
7. ElasticNet Regression
Important Points:
Beyond these 7 most commonly used regression techniques, you can also look
at other models like Bayesian, Ecological and Robust regression.
Life is usually simple, when you know only one or two techniques. One of the
training institutes I know of tells their students – if the outcome is continuous –
apply linear regression. If it is binary – use logistic regression! However, higher
the number of options available at our disposal, more difficult it becomes to
choose the right one. A similar case happens with regression models.