Everything You Need To Know About Linear Regression
Everything You Need To Know About Linear Regression
K KAVITA MALI
24 May, 2024 • 19 min read
Introduction
Linear Regression, a foundational algorithm in data science, plays a pivotal role
in predicting continuous outcomes. This guide provides an in-depth
exploration of Linear Regression, covering its principles, applications, and
implementation in Python on a real-world dataset. From understanding simple
and multiple linear regression to unveiling its significance, limitations, and
practical use cases, this article serves as a comprehensive resource for both
beginners and practitioners. Join us on this journey through the intricacies of
linear regression, offering insights into its workings and hands-on application.
This article is part of the Data Science Blogathon, delivering valuable
tsil gnidaeR
knowledge for data enthusiasts.
Learning Objective
Understand the principles and applications of linear regression.
Differentiate between simple and multiple linear regression.
Learn how to implement linear regression in Python.
Grasp the concept of gradient descent and its use in optimizing linear
regression.
Explore evaluation metrics for assessing linear regression models.
Recognize the assumptions and potential pitfalls, such as overfitting and
multicollinearity, in linear regression.
This article was published as a part of the Data Science Blogathon.
Table of contents
Everything
What is Linear Regression?
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
Linear regression predicts the relationship between two variables by assuming
a linear connection between the independent and dependent variables. It
seeks the optimal line that minimizes the sum of squared differences between
predicted and actual values. Applied in various domains like economics and
finance, this method analyzes and forecasts data trends. It can extend to
multiple linear regression involving several independent variables and logistic
regression, suitable for binary classification problems
403
The graph above presents the linear relationship between the output(y) and
predictor(X) variables. The blue line is referred to as the best-fit straight line.
Based on the given data points, we attempt to plot a line that fits the points
the best.
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Simple Regression
agree to ourCalculation
Privacy Policy and Terms of Use. Accept
To calculate best-fit line linear regression uses a traditional slope-intercept
form which is given below,
Yi= β0+β1Xi
where Y i = Dependent variable, β 0 = constant/Intercept, β 1 =
Slope/Intercept, X i = Independent variable.
This algorithm explains the linear relationship between the dependent(output)
variable y and the independent(predictor) variable X using a straight line Y= B
0 + B 1 X.
But how does the linear regression find out which is the best-fit line?
The goal of the linear regression algorithm is to get the best values for B 0
and B 1 to find the best-fit line. The best-fit line is a line that has the least
error which means the error between predicted values and actual values
should be minimum.
But how the linear regression finds out which is the best fit line?
The goal of the linear regression algorithm is to get the best values for B0 and
B1 to find the best fit line. The best fit line is a line that has the least error
which means the error between predicted values and actual values should be
minimum.
Random Error(Residuals)
In regression, the difference between the observed value of the dependent
variable(y i ) and the predicted value(predicted) is called the residuals.
ε i = y predicted – y i
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
where y predicted
agree= toBour
0 +Privacy
B 1 X Policy
i and Terms of Use. Accept
What is the Best Fit Line?
In simple terms, the best-fit line is a line that fits the given scatter plot in the
best way. Mathematically, the best-fit line is obtained by minimizing the
Residual Sum of Squares (RSS).
Using the MSE function, we’ll update the values of B 0 and B 1 such that the
MSE value settles at the minima. These parameters can be determined using
the gradient descent method such that the value for the cost function is
minimum.
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
Gradient Descent Example
Let’s take an example to understand this. Imagine a U-shaped pit. You are
standing at the uppermost point in the pit, and your motive is to reach the
bottom of the pit. Suppose there is a treasure at the bottom of the pit, and you
can only take a discrete number of steps to reach the bottom. If you opted to
take one step at a time, you would get to the bottom of the pit in the end but,
this would take a longer time. If you decide to take larger steps each time, you
may achieve the bottom sooner but, there’s a probability that you could
overshoot the bottom of the pit and not even near the bottom. In the gradient
descent algorithm, the number of steps you’re taking can be considered as the
learning rate, and this decides how fast the algorithm converges to the
minima.
To update B 0 and B 1, we take gradients from the cost function. To find these
gradients, we take partial derivatives for B 0 and B 1.
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
We need to minimize the cost function J. One of the ways to achieve this is to
apply the batch gradient descent algorithm. In batch gradient descent, the
values are updated in each iteration. (The last two equations show the
updating of values)
The partial derivates are the gradients, and they are used to update the values
of B 0 and B 1. Alpha is the learning rate.
Total Sum of Squares (TSS) is defined as the sum of errors of the data
points from the mean of the response variable. Mathematically TSS is,
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
where y hat is theagree
mean of Privacy
to our the sample
Policy data points.
and Terms of Use. Accept
The significance of R-squared is shown by the following figures,
To make this estimate unbiased, one has to divide the sum of the squared
residuals by the degrees of freedom rather than the total number of data
points in the model. This term is then called the Residual Standard Error(RSE).
Mathematically it can be represented as,
R-squared is a better measure than RSME. Because the value of Root Mean
Squared Error depends on the units of the variables (i.e. it is not a normalized
measure), it can change with the change in the unit of the variables.
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
2. Independence of residuals: The error terms should not be dependent on
one another (like in time-series data wherein the next value is dependent
on the previous one). There should be no correlation between the residual
terms. The absence of this phenomenon is known as Autocorrelation.
There should not be any visible patterns in the error terms.
4. The equal variance of residuals: The error terms must have constant
variance. This phenomenon is known as Homoscedasticity. The presence of
non-constant variance in the error terms is referred to as Heteroscedasticity.
Generally, non-constant variance arises in the presence of outliers or extreme
leverage values.
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
Hypothesis in Linear Regression
Once you have fitted a straight line on the data, you need to ask, “Is this
straight line a significant fit for the data?” Or “Is the beta coefficient explain
the variance in the data plotted?” And here comes the idea of hypothesis
testing on the beta coefficient. The Null and Alternate hypotheses in this case
are:
H0:B1 =0
HA:B1 ≠0
To test this hypothesis we use a t-test, test statistics for the beta coefficient
is given by,
Multicollinearity
As multicollinearity makes it difficult to find out which variable is contributing
towards the prediction of the response variable, it leads one to conclude
incorrectly, the effects of a variable on the target variable. Though it does not
affect the precision of the model predictions, it is essential to properly detect
and deal with the multicollinearity present in the model, as random removal of
any of these correlated variables from the model causes the coefficient values
to swing wildly and even change signs.
Multicollinearity can be detected using the following methods.
Pairwise Correlations: Checking the pairwise correlations between
different pairs of independent variables can throw useful insights into
detecting multicollinearity.
Variance Inflation Factor (VIF): Pairwise correlations may not always be
useful as it is possible that just one variable might not be able to
completely explain some other variable but some of the variables
combined could be ready to do this. Thus, to check these sorts of relations
between variables, one can use VIF. VIF explains the relationship of one
independent variable with all the other independent variables. VIF is given
by,
Overfitting
When a model learns every pattern and noise in the data to such an extent
that it affects the performance of the model on the unseen future dataset, it is
referred to as overfitting. The model fits the data so well that it interprets
noise as patterns in the data.
When a model has low bias and higher variance it ends up memorizing the
data and causing overfitting. Overfitting causes the model to become specific
rather than generic. This usually leads to high training accuracy and very low
test accuracy.
Detecting overfitting is useful, but it doesn’t solve the actual problem. There
are several ways to prevent overfitting, which are stated below:
Cross-validation
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
If abouttoLinear
the training Regression!
data
deliver our
is too small to train add more relevant and clean data.
services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
If the training data is too large, do some feature selection and remove
unnecessary features.
Regularization
Underfitting
Underfitting is not often discussed as often as overfitting is discussed. When
the model fails to learn from the training dataset and is also not able to
generalize the test dataset, is referred to as underfitting. This type of problem
can be very easily detected by the performance metrics.
When a model has high bias and low variance it ends up not generalizing the
data and causing underfitting. It is unable to find the hidden underlying
patterns in the data. This usually leads to low training accuracy and very low
test accuracy. The ways to prevent underfitting are stated below,
Increase the model complexity
Increase the number of features in the training data
Remove noise from the data.
Step 3: Visualization
Everything
We use cookiesyou need toVidhya
on Analytics Know usabout
Letwebsites Linear
plottothe scatter
deliver Regression!
our plot for analyze
services, target web
variable
traffic,vs.andpredictor variables
improve your in aonsingle
experience the site. By using Analytics Vidhya, you
plot to get the intuition. Also,Privacy
agree to our plotting
Policyaand
heatmap
Terms offorUse.
all the variables,
Accept
#Importing seaborn library for visualizations
import seaborn as sns
From the scatterplot and the heatmap, we can observe that ‘Sales’ and ‘TV’
have a higher correlation as compared to others because it shows a linear
pattern in the scatterplot as well as giving 0.9 correlation.
You can go ahead and play with the visualizations and can find out interesting
insights from the data.
Step 4: Performing Simple Linear Regression
Here, as the TV and Sales have a higher correlation we will perform the simple
linear regression for these variables.
We can use sklearn or statsmodels to apply linear regression. So we will go
ahead with statmodels.
We first assign the feature variable, `TV`, during this case, to the variable `X`
and the response variable, `Sales`, to the variable `y`.
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
X = advertising[ 'TV' ]
y = advertising[ 'Sales' ]
And after assigning the variables you need to split our variable into training
and testing sets. You’ll perform this by importing train_test_split from
the sklearn.model_selection library. It is usually a good practice to keep 70% of
the data in your train dataset and the rest 30% in your test dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = 0.7, test_siz
In this way, you can split the data into train and test sets.
One can check the shapes of train and test sets with the following code,
print( X_train.shape )
print( X_test.shape )
print( y_train.shape )
print( y_test.shape )
By default, the statsmodels library fits a line on the dataset which passes
through the origin. But in order to have an intercept, you need to manually use
the add_constant attribute of statsmodels. And once you’ve added the constant
to your X_train dataset, you can go ahead and fit a regression line using
the OLS (Ordinary Least Squares) the attribute of statsmodels as shown below,
# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)
# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()
One can see the values of betas using the following code,
# Print the parameters,i.e. intercept and slope of the regression line obtained
lr.params
Here, 6.948 is the intercept, and 0.0545 is a slope for the variable TV.
Now, let’s see the evaluation metrics for this linear regression operation. You
Everything
We use cookiesyou need toVidhya
on Analytics Know abouttoview
canwebsites
simply LineartheourRegression!
deliver summary using the
services, analyze webfollowing
traffic, andcode,
improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
#Performing a summary operation lists out all different parameters of the regression
print(lr.summary())
Summary
As you can see, this code gives you a brief summary of the linear regression.
Here are some key statistics from the summary:
1. The coefficient for TV is 0.054, with a very low p-value. The coefficient is
statistically significant. So the association is not purely by chance.
2. R – squared is 0.816 Meaning that 81.6% of the variance in `Sales` is
explained by `TV`. This is a decent R-squared value.
3. F-statistics has a very low p-value(practically low). Meaning that the
model fit is statistically significant, and the explained variance isn’t purely
by chance.
Step 5: Performing predictions on the test set
Now that you have simply fitted a regression line on your train dataset, it is
time to make some predictions on the test data. For this, you first need to add
a constant to the X_test data like you did for X_train and then you can simply
go on and predict the y values corresponding to X_test using the predict the
attribute of the fitted regression line.
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)
# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)
You can see the predicted values with the following code,
Everything
We use cookiesyou need toVidhya
on Analytics Knowwebsites
abouttoLinear Regression!
deliver our
y_pred.head() services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
To check how well the values are predicted on the test data we will check
some evaluation metrics using sklearn library.
#Imporitng libraries
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#RMSE value
print( "RMSE: ",np.sqrt( mean_squared_error( y_test, y_pred ) )
#R-squared value
print( "R-squared: ",r2_score( y_test, y_pred ) )
We are getting a decent score for both train and test sets.
Apart from `statsmodels`, there is another package namely `sklearn` that can
be used to perform linear regression. We will use the `linear_model` library
from `sklearn` to build the model. Since we have already performed a train-
test split, we don’t need to do it again.
There’s one small step that we need to add, though. When there’s only a single
feature, we need to add an additional column in order for the linear regression
fit to be performed successfully. Code is given below,
X_train_lm = X_train_lm.values.reshape(-1,1)
X_test_lm = X_test_lm.values.reshape(-1,1)
One can check the change in the shape of the above data frames.
print(X_train_lm.shape)
print(X_train_lm.shape)
You can get the intercept and slope values with sklearn using the following
code,
#get intercept
print( lr.intercept_ )
#get slope
print( lr.coef_ )
Conclusion
This is how we can perform the simple linear regression.
In conclusion, Linear Regression is a cornerstone in data science, providing a
robust framework for predicting continuous outcomes. As we unravel its
intricacies and applications, it becomes evident that Linear Regression is a
versatile tool with widespread implications. This article is a comprehensive
guide from its role in modeling relationships to real-world implementations in
Python.
For those eager to delve deeper into the world of data science and machine
learning, Analytics Vidhya’s AI & ML BlackBelt+ program offers an immersive
learning experience. Elevate your skills and navigate the evolving landscape of
data science with mentorship and hands-on projects. Join BB+ today and
unlock the next level in your data science journey!
Key Takeaways
Linear regression predicts relationships between variables by fitting a line
that minimizes prediction errors.
Simple linear regression involves one predictor and one outcome variable,
while multiple linear regression includes several predictors.
The cost function, often minimized using gradient descent, determines the
best-fit line in linear regression.
Evaluation metrics like R-squared and RMSE measure the model’s
Everything you need to Know abouttoLinear
performance
We use cookies on Analytics Vidhya websites deliverandRegression!
ourfit.services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
Assumptions such as linearity, independence, normal distribution, and
constant variance of residuals are crucial for valid regression analysis.
Proper feature selection and validation techniques help mitigate overfitting
and multicollinearity in regression models.
The media shown in this article are not owned by Analytics Vidhya and are
used at the Author’s discretion.
blogathon linear regression
K KAVITA MALI
24 May 2024
A Mathematics student turned Data Scientist. I am an aspiring data scientist
who aims at learning all the necessary concepts in Data Science in detail. I am
passionate about Data Science knowing data manipulation, data visualization,
data analysis, EDA, Machine Learning, etc which will help to find valuable
insights from the data.
Everything
We use cookiesyou need toVidhya
Knowwebsites
abouttoLinear Regression!
on Analytics deliver our
Responses From Readers
services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept
What are your thoughts?...
Submit reply
Write for us
Write, captivate, and earn accolades and rewards for your work
Download App
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept