Linear Regression Report
Linear Regression Report
net/publication/342452947
CITATIONS READS
0 1,938
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Hussein Ali Ahmed Ghanim on 25 June 2020.
2019-2020
-1-
Table of Contents
INTRODUCTION....................................................................................................................
1
INTRODUCTION
Numerical methods are techniques that formulate mathematical problems in such a way that they can
be solved with arithmetic operations that produce numerical results instead of closed-form results. The
majority of numerical methods require significant amounts of arithmetic calculations [1]. While numerical
methods are diverse in nature, they have one common characteristic: they inevitably require large numbers
of repetitive arithmetic calculations [1]. It's little wonder that the role of numeric methods in engineering
problem solving has increased dramatically in recent years with the advent of fast, powerful digital
computers [1].
Today, computers and numerical methods provide an alternative for such complicated calculations.
Using computer power to obtain solutions directly, you can approach these calculations without recourse
to simplifying assumptions or time-intensive techniques. Although analytical solutions are still extremely
valuable both for problem solving and for providing insight, numerical methods represent alternatives that
greatly enlarge your capabilities to confront and solve problems. As a result, more time is available for the
use of your creative skills. Thus, more emphasis can be placed on problem formulation and solution
interpretation and the incorporation of total system,
Since the late 1940s the widespread availability of digital computers has led to a veritable explosion in
the use and development of numerical methods. At first, this growth was somewhat limited by the cost of
access to large mainframe computers, and, consequently, many engineers continued to use simple analytical
approaches in a significant portion of their work. Needless to say, the recent evolution of inexpensive
personal computers has given us ready access to powerful computational capabilities or “holistic,”
awareness
We live in the era of vast quantities of data, powerful computers and artificial intelligence [2]. Data
science and machine learning are driving the identification of photos, the development of autonomous
vehicles, decisions in the financial and energy sectors, developments in medicine, social networks and
more. Linear regression is an important part of this [2].
There are many aaplication of numerical methods in computer sciene fields, the most used methods are
Mathematical Modeling and Problem Solving, interpolation, optimization, and Linear regression. Linear
regression is one of the basic techniques in numerical methods and machine learning. Whether you're
looking to do math, machine learning or scientific computing, there's a fair chance you'll need it [2].
Linear Regression
Regression analysis is one of the main areas in statistics and machine learning [3]. There are plenty of
regression methods. Linear regression is one of them. The term regression is used when attempting to find
the relation between variables [3]. The relationship is used in Machine Learning, and in statistical modeling,
to predict the outcome of future events.
Regression analysis is used to estimate the relationship between a dependent variable and one or more
independent variables. This technique is widely applied to predict the outputs, forecasting the data,
analyzing the time series, and finding the causal effect dependencies between the variables. There are
several types of regression techniques at hand based on the number of independent variables, the
dimensionality of the regression line, and the type of dependent variable. Out of these, the two most popular
regression techniques are linear regression and logistic regression.
Researchers use regression to indicate the strength of the impact of multiple independent variables on a
dependent variable on different scales. Regression has numerous applications. For example, consider a data
set consisting of weather information recorded over the past few decades. Using that data, we could forecast
weather for the next couple of years. Regression is also widely used in organizations and businesses to
assess risk and growth based on previously recorded data.
Linear regression is perhaps one of the most important and widely used regression techniques. It’s
among the easiest regression methods [4]. One of its key benefits is the ease of interpreting results. Linear
regression uses the relationship between the data-points to draw a straight line across all of them [4]. This
line is able to be used to predict future values. Linear regression is essentially a linear method to model the
relationship between your independent variables and and dependent variables [4]. which means that let’s
assume if we have a scatter plot with some points on it, the aim for linear regression is to make a line that
can be as close as to all the points as possible.
2
Applications of Linear Regression
There are many applications to linear regression such as machine learning, trend estimation, and
economics. The most common supervised learning machine learning algorithm is the linear regression
because of its’ simplicity and the fact that it has been around for a while [4]. Trend estimation also
extensively uses linear regression because after all regression is the prediction of results with the continuous
output [4]. In economics, many things are also predicted using linear regression such as labor demand and
supply, consumption spending and etc [4].
Applications of linear regression in machine learning:
Linear regression is one of the most common used algorithm machine learning processes in the world
and it helps prepare businesses in a volatile and dynamic environment. Machine Learning needs to be
supervised for the computers to effectively and efficiently utilize their time and efforts. One of the top ways
to do it is through linear regression.
Linear regrissions simply put, machines need to be supervised in order to effectively learn new
things. The biggest ability of machines is that they can learn about the problem and execute solutions
seamlessly. This greatly reduces and eliminates human error.
It is also used to find the relationship between forecasting and variables. A task is performed based on
a dependable variable by analyzing the impact of an independent variable on it. Those proficient in
programming software such as Python, C can sci-kit learn the library to import the linear regression model
or create their own custom algorithm before applying it to the machines. This means that it is highly
customizable and easy to learn. Organizations across the world are heavily investing in linear regression
training when it comes to their employees in order to prepare the workforce for the future.
The main benefits of linear regression in machine learning is as follows.
Forecasting
A main advantage of using a linear regression model in machine learning is the ability to forecast trends
and make predictions that are feasible. Data scientists can use these predictions and make further deductions
based on machine learning [5]. It is quick, efficient, and accurate. This is predominantly since machines
process large volumes of data and there is minimum human intervention [5]. Once the algorithm is
established, the process of learning becomes simplified.
Beneficial to small businesses
By altering one or two variables, machines can understand the impact on sales. Since deploying linear
regression is cost-effective, it is greatly advantageous to small businesses since short-term and long-term
forecasts can be made when it comes to sales [5]. This means that small businesses can plan their resources
well and create a growth trajectory for themselves [5]. They will also be to understand the market and its
preferences and learn about supply and demand [5].
Preparing Strategies
Since machine learning enables prediction, one of the biggest advantages of a linear regression model
in it is the ability to prepare a strategy for a given situation, well in advance, and analyze various outcomes
[5]. Meaningful information can be derived from the regression model of forecasting thereby helping
companies plan strategically and make executive decisions [5].
Types of Linear Regression
Linear Regression is generally classified into three types such as Simple Linear Regression, Multible
Linear Regression, and Polynomial Regression
1. Simple Linear Regression
Simple Linear regression has only 1 predictor variable and 1 dependent variable. the given equation for
the Linear Regression, 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 . If there is only 1 predictor available then it is known as Simple
Linear Regression.
𝑇ℎ𝑒 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑆𝐿𝑅 𝑤𝑖𝑙𝑙 𝑏𝑒 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖, 𝛽1 = 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑓𝑜𝑟 𝑋1 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑎𝑛𝑑 𝛽0 𝑖𝑠 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
While executing the prediction, there is an error term which is associated with the equation.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝜀, 𝜀 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑒𝑎𝑐ℎ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
The goal of the SLR model is to find the estimated values of β1 & β0 by keeping the error term (ε)
minimum.
3
Case study: implementing Simple Linear Regression in Python
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑛 𝑋𝑛
β1 = coefficient for X1 variable
β2 = coefficient for X2 variable
β3 = coefficient for X3 variable and so on…
β0 is the intercept (constant term). While doing the prediction, there is an error term which is associated
with the equation.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝛽3 𝑋3𝑖 + ⋯ + 𝜀𝑖 , 𝜀𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑒𝑎𝑐ℎ 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
The goal of the MLR model is to find the estimated values of β 0 , β1 , β2 , β3 … by keeping the error term
(i) minimum. Broadly speaking supervised machine learning algorithms are classified into two types-
Regression: Used to predict continuous variable and Classification: Used to predict discrete variable.
Assumptions for Multiple Linear Regression
1. Linearity: There should be a linear relationship between dependent and independent variables like
shown in the below example graph.
2. Multicollinearity: There should not be high correlation between two or more independent variables.
Multicollinearity can be checked using correlation matrix, Tolerance and Variance Influencing Factor
(VIF).
3. Homoscedasticity: If Variance of errors are constant across independent variables, then it is called
Homoscedasticity. The residuals should be homoscedastic. Standardized residuals versus predicted values
is used to check homoscedasticity as shown in the below figure. Breusch-Pagan and White tests are the
famous tests used to check Homoscedasticity. Q-Q plots are also used to check homoscedasticity.
4
4. Multivariate Normality: Residuals should be normally distributed.
5. Categorical Data: Any categorical data present should be converted into dummy variables.
6. Minimum records: There should be at least 20 records of independent variables.
Mathematical formulation of Multiple Linear Regression
In Linear Regression, we try to find a linear relationship between independent and dependent variables
by using a linear equation on the data.
The equation for linear line is- Y=mx + c
Where m is slope and c is intercept.
In Linear Regression, we are actually trying to predict the best m and c values for dependent variable Y
and independent variable x. We fit as many lines and take the best line that gives the least possible error,
we use the corresponding m and c values to predict y value.
The same concept can be used in multiple Linear Regression where we have multiple independent
variables, x1 , x2 , x3 …xn. the equation changes to-
Y=M1 X 1 + M2 X 2 + M3 M3 + …Mn X n+C
The above equation is not a line but a plane of multi dimensions.
Model Evaluation:
Model can be evaluated by using the below methods-
Mean absolute error: It is the mean of absolute values of the errors, formulated as-
Applications
1. Effect of independent variable on dependent variable can be calculated.
2. Used to predict trends.
3. Used to find how much change can be expected in a dependent variable with change in an
independent variable.
Case study: implementing Multible Linear Regression in Python
3. Polynomial Regression
Polynomial regression is a non-linear regression. In Polynomial regression, the relationship of the
dependent variable is fitted to the nth degree of the independent variable.
Equation of polynomial regression:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + 𝛽3 𝑋𝑖3 + 𝛽4 𝑋𝑖4 + 𝛽5 𝑋𝑖5 + ⋯ + 𝛽𝑛 𝑋𝑖𝑛 + 𝜀 (𝑖 = 1,2,3,… 𝑛)
5
Underfitting and Overfitting
When we fit a model, we try to find the optimised, best-fit line, which can describe the impact of the
change in the independent variable on the change in the dependent variable by keeping the error term
minimum. While fitting the model, there can be two events which will lead to the bad performance of the
model. These events are
• Underfitting
• Overfitting
Underfitting
Underfitting is the condition where the model could not fit the data well enough. The under-fitted model
leads to low accuracy of the model. Therefore, the model is unable to capture the relationship, trend or
pattern in the training data. Underfitting of the model could be avoided by using more data, or by optimising
the parameters of the model.
Overfitting
Overfitting is the opposite case of underfitting, i.e., when the model predicts very well on training data
and is not able to predict well on test data or validation data. The main reason for overfitting could be that
the model is memorising the training data and is unable to generalise it on test/unseen dataset. Overfitting
can be reduced by doing feature selection or by using regularisation techniques.
From the above graph, we can clearly infer that there is a negative linear relationship between horsepower
and miles per gallon (mpg). With horsepower increasing, mpg is decreasing.
Now, let’s perform the Simple linear regression.
From the output of the above SLR model, the equation of the best fit line of the model is
mpg = 39.94 + (-0.16)*(horsepower)
By comparing the above equation to the SLR model equation Yi= βiXi + β0 , β0=39.94, β1=-0.16
6
Now, check for the model relevancy by looking at its R2 and RMSE Values
R2 and RMSE (Root mean square) values are 0.6059 and 4.89 respectively. It means that 60% of the
variance in mpg is explained by horsepower. For a simple linear regression model, this result is okay, but
not so good since there could be an effect of other variables like cylinders, acceleration etc. RMSE value
is also very less. Let’s check how the line fits the data
From the graph, we can infer that the best fit line is able to explain the effect of horsepower on mpg.
Multiple Linear Regression With scikit-learn
Since the data is already loaded in the system, we will start performing multiple linear regression.
The actual data has 5 independent variables and 1 dependent variable (mpg)
The best fit line for Multiple Linear Regression is
Y=46.26+-0.4cylinders+-8.313e-05displacement+-0.045horsepower + -0.01weight + -0.03acceleration
By comparing the best fit line equation with 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + 𝛽3 𝑋𝑖3 + ⋯
β0 (Intercept)= 46.25, β1 = -0.4, β2 = -8.313e-05, β3 = -0.045, β4 = 0.01, β5 = -0.03
Now, let’s check the R2 and RMSE values.
R2 and RMSE (Root mean square) values are 0.707 and 4.21 respectively. It means that ~71% of the
variance in mpg is explained by all the predictors. This depicts a good model. Both values are less than the
results of Simple Linear Regression that means that adding more variables to the model will help in good
model performance. However, the more the value of R2 and least RMSE, the better the model will be.
Multiple Linear Regression- Implementation using Python. Let us take a small data set and try out a
building model using python.
The above figure shows the top 5 rows of the data. We are actually trying to predict Sel (dependent variable)
based on the independent variables Temp. We first check for our assumptions in our data set.
1. Check for Linearity
7
We could see from that above graph, there exists a linear relationship between Sel and Temp.
2. Check for Multicollinearity
8
General Linear Least Squares
Nonlinear Regression
9
The two main metrics we use to evaluate linear regression models are accuracy and error. For a model
to be highly accurate with minimum error, we need to achieve low bias and low variance. We partition the
data into training and testing data sets to keep bias in check and ensure accuracy.
A DEEP DIVE INTO LINEAR REGRESSION
Before we build a supervised machine learning model, all we have is data comprising inputs and outputs.
To estimate the dependency between them using linear regression, we pick two random values, variance
and bias. Thereby, we consider a tuple from the data set, feed the input values to the equation y = mx +
c, and predict the new values. Later, we calculate the loss incurred by the predicted value using a loss
function.
The values of m and c are picked randomly, but they must be updated to minimize the error. We thereby
consider loss function as a metric to evaluate the model. Our goal is to obtain a line that best reduces the
error. The most common loss function used is mean squared error. It is mathematically represented as
If we don’t square the error, the positive and negative points cancel each other out. The static
mathematical equations of bias and variance are as follows:
When we train a network to find the ideal variance and bias, different values can yield different errors.
Out of all the values, there will be one point where the error value will be minimized, and the parameters
corresponding to this value will yield an optimal solution. At this point, gradient descent comes into the
picture.
Gradient descent is an optimization algorithm that finds the values of parameters (coefficients) of a
function (f) to minimize the cost function (cost). The learning rate defines the rate at which the parameters
are updated. It controls the rate at which we would be adjusting the weights of our network with respect to
the loss gradient. The lower the value, the slower we travel the downward slope along which the weights
get updated at every step.
Both the m and c values are updated as follows:
Once the model is trained and achieves a minimum error, we can fix the values of bias and variance.
Ultimately, this is how the best fit line looks like when plotted between the data points:
1. Linearity: The relationship between independent variables and the mean of the dependent variable
is linear.
10
2. Homoscedasticity: The variance of residuals should be equal.
3. Independence: Observations are independent of each other.
4. Normality: For any fixed value of an independent variable, the dependent variable is normally
distributed.
As such, linear regression was developed in the field of statistics and is studied as a model for
understanding the relationship between input and output numerical variables, but has been borrowed by
machine learning. It is both a statistical algorithm and a machine learning algorithm.
Linear regression is an attractive model because the representation is so simple. The representation is a
linear equation that combines a specific set of input values (x) the solution to which is the predicted output
for that set of input values (y). As such, both the input values (x) and the output value are numeric.
Linear Regression Learning the Model
Learning a linear regression model means estimating the values of the coefficients used in the
representation with the data that we have available.
In this section we will take a brief look at four techniques to prepare a linear regression model. This is
not enough information to implement them from scratch, but enough to get a flavor of the computation and
trade-offs involved.
There are many more techniques because the model is so well studied. Take note of Ordinary Least
Squares because it is the most common method used in general. Also take note of Gradient Descent as it is
the most common technique taught in machine learning classes.
1. Simple Linear Regression
With simple linear regression when we have a single input, we can use statistics to estimate the
coefficients. This requires that you calculate statistical properties from the data such as means, standard
deviations, correlations and covariance. All of the data must be available to traverse and calculate statistics.
2. Ordinary Least Squares
When we have more than one input we can use Ordinary Least Squares to estimate the values of the
coefficients. The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals.
This means that given a regression line through the data we calculate the distance from each data point to
the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary
least squares seeks to minimize. This approach treats the data as a matrix and uses linear algebra operations
to estimate the optimal values for the coefficients. It means that all of the data must be available and you
must have enough memory to fit the data and perform matrix operations. It is unusual to implement the
Ordinary Least Squares procedure yourself unless as an exercise in linear algebra. It is more likely that you
will call a procedure in a linear algebra library. This procedure is very fast to calculate.
3. Gradient Descent
When there are one or more inputs you can use a process of optimizing the values of the coefficients by
iteratively minimizing the error of the model on your training data. This operation is called Gradient
Descent and works by starting with random values for each coefficient. The sum of the squared errors are
calculated for each pair of input and output values. A learning rate is used as a scale factor and the
coefficients are updated in the direction towards minimizing the error. The process is repeated until a
minimum sum squared error is achieved or no further improvement is possible. When using this method,
you must select a learning rate (alpha) parameter that determines the size of the improvement step to take
on each iteration of the procedure. Gradient descent is often taught using a linear regression model because
it is relatively straightforward to understand. In practice, it is useful when you have a very large dataset
either in the number of rows or the number of columns that may not fit into memory.
11
4. Regularization
There are extensions of the training of the linear model called regularization methods. These seek to
both minimize the sum of the squared error of the model on the training data (using ordinary least squares)
but also to reduce the complexity of the model (like the number or absolute size of the sum of all coefficients
in the model). Two popular examples of regularization procedures for linear regression are:
• Lasso Regression : where Ordinary Least Squares is modified to also minimize the absolute sum of
the coefficients (called L1 regularization).
• Ridge Regression : where Ordinary Least Squares is modified to also minimize the squared absolute
sum of the coefficients (called L2 regularization).
These methods are effective to use when there is collinearity in your input values and ordinary least
squares would overfit the training data.
Preparing Data for Linear Regression
Linear regression is been studied at great length, and there is a lot of literature on how your data must
be structured to make best use of the model.
As such, there is a lot of sophistication when talking about these requirements and expectations which
can be intimidating. In practice, you can uses these rules more as rules of thumb when using Ordinary Least
Squares Regression, the most common implementation of linear regression.
• Linear Assumption. Linear regression assumes that the relationship between your input and output
is linear. It does not support anything else. This may be obvious, but it is good to remember when
you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log
transform for an exponential relationship).
• Remove Noise. Linear regression assumes that your input and output variables are not noisy.
Consider using data cleaning operations that let you better expose and clarify the signal in your data.
This is most important for the output variable and you want to remove outliers in the output variable
(y) if possible.
• Remove Collinearity. Linear regression will over-fit your data when you have highly correlated
input variables. Consider calculating pairwise correlations for your input data and removing the most
correlated.
• Gaussian Distributions. Linear regression will make more reliable predictions if your input and
output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log
or BoxCox) on your variables to make their distribution more Gaussian looking.
• Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input
variables using standardization or normalization.
Advantages and disadvantages of Linear Regression
Advantages:
1. Linear Regression performs well when the dataset is linearly separable. We can use it to find the nature
of the relationship among the variables.
2. Linear Regression is easier to implement, interpret and very efficient to train.
3. Linear Regression handles over-fitting prety well using dimensionally reduction techniques,
regularization, and cross-validation.
4. The most common uses for linear regression are to predict results for a given data set.
5. The extrapolation beyond a specific data set.
Disadvanages:
1. The assumption of linearity between dependent and independent variables.
2. It is often quite prone to noise and overfitting
3. Linear regression is quite sensitive to outliers
4. It is prone to multicolinearity
Linear Regression Use Cases
• Sales Forecasting
• Risk Analysis
12
• Housing Applications to Predict the prices and other factors
• Finance Applications to Predict Stock prices, investment evaluation, etc.
The basic idea behind linear regression is to find the relationship between the dependent and independent
variables. It is used to get the best fitting line that would predict the outcome with the least error. We can
use linear regression in simple real-life situations, like predicting the SAT scores with regard to the number
of hours of study and other decisive factors.
Bibliography
[1] R. P. C. Steven C. Chapra, Numerical Methods for Engineers, New York: Published by
McGraw-Hill Education, 2015.
[2] M. Stojiljkovic, "Linear Regression in Python," Real Python, 2012 -2020. [Online].
Available: https://realpython.com/linear-regression-in-python/. [Accessed 6 5 2020].
[3] "Machine Learning - Linear Regression," w3schools, [Online]. Available:
https://www.w3schools.com/python/python_ml_linear_regression.asp. [Accessed 6 May
2020].
[4] A. Adib, "Basics of linear regression," Meduim, 6 November 2018. [Online]. Available:
https://medium.com/datadriveninvestor/basics-of-linear-regression-9b529aeaa0a5.
[Accessed 6 May 2020].
[5] Imarticus, "LINEAR REGRESSION AND ITS APPLICATIONS IN MACHINE
LEARNING," Imarticus, 19 March 2019. [Online]. Available:
https://blog.imarticus.org/linear-regression-and-its-applications-in-machine-learning/.
[Accessed 9 May 2020].
13