0% found this document useful (0 votes)
39 views11 pages

Additional Material - Linear Regression

Linear Regression - Additional materials

Uploaded by

VIGNESHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views11 pages

Additional Material - Linear Regression

Linear Regression - Additional materials

Uploaded by

VIGNESHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Linear Regression

Prerequisite:
Statistics- mean, median, mode, variance and standard deviation, correlation, and
covariance.
Exploratory Data Analysis- Data distribution, Scatter plot, correlation matrix, Heat map.

Objectives:

• Understand, what is Linear Regression and motivation behind linear regression


• What is the best fit line and residual of regression
• Least Square method to find Best Fit line of Regression.
• Gradient Descent method to find Best Fit line of Regression

Linear Regression
[email protected]
X9IMGDQ0PO Linear regression is a way to identify a relationship between two or more variables and
use these relationships to predict values for one variable for given value(s) of other
variable(s). Linear regression assume the relationship between variables can be modeled
through linear equation or an equation of line. The variable, which is used in prediction is
termed as independent/explanatory/regressor where the predicted variable is termed as
dependent/target/response variable. Linear regression assumes that independent
variables are related linearly to response variable.
𝑦 = 𝑐 + 𝑚𝑥
In machine learning and regression literature above equation is used in the form:
𝑦 = 𝑤! + 𝑤" 𝑥

Where w0 is intercept on y-axis, w1 is slope of line, x is an explanatory variable and y is


the response variable.
Motivational Examples

1. Let we have given with sales data of house prices. For each house we have complete
information of it plot size area and the price at which the house was sold. Can we use this
information to predict the price of a house for a given plot size area? The problem can be

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
1
Sharing or publishing the contents in part or full is liable for legal action.
modeled as linear regression with plot_ size(x) is explanatory variable and house price(y)
as response variable.
𝐻𝑜𝑢𝑠𝑒𝑃𝑟𝑖𝑐𝑒(𝑦) = 𝑤! + 𝑤" 𝑃𝑙𝑜𝑡𝑆𝑖𝑧𝑒
2. Consider a scenario where we have given a medical data about some patients. The data
contains the information of the blood pressure for a patient along with his/her age. Can we
use this information to predict the blood pressure level of patient for given age. This
problem is modeled as regression problem with age as an explanatory variable and blood
pressure is the response variable.
𝐵𝑙𝑜𝑜𝑑𝑃𝑟𝑒𝑠𝑠𝑢𝑟𝑒(𝑦) = 𝑤! + 𝑤" 𝐴𝑔𝑒

3. Next consider a problem where we need to predict the price of a used car. The sale price
of a used car depends on many attributes, some of them may be mileage (km/litre), model
( Maruti, Hundai, Honda, Toyota, Tata), segment (Small, Medium, Luxury. In this scenario
the sale price is response or target variable depends on mileage, model and segment(
explanatory variables). This problem is model as linear regression problem but belongs to
multiple linear regression as there are more than one explanatory variables are involved
in the prediction of target variable.
[email protected]
X9IMGDQ0PO 𝑆𝑎𝑙𝑒𝑃𝑟𝑖𝑐𝑒(𝑦) = 𝑤! + 𝑤" 𝑀𝑖𝑙𝑒𝑎𝑔𝑒 + 𝑤# 𝑀𝑜𝑑𝑒𝑙 + 𝑤$ 𝑆𝑒𝑔𝑚𝑒𝑛𝑡

In real scenarios, we rarely have one explanatory variable, so we use multiple linear
regression rather than simple linear regression. However, here we take an example of
simple linear regression to understand the fundamentals of regression.

Example: Consider a toy example where we are interested to find the effect of studying
hours per day over grades in examination and predict the grads of a student for given
study hours. We have sample data about six students for their grades and total study
hours per day.
From the given data we get an idea that study hours per day and grades have positive
relationship. So one can say that if a student spends more hours studying per day he is
likely to get good grades in his/her examination.
The scatter plot of given data is shown in Figure-1. Scatter plot is a useful tool to judge
the strength of relationship between two variables. If scatter plot does not conclude
any relationship then fitting a linear model to data is probably not useful. A valuable
measure to quantify the relationship between two variables is the correlation coefficient,

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
2
Sharing or publishing the contents in part or full is liable for legal action.
The correlation coefficient has range values between -1 to 1 to indicate the strength of
relationship. -1( minus one) indicates the strong negative relation where an increase in
one variable results in a decrease of other variable. 1(plus one) indicates a strong positive
relationship with increase in one variable results in increase in other variable too. 0 (zero)
shows no correlation between the two variables.
From the scatter plot shown in Figure-1, we get some intuition that there is a
positive effect of studying hours per day over grades in exam.

Table-1

Study Hours Grades


per day

6 7

5 5

4 2

7 9
[email protected] 8 6 Figure-1: Scatter Graph
X9IMGDQ0PO
2 3

To fit, given data we can draw multiple lines out of them one will be our best fit line. Let
the equation of the linear model is given by:
𝑦(𝐺𝑟𝑎𝑑𝑒𝑠) = 𝑤! + 𝑤" 𝑋(𝑆𝑡𝑢𝑑𝑦 𝐻𝑜𝑢𝑟𝑠 𝑝𝑒𝑟 𝑑𝑎𝑦)
Now we need to define the criteria for best fit line.
Any line we might come up with has some fixed intercept w0 and a slop w1.This line may
include some data points on it but cannot cover all of them. In our example we have given
with six data points let us label these points by (x1,y1), (x2, y2), (x3, y3), (x4,y4), (x5, y5)
and (x6, y6), with values (6, 7), (5, 5), (4, 2), (7, 9), (8, 7) and (2, 3). For any given point
𝑋! the prediction of 𝑦ℎ𝑎𝑡! is given by:
𝑦ℎ𝑎𝑡! = 𝑤" + 𝑤# 𝑥!
Unless the line passes through (𝑥& , 𝑦& ) the value of 𝑦ℎ𝑎𝑡! differs from the observed value
of 𝑦& . The difference between the two values is denotes as error or residual of regression.
𝑒! = 𝑦! − 𝑦ℎ𝑎𝑡!
The best line is the line which minimizes the sum of the squared error:

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
3
Sharing or publishing the contents in part or full is liable for legal action.
𝑒!$ = (𝑦# − 𝑦ℎ𝑎𝑡# ) $ + (𝑦$ − 𝑦ℎ𝑎𝑡$ ) $ + … … … + (𝑦% − 𝑦ℎ𝑎𝑡% ) $

Following graphs illustrate the process to find the best line of regression.

(a) (b)

[email protected]
X9IMGDQ0PO

(c) (d)

Figure-2 (a) Fit a line of equation y =x. (b) Fit a line of equation y = 1.667 + 0.667x
(c) Fit a line of equation 0.457 + 0.914x (d) Combine all three lines and choose the Best Fit Lines with
minimum Residual.

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
4
Sharing or publishing the contents in part or full is liable for legal action.
Methods to Find Best Fit Line - We can use two different methods to find best fit line of
regression.
1. Principle of Least Squares.
2. Gradient Descent.

Least Square- Let the equation of regression line of y on x is:


𝑦 = 𝑤! + 𝑤" 𝑥
According to the least square principle the equations to estimate the values of w0 and w1
are:
% %
0 𝑦! = 𝑛𝑤" + 𝑤# 0 𝑥! … … … (1)
!&# !&#
% % %
0 𝑋! 𝑦! = 𝑤" 0 𝑥! + 𝑤# 0 𝑥!$ … … … . (2)
!&# !&# !&#

Dividing equation (1) by n we get,


1 ' 𝑤" '
C 𝑦& = 𝑤! + C 𝑥&
𝑛 &(" 𝑛 &("
[email protected] 𝑦D) = 𝑤! + 𝑤" 𝑥D) … … … . (3)
X9IMGDQ0PO
Thus we can say the line of regression will always passes through the points (𝑥̅ , 𝑦I)
Now we need to estimate the values for w0 and w1,
We know,
1 1
𝑐𝑜𝑣(𝑥, 𝑦) = C 𝑥& 𝑦& − 𝑥̅ 𝑦I ⟹ C 𝑥& 𝑦& ⟹ 𝑐𝑜𝑣(𝑥, 𝑦) + 𝑥̅ 𝑦I … … … . . (4)
𝑛 𝑛
& &

Also,
1 ' 1 '
𝑣𝑎𝑟(𝑥) = C 𝑥&# − 𝑥̅ # ⟹ C 𝑥&# = 𝑣𝑎𝑟(𝑥) + 𝑥̅ # … … … . . (5)
𝑛 &(" 𝑛 &("

Dividing equation (2) by n and using equation (4) and (5),


𝑐𝑜𝑣(𝑥, 𝑦) + 𝑥̅ 𝑦: = 𝑤" 𝑥̅ + 𝑤# (𝑣𝑎𝑟(𝑥) + 𝑥̅ $ ) … … . . (6)
By solving equation (3) and (6) we get,
𝑐𝑜𝑣(𝑥, 𝑦)
𝑤# = … … … (7)
𝑣𝑎𝑟(𝑥)

and

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
5
Sharing or publishing the contents in part or full is liable for legal action.
𝑤! = 𝑦I − 𝑤" 𝑥̅
The straight line defined by 𝑦 = 𝑤! + 𝑤" 𝑥 satisfies the residual (least squares) condition
error = E{( y -(w0 + w1x))2} is minimum for variations in a and b, is called the line of
regression of y on x. Let us try these equations to estimate best fit line on our data given
in Table-1.
To estimate w0 and w1we need to find covariance between x and y cov(x, y), variance of
x var(x)and mean of x and y variables (x, and y) .For given data we get,
6+5+4+7+8+2
𝑥̅ = = 5.333
6
7+5+2+9+6+3
𝑦: = = 5.333
6
𝑐𝑜𝑣(𝑥, 𝑦) = 3.5555
𝑣𝑎𝑟(𝑥) = 3.8889
when we substitute these values in equation (7) and (8) we get,
𝑤" = 0.4571 𝑎𝑛𝑑 𝑤# = 0.9143
which are exactly the same as shown in Figure-2(c) for the line y = 0.457 + 0.914x, which
gives the minimum residual among all the lines.
[email protected]
X9IMGDQ0PO
Performance metric for least square regression- Performance metrics are the way to
quantify and compare the efficiency of any machine learning model. Least square
2
regression uses R ( R-squared) and R
2
adj (Adjusted R-Square) metrics to measure the
performance of regression model. R adj
2
(Adjusted R-Square) is used with multiple linear
regression. Both of these metrics denotes the power of explain ability of selected
independent variable(s) to the variation of response variable. The equations of R2( R-
2
squared) and R adj (Adjusted R-Square) are given by:
1 %
∑!&#(𝑦! − 𝑦ℎ𝑎𝑡! )$
$
𝑅 = 1 − 𝑛
1 %
∑!&#(𝑦! − 𝑦:)$
𝑛
The numerator term gives the average of squares of residuals and denominator shows
the variance in y(response) value. A small value for R or higher mean residual error
2

denote poor model.


(1 − 𝑅$ )(𝑛 − 1)
𝑅$ = 1 −
𝑛−𝑘−1

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
6
Sharing or publishing the contents in part or full is liable for legal action.
Where n is the total number of observations in data and k is the number of explanatory
2
variables. R adj (Adjusted R-Square) is slight improvement over R ( R-squared) by adding
2

an additional term to it. The problem with R2 ( R-squared) is that, R2( R-squared) increases
with increase in number of terms in the model irrespective of whether the added terms are
2
significantly contribute in prediction or not . On the contrary, the value of R adj (Adjusted
R-Square) is only affected by if useful terms are added to the model. The relation between
2
R ( R-squared) and R
2
adj (Adjusted R-Square) is:
#
𝑅*+, ≤ 𝑅#

Gradient Descent- Let the equation of regression line of y on x is:


𝑦 = 𝑤! + 𝑤" 𝑋
This straight line tries to approximate the relationship between x and y for given set of
data. By varying the values of w0and w1we can find the best fit line. With the above
discussion we know that the best fit line is one which minimizes the total error in prediction.
Gradient descent method defines a cost function of parameter w0and w1and uses a
systematic approach to optimize the values of parameters to get minimum cost function.
Let us dive into mathematics of algorithm.
Let the model is defined as:
[email protected]
X9IMGDQ0PO
𝑦 = 𝑤! + 𝑤" 𝑥
Now define the cost function of gradient descent as Mean Squared Error of prediction:
% %
1 1
𝑐𝑜𝑠𝑡 (𝑤" , 𝑤# ) = 0(𝑦! − 𝑦ℎ𝑎𝑡! )$ = 0(𝑦! − 𝑤" − 𝑤# 𝑥! )$
𝑛 𝑛
!&# !&#

The cost function includes two parameters w0and w1, which control the value of cost
function. As we know the derivatives give us the rate of change in one variable with respect
to others, so we can use partial derivatives to find the impact of individual parameter over
the cost function.
The principle of gradient descent is that we always make progress in the direction where
the partial derivatives of w0 and w1 are steepest. If the derivatives of parameters are
become zero or very less, point the situation of either maxima or minima on the surface of
cost function. The process of gradient descent is started with the random initialization of
w0 and w1. Every iteration of gradient descent improves in the direction of optimal values
for w0 and w1 parameters which will have minimum cost function value. Following figure
illustrate the process of optimization.

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
7
Sharing or publishing the contents in part or full is liable for legal action.
Figure 3: Gradient Descent Iteration

Gradient descent works in following steps:


1. Random initialization of parameters.
2. Calculate the partial derivatives of the cost function with respect to each
parameter(gradients).
[email protected]
3. Update the parameters in the opposite direction of gradients.
X9IMGDQ0PO
4. Repeat step 3 and 4 till maximum iteration reached or minimum cost function value
achieved.
Partial derivatives:
We have,
%
1
𝑒𝑟𝑟𝑜𝑟 = 𝑐𝑜𝑠𝑡 (𝑤" , 𝑤# ) = 0(𝑦! − 𝑤" − 𝑤# 𝑥! )$
𝑛
!&#

Partial derivative w. r. t. w0 and w1:


% %
𝜕𝑐𝑜𝑠𝑡(𝑤" , 𝑤# ) 1 −2
= 0(𝑦! − 𝑤" − 𝑤# 𝑥! )(−2) = 0 𝑒𝑟𝑟𝑜𝑟!
𝜕𝑤" 𝑛 𝑛
!&# !&#
% %
𝜕𝑐𝑜𝑠𝑡(𝑤" , 𝑤# ) 1 −2
= 0(𝑦! − 𝑤" − 𝑤# 𝑥! )(−2𝑥! ) = 0 𝑒𝑟𝑟𝑜𝑟! (𝑥! )
𝜕𝑤# 𝑛 𝑛
!&# !&#

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
8
Sharing or publishing the contents in part or full is liable for legal action.
Parameter updates:
𝜕𝑐𝑜𝑠𝑡(𝑤" , 𝑤# )
𝑤" = 𝑤" − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤"
𝜕𝑐𝑜𝑠𝑡(𝑤" , 𝑤# )
𝑤# = 𝑤# − 𝑙𝑟𝑎𝑡𝑒
𝜕𝑤#

lrate is the learning rate which controls the step size of parameter update.
Let’s run it on our example:
X: 6 5 4 7 8 2
y: 7 5 2 9 6 3

Let’s initialize both coefficient 𝑤! 𝑎𝑛𝑑 𝑤" with 0.0,


𝑤! = 0.0
𝑤" = 0.0
Iteration #1:
𝑦ℎ𝑎𝑡& = 0.0 + 0.0 𝑥&
Calculate gradients:
𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" ) −2
[email protected] = (7 + 5 + 2 + 9 + 6 + 3) = −10.6667
X9IMGDQ0PO 𝜕𝑤! 6

𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" ) −2
= (7 ∗ 6 + 5 ∗ 5 + 2 ∗ 4 + 9 ∗ 7 + 6 ∗ 8 + 3 ∗ 2) = −64
𝜕𝑤" 6
Update parameters: ( lrate = 0.01)
𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" )
𝑤! = 𝑤! − 𝑙𝑟𝑎𝑡𝑒 = 0.0 − 0.01 (−10.6667) = 0.1066
𝜕𝑤!

𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" )
𝑤" = 𝑤" − 𝑙𝑟𝑎𝑡𝑒 = 0.0 − 0.01 (−64) = 0.64
𝜕𝑤"
Iteration #2:
𝑦ℎ𝑎𝑡& = 0.1067 + 0.64 𝑥&
Calculate gradients:

𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" ) −2
= (3.0533 + 1.6933 − 0.6667 + 4.4133 + 0.7733 + 1.6133) = −3.6266
𝜕𝑤! 6

𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" ) −2
= (3.0533 ∗ 6 + 1.6933 ∗ 5 − 0.6667 ∗ 4 + 4.4133 ∗ 7 + 0.7733 ∗ 8 + 1.6133 ∗ 2)
𝜕𝑤" 6
= −21.475

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
9
Sharing or publishing the contents in part or full is liable for legal action.
Update parameters: ( lrate = 0.01)

𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" )
𝑤! = 𝑤! − 𝑙𝑟𝑎𝑡𝑒 = 0.1067 − 0.01 (−3.6266) = 0.14296
𝜕𝑤!

𝜕𝑐𝑜𝑠𝑡(𝑤! , 𝑤" )
𝑤" = 𝑤" − 𝑙𝑟𝑎𝑡𝑒 = 0.64 − 0.01 (−21.475) = 0.8547
𝜕𝑤"

Similarly the number of iteration for gradient descent are performed till the minimum value
of cost function of error is achieved or some finite iterations are reached.

Multiple Linear Regression: Till we have discussed the case linear regression with just
one explanatory variable. But in real scenarios the target variable depends on multiple
explanatory variable which need to be cater during the development of linear regression
model. The model is expressed as:
𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + 𝑤$ 𝑥$ … … … … . . +𝑤' 𝑥'
Where x1, x2, x3…….xn are explanatory variables and y is target variable.
[email protected]
X9IMGDQ0PO Evaluation of Linear regression model- Evaluation helps to judge the performance of
any machine learning model that would provide best results to our test data.
Fundamentally three types of evaluation metrics are used to evaluate linear regression
model.
- R2 measure (discussed with least square method)
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
Mean Absolute Error(MAE)- Mean Absolute Error is the average of the difference
between actual and predicted value of target variable.
'
1
𝑀𝐴𝐸 = C|𝑦& − 𝑦ℎ𝑎𝑡& |
𝑛
&("

Root Mean Square Error(RMSE)- defined as:

'
1
𝑅𝑀𝑆𝐸 = \ C(𝑦& − 𝑦ℎ𝑎𝑡& )#
𝑛
&("

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
10
Sharing or publishing the contents in part or full is liable for legal action.
Pros and cons of Linear Regression:
Pros- Linear regression models are very simple and easy to implement. These models
are said to be most interpretable.
Cons- Linear regression models are largely affected by the presence of outlier in training
data. These models assume linear relationship between target and explanatory variables
which is sometimes is not true.

********

[email protected]
X9IMGDQ0PO

Proprietary content.
This © Great
file is meant Learning. All Rights
for personal Reserved.
use Unauthorized use or distribution
by [email protected] only.prohibited.
11
Sharing or publishing the contents in part or full is liable for legal action.

You might also like