0% found this document useful (0 votes)
22 views

Note 13 - Linear Regression

Uploaded by

anupriya6105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Note 13 - Linear Regression

Uploaded by

anupriya6105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

IT3011 - Theory & Practices in Statistical Modelling

Regression Analysis

H.M. Samadhi Chathuranga Rathnayake


MSc, PgDip, BSc
Covariance of Two Numerical Variables

Covariance will measure joint variation of two numerical variables. This is a measure of the relationship
between two variables.

𝑛
1
𝐶𝑂𝑉 𝑋, 𝑌 = ෍ 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑛
𝑖=1
Correlation of Two Numerical Variables

Correlation is also a measurement of the relationship between two numerical variables. It is lying between -1
and +1.
𝐶𝑂𝑉(𝑋, 𝑌)
𝐶𝑂𝑅𝑅 𝑋, 𝑌 =
𝑉 𝑋 𝑉(𝑌)
Correlation VS Causation

A correlation between variables, however, does not automatically mean that the change in one variable is the
cause of the change in the values of the other variable. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear Regression

Linear regression is perhaps one of the most well known and well understood algorithms in statistics and
machine learning.
Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables
(x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of
the input variables (x).

When there is a single input variable (x), the method is referred to as Simple Linear Regression. When there
are multiple input variables, literature from statistics often refers to the method as Multiple Linear Regression.

Different techniques can be used to prepare or train the linear regression equation from data, the most
common of which is called Ordinary Least Squares. It is common to therefore refer to a model prepared this
way as Ordinary Least Squares Linear Regression or just Least Squares Regression.
Simple Linear Regression

Assume that we want to see the relationship between the height and the weight of Sri Lankan university
students. Our objective is to fit a model to predict the weight using height of the students.
Since the entire population cannot be accessed, a sample will be taken, and the model will be created.

Let’s plot the sample data.


Simple Linear Regression

Simple linear regression is useful for finding relationship between two variables. One is predictor or
independent variable and other is response or dependent variable which is a quantitative variable. For
example, relationship between height and weight.
Simple Linear Regression

Infinitely many number of plots can be created.


Simple Linear Regression

The equation for this model for the population is as follows,


𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
The values 𝛽0 and 𝛽1 must be chosen so that they minimize the error. If sum of squared error is taken as a
metric to evaluate the model, then goal to obtain a line that best estimated model which reduces the error.
𝑛 𝑛
2 2
𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝐸𝑟𝑟𝑜𝑟 (𝑆𝑆𝐸) = ෍ 𝐴𝑐𝑡𝑢𝑎𝑙 𝑂𝑢𝑡𝑝𝑢𝑡 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑂𝑢𝑡𝑝𝑢𝑡 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑖=1 𝑖=1

This is called as Residual Sum of Squares (RSS) as well. By minimizing this SSE, we can obtain following
parameter estimations. Significance of these parameters can also be measured using Hypothesis Testing.

σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝛽መ1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Then the estimated model is,
𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
Simple Linear Regression

Best Fit Line

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction
error (all data points) are as small as possible.
Multiple Linear Regression

The equation for this model is as follows,


𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝜀
Consider here we have k variables. By minimizing this SSE, we can obtain the parameter estimations in here as
well. These estimated parameters can be represented as a vector. The parameter vector 𝛽መ can be obtained
through,

𝛽መ = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌

Here,

𝑿𝟏 𝑿𝟐 … 𝑿𝟒
𝑋11 𝑋12 … 𝑋1𝑘
𝑋21 𝑋22 … 𝑋2𝑘
… … … …
𝑋𝑛1 𝑋𝑛2 … 𝑋𝑛𝑘
Multiple Linear Regression

How this model is visualized. Consider a 2 variables with one response example. Here the Income is the
response variable and Seniority and the Years Of Education are the independent variables.

Higher dimensions cannot be visualized easily. There are several advanced techniques for visualize them.
Qualitative Predictors (Categorical Independent Variables)

Categorical variables cannot be added to the model as the numerical variables.


Ex- University Year Variable (First, Second, Third, Fourth)

First Second Third Fourth


1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

These are called as Dummy Variables.


Qualitative Predictors (Categorical Independent Variables)

Since one variable can be represented as other variable’s 0 case, one variable can be removed. That is called as
the reference level.
Consider First as the reference level.

First Second Third Fourth


1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

This is called as Dummy Trapping.


Qualitative Predictors (Categorical Independent Variables)

Since one variable can be represented as other variable’s 0 case, one variable can be removed. That is called as
the reference level.

Second Third Fourth


0 0 0
1 0 0
0 1 0
0 0 1

This is called as Dummy Trapping.


Dummy Variables

To represent categorical variables in a linear regression model, we use dummy variables. Consider following
example.
Ex:- Gender Ex- Temperature (High, Medium, Low)

Gender Dummy Variable (G) Dummy Variable Dummy Variable


Temperature
Male 1 (T1) (T2)

Female 0 High 1 0
Medium 0 1
Low 0 0
Ex- Colour (Red, Green, Yellow, Blue)

Dummy Dummy Dummy


Colour
Variable (C1) Variable (C2) Variable (C3)
Red 1 0 0
Green 0 1 0
Yellow 0 0 1
Blue 0 0 0
Dummy Variables

Now assume that the model to be fitted for a numerical variable (X) and the above discussed categorical
variable Colour.

Dummy Dummy Dummy


Colour
Variable (C1) Variable (C2) Variable (C3)
Red 1 0 0
Green 0 1 0
Yellow 0 0 1
Blue 0 0 0

𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥1 + 𝛽መ2 𝑐1 + 𝛽መ3 𝑐2 + 𝛽መ4 𝑐3


R Squared Value (Coefficient of Determination)

𝑛
2
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑇𝑆𝑆 = ෍ 𝑦𝑖 − 𝑦ത
𝑖=1

𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝐸𝑟𝑟𝑜𝑟 (𝑆𝑆𝐸) = ෍ 𝑦𝑖 − 𝑦ො𝑖 2

𝑖=1

𝑇𝑆𝑆 − 𝑆𝑆𝐸
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑅2 =
𝑇𝑆𝑆
This R Squared Value is explaining the fraction of variation explained by the estimated model. In simple words
how much of the data captured by this model.

For the Simple Linear Regression case 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑅2 = (𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛)2


Hypothesis Testing for Simple Linear Regression

For each sample of the population, different estimates can be obtained. So these estimates are random
variables.

𝑆𝑆𝐸 1 𝑥ҧ 2
𝑉 𝛽መ0 = +
𝑛 − 2 𝑛 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2

𝑆𝑆𝐸 1
𝑉 𝛽መ1 =
𝑛 − 2 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
Hypothesis Testing for Simple Linear Regression

Following hypothesis can be tested.

𝐻0 : 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌 (𝛽1 = 0)

𝐻1 : 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌 (𝛽1 ≠ 0)

Here the test statistic is,

𝛽መ1 − 0
𝑡𝑐𝑎𝑙 = ~𝑡𝑛−2

𝑆𝐸(𝛽1 )
Hypothesis Testing for Multiple Linear Regression

Following hypothesis can be tested.

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑘 = 0

𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑗 𝑖𝑠 𝑛𝑜𝑛 𝑧𝑒𝑟𝑜 𝑓𝑜𝑟 𝑗 = 1,2, … , 𝑘


ANOVA table here is,

Degrees of
Source of variation Mean square F
freedom

𝑇𝑆𝑆 − 𝑆𝑆𝐸
Regression k 𝑀𝑆𝑅 =
𝑘

𝑆𝑆𝐸 𝑀𝑆𝑅
Error n-k-1 𝑀𝑆𝐸 =
𝑛−𝑘−1
𝐹𝑠𝑡𝑎𝑡 =
𝑀𝑆𝐸

Total n-k-1+(k)=n-1
Linear Regression Assumptions

There are mainly four assumptions associated with a linear regression model:
• Linearity: The relationship between X and the mean of Y is linear.
• Scatter plots & partial regression models
• Multicollinearity: Correlation among independent variables.
• VIF
• Homoscedasticity: Homogeneity of variance of residuals.
• Residual VS Fitted Values plot
• Independence: Observations are independent of each other.
• Residual VS Fitted Values plot
• Normality: Residuals are normally distributed.
• Shapiro–Wilk test
Predictions

After fitting the model, the goal is to predict the response using independent data.
Prediction Error

Always there is a prediction error.

}
Model Evaluation – Validation Set Approach

Split the dataset into two sets,


• Training dataset (Generally 80 of the data But it can be changed)
• Testing dataset (Rest of the data)

Train the model using the training dataset and then check the accuracy of the model using the testing dataset.
MSE of a regression model will be calculated and the model will be evaluated.

σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 =
𝑛

You might also like