Note 13 - Linear Regression
Note 13 - Linear Regression
Regression Analysis
Covariance will measure joint variation of two numerical variables. This is a measure of the relationship
between two variables.
𝑛
1
𝐶𝑂𝑉 𝑋, 𝑌 = 𝑋𝑖 − 𝑋ത 𝑌𝑖 − 𝑌ത
𝑛
𝑖=1
Correlation of Two Numerical Variables
Correlation is also a measurement of the relationship between two numerical variables. It is lying between -1
and +1.
𝐶𝑂𝑉(𝑋, 𝑌)
𝐶𝑂𝑅𝑅 𝑋, 𝑌 =
𝑉 𝑋 𝑉(𝑌)
Correlation VS Causation
A correlation between variables, however, does not automatically mean that the change in one variable is the
cause of the change in the values of the other variable. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear Regression
Linear regression is perhaps one of the most well known and well understood algorithms in statistics and
machine learning.
Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables
(x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of
the input variables (x).
When there is a single input variable (x), the method is referred to as Simple Linear Regression. When there
are multiple input variables, literature from statistics often refers to the method as Multiple Linear Regression.
Different techniques can be used to prepare or train the linear regression equation from data, the most
common of which is called Ordinary Least Squares. It is common to therefore refer to a model prepared this
way as Ordinary Least Squares Linear Regression or just Least Squares Regression.
Simple Linear Regression
Assume that we want to see the relationship between the height and the weight of Sri Lankan university
students. Our objective is to fit a model to predict the weight using height of the students.
Since the entire population cannot be accessed, a sample will be taken, and the model will be created.
Simple linear regression is useful for finding relationship between two variables. One is predictor or
independent variable and other is response or dependent variable which is a quantitative variable. For
example, relationship between height and weight.
Simple Linear Regression
This is called as Residual Sum of Squares (RSS) as well. By minimizing this SSE, we can obtain following
parameter estimations. Significance of these parameters can also be measured using Hypothesis Testing.
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝛽መ1 =
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Then the estimated model is,
𝑦ො = 𝛽መ0 + 𝛽መ1 𝑥
Simple Linear Regression
The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction
error (all data points) are as small as possible.
Multiple Linear Regression
𝛽መ = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑌
Here,
𝑿𝟏 𝑿𝟐 … 𝑿𝟒
𝑋11 𝑋12 … 𝑋1𝑘
𝑋21 𝑋22 … 𝑋2𝑘
… … … …
𝑋𝑛1 𝑋𝑛2 … 𝑋𝑛𝑘
Multiple Linear Regression
How this model is visualized. Consider a 2 variables with one response example. Here the Income is the
response variable and Seniority and the Years Of Education are the independent variables.
Higher dimensions cannot be visualized easily. There are several advanced techniques for visualize them.
Qualitative Predictors (Categorical Independent Variables)
Since one variable can be represented as other variable’s 0 case, one variable can be removed. That is called as
the reference level.
Consider First as the reference level.
Since one variable can be represented as other variable’s 0 case, one variable can be removed. That is called as
the reference level.
To represent categorical variables in a linear regression model, we use dummy variables. Consider following
example.
Ex:- Gender Ex- Temperature (High, Medium, Low)
Female 0 High 1 0
Medium 0 1
Low 0 0
Ex- Colour (Red, Green, Yellow, Blue)
Now assume that the model to be fitted for a numerical variable (X) and the above discussed categorical
variable Colour.
𝑛
2
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑇𝑆𝑆 = 𝑦𝑖 − 𝑦ത
𝑖=1
𝑖=1
𝑇𝑆𝑆 − 𝑆𝑆𝐸
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝐷𝑒𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑅2 =
𝑇𝑆𝑆
This R Squared Value is explaining the fraction of variation explained by the estimated model. In simple words
how much of the data captured by this model.
For each sample of the population, different estimates can be obtained. So these estimates are random
variables.
𝑆𝑆𝐸 1 𝑥ҧ 2
𝑉 𝛽መ0 = +
𝑛 − 2 𝑛 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
𝑆𝑆𝐸 1
𝑉 𝛽መ1 =
𝑛 − 2 σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
Hypothesis Testing for Simple Linear Regression
𝛽መ1 − 0
𝑡𝑐𝑎𝑙 = ~𝑡𝑛−2
መ
𝑆𝐸(𝛽1 )
Hypothesis Testing for Multiple Linear Regression
𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑘 = 0
Degrees of
Source of variation Mean square F
freedom
𝑇𝑆𝑆 − 𝑆𝑆𝐸
Regression k 𝑀𝑆𝑅 =
𝑘
𝑆𝑆𝐸 𝑀𝑆𝑅
Error n-k-1 𝑀𝑆𝐸 =
𝑛−𝑘−1
𝐹𝑠𝑡𝑎𝑡 =
𝑀𝑆𝐸
Total n-k-1+(k)=n-1
Linear Regression Assumptions
There are mainly four assumptions associated with a linear regression model:
• Linearity: The relationship between X and the mean of Y is linear.
• Scatter plots & partial regression models
• Multicollinearity: Correlation among independent variables.
• VIF
• Homoscedasticity: Homogeneity of variance of residuals.
• Residual VS Fitted Values plot
• Independence: Observations are independent of each other.
• Residual VS Fitted Values plot
• Normality: Residuals are normally distributed.
• Shapiro–Wilk test
Predictions
After fitting the model, the goal is to predict the response using independent data.
Prediction Error
}
Model Evaluation – Validation Set Approach
Train the model using the training dataset and then check the accuracy of the model using the testing dataset.
MSE of a regression model will be calculated and the model will be evaluated.
σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 =
𝑛