Lecture_Regression Analysis and Correlation
Lecture_Regression Analysis and Correlation
Regression Analysis
and
Correlation
Introduction
• Suppose that we need to optimize the flow of traffic on a
certain main road in Kigali city. A transportation engineer
needs to look at the key factors affecting the traffic flow
like the density of vehicles (number of vehicles per
kilometer) and the average travel speed on that main road.
• To perform the task, the data should be collected several
days during the peak hours. Then, develop a mathematical
model that predicts travel speed based on the vehicle
density, which enables better traffic management. This is
where “regression analysis” comes in.
Regression Analysis
• Def: Regression analysis is the part of statistics that
investigates the relationship between two or more variables
related in a nondeterministic fashion.
• In particular, a simple linear regression model is used to
model the relationship between two variables (dependent
and independent variables).
• The dependent variable is a the outcome variable we aim
to predict while the independent variable is the predictor
variable which influences the dependent variable.
Regression Analysis
The simple linear regression model is of the form:
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖
Where 𝛽0 and 𝛽1 are the parameters to be determined and 𝜖 is
the error term which is assumed to be normally distributed
with 𝐸 𝜖 = 0 and 𝑉 𝜖 = 𝜎 2.
Without 𝜖 , any observed pair (𝑥, 𝑦) would correspond to a
point falling exactly on the line 𝑦 = 𝛽0 + 𝛽1 𝑥, called the true
(or population) regression line. The inclusion of the error term
allows (𝑥, 𝑦) to fall either above or below the true regression
line.
Model Parameters Estimation
The point estimates of 𝛽0 and 𝛽1 , denoted by 𝛽0 and 𝛽1 are called the
least squares estimates. The estimated regression line or least squares
line is then the line whose equation is
𝑦 = 𝛽0 + 𝛽1 𝑥
Where,
𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 𝑆𝑥𝑦
𝛽1 = 2
=
𝑥𝑖 − 𝑥 𝑆𝑥𝑥
𝑦𝑖 − 𝛽1 𝑥𝑖
𝛽0 = = 𝑦 − 𝛽1 𝑥
𝑛
Example
The following data was collected on a certain main road during
the peak hours. The data indicates the vehicle density (x, in
vehicles per kilometer) and the corresponding average travel
speed (y, in Km/h). Write down the linear regression model that
represent the data.
Vehicle 10 20 30 40 50 60
Density (x)
Travel 70 65 58 50 45 40
Speed (y)
Example
Sol: (Step-by-Steps will be performed in the class session)
𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖 /𝑛 −1090
𝛽1 = 2 2
= = −0.62286
𝑥𝑖 − 𝑥𝑖 /𝑛 1750
𝛽0 = 𝑦 − 𝛽1 𝑥 = 54.66667 − −0.62286 35 = 76.46667
𝑦 = 𝛽0 + 𝛽1 𝑥 = 76.46667 − 0.62286𝑥
This model can also be used to predict travel speed for different vehicle
densities.
For instance, at 𝑥 = 30 𝑣𝑒ℎ𝑖𝑐𝑙𝑒𝑠/𝑘𝑚 ,
y = 76.46667 − 0.62286 30 = 57.78𝑘𝑚/ℎ
Pearson’s Correlation Coefficient
There are many situations in which the objective in studying the joint
behavior of two variables is to see whether they are related, rather than
to use one to predict the value of the other.
𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑟=
𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2
Pearson’s Correlation Coefficient
𝑟 ∈ −1,1 ,
If 𝑟 = 1: Perfect positive correlation
If 𝑟 = −1: Perfect negative correlation
If 𝑟 = 0: No linear correlation
When we are dealing with population data, the previous formula
of correlation coefficient (r) is changed to:
𝐶𝑜𝑣(𝑋, 𝑌)
𝑟=
𝜎𝑋𝜎𝑌
𝑁
𝑖=1 𝑋𝑖 −𝜇𝑋 𝑌𝑖 −𝜇𝑌
Where, 𝐶𝑜𝑣 𝑋, 𝑌 = 𝑁
Pearson’s Correlation Coefficient
By using our previous example with vehicle density and the
corresponding average travel speed, the correlation coefficient is
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦 −1090
𝑟= 2 2
= = −0.99676
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦 1195833
𝑀𝑆𝐸
𝑆𝐸 𝛽1 = 2
𝑥𝑖 − 𝑥
1 𝑥2
𝑆𝐸 𝛽0 = 𝑀𝑆𝐸 + 2
𝑛 𝑥𝑖 − 𝑥
Where,
𝑦𝑖 −𝑦 𝑖 2
𝑀𝑆𝐸 :Mean Square Error is given by: 𝑀𝑆𝐸 = , 𝑦𝑖 are the predicted values of y,
𝑛−2
WKT 𝐶𝐼 = 𝛽 ± 𝑡𝛼 2,𝑛−2 . 𝑆𝐸 𝛽
𝑡0.05 2, 6−2 = 𝑡0.025 ,4 = 2.776
Example
The confidence interval for 𝛽0 is given by:
𝑛−𝑘 𝑥𝑡 − 𝑥 𝑥𝑡−𝑘 − 𝑥
𝑡=1
𝑟𝑥𝑥 𝑘 = 𝑛 2
𝑥
𝑡=1 𝑡 − 𝑥
Auto-Correlation and Cross-Correlation
Cross-correlation measures the relationship between
two different variables at the same time or different
lags.
𝑛−𝑘 𝑥𝑡 − 𝑥 𝑦𝑡−𝑘 − 𝑦
𝑡=1
𝑟𝑥𝑦 𝑘 = 𝑛 𝑛
𝑡=1 𝑥𝑡 − 𝑥 2 . 𝑡=1 𝑦𝑡−𝑘 − 𝑦 2
Auto-Correlation and Cross-Correlation
Example: By using our previous example data, Calculate the auto-correlation and
cross-correlation at lag k=0 (i.e., at the same time point).
Vehicle 10 20 30 40 50 60
Density (x)
Travel 70 65 58 50 45 40
Speed (y)
𝑛−𝑘
𝑡=1 𝑥𝑡 − 𝑥 𝑥𝑡−𝑘 − 𝑥
𝑟𝑥𝑥 𝑘 = 𝑛 2
𝑡=1 𝑥𝑡 − 𝑥
𝑛 𝑛 2
𝑡=1 𝑥𝑡 −𝑥 𝑥𝑡 −𝑥 𝑡=1 𝑥𝑡−𝑥
At k=0, 𝑟𝑥𝑥 0 = 𝑛 2 = 𝑛 2 =1
𝑡=1 𝑥𝑡−𝑥 𝑡=1 𝑥𝑡−𝑥
𝑛−𝑘 𝑥𝑡 − 𝑥 𝑦𝑡−𝑘 − 𝑦
𝑡=1
𝑟𝑥𝑦 𝑘 =
𝑛 𝑥𝑡 − 𝑥 2. 𝑛 𝑦𝑡−𝑘 − 𝑦 2
𝑡=1 𝑡=1
At k=0, we have
𝑛
𝑡=1 𝑥𝑡 − 𝑥 𝑦𝑡 − 𝑦 −1090
𝑟𝑥𝑦 0 = = = −0.9967
𝑛
𝑡=1 𝑥𝑡 − 𝑥 2 . 𝑛
𝑡=1 𝑦𝑡 − 𝑦 2 1093.542
Where,
Y: Dependent variable
𝑋𝑖 : Independent variables
15 3.0 1800 32
12 2.0 1500 30
10 1.8 1200 28
8 1.6 800 29
Multiple Linear Regression Analysis
The coefficients can easily be obtained by using software. Here is Excel output
(Step-by-step to be performed in class session).