0% found this document useful (0 votes)
4 views

Lecture_Regression Analysis and Correlation

Uploaded by

marvinmugisha955
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture_Regression Analysis and Correlation

Uploaded by

marvinmugisha955
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Unit III:

Regression Analysis
and
Correlation
Introduction
• Suppose that we need to optimize the flow of traffic on a
certain main road in Kigali city. A transportation engineer
needs to look at the key factors affecting the traffic flow
like the density of vehicles (number of vehicles per
kilometer) and the average travel speed on that main road.
• To perform the task, the data should be collected several
days during the peak hours. Then, develop a mathematical
model that predicts travel speed based on the vehicle
density, which enables better traffic management. This is
where “regression analysis” comes in.
Regression Analysis
• Def: Regression analysis is the part of statistics that
investigates the relationship between two or more variables
related in a nondeterministic fashion.
• In particular, a simple linear regression model is used to
model the relationship between two variables (dependent
and independent variables).
• The dependent variable is a the outcome variable we aim
to predict while the independent variable is the predictor
variable which influences the dependent variable.
Regression Analysis
The simple linear regression model is of the form:
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖
Where 𝛽0 and 𝛽1 are the parameters to be determined and 𝜖 is
the error term which is assumed to be normally distributed
with 𝐸 𝜖 = 0 and 𝑉 𝜖 = 𝜎 2.
Without 𝜖 , any observed pair (𝑥, 𝑦) would correspond to a
point falling exactly on the line 𝑦 = 𝛽0 + 𝛽1 𝑥, called the true
(or population) regression line. The inclusion of the error term
allows (𝑥, 𝑦) to fall either above or below the true regression
line.
Model Parameters Estimation
The point estimates of 𝛽0 and 𝛽1 , denoted by 𝛽0 and 𝛽1 are called the
least squares estimates. The estimated regression line or least squares
line is then the line whose equation is
𝑦 = 𝛽0 + 𝛽1 𝑥
Where,

𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 𝑆𝑥𝑦
𝛽1 = 2
=
𝑥𝑖 − 𝑥 𝑆𝑥𝑥

With 𝑆𝑥𝑦 = 𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖 /𝑛 and 𝑆𝑥𝑥 = 𝑥𝑖2 − 𝑥𝑖 2 /𝑛

𝑦𝑖 − 𝛽1 𝑥𝑖
𝛽0 = = 𝑦 − 𝛽1 𝑥
𝑛
Example
The following data was collected on a certain main road during
the peak hours. The data indicates the vehicle density (x, in
vehicles per kilometer) and the corresponding average travel
speed (y, in Km/h). Write down the linear regression model that
represent the data.

Vehicle 10 20 30 40 50 60
Density (x)
Travel 70 65 58 50 45 40
Speed (y)
Example
Sol: (Step-by-Steps will be performed in the class session)

𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑖 /𝑛 −1090
𝛽1 = 2 2
= = −0.62286
𝑥𝑖 − 𝑥𝑖 /𝑛 1750
𝛽0 = 𝑦 − 𝛽1 𝑥 = 54.66667 − −0.62286 35 = 76.46667

Thus the linear regression model is given by:

𝑦 = 𝛽0 + 𝛽1 𝑥 = 76.46667 − 0.62286𝑥

This model can also be used to predict travel speed for different vehicle
densities.
For instance, at 𝑥 = 30 𝑣𝑒ℎ𝑖𝑐𝑙𝑒𝑠/𝑘𝑚 ,
y = 76.46667 − 0.62286 30 = 57.78𝑘𝑚/ℎ
Pearson’s Correlation Coefficient
There are many situations in which the objective in studying the joint
behavior of two variables is to see whether they are related, rather than
to use one to predict the value of the other.

The Pearson’s correlation coefficient ( r ) or simply correlation


coefficient measures how strongly related two variables x and y are in
a sample.
For n pairs 𝑥1 , 𝑦1 , … , 𝑥𝑛 , 𝑦𝑛 , the correlation coefficient is given by:

𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑟=
𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2
Pearson’s Correlation Coefficient
𝑟 ∈ −1,1 ,
If 𝑟 = 1: Perfect positive correlation
If 𝑟 = −1: Perfect negative correlation
If 𝑟 = 0: No linear correlation
When we are dealing with population data, the previous formula
of correlation coefficient (r) is changed to:

𝐶𝑜𝑣(𝑋, 𝑌)
𝑟=
𝜎𝑋𝜎𝑌

𝑁
𝑖=1 𝑋𝑖 −𝜇𝑋 𝑌𝑖 −𝜇𝑌
Where, 𝐶𝑜𝑣 𝑋, 𝑌 = 𝑁
Pearson’s Correlation Coefficient
By using our previous example with vehicle density and the
corresponding average travel speed, the correlation coefficient is

𝑥𝑖 −𝑥 𝑦𝑖 −𝑦 −1090
𝑟= 2 2
= = −0.99676
𝑥𝑖 −𝑥 𝑦𝑖 −𝑦 1195833

Obviously, the correlation coefficient is approaching -1, which


indicates a negative linear relationship.
This makes sense since, by even observing our original data, as
vehicle density increases, the average travel speed decreases. We
can say that the vehicle density and the average travel speed are
inversely proportional.
Confidence Intervals for Regression Coefficients
A confidence interval for a regression coefficient estimates the range of
values within which the true coefficient is likely to lie, with a specified
level of confidence (like 99%, 95%, 90%).
For our simple linear regression model, 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜖, which is
estimated as 𝑦 = 𝛽0 + 𝛽1 𝑥, we calculate the confidence intervals as
follows:
𝐶𝐼 = 𝛽 ± 𝑡𝛼 2,𝑛−2 . 𝑆𝐸 𝛽
Where,
𝛽 : The estimated regression coefficient (𝛽0 𝑜𝑟 𝛽1 )
𝑡𝛼 2,𝑛−2 : The critical value obtained from the t-distribution
𝑆𝐸 𝛽 : The standard error of the coefficient
Confidence Intervals for Regression Coefficients
The standard Error (SE) is calculated as follows:
For the slope( 𝛽1 ):

𝑀𝑆𝐸
𝑆𝐸 𝛽1 = 2
𝑥𝑖 − 𝑥

For the intercept (𝛽0):

1 𝑥2
𝑆𝐸 𝛽0 = 𝑀𝑆𝐸 + 2
𝑛 𝑥𝑖 − 𝑥

Where,

𝑦𝑖 −𝑦 𝑖 2
𝑀𝑆𝐸 :Mean Square Error is given by: 𝑀𝑆𝐸 = , 𝑦𝑖 are the predicted values of y,
𝑛−2

n: Number of data points


Example
By using our previous example data, Calculate the confidence
intervals for the regression coefficients at 95%.
Vehicle 10 20 30 40 50 60
Density (x)
Travel 70 65 58 50 45 40
Speed (y)

Soln(Step-by-step to be done in class session)


The regression equation was estimated to be:
𝑦 = 𝛽0 + 𝛽1 𝑥 = 76.46667 − 0.62286𝑥

WKT 𝐶𝐼 = 𝛽 ± 𝑡𝛼 2,𝑛−2 . 𝑆𝐸 𝛽
𝑡0.05 2, 6−2 = 𝑡0.025 ,4 = 2.776
Example
The confidence interval for 𝛽0 is given by:

76.46667 ± 2.776 0.9785 = 73.750 , 79.183

The confidence interval for 𝛽1 is given by:

−0.62286 ± 2.776 0.0251 = −0.693 , −0.553

The confidence intervals and regression coefficients can be easily

estimated using software like Excel, Python,…


Example
Here is the output generated by Excel (Step-by-step to be
performed in class session)
Auto-Correlation and Cross-Correlation
Auto-correlation measures the correlation of a variable
with itself at different time lags (k) and is more useful in
time-series data. For instance, we can compare 𝑥𝑡 and
𝑥𝑡+𝑘 . It is calculated using the following formula:

𝑛−𝑘 𝑥𝑡 − 𝑥 𝑥𝑡−𝑘 − 𝑥
𝑡=1
𝑟𝑥𝑥 𝑘 = 𝑛 2
𝑥
𝑡=1 𝑡 − 𝑥
Auto-Correlation and Cross-Correlation
Cross-correlation measures the relationship between
two different variables at the same time or different
lags.

The cross-correlation between x and y is calculated


using the following formula:

𝑛−𝑘 𝑥𝑡 − 𝑥 𝑦𝑡−𝑘 − 𝑦
𝑡=1
𝑟𝑥𝑦 𝑘 = 𝑛 𝑛
𝑡=1 𝑥𝑡 − 𝑥 2 . 𝑡=1 𝑦𝑡−𝑘 − 𝑦 2
Auto-Correlation and Cross-Correlation
Example: By using our previous example data, Calculate the auto-correlation and
cross-correlation at lag k=0 (i.e., at the same time point).

Vehicle 10 20 30 40 50 60
Density (x)
Travel 70 65 58 50 45 40
Speed (y)

Sol: The formula of the auto-correlation is given by:

𝑛−𝑘
𝑡=1 𝑥𝑡 − 𝑥 𝑥𝑡−𝑘 − 𝑥
𝑟𝑥𝑥 𝑘 = 𝑛 2
𝑡=1 𝑥𝑡 − 𝑥

𝑛 𝑛 2
𝑡=1 𝑥𝑡 −𝑥 𝑥𝑡 −𝑥 𝑡=1 𝑥𝑡−𝑥
At k=0, 𝑟𝑥𝑥 0 = 𝑛 2 = 𝑛 2 =1
𝑡=1 𝑥𝑡−𝑥 𝑡=1 𝑥𝑡−𝑥

The autocorrelation equals to 1 confirms that a time series is perfectly correlated


with itself at the same time point.
Auto-Correlation and Cross-Correlation
The cross-correlation is given by:

𝑛−𝑘 𝑥𝑡 − 𝑥 𝑦𝑡−𝑘 − 𝑦
𝑡=1
𝑟𝑥𝑦 𝑘 =
𝑛 𝑥𝑡 − 𝑥 2. 𝑛 𝑦𝑡−𝑘 − 𝑦 2
𝑡=1 𝑡=1

At k=0, we have

𝑛
𝑡=1 𝑥𝑡 − 𝑥 𝑦𝑡 − 𝑦 −1090
𝑟𝑥𝑦 0 = = = −0.9967
𝑛
𝑡=1 𝑥𝑡 − 𝑥 2 . 𝑛
𝑡=1 𝑦𝑡 − 𝑦 2 1093.542

(Step-by-step to be done in class session)

The cross-correlation 𝑟𝑥𝑦 0 = −0.9967 indicates a strong negative


correlation between vehicle density and travel speed at the same time point.

Exercise(HW): Calculate the auto and cross correlations at lag k=1.


Multiple Linear Regression Analysis
• Imagine that you want to estimate the power consumption
based on temperature , time of the day and household size.
• Suppose that also you want to predict the strength of concrete
based on cement , water-to-cement ratio and curing time.
• In these cases, we have one dependent variable and more than
one independent variables.
• To model the relationship between the dependent variable and
independent variables, we use the “multiple linear regression
model”.
Multiple Linear Regression Analysis
The multiple linear regression model is expressed as
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜀

Where,

Y: Dependent variable

𝑋𝑖 : Independent variables

𝛽0 : Intercept (value of Y when all X’s are zero)

𝛽𝑖 , 𝑖 = 1,2,3, …: Regression coefficients

𝜀: Error term (difference between observed and predicted Y)


Multiple Linear Regression Analysis
Example: Consider the following data on the fuel
efficiency of vehicles. Write down the regression
equation which predict the fuel efficiency.

Fuel Efficiency Engine Size Vehicle Weight Tire Pressure

15 3.0 1800 32

12 2.0 1500 30

10 1.8 1200 28

8 1.6 800 29
Multiple Linear Regression Analysis
The coefficients can easily be obtained by using software. Here is Excel output
(Step-by-step to be performed in class session).

The regression model will be


𝑌 = −2.43 + 1.25 𝑒𝑛𝑔𝑖𝑛𝑒 𝑠𝑖𝑧𝑒 + 0.0047 𝑉𝑒ℎ𝑖𝑐𝑙𝑒 𝑤𝑒𝑖𝑔ℎ𝑡
+ 0.15(𝑡𝑖𝑟𝑒 𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒)
Exercise
The article “How to optimize and control the Wire
Bonding Process: Part II” described an experiment carried
out to assess the impact of the variables force(gm), power
(mW) , temperature (deg. Cel.) and time (msec) on ball
bond shear strength (gm). The data in the following slide
was generated to be consistent with the information given
in the article. Write down the regression model that
estimate the data.
Exercise

You might also like