Topic - 7 - Regression Analysis
Topic - 7 - Regression Analysis
Statistics
Topic 7:
Introduction to Regression Analysis
1
Relationships Between Random Variables
• Typically we have a sample of the data and don’t know the joint
PDF that the data comes from.
4
Correlation Coefficient
• We estimate ρ from sample data with the sample coefficient of
correlation 𝜌$ :
#
"!!" ' % &))
∑(%% &%)() '
$
𝜌$ = =
"
!! "
!" #
' & # ∑ )% &)' &
$
∑ % % &% $
5
Correlation Example
6
Scatterplot of Birth Rate and GNP Growth Rate
60
50
Birth rate 40
30
20
10
0
-1 0 1 2 3 4 5 6 7 8
Growth rate
7
Correlation Example
• We can apply the previous formula for the sample correlation coefficient
to derive:
ρ$ = −0.824
Interpretation
– There is a strong negative relationship between the birth rate and
GNP growth.
– Countries that have higher economic growth are expected to have
lower birth rates.
8
Correlation versus Causality
Correlation does not imply causality
– High correlation between two variables does not mean that one
of the variables causes the other.
9
Regression Analysis
10
Steps in Regression Analysis
11
Steps in Regression Analysis
6. Diagnostic Checking – making sure that the model fits the data well
8. Prediction (forecasting)
12
The Regression Line
• $ 𝛽0 define the position of the regression line.
The sample estimates 𝛼,
60
30
20
10
0
-1 0 1 2 3 4 5 6 7 8
Growth rate (X)
13
Estimation of the Model Parameters
• How do we obtain parameters from a linear regression model?
– There are several methods such as least squares, maximum
likelihood, method of moments.
– In this unit we will use a method called Ordinary Least Squares
(OLS)
– OLS provides a solution to 𝛼 and 𝛽 which minimizes the squared
(vertical) distances from the data points to the fitted line
60
50
40 error 𝑢"
Birth rate (Y)
30
20
10
0
-1 0 1 2 3 4 5 6 7 8
Growth rate (X)
14
Estimation of the Model Parameters
• The points on the actual regression line are computed as
– 𝑌,& = 𝛼 + 𝛽𝑋&
– Therefore the errors are 𝑢& = 𝑌& − 𝛼 − 𝛽𝑋& = 𝑌& − 𝑌,&
15
Regression Parameter Formulas
• OLS estimators
– The solution to the OLS minimisation problem presented on the
previous page is as follows
$
*!"
+ ∑ -& .-/ 0& .0/
𝛽4 = # = %
$
+*! ∑ 0& .0/ #
%
𝛼! = 𝑌5 − 𝛽4 𝑋5
16
Birth rate/ GNP Example
!"×!,!%&.()*+)."×%,)
1. 𝛽' = ! = −2.70
!"×!,+.)+*+)."
",) +)."
2. 𝛼/ = − −2.700 × = 40.71
!" !"
• Interpretation:
– A unit increase in the GNP growth rate (e.g. from 3% to 4%) will
lower the birth rate by 2.7 births per 1000 people on average
(e.g. from 32.61 to 29.91)
17
The difference between 𝑢! and 𝑢" !
• Note that there is a difference between the true residuals 𝑢, and the
estimated residuals 𝑢! ,
• 𝑢& = 𝑌& − 𝛼 − 𝛽𝑋&
4 &
• 𝑢! & = 𝑌& − 𝛼! − 𝛽𝑋
18
Measuring Goodness of Fit
• Having calculated the regression line we now ask whether it
provides a good fit for the data.
– Do the observations tend to lie close to, or far away from the line?
• The total sum of squares (TSS) is the total variation of 𝑌 around its mean value 𝑇𝑆𝑆 =
∑ 𝑌! − 𝑌/ "
• The explained sum of squares (ESS) is the part of the TSS explained by the linear regression
"
𝐸𝑆𝑆 = ∑ 𝑌8! − 𝑌/
• The residual (or unexplained) sum of squares (RSS) is the sum of squared errors between the
actual values of 𝑌 and the regression line 𝑅𝑆𝑆 = ∑𝑢: !"
#$$ &$$
• Note that 𝑇𝑆𝑆 = 𝐸𝑆𝑆 + 𝑅𝑆𝑆, so that %$$ = 1 − %$$
19
Goodness of Fit:
Explained and Unexplained components of Total error
• Hill, R., Carter, et al., 2011, Principles of Econometrics, 4th ed., p136.
20
"
𝑅 - Coefficient of Determination
$
∑ -8 & .-/ # ∑ -8 & .-/ # A
B CD -8 $
– 𝑅$ = = %
$ = , 𝑌
= 𝜌! 𝑌,
∑ -& .-/ # ∑ -& .-/ # A
B CD -
%
21
𝑅 " example
• Compute 𝑅 ( for the birth rate – GNP growth example
– 𝑇𝑆𝑆 = 𝑅𝑆𝑆 + 𝐸𝑆𝑆
– 𝑇𝑆𝑆 = ∑ 𝑌! − 𝑌+ "
= ∑ 𝑌! " − 𝑛𝑌+ " = 12,564 − 12×31.67" = 530.67
"
– 𝑅𝑆𝑆 = ∑ 𝑌! − 𝑌8! 9! " = ∑𝑌!" − ∑ 𝑌! 𝑌8! =
= ∑ 𝑌!" − 2𝑌! 𝑌8! + 𝑌
∑ 𝑌! " − 𝑎: ∑ 𝑌! − 𝑏8 ∑ 𝑋! 𝑌! = 12,564 − 40.71×380 − −2.7 ×1,139.7 = 170.75
• Since ∑ 𝑌! 𝑌8! = ∑ [𝑌8! + 𝑢: ! ]𝑌8! = ∑𝑌8!"
– 𝐸𝑆𝑆 = 530.67 − 170.75 = 359.92
#$$ &'(.("
𝑅" = %$$ = '&*.+, = 0.678
• Interpretation:
– About 67.8% of the total variation in the birth rates from the mean value is
explained by the variation in the GNP growth rates.
– The correlation between the values of 𝑌 and the estimated regression line 𝑌) is
) 𝑌 = 𝑅 ( = 0.823. Note: 𝜌S 𝑌,
0.823, i.e. 𝜌S 𝑌, ) 𝑌 is always positive, can be seen
from Cov 𝑌, ) 𝑌 = Var 𝑌)
22
Inference in Regression Analysis
• So far, we have used OLS as a way of finding the “best” values for the
parameters 𝛼 and 𝛽 given a dataset.
– For this we only needed calculus to differentiate the sum of squared residuals and find
its minimum.
• In order to work out the standard errors and distribution of 𝛼 and 𝛽 we will
need to make some assumptions about the true but unobservable residuals
𝑢& .
– Note: we don’t need to make any assumptions regarding the sample residuals 𝑢. 4 since
we can compute them.
23
Classical Linear Regression Model
(CLRM) Assumptions
1. 𝐸 𝑢4 = 0 for all 𝑖
2. 𝑉𝑎𝑟 𝑢4 = 𝜎56 for all 𝑖
3. 𝐶𝑜𝑣 𝑢4 , 𝑢7 = 0 for 𝑖 ≠ 𝑗
4. 𝑋 is fixed in repeated samples ⇒ E 𝑢4 𝑋4 = 𝑋4 E 𝑢4 = 0 ⇒ Cov 𝑢4 𝑋4 = 0
5. 𝑢4 ~𝑁 0, 𝜎56
24
Properties of the OLS estimators under the
CLRM assumptions
1. Under the assumptions 1-4, OLS estimators 𝛼$ and 𝛽' are
Best Linear Unbiased Estimators (BLUE)
– Unbiased means 𝐸 𝛼. = 𝛼, 𝐸 𝛽F = 𝛽
– Linear means that we can write 𝛼. = 𝑐> 𝑌> + 𝑐6 𝑌6 + ⋯ + 𝑐? 𝑌? and 𝛽F = 𝑐>∗ 𝑌> + 𝑐6∗ 𝑌6 + ⋯ +
𝑐?∗ 𝑌? for some constants 𝑐4 and 𝑐4∗ .
I I
2. 𝛼$ and 𝛽' are consistent, i.e. 𝛼$ → 𝛼, 𝛽' → 𝛽
– i.e. the estimates converge towards the true values in a very large sample.
25
OLS Standard Errors
• ) are specific to the sample used in their
Any set of regression estimates (𝛼S and 𝛽)
estimation -> they are random variables
– We need to find their variances (standard errors) and their PDF’s
• Given the 5 CLRM assumptions we can show that 𝛼S and 𝛽) are normally distributed
– If we know the true 𝜎'" , the variance of the error term, then the sampling distributions are as follows
𝜎)(
)
𝛽~𝑁 𝛽,
∑ 𝑋' − 𝑋_ (
𝜎)(∑𝑋'(
𝛼~𝑁
S 𝛼,
𝑛∑ 𝑋' − 𝑋_ (
26
"
Estimating the Error Variance 𝜎!
• Usually we don’t know the true 𝜎L- and need to estimate it from
the data together with 𝛼 and 𝛽.
– If we observed the true 𝑢& we could estimate 𝜎F$ from the sample as:
1
𝜎!F$ = ∑ 𝑢& − 𝑢5 $
𝑛
– However the problem is that we don’t know 𝑢& because we don’t know
the true 𝛼 and 𝛽.
– We have 𝛼! and 𝛽4 so we can compute 𝑢! & and use it to estimate 𝜎F$ as
follows
>
• 𝜎.56 = ; ∑𝑢. 4 this estimator is consistent for 𝜎56
• If we want an estimator which is both unbiased and consistent we use
1
𝜎!F$ = ∑𝑢! &$
𝑛−2
27
Standard Errors for 𝛼" and 𝛽%
• Based on the assumption that the true errors 𝑢! are normally distributed
(CLRM assumption 5) and uncorrelated with each other we can derive
'
sample variances for 𝛼$ and 𝛽:
𝜎$JK
𝑉𝑎𝑟 𝛽' =
∑ 𝑋! − 𝑋6 K
𝜎$JK ∑𝑋!K
𝑉𝑎𝑟 𝛼$ =
𝑛∑ 𝑋! − 𝑋6 K
28
The Distribution of 𝛼" and 𝛽%
• Under the normality assumption about 𝑢; it is possible to
show that when using 𝜎"<= instead of 𝜎<=
> @?
?
𝑡-statistic= ~𝑡D@=
>
ABC ?
E
F@F
𝑡-statistic= ~𝑡D@=
E
ABC F
29
Birth Rate and GNP Growth Rate Example
)-.
∑+ .// 012.14
• 𝜎S)( = = = = 17.075
,-( ,-( 02
+
5 .
01.214
• 𝑉𝑎𝑟 𝛽) = ∑ 6 -/67 . = 89.:1 = 0.346
-
– 𝑆𝐸 𝛽G = 0.346 = 0.588
– Since 𝑡() = 2.228 at the 5% level
– The 95% CI is −2.7 ± 2.228×0.588 = [−4.01, −1.39]
5+/.∑6-.
• 𝑉𝑎𝑟 𝛼S = = 5.304 ⇒ 𝑆𝐸 𝛼S = 2.303
,∑ 6- -67 .
– 95% CI= 40.71 ± 2.228×2.303 = [35.57,45.84]
30
Testing the significance of
individual coefficient estimates
• We can now also test hypotheses in the usual manner.
– To test for a zero slope coefficient, i.e. 𝑋 does not impact 𝑌 with statistical significance
𝐻) : 𝛽 = 0 { X has no impact on Y }
*".,*)
𝑡= ).-..
= −4.59
31
Example: Using Gretl
• Note that these are identical estimates of 𝛼! and 𝛽4 to those we made earlier
using the formulas.
32
Please watch the recording in iLearn
on introduction to Gretl.
33