0% found this document useful (0 votes)
29 views33 pages

Topic - 7 - Regression Analysis

This document provides an introduction to regression analysis. It discusses relationships between random variables and how regression can be used to estimate conditional means from sample data. Key points covered include: 1. Regression measures the effect of explanatory variables on a dependent variable and determines if the relationship is statistically significant. 2. The steps in regression analysis include formulating a hypothesis, specifying a model, collecting data, specifying an econometric model with an error term, estimating parameters, diagnostic checking, hypothesis testing, and prediction. 3. The regression line shows the estimated relationship between the dependent and explanatory variable based on the estimated intercept and slope parameters from the sample data.

Uploaded by

dhadkan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

Topic - 7 - Regression Analysis

This document provides an introduction to regression analysis. It discusses relationships between random variables and how regression can be used to estimate conditional means from sample data. Key points covered include: 1. Regression measures the effect of explanatory variables on a dependent variable and determines if the relationship is statistically significant. 2. The steps in regression analysis include formulating a hypothesis, specifying a model, collecting data, specifying an econometric model with an error term, estimating parameters, diagnostic checking, hypothesis testing, and prediction. 3. The regression line shows the estimated relationship between the dependent and explanatory variable based on the estimated intercept and slope parameters from the sample data.

Uploaded by

dhadkan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Econ6034: Econometrics and Business

Statistics

Topic 7:
Introduction to Regression Analysis

1
Relationships Between Random Variables

• In some situations we may know the joint distribution of two random


variables Y and X, which we call the bivariate distribution.
– Bivariate distributions can be specified for discrete or continuous random
variables.
– E.g. if X and Y are discrete random variables, then we say p(x, y) = P(X=x and
Y=y) is the joint probability that X=x and Y=y. P( ) is the joint probability
distribution.
– In the case of continuous variables, we may know the joint probability density
function.
• From the joint distribution we can compute expected values and variances, as well as
covariance and correlation which capture the extent of linear dependence between 𝑋
and 𝑌.
• From the joint distribution we can also compute the conditional distribution of 𝑌|𝑋.
• From the conditional distribution of 𝑌|𝑋 we can compute the conditional mean of 𝑌|𝑋
– i.e. 𝐸 𝑌 𝑋
– i.e. given a value of 𝑋, we can make a prediction of the value of 𝑌.
2
Relationships Between Random Variables

• The above discussion is based on the assumption that we know


the true joint or conditional distribution.

• However this is often not the case.

• Typically we have a sample of the data and don’t know the joint
PDF that the data comes from.

• In the remaining part of the course we will look at ways of


measuring the relationship between the random variables from a
sample of data.

• For example, we will learn to use linear regression analysis to


estimate 𝐸 𝑌 𝑋 from sample data.
3
Correlation Coefficient
• Recall that we have already learned another method of
measuring the linear dependence between two variables, the
correlation coefficient.
– It will be useful to recall this as we move into our explanation of linear
regression.

• The correlation coefficient ρ (rho) measures the strength of the


linear relationship between two variables, X and Y. Moreover, it
indicates the degree to which the variation in X is related to the
variation in Y.
• Always takes a value between -1 and +1:
– ρ = +1: perfect positive linear relationship;
– ρ = -1: perfect negative linear relationship; and,
– ρ = 0: no linear relationship (although there could be a different sort
of relationship between the variables).

4
Correlation Coefficient
• We estimate ρ from sample data with the sample coefficient of
correlation 𝜌$ :
#
"!!" ' % &))
∑(%% &%)() '
$
𝜌$ = =
"
!! "
!" #
' & # ∑ )% &)' &
$
∑ % % &% $

• Alternatively, if we have already calculated the sample


covariance and sample variances, we can derive the sample
correlation coefficient as:
𝑠!"
𝜌! =
𝑠! 𝑠"

5
Correlation Example

• Compute sample correlation between the birth rates and GNP


growth
Country Birth rate GNP growth
Y X
Brazil 30 5.1
Columbia 29 3.2
Costa Rica 30 3
India 35 1.4
Mexico 36 3.8
Peru 36 1
Phillipines 34 2.8
Senegal 48 -0.3
South Korea 24 6.9
Sri Lanka 27 2.5
Taiwan 21 6.2
Thailand 30 4.6
Birth rates: Births per 1000 population 1981
Growth rates: per capita per annum, average 1961-1981

6
Scatterplot of Birth Rate and GNP Growth Rate
60

50

Birth rate 40

30

20

10

0
-1 0 1 2 3 4 5 6 7 8
Growth rate

– Note: Correlation appears to be negative and is quite high,


since we can easily imagine a negatively sloped trend line, with
all the observations quite close to the line. Also note we have
only 12 observations, so we can’t be too confident in our
conclusion.

7
Correlation Example

• We can apply the previous formula for the sample correlation coefficient
to derive:
ρ$ = −0.824
Interpretation
– There is a strong negative relationship between the birth rate and
GNP growth.
– Countries that have higher economic growth are expected to have
lower birth rates.

8
Correlation versus Causality
Correlation does not imply causality
– High correlation between two variables does not mean that one
of the variables causes the other.

– We cannot use correlation to distinguish between the four


causal relationships:
• 𝑋 causes 𝑌
• 𝑌 causes 𝑋
• 𝑋 and 𝑌 are jointly caused
• another variable 𝑍 causes both 𝑋 and 𝑌

9
Regression Analysis

• Regression is probably the most important tool in econometrics


– Regression measures

• the effect of an explanatory (independent) variable 𝑋 upon a


dependent variable 𝑌, and
• whether this relationship is statistically significant.
– In a multivariate regression we allow for several explanatory variables
𝑋# , 𝑋$ , … , 𝑋% to impact the dependent variable 𝑌
– The most common form of regression analysis is linear regression
where a straight line is fitted through the data.
– In a regression analysis we must assume the direction of causation,
usually from theory, or common sense.
• N.B. Regression analysis will still not provide proof of causation.

10
Steps in Regression Analysis

1. Formulate a statement of theory (hypothesis)


2. Specify an economic model between a dependant variable 𝑌 and
explanatory variable 𝑋.
– For instance specify a linear relationship between 𝑌 and 𝑋
𝑌& = 𝛼 + 𝛽𝑋&
3. Collect data on 𝑌 and 𝑋 with the objective of obtaining sample estimates
$ 𝛽' of true population parameters 𝛼, 𝛽.
𝛼,
4. Specify an econometric model by adding an error term
𝑌! = 𝛼 + 𝛽𝑋! + 𝑢!
– 𝑢' is an error term, also called the residual which may represent
― Systematic determinants of 𝑌 (other than 𝑋) that were left out
― Random influences which cannot be modelled
― Errors in measurement of 𝑌'
― We typically assume that 𝐸 𝑢' = 0

11
Steps in Regression Analysis

5. Estimation of the parameters of the model (𝛼 and 𝛽)

6. Diagnostic Checking – making sure that the model fits the data well

7. Test hypotheses (theories)

8. Prediction (forecasting)

12
The Regression Line
• $ 𝛽0 define the position of the regression line.
The sample estimates 𝛼,

60

50 Y intercept is 𝛼" slope is 𝛽$

40 𝑌&' = 𝛼" + 𝛽$ 𝑋'


Birth rate (Y)

30

20

10

0
-1 0 1 2 3 4 5 6 7 8
Growth rate (X)

13
Estimation of the Model Parameters
• How do we obtain parameters from a linear regression model?
– There are several methods such as least squares, maximum
likelihood, method of moments.
– In this unit we will use a method called Ordinary Least Squares
(OLS)
– OLS provides a solution to 𝛼 and 𝛽 which minimizes the squared
(vertical) distances from the data points to the fitted line
60

50

40 error 𝑢"
Birth rate (Y)

30

20

10

0
-1 0 1 2 3 4 5 6 7 8
Growth rate (X)

14
Estimation of the Model Parameters
• The points on the actual regression line are computed as
– 𝑌,& = 𝛼 + 𝛽𝑋&
– Therefore the errors are 𝑢& = 𝑌& − 𝛼 − 𝛽𝑋& = 𝑌& − 𝑌,&

– Minimising the sum of squared errors may be written as


$
min ∑ 𝑌& − 𝑌,& ⇔ min ∑ 𝑌& − 𝛼 − 𝛽𝑋& $
⇔ min ∑ 𝑢& $
',) ',) ',)

– The solutions (formulas) for 𝛼 and 𝛽 can be derived using


calculus by differentiating the above sum with respect to 𝛼 and
𝛽, set the derivatives to zero and solve for 𝛼 and 𝛽. These
0
solution are then called 𝛼$ and 𝛽.

15
Regression Parameter Formulas
• OLS estimators
– The solution to the OLS minimisation problem presented on the
previous page is as follows

$
*!"
+ ∑ -& .-/ 0& .0/
𝛽4 = # = %
$
+*! ∑ 0& .0/ #
%

1∑0& -& .∑0& ∑-&


=
1∑0&#. ∑0& #

𝛼! = 𝑌5 − 𝛽4 𝑋5

16
Birth rate/ GNP Example
!"×!,!%&.()*+)."×%,)
1. 𝛽' = ! = −2.70
!"×!,+.)+*+)."
",) +)."
2. 𝛼/ = − −2.700 × = 40.71
!" !"

• Therefore the regression line is 𝑌)' = 40.71 − 2.70𝑋'

• Interpretation:
– A unit increase in the GNP growth rate (e.g. from 3% to 4%) will
lower the birth rate by 2.7 births per 1000 people on average
(e.g. from 32.61 to 29.91)

– The intercept of 40.71 may be interpreted as the birth rate in a


country with zero GNP growth, i.e 𝑋, = 0

17
The difference between 𝑢! and 𝑢" !
• Note that there is a difference between the true residuals 𝑢, and the
estimated residuals 𝑢! ,
• 𝑢& = 𝑌& − 𝛼 − 𝛽𝑋&

4 &
• 𝑢! & = 𝑌& − 𝛼! − 𝛽𝑋

• At times, when we set out a theory, we need to make assumptions


about the distribution or other properties of the residuals.

– Theoretical assumptions are made about the true residuals 𝑢,

– Typically, then we test the estimated residuals 𝑢$ , to see if the


behave in accordance with the theory.

18
Measuring Goodness of Fit
• Having calculated the regression line we now ask whether it
provides a good fit for the data.
– Do the observations tend to lie close to, or far away from the line?

– The coefficient of determination 𝑅 ( is a measure of goodness of fit.

• The total sum of squares (TSS) is the total variation of 𝑌 around its mean value 𝑇𝑆𝑆 =
∑ 𝑌! − 𝑌/ "

• The explained sum of squares (ESS) is the part of the TSS explained by the linear regression
"
𝐸𝑆𝑆 = ∑ 𝑌8! − 𝑌/

• The residual (or unexplained) sum of squares (RSS) is the sum of squared errors between the
actual values of 𝑌 and the regression line 𝑅𝑆𝑆 = ∑𝑢: !"

#$$ &$$
• Note that 𝑇𝑆𝑆 = 𝐸𝑆𝑆 + 𝑅𝑆𝑆, so that %$$ = 1 − %$$

19
Goodness of Fit:
Explained and Unexplained components of Total error

• Hill, R., Carter, et al., 2011, Principles of Econometrics, 4th ed., p136.

20
"
𝑅 - Coefficient of Determination

• 𝑅- - coefficient of determination – measures how much of the total


variation in 𝑌 is explained by the regression

>?? ∑ -8 & .-/ #


𝑅$ = @??
= ∑ -& .-/ #

$
∑ -8 & .-/ # ∑ -8 & .-/ # A
B CD -8 $
– 𝑅$ = = %
$ = , 𝑌
= 𝜌! 𝑌,
∑ -& .-/ # ∑ -& .-/ # A
B CD -
%

21
𝑅 " example
• Compute 𝑅 ( for the birth rate – GNP growth example
– 𝑇𝑆𝑆 = 𝑅𝑆𝑆 + 𝐸𝑆𝑆

– 𝑇𝑆𝑆 = ∑ 𝑌! − 𝑌+ "
= ∑ 𝑌! " − 𝑛𝑌+ " = 12,564 − 12×31.67" = 530.67
"
– 𝑅𝑆𝑆 = ∑ 𝑌! − 𝑌8! 9! " = ∑𝑌!" − ∑ 𝑌! 𝑌8! =
= ∑ 𝑌!" − 2𝑌! 𝑌8! + 𝑌
∑ 𝑌! " − 𝑎: ∑ 𝑌! − 𝑏8 ∑ 𝑋! 𝑌! = 12,564 − 40.71×380 − −2.7 ×1,139.7 = 170.75
• Since ∑ 𝑌! 𝑌8! = ∑ [𝑌8! + 𝑢: ! ]𝑌8! = ∑𝑌8!"
– 𝐸𝑆𝑆 = 530.67 − 170.75 = 359.92
#$$ &'(.("
𝑅" = %$$ = '&*.+, = 0.678

• Interpretation:
– About 67.8% of the total variation in the birth rates from the mean value is
explained by the variation in the GNP growth rates.
– The correlation between the values of 𝑌 and the estimated regression line 𝑌) is
) 𝑌 = 𝑅 ( = 0.823. Note: 𝜌S 𝑌,
0.823, i.e. 𝜌S 𝑌, ) 𝑌 is always positive, can be seen
from Cov 𝑌, ) 𝑌 = Var 𝑌)

22
Inference in Regression Analysis
• So far, we have used OLS as a way of finding the “best” values for the
parameters 𝛼 and 𝛽 given a dataset.

– For this we only needed calculus to differentiate the sum of squared residuals and find
its minimum.

• In order to work out the standard errors and distribution of 𝛼 and 𝛽 we will
need to make some assumptions about the true but unobservable residuals
𝑢& .

– Note: we don’t need to make any assumptions regarding the sample residuals 𝑢. 4 since
we can compute them.

• We need the standard errors and distributions so that we can have


something to say about the level of confidence in, or statistical significance
4
of, our estimates 𝛼! and 𝛽.

23
Classical Linear Regression Model
(CLRM) Assumptions

• CLRM conditions are summarized in the following set of assumptions


that are made about the true residuals 𝑢!

1. 𝐸 𝑢4 = 0 for all 𝑖
2. 𝑉𝑎𝑟 𝑢4 = 𝜎56 for all 𝑖
3. 𝐶𝑜𝑣 𝑢4 , 𝑢7 = 0 for 𝑖 ≠ 𝑗
4. 𝑋 is fixed in repeated samples ⇒ E 𝑢4 𝑋4 = 𝑋4 E 𝑢4 = 0 ⇒ Cov 𝑢4 𝑋4 = 0
5. 𝑢4 ~𝑁 0, 𝜎56

24
Properties of the OLS estimators under the
CLRM assumptions
1. Under the assumptions 1-4, OLS estimators 𝛼$ and 𝛽' are
Best Linear Unbiased Estimators (BLUE)
– Unbiased means 𝐸 𝛼. = 𝛼, 𝐸 𝛽F = 𝛽

– Best refers to the precision with which we estimate 𝛼 and 𝛽, thus


𝑉𝑎𝑟 𝛼. 89: ≤ 𝑉𝑎𝑟 𝛼. ;<= 89: , 𝑉𝑎𝑟 𝛽F 89: ≤ 𝑉𝑎𝑟 𝛽F ;<= 89:

– Linear means that we can write 𝛼. = 𝑐> 𝑌> + 𝑐6 𝑌6 + ⋯ + 𝑐? 𝑌? and 𝛽F = 𝑐>∗ 𝑌> + 𝑐6∗ 𝑌6 + ⋯ +
𝑐?∗ 𝑌? for some constants 𝑐4 and 𝑐4∗ .
I I
2. 𝛼$ and 𝛽' are consistent, i.e. 𝛼$ → 𝛼, 𝛽' → 𝛽
– i.e. the estimates converge towards the true values in a very large sample.

25
OLS Standard Errors
• ) are specific to the sample used in their
Any set of regression estimates (𝛼S and 𝛽)
estimation -> they are random variables
– We need to find their variances (standard errors) and their PDF’s

• Given the 5 CLRM assumptions we can show that 𝛼S and 𝛽) are normally distributed
– If we know the true 𝜎'" , the variance of the error term, then the sampling distributions are as follows

𝜎)(
)
𝛽~𝑁 𝛽,
∑ 𝑋' − 𝑋_ (

𝜎)(∑𝑋'(
𝛼~𝑁
S 𝛼,
𝑛∑ 𝑋' − 𝑋_ (

26
"
Estimating the Error Variance 𝜎!
• Usually we don’t know the true 𝜎L- and need to estimate it from
the data together with 𝛼 and 𝛽.
– If we observed the true 𝑢& we could estimate 𝜎F$ from the sample as:
1
𝜎!F$ = ∑ 𝑢& − 𝑢5 $
𝑛
– However the problem is that we don’t know 𝑢& because we don’t know
the true 𝛼 and 𝛽.
– We have 𝛼! and 𝛽4 so we can compute 𝑢! & and use it to estimate 𝜎F$ as
follows
>
• 𝜎.56 = ; ∑𝑢. 4 this estimator is consistent for 𝜎56
• If we want an estimator which is both unbiased and consistent we use

1
𝜎!F$ = ∑𝑢! &$
𝑛−2
27
Standard Errors for 𝛼" and 𝛽%
• Based on the assumption that the true errors 𝑢! are normally distributed
(CLRM assumption 5) and uncorrelated with each other we can derive
'
sample variances for 𝛼$ and 𝛽:

𝜎$JK
𝑉𝑎𝑟 𝛽' =
∑ 𝑋! − 𝑋6 K

𝜎$JK ∑𝑋!K
𝑉𝑎𝑟 𝛼$ =
𝑛∑ 𝑋! − 𝑋6 K

28
The Distribution of 𝛼" and 𝛽%
• Under the normality assumption about 𝑢; it is possible to
show that when using 𝜎"<= instead of 𝜎<=
> @?
?
𝑡-statistic= ~𝑡D@=
>
ABC ?

E
F@F
𝑡-statistic= ~𝑡D@=
E
ABC F

29
Birth Rate and GNP Growth Rate Example

)-.
∑+ .// 012.14
• 𝜎S)( = = = = 17.075
,-( ,-( 02

+
5 .
01.214
• 𝑉𝑎𝑟 𝛽) = ∑ 6 -/67 . = 89.:1 = 0.346
-

– 𝑆𝐸 𝛽G = 0.346 = 0.588
– Since 𝑡() = 2.228 at the 5% level
– The 95% CI is −2.7 ± 2.228×0.588 = [−4.01, −1.39]

5+/.∑6-.
• 𝑉𝑎𝑟 𝛼S = = 5.304 ⇒ 𝑆𝐸 𝛼S = 2.303
,∑ 6- -67 .
– 95% CI= 40.71 ± 2.228×2.303 = [35.57,45.84]

30
Testing the significance of
individual coefficient estimates
• We can now also test hypotheses in the usual manner.
– To test for a zero slope coefficient, i.e. 𝑋 does not impact 𝑌 with statistical significance

𝐻) : 𝛽 = 0 { X has no impact on Y }

𝐻( : 𝛽 ≠ 0 { X has some impact on Y }

*".,*)
𝑡= ).-..
= −4.59

– Since |𝑡| exceeds the critical value at the 5% 𝑡>A = 2.228,


𝐻A is rejected.

– Conclusion: there is evidence that 𝑋 impacts 𝑌 at the 5% significance level.

• In other words: X is an important variable in explaining the variations in Y.

31
Example: Using Gretl
• Note that these are identical estimates of 𝛼! and 𝛽4 to those we made earlier
using the formulas.

32
Please watch the recording in iLearn
on introduction to Gretl.

33

You might also like