0% found this document useful (0 votes)
11 views

Chapter 3 - Linear Regression

1. The document discusses linear regression models, which aim to model the relationship between a dependent variable and one or more independent variables. 2. It provides a brief history of regression analysis and its origins in modeling astronomical observations. Regression analysis allows estimating relationships based on economic theory but does not necessarily imply causation. 3. The key aspects of linear regression covered are the linear relationship between variables, the use of least squares to estimate unknown parameters, and the distinction between deterministic and stochastic relationships.

Uploaded by

Flora Fong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Chapter 3 - Linear Regression

1. The document discusses linear regression models, which aim to model the relationship between a dependent variable and one or more independent variables. 2. It provides a brief history of regression analysis and its origins in modeling astronomical observations. Regression analysis allows estimating relationships based on economic theory but does not necessarily imply causation. 3. The key aspects of linear regression covered are the linear relationship between variables, the use of least squares to estimate unknown parameters, and the distinction between deterministic and stochastic relationships.

Uploaded by

Flora Fong
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

C HAPT E R 3

Linear Regression
Model
Learning Objectives
• By the end of this lecture students will:
– understand the simple linear regression model
– understand the logic behind the method of
ordinary least squares (OLS) estimation
• The following notation will be used:
– y is the dependent or explained variable.
– X1, X2, Xk are the explanatory variables.

Evan Lau for EBQ2074 (CHAP 3)


History of regression
• The earliest form of regression was the method of least square (French: méthode des
moindres carrés), which was published by Legendre in 1805 and by Gauss in 1809.

• Legendre and Gauss both applied the method to the problem of determining, from
astronomical observations, the orbits of bodies about the Sun. Gauss published a further
development of the theory of least squares in 1821, including a version of the Gauss-
Markov Theorem.

• The term "regression" was coined by Francis Galton to describe a biological


phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend
to regress down towards a normal average (a phenomenon also known as regression
toward the mean.

• His work was later extended by Udny Yule and Karl Pearson to a more general statistical
context. In the work of Yule and Pearson, the joint distribution of the response and
explanatory variables is assumed to be Gaussian.

• Since then, regression methods continue to be an area of active research and used in
numerous of applied research areas.

Evan Lau for EBQ2074 (CHAP 3)


The Nature
• Regression analysis is one of the most commonly used tools in
econometrics work. It allows econometricians to make quantitative
estimates of economic relationships based on economic theory.
• It concerned with the study of the dependence of one variable, to
be explained (the dependent variable) on one or more independent
or explanatory variables(eg. salary) as a function of movements in
a set of other variables called explanatory variables (eg. education,
sex, age, occupation) through numerical estimates.
• When there is only one independent or explanatory variable, we
have simple regression. In the more usual case of more than one
independent or explanatory variable we have multiple regression.

Evan Lau for EBQ2074 (CHAP 3)


Aims of Regression
• To examine whether there exists a
significant relationship between any of the
x’s and y.
• To analyze the effects of changing one or
more of the x’s on y.
• To forecast the value of y for a given set of
x’s.

Evan Lau for EBQ2074 (CHAP 3)


• The terms explained or explanatory variables are described
differently.

Evan Lau for EBQ2074 (CHAP 3)


Regression vs. causation
• Although regression analysis deals with the dependence of one
variable on other variables, it does not necessarily imply
causation. Regression analysis can only show that a statistical
relationship exists. We cannot infer that one variable causes
another. It must be noted that statistical relationship alone
cannot logically imply causation. Any inference about causal
relationship must be justified by a reasonable theoretical
relationship.
• Example: Statistical tests established that the more a
person smoke, the greater the probability of developing a
lung cancer. However, analysis did not prove that smoking
cause lung cancer. It is the medical investigations that
verified that smoking cause lung cancer.
Evan Lau for EBQ2074 (CHAP 3)
Regression vs. Correlation
• In regression analysis, our primarily interest is to estimate or
to predict one variable based on other variables, rather than
just measure the strength of linear association as the
correlation analysis.
• We differentiate dependent variable from explanatory
variables in the regression analysis BUT there is no
distinction between the dependent and explanatory variables
in the correlation analysis.
• Dependent variable is assumed to be random or stochastic,
that is, it has probability distribution in the regression
analysis while explanatory variable is assumed to have fixed
values (nonstochastic). In correlation analysis, both variables
are assumed to be random.
Evan Lau for EBQ2074 (CHAP 3)
• In correlation analysis, two variables are treated in a
symmetrically fashion where there is no distinction
between the dependent and the explanatory variable.
• For example, we may use correlation analysis to
find the correlation between education level and
salary for Bachelor of Economics graduate. After
all, the correlation between education level and
salary is the same as correlation between salary and
education level.

Evan Lau for EBQ2074 (CHAP 3)


Specifying the Model
• Econometricians use regression analysis to analyze their
econometric models.
• For example: The demand function of a good is expressed as Dx
=b0+b1PX+b2Y+b3Pz.
• This model represents the exact or deterministic relationship
between the explained variable (on the left hand side of the
equation) and the explanatory variables. Thus, they are
deterministic models or mathematical model.
• We need extend the deterministic model by including a random
term that measures the error of the deterministic component,
econometrics model that is Dx =b0+b1PX+b2Y+b3Pz+ 

Evan Lau for EBQ2074 (CHAP 3)


Example: Linear Regression Analysis
• To study the relationship between the two variables of
consumption (Y) and disposable income (X), we plot using
scatter diagram.

Evan Lau for EBQ2074 (CHAP 3)


Figure 1: Consumption (Y) versus Disposable Income (X)
170

160

150
Consumption Y

140

130

120

110

100
110 120 130 140 150 160 170 180 190 200

Disposable Income (X)

• One most commonly used technique to determine the best


fitted line is called the least squares method.
• The resulting line is known as the least squares line or
regression line, whereas the equation that represents this
(linear) regression line is named linear regression model.
Evan Lau for EBQ2074 (CHAP 3)
Linear refers to
(1) Linearity in the variables
• dependent variables is a linear function of the independent
variables
Y    X   (Linear)
Y    X 2   (non-linear)
1
Y      (non-linear)
X

(2) Linearity in the parameters

• dependent variables is a linear function of the parameters


Y   0  1 X   (Linear)
Y   0  12 X   (Not Linear)
Evan Lau for EBQ2074 (CHAP 3)
• In linear regression analysis, the equation to be estimated must be
linear in the coefficients. A function Y = f(X) is said to be linear
in coefficient, say  1 , if  1 appears with a power of 1 only
(instead of or higher power) and is not multiplied or divided by
any other coefficient. Also, there are equations that are linear in
variables.
• A function Y = f(X) is said to be linear in variable X if X appears
with a power of 1 only
and is not multiplied or divided by any other variable (for
example, X  Z, X / Z , where Z is another variable).

• Simple Regression - one independent variable


Y   0  1 X 1  
Y  0  1 X1   2 X 2   - Multiple regression
Evan Lau for EBQ2074 (CHAP 3)
- In the simple regression model is a single-equation model
because only one equation is involved. It states that the
dependent variable Y is a function of X , the independent
variable.
- In the multiple regression analysis, the ’s are the coefficients
to be estimated using regression analysis. ’s are thus known as
regression coefficients.
- In figure 1, the exact or deterministic general relationship
between aggregate consumption expenditures Y and aggregate
disposable income X can be written as Yi = 0 + 1Xi where i
refers to each year in time series analysis (as with the data in
Table 1). 0 and 1 are the unknown constant called
parameters. Parameter 0 is the constant or Y intercept while 1
measures Y/X.

Evan Lau for EBQ2074 (CHAP 3)


•  is the slope coefficient, which indicates the amount that Y will
change when X changes. Mathematically, 1 = Y/X.
• For linear model, the slope is constant over the entire function.
Evan Lau for EBQ2074 (CHAP 3)
Deterministic and Stochastic
Relationships
• There are two types of relationships:
• – deterministic or exact relationships.
• – stochastic or statistical relationships
which do not give unique values of y for
given values of x.
• Statistical relations are specified in
probabilistic terms.
• Regression uses statistical models.

Evan Lau for EBQ2074 (CHAP 3)


Deterministic Relationship
• Take the following model:
• y = 2500 + 100x

• The values for y can be exactly determined


for given values of x.

Evan Lau for EBQ2074 (CHAP 3)


Stochastic Relationship
• Suppose the model is specified as:
• y = 2500 + 100x + u

• By defining y in probabilistic terms, it


cannot be exactly determined for given
values of x.

Evan Lau for EBQ2074 (CHAP 3)


Why add an error term?
The importance of Stochastic Error Term
• As pointed out earlier, the relationship between economic
variables is unlikely to be exact because of stochastic error.
Stochastic error is also known as random error, disturbance, or
simply error term.
• A (random) disturbance or error must be included in the exact
relationships postulated by economic theory and mathematical
economics in order to make them stochastic (in order to reflect the
fact that in the real world, economic relationships among economic
variables are inexact).

Evan Lau for EBQ2074 (CHAP 3)


• To account for the inexact relationship between economic variables, an
error term must be included in our regression equation


where stands for the error term.
• Equation above is made up of deterministic component ( )
and stochastic component (  ). The deterministic component
determines the expected value of Y.
• This component tells us exactly by how much a change in X will be
reflected in the change in Y. However, in the real world, it is unlikely
that variation in Y is solely explained by variation in X. There are some
variations in Y that cannot be explained by the model.
• Econometric admit the existence of such unexplained variation by
explicitly including a stochastic component in the model.

Evan Lau for EBQ2074 (CHAP 3)


• The inclusion of a (random) disturbance or error term (with
well-defined probabilistic properties) is required in
regression analysis for several important reasons.
1. Since the purpose of theory is to generalize and simplify,
economic relationships usually include only the most
important forces at work. This means that numerous
other variables with slight and irregular effects are not
included. The error term can be viewed as representing
the net effect of this large number of small and irregular
forces at work.
2. The inclusion of the error term can be justifies in order to
take into consideration the net effect of possible errors in
measuring the dependent variable or variable being
explained.
Evan Lau for EBQ2074 (CHAP 3)
3. Since human behavior usually differs in a random way
under identical circumstances the disturbance or error
term can be used to capture this inherently random
human behavior.
4. This error term thus allow for individual random
deviations from the exact and deterministic
relationships postulated by economic theory and
mathematical economics.

Evan Lau for EBQ2074 (CHAP 3)


The Model
– True Model (Population Regression Function)
Yi    Xi  ui
– Yi is the ith observation of the dependent variable
– Xi is the ith observation of the independent or explanatory
variable
– ui is the ith observation of the error or disturbance term
–  and  are unknown parameters (coefficients)
– i = 1,..., n observations

Evan Lau for EBQ2074 (CHAP 3)


• Estimated model (Sample Regression Function)

• where Yi  ˆ  ˆX i  uˆi


– ˆ and ˆ are the estimators of the coefficients
(parameters)
– ûi are the i estimated residuals
• Objective: estimate the PRF on basis of SRF

Evan Lau for EBQ2074 (CHAP 3)


• In terms of the SRF the observed Yi can be
written as Y  Yˆ  uˆ
i i i

• in terms of the PRF Yi can be written as


Yi  E(Yi | X i )  ui
• Given the SRF is an approximation to the
PRF can we make this approximation as
close as possible?

Evan Lau for EBQ2074 (CHAP 3)


Y The Model
SRF : Yˆi  ˆ  ˆX i
Yi
Yi
ui ûi PRF : E(Y | X i )    X i
Ŷi
Ŷi
E(Y|Xi) E(Y|Xi)

X
Xi
Evan Lau for EBQ2074 (CHAP 3)
Ordinary Least Squares (OLS)
• OLS obtains the estimators ̂ and ̂
minimising the sum of squared residuals
(RSS) with respect to  and  :
N N
RSS   ui   (Yi    X i )
ˆ 2
ˆ ˆ 2

i 1 i 1
• Partially differentiate RSS w.r.t. the
coefficients

Evan Lau for EBQ2074 (CHAP 3)


Ordinary Least Squares (OLS)
• The first order conditions are:

RSS N
• (1)  2 (Yi  ˆ  ˆX i )  0
ˆ i 1

RSS N
• (2)  2 (Yi  ˆ  ˆX i )( X i )  0
ˆ i 1

Evan Lau for EBQ2074 (CHAP 3)


Ordinary Least Squares (OLS)
• Solving (1) and (2) simultaneously gives
N

( X i  X )(Yi  Y )
̂  i 1
N

 i
( X
i 1
 X ) 2

• and ˆ  Y  ˆX
• where Y and X are the sample means
Evan Lau for EBQ2074 (CHAP 3)
Supplementary of Chap3

Evan Lau for EBQ2074 (CHAP 3)


Example: Weekly Food Expenditures
Y = dollars spent each week on food items.
X = consumer’s family weekly income.

The relationship between x and the expected value


of Y , given X, might be linear:
P(Y|X) = E(Y|Xi) = f(xi) = 1 + 2 Xi
Means that each conditional mean E(Y|Xi) is a
function of Xi, this equation is known as
the population regression function (PRF)
Evan Lau for EBQ2074 (CHAP 3)
Stochastic Specification of PRF
Given any income level of Xi, an family’s consumption is
clustered around the average of all families at that Xi, that
is, around its conditional expectation, E(Y|Xi).
The deviation of any individual Yi is:

ui = Yi - E(Y|Xi)

or Yi = E(Y|Xi) + u i
Shochastic error or
or Yi = 1 + 2 X + u i Stochastic disturbance

Evan Lau for EBQ2074 (CHAP 3)


For examples:
given X = $80, the individual consumption are
Y1 = 55 = 1 + 2 (80) + u 1
Y2 = 60 = 1 + 2 (80) + u 2
Y3 = 65 = 1 + 2 (80) + u 3
Y4 = 70 = 1 + 2 (80) + u 4
Y5 = 75 = 1 + 2 (80) + u 5

Y^ = 65 = ^ + ^ (80)
1 1 2
Y^ = 65 = ^ + ^ (80)
Estimated average: 2 1 2
Y^ = 65 = ^ + ^ (80)
3 1 2
Y^4 = 65 = ^1 + ^2 (80)
Y^5 = 65 = ^1 + ^2 (80)
Evan Lau for EBQ2074 (CHAP 3)
Y (SRF) ^Y = ^ + ^ x
1 2
Y4 .
{ E(Y|x)=1+2x
Y3 (PRF)
}
Y2 ^
^ u2
Y 2
u2
E(Y|x2) Page 45

Y1 .} u 1
x
x1 x2 x3 x4
The relationship among Yi, ui and the true regression line.
Evan Lau for EBQ2074 (CHAP 3)
SRF:
^ ^ ^
Yi = 1 + 2 Xi
^ ^
or Yi = 1 + 2Xi + u^ i Residual

or Yi = b1 + b2 Xi + ei

PRF:
E(Y|X) = 1 + 2 Xi
Yi = 1 + 2 Xi + u i Error term or
Disturbance
^
Yi = estimator of Yi (E(Y|xi)
^
i or bi = estimator of i
Evan Lau for EBQ2074 (CHAP 3)
Ordinary Least Squares (OLS) Method

Yi = 1 + 2Xi + ui
u i = Y i - 1 - 2X i

Minimize error sum of squared deviations:


n n
 ui2 = (Y i - 1 - 2X i )2 = f(1,2)
i=1 i=1

Evan Lau for EBQ2074 (CHAP 3)


Minimize w.r.t. 1 and 2:
n
f(1,2) = (Y i - 1 - 2x i 2
) = f ( )
i =1

f ( )
= - 2 (Y i - 1 - 2Xi )
 1

f ( )
= - 2 Xi (Yi - 1 - 2Xi )
 2
Set each of these two derivatives equal to zero and
solve these two equations for the two unknowns: 1 2
Evan Lau for EBQ2074 (CHAP 3)
f()
= - 2 (Y i – b1 – b2Xi ) = 0
 1

f()
= - 2 xi (Yi - b1 – b2Xi ) = 0
 2
When these two terms are set to zero,
1 and 2 become b1 and b2 because they no longer
represent just any value of 1 and 2 but the special
values that correspond to the minimum of f() .
Evan Lau for EBQ2074 (CHAP 3)
- 2 (Y i - b1 – b2Xi ) = 0
- 2 Xi (Y i – b1 – b2Xi ) = 0

Yi - nb1 – b2 Xi = 0

Xi Yi - b1  X i - b2  Xi
2
= 0

nb1 + b2  Xi = Y i
 Xi + b2 Xi Xi Yi
2
b1 =
Evan Lau for EBQ2074 (CHAP 3)
n Xi b1
=  Yi
 i  i
X X 2
b2 = Xi Yi
Solve the two unknowns

n Xi Yi -  Xi Yi
b2 =
n i  i
2 2
X - ( X )
(Xi - X )(Yi -Y) = xy
=
( i
X - X ) 2
x 2

b1 = Y - b2 x
Evan Lau for EBQ2074 (CHAP 3)
Y
Y4
. ^
Y = b1 + b2X

^
Y*
^*
Y3
.
{.
^u*
4
^*
^
Y* = b*1 + b* 2X
^*
Y1
. . 2 ^
u*3{ Y4
u*2 {. Y2
^ .

{
^
u*1

.
Y1
Y3

x1 x2 x3 x4 x

Why the SRF is the best one?


^ is larger.
The sum of squared residuals from any other line Y*
Evan Lau for EBQ2074 (CHAP 3)
Prediction
Estimated regression equation:
^
y = 4 + 1.5 x t
t

x t = years of experience
^
yt = predicted wage rate
^
If x t = 2 years, then yt = $7.00 per hour.
^
If x t = 3 years, then yt = $8.50 per hour.
Evan Lau for EBQ2074 (CHAP 3)

You might also like