0% found this document useful (0 votes)
45 views36 pages

The Linear Regression Model

The document discusses linear regression analysis. It provides three key points: 1) It presents a scatterplot showing the relationship between wage and education to motivate linear regression. 2) It explains the mechanics of fitting a linear regression line using the ordinary least squares (OLS) method to minimize the sum of squared errors between predicted and actual values. 3) It discusses some properties of linear regression models including how to interpret regression coefficients, predict values, and handle interaction terms between indicator variables.

Uploaded by

Samsul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views36 pages

The Linear Regression Model

The document discusses linear regression analysis. It provides three key points: 1) It presents a scatterplot showing the relationship between wage and education to motivate linear regression. 2) It explains the mechanics of fitting a linear regression line using the ordinary least squares (OLS) method to minimize the sum of squared errors between predicted and actual values. 3) It discusses some properties of linear regression models including how to interpret regression coefficients, predict values, and handle interaction terms between indicator variables.

Uploaded by

Samsul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

The Linear Regression

Model
Felix Wisnu Handoyo
PR-EPS
Badan Riset dan Inovasi Nasional
Scatter Plot

wage
25

20

15

10

0
What is the graph?
It is a scatterplot of Wage against
Education
What is a scatterplot?
The scatterplot associates Wage
with Education in a two-
dimensional space

0 5 10 15 20
educ
Scatter Plot with Estimated Regression
Function
25

20

15

10

0
• What are the mechanics of fitting a linear
regression?
• Use the ordinary least squares (OLS)
method
• What is the criterion for obtaining an OLS
estimate?
• For the data at hand, we would like to
minimize the sum of squared
prediction errors
• Construct a probabilistic model to
0 5 10 15 20
describe the properties of the fitted
educ line and the corresponding predictions
wage Fitted values • We assume that Wage is linearly related to
an exogenously determined variable Educ
The Linear Regression
Wagei = ß0 + ui
ß0 hat is the average value of Wage.

Wagei = ß0 + ß1female + ui
ß1 hat is the difference in wage for female in
comparison to male.
The Linear Regression
The Linear Regression

Regression t-test for the significance of the coefficient on female is identical to the t-
test for significant differences in between the female group and male group, with the
same mean, t-stat, and p-value
The Linear Regression

wage
25

20

15

10

25

20

15

10

0
0 .2 .4 .6 .8 1
female
0 .2 .4 .6 .8 1 wage Fitted values
female

Scatter plot wage with female Scatter plot wage with female and the fitted
value
Dummy Trap
• The dummy variable trap refers to the problem that not all categories
can be included in the regression and one category needs to be left
out, which is called a base or reference category.
• For example, male and female cannot be both included in the
regression because of perfect collinearity.   (1)
reg_female
(2)
reg_male
(3)
reg_female_male
VARIABLES wage wage wage
• Using Stata if we involve both male and female female
   
-2.512***
   

variables in regression, one of them will be male


(0.303)
2.512*** 2.512***
(0.303) (0.303)
omitted. o.female -

Constant 7.099*** 4.588*** 4.588***


(0.210) (0.219) (0.219)

Observations 526 526 526


R-squared 0.116 0.116 0.116
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
The Linear Regression
Wagei = ß0 + ß1 Educi
What is β0? 0 5 10
educ
15 20

wage Fitted values


What is β1?
Which one is the slope of the line and which one is the slope?
- Is there such a straight line that can perfectly fit all the points?
- How to compromise?
To allow for deviations from the straight line, we add an error term:
Wagei = ß0 + ß1 Educi+ Ɛi
The Linear Regression
The Linear Regression
Are there any restrictions on the error term?
We assume that the error term associated with each observation is
independent from each other and is identically distributed with a mean
of zero and a variance that is constant:

i=1,2,3….n=526
How to predict Wagei given Educi?
For simplicity, denote Wagei as y and Educi as x
The Linear Regression
What is the expectation of yi given xi = x∗?
What Is E(yi|xi = x ∗)?

First we substitute for yi according to our statistical assumption


E(yi|xi = x∗)
= E(β0 + β1xi + Ɛi|xi = x∗)
Are β0 , β1 and xi constant?
The Linear Regression
We extract out β0 + β1xi from the expectation operator because
they are deterministic.
E(β0 + β1 xi + Ɛi |xi = x ∗)
= β0 + β1 x∗ + E(Ɛi |xi = x∗)
The statistical assumption regarding Ɛi suggests that it is independent
from xi The expectation of Ɛi is also independent from xi
E(Ɛi|xi = x∗) = E(Ɛi ) = 0
It follows that
E(yi|xi = x∗)
= β0 + β1 x∗
The Linear Regression
• This is also called the regression function of y
How to estimate β0 and β1?
Find β0 and β1 such that they minimize.
The Linear Regression

We want to find out the values of β0 and β1 such that they minimize
the sum of squared residuals
What are residuals?
What are fitted values?
Given a set of estimates of β0 and β1, β0 =ˆβ0 and β1 =ˆβ1, what
are fitted values of y?
The Linear Regression

The variable ˆyi represents the fitted value of yi when the disturbance
term is set to its expectation zero.
Is the fitted value ˆyi the same as the actual value of yi?
We define the difference between the actual value yi and the fitted
value ˆyi as the residual (or in-sample prediction error)
The Linear Regression

The variable ˆyi represents the fitted value of yi when the disturbance
term is set to its expectation zero.
Is the fitted value ˆyi the same as the actual value of yi?
We define the difference between the actual value yi and the fitted
value ˆyi as the residual (or in-sample prediction error)
The Linear Regression
Wagei = ß0 + ß1 Educi
yi= ß0 + ß1 xi
Fitted value ?
Interaction terms with indicator variable
• Interaction terms for variables female and married can be done in two
different ways.
1. Include female and female ∗ married in the regression.
2. Create four categories: female*single, male*single, female*married, and
male*married
and include 3 of them in the regression model (the fourth/omitted category
serves as a base/reference category).
Interaction terms with indicator variable

female male single married female_single male_single female_married male_married


0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 1
0 1 0 1 0 0 0 1
0 1 1 0 0 1 0 0
0 1 0 1 0 0 0 1
0 1 1 0 0 1 0 0
Interaction terms with indicator variable
  (1) (2) (3)
female_married_femal
dummy_trap malesingle exmarried
VARIABLES wage wage wage
       
female_single -0.556 -0.556
(0.474) (0.474)
o.male_single -
female_married -0.602 -0.602
(0.464) (0.464)
male_married 2.815*** 2.815***
(0.436) (0.436)
female -0.556
(0.474)
married 2.815***
(0.436)
femaleXmarried -2.861***
(0.608)
Constant 5.168*** 5.168*** 5.168***
(0.361) (0.361) (0.361)
Observations 526 526 526
R-squared 0.181 0.181 0.181
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Several related indicator variables
• Regression with several related indicator variables needs to have one
reference/base category left out. Coefficients will be interpreted with
respect to the base category.
• Dummies for region: northcentral, south, west.
• If east base category: east=1- northcentral-south-west
• 𝑤age = 𝛽0 + 𝛽1Educ + 𝛽2northcentral+ 𝛽3south + 𝛽4𝑤est + 𝑢
• Alternatively, 𝑤age = 𝛽0 + 𝛽1Educ + 𝛽2northcentral+ 𝛽3south + 𝛽4east
+ 𝑢 , where the base category is 𝑤𝑤𝑤𝑤𝑠𝑠𝑛𝑛.
Several related indicator variables
  (1) (2)
east_base west_base Variable Mean SD
VARIABLES wage wage
     
northcen -0.659 -0.903* Northcen 5.710 3.325
(0.465) (0.504)
south -0.983** -1.226*** South 5.387 3.099
(0.432) (0.473)
west 0.244
(0.515) West 6.613 4.217
east -0.244
(0.515) East 6.370 4.371
Constant 6.370*** 6.613***
(0.338) (0.389)
Observations 526 526
R-squared 0.017 0.017
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Several related indicator variables
  (1) (2) (3) (4)
VARIABLES model1 model1 model2 model2
         
northcen -0.659 -0.903* -0.664 -1.006** Variable Mean SD
(0.465) (0.504) (0.426) (0.462)
south -0.983** -1.226*** -0.598 -0.941** Northcen 5.710 3.325
(0.432) (0.473) (0.397) (0.434)
west 0.244 0.342 South 5.387 3.099
(0.515) (0.473)
east -0.244 -0.342 West 6.613 4.217
(0.515) (0.473)
educ 0.535*** 0.535*** East 6.370 4.371
(0.0534) (0.0534)
Constant 6.370*** 6.613*** -0.503 -0.160
(0.338) (0.389) (0.753) (0.765)

Observations 526 526 526 526


R-squared 0.017 0.017 0.176 0.176
Standard errors in
parentheses
*** p<0.01, ** p<0.05, *
p<0.1
Indicator variable in Regression
• Regression model: wage= 𝛽0 + 𝛽1educ + 𝑢
• This model has the same intercept ̂𝛽0 and slope ̂𝛽1 on education for
females and males.
• Regression model: wage = 𝛽0 + 𝛽1educ + 𝛿0female + 𝑢
The coefficient 𝛿0 is the effect of female on 𝑤age.
• Same slope for females and males = ̂𝛽1
• Intercept for males = ̂𝛽0
• Intercept for females = ̂𝛽0 + ̂𝛿0
Indicator variable in regression
  (1) (2) (3) Regression 2: Females have $2.27 lower
female_dum wages than males.
VARIABLES wage my male_dummy Intercept for males = 0.62
        Intercept for females = 0.62-2.27=-1.65
educ 0.541*** 0.506*** 0.506***
(0.0532) (0.0504) (0.0504) Regression 3: Males have $2.27 higher
female -2.273*** wages than females. Same magnitude
(0.279) and significance but opposite sign.
male 2.273*** Intercept for females = - 1.65
(0.279) Intercept for males = - 1.65 +2.27=0.62.
Constant -0.905 0.623 -1.651**
(0.685) (0.673) (0.652) Slope for education is 0.51. One additional
year of education is associated with $0.51
Observations 526 526 526 increase in wage.
R-squared 0.165 0.259 0.259 Same slope for females and males.
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Indicator variable in regression
25

20

15

10

25

20

15

10
5

0
0 5 10 15 20
0 5 10 15 20 educ
educ
wage wage_educ
wage Fitted values female_educ male_educ

𝑤age= 𝛽0 + 𝛽1educ + 𝑢 𝑤age = 𝛽0 + 𝛽1educ + 𝛿0female + 𝑢


Same intercept (-0.91), same slope (0.54) for females Same slope (0.51), different intercepts for
and males. females (-1.65) and males (0.62).
Line for females is 2.27 lower than line for males.
Interaction terms with non-indicator variable
• Regression model:
• 𝑤age = 𝛽0 + 𝛽1educ + 𝛿0emale + 𝛿1female ∗ educ + 𝑢
• This model has different slopes on education and intercepts for
females and males.
• Slope for males = ̂𝛽1
• Slope for females = ̂𝛽1 + ̂𝛿1
• Intercept for males = ̂𝛽0
• Intercept for females = ̂𝛽0 + ̂𝛿0
Model with interaction term versus
two separate models
  (1) (2) (3)
Slope for males = ̂𝛽1= 0.54
interaction_ter model for model for male
VARIABLES m female wage wage Slope for females = ̂𝛽1 + ̂𝛿1=0.54-
        0.09=0.45
educ 0.539*** 0.453*** 0.539*** Intercept for males = ̂𝛽0= 0.20
(0.0642) (0.0580) (0.0774) Intercept for females = ̂𝛽0 + ̂𝛿0 = 0.20-
female -1.199 1.20=1.00
(1.325)
femaleXeduc -0.0860 Model with female and female*educ has the same
(0.104) coefficients as in the two separate models for
Constant 0.200 -0.998 0.200
females and males.
(0.844) (0.729) (1.016)

Observations 526 252 274


The coefficient on female and on the interaction
R-squared 0.260 0.197 0.152 term between female and education are not
Standard errors in parentheses significant. So the intercepts and the slopes of
*** p<0.01, ** p<0.05, * p<0.1 returns to education are not significantly different
for females and males.
Model with interaction term versus
two separate models

25

20

15

10

0
  (1) (2) (3)
interaction_ter model for model for male
VARIABLES m female wage wage
       
educ 0.539*** 0.453*** 0.539***
(0.0642) (0.0580) (0.0774)
female -1.199
(1.325)
femaleXeduc -0.0860
(0.104)
Constant 0.200 -0.998 0.200
(0.844) (0.729) (1.016)
0 5 10 15 20
educ
Observations 526 252 274 wage female_educ
R-squared 0.260 0.197 0.152 male_educ

Standard errors in parentheses


*** p<0.01, ** p<0.05, * p<0.1
Different intercepts and different slopes.
F-test for differences across groups
• F-test to test whether the returns to education, experience, and tenure are the same for males
and females.
• Unrestricted regression model:
• 𝑤age = 𝛽0 + 𝛽1educ + 𝛽2Exper + 𝛽3tenure + 𝛿0female +
• 𝛿1female ∗educ + 𝛿2female ∗ exper + 𝛿3female ∗ tenure + 𝑢
• H0: 𝛿0=0 and 𝛿1=0 and 𝛿2=0 and 𝛿3=0 Ha: 𝛿0 ≠0 or 𝛿1≠0 or 𝛿2 ≠0 or 𝛿3≠0, 𝑞 = 4 (restrictions)
• Restricted regression model:
• 𝑤age = 𝛼0 + 𝛼1educ + 𝛼2exper + 𝛼3tenure + e
• F critical value (4,518) = 2.39 < F-stat, p-value < 0.05
• Coefficients on female, female*educ, female*exper, and female*tenure are jointly significant.
• Females have significantly different wages than males.
  (1) (2)
VARIABLES restricted unrestricted
     
educ 0.599*** 0.677***
(0.0513) (0.0636) • Using t-tests, the coefficient on female
exper 0.0223* 0.0538***
(0.0121) (0.0170) is not significant, but the coefficients on
tenure 0.169*** 0.159*** • female*educ, female*exper, and
(0.0216) (0.0256)
female 2.075 female*tenure are individually significant.
(1.404) • Using F-test, these four coefficients are
femaleXeduc -0.214**
(0.0997) jointly significant.
femaleXexper -0.0459** • Different wages for females and males.
(0.0230)
femaleXtenure -0.102**
(0.0461)
Constant -2.873*** -3.531***
(0.729) (0.948)
Observations 526 526
R-squared 0.306 0.386
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1
Chow test for differences across groups
• Chow test is an F-test for significantly different coefficients in two models estimated
with two different groups. Instead of one unrestricted model, two separate models are
estimated, one for each group.
• Regression model for females:
𝑤age = 𝛽0 + 𝛽1educ + 𝛽2exper + 𝛽3tenure + 𝑢
• Regression model for males:
𝑤age = 𝛼0 + 𝛼1educ + 𝛼0 exper + 𝛼3tenure + e
H0: 𝛽0 = 𝛼0 and 𝛽1 = 𝛼1 and 𝛽2 = 𝛼2 and 𝛽3 = 𝛼3
• Ha: 𝛽0 ≠ 𝛼0 or 𝛽1 ≠ 𝛼1 or 𝛽2 ≠ 𝛼2 or 𝛽3 ≠ 𝛼3
Restricted regression model with both groups in one model:
𝑤age = 𝛾0 + 𝛾1educ + 𝛾2exper + 𝛾3tenure + 𝑣
Chow test
• After estimating the models:
• Obtain 𝑆𝑆𝑅𝑟 (sum of squared residuals) for restricted model.
• Calculate 𝑆𝑆𝑅1 and 𝑆𝑆𝑅2 for the models for females and males.

F critical value (4,518) = 2.39 < F-stat, p-value < 0.05


• Coefficients on intercept, educ, exper, and tenure are significantly different
for females and males.
• The Chow test is equivalent to the F-test.
• For the F-test, 𝑆𝑆𝑅𝑢𝑟 = 𝑆𝑆𝑅1 + 𝑆𝑆𝑅2.
Chow test
  (1) (2) (3)
Restricted model
with male and
VARIABLES female Model if female=1 Model if female=0
       
educ 0.599*** 0.463*** 0.677***
(0.0513) (0.0594) (0.0744)
exper 0.0223* 0.00785 0.0538***
(0.0121) (0.0119) (0.0199)
tenure 0.169*** 0.0568* 0.159***
(0.0216) (0.0296) (0.0300)
Constant -2.873*** -1.456* -3.531***
(0.729) (0.801) (1.110)
Observations 526 252 274
R-squared 0.306 0.217 0.336
Standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

Chow test for significant differences in coefficients between females and males. The coefficients are jointly
significantly different for females and males. Two separate models for females and males should be estimated.
24
Terima Kasih

You might also like