CH - 02 - Simple Linear Regression - TQT
CH - 02 - Simple Linear Regression - TQT
variable, on one or more other variables, the explanatory variables, with a view to estimating and/or
predicting the (population) mean or average value of the former in terms of the known or fixed (in
repeated sampling) values of the latter” [2, p.18].
Dependent variable,
explained variable, Independent variable, Error term
response variable, explanatory variable, unobservables,
predicted variable, control variable, disturbance,
regressand… pridictor white noise,..
regressor,…
Simple linear regression: graphical presentation of the coefficients
(Cont)
Constant slope indicates that a one-unit change in X has the same effect
on Y regardless of X's innial values.
Simple linear regression: (Cont)
, representing
all other unobserved factors
than 𝑋 that affect Y.
𝟎, 𝟏
The goal of the linear regression is to estimate the population mean of the
dependent variable on the basis of the known values of the independent
variable(s).
To estimate which is known as the conditional expectation function or
the population linear regression function.
The conditional mean: E )
(Cont)
The conditional mean: E )
(Cont)
E ) =
E ) =
E ) = 𝟎 𝟏
(Cont)
This assumption means that: repespents other unobservable factors does not
have a systematical effect on . Why? This assumption is critical for causal analysis; it cannot
be tested statistically and has to be argued by economic
theory.
Note that 𝒙𝟏, 𝒙𝟐, , 𝒙𝟑, …here refer to 𝒙𝒊, and not different variables
(Cont)
E )= : This assumption is a strong assumption for cetetis paribus analysis;
it cannot be tested statistically and has to be argued by economic theory.
: ability
?
(Cont)
: land quality
𝟎 𝟏
?
𝑇ℎ𝑒 𝑎𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒𝑠
: sample linear regression function
The average income for all household heads with 12 years of education is E(Income|edu=12)=815+53*12=1451
Note: But it is false to interpret that evevery household head with 12 years of education will earn 1451
The sample linear regression function (SRF)
(Cont)
is called as “Y-hat” or “Y-cap”,
𝟎 is the estimator of
𝟏 is the estimator of
is the estimator of , which is also known as the residual 𝒊 = 𝒊- 𝒊
𝒊
Population Sample
Model
Population linear regression model (PRM) Sample linear regression model (SRM)
Function E )=
Population linear regression function (PRF) Sample linear regression function (SRF)
What is OLS?
Answer:
𝒏 𝟐
𝟎 𝟏 𝒊 𝟏 𝒊
(Cont)
Deriving the OLS estimators
(Cont)
= =
These sample functions are the OLS estimators for
Regressions in Stata are very easy and simple, to run the regression of y on x,
just type: reg y x
Estimators vs estimate
An estimator, also referred to as a "sample" statistic, is a
rule, formula, or procedure that explains how to estimate
the population parameter from the sample data.
An estimate is a specific numerical value generated by the
estimator in an application.
Estimator: ,
Estimate: 815, 53
How do we fit the regression line 𝒊 𝟎 𝟏 𝒊 to the data?
Answer:
(Cont)
𝒊
𝒊 >1328
= 𝒓𝒆𝒔𝒊𝒅𝒖𝒂𝒍= 𝒊 - 𝒊
𝒊 𝒊
𝒊
𝒊 <1328
ReDifference Between the Residual and the Error
The
2.3. Interpretation
: the slope=
: the constant or
which indicates the average amount
intercept, which shows by which changes when increases by one
the average value of the unit, holding other factors in the model
dependent variable when constant.
the independent variable
is set to zero.
>0: a possitie association between and
<0: a negative association between and
=0: no association between and
Meat consumption and income in Lao Cai
(Cont) Monthly meat consumption per
household (1000 VND)
Monthly household income per
capita (1000 VND)
Fitted regression:
Fitted regression:
The intercept = - 856 when setting the independent variable (education) to zero ( zero years of
education).
But the data shows that education has the smallest value of 6 (6 years of education).
Therefore, we can’t interpret the intercept because it is outside of the data range.
The intercept = - 856 when setting the independent variable (education) to zero ( zero years of
education).
But the data shows that education has the smallest value of 6 (6 years of education).
Therefore, we can’t interpret the intercept because it is outside the range of the study data.
InThe intercept absorbs the bias for the regression model.
𝟎
(Cont)
(Cont)
𝐸 𝑢 <0
𝐸 𝑢 =0
𝜷𝒂
𝜷𝒃
𝐸 𝑢 >0
2.5. Properties of the OLS estim2.5.
2.4. Properties
Properties of OLS
of the the OLS estimator: fitted values and residuals
estimator
ator =3560.667+0.143*3000=785.42
- : Residuals )
: Predicted
Obs
1 3000 995 785.423017 209.577
2 8208 2900 1530.78546 1369.215
3 3613 1450 873.15481 576.8452
4 4624 1460 1017.84787 442.1521
5 4751 510 1036.02395 -526.024
6 5151 760 1093.27145 -333.271
7 5884 1005 1198.17749 -193.177
8 2696 100 741.914917 -641.915
9 2485 912 711.716861 200.2831
10 8860 570 1624.09889 -1054.1
11 1436 512 561.585292 -49.5853
Fitted values and residuals
(Cont)
Thealgebraic
Some Simple properties of OLS
(Cont)
Regression Model
From the first order conditions of OLS, we have some algebraic properties of OLS:
The sum and mean of the residuals will aways equal zero
The average of the predicted values is equal to the average of actual values .
=
Some algebraic properties of OLS
(Cont)
Observation X Y Predicted Y: Residuals:
Cov( , ) =0.00000
Decomposition of total variation
TSS ESS RSS
(Y- )
1015.818182 995 785.4230167 433.3966942 53081.93213 209.576983 43922.51
1015.818182 2900 1530.785464 3550141.124 265191.3019 1369.214536 1874748
1015.818182 1450 873.1548101 188513.8512 20352.83762 576.845190 332750.4
1015.818182 1460 1017.847866 197297.4876 4.119617482 442.152134 195498.5
1015.818182 510 1036.023947 255852.0331 408.2729503 -526.023947 276701.2
1015.818182 760 1093.271447 65442.94215 5999.008273 -333.271447 111069.9
1015.818182 1005 1198.17749 117.0330579 33254.91739 -193.177490 37317.54
1015.818182 100 741.9149168 838722.9421 75022.99858 -641.914917 412054.8
1015.818182 912 711.7168607 10778.21488 92477.61353 200.283139 40113.34
1015.818182 570 1624.098889 198753.8512 370005.4186 -1054.098889 1111124
R-Squared=ESS/TSS 0.201825363
Goodness-of-fit measure (R-squared)
Total sum of squares Explained sum of Residual sum of squares
squares
,
where =
TSS represents total variation in the ESS represents variation RSS represents variation not
dependent variable(s)
explained by regression explained by regression
N=11; =0.201
The regression explains 20.2 %
of the total variation in meat consumption
Change in the measurement unit of the in dependent variable 𝐗 Intercept Slope coefficient
𝟎
X is divided by the constant 𝑪: 𝟎
( * ) if is very small.
We multiply by 100 to obtain the perentage change given one extra year of education
; =0.15
( *
Important notes:
Exact = exp( *1)-1= .04456445 =4.45%
=0.16: shows that educ explains about 16% of the variation in log(wage) (not wage)
Functional form
(Cont)
2.6. Standard assumptions for the simple linear regression model
This implies that a one-unit change in X has the same effect on Y regardless of X’s innial values.
is a linear function of and the regression curve is a straight line.
From now on, the term “linear regression” means linearity in parameters ( ).
Both regression models are linear in parameters.
(Cont)
English= Math= 𝛽 𝟐 +𝑢
Stata command: curvefit english income, function (1) Stata command: curvefit math income, function (4)
(Cont)
(Cont)
Assumption SLR.4 (Zero conditional mean):
(Cont)
We have already discussed about this crucial assumption
Given any value of the explanatory variable, the expected value of the error
term is zero
In other words: the value of the independent variable (X) must contain no
information about the mean of the unobservables (u).
Questions:
𝑪𝒐𝒗 𝒙, 𝒖 =0 implies that 𝑬 𝒖 𝒙 = 𝟎?
𝑪𝒐𝒗 𝑥, 𝑢 = 𝟎 𝑜𝑟 𝑪𝒐𝒓𝒓(𝑥, 𝑢) =0 does not implies that 𝑬 𝒖 𝒙 = 𝟎
Assumption SLR.5 (Homoskedasticity): V 𝟐
(Cont)
𝑯𝒆𝒕𝒆𝒓𝒐𝒔𝒌𝒆𝒅𝒂𝒔𝒕𝒊𝒄𝒊𝒕𝒚: Wage variation around the 𝑯𝒐𝒎𝒐𝒔𝒌𝒆𝒅𝒂𝒔𝒕𝒊𝒄𝒊𝒕𝒚: Variation in log wage around the
mean tends to increase with 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑙𝑒𝑣𝑒𝑙𝑠 mean is relatively constant across all education leves.
W𝑎𝑔𝑒 = 𝛽 + 𝛽 𝑒𝑑𝑢 𝐿𝑜𝑔_W𝑎𝑔𝑒 = 𝛽 + 𝛽 𝑒𝑑𝑢
Stata command: Stata command:
Reg wage edu gen Log_Wage=ln(wage)
Reg Log_Wage edu
hettest
hettest
Gauss-Markov assumptions of the Simple Linear
Regression (SLR)
Under SLR.1-SLR.4, is unbiased: E( )=
Under SLR.5, 𝑡ℎ𝑒 𝑂𝐿𝑆 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 𝛽 has the smallest variance among other linear unbiased estimator.
Under SLR.1-SLR.5, 𝑡ℎ𝑒 𝑂𝐿𝑆 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 𝛽 for 𝛽 is the best linear unbiased estimator (BLUE)
What happens if the assumptions SLR.1–SLR.4 are satisfied but the assumption SLR.5 is not?
2.7. Mean and variances of the OLS estimator
Interpretation of unbiasedness
Under SLR.1-SLR.4, is unbiased: E( )=
Unbiasedness does not imply that, with a given sample, our estimated parameters would
equal the exact true values of the population parameters.
In a given sample, estimates may larger ( > ; > ) or smaller ( < ; < ) than
the true values.
Instead, the unbiasedness should be interpreted that:
Note: Under the assumptions 1-5: 𝐸 𝝈𝟐 = 𝝈𝟐 , which is an unbiased estimate of the error variance.
measures the average distance between the observed values and the
regression line (the fitted values).
Estimating the variance of the error term:
𝟐
𝝈 measures the average distance between the observed values and the regression line (the fitted values).
Estimating the variance of the error term:
(Cont)
= = 702.19972
Note:
The standard error of the regression ( ) has the same unit as the
dependent variable.
also has other names: the standard error of the estimate and the root
mean squared error (Root MSE).
Variances and standard errors for regression coefficients
; ;
𝝈𝟐 = ∑ 𝑢
(Cont)
∑ ̅)
;
(
∑ ( ̅)
∑
∑ ̅)
;
(
∑ ( ̅)
Note:
Standard errors are the estimated standard deviations of the regression coefficients.
E( )=
True
Properties of the mean and variances
Exercise:
1. Is the residual the error term? Explain
2. Why do the sum and mean of the residual always equal zero?
3. What happens to the sum and mean of the residual if we exclude the
intercept from the OLS model?
4. What happens to the OLS estimator if the sample is not randomly selected
from a population?
5. What happens to a simple linear regression model if the value of the
explanatory variable is similar for all observations?
6. Suppose our model satisfies SLR assumptions 1–4 but suffers from
heterokedasticity. In this case, are our estimates biased?. What is the
consequence of the heterokedasticity?
7. Comment on the statement that a model with a high R-squared shows a
strongly causal relationship.
8. Which model violates the assumption of the OLS?
+ ; (1)
+ ; (2)
+ (3)
Excercise
9. Let Qd denote the quantity of a given product, and let P denote the price of that product. A simple
model is presented that connects quantity demanded to price: 𝑄𝑑 = 𝛽 + 𝛽 𝑃 + 𝑢
(i) What possible factors are contained in 𝑢? Is it likely that these will be related to price
(ii) Will a simple regression analysis show the ceteris paribus effect of price on quantity demanded? Explain.
10. The following table contains monthly meat consumption per household (thousand VND) and monthly household
income per capita (thousand VND) for
list 20 households.
meat income list meat income
1 1390 5031 11 1770 4365
2 1320 6491 12 1620 4727
3 2900 4900 13 1460 5067
4 790 3267 14 650 5094
5 1600 5164 15 995 3000
6 2400 3260 16 2900 8208
7 1310 4847 17 1450 3613
8 1690 8395 18 1460 4624
9 1880 6625 19 510 4751
10 1205 2394 20 760 5151
(i) estimate the relationship between the dependent variable (meat consumption) and the independent variable (household income
per capita) using an OLS regression model. Comment on the link between two variables. What is the meaning of the intercept and
slope coefficients?
(ii) How much higher is the level of meat consumption predicted to be if the monthly income per capita is increased by 200 thousand
VND?
(iii) Is this true if we say that given a one million VND increase in household income per capita, the value of meat consumption
increases at the same level for all households?
(iv) calculate the fitted values of the dependent variable and the residuals. Do the sum and mean of the residuals equal zero? What is
the average of the fitted values and the observed values of the dependent variables?
(v) Please interpret the R-squared. How much of the variation in meat consumption is unexplained by the regression?
(vi) calculate the standard error of the regression coefficients and the standard error of the regression. What is the unit of analysis for
the standard error of the regression?
11. Using a simple linear regression model, a researcher investigates the dependence of the monthly wage (in thousand VND) on
the number of years of education among wage workers in Hanoi in 2018.
(i) What is the average predicted wage when education equals zero?
(ii) How much does the monthly wage increase if the number of years of education increases from 12 to 16 years?
(iii) Does this model infer a causal relationship between wage and education?
(iv) What percentage of the variance in wages is explained by education?
12. A sample of 11 households with their income and food consumption is given in the table.
Income Food consumption
Thousand VND/per person/month Thousan VND/per person/month
3000 995
8208 2900
3613 1450
4624 1460
4751 510
5151 760
5884 1005
2696 100
2485 912
8860 570
1436 512
Using the OLS estimator, estimate the relationship between the dependent variable (food consumption) and the explanatory variable
( income):
+
(i) Using the regression result, please report the marginal propensity to consume food (MPCF) .
(iii) What is the MPCF if the regression model excludes the intercept? + . (Note, please use “constant
is zero” in excel”)
(iv) Using the result from the model without intercept, calculate the fitted values of the dependent variable and the residuals. Do
the sum and mean of the residuals equal zero? What is the average of the fitted values and the observed values of the dependent
variables?
(v) Does the exlussion of the intercept from the model cause the bias? Explain.