2 Simple Regression Model
2 Simple Regression Model
Ani Katchova
2
Terminology
Regression model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
y is dependent variable, x is independent variable (one independent
variable for a simple regression), u is error, β0 and β1 are parameters.
3
Simple regression model example
Dependent Indep. Predicted value Residual Simple regression: actual and predicted values
variable y variable x 𝑦𝑦� = 20 + 0.5𝑥𝑥 𝑢𝑢� = 𝑦𝑦 − 𝑦𝑦� 22.5
22
$ experience 21
21 2 =20+0.5*2=21 =21-21=0
20
19.5
21 1 =20+0.5*1=20.5 =21-20.5=0.5 0 0.5 1 1.5 2 2.5 3 3.5
4
Simple regression:
actual values, predicted values, and residuals
Regression line fits as good as possible through the data points
5
Interpretation of coefficients
∆𝑦𝑦 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖 𝑦𝑦
𝛽𝛽̂1 = =
∆𝑥𝑥 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑖𝑖𝑖𝑖 𝑥𝑥
6
Population regression function
Population regression function:
𝐸𝐸 𝑦𝑦 𝑥𝑥 = 𝐸𝐸 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢 𝑥𝑥 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝐸𝐸 𝑢𝑢 𝑥𝑥 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
if 𝐸𝐸 𝑢𝑢 𝑥𝑥 =0 (this assumption is called zero conditional mean)
For the population, the average value of the dependent variable can be
expressed as a linear function of the independent variable.
7
Population regression function
• Population regression function shows the relationship between y and
x for the population
8
Population regression function
• For individuals with a particular x, the average value of y is 𝐸𝐸 𝑦𝑦 𝑥𝑥 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
• Note that x1, x2, x3 here refers to xi and not different variables
9
Derivation of the OLS estimates
• For a regression model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
• We need to estimate the regression equation: 𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥 and find
the coefficients 𝛽𝛽̂0 and 𝛽𝛽̂1 by looking at the residuals
• 𝑢𝑢� = 𝑦𝑦 − 𝑦𝑦� = 𝑦𝑦 − 𝛽𝛽̂0 − 𝛽𝛽̂1 𝑥𝑥
• Obtain a random sample of data with n observations
(𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 ), where 𝑖𝑖 = 1 … 𝑛𝑛 is the observation
• The goal is to obtain as good fit as possible of the estimated
regression equation
10
Derivation of the OLS estimates
• Minimize the sum of squared residuals
𝑛𝑛 𝑛𝑛
min � 𝑢𝑢� 2 = � (𝑦𝑦 − 𝛽𝛽̂0 − 𝛽𝛽̂1 𝑥𝑥 )2
𝑖𝑖=1 𝑖𝑖=1
We obtain OLS coefficients:
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥)(𝑦𝑦
̅ 𝑖𝑖 − 𝑦𝑦)
� 𝑐𝑐𝑐𝑐𝑐𝑐(𝑥𝑥, 𝑦𝑦)
̂
𝛽𝛽1 = =
∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥)̅ 2 𝑣𝑣𝑣𝑣𝑣𝑣(𝑥𝑥)
11
OLS properties
𝑦𝑦� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥̅
The sample average of the dependent and independent variable are on
the regression line
𝑛𝑛
� 𝑢𝑢� = 0
𝑖𝑖=1
The residuals sum up to zero (note that we minimized the sum of
squared residuals)
𝑛𝑛
� 𝑥𝑥 𝑢𝑢� = 0
𝑖𝑖=1
The covariance between the independent variable and residual is zero.
12
Simple regression example: CEO’s salary
Simple regression model explaining how return on equity (roe) affects CEO’s salary.
Regression model
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝛽𝛽0 + 𝛽𝛽1 𝑟𝑟𝑟𝑟𝑟𝑟 + 𝑢𝑢
Residuals
�
𝑢𝑢� = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
We estimate the regression model to find the coefficients.
𝛽𝛽̂1 measures the change in the CEO’s salary associated with one unit increase in
roe, holding other factors fixed.
13
Estimated equation and interpretation
• Estimated equation
� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑟𝑟𝑟𝑟𝑟𝑟 = 963.191 + 18.501 𝑟𝑟𝑟𝑟𝑟𝑟
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
• Salary is measured in thousand dollars, ROE (return on equity) is measured
in %.
• 𝛽𝛽̂1 measures the change in the CEO’s salary associated with one unit
increase in roe, holding other factors fixed.
• Interpretation of 𝛽𝛽̂1 : the CEO’s salary increases by $18,501 for each 1%
increase in ROE.
• Interpretation of 𝛽𝛽̂0 : if the ROE is zero, the CEO’s salary is $963,191.
14
Stata output for simple regression
. regress salary roe
roe 18.50*
(11.12)
Constant 963.2***
(213.2)
Observations 209
R-squared 0.013
16
Regression line for sample vs
population regression function for population
17
Estimated regression
15000
10000
5000
0
0 20 40 60
return on equity, 88-90 avg
15000
10000
5000
0
0 20 40 60
return on equity, 88-90 avg
True value Predicted value
Residual
19
Actual, predicted values, and residuals
roe salary �
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑢𝑢�
predicted value Residual
963.191 + 18.501 𝑟𝑟𝑟𝑟𝑟𝑟 �
𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
14.1 1095 1224 -129
10.9 1001 1165 -164
23.5 1122 1398 -276
5.9 578 1072 -494
13.8 1368 1219 149
20 1145 1333 -188
16.4 1078 1267 -189
16.3 1094 1265 -171
10.5 1237 1157 80
26.3 833 1450 -617
The mean salary is 1,281 ($1,281,000). The mean predicted salary is also 1,281.
The mean for the residuals is zero. 20
Simple regression example: wage
Simple regression model explaining how education affects wages for workers.
Regression model
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝑢𝑢
Residuals
𝑢𝑢� = 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 − 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
�
We estimate the regression model to find the coefficients.
𝛽𝛽̂1 measures the change in wage associated with one more year of education,
holding other factors fixed.
21
Estimated equation and interpretation
• Estimated equation
� = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = −0.90 + 0.54 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤
• Wage is measured in $/hour. Education is measured in years.
• 𝛽𝛽̂1 measures the change in person’s wage associated with one
additional year increase in education, holding other factors fixed.
• Interpretation of 𝛽𝛽̂1 : the hourly wage increases by $0.54 for
additional year of education.
• Interpretation of 𝛽𝛽̂0 : if education is zero, person’s wage is -$0.90 (but
no one in the sample has zero education).
22
Stata output for simple regression
. reg wage educ
23
Variations
𝑆𝑆𝑆𝑆𝑆𝑆 = ∑𝑛𝑛𝑖𝑖=1(𝑦𝑦 − 𝑦𝑦)
� 2 𝑆𝑆𝑆𝑆𝑆𝑆 = ∑𝑛𝑛𝑖𝑖=1(𝑦𝑦� − 𝑦𝑦)
� 2 𝑆𝑆𝑆𝑆𝑆𝑆 = ∑𝑛𝑛𝑖𝑖=1(𝑦𝑦 − 𝑦𝑦)
� 2 = ∑𝑛𝑛𝑖𝑖=1 𝑢𝑢� 2
SST = SSE + SSR
• SST is total sum of squares and measures the total variation in the
dependent variable
• SSE is explained sum of squares and measures the variation explained by
the regression
• SSR is residual sum of squares and measures the variation not explained by
the regression
Note: some call SSE error sum of squared and SSR regression sum of squares,
where R & E are confusingly reversed.
24
Variations
25
Goodness of fit measure
R-squared
• R2 = SSE/SST = 1 – SSR/SST
• R-squares is explained sum of squares divided by total sum of
squares.
• R-squared is a goodness of fit measure. It measures the proportion of
total variation that is explained by the regression.
• An R-squared of 0.7 is interpreted as 70% of the variation is explained
by the regression and the rest is due to error.
• R-squared that is greater than 0.25 is considered good fit.
26
R-squared calculated
. reg wage educ
29
Log-linear form (also called semi-log)
• Linear regression model: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
• Log-linear form: 𝑙𝑙𝑙𝑙𝑙𝑙(𝑦𝑦) = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
• Instead of the dependent variable, use log of the dependent variable.
32
Linear vs log-linear form
25
3
20
2
15
1
10
0
5
0
-1
0 5 10 15 20 0 5 10 15 20
educ educ
Linear form: wage increases by $0.54 for each additional year of education.
Log-linear form: wage increases by 8.2% for each additional year of education.
34
Example of data with logs
Salary Sales
(thousand (Million
dollars) lsalary dollars) lsales
1095 7.0 27595 10.2
1001 6.9 9958 9.2
1122 7.0 6126 8.7
578 6.4 16246 9.7
1368 7.2 21783 10.0
1145 7.0 6021 8.7
1078 7.0 2267 7.7
1094 7.0 2967 8.0
1237 7.1 4570 8.4
833 6.7 2830 7.9
Note that one unit is thousand dollars for salary and million dollars for sales.
35
Linear vs log-log form
15000
10
9
10000
8
7
5000
6
0
5
0 20000 40000 60000 80000 100000 4 6 8 10 12
1990 firm sales, millions $ natural log of sales
1990 salary, thousands $ Fitted values natural log of salary Fitted values
Linear form: salary on sales Log-log form: log salary on log sales 36
Log-linear vs linear-log form
15000
10
9
10000
8
7
5000
6
0
5
natural log of salary Fitted values 1990 salary, thousands $ Fitted values
Log-linear form: log salary on sales Linear-log form: salary on log sales 37
Interpretation of coefficients
Linear Log-log Log-linear Linear-log
VARIABLES salary lsalary lsalary salary
Linear form: salary increases by 0.155 thousand dollars ($155 dollars) for each additional one million dollars in sales.
Log-log form: salary increases by 0.25% for every 1% increase in sales.
Log-linear form: salary increases by 0.0015% (=0.000015*100) for each additional one million dollar increase in sales.
Linear-log form: salary increases by 2.629 (=262.9/100) thousand dollars for each additional 1% increase in sales.
38
Review questions
1. Define regression model, estimated equation, and residuals.
2. What method is used to obtain the coefficients?
3. What are the OLS properties?
4. How is R-squared defined and what does it measure?
5. By taking logs of the variable, how does the interpretation of
coefficients change?
39
Gauss Markov assumptions
• Gauss Markov assumptions are standard assumptions for the linear
regression model
1. Linearity in parameters
2. Random sampling
3. No perfect collinearity (or sample variance in the independent
variable)
4. Exogeneity or zero conditional mean – regressors are not correlated
with the error term
5. Homoscedasticity – variance of error term is constant
40
Assumption 1: linearity in parameters
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
41
Assumption 2: random sampling
𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖 , where 𝑖𝑖= 1….n
• The data are a random sample drawn from the population.
• Each observation follows the population equation 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥 + 𝑢𝑢
• Data on workers (y=wage, x=education).
• Population is all workers in the U.S. (150 million)
• Sample is workers selected for the study (1,000)
• Drawing randomly from the population – each worker has equal probability
of being selected
• For example, if young workers are oversampled, this will not be a
random/representative sample.
42
Assumption 3: no perfect collinearity
𝑛𝑛
43
Assumption 4: zero conditional mean
(exogeneity)
𝐸𝐸 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖 ) = 0
• Expected value of error term u given independent variable x is zero.
• The expected value of the error must not differ based on the values of
the independent variable.
• The errors must sum up to zero for each x.
44
Example of zero conditional mean
Regression model
𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 = 𝛽𝛽0 + 𝛽𝛽1 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 + 𝑢𝑢
45
Example of exogeneity vs endogeneity
Exogeneity - zero conditional mean Endogeneity - conditional mean is not zero
10
30
25
5
uhat_modified
20
Residuals
15
0
10
-5
5
10 12 14 16 18 10 12 14 16 18
educ educ
E(u|x)=0 error term is the same given education E(u|x)>0 ability/error is higher when education is higher
46
Unbiasedness of the OLS estimators
• Gauss Markov Assumptions 1-4 (linearity, random sampling, no
perfect collinearity, and zero conditional mean) lead to the
unbiasedness of the OLS estimators.
𝐸𝐸 𝛽𝛽̂0 = 𝛽𝛽0 and 𝐸𝐸 𝛽𝛽̂1 = 𝛽𝛽1
• Expected values of the sample coefficients 𝛽𝛽̂ are the population
parameters 𝛽𝛽.
• If we estimate the regression model with many random samples, the
average of these coefficients will be the population parameter.
• For a given sample, the coefficients may be very different from the
population parameters.
47
Assumption 5: homoscedasticity
• Homoscedasticity 𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝜎𝜎 2
• Variance of the error term 𝑢𝑢 must not differ with the independent
variable 𝑥𝑥.
• Heteroscedasticity 𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢𝑖𝑖 𝑥𝑥𝑖𝑖 ≠ 𝜎𝜎 2 is when the variance of the error
term 𝑢𝑢 is not constant for each 𝑥𝑥.
48
Homoscedasticity vs heteroscedasticity
Homoscedasticity Heteroscedasticity
𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢 𝑥𝑥 = 𝜎𝜎 2 𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢 𝑥𝑥 ≠ 𝜎𝜎 2
49
Homoscedasticity vs heteroscedasticity
Homoscedasticity Heteroscedasticity
𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢 𝑥𝑥 = 𝜎𝜎 2 𝑣𝑣𝑣𝑣𝑣𝑣 𝑢𝑢 𝑥𝑥 ≠ 𝜎𝜎 2
10
15
10
5
Residuals
Residuals
5
0
0
-5
-5
10 12 14 16 18 0 5 10 15 20
educ educ
50
Unbiasedness of the error variance
We can estimate the variance of the error term as:
1
2
𝜎𝜎� = ∑𝑛𝑛𝑖𝑖=1 𝑢𝑢� 𝑖𝑖2
𝑛𝑛−2
• The degrees of freedom (n-k-1) are corrected for the number of
independent variables k=1.
52
Variances of the OLS estimators
𝜎𝜎 2 𝜎𝜎 2
𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂1 = 𝑛𝑛 2
=
∑𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥)̅ 𝑆𝑆𝑆𝑆𝑇𝑇𝑥𝑥
𝜎𝜎� 2
𝑠𝑠𝑠𝑠 𝛽𝛽̂1 = 𝑣𝑣𝑣𝑣𝑣𝑣 𝛽𝛽̂1 =
𝑆𝑆𝑆𝑆𝑇𝑇𝑥𝑥
54
Review questions
1. List and explain the 5 Gauss Markov assumptions.
2. Which assumptions are needed for the unbiasedness of the
coefficients?
3. Which assumptions are needed to calculate the variance of the OLS
coefficients?
4. Is it possible to have zero conditional mean but heteroscedasticity?
55