Dummy Variable Regression
Dummy Variable Regression
Learning objectives
• Understand the role of dummy variables to represent
qualitative explanatory variables and use them in
regression.
• Test for differences between the categories of a
REGRESSION WITH DUMMY, qualitative variable.
DICHOTOMOUS OR INDICATOR • Calculate and interpret confidence intervals and
VARIABLES prediction intervals, to allow inferences about the
regression coefficients.
• Explain the role of the assumptions on the OLS
estimators.
• Describe common violations of the assumptions and
offer remedies.
1 2
3 4
Example: Programmer Salary Survey Exper. Test Salary Exper. Test Salary
As an extension of the problem involving the (Yrs.) Score Degr. ($000s) (Yrs.) Score Degr. ($000s)
computer programmer salary survey, suppose that
4 78 No 24.0 9 88 Yes 38.0
management also believes that the annual salary is
7 100 Yes 43.0 2 73 No 26.6
related to whether the individual has a graduate 1 86 No 23.7 10 75 Yes 36.2
degree in computer science or other. 5 82 Yes 34.3 5 81 No 31.6
The years of experience, the score on the 8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
programmer aptitude test, whether the individual has
0 75 No 22.2 4 79 No 30.1
a relevant graduate degree, and the annual salary 1 80 No 23.1 6 94 Yes 33.9
($000) for each of the sampled 20 programmers are 6 83 No 30.0 3 70 No 28.2
shown on the next slide. 6 91 Yes 33.0 3 89 No 30.0
5 6
1
12/7/2023
ANOVA Output
y^ = b0 + b1x1 + b2x2 + b3x3 Analysis of Variance
where: SOURCE DF SS MS F P
Regression 3 507.8960 269.299 29.48 0.000
y^ = annual salary ($1000)a Residual Error 16 91.8895 5.743
Previously,
x1 = years of experience Total 19 599.7855 R Square = .8342
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree Previously,
R2 = 507.896/599.7855 = .8468
1 if individual does have a graduate degree Adjusted
20 1 R Square = .815
x3 is a dummy variable Ra2 1 (1 .8468) .8181
20 3 1
7 8
Not significant
9 10
2
12/7/2023
13 14
15 16
• For the sake of simplicity, consider a model • For a given x, and d = 0, we compute ŷ as
containing one quantitative explanatory variable and ŷ = b0 + b1x1 + b2(0) = b0 + b1x1.
one dummy variable.
y = b 0 + b 1x1 + b 2d + e
• Similarly, when d = 1
ŷ = b0 + b1x1 + b2(1) = (b0 + b2) + b1x1.
• Conducting a standard ordinary least squares (OLS)
regression will yield an estimated equation of • The dummy variable allows a shift in the intercept
ŷ = b0 + b1x1 + b2d. term, enabling us to use a single regression
equation to represent both categories of the
qualitative variable.
continued continued
17 18
3
12/7/2023
Graphically, we can see how the dummy variable shifts • Example: Evidence of gender pay discrimination?
the intercept of the regression line. – The introductory case has two qualitative variables, gender
and age group. To measure the impact of gender and age
on salary, we need to create two dummy variables.
Let d1 = 1 if the professor is male; 0 if female
Let d2 = 1 if the professor is 60 or over; 0 if under 60.
continued continued
19 20
21 22
continued continued
23 24
4
12/7/2023
25 26
Interval estimates for the response Interval estimates for the response
variable variable
LO : Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the • But, this is only a point estimate and ignores
regression coefficients. sampling error. We can also provide interval
• Once we have developed a regression model, we estimates.
often want to use it to make predictions.
• We will develop two types of interval estimates
• In the academic salary example, what salary would regarding y:
we predict for a male professor with 10 years of – A confidence interval for the expected value of y
experience? Inserting these values into our
– A prediction interval for an individual value of y.
estimated regression equation, we find:
Salary(predicted) = ŷ = 54.011 + 1.503(10) + 18.541(1) + 5.772(0) • It is common to refer to the first as a confidence
= 87.554, that is, $87,554. interval and the second as a prediction interval.
continued continued
27 28
Interval estimates for the response Interval estimates for the response
variable variable
• The point estimate of E(y0) is just the ŷ value. • Many statistics programs will compute confidence
ŷ0 = b0 + b1x10 + b2x20 + … + bkxk0 intervals, but Excel’s data analysis tools do not.
• The confidence interval, as always, includes the • Here is a method you can use instead. Shift the
point estimate, plus or minus the margin of error. value of each explanatory variable in your data set
by the value of interest for that variable:
ŷ0 ± ta/2,df se(ŷ0)
x1* = x1 – x10, x2* = x2 – x20, …, xk* = xk – xk0
• The term se(ŷ0) is the standard error of the
prediction. Though difficult to compute by hand if • When we estimate this modified regression, the
there is more than one explanatory variable in the resulting estimate of the intercept and its standard
model, we will develop a procedure to compute it error equal y0 and se(ŷ0), respectively.
with a statistical package.
continued continued
29 30
5
12/7/2023
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: Evidence of gender pay discrimination? • Example:
– Estimating the modified regression now reveals the
– In the academic salary example, we first shift the data by
confidence interval.
our hypothesised values.
continued continued
31 32
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: • If we want to compute an interval with a different
– To summarise, after shifting the explanatory variables, confidence level, we simply need to find the correct
the intercept row in the regression output gives us all the ta/2,df statistic and insert the intercept and standard
information we need. The 95% confidence interval is given error of the intercept from the same regression, or
in the same row.
alternatively, specify a different confidence level in
– A 95% confidence interval for the salary of a man with 10 Excel's Regression dialog box'.
years of experience:
ŷ0 ± ta /2,df se(ŷ0) = • The formula for the prediction interval
87.406 ± 2.023 × 2.869 = 87.406 ± 5.802.
– With 95% confidence, we can state that the mean salary of
all male professors with 10 years of experience falls
yˆ 0 ta 2,df seyˆ
0 2
se
2
continued continued
33 34
LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: Prediction interval for salary
• The point estimate and the standard error of the – For the introductory case, to compute the prediction interval
for a man with 10 years of experience, we simply insert the
prediction are computed using the same technique
appropriate values from the previous example, plus the
as for the confidence interval. standard error of the estimate, 9.133.
• Now we need to include the standard error of the 87.406 2.023 2.8682 9.1332 68.044,106.768
estimate in the margin of error calculation.
– With 95% prediction level, we can state that the salary of a
male professor with 10 years of experience falls between
$68,044 and $106,768.
– Remember that the prediction interval is an interval
estimate for one man with this experience, while the
confidence interval pertains to the average of all men with
continued this much experience.
35 36
6
12/7/2023
1. The model given by y = b 0 + b 1x1 + … + b kxk + e is linear 6. The error term e is not correlated with any of the
in the parameters b 0, b 1, …, b k . predictors x1, …, xk. In other words, there are no
explanatory variables excluded.
2. Conditional on x1, …, xk, E(e) = 0, thus
E(y) = b 0 + b 1x1 + … + b kxk . 7. The error term e is normally distributed. This assumption
allows us to do hypothesis testing. If normality is not the
3. There is no exact linear relationship among the case, our tests may not be valid.
explanatory variables (i.e. no perfect multicollinearity).
continued continued
37 38
continued
39 40
continued continued
41 42
7
12/7/2023
continued continued
43 44
45 46
47