0% found this document useful (0 votes)
7 views

Dummy Variable Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Dummy Variable Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

12/7/2023

Learning objectives
• Understand the role of dummy variables to represent
qualitative explanatory variables and use them in
regression.
• Test for differences between the categories of a
REGRESSION WITH DUMMY, qualitative variable.
DICHOTOMOUS OR INDICATOR • Calculate and interpret confidence intervals and
VARIABLES prediction intervals, to allow inferences about the
regression coefficients.
• Explain the role of the assumptions on the OLS
estimators.
• Describe common violations of the assumptions and
offer remedies.

1 2

Categorical Independent Variables Example

 Example: Programmer Salary Survey


In many situations we must work with categorical
independent variables such as gender (male, female), A software firm collected data for a sample of 20
method of payment (cash, check, credit card), etc. computer programmers. A suggestion was made that
regression analysis could be used to determine if
For example, x2 might represent gender where x2 = 0 salary was related to the years of experience and the
indicates male and x2 = 1 indicates female. score on the firm’s Programmer Aptitude Test.
The years of experience, score on the aptitude test
In this case, x2 is called a dummy or indicator variable. test, and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the previous
class.

3 4

Categorical Independent Variables Categorical Independent Variables

 Example: Programmer Salary Survey Exper. Test Salary Exper. Test Salary
As an extension of the problem involving the (Yrs.) Score Degr. ($000s) (Yrs.) Score Degr. ($000s)
computer programmer salary survey, suppose that
4 78 No 24.0 9 88 Yes 38.0
management also believes that the annual salary is
7 100 Yes 43.0 2 73 No 26.6
related to whether the individual has a graduate 1 86 No 23.7 10 75 Yes 36.2
degree in computer science or other. 5 82 Yes 34.3 5 81 No 31.6
The years of experience, the score on the 8 86 Yes 35.8 6 74 No 29.0
10 84 Yes 38.0 8 87 Yes 34.0
programmer aptitude test, whether the individual has
0 75 No 22.2 4 79 No 30.1
a relevant graduate degree, and the annual salary 1 80 No 23.1 6 94 Yes 33.9
($000) for each of the sampled 20 programmers are 6 83 No 30.0 3 70 No 28.2
shown on the next slide. 6 91 Yes 33.0 3 89 No 30.0

5 6

1
12/7/2023

Estimated Regression Equation Categorical Independent Variables

 ANOVA Output
y^ = b0 + b1x1 + b2x2 + b3x3 Analysis of Variance

where: SOURCE DF SS MS F P
Regression 3 507.8960 269.299 29.48 0.000
y^ = annual salary ($1000)a Residual Error 16 91.8895 5.743
Previously,
x1 = years of experience Total 19 599.7855 R Square = .8342
x2 = score on programmer aptitude test
x3 = 0 if individual does not have a graduate degree Previously,
R2 = 507.896/599.7855 = .8468
1 if individual does have a graduate degree Adjusted
20  1 R Square = .815
x3 is a dummy variable Ra2  1  (1  .8468)  .8181
20  3  1

7 8

Categorical Independent Variables Dummy, Dichotomous or indicator


Variables
 Regression Equation Output
• Qualitative Explanatory variable with two
categories
Predictor Coef SE Coef T p • Qualitative Explanatory variable with multiple
Constant 7.945 7.382 1.076 0.298 categories
Experience 1.148 0.298 3.856 0.001
Test Score 0.197 0.090 2.191 0.044
Grad. Degr. 2.280 1.987 1.148 0.268

Not significant

9 10

More Complex Categorical Variables More Complex Categorical Variables

For example, a variable indicating level of


If a categorical variable has k levels, k - 1 dummy education could be represented by x1 and x2 values
variables are required, with each dummy variable as follows:
being coded as 0 or 1.
Highest
For example, a variable with levels A, B, and C could Degree x1 x2
be represented by x1 and x2 values of (0, 0) for A, (1, 0)
for B, and (0,1) for C. *Bachelor’s 0 0
Master’s 1 0
Care must be taken in defining and interpreting the Ph.D. 0 1
dummy variables.

*: Base line Indicator


11 12

2
12/7/2023

Example: Is there evidence of gender pay Is there evidence of gender pay


discrimination? discrimination?
• Worldwide studies have documented gender • She gathers data on 42 professors, including the
differences in wages and that female academics salary, experience, gender and age of each.
received lower pay than their male colleagues.
• Numerous studies have focused on salary
differences between men and women, indigenous
and non-indigenous, and young and old Australians.
• Joanna Smith works in human resources at a large • Using this data set, Joanna hopes to:
university. – Determine whether there is evidence of gender
discrimination in salaries
• After the release of the latest Australian Bureau of – Determine whether there is evidence of age discrimination
Statistics data, the university asked her to test for in salaries.
both gender and age discrimination in salaries.
continued

13 14

Dummy variables Dummy variables


LO :Understand the role of dummy variables to represent
qualitative explanatory variables and use them in • A dummy variable for a qualitative variable with two
regression. categories assigns a value of 1 for one of the
• Previously, all the variables used in regression categories and a value of 0 for the other.
applications are quantitative. • For example, suppose we are interested in teen
behaviour. We might first define a dummy variable d
• In empirical work it is common to have some
that has the following structure:
variables that are qualitative: the values represent
Let d = 1 if age is between 13 and 19
categories that may have no implied ordering.
and d = 0 if age is anything else.
• We can include these factors in a regression through • This would allows us to capture the role of being a
the use of dummy variables. teenager in a regression model and quantify its
impact.
continued continued

15 16

Dummy variables Dummy variables

• For the sake of simplicity, consider a model • For a given x, and d = 0, we compute ŷ as
containing one quantitative explanatory variable and ŷ = b0 + b1x1 + b2(0) = b0 + b1x1.
one dummy variable.
y = b 0 + b 1x1 + b 2d + e
• Similarly, when d = 1
ŷ = b0 + b1x1 + b2(1) = (b0 + b2) + b1x1.
• Conducting a standard ordinary least squares (OLS)
regression will yield an estimated equation of • The dummy variable allows a shift in the intercept
ŷ = b0 + b1x1 + b2d. term, enabling us to use a single regression
equation to represent both categories of the
qualitative variable.

continued continued

17 18

3
12/7/2023

Dummy variables Dummy variables

Graphically, we can see how the dummy variable shifts • Example: Evidence of gender pay discrimination?
the intercept of the regression line. – The introductory case has two qualitative variables, gender
and age group. To measure the impact of gender and age
on salary, we need to create two dummy variables.
Let d1 = 1 if the professor is male; 0 if female
Let d2 = 1 if the professor is 60 or over; 0 if under 60.

continued continued

19 20

Qualitative variables with two


Dummy variables
categories
• Example: LO : Test for differences between the categories of a
qualitative variable.

• The statistical tests discussed in remain valid for


dummy variables as well.
• We can perform a t test for individual significance,
– The estimated equation is form a confidence interval using the parameter
ŷ = 54.011 + 1.503x + 18.541d1 + 5.772d2 estimate and its standard error, and conduct a partial
– The difference in salary between a male and a female
F test for joint significance.
professor is captured in the coefficient of d1. A male
professor, on average, makes $18,541 more than a female
with comparable experience.
– The age coefficient, though statistically insignificant in this
case, would have a similar interpretation.
continued

21 22

Qualitative variables with two Qualitative variables with two


categories categories
• Example: Evidence of gender pay discrimination? • Sometimes a qualitative variable may be described
– Is there a gender effect in the salary study? by more than two categories.
H0: b 2 = 0 (males and females are paid the same) • In such cases we use multiple dummy variables to
HA: b 2 ≠ 0 (there is a difference due to gender) capture the effect of the variable.
– For example, suppose we divide the mode of transport used
– Given a value of the tdf test statistic of 4.86 and p-value of
by commuters into three categories: public transport, driving
approximately 0.00, we reject the null hypothesis and
and park-and-ride.
conclude that the gender dummy variable is significant.
– We then define two dummy variables, d1 and d2, where d1
• For the age coefficient, tdf is 0.94 and the p-value is 0.36, so equals 1 to denote public transport and 0 otherwise, and d2
we do not reject the null hypothesis. The evidence suggests equals 1 to denote driving and 0 otherwise. Park-and-ride is
that professors over 60 do not have significantly different captured when both d1 and d2 equal 0.
salaries, compared to those under 60.

continued continued

23 24

4
12/7/2023

Qualitative variables with two Qualitative variables with two


categories categories
• Our regression model for the mode of transport • Given the intercept term, we exclude one of the
example would then be dummy variables from the regression.

y = b 0 + b 1x + b 2d1 + b 3d2 + e • The excluded variable represents the reference


category (baseline indicator) against which the
and the estimated equation would be others are assessed.

ŷ = b 0 + b 1x + b 2d 1 + b 3d 2 . • If we included as many dummy variables as


categories, this would create perfect multicollinearity
in the data, and such a model cannot be estimated.
• So, we include one fewer dummy variable than the
number of categories of the qualitative variable.
continued

25 26

Interval estimates for the response Interval estimates for the response
variable variable
LO : Calculate and interpret confidence intervals and
prediction intervals, to allow inferences about the • But, this is only a point estimate and ignores
regression coefficients. sampling error. We can also provide interval
• Once we have developed a regression model, we estimates.
often want to use it to make predictions.
• We will develop two types of interval estimates
• In the academic salary example, what salary would regarding y:
we predict for a male professor with 10 years of – A confidence interval for the expected value of y
experience? Inserting these values into our
– A prediction interval for an individual value of y.
estimated regression equation, we find:
Salary(predicted) = ŷ = 54.011 + 1.503(10) + 18.541(1) + 5.772(0) • It is common to refer to the first as a confidence
= 87.554, that is, $87,554. interval and the second as a prediction interval.

continued continued

27 28

Interval estimates for the response Interval estimates for the response
variable variable
• The point estimate of E(y0) is just the ŷ value. • Many statistics programs will compute confidence
ŷ0 = b0 + b1x10 + b2x20 + … + bkxk0 intervals, but Excel’s data analysis tools do not.

• The confidence interval, as always, includes the • Here is a method you can use instead. Shift the
point estimate, plus or minus the margin of error. value of each explanatory variable in your data set
by the value of interest for that variable:
ŷ0 ± ta/2,df se(ŷ0)
x1* = x1 – x10, x2* = x2 – x20, …, xk* = xk – xk0
• The term se(ŷ0) is the standard error of the
prediction. Though difficult to compute by hand if • When we estimate this modified regression, the
there is more than one explanatory variable in the resulting estimate of the intercept and its standard
model, we will develop a procedure to compute it error equal y0 and se(ŷ0), respectively.
with a statistical package.
continued continued

29 30

5
12/7/2023

LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: Evidence of gender pay discrimination? • Example:
– Estimating the modified regression now reveals the
– In the academic salary example, we first shift the data by
confidence interval.
our hypothesised values.

continued continued

31 32

LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: • If we want to compute an interval with a different
– To summarise, after shifting the explanatory variables, confidence level, we simply need to find the correct
the intercept row in the regression output gives us all the ta/2,df statistic and insert the intercept and standard
information we need. The 95% confidence interval is given error of the intercept from the same regression, or
in the same row.
alternatively, specify a different confidence level in
– A 95% confidence interval for the salary of a man with 10 Excel's Regression dialog box'.
years of experience:
ŷ0 ± ta /2,df se(ŷ0) = • The formula for the prediction interval
87.406 ± 2.023 × 2.869 = 87.406 ± 5.802.
– With 95% confidence, we can state that the mean salary of
all male professors with 10 years of experience falls
yˆ 0  ta 2,df seyˆ 
0 2
 se
2

between $81,603 and $93,209.

continued continued

33 34

LO 12.3 Interval estimates for the LO 12.3 Interval estimates for the
response variable response variable
• Example: Prediction interval for salary
• The point estimate and the standard error of the – For the introductory case, to compute the prediction interval
for a man with 10 years of experience, we simply insert the
prediction are computed using the same technique
appropriate values from the previous example, plus the
as for the confidence interval. standard error of the estimate, 9.133.
• Now we need to include the standard error of the 87.406  2.023 2.8682  9.1332  68.044,106.768
estimate in the margin of error calculation.
– With 95% prediction level, we can state that the salary of a
male professor with 10 years of experience falls between
$68,044 and $106,768.
– Remember that the prediction interval is an interval
estimate for one man with this experience, while the
confidence interval pertains to the average of all men with
continued this much experience.

35 36

6
12/7/2023

Model assumptions and common LO 12.4 Model assumptions and


violations common violations
LO 12.4 Explain the role of the assumptions on the OLS
estimators. 4. The variance of the error term e is the same for all
x1, …, xk values. In other words, observations do not
have a changing variability.
• The statistical properties of the OLS estimator, as
well as the validity of the testing procedures, depend 5. The error term e is uncorrelated across observations. In
on a number of assumptions. other words, observations are not correlated.

1. The model given by y = b 0 + b 1x1 + … + b kxk + e is linear 6. The error term e is not correlated with any of the
in the parameters b 0, b 1, …, b k . predictors x1, …, xk. In other words, there are no
explanatory variables excluded.
2. Conditional on x1, …, xk, E(e) = 0, thus
E(y) = b 0 + b 1x1 + … + b kxk . 7. The error term e is normally distributed. This assumption
allows us to do hypothesis testing. If normality is not the
3. There is no exact linear relationship among the case, our tests may not be valid.
explanatory variables (i.e. no perfect multicollinearity).

continued continued

37 38

LO 12.4 Model assumptions and Model assumptions and common


common violations violations
LO 12.5 Describe common violations of the assumptions
• The true error terms e cannot be observed because and offer remedies.
they exist only in the population. We can, however,
Multicollinearity
look at the residuals, e = y – ŷ, where
ŷ = b0 + b1x1 + b2x2 + … + bkxk, for each observation. • Perfect multicollinearity exists when two or more x
variables exhibit an exact linear relationship.
• It is common to plot the residuals on the vertical axis
and an explanatory variable on the horizontal axis. • For example, suppose the x data includes total cost,
fixed cost and variable cost.
• When estimating a regression in Excel, the dialog
box that opens when you select Data > Data • Other data sets may have a great degree of
Analysis > Regression allows you to choose multicollinearity that is not perfect but still strong.
Residuals and Residual Plots options.

continued

39 40

LO 12.5 Model assumptions and LO 12.5 Model assumptions and


common violations common violations
Multicollinearity Multicollinearity
• In these cases we may see a high R2 but individually • A good remedy may be to simply drop one of the
insignificant explanatory variables. Additional, non- collinear variables if we can justify it as redundant.
intuitive results may be indicative.
• Alternatively, we could increase our sample size.
• A sample correlation between explanatory variables
greater than 0.80 or less than –0.80 suggests severe • Another option would be to try to transform our
multicollinearity. variables so that they are no longer collinear.
• Finally, especially if we are interested only in
maintaining a high predictive power, it may make
sense to do nothing.

continued continued

41 42

7
12/7/2023

LO 12.5 Model assumptions and LO 12.5 Model assumptions and


common violations common violations
Changing variability Changing variability
• The variance of the error term changes for different • Heteroscedasticity results in inefficient estimators,
values of at least one explanatory variable. and the hypothesis tests for significance are no
• Informal residual plots can gauge heteroscedasticity. longer valid.
Here is a residual plot for a model where none of the
assumptions has been violated. • To get around the second problem, some
researchers use OLS estimates along with corrected
standard errors, called White’s standard errors.
Many statistical packages have this option available.
Unfortunately, the current version of Excel does not.

continued continued

43 44

LO 12.5 Model assumptions and LO 12.5 Model assumptions and


common violations common violations
Correlated observations Excluded variables
• We assume that the error term is uncorrelated • Endogeneity in the regression model refers to the
across observations when obtaining OLS estimates. error term being correlated with the explanatory
But this often breaks down in time series data. variables. This commonly occurs due to an omitted
• In this example, we predict sales at a sushi restaurant. explanatory variable.
A plot of the residuals against time shows:
• For example, a person’s salary may be highly correlated
with that person’s innate ability. But since we cannot include
it, ability gets incorporated in the error term. If we try to
predict salary by years of education, which may also be
correlated with innate ability, then we have an endogeneity
problem.

• Remedies are not easily accessible using Excel.


continued continued

45 46

LO 12.5 Model assumptions and


common violations
Excluded variables
• Endogeneity will result in biased estimators, and so
is quite a serious problem. Unfortunately, it is difficult
to fix.
• Most commonly, we try to find an instrumental
variable. Discussion of the instrumental variable
approach is beyond the scope of the text.

47

You might also like