Dummy Variable Regression Models: Dichotomous Variables)
Dummy Variable Regression Models: Dichotomous Variables)
Dummy Variable Regression Models: Dichotomous Variables)
-Yogita Yadav
In all the regression models discussed so far the dependent and explanatory variables were
quantitative in nature. But this may not always be the case. There are many occasions when the
explanatory variables are qualitative in nature, for example gender, religion, color, nationality etc.
These qualitative variables are called Dummy Variables (AKA indicator/ categorical/ binary/
dichotomous variables).
The qualitative variables can be quantified with the help of artificial variables i.e. 0 (zero) and 1 (one)
where 0 indicates the absence of an attribute and 1 indicates the presence of that attribute.
Suppose we have a regression model where Y is dependent on one qualitative variable – ‘Gender’
Yi =B1 + B2 Di + ui (1.1)
Di = 0, for males
Di = 1, for females
The regression models that contain only dummy explanatory variables are called analysis-of-
variance (ANOVA) Models, e.g. (1.1)
Model (1.1) is similar to the two-variable regression model we have discussed except that the
explanatory variable is qualitative in nature (Di instead of Xi). Since Di remains fixed from sample to
sample (just like Xi) and assuming ui satisfy the usual assumptions of CLRM, OLS method can be used
to estimate the parameters of model (1.1)
Now,
E( Yi / Di = 0 ) = B1
E( Yi / Di = 1 ) = B1 + B2
So, B2 measure by how much the average annual food expenditure of females differ from that of
males.
Since there is no continuous regression line, it is not appropriate to call B2 the slope coefficient. B2, in
this case, is called the differential intercept term.
b2 = ∑ di yi / ∑ di2
and,
b1 = Y̅ – b 2 D̅
where di = Di – D̅
* The category for which dummy takes the value 0 is known as the Benchmark/ Base/ Reference
Category. For this particular example, Male is the base category.
r2 = 0.1890
Ho: B2 = 0
This means that statistically there is no significant difference between average annual expenditure
on food by males and females.
What if we change the base category (i.e. we change the indication for 0 and 1)???
Suppose now,
Di = 0 for female
r2 = 0.1890
Therefore, change of base category will not affect the regression results.
What if we introduce two different dummy variables for the two categories i.e. males and
females?
Yi = B1 + B2 D2i + B3 D3i + ui
Because one of the assumption under CLRM is that of no perfect multi-collinearity. However, for this
model
D2i + D3i = 1
GENERAL RULE: If a model has a common intercept, B1, and if a qualitative variable has M categories
then introduce only M-1 dummy variables. {For each dummy variable, base category should remain
the same}
Check how b2’ will change? What would happen to SE(b2’) and t-ratio?
{HINT: Di’ = 2Di and we have done similar questions for change in Xi}
ANOVA regression models although useful, are not so common in the field of economics. In most
economic research a regression model contains a combination of qualitative and quantitative
variables. Such regression model (containing both type of variables) are called analysis-of-
covariance (ANCOVA) models.
Let
Di = 1 for females
R2 = 0.9268
For (1.2),
Therefore, B2 is statistically significant. This means gender has influence on food expenditure and
there is a significant difference between male and female expenditure on food.
As after tax income increases, level of expenditure on food increases (which makes sense).
In model 1.1, we were committing mis-specification error i.e. omission if a relevant explanatory
variable.
B2 – If we keep after tax income constant then the mean food expenditure of females is less than
that if males by $228.98
B3 – If After tax income increases by $1, mean food expenditure increases by $0.06 or 60 cents ;
keeping the influence of gender constant.
For model (1.2), we can have two different regression equations for the two categories.
Yi = B1 + B2 Di + B3 Xi + B4 (Di Xi ) + ui (1.3)
For (1.3),
E(Yi / Di = 0, Xi ) = B1 + B3 Xi
Notice: When we add dummy in additive form (as we did in model 1.2) we look at differences in
intercept of the two categories and when we add dummy in the multiplicative form / interactive
form (as in model 1.3), we look at differences in the slope of the two categories.
R2 = 0.93
Di is statistically insignificant
R2 has increased marginally (whatever the small increase, is due to addition of an explanatory
variable)
CONCLUSION: Model 1.2 is better that Model 1.3. We are committing a mis-specification error in
model 1.3 i.e. inclusion of an unnecessary variable.
Therefore, model 1.2 seems to be the most relevant among the three models discussed as
far.
To summarize,
1) H0 : B2 = 0 Reject
H0 : B4 = 0 Reject
H0 : B4 = 0 Don’t Reject
3) H0 : B2 = 0 Don’t Reject
H0 : B4 = 0 Reject
4) H0 : B2 = 0 Don’t Reject
H0 : B4 = 0 Don’t Reject