Chapter 7, Dummy Variable
Chapter 7, Dummy Variable
1. A dummy variable takes on 1 and 0 only. The number 1 and 0 have no numerical
(quantitative) meaning. The two numbers are used to represent groups. In short
dummy variable is categorical (qualitative).
(a) For instance, we may have a sample (or population) that includes both female
and male. Then a dummy variable can be defined as D = 1 for female and D = 0
for male. Such a dummy variable divides the sample into two subsamples (or two
sub-populations): one for female and one for male.
(b) Dummy variable follows Bernoulli distribution. The distribution is characterized
by the parameter p
{
1, with probability p
D= (1)
0, with probability 1 − p
Y = β0 + β1 D + u (2)
E(Y |D = 0) = β0 (4)
E(Y |D = 1) = β0 + β1 (5)
and
β0 = E(Y |D = 0) (6)
β1 = E(Y |D = 1) − E(Y |D = 0) (7)
1
3. Sample mean is the estimate for population mean, so we have the following interpre-
tation for the estimated coefficients in (2)
where ȳD=0 denotes the average Y in the sub-sample for which D = 0, ȳD=1 denotes
the average Y in the sub-sample for which D = 1. Equation (2) provides a simple way
to carry out a comparison of means test (or two sample t test) between the two groups.
The null hypothesis of two-sample t test says that there is no difference between two
groups:
H0 : β 1 = 0
This hypothesis is rejected when the p-value for β̂1 is less than 0.05.
4. For example, let Y be wage, and D = 1 for female, and D = 0 for male. Then consider
the regression
wage = β0 + β1 D + u,
and we know β̂0 is the average wage for male, and β̂1 equals average female wage minus
average male wage. The two wages are significantly different if β̂1 is significant.
Y = β0 + β1 D + β2 X + u (10)
It follows that
so β1 measures the change in mean Y across two groups, holding X constant (or given
2
the same level of X). For instance, if X is edu(cation), in the regression
wage = β0 + β1 D + β2 edu + u,
β1 equals the average female wage minus average male wage, given the same level of
education.
7. In chapter 6 we know interaction term can be used to allow the marginal effect of X to
depend on another regressor. The regression with both dummy and interaction term
of dummy and X is
Y = β0 + β1 D + β2 X + β3 (X ∗ D) + u (16)
8. Note regression (16) contains the same amount of information as two separate regressions
of Y on X, one using subsample D = 0, and one using subsample D = 1.
3
10. Suppose we have two subsamples, one for female and one for male. We want to estimate
the effect of education on wage. We have two options. Option 1 is to run two separate
regressions, one for female and one for male. Option two is pool (merge) the two
subsamples together and just run one regression. Which option is better?
(a) Essentially this problem is about whether the relationship between education and
wage depends on gender
(b) To answer this question, we just pool the two subsample, and run regression (16).
The point is, we need to use dummy variable and interaction term. The null
hypothesis is gender does not matter, so
β1 = β3 = 0 (18)
We can use F test (called Chow test in this context) for this hypothesis.
i. If p-value is less than 0.05, H0 is rejected, so gender matters. We need to keep
the dummy and interaction term in (16). That means, running two separate
regressions, one for female and one for male, is better idea.
ii. If p-value is greater than 0.05, H0 is not rejected, so gender does not matter.
We need to drop the dummy and interaction term from (16). That means,
running one regression using both subsamples is better idea.
11. What if we have information about gender and marital status? Option one is to define
two dummy variables as {
1, female
D1 = (19)
0, male
{
1, married
D2 = (20)
0, unmarried
and use them to run the regression of
Y = β 0 + β 1 D1 + β 2 D2 + u (21)
4
For this regression we can show
β0 , if D1 = 0, D2 =0
β +β , if D1 = 1, D2 =0
0 1
E(Y ) =
β0 + β2 , if D1 = 0, D2 =1
β +β +β , if D1 = 1, D2 =1
0 1 2
12. In order to relax the no-interaction restriction, we can define four dummy variables
(because we have four groups of people) as
{
1, female and married
E1 =
0, otherwise
{
1, female and unmarried
E2 =
0, otherwise
{
1, male and married
E3 =
0, otherwise
{
1, male and unmarried
E4 =
0, otherwise
and run a regression using only three of them
Y = β0 + β1 E1 + β2 E2 + β3 E3 + u (23)
5
13. Exercise : Please show regression (23) does not impose no-interaction restriction.
Note that X has no numerical meaning, so is qualitative. Numbers 1, 2 and 3 are used
here to define three categories. Number 2 does not mean it is twice of 1. Because the
variable is qualitative, we need to translate it into a set of dummy variables
{
1, using bus
F1 =
0, otherwise
{
1, using subway
F2 =
0, otherwise
{
1, driving car
F3 =
0, otherwise
When running regression, we do not use X (since it has no numerical meaning). Instead
we use two of the three dummy variables defined above.
For ordinal variable we only know ranking. The number has no numerical meaning.
Actually we can replace number 3 with any number greater than 2 (to maintain the
ordering). Because ordinal variable is qualitative, we need to translate it into a set of
dummy variables. We cannot directly use ordinal variable in regression.
6
Example: Chapter 7
1. We use the data file 311 wage1.dta, downloadable at my webpage. See example 7.1 in
textbook for detail.
2. We see for the first observation, wage = 3.1, educ = 11, female = 1 (so is female), and
married = 0 (so is unmarried). Female and married are both dummy variables, for
which the values 1 and 0 have no quantitative meaning.
3. Command tab is used to tabulate proportion (probability) for dummy variable. In this
case 52.09 percent observations are male (female=0), and 47.91 percent are female.
4. Next we run regression (2), i.e., regress wage on dummy variable female. The estimated
intercept β̂0 = ȳD=0 = 7.099489 is the average wage for male. The estimated slope
β̂1 = ȳD=1 − ȳD=0 = −2.51183 is average female wage minus average male wage. In
this example female earns less than male since β̂1 is negative. The p-value for β̂1 is
less than 0.05, so we reject the null hypothesis that female wage equals male wage. In
other words, the two wages differ significantly.
5. Alternatively we can summarize wage separately for female and male. The command
is
sort female
by female: sum wage
On average a male earns 7.099489, and a female earns 4.587659. The difference is
4.587659 − 7.099489 = −2.51183, which is the same as β̂1 reported by regression (2).
This finding confirms that
(a) The estimated intercept is β̂0 = .2004963. It measures the average male wage
when educ = 0.
7
(b) β̂1 = −1.198523. It measures the average female wage when educ = 0 minus
average male wage when educ = 0. In other words, when educ = 0, a female earns
.2004963 + (−1.198523) = −.9980267. This number is not very meaningful since
in this sample no female has zero education (two males have zero educ, and you
can see them using command list if educ==0).
(c) β̂2 = .539476. So male wage rises by .539476 when his educ rises by 1 unit.
(d) β̂3 = −.085999. So female wage rises by .539476 + (−.085999) = .453477 when
her educ rises by 1 unit.
(e) The null hypothesis that the relationship between wage and educ does not depend
on gender (or there is NO difference in regression functions across female and male)
can be formulated as
H0 : β1 = β3 = 0.
The F test for difference in regression functions across groups is called Chow test
The stata command to conduct Chow test is test female fe. It is shown that
F = 33.51, p-value < 0.05. So we reject the null hypothesis. That means there
IS difference in regression functions across female and male. In other words, the
relationship between wage and educ depends on gender.
(f) Note that β̂1 and β̂3 are individually insignificant (the p-values are 0.366 and 0.407,
respectively), whereas the Chow test indicates that they are jointly significant.
The lesson is, just focusing on individual coefficient can be misleading.
7. Because the relationship between wage and educ depends on gender, we can run two
separate (group-wise) regressions, one using female and one using male. The stata
command is by female: reg wage educ. We see the coefficients in the male regres-
sion are the same as β̂0 and β̂2 reported by the pooled regression (16). The female
results can also be derived based on the pooled regression (16). In other words,
The pooled regression (16) has one big advantage over groupwise regressions: we can
run Chow test based on (16).
8. Finally you are shown how to define a set of dummy variables to represent multiple
categories of gender and marital status. In theory we should define four dummies since
8
there are four groups. But, aware of dummy variable trap, we only define three. The
group for which we do not define dummy is base group. In this example, the base
group is unmarried male. The three dummy variable sare
wage = β0 + β1 D1 + β2 D2 + β3 D3 + u
(a) β̂0 = 5.168023. It measures the average wage for unmarried male, the base group.
(b) β̂1 = 2.815009, So a married male earns 2.815009 more than an unmarried male.
(So marriage enhances a male’s market value)
(c) β̂2 = −.5564399, So an unmarried female earns .5564399 less than an unmarried
male. (So there is is discrimination against female)
(d) β̂3 = −.6021142, So a married female earns .6021142 less than an unmarried male.
(e) Because β̂3 − β̂2 = −.6021142 − (−.5564399) < 0, marriage decreases a female’s
market value.
9. Exercise : Show that a female is discriminated more when she is married than when
she is unmarried. Hint: compute β̂3 − β̂1
9
10
11
12
Do File
* Do file for dummy variable (chapter 7)
set more off
clear
capture log close
cd "I:\311"
log using 311log.txt, text replace
use 311_wage1.dta, clear
* show first 5 observations
list wage educ female married in 1/5
* tabulate female
tab female
* run regression using dummy
reg wage female
* compare the means for male and female
sort female
by female: sum wage
* run regression using dummy and interaction term
gen fe = female*educ
reg wage female educ fe
* chow test
test female fe
* run separate regressions for male and female
by female: reg wage educ
* multiple category
gen d1 = 0
replace d1 = 1 if female == 0 & married ==1
gen d2 = 0
replace d2 = 1 if female == 1 & married ==0
gen d3 = 0
replace d3 = 1 if female == 1 & married ==1
reg wage d1 d2 d3
log close
13