Chapter10 Econometrics DummyVariableModel
Chapter10 Econometrics DummyVariableModel
Chapter10 Econometrics DummyVariableModel
In general, the explanatory variables in any regression analysis are assumed to be quantitative in nature. For
example, the variables like temperature, distance, age etc. are quantitative in the sense that they are recorded
on a well defined scale.
In many applications, the variables can not be defined on a well defined scale and they are qualitative in
nature.
For example, the variables like sex (male or female), colour (black, white), nationality, employment status
(employed, unemployed) are defined on a nominal scale. Such variables do not have any natural scale of
measurement. Such variables usually indicate the presence or absence of a “quality” or an attribute like
employed or unemployed, graduate or non-graduate, smokers or non- smokers, yes or no, acceptance or
rejection, so they are defined on a nominal scale. Such variables can be quantified by artificially
constructing the variables that take the values, e.g., 1 and 0 where “1” indicates usually the presence of
attribute and “0” indicates usually the absence of attribute. For example, “1” indicates that the person is
male and “0” indicates that the person is female. Similarly, “1” may indicate that the person is employed
and then “0” indicates that the person is unemployed.
Such variables classify the data into mutually exclusive categories. These variables are called indicator
variables or dummy variables.
Usually, the dummy variables take on the values 0 and 1 to identify the mutually exclusive classes of the
explanatory variables. For example,
1 if person is male
D=
0 if person is female,
1 if person is employed
D=
0 if person is unemployed.
Here we use the notation D in place of X to denote the dummy variable. The choice of 1 and 0 to identify
a category is arbitrary. For example, one can also define the dummy variable in above examples as
In a given regression model, the qualitative and quantitative variables may also occur together, i.e., some
variables may be qualitative and others are quantitative.
Such models can be dealt within the framework of regression analysis. The usual tools of regression analysis
can be used in case of dummy variables.
Example:
Consider the following model with x1 as quantitative and D2 as dummy variable
y =β 0 + β1 x1 + β 2 D2 + ε , E (ε ) =0, Var (ε ) =σ 2
0 if an observation belongs to group A
D2 =
1 if an observation belongs to group B.
The interpretation of result is important. We proceed as follows:
If D2 = 0, then
y =β 0 + β1 x1 + β 2 .0 + ε
=β 0 + β1 x1 + ε
E ( y D=
2 = β 0 + β1 x1
0)
y =β 0 + β1 x1 + β 2 .1 + ε
= ( β 0 + β 2 ) + β1 x1 + ε
E ( y D2 =1) =( β 0 + β 2 ) + β1 x1
The quantities E ( y D2 = 0) and E ( y D2 = 1) are the average responses when an observation belongs to
β2 =
E ( y D2 =
1) − E ( y D2 =
0)
Graphically, it looks like as in the following figure. It describes two parallel regression lines with same
variances σ 2 .
If there are three explanatory variables in the model with two dummy variables D2 and D3 then they will
describe three levels, e.g., groups A, B and C. The levels of dummy variables are as follows:
1.=
D2 0,=
D3 0 if the observation is from group A
2.=
D2 1,=
D3 0 if the observation is from group B
3.=
D2 0,=
D3 1 if the observation is from group C
The concerned regression model is
y =β 0 + β1 x1 + β 2 D2 + β3 D3 + ε , E (ε ) =0, var(ε ) =σ 2 .
Consider the following examples to understand how to define such dummy variables and how they can be
handled.
Example:
Suppose y denotes the monthly salary of a person and D denotes whether the person is graduate or non-
graduate. The model is
y =β 0 + β1 D + ε , E (ε ) =0, var(ε ) =σ 2 .
With n observations, the model is
yi =β 0 + β1 Di + ε i , i =1, 2,..., n
E ( yi D=
i = β0
0)
E ( yi D=
i 1)= β 0 + β1
β1 =
E ( yi Di =
1) − E ( yi Di =
0)
Thus
- β 0 measures the mean salary of a non-graduate.
- β1 measures the difference in the mean salaries of a graduate and non-graduate person.
Now consider the same model with two dummy variables defined in the following way:
1 if person is graduate
Di1 =
0 if person is nongraduate,
1 if person is nongraduate
Di 2 =
0 if person is graduate.
The model with n observations is
yi =β 0 + β1 Di1 + β 2 Di 2 + ε i , E (ε i ) =
0, Var (ε i ) =
σ 2,i =
1, 2,..., n.
Then we have
1. E yi Di1 = 0, Di 2 = 1 = β 0 + β 2 : Average salary of non-graduate
Di1 + Di 2 =
1 ⇒ person is non-graduate
So multicollinearity is present in such cases. Hence the rank of matrix of explanatory variables falls short by
1. So β 0 , β1 and β 2 are indeterminate and least squares method breaks down. So the proposition of
introducing two dummy variables is useful but they lead to serious consequences. This is known as dummy
variable trap.
So when intercept term is dropped , then β1 and β 2 have proper interpretations as the average salaries of a
graduate and non-graduate persons, respectively.
Now the parameters can be estimated using ordinary least squares principle and standard procedures for
drawing inferences can be used.
Rule: When the explanatory variable leads to m mutually exclusive categories classification, then use
(m − 1) dummy variables for its representation. Alternatively, use m dummy variables but drop the intercept
term.
yi = salary of i th person.
Then
E ( yi Di 2 =0 ) =β 0 + β1 xi1 + β 2 .0 + β3 xi1.0
= β 0 + β1 xi1.
E ( yi Di 2 =
1) =β 0 + β1 xi1 + β 2 .1 + β3 xi1.1
= ( β 0 + β 2 ) + ( β1 + β3 ) xi1.
Thus
β 2 reflects the change in intercept term associated with the change in the group of person i.e., when group
changes from A to B.
β3 reflects the change in slope associated with the change in the group of person, i.e., when group changes
from A to B.
The test of hypothesis becomes convenient by using an dummy variable. For example, if we want to test
whether the two regression models are identical, the test of hypothesis involves testing
H 0 : β=
2 β=
3 0
H1 : β 2 ≠ 0 and/or β3 ≠ 0.
Acceptance of H 0 indicates that only single model is necessary to explain the relationship.
In another example, if the objective is to test that the two models differ with respect to intercepts only and
they have same slopes, then the test of hypothesis involves testing
H 0 : β3 = 0
H1 : β3 ≠ 0.
Since it is difficult to collect the data on individual ages, so this will help in easy collection of data. A
disadvantage is that some loss of information occurs. For example, if the ages in years are 2, 3, 4, 5, 6, 7 and
suppose the dummy variable is defined as
Di =
0 if age of i person is ≤ 5 years.
th
Then these values become 0, 0, 0, 1, 1, 1. Now looking at the value 1, one can not determine if it
corresponds to age 5, 6 or 7 years.
Moreover, if a quantitative explanatory variable is grouped into m categories , then (m − 1) parameters are
required whereas if the original variable is used as such, then only one parameter is required.
Treating a quantitative variable as qualitative variable increases the complexity of the model. The degrees
of freedom for error are also reduced. This can effect the inferences if data set is small. In large data sets,
such effect may be small.
The use of dummy variables does not require any assumption about the functional form of the relationship
between study and explanatory variables.