Chapter 1 Dummy Variable Regression
Chapter 1 Dummy Variable Regression
Dummy Variables)
In previous chapters, both the dependent and independent variables in our multiple regression
models are quantitative in their nature (e.g., hourly wage rate, years of education, GDP, prices,
and costs). However, some variables are essentially qualitative or nominal scale, in nature, such
as sex, race, color, religion, the industry of a firm (manufacturing, retail, etc.), and the regions in
Ethiopia are all considered to be qualitative factors. For example, holding all other factors
constant, female workers are found to earn less than their male counterparts.
Since such variables usually indicate the presence or absence of a “quality” or an attribute, such
as male or female, black or white, collage graduate or not collage graduate they are essentially
nominal scale variables. One way we could “quantify” such attributes is by constructing artificial
variables that take on values of 1 or 0, 1 indicating the presence (or possession) of that attribute
and 0 indicating the absence of that attribute. For example 1 may indicate that a person is a
female and 0 may designate a male; or 1 may indicate that a person is a college graduate, and 0
that the person is not, and so on. Variables that assume such 0 and 1 values are called dummy
variables. Such variables are thus essentially a device to classify data into mutually exclusive
categories such as male or female.
Note that although they are easy to incorporate in the regression models, one must use the
dummy variables carefully. Particularly,
1. When we have a dummy variable for each category or group and also an intercept in our
model, we have a case of perfect collinearity, that is, exact linear relationships among
the independent variables. The sum of all the dummy variables is one. In this case if a
1
qualitative variable has m categories, introduce only (m − 1) dummy variables. Other
wise we fall into what is known as the dummy variable trap. That is, the situation of
perfect collinearity or perfect multicollinearity arises. This rule also applies if we have
more than one qualitative variable in the model. For each qualitative regressor the
number of dummy variables introduced must be one less than the categories of that
variable.
2. The category for which no dummy variable is assigned is known as the base,
benchmark, control, comparison, reference, or omitted category and all comparisons
are made in relation to the benchmark category. This is the one that is omitted and against
which the other dummy variables are assessed.
3. The intercept value (β1) represents the mean value of the benchmark category
4. The coefficients attached to the dummy variables are known as the differential intercept
coefficients because they tell by how much the value of the intercept that receives the
value of 1 differs from the intercept coefficient of the benchmark category.
5. If a qualitative variable has more than one category, the choice of the benchmark
category is strictly up to the researcher
wage = 1 + D + 2edu + i
In our model only two observed factors affect wage rate are gender and education. Since
D = 1 when the person is female, and D = 0 when the person is male, the parameter has the
following interpretation: is the difference in hourly wage between females and males, given
the same amount of education. Thus, the coefficient determines whether there is discrimination
against women: if 0 , then, for the same level of other factors, women earn less than men on
average.
Note that
1. In our model the base group is male (D=0) and hence the interpretation of the
coefficient of dummy is made against the base group. If the coefficient is less than
zero, the females are paid less compared to their male counterparts for the same
level of education. But if its coefficient is positive, females are paid more compared
to males.
2. In any application, it does not matter how we choose the base group
3
Some researchers prefer to drop the overall intercept in the model and to include dummy
variables for each group. The equation would then be w = male + female + 2edu + i , where
the intercept for men is and the intercept for women is . There is no dummy variable trap in
this case because we do not have an overall intercept. However, this formulation has little
advantage, since testing for a difference in the intercepts is more difficult, and there is no
generally agreed upon way to compute R-squared in regressions without an intercept. Therefore,
we will always include an overall intercept for the base group.
Question is the difference between the females and males earnings statistically significant
or due to chance factor? We need to test that!!
In general, suppose the simple linear regression model takes the form:
Y = (1 + ) + 2 X 2 + i
When D=1 the model becomes:
Y = 1 + 2 X 2 + i
Thus, given the zero mean assumption (i.e., E ( i ) = 0 ), the mean of Y is:
Graphically,
When 0 When 0
4
Given the assumption of classical linear regression model, a model with one or more dummy
variables can be estimated using the OLS estimation method. Once the model is estimated, we
have to test whether the coefficients of dummy are statistically significant or not.
Suppose we have the model with one dummy variable:
H0 : = 0
H1 : 0
t= t( n − k )(
We can test this using the usual t-test:
SE ( ) 2
)
Decision Rule:
Reject the null hypothesis if tcal t( n−k )( ) . This means that the existence of the attribute is
2
The negative intercept—the intercept for men, in this case—is meaningless. The coefficient on D
is interesting, because it measures the average difference in hourly wage between a woman and a
man, given the same levels of educ, exper, and tenure. If we take a woman and a man with the
same levels of education, experience, and tenure, the woman earns, on average, $1.81 less per
hour than the man.
5
It is important to remember that, because we have performed multiple regression and controlled
for educ, exper, and tenure, the $1.81 wage differential cannot be explained by different average
levels of education, experience, or tenure between men and women. We can conclude that the
differential of $1.81 is due to gender or factors associated with gender that we have not
controlled for in the regression.
−1.81
Is this wage differential statistically significant? The usual t-test is given by: t = = −6.962 .
0.26
Using the rule of thumb since t>2 (in absolute value), we reject the null hypothesis and hence the
wage differential is statistically significant.
Now, suppose all non-dummy explanatory variables are dropped from our model. Then the result
becomes:
wage = 7.10 − 2.51D n = 526 R 2 = 0.116
D= 1 implies female and D = 0 means male
(0.21) (0.30)
Interpretations of OLS estimates:
1. The intercept is the average wage for men in the sample (when D = 0). Thus, on average,
males earn $7.10 per hour.
2. The coefficient on D is the difference in the average wage between females and males.
Thus, the average wage for females in the sample is 7.10 - 2.51 = 4.59, or $4.59 per hour.
Comparing the mean wage of males and females, the mean wage rate of males is higher by $2.51
per hour. Generally, simple regression on a constant and a dummy variable is a straightforward
way to compare the means of two groups. Since t = -8.37, the difference is statistically
significant. For the usual t test to be valid, we must assume that the homoskedasticity assumption
holds, which means that the population variance in wages for men is the same as that for women.
The estimated wage differential between men and women is larger in simple regression model
than in multiple regression model because simple regression model does not control for
differences in education, experience, and. Multiple regression model gives a more reliable
estimate of the ceteris paribus gender wage gap; it still indicates a very large differential.
6
Choices of Individuals and or Economic Units
In many cases, dummy independent variables reflect choices of individuals or other economic
units (as opposed to something predetermined, such as gender). In such situations, the matter of
causality is again a central issue. In the following example, we would like to know whether
personal computer ownership causes a higher college grade point average.
The dummy variable PC equals one if a student owns a personal computer and zero otherwise.
There are various reasons PC ownership might have an effect on colGPA (collage GPA). A
student’s work might be of higher quality if it is done on a computer, and time can be saved by
not having to wait at a computer lab. Of course, a student might be more inclined to play
computer games or use the internet if he or she owns a PC, so it is not obvious that the
coefficient of PC is positive. The variables hsGPA (high school GPA) and ACT (achievement
test score) are used as controls: it could be that stronger students, as measured by high school
GPA and ACT scores, are more likely to own computers. We control for these factors because we
would like to know the average effect on colGPA if a student is picked at random and given a
personal computer.
The estimated model is given by:
This equation implies that a student who owns a PC has a predicted GPA about .16 point higher
than a comparable student without a PC (remember, both colGPA and hsGPA are on a four-point
0.157
scale). The effect is also statistically significant, with t = = 2.75
0.057
Interpreting Coefficients of Dummy Explanatory Variables When ln y
Suppose the dependent variable appearing in logarithmic form, with one or more dummy
variables appearing as independent variables. The coefficient of the dummy variable has a
percentage interpretation. More specifically, when the dependent variable in log form (i.e., ln y )
7
coefficient on a dummy variable, when multiplied by 100, is interpreted as the percentage
difference in y, holding all other factors fixed. When the coefficient on a dummy variable
suggests a large proportionate change in y, the exact percentage difference can be obtained.
Let us re-estimate the wage equation using log(wage) as the dependent variable and adding
quadratics in exper and tenure:
Exact percentage difference is given by: = e −0.297 − 1 −0.257 . This more accurate estimate
implies that a wage rate of female is, on average, 25.7% below compared to wage rate of male.
Generally, if 2 is the coefficient of a dummy variable, say X2, when log(y) is the dependent
variable, the exact percentage difference in the predicted y when X2 = 1 versus when X2 = 0 is
given by:
= [e 2 − 1]*100 . The sign of 2 can be positive or negative and
8
Then the multiple linear regression model (assuming all dependent variables are dummy
In other words, the mean salary of public school teachers in the Addis Ababa region is given by
the intercept ,β1, in the multiple regression and the “slope” coefficients β2 and β3 tell by how
much the mean salaries of teachers in the Oromia region and in the Amhara region a differ from
the mean salary of teachers in the Addis Ababa region. But are these differences statistically
significant? Suppose the results based on our multiple regression model are as follow:
9
As these regression results show, the mean salary of teachers in the Addis Ababa is about Birr
26,158, that of teachers in the Oromia region is lower by about Birr 1734 and that of teachers in
the Amhara region is lower by about Birr 3, 265.
The actual mean salaries in the two regions can be easily obtained by adding these differential
salaries to the mean salary of teachers in the Addis Ababa region, as indicated above. Thus, the
mean salary in Oromia region is Birr 24,424 (=26,158 – 1,734) and the mean salary in Amhara
region is Birr 22,893 (26,158 – 3,265).
However, how do we know that these mean salaries are statistically different from the mean
salary of teachers in the Addis Ababa region, the comparison category? That is easy enough. All
we have to do is to find out if each of the “slope” coefficients is statistically significant. As can
be seen from this regression, the estimated slope coefficient for the Oromia region is not
statistically significant, as its p value is about 23 percent, whereas that of the Amhara region is
statistically significant, as the p value is only about 3.5 percent. Therefore, the overall conclusion
is that statistically the mean salaries of public school teachers in the Addis Ababa region and
Oromia region are about the same but the mean salary of teachers in the Amhara region is
statistically significantly lower by about Birr 3,265.
Note that
The dummy variables will simply point out the differences, if they exist. However, they do not
suggest the reasons for the differences.
In the second example, let’s combine both quantitative and qualitative explanatory variables
Suppose we estimated a model that allows for wage differences among four groups: married
men, married women, single men, and single women. To do this, we must select a base group;
we choose single men. Then, we must define dummy variables for each of the remaining groups.
Call these marrmale, marrfem, and singfem. The results of the model are given as:
10
In order to interpret the coefficients on the dummy variables, we must remember that the base
group is single males. Thus, the estimates on the three dummy variables measure the
proportionate difference in wage relative to single males. For example, married men are
estimated to earn about 21.3% more than single men, holding levels of education, experience,
and tenure fixed. A married woman, on the other hand, earns a predicted 19.8% less than a single
man with the same levels of the other variables.
Yi = 1 + 2 D2 + 3 D3 + edu + i
where: Y = hourly wage in dollars
edu = education (years of schooling)
D2 = 1 if female, 0 otherwise
D3 = 1 if nonwhite, 0 otherwise
In this model gender and race are qualitative regressors and education is a quantitative regressor.
Implicit this model assumes that the differential effect of the gender dummy (D2) is constant
across the two categories of race and the differential effect of the race dummy (D3) is also
constant across the two sexes. That is to say, if the mean salary is higher for males than for
females, this is so whether they are nonwhite or not. Likewise, if, say, nonwhite have lower
mean wages, this is so whether they are females or males.
11
In many applications such an assumption may be unsound. A female nonwhite may earn lower
wages than a male nonwhite. In other words, there may be interaction between the two
qualitative variables D2 and D3. Therefore their effect on mean Y may not be simply additive as
in the case of the above equation rather they have the multiplicative effect given in the
following model.
Yi = 1 + 2 D2 + 3 D3 + 4 ( D2 D3 ) + edu + i
Assuming that the error term has zero mean (i.e., the zero mean assumption)
The mean hourly wages of female nonwhite is different (by α4) from the mean hourly wages of
males nonwhite. If, for instance, all the three differential dummy coefficients are negative, this
would imply that female nonwhite workers earn much lower mean hourly wages than female
white or male nonwhite workers as compared with the base category, which in the present
example is male white.
Numerical example on Average Hourly Earnings in Relation to Education, Gender and Race
Now test the statistical significance of the differential intercept coefficients. The t-values indicate
that the differential intercept coefficients are statistically significantly different from zero the
education has a strong positive effect on hourly wage. Our estimation result shows, ceteris
paribus, the average hourly earnings of females are lower by about Birr 2.36 compared to their
12
male counterparts and the average hourly earnings of nonwhite workers are also lower by about
Birr 1.73 compared to their white counterparts.
Now consider the case of interaction of dummy variables
As one can see, the two additive dummies are still statistically significant, but the interactive
dummy is not at the conventional 5 percent level; the actual p value of the interaction dummy is
about the 8 percent level and hence it is statistically significant at 10% level of significance.
Interpretation: holding the level of education constant, if you add the three dummy coefficients
you will obtain: −1.964 (= −2.3605 − 1.7327 + 2.1289), which means that mean hourly wages of
nonwhite female workers is lower by about Birr 1.96, which is between the value of −2.3605
(gender difference alone) and −1.7327 (race difference alone).
Generally speaking, regression with dummy independent variables can take four different
possibilities:
1. The slopes of the regression lines are the same but their intercept terms are different.
Such a case is known as the parallel regression functions:
13
2. Both the intercept terms and the slope coefficients of two regressions are different. Such
a case is known as dissimilar regression functions.
3. Both intercept terms and slopes are the same for two regression lines. Such a case is
known as the coincident regression functions.
14
wage = 1 + 3m edu......male
wage = ( 1 + 2 ) + 3 f edu...... female
3m = 3 f and = 1 + 2 2 = 0
4. Two regression lines have the same intercept terms but they differ in their slopes. Such a
case is known as concurrent regression lines.
So far, we have considered dummy variables to allow for the changes in the intercept term while
keeping slope coefficients fixed/constant. However, there are also occasions for interacting
dummy variables with explanatory variables that are not dummy variables to allow for
differences in slope coefficients and the intercept terms.
15
Suppose that we wish to test whether the return to education is the same for males and females,
allowing for a constant wage differential between males and females (a differential for which we
have already found evidence). For simplicity, we include only education and gender in the
model. What kind of model allows for a constant wage differential as well as different returns to
education? Consider the model:
Suppose D2 = 0 into this model, then we find that the intercept for males is 1 , and the slope on
education for males is 3 . Now, suppose D2= 1; thus, the intercept for females is 1 + 2 and the
females and males, and 4 measures the difference in the return to education between females
and males (i.e., difference in the slopes of female and male regression functions). Two of the
four cases for the signs of 2 and 4 are presented as follows:
2 0 & 4 0
2 0 & 4 0
16
Graph (a) shows the case where the intercept for females is below that for males. This means that
females earn less than males at all levels of education, and the gap increases as education gets
larger. In graph (b), the intercept for females is below that for males, but the slope on education
is larger for females. This means that females earn less than males at low levels of education, but
the gap narrows as education increases. At some point, a female earns more than a male, given
the same levels of education.
This model can be estimated using the OLS method of estimation. In order to apply OLS, we
must write the model with an interaction between D2 and edu as:
The parameters can now be estimated from the regression of log(wage) on D2, edu, and D2.edu.
Obtaining the interaction term is easy in any regression package. Note that D2.edu is zero for any
male in the sample and equal to the level of education for any female in the sample.
An important hypothesis is that the return to education is the same for females and males. In
terms of our model H 0 : 4 = 0 , which means that the slope of log(wage) with respect to edu is
the same for males and females. Note that this hypothesis puts no restrictions on the difference in
intercepts, 2 . A wage differential between males and females is allowed under this null, but it
We are also interested in the hypothesis that average wages are identical for males and females
who have the same levels of education. This means that 2 & 4 must both be zero under the
17
Numerical example
The fitted model for our log wage function is:
The estimated return to education for males in this equation is 8.2% (i. e., when education a male
rises by one year, on average, wage rate rises by 8.2% keeping other factors constant). For
females, estimated return to education is 0 .082 - 0.0056 =0 .0764, or about 7.6%. (i. e., when
education of a female rises by one year, on average, wage rate rises by 7.6% keeping other
factors constant).The difference,-0.56%, or just over one-half a percentage point less for
females, is neither economically large nor statistically significant: the t-statistic is
−0.0056
t= = −0.43 . Thus, we conclude that there is no evidence against the hypothesis that the
0.0131
return to education is the same for males and females.
18
there is no one in the sample with even close to zero years of education, it is not surprising
that we have a difficult time estimating the differential at educ _ 0 (nor is the differential
at zero years of education very informative). More interesting would be to estimate
the gender differential at, say, the average education level in the sample (about 12.5).
To do this, we would replace female educ with female (educ _ 12.5) and rerun the
regression; this only changes the coefficient on female and its standard error. (See Exercise
7.15.)
If we compute the F statistic for H0: _0 _ 0, _1 _ 0, we obtain F _ 34.33, which is a
huge value for an F random variable with numerator df _ 2 and denominator df _ 518:
the p-value is zero to four decimal places. In the end, we prefer model (7.9), which allows
for a constant wage differential between women and men.
1.3. Dummy as Dependent Variable
So far, we considered the dummy variables as the right hand side or independent variables. In all
our models up until now, the dependent variable y has had quantitative meaning (for example, y
is a Birr amount, a test score, a percent, or the logs of these). What happens if we want to use
multiple regression model to explain a qualitative dependent event.
➢ participating in labor force or not
➢ willing to pay for improved environmental quality or not
➢ using contraceptives or not
➢ voting for a given election or not, etc
In this case, our dependent variable takes on only two values: zero and one (i.e., it is the dummy
variable). In other words, the regressand is a binary, or dichotomous, variable. For instance, if
our dependent variable is decision to participate in labor force, the response variable is
1=participate in labor force 0=not participate in the labor force. Such binary variable can be
analyzed in the general probability models, which can be binomial or multinomial models. In this
section we consider the case of binomial model.
19
Now, given our dependent variable is dummy variable, there are several methods to analyze such
regression models. There are three approaches to developing a probability model for a binary
response variable:
y = 1 + 2 x2 + 3 x3 + ... + k xk +
Assuming that E ( ) = 0 , then E ( y / x) = 1 + 2 x2 + 3 x3 + ... + k xk .
Where x is shorthand for all of the explanatory variables (both quantitative and qualitative
variables) in our model.
The key point is that when y is a binary variable taking on the values zero and one, it is always
true that p ( y = 1/ x) = E ( y / x) : the probability of “success”—that is, the probability that y =1—is
the same as the expected value of y. Thus, we have the important
Now, if Pi = probability that Yi = 1 (that is, the event occurs), and (1 − Pi) = probability that Yi =
0 (that is, that the event does not occur), the variable Yi has the following (probability)
distribution.
20
That is, Yi follows the Bernoulli probability distribution. Now the mathematical expectation
In general, the expectation of a Bernoulli random variable is the probability that the random
variable equals 1. In passing note that if there are n independent trials, each with a probability p
of success and probability (1 − p) of failure, and X of these trials represent the number of
successes, then X is said to follow the binomial distribution. The mean of the binomial
distribution is np and its variance is np(1 − p). The term success is defined in the context of the
problem.
Since the probability Pi must lie between 0 and 1, we have the restriction 0 ≤ E(Yi | Xi) ≤ 1. That
is, the conditional expectation (or conditional probability) must lie between 0 and 1.
Interpretation of the slope coefficients in the PLM
Since y can take on only two values (0 and 1), j cannot be interpreted as the change in y given
a one-unit increase in x j , holding all other factors fixed: y either changes from zero to one or
from one to zero. Nevertheless, the j still have useful interpretations. In the LPM, j measures
the change in the probability of success when x j changes, holding other factors fixed. That is,
21
p = j x j . This equation gives the marginal effect of x j on Y. The multiple regression
model can allow us to estimate the effect of various explanatory variables on qualitative events
The predicted value of Y (yˆ) is the predicted probability of success. Therefore, 1 hat is the
predicted probability of success when each x j is set to zero, which may or may not be
interesting. The slope coefficient beta hat2 measures the predicted change in the probability of
success when x2 increases by one unit (for quantitative independent variable).
In order to correctly interpret a linear probability model, we must know what constitutes a
“success.” Thus, it is a good idea to give the dependent variable a name that describes the event y
=1.
Numerical Example
Suppose inlf (“in the labor force”) is a binary variable indicating labor force participation by a
married woman during a given year: inlf =1 if the woman reports working for a wage outside the
home at some point during the year, and zero otherwise. We assume that labor force participation
depends on other sources of income, including husband’s earnings (nwifeinc), years of education
(educ), past years of labor market experience (exper), age, number of children less than six years
old (kidslt6), and number of kids between 6 and 18 years of age (kidsge6).
The estimated linear probability model using 753 sample size of which 428 are women is given
as:
22
Using the usual t statistics, all variables this estimated model except kidsge6 are statistically
significant, and all of the significant variables have the effects we would expect based on
economic theory.
In order to interpret the estimates, we must remember that a change in the independent variable
changes the probability that inlf =1. For example, the coefficient on edu means that keeping other
factors, another year of education increases the probability of labor force participation by .038. If
we take this equation literally, 10 more years of education increases the probability of being in
the labor force by .038(10) = 0.38, which is a large increase in a probability.
0.038
Test significance of coefficient of education: t = = 5.43 , since t-value>2, we reject the
0.007
claim that education has no effects on probability of women labor force participation.
The coefficient on nwifeinc implies that, if _nwifeinc =10 (the probability that a woman is in the
labor force falls by .034.Experience has been entered as a quadratic to allow the effect of past
experience to have a diminishing effect on the labor force participation probability. Holding
other factors fixed, the estimated change in the probability is approximated as: 0.039 -
2(0.0006)exper = 0.039 - 0.0012 exper.
There are a number of problems associated with using the PLM. Some of these problems are:
1. Non-Normality of the Error Term
23
Although OLS point estimation does not require the normality assumption of the
disturbances term, statistical inferences (interval estimation and hypothesis testing)
require that the disturbance term must be normally distributed. However, the
assumption of normality for the error term is not acceptable for the LPMs due to the
fact that, like Yi, the disturbances term also take only two values; that is, they also
follow the Bernoulli distribution. Given our LPM (in matrix form), the error term is
This implies that the disturbance terms cannot be assumed to be normally distributed;
they follow the Bernoulli distribution. The violation of the normality assumption has
serious effects on statistical inferences.
2. The Variance of the Error Term is not Homoscedastic
In the PLM, the disturbance terms are not homoscedastic. As statistical theory shows,
for a Bernoulli distribution the theoretical mean and variance are, respectively, p and
p(1 − p), where p is the probability of success (i.e., something happening), showing that
the variance is a function of the mean and hence the error variance is heteroscedastic.
The variance of the error term is given by:
ultimately depends on the values of X and hence is not homoscedastic. In the presence
of heteroscedasticity, the OLS estimators, although unbiased, are not efficient; that is,
they do not have minimum variance.
24
Since the variance of i depends on E(Yi | Xi ), one way to resolve the heteroscedasticity
Step 1. Run the OLS regression y = 1 + 2 x2 despite the heteroscedasticity problem and obtain ˆ
Yi = estimate of the true E(Yi | Xi). Then obtain ˆ wi = ˆ Yi(1 − ˆ Yi), the estimate of wi.
Step 2. Use the estimated wi to transform the data and estimate the transformed equation by OLS
(i.e., weighted least squares).
3. Non-fulfillment of probability 0 E ( y / x 1 or 0 p 1
Since E(Yi | X) in the linear probability models measures the conditional probability of
the event Y occurring given X, it must necessarily lie between 0 and 1. Although this is
true a priori, there is no guarantee that ˆ Yi , the estimators of E(Yi | Xi ), will
necessarily fulfill this restriction, and this is the real problem with the OLS estimation
of the LPM. There are two ways of finding out whether the estimated ˆ Yi lie between 0
and 1. One is to estimate the LPM by the usual OLS method and find out whether the
estimated ˆ Yi lie between 0 and 1. If some are less than 0 (that is, negative), ˆ Yi is
assumed to be zero for those cases; if they are greater than 1, they are assumed to be 1.
The second procedure is to devise an estimating technique that will guarantee that the
estimated conditional probabilities ˆ Yi will lie between 0 and 1. The logit and probit
models discussed later will guarantee that the estimated probabilities will indeed lie
between the logical limits 0 and 1.
25
4. A probability cannot be linearly related to the independent variables for all their
possible values.
Meaning the impact of a regressor on the dep. Var. when the regressor increases from
say 2 to 3 is not necessarily the same as when it increases from 6 to 7.
These models refer to a dependent variable whose range of values is substantively restricted.
Stating point
2. The relationship between Pi and Xi is nonlinear, that is, “Pi approaches zero at slower
and slower rates as Xi gets small (or X approaches − ) and approaches one at slower and
slower rates as Xi gets very large (or X approaches + ).’’
Symbolically,
lim p( y = 1/ x) = 1 and
x →+
lim p( y = 1/ x) = 0
x →−
Graphically,
26
In both cases (symbol and graph), the probability lies between zero and one as Xi’s vary from
( −, + ) and the graph is sigmoid or S-shaped, which is the shape of cumulative distribution
function (CDF) of any probability density function (PDF). However, the problem is that since all
CDF of PDF are S-shaped, which CDF should we use? The commonly used CDFs are
cumulative logistic distribution and the cumulative normal distribution. The cumulative
logistic distribution is giving rise to the Logit Model and cumulative normal distribution is
giving rise to the Probit Model.
P( y = 1| x) = G( 0 + 1 x1 + .... + k xk ) = G( x ) and
between zero and one: 0<G(.)<1. This function is known as the logit function. Logit is one of
the non-linear functions to make sure that probabilities are between 0 and 1. What is the
particular form of the logit model?
27
1 ez
G ( z ) = pi = −z
=
1+ e 1 + ez
where z = 1 + 2 x2 + 3 x3 + ... + k xk . We have the case that z varies from
( −, + ) but Pi is between zero and one and that Pi is nonlinearly related to Zi (i.e., Xi), thus
satisfying the two requirements mentioned above. However, it seems that in satisfying these
requirements, we have created an estimation problem because Pi is nonlinear both in variables
(X) and parameters the β’s as can be seen clearly from our model. This means that we cannot use
the OLS estimation procedure to estimate the parameters.
Estimation Problem with the Logit Model: Method of the Maximum Likelihood
The logistic function is introduced or invented in the 19th century (by Verhulst, 1804-1849) for the
description of population growth (Cramer, 2003). Consider the LPM
yi = 0 + 1 x1 + .... + k xk + i
= x +
Suppose p i is the probability that yi =1 and (1 − pi ) is the probability that
yi = 0 . In order to construct the likelihood function, we note that the contribution of the ith
observation can be written as:
piyi (1 − pi )1− yi
In the case of random sampling where all observations are sampled independently (the binomial
distribution), the likelihood function will simply be the product of the individual contributions
given as follows:
28
L = piy1 (1 − pi ) p1− y1 * piy2 (1 − pi )1− y2 *...* piyn (1 − pi )1− yn
n
= piyi (1 − pi )1− yi
i =1
The technique of maximum likelihood requires that we choose those values of the parameters of
the LPM which maximize the likelihood function given above. In practice, we maximize the
logarithm of the likelihood function:
ln L = [ yi ln pi + (1 − yi ) ln(1 − pi )]
= yi ln pi − yi ln(1 − pi ) + ln(1 − pi )
p
= yi ln( i ) + ln(1 − pi )
1 − pi
1
(1 − pi ) = and
1 + e x
pi
ln( ) = x
1 − pi
Now substituting into the last equation we get:
Thus, in MLE method our objective is to maximize the logarithm of the likelihood function to
obtain the values of the unknown parameters in such a manner that the probability of observing
29
the given Y’s is as maximum as possible. For this purpose, we differentiate the logarithm of the
likelihood function partially with respect to each unknown, set the resulting expressions to zero
and solve the resulting expressions.
The G function is the standard normal cumulative distribution function (CDF) expressed as an
integral;
z
G( z ) = ( z ) = ( z )dz,
−
Where
( z ) = (2 ) −1/ 2
e − z2 / 2
1
1 − z2
= e 2 .
2
Then, G(.) becomes:
z
1 − 12 z2
G ( z ) = ( z ) =
− 2
e dz
This again ensures that the response probability will be between 0 and 1 and it leads to the probit
model.
The MLE procedure for the probit model is similar to that for the logit model, except that in the
probit model, we use the normal CDF rather than the logistic CDF. The resulting expression
becomes rather complicated under the probit model, but the general idea is the same and hence
we never pursue the estimation problem.
30
THE ODDS RATIO: THE CASE OF LOGIT MODEL
1 ez
pi = −z
= , which is the probability that the event will
We know that
1+ e 1 + ez
occur. (1 − pi ) is the probability that the event never occurs and it is given by:
1 pi
(1 − pi ) = . Then the odds ratio is given by: = e z . The odds
1 + ez 1 − pi
ratio is in favor of the occurrence of an event.
Now, by taking the natural log of the odds-ratio, we can get:
pi
Li = ln = zi = x
1 − pi
That is, L, the log of the odds ratio, is linear both in variables (X) and the parameters ( ). L is
called the logit, and hence the name logit model for model given above.
Note that
1. As P goes from 0 to 1 (i.e., as Z varies from −∞ to +∞), the logit L goes from −∞ to
+∞. That is, although the probabilities (of necessity) lie between 0 and 1, the logits
are not so bounded.
2. Although L is linear in X, the probabilities themselves are not. This property is in
contrast with the LPM model (15.5.1) where the probabilities increase linearly with
X.
3. If L, the logit, is positive, it means that when the value of the regressor( s) increases,
the odds that the regressand equals 1 (meaning some event of interest happens)
increases. If L is negative, the odds that the regressand equals 1 decreases as the value
of X increases.
4. Whereas the LPM assumes that Pi is linearly related to Xi, the logit model assumes
that the log of the odds ratio is linearly related to Xi.
31
Derivation of the above two models using latent variable model
The latent approach is a common way of specifying discrete choice models. Both probit and logit
models can be interpreted as latent variable models. Let y* be an unobserved or latent variable
given as;
y* = x + e, y = 1[ y* 0]
Where 1[.] is the indicator function which can be shown as;
1 if y* 0
Y =
0 otherwise(i.e.,y* 0)
That is, we observe the outcomes: Y=1 if the event occurs and Y=0 if the event does not occur.
Such unobserved variables are known as the latent or underlying variables. The latent approach
is a common way of specifying discrete choice models, like the logit model.
P( y = 1 | x) = P( y* 0 | x) = P[e −( 0 + x ) | x]
= 1 − G[−( 0 + x )] = G ( 0 + x )
or
= 1 − F ( − xi ' ) = F ( x i ' )
1.3.4. Interpreting Logit and Probit Models Estimates
1 ez
pi = = . Then the effect of changes in
Given the logit model
1 + e− z 1 + e z
dp e− z
f (Z ) = =
Z on p denoted by f(Z) is given as: dz (1 + e− z )2 , which is the function of all
32
The slope coefficients of logit or probit models give the signs of the partial effects of each xj on
the response probability, and the statistical significance of xj is determined by whether we can
reject H0 : j = 0 at a sufficiently small significance level. These slope coefficients can be
interpreted in terms of the odds-ratio or the log of odds-ratio in favor of the occurrence of an
event.
Question: How do we get the marginal effects of regressors in the model on the estimated
probability of success? In order to answer this question, we need to compute the marginal
effects.
MEs explain the impact of regressors (x’s) on the response probability or specifically on the
probability of success [i.e. P(y=1|x)]. However, the MEs are not straightforward due to the non-
linear function G(.) and we compute them using calculus.
1 ez dp e− z
pi = = f (Z ) = = , z = x
We know that
1+ e −z
1 + ez and dz (1 + e )
− z 2
33
If x j is a continuous variable, then the marginal effects are obtained from the following
partials;
p ( x ) dp dz
= f ( x ) j , where, f (.) ( z ) and j =
x j dz dx j
Since the logistic or normal CDF are strictly increasing, f(z)>0, z.
This tells us that the partial effect of xj on p(x) depends on x through the positive
quantity f ( x ) , which means that the partial effect always has the same sign as j.
Again since the marginal effects depend on the values of the regressors, the usual method is to
compute a marginal effect at mean of the corresponding independent variable. Thus, the
marginal effects are obtained by multiplying f ( Z ) by the estimates of the coefficients of the
logit or probit model after multiplying f ( Z ) by the mean of a corresponding variable. Now,
both signs and magnitudes can be interpreted. The magnitude indicates the amount of increase
and decrease in the probability of success and the sign shows the direction of such the increase or
decrease.
What is the ME if the regressor is binary? In this case we can not use the mean value of Xi
(which is the fraction of ones in the sample). Putting in the averages for the binary variables
means that the effect does not really correspond to a particular individual. If xk is a discrete
variable, then we can estimate the change in the predicted probability as it goes from zero to one.
34
Testing the Statistical Significance of Each Slope Coefficient
The procedure of testing the significance of each coefficient in the LDV model is the same as in
the case of the usual OLS. However, the z statistics in the stata output are approximation to t
statistics in the OLS. Note that this z has nothing to do with the Z-score/variable.
Testing Overall Statistical Significance of the Model
The LR test is based on the same concept as the F test in a linear model. The LR test is based on
the difference in the log-likelihood functions for the unrestricted and restricted models. The idea
is this. Because the MLE maximizes the log-likelihood function, dropping variables generally
leads to a smaller—or at least no larger—log-likelihood. (This is similar to the fact that the R-
squared never increases when variables are dropped from a regression.) The question is whether
the fall in the log-likelihood is large enough to conclude that the dropped variables are important.
We can make this decision once we have a test statistic and a set of critical values.
where Lur is the log-likelihood value for the unrestricted model, and Lr is the log likelihood
value for the restricted model. Because Lur greater than or equal to Lr, LR is nonnegative and
usually strictly positive. In computing the LR statistic, it is important to know that Lur and Lr
can each be negative. This does not change the way that LR is computed; we must preserve the
negative signs.
Contrary to the linear regression model, there is no single measure for the goodness-of-fit in
binary response (choice) models. Often, goodness-of-fit measures are implicitly or explicitly
based on comparison with a model that contains only a constant as explanatory variable.
35
Let log L1 denote the maximum likelihood value of the model of interest and let log L0
denote the maximum value of the log likelihood function when all parameters, except the
intercept, are set to zero. Clearly, log L1 log L0 . The larger the difference between the
two log likelihoods values, the more the extended model adds to the very restrictive model 1. A
first goodness-of-fit measure is defined as;
1
R2 = 1 −
Pseudo 1 + 2(log L1 − log L0 ) / N
where N denotes the number of observations.
log L1
R2 = 1 −
McFadden log L0
and it is sometimes referred to as the Likelihood Ratio Index.
Because the log likelihood is the sum of log probabilities, it follows that
If all estimated slope coefficients are equal to 0, we have log L0 = log L1 , such that both R-
squared’s are equal to zero.
If the model would be able to generate(estimated) probabilities that correspond exactly to the
observed values (that is pˆ i = yi ,i ), all probabilities in the log likelihood would be equal to
1
Indeed, a formal likelihood ration(LR) test can be based on the difference between the two values.
36
one, such that the log likelihood would be exactly equal to zero. Consequently, the upper limit for
How can we explain the impact of regressors (x’s) on the response probability or specifically on the
probability of success [i.e. P(y=1|x)]?
The MEs are not straightforward due to the non-linear function G(.) and we compute them using
calculus.
p( x) dG
= g ( 0 + x ) j , where, g ( z ) ( z ).
x j dz
(11)
G= cdf of a continuous random variable
g= pdf
G(.) is strictly increasing and so g(z)>0, z.
Eq(11) tells us that the partial effect of xj on p(x) depends on x through the positive quantity
g ( 0 + x ) , which mans that the partial effect always has the same sign as j .
What is the ME if the regressor is binary?
G( 0 + 1 x1 + 2 x2 + ... + k xk ) − G( 0 + 2 x2 + ... + k xk )
(12)
37
Note X 1 is 1 in the first term of eq(12) and 0 in the second term. Only the sign, not the
magnitude of the coefficient is important.
Example:
If y is an employment indicator and the regressor is a location dummy (e.g. urban – rural).
The parameter estimate of the location dummy indicates the probability of illness due to location.
Note that knowing the sign of the parameter estimate is sufficient whether being in urban areas has
a positive or a negative effect on the probability of having illness.
We can use the difference in eq (12) for other kinds of discrete variables (such as the number of
children in a given household).
If x k denotes this vaiable, then the effect on the employment probability of x k going from
ck to c k + 1 is;
Other standard functional forms can be included among the regressors (e.g. polynomials of different
order).
Example:
In the model,
ME of z1 on P(y=1|z) is
P( y = 1 | z )
= g ( 0 + x )(1 + 2 2 z1 )
z1
(15)
ME of z 2 is
38
P( y = 1 | z )
= g ( 0 + x )( 3 / z 2 )
z 2
(16)
where
Thus
P( y = 1 | z )
= g ( 0 + x )( 3 / 100)or, 3/(1 / 100)or, / 3 *100
z 2
is the approximate change in the response probability when z 2 increases by 1%.
Computing Marginal Effects using STATA 9
(Type mfx after estimating equation)
Note:
Interactions among regressors (i.e. including those between discrete and continuous variables) are
handled similarly.
Estimation
We know that we have different ways of generating estimators viz, method of moments, least
squares and maximum likelihood estimation (MLE).
All of the discrete choice models we discussed above are estimated using MLE technique 2. To
estimate LPM, we can use OLS or WLS (weighted least squares) in some cases. However if E(y|x) is
non-linear, we can’t use OLS to estimate either the logit or probit model.
Assume that we have a random sample of size n. To obtain the ML estimator conditional on the
regressors, we need the density of y i given x i . This is
39
(18)
The log-likelihood function (i.e. the function we ptimise)- a function of the parameters and the data (x i ,
y i ) - is obtained by taking log of (18);
The log-likelihood for a sample of size n is obtained by summing (19) across all observations;
n
L( ) = l i
i =1
Differentiaing this function w.r.t. to the parameters gives us the following FOCs. Solving them for
the parameters of interest will give us the ML estimates.
The non-linear nature of the maximisation problem makes it computationally difficult to write
formulas for logit or probit ML estimates. However under general conditions, the MLE is
- consistent
- asymptotically efficient.
40
Dummy Variable Regression
Usually data is classified according to categories. Often, such categories are qualitative in nature
and specify unique characteristics of the members of the observations. For instance, if we have
data on outputs and inputs of firms, the producing unit may be classified as large or small. If we
have data on wages of workers in a firm, we may characterize the recipients of these wages as
male and female, etc. We may then argue, hypothesize that large firms are more productive than
small ones, or female workers earn less than their male counterparts. Note that what we are
saying here is that, given the same attributes in the right-hand side of the equation, the left hand
side of the equation will give larger or smaller value depending on the categories of the firms or
workers. What we are proposing here is the following. Suppose the general model we specify is
the following
Y i = 1 + 2 X i + i
with the following characteristics
Y i = 1 + 2 X i + i : Small firms/female workers
Large firms/
Y Male workers
Small firms/
Female workers
X1 X
41
This is one of the main areas where the notion of dummy variables is used. To illustrate
this,
Let Di = 0 for small firms /female workers
Di = 1 for large firms /male workers
Say we have n1 small firms/ female workers and n2 large firms/ male workers in our
sample. The total number of observations in our sample is, therefore, equal to n =n1+n2.
Then our model reduces to:
Y i = 1 + 2 X i + i ; i = 1,2, , n1
Y i = 1 + + 2 X i + i ; i = n1 + 1, n1 + 2, , n
Y i = 1 + Di + 2 X i + i ; i = 1,2, , n
To estimate 1,2 and we regress Y on X and D. Suppose we ignore such a difference
and regress the equation without the dummy variable, then we would have a biased
estimators because we have omitted a relevant variable. To see this, observe the
following figure.
Y Regression ignoring
categories
Large firms/
Male workers
* Small firms/
* Female workers
*
*
*
.
.
X
Notice the parameter estimates would be different if we use dummy compared to the case
where we don’t. Now, given that the model should include a dummy variable, we have:
Y i = 1 + Di + 2 X i + i
We minimize the sum of squared residuals with respect to the parameters in order to
n n
e (Y ˆ X i)
2
obtain the OLS estimators, i.e., min 2
i = i
ˆ − ˆ Di −
−1 2
i =1 i =1
42
n
ei2
= 0 2 (Y i − ˆ X i )(− 1) = 0
n
i =1 ˆ − ˆ Di −
ˆ
1
i =1
2
1
n n
Y i − n ˆ 1 − n2 ˆ − ˆ 2 X i = 0
i =1 i =1
Y ˆ + n2 ˆ +
= ˆ X EQ1
1
n 2
n
ei2
= 0 2 (Y i − ˆ X i )(− X i ) = 0
n
i =1 ˆ − ˆ Di −
ˆ
1
i =1
2
2
n n n n
= X iYi −
ˆ X i − ˆ Di X i −
1
ˆ X i2 = 0
2
i =1 i =1 i =1 i =1
n n n n
= X iYi −
ˆ X i − ˆ X i −
1
ˆ
2 X 2
i = 0 EQ2
i =1 i =1 i = n1 i = n1+1
n
ei2
= 0 2 (Y i − ˆ X i )(− Di ) = 0
n
i =1 ˆ − ˆ Di −
ˆ 1
i =1
2
n n n n
= Di Y i −
ˆ Di − ˆ Di2 −
1
ˆ Di X i = 0
2
i =1 i =1 i =1 i =1
n n
= Y
i = n1+1
i
ˆ n2 − ˆ n2 −
−1
ˆ
2 X
i = n1+1
i = 0
= ˆ -
ˆ = Y 2 - ˆ X2 EQ3
1 2
Y = ˆ + n2 (Y 2 −
ˆ − ˆ X 2) + ˆ X
1
n 1 2 2
n n ˆ n ˆ
Y − 2 Y 2 = 1 − 2 + X − 2 X 2
n n 1
n
2
n ˆ n n ˆ
1 = 1 Y 1 − 1 X 1
n 1 n n 2
ˆ = Y1−
ˆ X1 EQ4
1 2
= (Y 2 − Y 1) − ( X 2 − X 1)
ˆ
2
EQ5
43
Substituting the results in EQ4 and EQ5 into EQ2 yields:
n1 n
(Y
i =1
1i − Y 1)( X 1i − X 1) + (Y
i = n1+1
2i − Y 2 )( X 2i − X 2 )
ˆ =
2 n1 n
( X 1i − X 1) + ( X 2i − X 2 )
2 2
i =1 i = n1
test in order to test whether the intercepts are different or equal. Not that if we accept H 0 ,
it implies that the dummy variable is not necessary in our model, i.e., categorization of
the data is not necessary. On the other hand if H 0 is rejected, meaning that is
significantly different from zero in the statistical sense, then our categorization is correct,
and there are two different regression equations and not one in our model.
We can also introduce a different dummy variable for group1 and model our regression
as follows
Y i = 1 + 2 X i + i : Small firms/female workers
In this formulation = 1 +
Combining the two equations give us:
44
Y i = 1 Z i + Di + 2 X i + i ; i i = 1,2 , , n
where Z i = 1 & Di = 0 for group1, and
Z i = 0 & Di = 1 for group2
Note that in this formulation we do not have a constant (intercept) term. Actually
introducing an intercept into the last model would lead to what is known as the dummy
variable trap, which is the result of the fact that given
Y i = + 1 Z i + Di + 2 X i + i ; i i = 1,2,, n
This is so because Z i + Di 1 i.e., col (1) = col (2) + col (3) . This is what is known as
perfect multicollinearity, we shall treat shortly. However, the issue of dummy variables
(categories in data) can be generalized to the case of more than two categories. Think
about it, and try to formulate a model with multiple (say three) categories.
45