Chapter 4
Chapter 4
Average salary
4.3.2 Regression on one quantitative variable and one
qualitative variable with two classes, or categories
Consider the model:
Yi = α1 + α2 Di + βXi + ui ------------(2)
Where: Yi = annual salary of a college
professor
Xi = years of teaching experience
Di = 1 if male =0 otherwise
Model (2) contains one quantitative variable
(years of teaching experience) and one qualitative
variable (gender) that has two classes (or levels,
classifications, or categories), namely, male and
female.
4.3.2 Regression on one quantitative variable and one
qualitative variable with two classes, or categories
What is the meaning of this equation?
Yi = α1 + α2 Di + βXi + ui ------------(2)
Assuming, as usual, that E(ui ) = 0, we see that mean
salary of female college professor:
E(Yi|Xi, Di = 0) = α1 + βXi ----------------------------------(3)
o Mean salary of male college professor:
o E(Yi|Xi , Di = 1) = (α + α2 ) + βXi -----------------(4)
o Geometrically, we have the situation shown in fig below
(for illustration, it is assumed that α1 > 0 ).
4.3.2 Regression on one quantitative variable and one
qualitative variable with two classes, or categories
In model (2 ) postulates that the male and female college professors’ salary
functions in relation to the years of teaching experience have the same slope
(β ) but different intercepts.
In other words, it is assumed that the level of the male professor's mean
salary is different from that of the female professor's mean salary (by α2 )
but the rate of change in the mean annual salary by years of experience is the
same for both sexes.
Example
Conti….
1.Assume the following regression result from a model
given by above equation with Y being the hourly wage
rate, D a dummy for men, and X a variable for years of
schooling. The dependent variable is expressed in USA
dollar ($). Standard errors are given within parenthesis:
Ϋ = 338.5 -165.5 D + 59.6 X
(244.7) (81.6) (17.04)
A. Find the slope and the intercept of dummy
variable (for male and female)?
B. Find the difference in average hourly wages
male and female as head of the household.
C. Interpret the estimated coefficient and model result
Regression on one quantitative variable and one qualitative
variable with more than two classes
Suppose that, on the basis of the cross-sectional data, we
want to regress the annual expenditure on health care by an
individual on the income and education of the individual.
Since the variable education is qualitative in nature,
suppose we consider three mutually exclusive levels of
education: less than high school, high school, and college.
Now, unlike the previous case, we have more than two
categories of the qualitative variable education.
Therefore, following the rule that the number of dummies
be one less than the number of categories of the variable
(m-1), we should introduce two dummies to take care of the
three levels of education.
Conti…
Assuming that the three educational groups have a common slope
but different intercepts in the regression of annual expenditure on
health care on annual income, we can use the following model:
Yi = α1 + α2 D2i + α3 D3i + βXi + ui ----------(5)
Where Yi = annual expenditure on health care
Xi= annual income
D2= 1 if high school education
= 0 otherwise
D3 = 1 if college education
= 0 otherwise
• Note: The intercept α1 will reflect “less than high school
education” category as the base category.
• The differential intercepts α2 and α3 tell by how much the
intercepts of the other two categories differ from the intercept of the
base category, which can be readily checked as follows:
Conti…
• Assuming E(ui ) = 0 , we obtain
E(Yi | D2 = 0, D3 = 0, Xi ) = α1 + βXi
E(Yi | D2 = 1, D3 = 0, Xi ) = (α1 + α 2 ) + βXi
E(Yi | D2 = 0, D3 = 1, Xi ) = (α1 + α3 ) + βXi
which are, respectively the mean health care expenditure functions for
the three levels of education, namely, less than high school, high
school, and college. Geometrically, the situation is shown in fig 1.2
(for illustrative purposes it is assumed that α3 > α2 ).
Ay
u
Illustrative Example
Regression on one quantitative variable and two qualitative variables, with
two categories
The technique of dummy variable can be easily extended to
handle more than one qualitative variable.
Let us revert to the college professors’ salary regression, but
now assume that in addition to years of teaching
experience and sex the, skin color of the teacher is also an
important determinant of salary.
For simplicity, assume that color has two categories: black and
white and assume that sex has two categories male and
female . We can now write as :
Yi = α1 + α2 D2i + α3 D3i + βXi + ui ------(6)
Where Yi = annual salary
Xi = years of teaching experience
D2 = 1if male =0 otherwise
D3 = 1if white =0 otherwise
Conti…
Notice that each of the two qualitative variables, sex and color, has
two categories and hence needs one dummy variable for each.
Note also that the omitted, or base, category now is “black female
professor.” Assuming E(ui ) = 0 , we can obtain the following
regression from equation …………………..…(6)
Mean salary for black female professor:
E(Yi | D2 = 0, D3 = 0, Xi ) = α1 + βXi
Mean salary for black male professor:
E(Yi | D2 = 1, D3 = 0, Xi ) = (α1 + α 2 ) +
βXi
Mean salary for white female professor:
E(Yi | D2 = 0, D3 = 1, Xi ) = (α1 + α3 ) +
βXi
Mean salary for white male professor:
E(Yi | D2 = 1, D3 = 1, Xi ) = (α1 + α2 + α3 ) +
βXi
Example
1. Now, suppose we will run the regression of Y on the
four explanatory variables and a constant.
o Y =2736 + 12598D1 + 10969D2 + 5.197X1 + 10.562X2.
o Where, Y is the price of the house.
o D1= 1 (if the house has a driveway) or 0 (if it does not).
o D2= 1 (if the house has a recreation room) or 0
(otherwise) X1 is the size of the garden and X2 is land
rent and
Required: Calculate the expected value if the house has
no driveway, no recreation room, a driveway and a
recreation room, citreous paribus? And interpret the
result of all explanatory variables.
Solution
I. If the house has no driveway ( D1= 0 ) and no recreation
room ( D2 = 0 ), its value will be Y =2736.
II. If the house has a driveway, its value will be, ceteris
paribus), $12598 more.
III. If the house has a recreation room, its value will be, ceteris
paribus, $10969 more.
IV. If the house has a driveway and a recreation room, its value
will be, ceteris paribus, 12598+10969 = $23567 more.
V. Increasing the size of the garden by 1 square foot will
increase the price of the house by $5.197 whether the
house has or not a driveway or a recreation room.
VI. If the land of rent increase by one birr, the price of house
will be rise by $ 10.56, citreous paribus.
4.3.5 Dummy variable Trap
First, if the regression contains a constant term,
the number of dummy variables must be one less than
the number of classes of each qualitative variable.
If all categories of a qualitative variable are
incorporated with intercept, there will be perfect
multicollinearity and regression will be impossible.
This is called dummy variable trap.
Dummy Variable Trap occurs when two or more
dummy variables created by one-hot encoding are
highly correlated (multi-collinear).
This means that one variable can be predicted from
the others, making it difficult to interpret predicted
coefficient variables in regression models.
Conti…
There is a way to avoid dummy variable trap.
First, by introducing as many dummy variables as the number
of categories of that variable and omit the intercept term
in a model. Yi = β1D1i + β2D2i + β3D3i + ui
Second, if there is base group in the model, the coefficient
attached to the dummy variables must always be interpreted
in relation to the base, or reference, group. That is, include
the intercept term and introduce only (m-1) dummies, where
m is the number of categories of the dummy variable.
For example, If we want to look at the effect of location( Addis
Ababa, Hawassa, Arba Minch) on Person's salary in thousands
of Birr (Y). If, Arba Minch dropped then:
Multiple Regression Model: Y= β0 + β1D1+ β2D2+ e
Conti…
To distinguish the two categories, male and female,
we have introduced only one dummy variable Di . For if
Di = 1 always denotes a male, when Di = 0 we know that
it is a female since there are only two possible outcomes.
Hence, one dummy variable suffices to distinguish
two categories.
The general rule is this: If a qualitative variable has
“m” categories, introduce only “m-1” dummy variables.
In the above example, sex has two categories, and
hence we introduced only a single dummy variable. If this
rule is not followed, we shall fall into what might be
called the dummy variable trap, that is, the situation
of perfect multicollinearity.
Conti…
4.3.6 ANOVA and ANCOVA MODELS
1. ANOVA stands for Analysis of Variance. It is a regression
model in which the dependent variable is quantitative in
nature, but all the explanatory variables are qualitative in
nature (dummies).
There are two major types of ANOVA models:
ANOVA model with one qualitative variable
ANOVA model with two qualitative variables
2. ANCOVA stands for analysis of covariance. It is regression
model contains a mixture of qualitative and quantitative
variables.
NB. The interpretation of dummy variable remains the same in
both the ANCOVA and ANOVA.
4.4 Dummy as Dependent Variable
Qualitative Response Model shows situations in which the
dependent variable in a regression equation simply represents a
discrete choice assuming only a limited number of values. Or it is
defined as a dependent variable whose range of values is
substantively restricted.
On occasions the variable that we are trying to explain may be
discrete rather than continuous.
Models that involve such variables are called
Qualitative Response models or
Discrete Choice models
Categorical dependent variable model
Dummy as Dependent Variable
Dichotomous dependent variable models
Limited dependent variable models.
Conti…
If the dependent variable of the model is dummy, the usual
OLS technique will no more be useful. Instead, the maximum
likelihood estimation technique is used. Because when the
dependent variable is dummy, the objective is finding maximum
probability of something happening for the given values of
regressors
In a regression analysis, we usually face a qualitative
response (dependent) variable of the “yes” or “no” type.
Discrete choice models dealing with such kind of binary responses
are called binary choice models.
At this junction, it is important to distinguish between:
Binary choices: the dependent variable can take two values.
Multiple choices: the dependent variable can take more than two
values.
Multinomial choices: work as a teacher, or as a clerk, or as a self
employed or professional or as a factory worker
Multinomial ordered choices: strongly agree, agree,
neutral, disagree.
Conti…
There are several types of such models. Some of them include
the
Linear Probability Model (LPM),
Probit model
Logit model,
The tobit(censored regression) model
Heckman two stage model etc.
Technically, it is possible to estimate the binary choices
using OLS.
Such linear model for binary choices where OLS is used is
called linear probability model (LPM).
The primary objective in categorical response models is to
explain how observations fall into each category.
• For example, in the labor market case we may wish to
explain labor force participation decision of a women by
linking the dependent variable to explanatory variables like
age, education, marital status etc.
Basic framework of binary models
Conti
4.4.1 The Linear Probability
It is a Model
multiple regression model with a dependent variable in the form of
binary rather than continuous.
The term linear probability model comes from the fact that the right-hand
side of the equation is linear.
Because the dependent variable Y is binary, the population regression
function corresponds to the probability that the dependent variable
equals 1 given explanatory variables, Xs, i.e.
β1 is the change in the probability that Y=1 associated with a unit change in
X1, i.e.
Interpreting the coefficients of a LPM
Conti…
The regression coefficients in the LPM are estimated by
OLS.
The usual (Heteroscedastic-robust) OLS standard errors can
be used to construct confidence intervals and hypotheses
tests.
Let be the probability that Y=1 (probability of success),
then = probability that Y=0 (probability of failure).
Therefore;
Probability
0
1
Conti…
Simple Linear Probability Model (LPM)
P(Y=1∣X)=β0+β1X+ε
Probability of Being Approved for a Loan
• P(Loan Approved=1∣Income)=β0+β1⋅Income
• Let’s say after estimating the model using data, you get:
β0=0.1,β1=0.04.Income
• So, the model becomes: P(Loan Approved)=0.1+0.04⋅Income
• Interpretation of Coefficients
• Intercept (β0=0.1):
• A person with $0 income has a 10% chance of getting approved
(probably only theoretical banks rarely approve $0 income).
• Slope (β1=0.04):
• For every $1,000 increase in income, the probability of being approved
increases by 4 percentage points.
Advantages of the linear probability model
It is easy to estimate and interpret the results
Drawbacks of LPM
I. The partial effect of any explanatory variable is constant.
The dependent variable is discrete while the independent
variable is the combination of discrete and continuous
variables.
II. The disturbances are not normally distributed. i .e E(Ui)#0
dependent variable Yᵢ assumes only two values (0 or 1),
the disturbances also takes only two values; that is, the error
term
follows the Bernoulli distribution. As a result, is
not normally distributed.
Conti…
Y and
LPM: Observed vs. Predicted
1.5
Observed Y (0 or 1)
Predicted Probabilities (yhat)
.5
-2 -1 0 1 2
X
Conti…
Conti….
Probability model that has the following two features:
As Xi increases, Pi= E(Y = 1/X) increases but never steps outside
the 0-1 interval.
The relationship between Pi and Xi is nonlinear, that is, “ one
which approaches zero at slower and slower rates as Xi gets
small and approaches one at slower and slower rates as Xi gets
very large”.
are zero.
The coefficient β measures the percentage change
in log-odds ratio for a unit change in a covariate.
Merits of Logit Model
• Logit analysis produces statistically sound results. By
allowing for the transformation of a dichotomous
dependent variable to a continuous variable ranging
from - ∞ to + ∞, the problem of out-of-range estimates
is avoided.
Conti…
The logit analysis provides results which can be easily interpreted and
the method is simple to analyze.
It gives parameter estimates which are asymptotically consistent,
efficient and normal, so that the analogue of the regression t-test can
be applied.
Demerits of Logit Model
Difference between Logit and LPM
In the LPM the slope coefficient measures the marginal effect
of a unit change in the explanatory variable on the probability
of the outcome, holding other variables constant.
In the logit model, the marginal effect of a unit change in the
explanatory variable not only depends on the coefficient of that
variable but also on the level of probability from which the
change is measured.
The logit model depends on the values of all the explanatory
variables in the model.
The LPM assumes that Pi is linearly related to Xi, where as the
logit model assumes that the of odds ratio is linearly related to
Xi.
Ay
u
How to interpret coefficients in both model?
β can not be interpreted as a simple slope as in
ordinary regression. B/c the rate at which the curve
ascends and descends changes according to the value of
x. In other words, it is not a constant change as in
ordinary regression