Assignment 2 Microeconometrics
Assignment 2 Microeconometrics
Students:
Chervakova, Polina (731001)
Flores Castellanos, Brandon Ignacio (653334)
Mongelli, Nahuel (747159)
Stoop, Louise (617803)
Question 1.
Begin by describing the dataset using the relevant panel data commands. For each
variable discuss the different variation components and identify which variables are
time-variant and time-invariant.
Commands.
xtset pid wave
xtdescribe
xtsum
The “xtdescribe” command helps to obtain a general picture of the composition of the
data set used. So, the total of observations is 7,126, obtained in 17 different waves.
Page 1 of 37
Also, the distribution of the sample says that 75% of the sample has, at least, 3
observations; 50% has, at least, 7 observations, and; 25% of the sample has, at least,
13 observations. Besides, by observing the column “pattern”, what the number one
shows is whether people answered the survey in the 17 different waves, being 1 when
there is information of that unit at that moment in time. In this sense, in this sample
there are 758 people that answered only in the first wave; 534 that answered the first
and the second waves, and so on. In the end, only 601 people answered the entire
waves. That gives us an initial approach to think that we have an attrition problem in
the sample that we will need to test.
Page 2 of 37
By using the command “xtsum” one can obtain the summary of all the variables that
compose the sample. First, it is important to say that the variables “pid” and “wave”
are just variables that serve to identify the units and the moment in time where the data
was collected, respectively. Secondly, in order to distinguish between the kind of
variables included in the sample, we differentiate the variables that are time-invariant
surrounded in red, and the time-variant ones in green. The first group is built by two
different variables: gender (defined in the dataset as “male”) and ethnicity. These are
time-invariant because the result of within variation for both is zero. In contrast, the
time variant variables have within values higher than zero and are: age, marital status
(mstatus), education years (edyears), income, and childbirth (child_birth).
In addition, the sample has four different categorical variables (two of them being
binary): marital status and ethnicity divided into four different categories; gender and
childbirth that have only two. The marital status variable reveals that a major part of the
people in the sample have never been married (83.61% of the total), while the
distribution of the ethnicity variable has less concentration, still one category has more
than half of the observations (Non-Black / Non-Hispanic, 51.43%).
Page 3 of 37
Question 2.
Estimate the effect of childbirth on Log(income) using a Pooled OLS estimator and
controlling for age, male, education years, marital status categories, and ethnicity
categories.
Since we are required to estimate the effect of marital status on income, we will
generate dummy variables for mstatus.
Commands:
generate lnincome=log(income)
tab mstatus, gen(dmstatus)
reg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity
display 100*(exp(_b[child_birth])-1)
display 100*(exp(_b[dmstatus2])-1)
Page 4 of 37
income by −7.6%, keeping all things constant (ceteris paribus), an effect that is
significant at our taken 5% level (because the p-value is 0.000 and because the
95% confidence interval does not contain zero).
b) Interpret the estimated coefficient for being married (sign, magnitude, and
significance)
Like we’ve already mentioned, we created dummies for mstatus, so that we can
estimate the effect of marital status on income. Reference category is “Never
married”, dmstatus2 indicates “Married”, dmstatus3 indicates “Separated or
divorced”, dmstatus4 indicates “Widowed”.
As we have a log-level model, we will apply the formula, mentioned in 2a, to get
a more precise value and a correct interpretation of the magnitude effect. We
observe that compared to people, who never married, married people’s income
is 85.68% higher, ceteris paribus, significant at our taken significance level of
5% level. This difference is presumably due to their income being a combined
household income.
c) Under what conditions will the Pooled OLS estimates of childbirth be unbiased
and efficient? Do you think these conditions hold? Discuss with one concrete
example for each condition.
There are the two conditions that need to be satisfied to state that the Pooled
OLS estimates of childbirth are unbiased and efficient: i) it does not exist a
Page 5 of 37
correlation between the errors and the independent variable (childbirth), and; ii)
that there is not serial correlation in the errors.
We think that these two conditions won’t hold. Consider a case when it is a
problem of fertility. In this case, one unobserved factor affects directly our
independent variable, which means that the obtained coefficients will be
biased. In the same sense, this unobserved factor will affect the results during
different points in time, resulting in a serial correlation in the errors.
Question 3.
Use a fixed effects (FE) estimator to study the effect of childbirth on Log(income)
controlling for age, male, education years, marital status categories, and ethnicity
categories.
Commands:
xtset pid wave
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
display 100*(exp(_b[child_birth])-1)
Page 6 of 37
Continuity Table 7. Fixed errors log-level regression
STATA eliminates the variables that have no change over time, which means,
those variables do not show within variation. This happens because FE model
uses only within variation to process the information in the sample, which is
different from those results obtained with Pooled OLS.
Page 7 of 37
The FE model controls for all time-invariant characteristics of individuals, so
variables that do not vary across time (such as gender or ethnicity) are included
in the unit-specific variation, making it unnecessary for them to be explicitly
included in the regression.
Moreover, due to this absence of variation, we find that the values for the
variables are 0, presenting perfect collinearity. This occurs when one variable is
a perfect linear combination of other variables, making it impossible for the
regression to distinguish the effects of that variable. Stata automatically detects
this and omits the collinear variable to avoid overfitting and estimation
problems.
c) In which situation would you prefer to estimate the effect of childbirth using a
fixed effects estimation compared to a pooled OLS estimation? Provide a
concrete example based on the research objective of this assignment.
If we believe that there are unit-specific, time-invariant factors that affect both
the likelihood of childbirth and income, then using fixed effects is preferable.
Pooled OLS assumes that there are no omitted variables that are correlated with
both the independent variables (like childbirth) and the dependent variable
(income). However, this assumption is often unrealistic in the context of social
science research, especially when individual-specific characteristics play a
role.
Question 4.
Use a random effects (RE) estimator to study the effect of childbirth on Log(income)
controlling for age, male, education years, all indicator variables of marital status, and
all indicator variables of ethnicity.
Page 8 of 37
Commands.
xtset pid wave
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, re
display 100*(exp(_b[child_birth])-1)
display 1-[([e(sigma_e)]^2)/(17*([e(sigma_u)]^2)+([e(sigma_e)]^2))]^(1/2)
Page 9 of 37
a) Compare the coefficients for childbirth of the RE with those of the Pooled OLS
and FE.
Now, let’s compare the magnitudes of the coefficients for childbirth in our
models. The absolute magnitude of said coefficient in RE is 0.060, in FE 0.054,
in Pooled OLS 0.079. The differences between the coefficients in RE and FE
models stem from the fact that RE regression uses not only within variation but
also the between one. Pooled OLS revealing a larger effect, comes from its
inability to account for unobserved unit-specific effects. It is possible that the
childbirth effect may be biased due to the omission of these factors, since
Pooled OLS does not control for potential unobserved heterogeneity. This will
result in an overestimated negative impact of childbirth on income compared to
the RE model, which does partially control for unobserved heterogeneity.
One of the most important advantages is that RE estimation uses not only within
variation, as in the case of FE estimation, but also, uses between variation as in
Pooled OLS regression. Moreover, RE regression is a better tool when we
consider there are serial correlation, that seriously affects the efficiency of
Pooled OLS.
c) What is the value of the demeaning factor in the RE model? Based on this, is the
RE closer to FE or Pooled OLS? Explain why.
the unit-specific variation, 𝜎𝜀2 is the idiosyncratic shock variance. Applying the
formula, we obtain a value of 0.457. This means, that marginally within variation
is more important than between variation, making RE model closer to zero and,
therefore, closer to Pooled OLS.
Page 10 of 37
Question 5.
Without doing any analysis:
a) Compare the assumptions for Pooled OLS, FE and RE. Theoretically discuss in
which cases you would prefer each method.
To make use of Pooled OLS we assume that the zero conditional mean holds
and that there is no serial correlation. If these assumptions hold, Pooled OLS
exploits all data variation and is the best linear unbiased estimator. So, we do
prefer Pooled Ols over FE and RE if these assumptions hold.
If there are significant differences in time varying coefficients this implies that
time invariant characteristics do matter. Then, RE is likely to be biased. FE is a
more appropriate to account for all time invariant characteristics.
b) Using the practical context of the assignment, give an example situation in the
form of a DAG in which Pooled OLS would be the preferred method and discuss
why the necessary assumptions hold in that example. Do the same for FE and
RE.
In this case, we assume that both assumptions hold for different reasons. First,
whether we can sustain that all relevant variables that affect income are
captured in our dataset, then the model does not have a problem with
unobserved heterogeneity (𝛼𝑖 ) and neither with the idiosyncratic shock (𝜀𝑖𝑡 ). On
the other hand, we assume that there is not serial correlation if the errors are
uncorrelated across time. For example, if we can support the idea that
Page 11 of 37
“motivation” does not affect income systematically, then we do not have serial
correlation.
RE regression model, as in the case of Pooled OLS, uses both, between and
within variations, but the most important difference regarding Pooled OLS is
that this model quasi-demeans the data, eliminating serial correlation problem,
making it more efficient. Furthermore, to have an unbiased RE estimator, as in
the POLS model, we need to assume that zero conditional mean holds (the
entire error is uncorrelated with the variable of interest), but whether we have
serial correlation, we will prefer RE than POLS because RE fixes this problem,
having as a result a more efficient estimator. If we use the same example, but in
this case, we have no evidence to eliminate the idea that “motivation” does not
affect incom.e, first, we cannot opt for using any of the within effect methods
because they only use within variation (and in this case we have a serial
correlation problem), and second, the unique model we can use to avoid serial
correlation problem is RE model.
Page 12 of 37
Figure 2. RE DAG
Finally, the main difference between FE and RE or POLS is that FE only takes
within variation, 𝜀𝑖𝑡 , for estimate the coefficients. In this sense, we will prefer FE
model when there is a correlation between the unobserved heterogeneity, (𝛼𝑖 ),
and the variable of interest. So, for instance, if we have “genetic material” that
affects income and the possibility to have a child systematically, we will prefer
FE model instead of others because this model fixes unobserved heterogeneity
problem.
Figure 3. FE DAG
Page 13 of 37
Question 6.
Estimate a Correlated Random Effects (CRE) model on the effect of childbirth on
Log(income) controlling for age, male, education years, all indicator variables of
marital status, and all indicator variables of ethnicity.
Commands:
xtset pid wave
bysort pid: egen av_age=mean(age)
bysort pid: egen av_child_birth=mean(child_birth)
bysort pid: egen av_edyears=mean(edyears)
bysort pid: egen av_dmstatus2=mean(dmstatus2)
bysort pid: egen av_dmstatus3=mean(dmstatus3)
bysort pid: egen av_dmstatus4=mean(dmstatus4)
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity av_child_birth av_age av_edyears av_dmstatus2 av_dmstatus3
av_dmstatus4
display 100*(exp(_b[child_birth])-1)
test
av_age=av_child_birth=av_edyears=av_dmstatus2=av_dmstatus3=av_dmstatus4=0
Page 14 of 37
Continuity of Table 9. Correlated Random Effects (CRE) log-
level regression
The CRE estimator accounts for potential endogeneity between the unobserved
individual-specific effects and the independent variables.
Page 15 of 37
b) What is one advantage of the CRE estimator in comparison to the FE?
The CRE estimator retains the between-individual variation, unlike FE, which
only focuses on the within-individual variation.
c) Compare the estimated coefficient for childbirth of the CRE with those of the RE
and FE. Explain why the estimated coefficients are (not) similar?
The coefficients for childbirth in the CRE and FE models are exactly the same
(−0.0545 / −0.0545). This similarity arises because both models account for the
correlation between the unobserved individual-specific effects and the
independent variables (in this case, childbirth). CRE includes the averages of
the time-varying variables to control for potential endogeneity, which aligns it
more closely with the FE model.
On the other hand, the coefficient in the RE model is slightly more negative at
−0.0599, indicating a 5.81% decrease in income due to childbirth which is
larger than the estimate in the CRE and FE models, keeping all things constant
(ceteris paribus). This effect is significant at our taken 5% level (because the p-
value is 0.003 and because the 95% confidence interval does not contain zero).
This difference occurs because the RE model assumes exogeneity (i.e., that the
unobserved individual-specific effects are uncorrelated with the independent
variables). However, in this case, the correlation between childbirth and
unobserved individual-specific effects likely exists, and thus, the RE estimate is
biased.
The RE model does not control for endogeneity, leading to slightly biased
estimates. It does not properly account for the correlation between the
Page 16 of 37
individual-specific effects and childbirth, leading to a more negative coefficient
compared to the CRE and FE models.
In our case, the CRE estimates indicate that several of the average variables
(e.g., av_edyears, av_age) are significant at the 5% level, which suggests that
these time-varying variables are correlated with the unobserved individual-
specific effects. This means that the assumption of exogeneity, which the RE
model relies on, is likely violated.
Based on the CRE estimates, exogeneity does not hold, and therefore the RE
model is biased. Between the CRE and FE models, we would recommend
choosing the CRE model for the following reasons:
The CRE model allows us to include and estimate the effects of time invariant
variables like gender and ethnicity. In our case, these variables are likely
important to our analysis (since gender plays a key role in evaluating the impact
of childbirth). The FE model cannot estimate the effects of time-invariant
variables because it eliminates them through demeaning.
• Addressing Endogeneity:
Page 17 of 37
• Efficiency:
The CRE model is generally more efficient than the FE model because it uses
both within-individual and between-individual variation. FE only uses within-
individual variation, which can reduce efficiency, especially when time-
invariant variables are relevant.
Question 7.
Implement the Hausman test based on the models above.
a) How does the Hausman test contribute to the estimator decision-making
process?
The Hausman test helps decide between the FE and RE models. The test checks
whether the RE estimator is consistent by comparing it with the FE estimator.
The null hypothesis (𝐻0 ) of the test states that the differences between the
coefficients of the FE and RE models are not systematic (i.e., the RE model is
consistent and efficient). The alternative hypothesis (𝐻1 ) is that the differences
between the coefficients are systematic, implying that the RE model is
inconsistent and biased.
If the test rejects the null hypothesis, we conclude that the RE model is
inconsistent and should not be used. Instead, we choose the FE model, which
is consistent but less efficient. If the test fails to reject the null, we conclude that
RE is both consistent and more efficient, making it the preferred estimator.
b) What do you find? In your answer, make sure to state the null hypothesis of the
test, whether you reject the test, and the significance level used.
Commands:
xtset pid wave
est clear
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
estimates store fe
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, re
estimates store re
hausman fe re
Page 18 of 37
Table 10. Hausman test
The null hypothesis (𝐻0 ) of the test states that the differences between the
coefficients of the FE and RE models are not systematic (i.e., the RE model is
consistent and efficient). The alternative hypothesis (𝐻1 ) is that the differences
between the coefficients are systematic, implying that the RE model is
inconsistent and biased.
The significant Hausman test result suggests that the assumptions behind the
RE model (i.e., no correlation between the individual effects and the regressors)
do not hold. In this case, the FE model should be preferred because it is
consistent, even though it may be less efficient than RE.
Since the test shows systematic differences between the FE and RE estimates,
this suggests that endogeneity is present (i.e., the unobserved individual effects
are correlated with the explanatory variables). Thus, exogeneity is unlikely to
Page 19 of 37
hold. Therefore, FE is the preferred estimator in this case because the RE
estimator would be biased due to the violation of the exogeneity assumption.
The FE estimator remains consistent in the presence of such correlation,
making it the appropriate choice for this analysis.
Question 8.
In the previous analysis, you have assumed that the effect of childbirth is the same for
male and females.
a) How likely do you think that this is the case? Discuss without any further
analysis.
It is unlikely that the effect of childbirth on income is the same for males and
females. Childbirth typically has a much larger negative impact on women's
income due to caregiving responsibilities, societal expectations, and possible
career interruptions. For men, childbirth may not have a negative effect and
could even have a positive impact as men might feel pressure to increase their
work effort to support the family financially.
Commands:
gen child_birth_male=child_birth*male
xtreg lnincome child_birth age male child_birth_male edyears dmstatus2
dmstatus3 dmstatus4 i.ethnicity, fe
display 100*(exp(_b[child_birth])-1)
display 100*(exp(_b[child_birth_male]+_b[child_birth])-1)
test child_birth_male
Page 20 of 37
Table 11. FE log-level regression testing for different effects of childbirth
according to gender
The coefficient for childbirth is −0.339 and is statistically significant at our taken
5% level (𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.000). This implies that for females (since male is
omitted due to collinearity), having a child is associated with a 28.76%
reduction in income (ceteris paribus).
Page 21 of 37
This suggests that childbirth increases male income by approximately 6.27%,
holding all else constant (ceteris paribus).
These results suggest that the effect of childbirth on income differs significantly
between males and females, with females earning considerably less income
compared to males (ceteris paribus).
Question 9.
Finally, evaluate your data again and discuss whether there is attrition or not in your
sample. Based on your preferred model, is it likely that there is attrition bias? What do
you conclude? How does this impact your conclusions?
Commands:
xtset pid wave
xtdescribe
gen sample=1 // creating an auxiliary variable that is one for everyone. This will be used
to see whether sample is not missing for the next wave
gen nextwave=F.sample==1 // creating a variable that is 1 if sample is equal to 1 in the
following wave and 0 if the value for sample is missing in the next wave
Page 22 of 37
xtreg lnincome nextwave child_birth age male edyears dmstatus2 dmstatus3
dmstatus4 i.ethnicity, fe
display 100*(exp(_b[nextwave])-1)
By using the command “xtset pid wave” we can already identify, if there is attrition in
our sample. By looking at the first line we see the word “unbalanced”, which already
hints at attrition, implying we don’t have data for each unit available for all of the time
periods. Then, the command “xtdescribe” gives us distribution of time periods and
patterns, based on which we can see that, for instance, in case of the first row, we were
able to observe 758 people for one period of time and then they dropped off.
Page 23 of 37
We start by creating an auxiliary variable “sample”, one for everyone, which is going to
be used to indicate if sample is not missing for the next wave. Then, we create
“nextwave”, that takes value 1, if sample=1 in the following wave, and 0 otherwise.
We find that the coefficient for “nextwave” is 0.059, (𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.000), which is
statistically significant at our taken level of 5% (ceteris paribus). This indicates, there
is likely attrition bias, and by applying the formula for log-level model we see that
individuals who remain in the survey for the following wave tend to have 6.1% higher
income compared to those who drop out in the next wave. Therefore, since individuals
with higher income are more likely to remain in the panel and those with lower income
are more likely to drop out, the sample over time becomes skewed towards higher-
income individuals.
Page 24 of 37
This bias could affect the validity of any conclusions drawn from the data, particularly
if the analysis does not account for the fact that lower-income individuals are more
likely to drop out. As a result, income-related estimates may be upwardly biased,
overestimating the true average income of the population.
Page 25 of 37
STATA log file
--------------------------------------------------------------------------------------------
-------------------------------------------------
name: <unnamed>
log: /Users/polinacervakova/Desktop/Erasmus/Applied Microeconometrics/Assignment
2/Childbirth.log
log type: text
opened on: 1 Oct 2024, 19:18:11
.
. * Q1. Panel data description
. describe
. xtdescribe
Page 26 of 37
758 10.64 10.64 | 1................
601 8.43 19.07 | 11111111111111111
534 7.49 26.56 | 11...............
449 6.30 32.87 | 111..............
347 4.87 37.74 | 11111............
343 4.81 42.55 | 1111.............
260 3.65 46.20 | 111111...........
214 3.00 49.20 | 1111111..........
198 2.78 51.98 | 11111111.........
3422 48.02 100.00 | (other patterns)
---------------------------+-------------------
7126 100.00 | XXXXXXXXXXXXXXXXX
. xtsum
. tab mstatus
Page 27 of 37
Separated or divorced | 1,413 2.53 99.95
Widowed | 28 0.05 100.00
----------------------+-----------------------------------
Total | 55,874 100.00
. tab ethnicity
.
. * Q2. Pooled OLS
. generate lnincome=log(income)
. reg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0794501 .0184009 -4.32 0.000 -.1155159 -.0433843
age | .1262801 .0014485 87.18 0.000 .123441 .1291192
male | .5999785 .0126318 47.50 0.000 .5752201 .6247369
edyears | .145754 .0026827 54.33 0.000 .1404959 .151012
dmstatus2 | .6188734 .0184764 33.50 0.000 .5826594 .6550873
dmstatus3 | .0766986 .0375023 2.05 0.041 .0031939 .1502032
dmstatus4 | .0647223 .2531113 0.26 0.798 -.4313775 .560822
|
ethnicity |
Hispanic | .3460341 .0165731 20.88 0.000 .3135507 .3785176
Mixed Race (Non-Hispanic) | .0541551 .0588592 0.92 0.358 -.0612093 .1695194
Non-Black / Non-Hispanic | .4285689 .0139208 30.79 0.000 .4012841 .4558537
|
Page 28 of 37
_cons | 2.587026 .0323465 79.98 0.000 2.523627 2.650425
--------------------------------------------------------------------------------------------
. display 100*(exp(_b[child_birth])-1)
-7.6375916
. display 100*(exp(_b[dmstatus2])-1)
85.68349
.
.
. * Q3. Fixed effects
. xtset pid wave
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0544722 .0179502 -3.03 0.002 -.0896548 -.0192896
age | .1259277 .0016685 75.47 0.000 .1226575 .1291979
male | 0 (omitted)
edyears | .1609736 .0038254 42.08 0.000 .1534758 .1684713
dmstatus2 | .4178146 .0222754 18.76 0.000 .3741545 .4614746
dmstatus3 | -.0111624 .0431511 -0.26 0.796 -.0957392 .0734143
dmstatus4 | .1574529 .2860796 0.55 0.582 -.4032667 .7181725
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.153631 .0377618 83.51 0.000 3.079618 3.227645
---------------------------+----------------------------------------------------------------
sigma_u | .81739968
sigma_e | 1.1765995
rho | .3255215 (fraction of variance due to u_i)
Page 29 of 37
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48742) = 3.89 Prob > F = 0.0000
. display 100*(exp(_b[child_birth])-1)
-5.3015153
.
. * Q4. Random effects
. xtset pid wave
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, re
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0598945 .0173605 -3.45 0.001 -.0939204 -.0258686
age | .1272876 .0014948 85.16 0.000 .124358 .1302173
male | .5698314 .0167349 34.05 0.000 .5370316 .6026312
edyears | .1509147 .0030581 49.35 0.000 .1449209 .1569085
dmstatus2 | .5107319 .019666 25.97 0.000 .4721873 .5492766
dmstatus3 | .0454112 .0391588 1.16 0.246 -.0313386 .122161
dmstatus4 | .0855097 .2618702 0.33 0.744 -.4277464 .5987658
|
ethnicity |
Hispanic | .3177513 .0230481 13.79 0.000 .2725779 .3629247
Mixed Race (Non-Hispanic) | .0588821 .0820328 0.72 0.473 -.1018992 .2196634
Non-Black / Non-Hispanic | .3832821 .0191953 19.97 0.000 .3456599 .4209042
|
_cons | 2.559046 .035206 72.69 0.000 2.490044 2.628049
---------------------------+----------------------------------------------------------------
sigma_u | .4411512
sigma_e | 1.1765995
rho | .1232516 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
. display 100*(exp(_b[child_birth])-1)
-5.8136148
. display 1-[([e(sigma_e)]^2)/(17*([e(sigma_u)]^2)+([e(sigma_e)]^2))]^(1/2)
.45686071
Page 30 of 37
.
.
. * Q6. CRE
. xtset pid wave
.
. bysort pid: egen av_age=mean(age)
.
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity
av_child_birth av_age av_edyears av_dmstatus2 av_dmst
> atus3 av_dmstatus4
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0544722 .0183181 -2.97 0.003 -.0903749 -.0185695
age | .1259277 .0017027 73.96 0.000 .1225906 .1292649
male | .5514907 .0179655 30.70 0.000 .516279 .5867024
edyears | .1609736 .0039038 41.24 0.000 .1533224 .1686248
dmstatus2 | .4178146 .0227319 18.38 0.000 .3732609 .4623683
dmstatus3 | -.0111624 .0440355 -0.25 0.800 -.0974703 .0751455
dmstatus4 | .1574529 .2919425 0.54 0.590 -.4147438 .7296496
|
ethnicity |
Hispanic | .294627 .0232277 12.68 0.000 .2491015 .3401526
Mixed Race (Non-Hispanic) | .0624566 .0820323 0.76 0.446 -.0983238 .223237
Page 31 of 37
Non-Black / Non-Hispanic | .3718733 .0196553 18.92 0.000 .3333496 .4103971
|
av_child_birth | -.116747 .0580712 -2.01 0.044 -.2305644 -.0029296
av_age | .009513 .0038225 2.49 0.013 .0020211 .0170049
av_edyears | -.031524 .0064149 -4.91 0.000 -.044097 -.0189511
av_dmstatus2 | .4106768 .0473467 8.67 0.000 .3178789 .5034747
av_dmstatus3 | -.005703 .1021948 -0.06 0.955 -.2060011 .1945951
av_dmstatus4 | -.5879614 .659662 -0.89 0.373 -1.880875 .7049524
_cons | 2.635625 .062808 41.96 0.000 2.512523 2.758726
---------------------------+----------------------------------------------------------------
sigma_u | .4411512
sigma_e | 1.1765995
rho | .1232516 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
.
. display 100*(exp(_b[child_birth])-1)
-5.3015153
.
. test av_age=av_child_birth=av_edyears=av_dmstatus2=av_dmstatus3=av_dmstatus4=0
( 1) - av_child_birth + av_age = 0
( 2) av_age - av_edyears = 0
( 3) av_age - av_dmstatus2 = 0
( 4) av_age - av_dmstatus3 = 0
( 5) av_age - av_dmstatus4 = 0
( 6) av_age = 0
chi2( 6) = 111.11
Prob > chi2 = 0.0000
.
. * Q7. Hausman
. * H0: There are no systematic differences between the random effects coefficients or fixed
effects coefficients =? Random effects is chosen
> because it is more efficient
. * H1: There are systematic differences between fixed effects and random effects => Fixed
effects is preferred because random effects is bia
> sed
. xtset pid wave
. est clear
.
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.
Page 32 of 37
Group variable: pid Number of groups = 7,126
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0544722 .0179502 -3.03 0.002 -.0896548 -.0192896
age | .1259277 .0016685 75.47 0.000 .1226575 .1291979
male | 0 (omitted)
edyears | .1609736 .0038254 42.08 0.000 .1534758 .1684713
dmstatus2 | .4178146 .0222754 18.76 0.000 .3741545 .4614746
dmstatus3 | -.0111624 .0431511 -0.26 0.796 -.0957392 .0734143
dmstatus4 | .1574529 .2860796 0.55 0.582 -.4032667 .7181725
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.153631 .0377618 83.51 0.000 3.079618 3.227645
---------------------------+----------------------------------------------------------------
sigma_u | .81739968
sigma_e | 1.1765995
rho | .3255215 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48742) = 3.89 Prob > F = 0.0000
. estimates store fe
.
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, re
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0598945 .0173605 -3.45 0.001 -.0939204 -.0258686
age | .1272876 .0014948 85.16 0.000 .124358 .1302173
male | .5698314 .0167349 34.05 0.000 .5370316 .6026312
Page 33 of 37
edyears| .1509147 .0030581 49.35 0.000 .1449209 .1569085
dmstatus2| .5107319 .019666 25.97 0.000 .4721873 .5492766
dmstatus3| .0454112 .0391588 1.16 0.246 -.0313386 .122161
dmstatus4| .0855097 .2618702 0.33 0.744 -.4277464 .5987658
|
ethnicity |
Hispanic | .3177513 .0230481 13.79 0.000 .2725779 .3629247
Mixed Race (Non-Hispanic) | .0588821 .0820328 0.72 0.473 -.1018992 .2196634
Non-Black / Non-Hispanic | .3832821 .0191953 19.97 0.000 .3456599 .4209042
|
_cons | 2.559046 .035206 72.69 0.000 2.490044 2.628049
---------------------------+----------------------------------------------------------------
sigma_u | .4411512
sigma_e | 1.1765995
rho | .1232516 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
. estimates store re
.
. hausman fe re
chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 141.94
Prob > chi2 = 0.0000
.
.
. * Q8. Gender and childbirth
. gen child_birth_male=child_birth*male
. xtreg lnincome child_birth age male child_birth_male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.
Page 34 of 37
R-squared: Obs per group:
Within = 0.3398 min = 1
Between = 0.5022 avg = 7.8
Overall = 0.3762 max = 17
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.3391592 .0330189 -10.27 0.000 -.4038766 -.2744418
age | .124866 .0016699 74.77 0.000 .121593 .128139
male | 0 (omitted)
child_birth_male | .3999343 .0389499 10.27 0.000 .323592 .4762767
edyears | .1621751 .0038231 42.42 0.000 .1546818 .1696683
dmstatus2 | .4181992 .0222516 18.79 0.000 .3745857 .4618126
dmstatus3 | -.012054 .0431051 -0.28 0.780 -.0965405 .0724325
dmstatus4 | .1307534 .2857855 0.46 0.647 -.4293897 .6908966
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.162842 .037732 83.82 0.000 3.088887 3.236798
---------------------------+----------------------------------------------------------------
sigma_u | .81140777
sigma_e | 1.1753411
rho | .32276672 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48741) = 3.79 Prob > F = 0.0000
. display 100*(exp(_b[child_birth])-1)
-28.763096
. display 100*(exp(_b[child_birth_male]+_b[child_birth])-1)
6.2659921
. test child_birth_male
( 1) child_birth_male = 0
F( 1, 48741) = 105.43
Prob > F = 0.0000
.
. * Q9. Attrition
. xtset pid wave
Page 35 of 37
. xtdescribe
.
. * Indicator of being available in following wave
. * Objective: Estimate whether the number of participations in the survey is correlated
with the outcome
. gen sample=1 // creating an auxiliary variable that is one for everyone. This will be used
to see whether sample is not missing for the nex
> t wave
.
. gen nextwave=F.sample==1 // creating a variable that is 1 if sample is equal to 1 in the
following wave and 0 if the value for sample is mi
> ssing in the next wave
. xtreg lnincome nextwave child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.
Page 36 of 37
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
nextwave | .0592868 .014481 4.09 0.000 .0309038 .0876699
child_birth | -.0562282 .0179524 -3.13 0.002 -.0914152 -.0210413
age | .127547 .0017144 74.40 0.000 .1241867 .1309073
male | 0 (omitted)
edyears | .1610698 .0038248 42.11 0.000 .1535731 .1685664
dmstatus2 | .4161596 .0222755 18.68 0.000 .3724994 .4598198
dmstatus3 | -.0114637 .0431442 -0.27 0.790 -.0960269 .0730995
dmstatus4 | .1518819 .2860366 0.53 0.595 -.4087535 .7125172
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.070286 .0428942 71.58 0.000 2.986213 3.154359
---------------------------+----------------------------------------------------------------
sigma_u | .81510967
sigma_e | 1.1764093
rho | .32436163 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48741) = 3.87 Prob > F = 0.0000
. display 100*(exp(_b[nextwave])-1)
6.1079567
.
end of do-file
Page 37 of 37