0% found this document useful (0 votes)
7 views

Assignment 2 Microeconometrics

Uploaded by

Nahuel Mongelli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Assignment 2 Microeconometrics

Uploaded by

Nahuel Mongelli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Group number: 3

Students:
Chervakova, Polina (731001)
Flores Castellanos, Brandon Ignacio (653334)
Mongelli, Nahuel (747159)
Stoop, Louise (617803)

Assignment 2: Panel Data Analysis

Question 1.
Begin by describing the dataset using the relevant panel data commands. For each
variable discuss the different variation components and identify which variables are
time-variant and time-invariant.

Commands.
xtset pid wave
xtdescribe
xtsum

Table 1. General description of the panel data sample

The “xtdescribe” command helps to obtain a general picture of the composition of the
data set used. So, the total of observations is 7,126, obtained in 17 different waves.
Page 1 of 37
Also, the distribution of the sample says that 75% of the sample has, at least, 3
observations; 50% has, at least, 7 observations, and; 25% of the sample has, at least,
13 observations. Besides, by observing the column “pattern”, what the number one
shows is whether people answered the survey in the 17 different waves, being 1 when
there is information of that unit at that moment in time. In this sense, in this sample
there are 758 people that answered only in the first wave; 534 that answered the first
and the second waves, and so on. In the end, only 601 people answered the entire
waves. That gives us an initial approach to think that we have an attrition problem in
the sample that we will need to test.

Table 2. Summary of the panel data sample

Page 2 of 37
By using the command “xtsum” one can obtain the summary of all the variables that
compose the sample. First, it is important to say that the variables “pid” and “wave”
are just variables that serve to identify the units and the moment in time where the data
was collected, respectively. Secondly, in order to distinguish between the kind of
variables included in the sample, we differentiate the variables that are time-invariant
surrounded in red, and the time-variant ones in green. The first group is built by two
different variables: gender (defined in the dataset as “male”) and ethnicity. These are
time-invariant because the result of within variation for both is zero. In contrast, the
time variant variables have within values higher than zero and are: age, marital status
(mstatus), education years (edyears), income, and childbirth (child_birth).

Table 3. Summary of the marital status categorical variable

Table 4. Summary of the ethnicity categorical variable

In addition, the sample has four different categorical variables (two of them being
binary): marital status and ethnicity divided into four different categories; gender and
childbirth that have only two. The marital status variable reveals that a major part of the
people in the sample have never been married (83.61% of the total), while the
distribution of the ethnicity variable has less concentration, still one category has more
than half of the observations (Non-Black / Non-Hispanic, 51.43%).

Page 3 of 37
Question 2.
Estimate the effect of childbirth on Log(income) using a Pooled OLS estimator and
controlling for age, male, education years, marital status categories, and ethnicity
categories.

Since we are required to estimate the effect of marital status on income, we will
generate dummy variables for mstatus.

Commands:
generate lnincome=log(income)
tab mstatus, gen(dmstatus)
reg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity
display 100*(exp(_b[child_birth])-1)
display 100*(exp(_b[dmstatus2])-1)

Table 5. Pooled OLS log-level regression

a) Interpret the estimated coefficient for childbirth (sign, magnitude, and


significance)

As we have a log-level model, we need to use the formula


100 ∗ (𝑒𝑥𝑝(_𝑏[𝑐ℎ𝑖𝑙𝑑_𝑏𝑖𝑟𝑡ℎ]) − 1 in order to obtain the real magnitude of the
result of the effect of childbirth on income. So, after having applied this process,
it is possible to say that the effect of having one more child is a decrease on the

Page 4 of 37
income by −7.6%, keeping all things constant (ceteris paribus), an effect that is
significant at our taken 5% level (because the p-value is 0.000 and because the
95% confidence interval does not contain zero).

b) Interpret the estimated coefficient for being married (sign, magnitude, and
significance)

Table 6. Marital status coefficients in the log-level regression

Like we’ve already mentioned, we created dummies for mstatus, so that we can
estimate the effect of marital status on income. Reference category is “Never
married”, dmstatus2 indicates “Married”, dmstatus3 indicates “Separated or
divorced”, dmstatus4 indicates “Widowed”.

As we have a log-level model, we will apply the formula, mentioned in 2a, to get
a more precise value and a correct interpretation of the magnitude effect. We
observe that compared to people, who never married, married people’s income
is 85.68% higher, ceteris paribus, significant at our taken significance level of
5% level. This difference is presumably due to their income being a combined
household income.

c) Under what conditions will the Pooled OLS estimates of childbirth be unbiased
and efficient? Do you think these conditions hold? Discuss with one concrete
example for each condition.

There are the two conditions that need to be satisfied to state that the Pooled
OLS estimates of childbirth are unbiased and efficient: i) it does not exist a

Page 5 of 37
correlation between the errors and the independent variable (childbirth), and; ii)
that there is not serial correlation in the errors.

We think that these two conditions won’t hold. Consider a case when it is a
problem of fertility. In this case, one unobserved factor affects directly our
independent variable, which means that the obtained coefficients will be
biased. In the same sense, this unobserved factor will affect the results during
different points in time, resulting in a serial correlation in the errors.

Question 3.
Use a fixed effects (FE) estimator to study the effect of childbirth on Log(income)
controlling for age, male, education years, marital status categories, and ethnicity
categories.

Commands:
xtset pid wave
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
display 100*(exp(_b[child_birth])-1)

Table 7. Fixed errors log-level regression

Page 6 of 37
Continuity Table 7. Fixed errors log-level regression

a) Interpret the estimated coefficient for childbirth (sign, magnitude, and


significance)

Command. display 100*(exp(_b[child_birth])-1)

As we still have a log-level model, it is necessary to use the


100 ∗ (𝑒𝑥𝑝(_𝑏[𝑐ℎ𝑖𝑙𝑑_𝑏𝑖𝑟𝑡ℎ]) − 1) formula to obtain the real magnitude effect of
the coefficient and the correct interpretation. In this sense, what the result
means is that having a child has a negative impact on income in around 5.3%,
keeping all things constant (ceteris paribus). This effect is significant at our
taken 5% level ((because the p-value is 0.002 and because the 95% confidence
interval does not contain zero).

b) Why are some of the variables omitted by STATA?

STATA eliminates the variables that have no change over time, which means,
those variables do not show within variation. This happens because FE model
uses only within variation to process the information in the sample, which is
different from those results obtained with Pooled OLS.

Page 7 of 37
The FE model controls for all time-invariant characteristics of individuals, so
variables that do not vary across time (such as gender or ethnicity) are included
in the unit-specific variation, making it unnecessary for them to be explicitly
included in the regression.

Moreover, due to this absence of variation, we find that the values for the
variables are 0, presenting perfect collinearity. This occurs when one variable is
a perfect linear combination of other variables, making it impossible for the
regression to distinguish the effects of that variable. Stata automatically detects
this and omits the collinear variable to avoid overfitting and estimation
problems.

c) In which situation would you prefer to estimate the effect of childbirth using a
fixed effects estimation compared to a pooled OLS estimation? Provide a
concrete example based on the research objective of this assignment.

If we believe that there are unit-specific, time-invariant factors that affect both
the likelihood of childbirth and income, then using fixed effects is preferable.

Pooled OLS assumes that there are no omitted variables that are correlated with
both the independent variables (like childbirth) and the dependent variable
(income). However, this assumption is often unrealistic in the context of social
science research, especially when individual-specific characteristics play a
role.

Imagine we are studying the effect of childbirth on income among working


women. Characteristics such as ambition, family background, and career
orientation are likely to affect both the decision to have children and income
levels. These factors are usually unobserved in the dataset but remain constant
over time for each individual. If these factors are correlated with the timing of
childbirth, they will confound the results if we use Pooled OLS.

Question 4.
Use a random effects (RE) estimator to study the effect of childbirth on Log(income)
controlling for age, male, education years, all indicator variables of marital status, and
all indicator variables of ethnicity.

Page 8 of 37
Commands.
xtset pid wave
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, re
display 100*(exp(_b[child_birth])-1)
display 1-[([e(sigma_e)]^2)/(17*([e(sigma_u)]^2)+([e(sigma_e)]^2))]^(1/2)

Table 8. Random effects log-level regression

Page 9 of 37
a) Compare the coefficients for childbirth of the RE with those of the Pooled OLS
and FE.

After using the 100 ∗ (𝑒𝑥𝑝(_𝑏[𝑐ℎ𝑖𝑙𝑑_𝑏𝑖𝑟𝑡ℎ]) − 1) formula, the result is


−5.81, which means that having a child leads to lower income by 5.81% (ceteris
paribus). This effect is significant at our taken 5% level (because the p-value is
0.001 and because the 95% confidence interval does not contain zero).

Now, let’s compare the magnitudes of the coefficients for childbirth in our
models. The absolute magnitude of said coefficient in RE is 0.060, in FE 0.054,
in Pooled OLS 0.079. The differences between the coefficients in RE and FE
models stem from the fact that RE regression uses not only within variation but
also the between one. Pooled OLS revealing a larger effect, comes from its
inability to account for unobserved unit-specific effects. It is possible that the
childbirth effect may be biased due to the omission of these factors, since
Pooled OLS does not control for potential unobserved heterogeneity. This will
result in an overestimated negative impact of childbirth on income compared to
the RE model, which does partially control for unobserved heterogeneity.

b) What is one advantage of a RE estimation with regards to a Pooled OLS and a


Fixed Effects estimation?

One of the most important advantages is that RE estimation uses not only within
variation, as in the case of FE estimation, but also, uses between variation as in
Pooled OLS regression. Moreover, RE regression is a better tool when we
consider there are serial correlation, that seriously affects the efficiency of
Pooled OLS.

c) What is the value of the demeaning factor in the RE model? Based on this, is the
RE closer to FE or Pooled OLS? Explain why.

To calculate the demeaning factor in the RE model, 𝜃, we refer to the following


0.5
𝜎2
formula: 𝜃 = 1 − [(𝑇∗𝜎2𝜀+𝜎2)] , where 𝑇 is the maximum number of waves, 𝜎𝛼2 is
𝛼 𝜀

the unit-specific variation, 𝜎𝜀2 is the idiosyncratic shock variance. Applying the
formula, we obtain a value of 0.457. This means, that marginally within variation
is more important than between variation, making RE model closer to zero and,
therefore, closer to Pooled OLS.

Page 10 of 37
Question 5.
Without doing any analysis:
a) Compare the assumptions for Pooled OLS, FE and RE. Theoretically discuss in
which cases you would prefer each method.

To make use of Pooled OLS we assume that the zero conditional mean holds
and that there is no serial correlation. If these assumptions hold, Pooled OLS
exploits all data variation and is the best linear unbiased estimator. So, we do
prefer Pooled Ols over FE and RE if these assumptions hold.

To use FE we assume strict exogeneity, while for RE we assume exogeneity in


the independent variables, and for both we assume not having multicollinearity.

If there are significant differences in time varying coefficients this implies that
time invariant characteristics do matter. Then, RE is likely to be biased. FE is a
more appropriate to account for all time invariant characteristics.

When there are non-significant differences in time varying coefficients, this


implies that time invariant characteristics do not matter. Then RE is likely to be
unbiased. RE is a more appropriate since it is also more efficient.

b) Using the practical context of the assignment, give an example situation in the
form of a DAG in which Pooled OLS would be the preferred method and discuss
why the necessary assumptions hold in that example. Do the same for FE and
RE.

According to Figure 1, as we can assume exogeneity (both parts of the error


term, 𝛼𝑖 and 𝜀𝑖𝑡 , are not correlated with the variable of interest, in this case,
childbirth), and also that there is not serial correlation (the time-variant part of
the error is not influenced by the observations of itself in past waves), in that
case, Pooled OLS is the best estimator.

In this case, we assume that both assumptions hold for different reasons. First,
whether we can sustain that all relevant variables that affect income are
captured in our dataset, then the model does not have a problem with
unobserved heterogeneity (𝛼𝑖 ) and neither with the idiosyncratic shock (𝜀𝑖𝑡 ). On
the other hand, we assume that there is not serial correlation if the errors are
uncorrelated across time. For example, if we can support the idea that

Page 11 of 37
“motivation” does not affect income systematically, then we do not have serial
correlation.

Figure 1. Pooled OLS DAG

RE regression model, as in the case of Pooled OLS, uses both, between and
within variations, but the most important difference regarding Pooled OLS is
that this model quasi-demeans the data, eliminating serial correlation problem,
making it more efficient. Furthermore, to have an unbiased RE estimator, as in
the POLS model, we need to assume that zero conditional mean holds (the
entire error is uncorrelated with the variable of interest), but whether we have
serial correlation, we will prefer RE than POLS because RE fixes this problem,
having as a result a more efficient estimator. If we use the same example, but in
this case, we have no evidence to eliminate the idea that “motivation” does not
affect incom.e, first, we cannot opt for using any of the within effect methods
because they only use within variation (and in this case we have a serial
correlation problem), and second, the unique model we can use to avoid serial
correlation problem is RE model.

Page 12 of 37
Figure 2. RE DAG

Finally, the main difference between FE and RE or POLS is that FE only takes
within variation, 𝜀𝑖𝑡 , for estimate the coefficients. In this sense, we will prefer FE
model when there is a correlation between the unobserved heterogeneity, (𝛼𝑖 ),
and the variable of interest. So, for instance, if we have “genetic material” that
affects income and the possibility to have a child systematically, we will prefer
FE model instead of others because this model fixes unobserved heterogeneity
problem.

Figure 3. FE DAG

Page 13 of 37
Question 6.
Estimate a Correlated Random Effects (CRE) model on the effect of childbirth on
Log(income) controlling for age, male, education years, all indicator variables of
marital status, and all indicator variables of ethnicity.

Commands:
xtset pid wave
bysort pid: egen av_age=mean(age)
bysort pid: egen av_child_birth=mean(child_birth)
bysort pid: egen av_edyears=mean(edyears)
bysort pid: egen av_dmstatus2=mean(dmstatus2)
bysort pid: egen av_dmstatus3=mean(dmstatus3)
bysort pid: egen av_dmstatus4=mean(dmstatus4)
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity av_child_birth av_age av_edyears av_dmstatus2 av_dmstatus3
av_dmstatus4
display 100*(exp(_b[child_birth])-1)
test
av_age=av_child_birth=av_edyears=av_dmstatus2=av_dmstatus3=av_dmstatus4=0

Table 9. Correlated Random Effects (CRE) log-level regression

Page 14 of 37
Continuity of Table 9. Correlated Random Effects (CRE) log-
level regression

a) What is one advantage of the CRE estimator in comparison to the RE?

The CRE estimator accounts for potential endogeneity between the unobserved
individual-specific effects and the independent variables.

In the RE model there is an implicit assumption that the individual-specific


effects are uncorrelated with the independent variables, which might lead to
biased estimates if this assumption is violated.

CRE addresses this by including the individual-level means (averages) of the


time-varying variables in the model. These averages control for the possible
correlation between unobserved unit-specific and the independent variables,
reducing bias and making the CRE more robust than the standard RE model
when endogeneity is present.

Page 15 of 37
b) What is one advantage of the CRE estimator in comparison to the FE?

The CRE estimator retains the between-individual variation, unlike FE, which
only focuses on the within-individual variation.

The FE model eliminates time-invariant variables (like gender or ethnicity) by


differencing them out, which can be a drawback if one wants to study their
effects. In contrast, CRE allows to estimate coefficients for time-invariant
variables (like gender, ethnicity, etc.) while still accounting for partial
unobserved individual-specific effects.

Additionally, CRE is more efficient than FE if the assumptions hold because it


uses both within and between variation in the data.

c) Compare the estimated coefficient for childbirth of the CRE with those of the RE
and FE. Explain why the estimated coefficients are (not) similar?

The coefficients for childbirth in the CRE and FE models are exactly the same
(−0.0545 / −0.0545). This similarity arises because both models account for the
correlation between the unobserved individual-specific effects and the
independent variables (in this case, childbirth). CRE includes the averages of
the time-varying variables to control for potential endogeneity, which aligns it
more closely with the FE model.

On the other hand, the coefficient in the RE model is slightly more negative at
−0.0599, indicating a 5.81% decrease in income due to childbirth which is
larger than the estimate in the CRE and FE models, keeping all things constant
(ceteris paribus). This effect is significant at our taken 5% level (because the p-
value is 0.003 and because the 95% confidence interval does not contain zero).

This difference occurs because the RE model assumes exogeneity (i.e., that the
unobserved individual-specific effects are uncorrelated with the independent
variables). However, in this case, the correlation between childbirth and
unobserved individual-specific effects likely exists, and thus, the RE estimate is
biased.

The RE model does not control for endogeneity, leading to slightly biased
estimates. It does not properly account for the correlation between the

Page 16 of 37
individual-specific effects and childbirth, leading to a more negative coefficient
compared to the CRE and FE models.

d) Based on your CRE estimates, is exogeneity likely to hold? Which estimator


should you choose?

The CRE model is specifically designed to check for endogeneity by including


the within-individual averages of the time-varying variables as regressors. These
averages help account for potential correlation between the unobserved
individual-specific effects and the independent variables.

In our case, the CRE estimates indicate that several of the average variables
(e.g., av_edyears, av_age) are significant at the 5% level, which suggests that
these time-varying variables are correlated with the unobserved individual-
specific effects. This means that the assumption of exogeneity, which the RE
model relies on, is likely violated.

Based on the CRE estimates, exogeneity does not hold, and therefore the RE
model is biased. Between the CRE and FE models, we would recommend
choosing the CRE model for the following reasons:

• Inclusion of Time-Invariant Variables:

The CRE model allows us to include and estimate the effects of time invariant
variables like gender and ethnicity. In our case, these variables are likely
important to our analysis (since gender plays a key role in evaluating the impact
of childbirth). The FE model cannot estimate the effects of time-invariant
variables because it eliminates them through demeaning.

• Addressing Endogeneity:

The CRE model incorporates individual averages of time-varying variables,


which helps control for potential endogeneity. Based on the significance of the
averages in our results, the CRE model has effectively handled endogeneity,
making it a reliable estimator.

Page 17 of 37
• Efficiency:

The CRE model is generally more efficient than the FE model because it uses
both within-individual and between-individual variation. FE only uses within-
individual variation, which can reduce efficiency, especially when time-
invariant variables are relevant.

Question 7.
Implement the Hausman test based on the models above.
a) How does the Hausman test contribute to the estimator decision-making
process?

The Hausman test helps decide between the FE and RE models. The test checks
whether the RE estimator is consistent by comparing it with the FE estimator.
The null hypothesis (𝐻0 ) of the test states that the differences between the
coefficients of the FE and RE models are not systematic (i.e., the RE model is
consistent and efficient). The alternative hypothesis (𝐻1 ) is that the differences
between the coefficients are systematic, implying that the RE model is
inconsistent and biased.

If the test rejects the null hypothesis, we conclude that the RE model is
inconsistent and should not be used. Instead, we choose the FE model, which
is consistent but less efficient. If the test fails to reject the null, we conclude that
RE is both consistent and more efficient, making it the preferred estimator.

b) What do you find? In your answer, make sure to state the null hypothesis of the
test, whether you reject the test, and the significance level used.

Commands:
xtset pid wave
est clear
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
estimates store fe
xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, re
estimates store re
hausman fe re

Page 18 of 37
Table 10. Hausman test

Chi-squared (𝜒 2 ) = 141.94 with a p-value of 0.000.

The null hypothesis (𝐻0 ) of the test states that the differences between the
coefficients of the FE and RE models are not systematic (i.e., the RE model is
consistent and efficient). The alternative hypothesis (𝐻1 ) is that the differences
between the coefficients are systematic, implying that the RE model is
inconsistent and biased.

The test is statistically significant at any conventional significance level (e.g.,


1%, 5%, or 10%). Having used the 5% significance level ourselves, we reject the
null hypothesis that the differences between the coefficients of the FE and RE
models are not systematic.

The significant Hausman test result suggests that the assumptions behind the
RE model (i.e., no correlation between the individual effects and the regressors)
do not hold. In this case, the FE model should be preferred because it is
consistent, even though it may be less efficient than RE.

c) Based on your estimates, is exogeneity likely to hold? Which estimator should


you choose?

Since the test shows systematic differences between the FE and RE estimates,
this suggests that endogeneity is present (i.e., the unobserved individual effects
are correlated with the explanatory variables). Thus, exogeneity is unlikely to
Page 19 of 37
hold. Therefore, FE is the preferred estimator in this case because the RE
estimator would be biased due to the violation of the exogeneity assumption.
The FE estimator remains consistent in the presence of such correlation,
making it the appropriate choice for this analysis.

Question 8.
In the previous analysis, you have assumed that the effect of childbirth is the same for
male and females.
a) How likely do you think that this is the case? Discuss without any further
analysis.

It is unlikely that the effect of childbirth on income is the same for males and
females. Childbirth typically has a much larger negative impact on women's
income due to caregiving responsibilities, societal expectations, and possible
career interruptions. For men, childbirth may not have a negative effect and
could even have a positive impact as men might feel pressure to increase their
work effort to support the family financially.

b) Based on your preferred model, test if the effect of childbirth on Log(income) is


the same for males and females. Explain what you conclude

Commands:
gen child_birth_male=child_birth*male
xtreg lnincome child_birth age male child_birth_male edyears dmstatus2
dmstatus3 dmstatus4 i.ethnicity, fe
display 100*(exp(_b[child_birth])-1)
display 100*(exp(_b[child_birth_male]+_b[child_birth])-1)
test child_birth_male

Page 20 of 37
Table 11. FE log-level regression testing for different effects of childbirth
according to gender

The coefficient for childbirth is −0.339 and is statistically significant at our taken
5% level (𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.000). This implies that for females (since male is
omitted due to collinearity), having a child is associated with a 28.76%
reduction in income (ceteris paribus).

The coefficient for the interaction term “child_birth_male” is 0.399 and is


statistically significant at our taken 5% level (𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.000). This implies
that the negative effect of childbirth on income is positive for males.

For males, the net effect of childbirth on income is:

−0.339 + 0.399 = 0.060

Page 21 of 37
This suggests that childbirth increases male income by approximately 6.27%,
holding all else constant (ceteris paribus).

The variable male is omitted due to collinearity. This is expected in a FE model


because gender is time-invariant, meaning it doesn't change within individuals
over time.

For females, childbirth is associated with a significant decrease in income,


meanwhile, for males, childbirth is associated with a slight increase in income,
compared to females, as evidenced by the positive interaction term.

In addition, we tested, if the coefficient for our interaction term,


“child_birth_male” is equal to 0. The results of the test suggest that we reject
the null hypothesis of said coefficient being equal to 0.

Table 12. Test for child_birth_male = 0

These results suggest that the effect of childbirth on income differs significantly
between males and females, with females earning considerably less income
compared to males (ceteris paribus).

Question 9.
Finally, evaluate your data again and discuss whether there is attrition or not in your
sample. Based on your preferred model, is it likely that there is attrition bias? What do
you conclude? How does this impact your conclusions?

Commands:
xtset pid wave
xtdescribe
gen sample=1 // creating an auxiliary variable that is one for everyone. This will be used
to see whether sample is not missing for the next wave
gen nextwave=F.sample==1 // creating a variable that is 1 if sample is equal to 1 in the
following wave and 0 if the value for sample is missing in the next wave

Page 22 of 37
xtreg lnincome nextwave child_birth age male edyears dmstatus2 dmstatus3
dmstatus4 i.ethnicity, fe
display 100*(exp(_b[nextwave])-1)

By using the command “xtset pid wave” we can already identify, if there is attrition in
our sample. By looking at the first line we see the word “unbalanced”, which already
hints at attrition, implying we don’t have data for each unit available for all of the time
periods. Then, the command “xtdescribe” gives us distribution of time periods and
patterns, based on which we can see that, for instance, in case of the first row, we were
able to observe 758 people for one period of time and then they dropped off.

Table 13. Waves pattern frequencies

To test, if attrition is at random or not, we resorted to creating a variable “nextwave” to


see whether the participant was available in the following wave. Our goal is to estimate,
if the number of participations in the survey is correlated with the outcome.

Page 23 of 37
We start by creating an auxiliary variable “sample”, one for everyone, which is going to
be used to indicate if sample is not missing for the next wave. Then, we create
“nextwave”, that takes value 1, if sample=1 in the following wave, and 0 otherwise.

We find that the coefficient for “nextwave” is 0.059, (𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.000), which is
statistically significant at our taken level of 5% (ceteris paribus). This indicates, there
is likely attrition bias, and by applying the formula for log-level model we see that
individuals who remain in the survey for the following wave tend to have 6.1% higher
income compared to those who drop out in the next wave. Therefore, since individuals
with higher income are more likely to remain in the panel and those with lower income
are more likely to drop out, the sample over time becomes skewed towards higher-
income individuals.

Table 14. Attrition test

Page 24 of 37
This bias could affect the validity of any conclusions drawn from the data, particularly
if the analysis does not account for the fact that lower-income individuals are more
likely to drop out. As a result, income-related estimates may be upwardly biased,
overestimating the true average income of the population.

Page 25 of 37
STATA log file
--------------------------------------------------------------------------------------------
-------------------------------------------------
name: <unnamed>
log: /Users/polinacervakova/Desktop/Erasmus/Applied Microeconometrics/Assignment
2/Childbirth.log
log type: text
opened on: 1 Oct 2024, 19:18:11

.
. * Q1. Panel data description
. describe

Contains data from /Users/polinacervakova/Desktop/Erasmus/Applied


Microeconometrics/Assignment 2/Childbirth.dta
Observations: 55,874
Variables: 9 22 Sep 2023 15:50
--------------------------------------------------------------------------------------------
-------------------------------------------------
Variable Storage Display Value
name type format label Variable label
--------------------------------------------------------------------------------------------
-------------------------------------------------
pid int %12.0g Individual identifier
male byte %14.0g gen Male
ethnicity byte %25.0g vlR1482600
Ethnicity
age byte %8.0g Age
mstatus byte %21.0g mstat Marital status
edyears byte %24.0g Education years
income double %25.0g Income
child_birth byte %9.0g Childbirth in past year
wave byte %9.0g Wave
--------------------------------------------------------------------------------------------
-------------------------------------------------
Sorted by: pid age

. xtset pid wave

Panel variable: pid (unbalanced)


Time variable: wave, 2 to 18, but with gaps
Delta: 1 unit

. xtdescribe

pid: 2, 4, ..., 9022 n = 7126


wave: 2, 3, ..., 18 T = 17
Delta(wave) = 1 unit
Span(wave) = 17 periods
(pid*wave uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max


1 1 3 7 13 17 17

Freq. Percent Cum. | Pattern


---------------------------+-------------------

Page 26 of 37
758 10.64 10.64 | 1................
601 8.43 19.07 | 11111111111111111
534 7.49 26.56 | 11...............
449 6.30 32.87 | 111..............
347 4.87 37.74 | 11111............
343 4.81 42.55 | 1111.............
260 3.65 46.20 | 111111...........
214 3.00 49.20 | 1111111..........
198 2.78 51.98 | 11111111.........
3422 48.02 100.00 | (other patterns)
---------------------------+-------------------
7126 100.00 | XXXXXXXXXXXXXXXXX

. xtsum

Variable | Mean Std. dev. Min Max | Observations


-----------------+--------------------------------------------+----------------
pid overall | 4512.136 2609.298 2 9022 | N = 55874
between | 2606.401 2 9022 | n = 7126
within | 0 4512.136 4512.136 | T-bar = 7.84086
| |
male overall | .6954755 .4602099 0 1 | N = 55874
between | .4954768 0 1 | n = 7126
within | 0 .6954755 .6954755 | T-bar = 7.84086
| |
ethnic~y overall | 2.778305 1.31181 1 4 | N = 55874
between | 1.312902 1 4 | n = 7126
within | 0 2.778305 2.778305 | T-bar = 7.84086
| |
age overall | 21.9753 5.260259 13 38 | N = 55874
between | 3.434158 13 36 | n = 7126
within | 4.301764 9.54673 35.9753 | T-bar = 7.84086
| |
mstatus overall | .1901779 .4556302 0 3 | N = 55874
between | .2736693 0 2.5 | n = 7126
within | .327793 -2.309822 2.990178 | T-bar = 7.84086
| |
edyears overall | 11.90888 2.532601 0 20 | N = 55874
between | 1.939292 2.5 20 | n = 7126
within | 1.663535 -.8244491 19.67359 | T-bar = 7.84086
| |
income overall | 10984.55 20284.89 .65 235884 | N = 55874
between | 10492.25 6.5 151934.3 | n = 7126
within | 16209.18 -90203.99 221033.3 | T-bar = 7.84086
| |
child_~h overall | .114615 .3185596 0 1 | N = 55874
between | .1818743 0 1 | n = 7126
within | .2841539 -.7425278 1.055791 | T-bar = 7.84086
| |
wave overall | 7.850628 4.705235 2 18 | N = 55874
between | 3.007591 2 18 | n = 7126
within | 3.958799 -3.577943 19.10063 | T-bar = 7.84086

. tab mstatus

Marital status | Freq. Percent Cum.


----------------------+-----------------------------------
Never married | 46,717 83.61 83.61
Married | 7,716 13.81 97.42

Page 27 of 37
Separated or divorced | 1,413 2.53 99.95
Widowed | 28 0.05 100.00
----------------------+-----------------------------------
Total | 55,874 100.00

. tab ethnicity

Ethnicity | Freq. Percent Cum.


--------------------------+-----------------------------------
Black | 14,519 25.99 25.99
Hispanic | 12,084 21.63 47.61
Mixed Race (Non-Hispanic) | 536 0.96 48.57
Non-Black / Non-Hispanic | 28,735 51.43 100.00
--------------------------+-----------------------------------
Total | 55,874 100.00

.
. * Q2. Pooled OLS
. generate lnincome=log(income)

. tab mstatus, gen(dmstatus)

Marital status | Freq. Percent Cum.


----------------------+-----------------------------------
Never married | 46,717 83.61 83.61
Married | 7,716 13.81 97.42
Separated or divorced | 1,413 2.53 99.95
Widowed | 28 0.05 100.00
----------------------+-----------------------------------
Total | 55,874 100.00

. reg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity

Source | SS df MS Number of obs = 55,874


-------------+---------------------------------- F(10, 55863) = 3864.88
Model | 69070.4646 10 6907.04646 Prob > F = 0.0000
Residual | 99834.5017 55,863 1.78713105 R-squared = 0.4089
-------------+---------------------------------- Adj R-squared = 0.4088
Total | 168904.966 55,873 3.02301588 Root MSE = 1.3368

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0794501 .0184009 -4.32 0.000 -.1155159 -.0433843
age | .1262801 .0014485 87.18 0.000 .123441 .1291192
male | .5999785 .0126318 47.50 0.000 .5752201 .6247369
edyears | .145754 .0026827 54.33 0.000 .1404959 .151012
dmstatus2 | .6188734 .0184764 33.50 0.000 .5826594 .6550873
dmstatus3 | .0766986 .0375023 2.05 0.041 .0031939 .1502032
dmstatus4 | .0647223 .2531113 0.26 0.798 -.4313775 .560822
|
ethnicity |
Hispanic | .3460341 .0165731 20.88 0.000 .3135507 .3785176
Mixed Race (Non-Hispanic) | .0541551 .0588592 0.92 0.358 -.0612093 .1695194
Non-Black / Non-Hispanic | .4285689 .0139208 30.79 0.000 .4012841 .4558537
|

Page 28 of 37
_cons | 2.587026 .0323465 79.98 0.000 2.523627 2.650425
--------------------------------------------------------------------------------------------

. display 100*(exp(_b[child_birth])-1)
-7.6375916

. display 100*(exp(_b[dmstatus2])-1)
85.68349

.
.
. * Q3. Fixed effects
. xtset pid wave

Panel variable: pid (unbalanced)


Time variable: wave, 2 to 18, but with gaps
Delta: 1 unit

. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.

Fixed-effects (within) regression Number of obs = 55,874


Group variable: pid Number of groups = 7,126

R-squared: Obs per group:


Within = 0.3383 min = 1
Between = 0.4945 avg = 7.8
Overall = 0.3713 max = 17

F(6, 48742) = 4154.09


corr(u_i, Xb) = 0.0706 Prob > F = 0.0000

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0544722 .0179502 -3.03 0.002 -.0896548 -.0192896
age | .1259277 .0016685 75.47 0.000 .1226575 .1291979
male | 0 (omitted)
edyears | .1609736 .0038254 42.08 0.000 .1534758 .1684713
dmstatus2 | .4178146 .0222754 18.76 0.000 .3741545 .4614746
dmstatus3 | -.0111624 .0431511 -0.26 0.796 -.0957392 .0734143
dmstatus4 | .1574529 .2860796 0.55 0.582 -.4032667 .7181725
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.153631 .0377618 83.51 0.000 3.079618 3.227645
---------------------------+----------------------------------------------------------------
sigma_u | .81739968
sigma_e | 1.1765995
rho | .3255215 (fraction of variance due to u_i)

Page 29 of 37
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48742) = 3.89 Prob > F = 0.0000

. display 100*(exp(_b[child_birth])-1)
-5.3015153

.
. * Q4. Random effects
. xtset pid wave

Panel variable: pid (unbalanced)


Time variable: wave, 2 to 18, but with gaps
Delta: 1 unit

. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, re

Random-effects GLS regression Number of obs = 55,874


Group variable: pid Number of groups = 7,126

R-squared: Obs per group:


Within = 0.3380 min = 1
Between = 0.5546 avg = 7.8
Overall = 0.4084 max = 17

Wald chi2(10) = 34633.21


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0598945 .0173605 -3.45 0.001 -.0939204 -.0258686
age | .1272876 .0014948 85.16 0.000 .124358 .1302173
male | .5698314 .0167349 34.05 0.000 .5370316 .6026312
edyears | .1509147 .0030581 49.35 0.000 .1449209 .1569085
dmstatus2 | .5107319 .019666 25.97 0.000 .4721873 .5492766
dmstatus3 | .0454112 .0391588 1.16 0.246 -.0313386 .122161
dmstatus4 | .0855097 .2618702 0.33 0.744 -.4277464 .5987658
|
ethnicity |
Hispanic | .3177513 .0230481 13.79 0.000 .2725779 .3629247
Mixed Race (Non-Hispanic) | .0588821 .0820328 0.72 0.473 -.1018992 .2196634
Non-Black / Non-Hispanic | .3832821 .0191953 19.97 0.000 .3456599 .4209042
|
_cons | 2.559046 .035206 72.69 0.000 2.490044 2.628049
---------------------------+----------------------------------------------------------------
sigma_u | .4411512
sigma_e | 1.1765995
rho | .1232516 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------

. display 100*(exp(_b[child_birth])-1)
-5.8136148

. display 1-[([e(sigma_e)]^2)/(17*([e(sigma_u)]^2)+([e(sigma_e)]^2))]^(1/2)
.45686071

Page 30 of 37
.
.
. * Q6. CRE
. xtset pid wave

Panel variable: pid (unbalanced)


Time variable: wave, 2 to 18, but with gaps
Delta: 1 unit

.
. bysort pid: egen av_age=mean(age)

. bysort pid: egen av_child_birth=mean(child_birth)

. bysort pid: egen av_edyears=mean(edyears)

. bysort pid: egen av_dmstatus2=mean(dmstatus2)

. bysort pid: egen av_dmstatus3=mean(dmstatus3)

. bysort pid: egen av_dmstatus4=mean(dmstatus4)

.
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity
av_child_birth av_age av_edyears av_dmstatus2 av_dmst
> atus3 av_dmstatus4

Random-effects GLS regression Number of obs = 55,874


Group variable: pid Number of groups = 7,126

R-squared: Obs per group:


Within = 0.3383 min = 1
Between = 0.5575 avg = 7.8
Overall = 0.4107 max = 17

Wald chi2(16) = 34809.49


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0544722 .0183181 -2.97 0.003 -.0903749 -.0185695
age | .1259277 .0017027 73.96 0.000 .1225906 .1292649
male | .5514907 .0179655 30.70 0.000 .516279 .5867024
edyears | .1609736 .0039038 41.24 0.000 .1533224 .1686248
dmstatus2 | .4178146 .0227319 18.38 0.000 .3732609 .4623683
dmstatus3 | -.0111624 .0440355 -0.25 0.800 -.0974703 .0751455
dmstatus4 | .1574529 .2919425 0.54 0.590 -.4147438 .7296496
|
ethnicity |
Hispanic | .294627 .0232277 12.68 0.000 .2491015 .3401526
Mixed Race (Non-Hispanic) | .0624566 .0820323 0.76 0.446 -.0983238 .223237

Page 31 of 37
Non-Black / Non-Hispanic | .3718733 .0196553 18.92 0.000 .3333496 .4103971
|
av_child_birth | -.116747 .0580712 -2.01 0.044 -.2305644 -.0029296
av_age | .009513 .0038225 2.49 0.013 .0020211 .0170049
av_edyears | -.031524 .0064149 -4.91 0.000 -.044097 -.0189511
av_dmstatus2 | .4106768 .0473467 8.67 0.000 .3178789 .5034747
av_dmstatus3 | -.005703 .1021948 -0.06 0.955 -.2060011 .1945951
av_dmstatus4 | -.5879614 .659662 -0.89 0.373 -1.880875 .7049524
_cons | 2.635625 .062808 41.96 0.000 2.512523 2.758726
---------------------------+----------------------------------------------------------------
sigma_u | .4411512
sigma_e | 1.1765995
rho | .1232516 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------

.
. display 100*(exp(_b[child_birth])-1)
-5.3015153

.
. test av_age=av_child_birth=av_edyears=av_dmstatus2=av_dmstatus3=av_dmstatus4=0

( 1) - av_child_birth + av_age = 0
( 2) av_age - av_edyears = 0
( 3) av_age - av_dmstatus2 = 0
( 4) av_age - av_dmstatus3 = 0
( 5) av_age - av_dmstatus4 = 0
( 6) av_age = 0

chi2( 6) = 111.11
Prob > chi2 = 0.0000

.
. * Q7. Hausman
. * H0: There are no systematic differences between the random effects coefficients or fixed
effects coefficients =? Random effects is chosen
> because it is more efficient
. * H1: There are systematic differences between fixed effects and random effects => Fixed
effects is preferred because random effects is bia
> sed
. xtset pid wave

Panel variable: pid (unbalanced)


Time variable: wave, 2 to 18, but with gaps
Delta: 1 unit

. est clear

.
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.

Fixed-effects (within) regression Number of obs = 55,874

Page 32 of 37
Group variable: pid Number of groups = 7,126

R-squared: Obs per group:


Within = 0.3383 min = 1
Between = 0.4945 avg = 7.8
Overall = 0.3713 max = 17

F(6, 48742) = 4154.09


corr(u_i, Xb) = 0.0706 Prob > F = 0.0000

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0544722 .0179502 -3.03 0.002 -.0896548 -.0192896
age | .1259277 .0016685 75.47 0.000 .1226575 .1291979
male | 0 (omitted)
edyears | .1609736 .0038254 42.08 0.000 .1534758 .1684713
dmstatus2 | .4178146 .0222754 18.76 0.000 .3741545 .4614746
dmstatus3 | -.0111624 .0431511 -0.26 0.796 -.0957392 .0734143
dmstatus4 | .1574529 .2860796 0.55 0.582 -.4032667 .7181725
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.153631 .0377618 83.51 0.000 3.079618 3.227645
---------------------------+----------------------------------------------------------------
sigma_u | .81739968
sigma_e | 1.1765995
rho | .3255215 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48742) = 3.89 Prob > F = 0.0000

. estimates store fe

.
. xtreg lnincome child_birth age male edyears dmstatus2 dmstatus3 dmstatus4 i.ethnicity, re

Random-effects GLS regression Number of obs = 55,874


Group variable: pid Number of groups = 7,126

R-squared: Obs per group:


Within = 0.3380 min = 1
Between = 0.5546 avg = 7.8
Overall = 0.4084 max = 17

Wald chi2(10) = 34633.21


corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.0598945 .0173605 -3.45 0.001 -.0939204 -.0258686
age | .1272876 .0014948 85.16 0.000 .124358 .1302173
male | .5698314 .0167349 34.05 0.000 .5370316 .6026312

Page 33 of 37
edyears| .1509147 .0030581 49.35 0.000 .1449209 .1569085
dmstatus2| .5107319 .019666 25.97 0.000 .4721873 .5492766
dmstatus3| .0454112 .0391588 1.16 0.246 -.0313386 .122161
dmstatus4| .0855097 .2618702 0.33 0.744 -.4277464 .5987658
|
ethnicity |
Hispanic | .3177513 .0230481 13.79 0.000 .2725779 .3629247
Mixed Race (Non-Hispanic) | .0588821 .0820328 0.72 0.473 -.1018992 .2196634
Non-Black / Non-Hispanic | .3832821 .0191953 19.97 0.000 .3456599 .4209042
|
_cons | 2.559046 .035206 72.69 0.000 2.490044 2.628049
---------------------------+----------------------------------------------------------------
sigma_u | .4411512
sigma_e | 1.1765995
rho | .1232516 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------

. estimates store re

.
. hausman fe re

---- Coefficients ----


| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| fe re Difference Std. err.
-------------+----------------------------------------------------------------
child_birth | -.0544722 -.0598945 .0054224 .0045632
age | .1259277 .1272876 -.0013599 .0007413
edyears | .1609736 .1509147 .0100589 .0022981
dmstatus2 | .4178146 .5107319 -.0929174 .0104614
dmstatus3 | -.0111624 .0454112 -.0565736 .0181276
dmstatus4 | .1574529 .0855097 .0719432 .1151762
------------------------------------------------------------------------------
b = Consistent under H0 and Ha; obtained from xtreg.
B = Inconsistent under Ha, efficient under H0; obtained from xtreg.

Test of H0: Difference in coefficients not systematic

chi2(6) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 141.94
Prob > chi2 = 0.0000

.
.
. * Q8. Gender and childbirth
. gen child_birth_male=child_birth*male

. xtreg lnincome child_birth age male child_birth_male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.

Fixed-effects (within) regression Number of obs = 55,874


Group variable: pid Number of groups = 7,126

Page 34 of 37
R-squared: Obs per group:
Within = 0.3398 min = 1
Between = 0.5022 avg = 7.8
Overall = 0.3762 max = 17

F(7, 48741) = 3583.34


corr(u_i, Xb) = 0.0746 Prob > F = 0.0000

--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
child_birth | -.3391592 .0330189 -10.27 0.000 -.4038766 -.2744418
age | .124866 .0016699 74.77 0.000 .121593 .128139
male | 0 (omitted)
child_birth_male | .3999343 .0389499 10.27 0.000 .323592 .4762767
edyears | .1621751 .0038231 42.42 0.000 .1546818 .1696683
dmstatus2 | .4181992 .0222516 18.79 0.000 .3745857 .4618126
dmstatus3 | -.012054 .0431051 -0.28 0.780 -.0965405 .0724325
dmstatus4 | .1307534 .2857855 0.46 0.647 -.4293897 .6908966
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.162842 .037732 83.82 0.000 3.088887 3.236798
---------------------------+----------------------------------------------------------------
sigma_u | .81140777
sigma_e | 1.1753411
rho | .32276672 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48741) = 3.79 Prob > F = 0.0000

. display 100*(exp(_b[child_birth])-1)
-28.763096

. display 100*(exp(_b[child_birth_male]+_b[child_birth])-1)
6.2659921

. test child_birth_male

( 1) child_birth_male = 0

F( 1, 48741) = 105.43
Prob > F = 0.0000

.
. * Q9. Attrition
. xtset pid wave

Panel variable: pid (unbalanced)


Time variable: wave, 2 to 18, but with gaps
Delta: 1 unit

Page 35 of 37
. xtdescribe

pid: 2, 4, ..., 9022 n = 7126


wave: 2, 3, ..., 18 T = 17
Delta(wave) = 1 unit
Span(wave) = 17 periods
(pid*wave uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max


1 1 3 7 13 17 17

Freq. Percent Cum. | Pattern


---------------------------+-------------------
758 10.64 10.64 | 1................
601 8.43 19.07 | 11111111111111111
534 7.49 26.56 | 11...............
449 6.30 32.87 | 111..............
347 4.87 37.74 | 11111............
343 4.81 42.55 | 1111.............
260 3.65 46.20 | 111111...........
214 3.00 49.20 | 1111111..........
198 2.78 51.98 | 11111111.........
3422 48.02 100.00 | (other patterns)
---------------------------+-------------------
7126 100.00 | XXXXXXXXXXXXXXXXX

.
. * Indicator of being available in following wave
. * Objective: Estimate whether the number of participations in the survey is correlated
with the outcome
. gen sample=1 // creating an auxiliary variable that is one for everyone. This will be used
to see whether sample is not missing for the nex
> t wave

.
. gen nextwave=F.sample==1 // creating a variable that is 1 if sample is equal to 1 in the
following wave and 0 if the value for sample is mi
> ssing in the next wave

. xtreg lnincome nextwave child_birth age male edyears dmstatus2 dmstatus3 dmstatus4
i.ethnicity, fe
note: male omitted because of collinearity.
note: 2.ethnicity omitted because of collinearity.
note: 3.ethnicity omitted because of collinearity.
note: 4.ethnicity omitted because of collinearity.

Fixed-effects (within) regression Number of obs = 55,874


Group variable: pid Number of groups = 7,126

R-squared: Obs per group:


Within = 0.3386 min = 1
Between = 0.4954 avg = 7.8
Overall = 0.3720 max = 17

F(7, 48741) = 3564.19


corr(u_i, Xb) = 0.0654 Prob > F = 0.0000

Page 36 of 37
--------------------------------------------------------------------------------------------
lnincome | Coefficient Std. err. t P>|t| [95% conf. interval]
---------------------------+----------------------------------------------------------------
nextwave | .0592868 .014481 4.09 0.000 .0309038 .0876699
child_birth | -.0562282 .0179524 -3.13 0.002 -.0914152 -.0210413
age | .127547 .0017144 74.40 0.000 .1241867 .1309073
male | 0 (omitted)
edyears | .1610698 .0038248 42.11 0.000 .1535731 .1685664
dmstatus2 | .4161596 .0222755 18.68 0.000 .3724994 .4598198
dmstatus3 | -.0114637 .0431442 -0.27 0.790 -.0960269 .0730995
dmstatus4 | .1518819 .2860366 0.53 0.595 -.4087535 .7125172
|
ethnicity |
Hispanic | 0 (omitted)
Mixed Race (Non-Hispanic) | 0 (omitted)
Non-Black / Non-Hispanic | 0 (omitted)
|
_cons | 3.070286 .0428942 71.58 0.000 2.986213 3.154359
---------------------------+----------------------------------------------------------------
sigma_u | .81510967
sigma_e | 1.1764093
rho | .32436163 (fraction of variance due to u_i)
--------------------------------------------------------------------------------------------
F test that all u_i=0: F(7125, 48741) = 3.87 Prob > F = 0.0000

. display 100*(exp(_b[nextwave])-1)
6.1079567

.
end of do-file

Page 37 of 37

You might also like