Panel Data Analysis
Panel Data Analysis
Panel data, also called longitudinal data or cross-sectional time series data, are data where same entities (panels) like people, firms, and countries were observed at multiple time points. National Longitudinal Survey is an example of panel data, where a sample of people were followed up over the years. On the other hand, General Social Survey data, for example, are not longitudinal data although a group of people were surveyd for multiple years, because the respondents are not necessarily the same each year. We use an example data set, nlswork, that comes with Stata 11 to illustrate how to analyze panel data on this page. Let's open nlswork.dta. Give a command: . use http://www.stata-press.com/data/r11/nlswork.dta Alternatively, you can use the menu on the top. Select File -> Example Datasets -> Stata 11 manual datasets -> Longitudinal-Data/Panel-Data Reference Manual [XT]. Then click "use" next to nlswork.dta. First, you need to tell Stata that your dataset is panel. You can do so by telling Stata the name of the panel and the time variables. Both panel and time variables need to be numeric. In this example dataset, the panel variable is idcode, and the time variable is year. At the command prompt, type: . xtset idcode year Alternatively, you may use the menus. Select Statistics -> Longitudinal/panel data -> Setup and Utilities -> Declare dataset to be panel data.
Unbalanced idcode means that there are gaps among the id numbers. The xtset output also tells us that the year has gaps. You do not need to do anythng special to address this unbalance to proceed to the analysis. If you have a string panel variable like a name, you can create a numeric code for it using encode command. If you have a string date, it will need to be converted into a Stata date format. For information about Stata's date variable formats, see the Dates in Stata page.
assume that there are other effects than tenure that are different among people but constant over time, like, for example, personality, that is not included in the model but influence the income. Then the model will be: LogWage = intercept + b1TenureForEachPanel&Time + b2UnobservedCharacteristicsForEachPanel + ErrorForEachPanel&Time where UnobservedCharacteristics varies from a person to person but does not change over time. So, this equation is the same as: LogWage = interceptForEachPanel + b1TenureForEachPanel&Time + ErrorForEachPanel&Time in which the interceptForEachPanel absorbs the b2, coefficient for the UnobservedCharacteristicsForEachPanel in the previous equation. The second equation, for sample data, becomes EstimatedLogWage = PanelFixedEffects + b1TenureForEachPanel&Time To run this fixed effects regression in Stata, give a command: . xtreg ln_wage tenure, fe If you are using menus, select Statistics -> Longitudinal/panel data -> Linear models > Linear regression(FE,RE,PA,BE).
The output shows you that it is a fixed-effects regression, with a group variable idcode. There is a total of 28,101 observations, and 4,699 groups (persons in this case). The observations per group, in this case year, ranges from 1 to 15. Plugging in the coefficients into the above model, we have:
EstimatedLogWage = 1.57 + 0.034TenureForEachPanel&Time This is equivalent to including n-1 dummy variables in the model, where n is the total number of panels in your data. Or, it may be more intuitive to exclude a constant and include n dummy variables. The dummy variables will absorb the panel variations that are consistent across time. In this data, it is not practical to create a dummy variable for each person, as there are close to 5000 people! But if you have a small number of panels, then you would have obtained the same coefficient by running a regular regression with dummy variables. We'll examine this in time fixed effects section, because we have 15 years, which is more manageable than 5000 panels.
By plugging in these coefficients, we have EstimatedLogWage = 1.55 + 0.039TenureForEachPanel&Time As in the case of entity fixed effects, you can include t-1 dummy variables in the model, where t is the total number of years in the data. tabulate command with generate option creates a dummy variable for each year. Using asterisk (*) with yr, you can include yr1 through yr15. If you include a constant, then the constant takes the effect of the year that is ommitted. Here we purposefully exclude the constant. In the areg, the constant absorbed all the year's effects, whereas in the dummy version, you'll have an intercept for each year. . tabulate year, generate(yr) . regress ln_wage tenure yr*, hascons
Notice that the coefficient, standard error, and t-value of tenure are the same as in areg results. In time fixed effects model, we assumed that the slope for tenure is the same for all years but the intercept is different. If you think that there not only are effects that are different for each year, the effect of tenure would also be different for each year, you may run regressions for each year. Well, it may be a bit of a digression from time fixed effects. . sort year . by year: regress ln_wage tenure
From the menu, select Statistics -> Postestimation -> Tests -> Hausman specification test
The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the consistent fixed effects estimator. If they are, then it is safe to use random effects. If you get a statistically significant P-value, however, you should use fixed effects. In this example, the P-value is statistically significant. Therefore, fixed effects would be more appropriate in this case.
Applying between effects regression model is equivalent to taking the mean of each variable in the model for each panel across time and running a regression on the
collapsed dataset of means. In this data, some people's tenure information is missing. In xt command, Stata will automatically exclude missing values from the computations. Manually creating means with collapse, however, will not automatically exclude missing values. So we need to remove the cases with missing tenure before collapsing. . . . . drop if tenure == . sort idcode collapse (mean) meanLnWage=ln_wage meanTenure=tenure, by(idcode) regress meanLnWage meanTenure
The number of observations in the second regression matches the number of groups in xt regression. You see that it created identical result as in the xt result above.