Papke Wooldridge 1996
Papke Wooldridge 1996
Papke Wooldridge 1996
Participation Rates
Author(s): Leslie E. Papke and Jeffrey M. Wooldridge
Source: Journal of Applied Econometrics, Vol. 11, No. 6 (Nov. - Dec., 1996), pp. 619-632
Published by: John Wiley & Sons
Stable URL: http://www.jstor.org/stable/2285155 .
Accessed: 22/05/2011 17:56
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=jwiley. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
John Wiley & Sons is collaborating with JSTOR to digitize, preserve and extend access to Journal of Applied
Econometrics.
http://www.jstor.org
VOL. 11, 619-632 (1996)
JOURNALOF APPLIEDECONOMETRICS,
ECONOMETRICMETHODSFORFRACTIONALRESPONSE
VARIABLESWITHAN APPLICATIONTO 401 (K) PLAN
PARTICIPATIONRATES
SUMMARY
We develop attractive functional forms and simple quasi-likelihood estimation methods for regression
models with a fractional dependent variable. Compared with log-odds type procedures, there is no
difficulty in recovering the regression function for the fractional variable, and there is no need to use ad
hoc transformationsto handle data at the extreme values of zero and one. We also offer some new, robust
specification tests by nesting the logit or probit function in a more general functional form. We apply these
methods to a data set of employee participation rates in 401 (k) pension plans.
1. INTRODUCTION
Fractional response variables arise naturally in many economic settings. The fraction of total
weekly hours spent working, the proportion of income spent on charitable contributions, and
participation rates in voluntary pension plans are just a few examples of economic variables
bounded between zero and one. The bounded nature of such variables and the possibility of
observing values at the boundaries raise interesting functional form and inference issues. In this
paper we specify and analyse a class of functional forms with satisfying econometric properties.
We also synthesize and expand on the generalized linear models (GLM) literature from statistics
and the quasi-likelihood literature from econometrics to obtain robust methods for estimation
and inference with fractional response variables.
We apply the methods to estimate a model of employee participationrates in 401 (k) pension
plans. The key explanatory variable of interest is the plan's 'match rate,' the rate at which a firm
matches a dollar of employee contributions. The empirical work extends that of Papke (1995),
who studied this problem using linear spline methods. Spline methods are flexible, but they do
not ensure that predicted values lie in the unit interval.
To illustrate the methodological issues that arise with fractional dependent variables, suppose
that a variable y, 0 < y < 1, is to be explained by a 1 x K vector of explanatory variables
x a (xI, x2, ..., XK), with the convention that x 1. The population model
values from an OLS regression can never be guaranteed to lie in the unit interval. Thus, the
drawbacks of linear models for fractional data are analogous to the drawbacks of the linear
probability model for binary data.
The most common alternative to equation (1) has been to model the log-odds ratio as a linear
function. If y is strictly between zero and one then a linear model for the log-odds ratio is
new specification tests are offered in Section 3, and Section 4 contains the empirical application
relating 401 (k) plan participationrates to the plan's matching rate and other plan characteristics.
fromthe maximizationproblem
N
max li(b)
b
A- gi X A (7)
·i= [Gi (1 -Gi)]
The standard error of Bi reported from standard binary response analysis (regardless of the
nature of yi) would be obtained as the square root of the jth diagonal element of A-1. Under
equation(4) only, this is not a consistentestimatorof the true asymptoticstandarderror;we
also need the outer productof the score. Let ui yi-G(xfi,) be the residuals (deviations
ECONOMETRIC METHODS FOR FRACTIONAL RESPONSE VARIABLES 623
betweenYiandits estimatedconditionalexpectation),anddefine
N 2A22
i Xi
B = uig (8)
i=1 [Gi(1-Gi)]2
Thena validestimateof the asymptoticvarianceof / is
_1BA_-1 (9)
The standarderrors are obtained as the square roots of the diagonal elements of equation (9); see
GMT (1984) andWooldridge(1991b) for generaltreatments.
Interestingly, the robust standard errors from equation (9) in the context of ordinary logit
and probit are computed almost routinely by certain statistics and econometrics packages,
such as STATA® and SST®. Unfortunately, the packages with which we are familiar
automaticallytransformthe dependentvariableused in logit or probitinto a binaryvariable
before estimation, or do not allow non-binaryvariables at all (STATA®and SST®fall into
the first category). With the minor change of allowing for fractional y in so-called binary
response analysis, standard software packages could be used to estimate the parameters in
equation (4) and to performasymptoticallyvalid inference. Alternatively,programmingthe
estimatorin a languagesuch as GAUSS®,as we do for our applicationin Section 4, is fairly
straightforward.
If the GLM assumption(6) is maintainedin additionto (4) then a2 is consistentlyestimated
by
N
d2=(N-K)-'1Eui (10)
i=1
TESTING
3. SPECIFICATION
Specification testing in this framework can be carried out by applying the results of Wooldridge
(1991a,b). We discuss two forms of the test. The first is valid under equations (4) and (6); these
are non-robust tests because they maintain the GLM variance assumption. The second, robust
formof the testrequiresonly equation(4).
We focus primarilyon Lagrangemultiplieror score tests thatnest E(y Ix) = G(x/l) withina
more generalmodel. Let m(x, z,/8, y) be a model for E( y x, z), where z is a 1 x J vector of
624 L. E. PAPKE AND J. M. WOOLDRIDGE
additional variables; the elements of z can be non-linear functions of x (in which case
E( y Ix) = E( y Ix, z)), or variables not functionally related to x, or both. The vector y is a Q x I
vector of additional parameters. The null is assumed to be Ho: y= yo for a specified vector yo
(often yo= 0). Then,by definition,
G(x,8) - m(x, z, y, Yo) (12)
Given the estimates under the null, /B, define the 1 x K vector V -i, am(x,, z,/ , 7o)//fi = gix
and the 1 x Q vector Vyri ,mi (x,,z1,,, yo)/~y; these are the gradientsof the regression
to
functionwith respect / and y, respectively,evaluatedunderthe null hypothesis.Define the
weightedresidualsii as equation(11) andthe weightedgradientsas
in
Vf,ri = VA,in/[Gi(1 - Gi)]1/2= i/[Gi(1 -
Gi)]112 (13)
V7i -3 V y7 /[G(i(1 - Gi)]1/2 (14)
As in equation (11), the weights are proportional to the inverse of the estimated nominal
standarddeviation(see equation(6)). A valid test of Ho: y= Yodependson whatis maintained
underthe null hypothesis.Underthe assumptions
E( YiIxi, zi) = G(xifi) (15)
and
Var(y,i xi, zi) = a2G(xil,)[1 - G(x,if)] (16)
a valid statisticis obtainedas NR2 fromthe OLSregression
uiion Vwrhi,Vrhi i = 1,2, ..., N (17)
where Ru2is the constant-unadjustedr-squared.Under equations (15) and (16), NRu2is
distributedasymptoticallyas 2 -see Wooldridge(199 a).
For binary choice models, Engle (1984) and Davidson and MacKinnon (1984) suggest a test
based on regression (17) for logit and probit. Gurmu and Trivedi (1993) present results for a
class of models thatallows testingthe logit functionagainsta moregeneralindex function.But
for fractionaldependentvariablesit is importantto use the NRu2form ratherthanthe explained
sum of squaresform suggested in Davidson and MacKinnon(1984): the latter test requires
a2 = 1, which is alwaysthe case for binaryresponsevariablesbut is too restrictivefor fractional
response variables. Alternatively, as in Gurmu and Trivedi (1993), each term in regression (17)
can be dividedby 6 and then the explainedsum of squarescan be used. This is essentiallythe
sameas the NR2statistic(althoughthey will differif 6 is estimatedwith the degrees-of-freedom
adjustmentin equation(10)).
It is often useful to have a likelihood-based statistic, especially for testing exclusion
restrictions.Under the same two assumptions(15) and (16), a quasi-likelihoodratio (QLR)
statistic has a limiting chi-square distribution. Let N(f, Yo) denote the log-likelihood
evaluated under the null, and let 4N(P, 5) denote the log-likelihood from the unrestricted
model (that is, the Bernoulli log-likelihood with m(x,z, f, y) used in place of G(xi/f)).
Further,define rmi m(xi, Zi,Bf,)), and let the variance estimatorbased on the unrestricted
estimatesbe
N
2 - (N- K - Q)-' mi)2/[Im(1 -
(yi- ii)] (18)
i=
TableI. Summarystatistics
Standard
Variable Mean deviation Minimum Maximum
Full sample
Numberof observations= 4734
PRATE 0.869 0.167 0.023 1
MRATE 0.746 0.844 0.011 5
EMPLOYMENT 4621.01 16299.64 53 443040
AGE 13-14 9.63 4 76
SOLE 0.415 0.493 0 1
where G(.) is the logistic function. (The GAUSS® code used for the estimation and testing is
available on request from the authors.) The partial effect of MRATE on E(PRATElx) is
aE(PRATE | x)/9MRATE, or, for specification (23), g(xfi)B2, where g(z)= dG(z)/dz = exp(z)/
[1 + exp(z)]2. Because g(z)--*O as z-- oo, the marginal effect falls to zero as MRATEbecomes
large, holding other variables fixed.
Column (2) of Table II contains the results of estimating equation (23). The variable MRATE
is highly statistically significant and, with the exception of SOLE (which is still not significant),
the directions of effects of all other variables are the same as in the linear model. Unlike the
linear model, the RESET statistic reveals no misspecification in equation (23); the p-value for
the robust statistic is 0.676, and it is even larger for the non-robust statistic. Based on this
ECONOMETRIC METHODS FOR FRACTIONAL RESPONSE VARIABLES 629
RESET analog, equation (23) appears to capture the non-linear relationship between PRATE
and the explanatory variables for MRATE< 1.
There is other evidence that equation (23) fits better than (22). Table II also contains an r-
squared for each model, which in either case is defined as 1 - SSR/SST, where SST is the total
sun of squares of theyi. The SSRs, reported in Table II, are based on the unweighted residuals,
ui y -yi for OLS and QMLE. Thus, the r-squareds are comparable across any model for
E(PRATE Ix) and for any estimation methods. From Table II we see that the r-squared from the
logit model is about 6% higher than the r-squared for the linear model. Also, while OLS chooses
/5 to maximize the r-squared over all linear functions of x, the logit QMLE does not maximize
r-squared given the logit functional form; yet the logit model has a higher r-squared than the
linear model. Since we are only modelling the conditional expectation, with other features of the
conditional distribution left unspecified, the r-squared is the most appropriate goodness-of-fit
measure.
Before directly comparing estimates of the response functions and the marginal effects, some
other comments are worth making about Table II. First, each method comes with an SER
(standard error of the regression). These SERs are the estimates of a for the different models,
and thus are not directly comparable. For OLS, o2 is based on the unweighted OLS residuals,
while for QMLE, o2 is based on the weighted residuals; see equation (11). Because 0-438 =
for the QMLE, this implies that the usual logit standarderrors obtained from the inverse of the
Hessian, A1-, are over twice as large as the GLM standard errors that are obtained as the
squared roots of the diagonal elements of 6r2A-. The latter (smaller) standard errors are the
appropriateones under the GLM assumption (6) because they do not assume that a = 1. MRATE
is underdispersed (a2 < 1) relative to the Bernoulli variance (a2 = 1).
We now turn to a direct comparison of the linear and logistic models. To compare the
estimated response functions and marginal effects, we need to choose values for MRATE, EMP,
AGE, and SOLE. Because most 401 (k) plans are accompanied by other pension plans, we set
SOLE = 0. We also set AGE at roughly its sample average, AGE = 13. To gauge the differences
across firms of different sizes we choose three firm sizes: small (EMP = 200), average
(EMP = 4620), and large (EMP = 100,000) The estimated relationships between E(PRATE Ix)
and MRATE for the three different firm sizes are graphed in Figure 1. Interestingly, for a small
firm the linear and logistic predictions are most different at high match rates; for the average
sized firm, the difference is largest at low match rates; and for a large firm the largest difference
is at a match rate between 0 5 and 0.75.
As is seen from Table II, the marginal effect of MRATEon E(PRATE Ix) for the linear model
is 0 156 for any value of x. For the logistic model, we set SOLE=0, AGE= 13, and
EMP = 4,620, and compute the estimated partial effect at three different match rates:
MRATE1=0, MRATE= 0.50, and MRATE= 1.0. The estimated derivatives are 0.288, 0.197,
and 0.118, respectively, which illustrates the diminishing marginal effect as MRATE increases.
Perhaps not surprisingly, the marginal effect estimated from the linear model is bracketed by the
low and high estimates from the non-linear model. The differences in the estimated marginal
effects are not trivial; for example, the non-linear model predicts an increase in participationof
approximately 2.9 percentage points in moving from a zero match rate to MRATE= 0.10, rather
than the 1 6 percentage point increase obtained from the linear model. Similarly, at high match
rates the marginal effect from increasing the match rate is estimated to be lower in the non-linear
model.
One way to try to salvage the linear model is to use a more flexible functional form in the
match rate. A popular functional form that allows a diminishing marginal effect is a quadratic.
Column (3) contains estimates of the linear model that includes a quadratic in MRATE. The
630 L. E. PAPKEAND J. M. WOOLDRIDGE
.8 - .9-
I--
.7 .7 -
d .25
.s .is d .i5 A .i5 i
MRATE MATE
(C) 1-
.9 -
.7 -
d 5.s . .2'5
IRATE
Figure 1. PRATE versus MRATEfor various fi sizes: (a) EMP=200; (b) MP=4620; (c)
EMP= 100,000
squared term is marginally significant (robust t-statistic -1.98), and this does give a
diminishing marginal effect. But even with this additional regressor the model in column (3)
does not fit as well as the logistic model without the quadratic term (the r-squared for the linear
model with the quadratictermis only 0-144). Further,the rejectionof the model by RESETis
almostas strongas it was withoutthe quadratic.Thus, we concludethatsimplyaddingMRATE2
to equation (22) is not sufficient. (The spline approach used by Papke (1995) is more effective in
capturinga diminishingeffect in this application,but the coefficients are more difficult to
interpret.)
When MRATE2is addedto equation(23) it turnsout to be insignificant.Thus, the logistic
functional form, with the term linear in MRATE, appears to be enough to capture the
diminishing effect, at least for MRATE< 1. This is a useful lesson: a significant quadratic term
in a linear model might be indicating that an entirely different, more parsimonious, functional
formcan providea betterfit. Model (23) is clearlythe preferredspecificationthusfar.
As another test of model (23), we interact log(EMP) with each of MRATE, AGE, AGE2, and
SOLE and test for exclusion of these four interactions using the LM and QLR tests discussed in
Section 3. This is similar in spirit to a Chow test where the sample is split based on firm size, but
here we do not need to make an arbitrarychoice about where to split the sample. The LM statistic
is 16-52, the robustLM statisticis 14-41, andthe QLRstatistic,computedfromequation(19), is
15-78 (2 = -1547-33, £ = - 1548-84, and 6=2= 0.1914). The associated p-value for the robust LM
statisticis 0s006, which rejectsequation(23) at the 1% significancelevel. Thus, equation(23)
apparentlymisses some non-linearities, although the significance level is not very small given the
large sample size (comparethe p-value for RESET in the linear model).
From a practical perspective, the story about the relationship between expected PRATE and
MRATE does not change: the t-statistic on the term log(EMP)*MRATE is only -1-27 (the
robust t-statistic is -1-13). In fact, when log(EMP) MRATE is dropped from the more general
model, the coefficient on MRATE becomes 1*396, which is a trivial change from 1*390, the
ECONOMETRIC
METHODS
FORFRACTIONAL VARIABLES
RESPONSE 631
estimate from equation (23). The most significant interaction term is log(EMP)- SOLE, with a
t-statistic of -3.48 (robust t-statistic = -3.47). We report only equation (23) because of its
simplicity and because it captures the economically important relationship between PRATE and
MRATE.The full set of results is available on request from the authors.
The basic story does not change when we estimate the models over the entire sample. One
notable difference is that a quadratic term in MRATE is now significant in equation (23),
reflecting a faster diminishing effect at high match rates. Table III presents the same models as
Table II, now estimated over the full sample. First consider the models without MRATE2.The
discrepancy in r-squareds between equations (23) and (22) is even greater than before, but
RESET now rejects both equations, although the logistic model is rejected less strongly. In
columns (3) and (4) we put MRATE2into each equation. Model (22) is still soundly rejected,
whereas (23) with MRATE2passes the RESET test with a p-value above 0.50. For the full
sample, it seems that a quadratic in MRATE-or some other way to capture additional
non-linearities- is needed to provide a reasonable fit.
Note:SeeTableII.
632 ANDJ. M.WOOLDRIDGE
L. E.PAPKE
Putting MRATE2into equation (23) has the usual drawbackfor quadratics:it implies an eventual
negative marginaleffect. In this case, the marginaleffect becomes negative at a match rate of about
2.51. This is a high value for MRATE,but there are some match rates this large in the full sample.
5. CONCLUSION
The functional forms offered in this paper are viable alternatives to linear models that use either
y or the log-odds ratio of y as the dependent variable. No special data adjustments are needed for
the extreme values of zero and one, and the conditional expectation of y given the explanatory
variables is estimated directly. The quasi-likelihood method we propose is fully robust and
relatively efficient under the GLM assumption (6). The empirical application to 401(k) plan
participationrates illustrates the usefulness of these methods: while a linear model to explain the
fraction of participantsis strongly rejected, the logistic conditional mean specification is not.
Methods for fractional dependent variables have many applications in economics. For
example, Hausman and Leonard (1994) have recently applied the methods suggested here to
estimate a model for Nielsen ratings for telecasts of NBA basketball games.
ACKNOWLEDGEMENTS
We are grateful to John Mullahy and two anonymous referees for helpful comments. The second
author would like to thank the Alfred P. Sloan Foundation for financial support.
REFERENCES
Davidson,R. and J. G. MacKinnon(1984),'Convenientspecificationtests for logit and probitmodels',
Journalof Econometrics,24, 241-262.
Duan,N. (1983), 'Smearingestimate:a nonparametric retransformation method',Journalof theAmerican
StatisticalAssociation,78, 605-610.
Engle, R. F. (1984), 'Wald, likelihoodratio, and Lagrangemultiplierstatisticsin econometrics',in Z.
Griliches and M. D. Intriligator(eds), Handbookof Econometrics,Volume 2, 776-828, North-
Holland,Amsterdam.
Gourieroux,C., A. Monfortand A. Trognon (1984), 'Pseudo-maximumlikelihood methods:theory',
Econometrica,52, 681-700.
Gurmu,S. andP. K. Trivedi(1993), 'Variableaugmentationspecificationtests in the exponentialfamily',
EconometricTheory,9, 94-113.
Hausman,J. A. and G. K. Leonard(1994), 'Superstarsin the NBA: economic value and policy', MIT
Departmentof EconomicsWorkingPaperNo. 95-2.
Maddala,G. S. (1983), Limited Dependentand QualitativeVariables in Econometrics,Cambridge
UniversityPress,Cambridge.
McCullagh, P. andJ.A. Nelder(1989),Generalized LinearModels,2ndedition,Chapman andHall,NewYork.
Mullahy, J. (1990), 'Regressionmodels and transformationsfor beta-distributedoutcomes', mimeo,
TrinityCollegeDepartmentof Economics.
Papke,L. E. (1995), 'Participation
in andcontributionsto 401(k) pensionplans:evidencefromplandata,'
Journalof HumanResources,30, 311-325.
Papke,L. E. and J. M. Wooldridge(1993), 'Econometricmethodsfor fractionalresponsevariableswith
an applicationto 401(k) plan participationrates', NationalBureauof EconomicResearchTechnical
WorkingPaperNo. 147.
Ramsey,J. B. (1969), 'Tests for specificationerrorsin classical linearleast squaresregressionanalysis',
Journalof theRoyalStatisticalSociety,SeriesB 31, 350-371.
Wooldridge,J. M. (1991a), 'On the applicationof robust, regression-baseddiagnosticsto models of
conditionalmeansandconditionalvariances',Journalof Econometrics,47, 5-46.
Wooldridge,J. M. (1991b), 'Specificationtestingand quasi-maximumlikelihoodestimation',Journalof
Econometrics,48, 29-55.