Oaxaca
Oaxaca
Marek Hlavac
Social Policy Institute, Bratislava, Slovakia
Abstract
This article introduces the R package oaxaca to perform the Blinder-Oaxaca decom-
position, a statistical method that decomposes the gap in mean outcomes across two
groups into a portion that is due to differences in group characteristics and a portion that
cannot be explained by such differences. Although this method has been most widely
used to study gender- and race-based discrimination in the labor market, Blinder-Oaxaca
decompositions can be applied to explain differences in any continuous outcome across
any two groups. The oaxaca package implements all the most commonly used variants of
the Blinder-Oaxaca decomposition for linear regression models, calculates bootstrapped
standard errors for its estimates, and allows users to visualize the decomposition results.
If you use the oaxaca package in your research, please do not forget to include a citation:
1. Introduction
In this article, I introduce the R package oaxaca to estimate Blinder-Oaxaca decompositions
for linear regression models. The Blinder-Oaxaca decomposition is a statistical method that
decomposes differences in mean outcomes across two groups into a part that is due to group
differences in the levels of explanatory variables and a part that is due to differential magni-
tudes of regression coefficients.
The Blinder-Oaxaca decomposition originated and has been widely used in the study of labor
market discrimination (Blinder 1973; Oaxaca 1973). Economists and sociologists have, for in-
stance, used it to decompose wage and earnings differences based on gender (e.g., Stanley and
Jarrell 1998; Weichselbaumer and Winter-Ebmer 2005) and race (e.g., Darity, Guilkey, and
Winfrey 1996; Kim 2010). Although Blinder-Oaxaca decompositions have been a mainstay of
empirical research on discrimination, they can be, in principle, applied to explain differences
in any continuous outcome across any two groups. Researchers have, for instance, used it to
examine the assimilation of immigrants (LaLonde and Topel 1992), school enrollment rates
(Borooah and Iyer 2006), health insurance coverage (Bustamante, Fang, Rizzo, and Ortega
2009), the prevalence of smoking (Bauer, Göhlmann, and Sinning 2007), or even local hunting
lease rates (Munn and Hussain 2010).
Several software implementations of the Blinder-Oaxaca decomposition are already available.
2 oaxaca: Blinder-Oaxaca Decomposition in R
These include modules oaxaca (Jann 2008) and decomp (Watson 2010) for Stata (Stata-
Corp 2017) that estimate the decomposition for linear regression models. In addition, Stata
modules fairlie (Jann 2006) and nldecompose (Sinning, Hahn, and Bauer 2008) implement
the decomposition for a large variety of non-linear models using methods proposed in Fairlie
(2005), Bauer and Sinning (2008) and Bauer and Sinning (2010). A SAS (SAS Institute 2017)
implementation of the Blinder-Oaxaca decomposition for non-linear models is also available
(Fairlie 2013).
The oaxaca package is the first Blinder-Oaxaca decomposition package for the R statistical
programming language (R Core Team 2017). It implements several types of the decomposition
for linear regression models, and obtains point estimates of all decomposition components
using the same estimation procedures as the Stata module oaxaca (Jann 2008). Standard
errors are calculated using a non-parametric bootstrapping approach (Efron 1979). Unlike
any other existing software implementation of the Blinder-Oaxaca decomposition, oaxaca
enables users to generate elegant bar graph visualizations of all decomposition results.
The package is available free of charge, and can be installed from the Comprehensive R Archive
Network (CRAN) (2017) in the usual way:
R> install.packages("oaxaca")
In the next section, I give a brief description of the Blinder-Oaxaca decomposition method.
I then provide an overview of the oaxaca package’s features in Section 3. In Section 4, I
showcase them on an empirical example that examines the wage gap between native and
foreign-born Hispanic workers in metropolitan Chicago. Section 5 concludes.
2. Blinder-Oaxaca decomposition
This section provides an overview of the Blinder-Oaxaca decomposition. It is by no means
intended to be exhaustive, and primarily aims to give readers an understanding of the estima-
tion procedures that the oaxaca package implements. Readers who are interested in a more
comprehensive and rigorous treatment of the statistical method can refer to the excellent
overview in Jann (2008), whose notation I follow with only a few minor adjustments.
The aim of the Blinder-Oaxaca decomposition is to explain how much of the difference in
mean outcomes across two groups is due to group differences in the levels of explanatory
variables, and how much is due to differences in the magnitude of regression coefficients
(Oaxaca 1973; Blinder 1973). I will label the two groups as Group A and Group B. The mean
outcome difference to be explained (∆Ȳ ) is simply the difference of the mean outcomes for
observations in Group A and Group B, denoted as ȲA and ȲB , respectively:
′ ′
∆Ȳ = X̄ A β̂ A − X̄ B β̂ B (2)
This expression can, in turn, be written as the sum of the following three terms:
′
∆Ȳ = (X̄ A − X̄ B )′ β̂ B + X̄ B (β̂ A − β̂ B ) + (X̄ A − X̄ B )′ (β̂ A − β̂ B ) (3)
endowments coefficients interaction
(X̄ A − X̄ B )′ (β̂ A − β̂ B ) = (X̄1A − X̄1B )(´ˆ1A − ´ˆ1B ) + (X̄2A − X̄2B )(´ˆ2A − ´ˆ2B ) + . . . (6)
interaction variable 1 variable 2
′ ′
∆Ȳ = (X̄ A − X̄ B )′ β̂ R + X̄ A (β̂ A − β̂ R ) + X̄ B (β̂ R − β̂ B ) (7)
explained unexplained A unexplained B
unexplained
As Equation 7 shows, the twofold decomposition divides the difference in mean outcomes into
a portion that is explained by cross-group differences in the explanatory variables, and a part
that remains unexplained by these differences.
The unexplained portion of the mean outcome gap has often been attributed to discrimination,
but may also result from the influence of unobserved variables. It can be further decomposed
into two sub-components, labeled “unexplained A” and “unexplained B” above. If one inter-
prets the reference coefficient vector to be non-discriminatory, these sub-components measure
the part of the mean difference in outcomes that originates from discrimination in favor of
Group A and the part that comes from discrimination against Group B, respectively.
4 oaxaca: Blinder-Oaxaca Decomposition in R
The choice of the reference coefficients is generally up to the researcher. In the literature
on labor market discrimination, it is often assumed that only one of the two groups faces
discrimination – for instance, that only women or members of ethnic minorities are discrimi-
nated against. In such cases, the reference coefficients will simply be the coefficients from a
regression on observations from one of the groups: either β̂ R = β̂ A or β̂ R = β̂ B .
Some researchers have instead used a weighted average of β̂ A and β̂ B as the set of reference
coefficients. Reimers (1983), for example, proposes giving equal weight to coefficients from
regressions on Group A and Group B observations:
Cotton (1988) suggests weighting the coefficients by the proportion of observations in the
corresponding group:
nA nB
β̂ R = β̂ + β̂ (12)
nA + nB A nA + nB B
Other researchers still have advocated the use of coefficient estimates from a regression that
pools observations from both Groups A and B, and includes (Jann 2008) or does not include
(Neumark 1988) the group indicator variable as an additional regressor. The oaxaca package
estimates results for all of the aforementioned choices of β̂ R , and also enables users to specify
their own custom weights for β̂ A and β̂ B to construct a weighted average-based set of reference
coefficients.
where Di , such that i = 1, . . . , k − 1, are indicator variables that represent individual levels
of a categorical variable. Category k is the omitted baseline.
To ensure that the Blinder-Oaxaca decomposition results are invariant to the user’s choice of
the omitted baseline category, oaxaca implements a procedure proposed by Gardeazabal and
Ugidos (2004). More specifically, the package transforms the above regression model into:
where the new regression coefficients on the indicator variables are calculated by adding or
subtracting an adjustment amount a to/from the original coefficients. The adjustment amount
a is simply the sum of the original dummy coefficients β divided by k, the total number of
categories:
k−1
´j
j=1
a= (15)
k
The adjustment amount is then added to the original intercept ´0 :
´˜0 = ´0 + a (16)
´˜i = ´i − a (17)
for i = 1, . . . , k. The adjusted coefficients (β̃), as well as the results of detailed variable-by-
variable Blinder-Oaxaca decompositions, will remain the same regardless of the researcher’s
choice of the omitted category k.
1. R resamples are randomly sampled with replacement from the relevant set of observa-
tions.
2. Decomposition estimates are calculated for each of the R resamples from Step 1.
y ~ x1 + x2 + x3 + ... | z
If the regression model contains dummies that represent a categorical variable (d1, d2, d3,
etc.), these can be specified by adding another part to the formula:
y ~ x1 + x2 + x3 + ... | z | d1 + d2 + d3 + ...
When categorical variable dummies are specified, the oaxaca() function will automatically
adjust estimates to be invariant with respect to the user’s choice of the omitted baseline
category.
If the user does not include any other arguments, oaxaca() will estimate the Blinder-Oaxaca
decompositions – both threefold and twofold – based on Ordinary Least Squares regressions
(estimated via the standard lm() function), and will calculate standard errors based on 100
bootstrapping replicates. By default, oaxaca() estimates the twofold decomposition with
Group A coefficients, Group B coefficients, their equally weighted average (Reimers 1983), a
weighted average that reflects the number of observations in Groups A and B (Cotton 1988),
as well with pooled coefficients – both including and excluding the group indicator variable
(Neumark 1988; Jann 2008) – as the set of reference coefficients.
These defaults can, however, easily be changed. Users can use the argument group.weights
to specify additional relative weights of Group A and Group B coefficients in the estimation of
the twofold decomposition. They can also choose, via the R argument, how many bootstrap-
ping resamples should be drawn to calculate the standard errors. Last but not least, users can
use a different regression function (argument reg.fun) to estimate the regression coefficients
used in the decompositions. Note that, if a non-linear function such as glm() is chosen, the
decomposition will be based on the linear systematic component – usually associated with
the estimation of the corresponding latent variable – of the regression method.
Marek Hlavac 7
The function oaxaca() returns an object of class "oaxaca", which can then be passed on to
the plot() method to obtain a bar graph visualization of the Blinder-Oaxaca decomposition
results. The object contains lists named threefold and twofold which contain the results
of the threefold and twofold decompositions, respectively. In addition, the object stores the
regression coefficients used in the decomposition (component beta), the number of observa-
tions in each group that were used in the analysis (n), the number of bootstrapping replicates
(R), the regression objects generated during the analysis (reg), as well as the mean values of
both the dependent variable (y) and the explanatory variables (x).
R> data("chicago")
The chicago data frame contains information about the demographic characteristics and la-
bor market outcomes of 712 employed Hispanic workers in the Chicago metropolitan area. It
is a subset of the 2013 Current Population Survey (CPS) Outgoing Rotation Groups (ORG)
data set (Center for Economic and Policy Research 2014). These data have been used exten-
sively in labor economics research (e.g., Holzer and Hlavac 2014).
8 oaxaca: Blinder-Oaxaca Decomposition in R
I am interested in decomposing the wage gap between native and foreign-born workers. The
wage gap could be due to group differences in the level of wage determinants such as age,
gender or education. Alternatively, the gap could arise from a differential effect of these de-
terminants on native and immigrant workers’ wages. I call the oaxaca() function to estimate
the relative magnitudes of these channels’ influence:
As the formula argument indicates, the outcome variable in this decomposition is real.wage,
the worker’s real wage denominated in 2013 U.S. dollars. The values of the dependent vari-
able had been obtained by exponentiating the natural logarithm of the workers’ real wages
(contained in the provided ln.real.wage variable):
The linear regression model includes covariates that account for the workers’ age, gender and
education. LTHS (“less than high school”), some.college, college and advanced.degree are
indicator variables that denote the highest level of education an individual has achieved. A
high school education is the omitted baseline category. The variable foreign.born indicates
whether a worker was born outside of the United States. Group A consists of native workers,
and Group B of foreign-born ones. To make sure that the choice of the omitted baseline does
not affect the decomposition estimates, the formula argument also specifies that the cate-
gorical variables denoting the education level ought to be adjusted. Bootstrapped standard
errors are calculated based on 1,000 replicates.
R> results$n
$n.A
[1] 287
$n.B
[1] 379
$n.pooled
[1] 666
The n component of the resulting "oaxaca"-class object indicates that there are nA = 287
native and nB = 379 foreign-born workers in the analyzed sample. The pooled analysis
contains nA + nB = 666 observations.
Marek Hlavac 9
R> results$y
$y.A
[1] 17.58282
$y.B
[1] 14.56725
$y.diff
[1] 3.015574
The y component of the resulting "oaxaca"-class object indicates that the mean real wage is
$17.58 for the natives (Group A) and $14.57 for foreign-born workers, leaving the difference
of approximately $3.02 to be explained by the Blinder-Oaxaca decomposition.
R> results$threefold$overall
coef(interaction) se(interaction)
-1.4342857 0.7953771
The results of the threefold decomposition suggest that, of the $3.02 difference, approximately
$1.62 can be attributed to group differences in endowments (i.e., age, gender, education), $2.83
to differences in coefficients, and the remaining -$1.43 is accounted for by the interaction
of the two. Next, I examine the endowments and coefficients components of the threefold
decomposition variable by variable. This is most easily done by using the plot() method:
Figure 1 shows the estimation results for each variable, along with error bars that indicate
95% confidence intervals. In the endowments component, most variables appear to have a
statistically insignificant (or only marginally significant) influence, with the sole exception of
LTHS. It seems that a significant portion of the native-immigrant wage gap is driven by group
differences in the proportion of individuals with less than a high school education.
R> summary(results$reg$reg.pooled.2)$coefficients["LTHS",]
Endowments
(Intercept) l
age l
female l
LTHS l
some.college l
college l
advanced.degree l
(Base) l
Coefficients
(Intercept) l
age l
female l
LTHS l
some.college l
college l
advanced.degree l
(Base) l
−10 −5 0 5 10
R> results$x$x.mean.diff["LTHS"]
LTHS
-0.2693959
Individuals with less human capital tend to earn less, as can be seen from the pooled regression
coefficient on LTHS reported above. Furthermore, the value of x.mean.diff shows that a
greater proportion of foreign-born Hispanic workers have not attained a high school education.
The difference in the educational composition of native and immigrant worker groups thus
accounts for some portion of the natives’ higher wages.
Similarly, most variables are either insignificant or exhibit only marginal statistical significance
in the coefficients component. The only variable which achieves clear statistical significance
is age.
R> results$beta$beta.diff["age"]
age
0.1860063
As the difference in the age coefficients between natives and immigrants shows, the wage
payoff of an additional year of age is greater for U.S.-born Hispanic workers by almost 19
cents. As Figure 1 makes clear, differences in the regression coefficients on age account for
the decisive portion of the wage gap.
R> results$twofold$overall
For presentational ease, I focus my discussion on the Neumark (1988) decomposition, which
uses pooled regression coefficients (from a regression that does not include the group indi-
cator variable foreign.born) as the reference coefficient set. The Neumark decomposition
is denoted by -1 in the weights column. The results of the overall twofold decomposition
indicate that the $3.02 wage gap between native and foreign-born Hispanic workers can be
decomposed into $1.36 that can be explained by group differences in the explanatory variables
and $1.66 that is unexplained.
Let us assume that the unexplained component of the wage gap occurs due to labor mar-
ket discrimination, and that the pooled regression coefficients are non-discriminatory. The
Blinder-Oaxaca decomposition would then also indicate that $0.94 of the unexplained part
originates from discrimination in favor of native Hispanic workers (component "unexplained
A"), while $0.72 comes from discrimination against those who are born outside of the United
States (component "unexplained B"). The standard errors provide a sense of the uncertainty
that accompanies all of the point estimates.
I use a variety of plot() method arguments to customize the formatting of the resulting
bar graph. Through the components and component.labels arguments, I choose to dis-
play only the two subparts – "unexplained A" (i.e., discrimination in favor of Group A)
and "unexplained B" (discrimination against Group B) – of the unexplained decomposi-
tion component, and attach appropriate labels to them. Similarly, I use the variables and
variable.labels arguments to select and label the variables I examine.
It appears that only the discrimination components for the age variable (labeled "Years of
Age" in the bar graph) achieve non-marginal statistical significance. The relative size of the
bars suggests that – if we assume that the pooled regression coefficients reflect a state of
non-discrimination – almost twice as much of the wage gap is explained by discrimination
against foreign-born workers as it is by discrimination in favor of native ones.
The comparison would be a little easier to make if the discrimination components bar charts
were presented side-by-side for each variable separately. This can be achieved by switching on
Marek Hlavac 13
Explained
(Intercept) l
age l
female l
LTHS l
some.college l
college l
advanced.degree l
(Base) l
Unexplained
(Intercept) l
age l
female l
LTHS l
some.college l
college l
advanced.degree l
(Base) l
−10 −5 0 5 10
In Favor of Natives
Years of Age l
Female l
College Education l
Years of Age l
Female l
College Education l
the component.left argument in the plot() method. The resulting bar graph is presented
in Figure 4.
Years of Age
In Favor of Natives l
Female
In Favor of Natives l
College Education
In Favor of Natives l
Specific numerical values of the point estimates of the unexplained discrimination components
can, of course, be obtained directly from the "oaxaca"-class object:
To summarize, I have used the Blinder-Oaxaca decomposition to examine the wage gap be-
tween native and foreign-born Hispanic workers in the Chicago metropolitan area. The results
of my analysis suggest that much of the gap can be explained by two facts:
• There are more workers with less than a high school education in the foreign-born group.
Workers with a lower stock of human capital tend to command lower wages in the labor
market. As a result, the relatively less-educated group of foreign-born Hispanic workers
will, on average, earn lower wages than their native counterparts.
• The returns to age are greater for native workers than for the immigrants. In other
words, even if the foreign-born workers had the same average age as the natives, the
native group would, on average, earn higher wages than immigrants. This result makes
some intuitive sense if we interpret age as potentially picking up the effect of labor
market experience. The higher returns to age among the natives may, for instance,
reflect the differential availability of more lucrative jobs with greater opportunities for
career growth.
5. Concluding remarks
In this article, I have introduced the oaxaca package for the R statistical programming lan-
guage. It is the first R package that allows researchers to estimate Blinder-Oaxaca decompo-
sitions, a statistical method that decomposes differences in mean outcomes across two groups
into a part that is due to group differences in the levels of explanatory variables and a part
that is due to differential magnitudes of regression coefficients.
oaxaca estimates threefold and twofold Blinder-Oaxaca decompositions for linear models,
and also provides estimates for a detailed, variable-by-variable decomposition. Each point
estimate is presented with a bootstrapped standard error that measures the corresponding
estimation uncertainty.
I have demonstrated the package’s capabilities through an empirical example that examines
the wage gap between native and foreign-born Hispanic workers in the Chicago metropolitan
area. In doing so, I have also showcased the oaxaca package’s unique visualization features
that allow users to graphically summarize the results of their decompositions.
Marek Hlavac 17
Acknowledgments
I would like to thank Kai Gehring, Becca Goldstein, Jakub Kubajek, Olivier Monso and
Sophie Saint-Philippe for helpful comments and suggestions.
References
Bauer TK, Sinning M (2008). “An Extension of the Blinder-Oaxaca Decomposition to Non-
linear Models.” Advances in Statistical Analysis, 92(2), 197–206.
Bauer TK, Sinning M (2010). “Blinder-Oaxaca Decomposition for Tobit Models.” Applied
Economics, 42(12), 1569–1575.
Blinder AS (1973). “Wage Discrimination: Reduced Form and Structural Estimates.” Journal
of Human Resources, 8(4), 436–455.
Borooah VK, Iyer S (2006). “The Decomposition of Inter-Group Differences in a Logit Model:
Extending the Oaxaca-Blinder Approach with an Application to School Enrolment in In-
dia.” Journal of Economic and Social Measurement, 30(4), 279–293.
Bustamante AV, Fang H, Rizzo JA, Ortega AN (2009). “Heterogeneity in Health Insurance
Coverage Among US Latino Adults.” Journal of General Internal Medicine, 24(3), 561–566.
Center for Economic and Policy Research (2014). “CPS ORG Uniform Ex-
tracts, Version 1.9.” URL http://ceprdata.org/cps-uniform-data-extracts/
cps-outgoing-rotation-group/.
Cotton J (1988). “On the Decomposition of Wage Differentials.” Review of Economics and
Statistics, 70(2), 236–243.
Efron B (1979). “Bootstrap Methods: Another Look at the Jackknife.” The Annals of
Statistics, 7(1), 1–26.
Fairlie RW (2013). Example of Non-Linear Decomposition Technique for Logit Model. SAS pro-
gram, URL http://people.ucsc.edu/~rfairlie/decomposition/decompexample_v6.
sas.
18 oaxaca: Blinder-Oaxaca Decomposition in R
Holzer HJ, Hlavac M (2014). Diversity and Disparities: America Enters a New Century,
chapter A Very Uneven Road: U.S. Labor Markets in the Past Thirty Years. Russell Sage
Foundation, New York, NY, USA.
Jann B (2006). fairlie: Stata Module to Generate Nonlinear Decomposition of Binary Outcome
Differentials. Stata module, URL http://econpapers.repec.org/software/bocbocode/
s456727.htm.
Jann B (2008). “The Blinder-Oaxaca Decomposition for Linear Regression Models.” Stata
Journal, 8(4), 453–479.
Kim C (2010). “Decomposing the Change in the Wage Gap Between White and Black Men
Over Time, 1980-2005: An Extension of the Blinder-Oaxaca Decomposition Method.” So-
ciological Methods Research, 38(4), 619–651.
LaLonde RJ, Topel RH (1992). Immigration and the Work Force, chapter The Assimilation
of Immigrants in the U.S. Labor Market. The University of Chicago Press, Chicago, IL,
USA.
Munn IA, Hussain A (2010). “Factors Determining Differences in Local Hunting Lease Rates:
Insights from Blinder-Oaxaca Decomposition.” Land Economics, 86(1), 66–78.
Neumark D (1988). “Employers’ Discriminatory Behavior and the Estimation of Wage Dis-
crimination.” Journal of Human Resources, 23(3), 279–295.
R Core Team (2017). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL http://www.R-project.org.
Reimers CW (1983). “Labor Market Discrimination Against Hispanic and Black Men.” Review
of Economics and Statistics, 65(4), 570–579.
SAS Institute (2017). SAS/STAT Software. SAS Institute Inc., Cary, NC, USA. URL http:
//www.sas.com/en_us/software/analytics/stat.html.
StataCorp (2017). Stata Statistical Software: Release 13. StataCorp LP, College Station, TX,
USA. URL https://www.stata.com.
Marek Hlavac 19
Wickham H (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York,
NY, USA.
Zeileis A, Croissant Y (2010). “Extended Model Formulas in R: Multiple Parts and Multiple
Responses.” Journal of Statistical Software, 34(1), 1–13. URL http://www.jstatsoft.
org/v34/i01/.
Affiliation:
E-mail: [email protected]