The Existence of Inefficiency:
LASSO+SFA
Christopher F. Parmeter1 Artem Prokhorov2
Valentin Zelenyuk3
1
Miami Herbert Business School
2
University of Sydney Business School
3
School of Economics and Centre for Efficiency and Productivity Analysis
November 7th, 2024
From the End
I Combine machine learning with stochastic frontier analysis.
I Establish moment/parameter redundancy for use of post
double LASSO with MLE.
I Simple and effective step-wise estimator that preserves
efficiency and valid inference.
X-inefficiency
X-inefficiency?
X-inefficiency!
Selective Attention
Stochastic Frontier Analysis (SFA)
The Stochastic Frontier Model
The stochastic frontier model we consider in this paper can be
written as follows:
y = x0 β + v − u = x0 β + ε, (1)
where y is an n-vector of output, x is an p × 1 vector of
production inputs including a constant, ε = v − u is the
n-vector of error terms εi composed of a Normal part
vi ∼ N (0, σv2 ) and a Half-Normal inefficiency component
ui ∼ N+ (0, σu2 ).
I Aside from presence of ui this is a trivial model to estimate.
I But we are interested in ui .
ML Formlation of Frontier Model
inputs conf ounders
z}|{ z}|{
yi = x0i β + z0δ +vi −ui , i = 1, . . . , 2n. (2)
| {z i }
stochastic frontier
I p (number of inputs) is small (and fixed).
I d (number of confounders), possibly large (> 2n).
I β can be estimated at O(n−1/2 ) if δ can be.
I Impossible to estimate δ at this rate when d is large.
Double Machine Learning
I Consider estimation of a treatment effect (not a frontier
model)
yi = xi β + zi0 δ +vi , i = 1, . . . , 2n.
|{z} |{z}
scalar treatment conf ounders
1. Use any ML tool to predict E[y|z] and E[x|z], using half of
the sample for each (hence the 2n).
2. Obtain βb from the regression of ỹ on x̃ where
\
w̃ = w − E[w|z].
Double Machine Learning
√
I β̂ is n-consistent and asymptotically Normal even if RMSE
of z 0 δ has rate O(n−1/4 ) (so can estimate nonparametrically).
I The moment conditions for which β̂ is constructed imply
Neyman Orthogonality.
Neyman Orthogonality and M/P Redundancy
I In the context of asymptotically optimal testing, Neyman
(1959) asked when do errors of nuisance functions not carry
over into β̂.
I Let δ denote the functional nuisance parameter and let
h∗1 (β, δ) be the moment function implied by the FOC for β̂:
E[h∗1 (β, δ)] = 0.
Neyman Orthogonality and M/P Redundancy
I We say h∗1 (·, ·) is Neyman orthogonal if the moment function
remains valid under perturbations in δ:
D12 [δ − δ0 ] = ∂δ E[h∗1 (β, δ)][δ − δ0 ] = 0. (3)
I D12 [δ − δ0 ] is the Gateaux derivative of the moment function
in the direction δ around the true value δ0 .
I Neyman orthogonality is connected to and best understood in
a GMM framework.
Neyman Orthogonality and M/P Redundancy
I GMM of (β, δ) is based on moment conditions assumed to
hold in the population:
[A] for β : E[h1 (β, δ)] = 0 (4)
[B] for δ : E[h2 (β, δ)] = 0. (5)
I We assume that [A] is enough to identify β given δ.
I Using more knowledge in the form of [B] and/or using δ0
improves statistical efficiency asymptotically.
I Prokhorov and Schmidt (2009) asked when is it irrelevant for
the estimation of β whether we know [B] and/or δ0 .
Neyman Orthogonality and M/P Redundancy
I Assume finite dimensional δ:
[A] for β : E[h1 (β, δ)] = 0 (6)
[B] for δ : E[h2 (β, δ)] = 0. (7)
I When is it irrelevant for estimation of β whether we know [B]
and/or δ0 ?
I When asymptotic variance of GMM based on [A] with known
δ is equal to asymptotic variance of GMM based on [A] and
[B] with unknown δ.
Neyman Orthogonality and M/P Redundancy
h1 h01 h1 h02
C11 C12
C=E =
h2 h01 h2 h02 C21 C22
and
∇β h1 ∇δ h1 D11 D12
D=E =
∇β h2 ∇δ h2 D21 D22
C12 = 0 Moment redundancy of [B]
M/P-Redundancy ⇔
D12 = 0 Parameter redundancy of δ
Neyman Orthogonality and M/P Redundancy
I So start by specifying
[A] for β : E[h1 (β, δ)] = 0 (8)
[B] for δ : E[h2 (β, δ)] = 0 (9)
and look for valid moment function h∗1 (β, δ) that is
uncorrelated with h2 (·, ·) such that
D12 = E [∇δ h∗1 (β, δ)] = 0.
I The we can use any slowly converging ML tool (LASSO, GRF,
etc.)
√ to obtain δ̂, plug it into h1 (β, δ) and obtain a
n-consistent and asymptotically Normal β̂.
Return to ML Formlation of Frontier Model
inputs conf ounders
z}|{ z}|{
yi = x0i β + z0δ +vi −ui , i = 1, . . . , 2n. (10)
| {z i }
stochastic frontier
I All ML tools give biased estimators.
I Inputs correlate with confounders: xi = m(zi ) + ηi .
I Biases in δ̂ and m̂(zi ) affect β̂ and ûi .
I So what changes with the introduction of u ≥ 0 into the
model and what are the M/P redundant moments?
Return to ML Formlation of Frontier Model
inputs conf ounders
z}|{ z}|{
yi = x0i β + z0δ +vi −ui , i = 1, . . . , 2n.
| {z i }
stochastic frontier
I Conventional estimation (COLS) (assume u ∼ |N (0, σu2 )| and
v is symmetric and E[v] = 0):
2n
β̂ X 2
= min (yi − x0i β − zi0 δ)
δ̂ β,δ
i=1
r
2
accounting for E[ui ] = σu > 0.
π
I Evidence of inefficiency is captured through negative skewness
of the residuals ε̂i = yi − x0i β̂ − zi0 δ̂.
Return to ML Formlation of Frontier Model
inputs conf ounders
z}|{ z}|{
yi = x0i β + z0δ +vi −ui , i = 1, . . . , 2n.
| {z i }
stochastic frontier
I Conventional estimation (maximum likelihood) (assume
u ∼ |N (0, σu2 )|, v ∼ N (0, σu2 ) and u ⊥ v):
θ̂ = max ln L(θ), θ = β̂, δ̂, σ̂v2 , σ̂u2 .
θ
I Evidence of inefficiency: σu2 >> 0.
Does Inefficiency Exist?
d
X
yi = 1 + 0.3x1i + 0.4x2i + 0.38x3i + δj zij + vi − ui
j=1
True δj = 0, zij ∼ N (0, 1), d = cn,
vi ∼ N (0, 0.5), ui ∼ |N (0, 1.2)|
Does Inefficiency Exist?
Average skewness of OLS residuals over 1,000 simulations
n 0 0.01 0.1 0.2 0.3 0.5 0.9
100 −0.494 −0.488 −0.420 −0.342 −0.267 −0.143 −0.001
200 −0.525 −0.517 −0.445 −0.375 −0.299 −0.177 −0.011
400 −0.536 −0.530 −0.454 −0.380 −0.308 −0.186 −0.012
800 −0.547 −0.539 −0.466 −0.391 −0.319 −0.193 −0.016
1,600 −0.549 −0.542 −0.468 −0.391 −0.319 −0.189 −0.016
Resort to ML? - Post-Single-LASSO
I LASSO ⇒ some elements of δ̂LASSO are exactly 0; drop these
confounders
2n d
β̂LASSO X
0 0 2
X
= min (yi − xi β − zi δ) + λ |δj |
δ̂LASSO β,δ
i=1 j=1
I COLS using only confounders picked by LASSO ⇒
PSL-COLS
2n
β̂P SL X 2
= min (yi − x0i β − zi0 δ) ,
δ̂P SL β,δ
i=1
s.t. δj = 0 for any j ∈/ supp δ̂LASSO
Resort to ML? - Post-Single-LASSO
I LASSO ⇒ some elements of δ̂LASSO are exactly 0; drop these
confounders
2n d
β̂LASSO X
0 0 2
X
= min (yi − xi β − zi δ) + λ |δj |
δ̂LASSO β,δ
i=1 j=1
I or MLE using only confounders picked by LASSO ⇒
PSL-MLE
θ̂P SL = max ln L(θ), s.t. δj = 0 for any j ∈
/ supp δ̂LASSO .
θ
Inefficiency Exists!
Average skewness of PSL-OLS residuals over 1,000 simulations
n 0 0.01 0.1 0.2 0.3 0.5 0.9
100 −0.503 −0.404 −0.386 −0.374 −0.367 −0.359 −0.350
200 −0.520 −0.452 −0.436 −0.430 −0.425 −0.420 −0.413
400 −0.536 −0.479 −0.470 −0.465 −0.463 −0.459 −0.455
800 −0.546 −0.506 −0.500 −0.498 −0.497 −0.494 −0.492
1,600 −0.552 −0.522 −0.519 −0.517 −0.516 −0.516 −0.514
Another Problem: Inference for PSL-MLE
200
X
yi = βxi + 0.8 δj zij + vi − ui
j=1
True δj = (1/j)2 , zij ∼ N (0, 1), 2n = 100, λ by CV,
vi ∼ N (0, 0.5), ui ∼ |N (0, 1.2)|,
200
X
xi = 0.6 δj zij + ηi , ηi ∼ N (0, 1)
j=1
Sampling distribution of standardized β̂P SL over 1,000
simulations
Why Does PSL Fail?
Look at MLE when yi = x0i β + zi0 δ + vi − ui for
vi ∼ N (0, σv2 ) ⊥ ui ∼ |N (0, σu2 )|
2
fε (εi ) = φ(εi /σ)Φ(−λεi /σ)
σ
where σ 2 = σv2 + σu2 , λ = σu /σv
2n
X
θ̂M LE = max ln fε (εi ), where εi = yi − x0i β − zi0 δ.
θ
i=1
Recall: PSL zeros out some δj s ⇒ let δLASSO contain 0’s for
those j’s, then
ξi = yi − x0i β + zi0 δLASSO = εi + zi0 (δ − δLASSO ) 6= εi .
Why Does LASSO Break?
I For simplicity assume that (σ, λ) = (1, 1), d = dim(δ) < 2n
φ(νi )
and define ri (νi ) = (Inverse Mill’s Ratio).
1 − Φ(νi )
I Moment equations implied by FOCs from MLE:
[A] for β : E [x0i (εi + ri (εi ))] = 0
[B] for δ : E [zi0 (εi + ri (εi ))] = 0.
I Moment equations implied by FOCs from PSL-MLE:
[A] for β : E [x0i (ξi + ri (ξi ))] = 0
[B] for δLASSO : E [zi0 (ξi + ri (ξi ))] = 0.
PSL-MLE is using invalid moment conditions; LASSO
regularization bias carries over to estimation of β̂.
How to Conduct Valid Inference in SFA?
0
Let ε̃i := yi − πy0 zi − (xi − πx0 zi ) β − zi0 δ.
I
I Consider the moment conditions
0
[A∗ ] E (xi − πx0 zi ) (ε̃i + ri (ε̃i )) = 0
[B ∗ ] E [zi0 (ε̃i + ri (ε̃i ))] = 0
[C] E [zi0 (xi − πx0 zi )] = 0
[D] E zi0 yi − πy0 zi = 0
I Under homoskedasticity, [A∗ ] satisfies Neyman orthogonality.
I Equivalently, [B ∗ ], [C] and [D] are M/P redundant for the
estimation of β.
Sketch of the Argument
I Look at
0
[A∗ ] E (xi − πx0 zi ) (ε̃i + ri (ε̃i )) = 0
[B ∗ ] E [zi0 (ε̃i + ri (ε̃i ))] = 0
[A∗ ] ⊥ [B ∗ ] ⇒ C12 = 0.
I Expected derivative:
0 0 ∂ ∂
δ : E (xi − πx zi ) ε̃i + ri (ε̃i ) =
∂δ ∂δ
0
E (xi − πx0 zi ) (−zi + zi ri (ε̃i ) (ε̃i + ri (ε̃i ))) = 0.
Note
I Identical result holds for both πx and πy .
I The idea is similar to partialing out from Frisch-Waugh-Lovell.
I [A∗ ] and [B ∗ ] correspond to running MLE where the
dependent variable is the part of yi that is orthogonal to zi
and the explanatory variables are zi and the part of xi that is
orthogonal to zi .
Post-Double-LASSO
I LASSO of yi on zi
n d
X 2 X
0
π̂LASSO = min
0
yi − zi0 π 0 + λ0 |πj0 |.
π
i=1 j=1
I LASSO of xi (one-by-one) on zi
n d
X 2 X
`
π̂LASSO = min x`i − zi0 π ` + λ` |πj` |.
π`
i=1 j=1
Post-Double-LASSO
I MLE using the union of confounders picked by LASSO in the
first two steps PDL-MLE
2n
X
θ̂P DL =max ln fε (εi ),
θ
i=1
p
[
`
s.t. δj = 0 for any j ∈
/I= supp π̂LASSO .
`=0
I I is called the amelioration set (Belloni, Chernozhukov and
Hansen, 2013)
Sampling distribution of standardized β̂P DL over 1,000
simulations
Empirical Example
I 137 dairy farms in Spain from 1999-2010 (Alvarez & Arias,
2004).
I y is milk production (liters).
I x is labor (man-equivalent units), cows, feed (kg), land
(hectares) and roughage (expenses incurred to produce
roughage: fertilizer, machines, seed, silage additives, etc.).
Empirical Example
I z is year dummies, zone dummies, land-ownership,
bacteriological content of the milk, price of milk, price of feed,
membership in an agricultural cooperative, milk quality
indicators (fat, protein, somatic cell count), and something
called AVGCOST (neither Antonio or Carlos could remember
what this variable captured).
I dim(z) = 50 with first order terms [Cobb-Douglas];
dim(z) = 87 with second order terms [translog].
Empirical Example: Cobb-Douglas
OLS SFA
OLS SFA Large Large SFA-PSL SFA-PDL
Feedstuffs 0.386 0.360 0.464 0.464 0.439 0.401
0.012 0.013 0.011 0.011 0.011 0.013
Cows 0.595 0.642 0.467 0.467 0.546 0.560
0.020 0.022 0.017 0.017 0.017 0.021
Land −0.010 −0.012 0.032 0.032 0.007 0.033
0.009 0.009 0.009 0.009 0.008 0.010
Labor 0.035 0.032 0.013 0.013 −0.015 0.005
0.012 0.012 0.010 0.009 0.009 0.011
Roughage 0.067 0.060 0.073 0.073 0.082 0.061
0.005 0.005 0.004 0.004 0.004 0.005
RTS 1.074 1.082 1.048 1.048 1.059 1.059
Eff 0.930 0.892 1.000 0.999 0.999 0.926
Empirical Example: Translog
OLS SFA
OLS SFA Large Large SFA-PSL SFA-PDL
Feedstuffs 0.341 0.319 0.457 0.457 0.409 0.342
0.014 0.014 0.013 0.013 0.013 0.014
Cows 0.633 0.676 0.454 0.454 0.574 0.618
0.024 0.024 0.021 0.020 0.020 0.023
Land −0.011 −0.017 −0.017 −0.017 −0.014 −0.013
0.010 0.010 0.009 0.009 0.009 0.010
Labor 0.021 0.014 −0.007 −0.007 −0.033 0.000
0.014 0.013 0.010 0.010 0.010 0.012
Roughage 0.093 0.088 0.122 0.122 0.126 0.079
0.008 0.008 0.007 0.007 0.007 0.007
RTS 1.076 1.080 1.008 1.008 1.062 1.025
Eff 0.932 0.887 1.000 0.999 0.927 0.915
Concluding remarks
I Neyman orthogonality is key to ensuring valid causal inference;
it is equivalent to M/P redundancy.
I Abundance of data makes it harder to establish and address
inefficiency of production.
I Machine Learning tools are effective at reversing the spurious
finding of full efficiency.
I Partialing out offers a way of conducting valid
post-machine-learning causal inference.
I We derive and apply Neyman orthogonal moment conditions
for production frontier models.