Causal Inference in Python
Causal Inference in Python
As is standard in the literature, we work within the framework of Rubin’s potential outcome
model (Rubin, 1974). Let Y (0) denote the potential outcome of a subject in the absence of
treatment, and let Y (1) denote the unit’s potential outcome when it is treated. Let D denote
treatment status, with D = 1 indicating treatment and D = 0 indicating control, and let X be a K-
column vector of covariates or individual characteristics. For unit i, i = 1, 2, . . . , N, the observed
outcome can be written as Y i = (1 − Di )Yi (0) + Di Yi (1) . The set of observables (Y , D , X
i i i
In the following, we illustrate the typical flow of a causal analysis using the tools of
Causalinference and a simulated data set. In simulating the data, we specified a constant
treatment effect of 10 for simplicity, and incorporated systematic overlap issues and
nonlinearities to highlight a number of tools in the package. We focus mostly on illustrating the
use of Causalinference; for details on methodology please refer to Imbens and Rubin (2015).
2. Causalinference
Causalinference is a Python package that provides various statistical methods for causal
analysis. It is a simple package that was used for basic causal analysis learning. The main
features of these packages include:
Let’s try out the Causalinference package. For starters, we need to install the package.
After the installation finishes, we will try to implement a causal model for causal analysis. We
would use the random data that came from the causalinference package.
The CausalModel class would analyze the data. We would need to do a few more steps to
acquire important information from the model.
In [67]: print(causal.summary_stats)
Summary Statistics
In [68]: causal.summary_stats.keys()
In [69]: causal.summary_stats['X_t_mean']
In [70]: causal.summary_stats['ndiff']
array([0.70765718, 0.66358536, 0.7261009 ])
Out[70]:
In [71]: causal.summary_stats['Y_t_mean']
4.986076842982941
Out[71]:
Here rdiff refers to the difference in average observed outcomes between treatment and
control groups. ndiff , on the other hand, refers to the normalized differences in average
covariates, defined as
¯
¯¯ ¯
¯¯
x k,t −x k,t
2 2
s +s
k,t k,c
√
2
where ¯x
¯¯
k,t
and sk,t are the sample mean and sample standard deviation of the kth covariate of
the treatment group, and ¯x
¯¯
k,c and sk,c are the analogous statistics for the control group.
The normalized differences in average covariates provide a way to measure the covariate
balance between the treatment and the control groups. Unlike the t-statistic, its absolute
magnitude does not increase (in expectation) as the sample size increases.
By using the summary_stats attribute, we would acquire all the basic information of the
dataset.
The main part of causal analysis is acquiring the treatment effect information. The simplest one
to do is by using the Ordinary Least Square method.
¯
¯¯¯¯ ¯
¯¯¯¯
Yi = α + β(Xi − X ) + δYi (Xi − X ) + ϵi
To inspect any treatment effect estimates produced, we can simply invoke print on the attribute
estimates, as in below:
In [72]: causal.est_via_ols()
print(causal.estimates)
Treatment Effect Estimates: OLS
C:\Users\moham\AppData\Roaming\Python\Python310\site-packages\causalinference\estimat
ors\ols.py:21: FutureWarning: `rcond` parameter will change to the default of machine
precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to
keep using the old, explicitly pass `rcond=-1`.
olscoef = np.linalg.lstsq(Z, Y)[0]
ATE, ATC, and ATT stand for Average Treatment Effect, Average Treatment Effect for Control
and Average Treatment Effect for Treated, respectively. Using this information, we could assess
whether the treatment has an effect compared to the control.
Including interaction terms between the treatment indicator D and covariates X implies that
treatment effects can differ across individuals. In some instances we may want to assume a
constant treatment effect, and only run
¯
¯¯¯¯ ¯
¯¯¯¯
Yi = α + β(Xi − X ) + δYi (Xi − X ) + ϵi
This can be achieved by supplying a value of 1 in est via ols to the optional parameter adj (its
default value is 2). To compute the raw difference in average outcomes between treatment and
control groups, we can set adj=0 . In this example, the least squares estimates are radically
different from the true treatment effect of 10. This is the result of the nonlinearity and non-
overlap issues intentionally introduced into the data simulation process. As we shall see, several
other tools exist in Causalinference that can better deal with a lack of overlap and that will
allow us to obtain estimates that are less sensitive to functional form assumptions.
known as the propensity score, plays a central role in much of what follows. Two methods, est
propensity and est propensity s, are provided for propensity score estimation. Both involve
running a logistic regression of the treatment indicator D on functions of the covariates.
est_propensity allows the user to specify the covariates to include linearly and/or
quadratically, while est_propensity_s will make this choice automatically based on a
sequence of likelihood ratio tests. In the following, we run est_propensity_s and display the
estimation results. In this example, the specification selection algorithm decided to include both
covariates and all the interaction and quadratic terms.
Using the propensity score method, we could also get information regarding the probability of
treatment conditional on the independent variables.
In [74]: causal.est_propensity_s()
print(causal.propensity)
Using the propensity score method, we could assess the probability of the treatment given the
independent variables.
There are still many methods you could explore and learn from. I suggest you visit the
causalinference web page and learn further.
The propensity attribute is again another dictionary-like container of results. The dictionary keys
of propensity can be found by running:
In [75]: causal.propensity.keys()
In [76]: causal.propensity['lin']
[2, 0, 1]
Out[76]:
In [77]: causal.propensity['qua']
[(1, 1)]
Out[77]:
In [78]: causal.propensity['coef']
In [79]: causal.cutoff
0.1
Out[79]:
Calling causal.trim() at this point will drop every unit that has propensity score outside of the [α,
1 − α] interval. Alternatively, a procedure exists that will estimate the optimal cutoff that
minimizes the asymptotic sampling variance of the trimmed sample. The method trim s will
perform this calculation, set the cutoff to the optimal α, and then invoke trim to construct the
subsample. For our example, the optimal α was estimated to be slightly less than 0.1:
In [80]: causal.trim_s()
In [81]: causal.cutoff
0.10095500234207272
Out[81]:
The complexity of this cutoff selection algorithm is only O(N log N), so in practice there is very
little reason to not employ it.
In [82]: causal.stratify_s()
In [83]: print(causal.strata)
Stratification Summary
Under the hood, the attribute strata is actually a list-like object that contains, as each of its
elements, a full instance of the class CausalModel, with the input data being those that
correspond to the units that are in the propensity bin. We can thus, for example, access each
stratum and inspect its summary_stats attribute, or as the following illustrates, loop through
strata and estimate within-bin treatment effects using least squares.
[1.1010525059277556,
Out[100]:
1.3936541440372463,
2.004643746290025,
2.4002774533086817,
2.662716013620451,
3.0475818205042002,
3.122994358139151,
3.424260986897368,
3.7506272041228654,
3.9854869677920286,
4.443325713714909,
4.67815605365512]
In [102… stratum.estimates['ols']['att']
4.67815605365512
Out[102]:
Note that these estimates are much more stable and closer to the true value of 10 than the
within-bin raw differences in average outcomes that were reported in the stratification summary
table, highlighting the virtue of further controlling for covariates even within blocks. Taking the
sample-weighted average of the above within-bin least squares estimates results in a
propensity score matching estimator that is commonly known as the subclassification estimator
or blocking estimator. However, instead of manually looping through the strata attribute,
estimating within-bin treatment effects, and then averaging appropriately to arrive at an overall
estimate, we can also simply call est via blocking, which will perform these operations and
collect the results in the attribute estimates. We will report these estimates in the next section
along with estimates obtained from other, alternative estimators.
and ∥X i − Xj ∥ is some measure of distance between the covariate vectors X and X . The
j i
method est_via_matching implements this estimator, as well as several extensions that can be
invoked through optional arguments.
′
Yi = α + βDi + γ Xi + ϵi
where the weight for unit i is 1/p^(X) if i is in the treatment group, and 1/(1 − p^(X)) if i is in
the control group. This estimator is also sometimes called the doubly-robust estimator,
referring to the fact that this estimator is consistent if either the specification of the propensity
score is correct, or the specification of the regression function is correct. We can invoke it by
calling est via weighting. Note that under this specification the treatment effect does not differ
across units, so the ATC and the ATT are both equal to the overall ATE
In the following we invoke each of the four estimators (including least squares, since the input
data has changed now that the sample has been trimmed), and print out the resulting
estimates.
In [103… causal.est_via_ols()
In [104… causal.est_via_weighting()
C:\Users\moham\AppData\Roaming\Python\Python310\site-packages\causalinference\estimat
ors\weighting.py:23: FutureWarning: `rcond` parameter will change to the default of m
achine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to
keep using the old, explicitly pass `rcond=-1`.
wlscoef = np.linalg.lstsq(Z_w, Y_w)[0]
In [105… causal.est_via_blocking()
In [106… causal.est_via_matching(bias_adj=True)
C:\Users\moham\AppData\Roaming\Python\Python310\site-packages\causalinference\estimat
ors\matching.py:100: FutureWarning: `rcond` parameter will change to the default of m
achine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to
keep using the old, explicitly pass `rcond=-1`.
return np.linalg.lstsq(X, Y)[0][1:] # don't need intercept coef
In [107… print(causal.estimates)
As we can see above, despite the trimming the least squares estimates are still severely biased,
as is the weighting estimator (since neither the propensity score or the regression function is
correctly specified). The blocking and matching estimators, on the other hand, are less sensitive
to specification assumptions, and thus result in estimates that are closer to the true average
treatment effects.
References
Abadie, A., & Imbens, G. (2006). Large sample properties of matching estimators for
average treatment effects. Econometrica, 74 , 235-267.
Crump, R., Hotz, V. J., Imbens, G., & Mitnik, O. (2009). Dealing with limited overlap in
estimation of average treatment effects. Biometrika, 96 , 187-199.
Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical
sciences: An introduction. Cambridge University Press.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in
observational studies for causal effects. Biometrika, 70 , 41-55.
Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and
Nonrandomized Studies. Journal of Educational Psychology, 66 , 688-701
In [ ]: