0% found this document useful (0 votes)
55 views11 pages

Opportunityfinder A Framework For Automated Causal Inference

This document introduces OpportunityFinder, a framework for automated causal inference using panel data. It allows non-expert users to perform common causal studies by simply providing raw data and a configuration file. The framework handles data processing, algorithm selection, and returns estimates of causal treatment effects. It currently supports estimating average treatment effects for binary treatments. The framework automates tasks to make causal inference more accessible for business analysts and scientists.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views11 pages

Opportunityfinder A Framework For Automated Causal Inference

This document introduces OpportunityFinder, a framework for automated causal inference using panel data. It allows non-expert users to perform common causal studies by simply providing raw data and a configuration file. The framework handles data processing, algorithm selection, and returns estimates of causal treatment effects. It currently supports estimating average treatment effects for binary treatments. The framework automates tasks to make causal inference more accessible for business analysts and scientists.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

OpportunityFinder: A Framework for Automated Causal

Inference
Huy Nguyen∗ Prince Grover∗ Devashish Khatwani
Amazon.com Amazon.com Amazon.com
USA USA USA
[email protected] [email protected] [email protected]

ABSTRACT 1 INTRODUCTION
We introduce OpportunityFinder, a code-less framework for per- Automated machine learning (AutoML) frameworks for predictive
forming a variety of causal inference studies with panel data for non- machine learning (ML) have advanced significantly over the past
expert users. In its current state, OpportunityFinder only requires decade with the introductions of AutoGluon [14], Auto-sklearn
users to provide raw observational data and a configuration file. [15], H2O [19]. AutoML’s biggest advantage is abstracting away
A pipeline is then triggered that inspects/processes data, chooses the implementation of underlying algorithms and hyper-parameter
the suitable algorithm(s) to execute the causal study. It returns the tuning, and making it easy for scientists and engineers to experi-
causal impact of the treatment on the configured outcome, together ment with a large number of models and identify the one that works
with sensitivity and robustness results. Causal inference is widely best. The demand of AutoML has risen from the fact that no single
studied and used to estimate the downstream impact of individual’s ML algorithm works best in all scenarios. This has been even more
interactions with products and features. It is common that these challenging in the causal inference literature. Different methods
causal studies are performed by scientists and/or economists peri- rely on different set of assumptions [2] for the identification of
odically. Business stakeholders are often bottle-necked on scientist causal treatment effects1 : CIA (conditional independence assump-
or economist bandwidth to conduct causal studies. We offer Oppor- tion or unconfoundedness), propensity overlap, SUTVA (stable unit
tunityFinder as a solution for commonly performed causal studies treatment value assignment), exchangeability (same outcome dis-
with four key features: (1) easy to use for both Business Analysts tribution would be observed if exposed and unexposed individuals
and Scientists, (2) abstraction of multiple algorithms under a single were exchanged) etc.
I/O interface, (3) support for causal impact analysis under binary Causal inference framework DoWhy [22] supports explicit mod-
treatment with panel data and (4) dynamic selection of algorithm eling and testing of causal assumptions, but it is still a low level API.
based on scale of data. AutoCausality, [16] which is built on the top of EconML [8] and
DoWhy, supports automated hyperparameter tuning, but it only
CCS CONCEPTS focuses on the estimation part and assumes that the causal graph
• Computing methodologies → Artificial intelligence; Knowl- provided by the user accurately explains data-generating process.
edge representation and reasoning; Causal reasoning and Both AutoCausality and DoWhy do not support panel data2 , which
diagnostics; is a mainstream at real-world problems. Most real-world causal
studies have panel data of different aggregated granularities, e.g.,
yearly to daily levels, at different scales, e.g., few individuals to
KEYWORDS
large populations of million entities. To the best of our knowledge,
causal inference, double machine learning, neural networks, panel there is no AutoML-like causal inference framework that supports
data panel data and abstracts away the know-how of causal studies from
ACM Reference Format: the users.
Huy Nguyen, Prince Grover, and Devashish Khatwani. 2018. Opportuni- In this study, we introduce OpportunityFinder (OPF), our first
tyFinder: A Framework for Automated Causal Inference. In Proceedings step in democratizing causal inference techniques. As of our first
of Make sure to enter the correct conference title from your rights confirma- contribution, Project OPF implements an auto causal inference
tion emai (Conference acronym ’XX). ACM, New York, NY, USA, 11 pages. framework that supports panel and cross-sectional data and offers
https://doi.org/XXXXXXX.XXXXXXX a wide range of causal inference algorithms. The decision to choose
the algorithm is automated and abstracted away from the user. Our
∗ Both authors contributed equally to this research. second contribution is the automated transformation from input
panel data into list of cohort datasets when needed. Cohort-based
results are then aggregated for a final result. In the third contri-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed bution, OPF provides data visualization to illustrate causal impact.
for profit or commercial advantage and that copies bear this notice and the full citation Combining numerical and graphical reports help non-expert users
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. 1A causal effect can be defined as the difference between hypothetical outcomes that
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY result from two or more alternative treatments, with only one outcome of a treatment
© 2018 Association for Computing Machinery. being observed each time
2 Panel data contains observations collected across multiple individuals at a regular
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/XXXXXXX.XXXXXXX frequency, and ordered chronologically.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huy Nguyen, Prince Grover, and Devashish Khatwani

to verify input data and reason about causal inference results. Cur- returns the estimated treatment effect in a standardized format into
rent capability of OPF allows non-expert users to carry out the the user’s S3 bucket.
most common causal analysis: estimating the average treatment The data processing can vary for different underlying models.
effect (ATE) with configurable time horizons for binary actions. At For example, Generalized Synthetic Control (GSC) [26] works well
current state, OpportunityFinder is deployed within AWS account even if there is one treated unit, but it requires panel data with
of our organization for internal testing. We are also refactoring at-least 7 pre-treatment periods. Double Machine Learning (DML)
OpportunityFinder source as a stand-alone library. [12] is a better solution for large-scale data but requires breaking
down treatments into the cohorts of different weeks, months or
2 LITERATURE REVIEW quarters, depending on the number of treated individuals in each
cohort.
Traditional econometric techniques such as propensity score match-
On completion of causal estimation, a series of sensitivity and
ing, instrumental variable estimation, and difference-in-differences
placebo tests are applied to assess the robustness of the findings to
(DiD), offer rigorous methods for estimating average treatment
violations of the underlying assumptions. These validations include
effects under specific assumptions, but often struggle to account
(but not limited to), direction of causal relationship, sensitivity
for high-dimensional covariates and complex interactions [5]. The
of causal estimate to small variations in observations data (e.g.,
Synthetic Control Method (SCM) extends these approaches by con-
down-sampling, random co-variate) and variations in model hyper-
structing a “synthetic” control unit as a weighted combination of
parameters (e.g., number of pre-treatment periods used for finding
potential control units, providing a more flexible comparison for
synthetic controls). The results of these validation tests are written
the treated unit [4]. The Generalized Synthetic Control (GSC) fur-
to the S3 bucket for user reference.
ther expands SCM by incorporating interactive fixed effects models,
thus accommodating multiple treated units and variable treatment
3.1 Data Requirements
periods [26].
Recently, machine learning techniques have been widely inte- OpportunityFinder requires user to provide two datasets and a
grated into causal inference due to notable works by various teams, configuration file (examples shown in Figure 1). The first data, i.e.,
e.g., DoubleML [7], EconML [8], CausalML [10]. Double Machine treatment data, should contain IDs of the treated units 3 and date
Learning (DML) provides a flexible approach, leveraging machine when the treatment happened. The second data, also known as,
learning for nuisance parameter estimation while maintaining ro- baseline observational or control data, contains the observational
bustness against mis-specification [12]. Beyond average treatment information about all IDs that received treatment as well as the ones
effect, machine learning enables approaches to estimate individual that did not receive treatment during the same period. Control data
treatment effects, e.g., heterogeneous treatment effect estimator in should contain time-based, e.g., daily, weekly or monthly, outcome
EconML and uplift modeling in CausalML. Deep learning methods, variables (i.e., target) of interest such as ad spend, click count over
such as those based on Neural Networks (NN), have shown promise the historical period. At the same level of time granularity, user is
in estimating individual treatment effects due to their ability to recommended to add a superset of possible variables (i.e., features)
model complex, high-dimensional data, thus uncovering nuanced that are related to the outcome and the treatment. Among those
causal relationships [21]. superset of variables, the model will search for the ones that can help
in removing the confounding and mediating effects, an essential
for accurate causal estimates.
3 FRAMEWORK DESIGN Configuration file has optional and mandatory requirements.
The key contributions of our design are (1) integration of several Optional requirements like list of features to scale, choice of algo-
causal modeling models, (2) branching based on type of observa- rithm, choice of hyper-parameters allow user flexibility, but are not
tional data (cross sectional vs. panel) and number of treatment units, necessary and can be automatically handled by the framework. The
and (3) execution in the users’ own AWS environment where they mandatory requirements include columns that specify time, unit id,
have access to CloudWatch logs for debugging and can visualize the outcome variable and pre/post-treatment evaluation window, e.g.,
progress. Current OpportunityFinder deployment allows code-less 4 weeks, 6 months. Based on user-provided configuration and data
UI without having to move data outside the AWS account as demon- validation, input panel data might be segmented into cohorts and
strated in Figure 1. While this design is tied to the MLOps set-up of feature engineering would be performed, before passing to causal
our organization, OpportunityFinder source code is independent analysis algorithms.
from deployment platforms.
Figure 2 shows the design of OpportunityFinder. Once a user 3.2 Implementation Details
triggers a job, CloudFormation kicks off a set of AWS services in- 3.2.1 Two Stage Decision Path. The decision of causal estimation
cluding SageMaker, Lambda and Glue jobs. Data Validation module algorithm goes through two stages. First stage is a set of rules, based
checks treatment and control data for basic requirements. Sage- on factors that include the following. Depending on the answers of
Maker Pipeline then starts with performing follow-up components. these factors, a causal estimation algorithm is selected (leaf node
Data Processing transforms panel data into cohorts (where needed), of the decision path)
handles missing data, extracts lag/lead features, performs optional
• Are total event data less than or more than 500,000?
data scaling and normalization. Causal Estimation decides most
• Is the data panel or cross sectional?
suitable causal model given data, and executes the model. Result
Validation performs validation tests for sanity and sensitivity, and 3 Individuals, e.g., shoppers, advertisers, who activated a feature or received a treatment
OpportunityFinder: A Framework for Automated Causal Inference Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

1. ###### API: End-to-end ###### Unit ID Treatment Date Unit ID Date Impressions Clicks Sales
A 2022-07-23 A 2022-07-01 3.43M 400K $0.25M
2. from opportunity_finder.api \ B 2022-07-16 A 2022-07-08 3.62M 410K $0.26M
3. import OpportunityFinder C 2022-10-01 A 2022-07-15 3.90M 423K $0.39M
D 2023-01-11 A 2022-07-22 4.21M 431K $0.32M
4. opf = OpportunityFinder( E 2022-03-05 A 2022-07-29 3.52M 399K $0.40M
5. treatment_df, F 2023-03-21 Z 2022-07-01 8.12M 912K $10.1M
6. observations_df, G 2022-05-05 Z 2022-07-08 8.42M 923K $10.1M
7. config_dict) H 2022-10-12 Z 2022-07-15 8.55M 922K $10.3M
I 2022-12-02 Z 2022-07-22 8.21M 942K $8.1M
8. opf.estimate_causal_effect() J 2022-02-25 Z 2022-07-29 8.12M 890K $11.2M

Figure 1: Left: Sample of OpportunityFinder UX with Python. Center: Sample treatment data. Right: Sample observational data
with possible set of covariates.

Figure 2: Framework design of OpportunityFinder. Dotted boxes are under planned development.

• Are number of treated units per cohort less than or more output and within lower and upper bounds of at-least 2 more esti-
than 50? mators (voting mechanism).
• Are number of control units less than or more than 5,000?
• Are number of covariates as per causal graph more or less 3.2.2 Cohort Data. One of key functions provided by Opportuni-
than 5? tyFinder is the transformation of panel data into cohort, i.e., sec-
• How many periods (e.g., daily, monthly) of pre and post- tional data, which allows techniques like double machine learning to
treatment data are available? work. Each cohort corresponds to a set of treated units that received
treatment in a closed period. First, treatment data is processed to
In scenarios, where the above set of rules give more than one extract list of treatment times and number of treated units at each
option of causal estimators, the decision flow moves to the second time. A cohort is a set of one, two or more consecutive treatment
stage. It tries out the list of all models chosen in the first stage, times and constrained by three parameters: Minimum/maximum
and selects the result that has least standard error on the estimated number of treatment times, and minimum number of treated units.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huy Nguyen, Prince Grover, and Devashish Khatwani

For example, if treatment happens at day level, the first two OPF applies our best heuristics after exploring input data to
parameters specify the lower and upper bounds of number of days select the right algorithm. Due to the lack of ground truth data
in each cohort. The third parameter says that a cohort must have at in causal inference, our framework can make mistakes without
least a certain number of treated units. We sort treatment times in knowing that the estimated effect it returns is wrong. We select a
ascending order, for each treatment time we keep merging it with model based on standard error and ensemble by voting to mitigate
following times until three conditions above are satisfied. Then a this limitation upto some extent. The accuracy of estimate still
list of cohorts is returned. Results from each cohort are aggregated depends on the observational data given by the user.
using weighted average w.r.t number of treated units in each cohort. For real-world problems, OPF does not necessarily use estimation
models that gave best score on benchmark data. Our experiments
3.2.3 Causal Inference Models. OpportunityFinder implements a show that simpler estimators work more reasonably than DNN
wide set of popular and advanced causal inference algorithms, as models on our use-case data.
following. Further details of these and other models that we con-
sidered are in Appendix A. 4 VALIDATION OF CAUSAL ESTIMATES BY
• Generalized Synthetic Controls: our implementation is based OPPORTUNITYFINDER
on original R code [26]. We validate our causal inference algorithms on benchmark datasets
• Double Machine Learning: using EconML package [8] we using three metrics, choosing the metric available in related re-
employ two treatment effect estimators, LinearDML and search for each dataset.
CausalForestDML. Each estimator is stacked with any of
four classes of base prediction models that predict treatment (1) Average Treatment Effects (ATE): measures causal impact of
and outcome: Random Forest, Linear Regression, XGBoost, a treatment/intervention on a population by comparing the
and LightGBM. average hypothetical outcomes between receiving and not
• Deep Neural Networks: we implement four state of the arts receiving treatment, accounting for potential confounding
DNN algorithms for estimation of treatment effect, BCAUSS factors.
[24], DRAGON [23], TARNET [21], and GANITE [27]. (2) Average Treatment Effects on Treated (ATT): is ATE mea-
sured on treated units.
3.2.4 Validation Tests. DML models and their treatment effect es- (3) Mean Absolute Error (MAE): the average absolute difference
timation are validated through refutation tests by DoWhy package: between estimated ATE and true ATE where available, for
add random common cause, add unobserved common causes, data evaluating accuracy of a causal estimation method.
subsets validation, and placebo treatment. For a robust causal model
and valid treatment effect, first three tests should return treatment 4.1 IHDP (public benchmark)
effect similar to original model while fourth test must have effect The Infant Health and Development Program (IHDP) [1] is a ran-
close to zero. GSC model is validated with a suite of sensitivity domized controlled study designed to evaluate the effect of special-
tests that check for changes in the estimated causal effect with ist visit on cognitive test scores of premature infants. This dataset
small changes in data like random down-sampling, different pre- is cross-sectional data, with binary treatment (specialist visits), con-
treatment window for learning synthetic control weights and a tinuous outcome (cognitive scores) and has known ground truth
reduced covariate list. The expectation is that the causal effect ATE. As shown in Table 1, our implementation of DML models
should not change the direction of estimation with small changes achieved competitive performance. Results on BCAUSS, TARNET
in the setting. Example test results are shown in Appendix B. In and DRAGON are based on our implementation and slightly differ
future, we will equip validation tests for DNN models. from reported numbers [1]. The difference is because DNN methods
are executed within OPF pipeline and data are not prepared as the
3.2.5 Data Visualization. A challenge that prevents the adoption
same as previous studies.
of causal inference studies is a lack of ground-truth data which
makes estimation error impossible to assess. OpportunityFinder
4.2 Smoking (public benchmark)
addresses this by providing visualizations that naively explain the
treatment effect to some extent. For example, it returns a plot that The goal of smoking data is to analyze the causal effect of Propo-
shows the trend of outcome for treated and control units over time. sition 99 on cigarette sales. This data has small size with just one
The visualizations are a part of Logging and Monitoring module. treated unit, thus causal estimations based on machine learning
While such plots do not confirm calculated treatment effect by (DML, NN) do not apply. OpportunityFinder chooses to run GSC
causal models, they help non-expert users to comprehend causal and does not create any cohorts. Table 2 shows comparison of re-
inference results. Example visualizations are shown in Appendix C. sults using OPF on this dataset with previous research [3, 6] works.
We observe that the range of ATE estimate lies between -11.1 to
-27.1, and the results from OPF are within the range of previous
3.3 Limitations and Risks
studies.4 When we use cigarette retail price as a covariate, the ATE
As of today, OpportunityFinder (OPF) does not implement causal reduces to -14.0, which is closer to SDID. It is discussed in SDID
graph generation algorithms. This also means that the tool has [6] paper that their results (-15.6) are more credible among other
less flexibility for someone who wants to control covariates and
experiment with different algorithms. We plan to integrate causal 4 Synthetic difference in differences (SDID), Synthetic controls (SC), Difference in
discovery module in near future. differences (DID), Matrix completion (MC), Synthetic control with intercept (DIFP).
OpportunityFinder: A Framework for Automated Causal Inference Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

Table 1: Mean absolute error on IHDP benchmark. All models are part of OpportunityFinder

DNN +LinearDML
BCAUSS DRAGON TARNET LinearReg. Rand.Forest XGBoost LightGBM

0.23 0.32 0.25 0.42 0.48 0.47 0.43

approaches shown in the table. We also observe a lower standard 5 APPLICATIONS ON REAL WORLD DATA
error with OPF results. This experiment helps validate OPF on small OpportunityFinder has already been used in multiple use cases.
panel dataset. In this section, we present two most important applications of
OPF within our organization. Most commonly used down stream
4.3 Synthetic Data 1: Cross Sectional (synthetic) impact metric in real applications is uplift, which is defined as
In addition to public datasetes, we validate OpportunityFinder out- the percentage increase/decrease in the outcome attributed to the
puts on two synthetic datasets with known ATE.5 The first synthetic treatment over a defined period. It is calculated as ATE or ATT
dataset is a linear cross sectional dataset that we generated using divided by average over control units.
DoWhy [22] package. We created the dataset with 2 instrument,
5 common causes, 5000 samples and binary treatment with some 5.1 Opportunity for Partners
treatment noise. Because this data is cross sectional, OPF rejects Advertising partners are the agencies and tool providers that have
GSC, but branches off to the second stage of decision path, where expertise in interacting with Ads products and help sellers/vendors
it evaluates multiple models, including DML [12], and neural net in setting ad campaigns. Our team helps in identifying the ac-
based estimators including BCAUSS [24], TARNET [21], DRAGON tions that would enable partners to create maximum value for
[23] and GANITE [27]. As we see in Table 3, all models (except sellers/vendors. Such actions are considered opportunity for part-
GANITE) give ATE close to the true ATE. OPF finally selects the ners. Their impacts are measured on a wide list of business outcome
model with least standard error if the mean is within the range of metrics such as revenue and adspend. Traditionally, it used to take
2+ other models, and ends up selecting LinearReg+LinearDML. 1-3 weeks of an Economist time to update the studies on ad-hoc re-
quests. Since January 2023, we have been using OpportunityFinder
4.4 Synthetic Data 2: Large Panel (synthetic) to refresh the studies. Each refresh completes in a day with minimal
In the second dataset, we add non-linear confounding effect and human involvement.
correlated variables on a panel data, to test the efficacy of different OPF chooses GSC due to: number of total events < 500,000, the
supported models to be able to remove the bias. This data contains number of treated units per monthly cohort < 50 and control units
52,000 rows, 3 confounders with non linear effect on treatment < 5,000. In Table 4, we show two such opportunities: adoption of X,
and outcome, 1000 units, 263 treated units and 52 time periods. and adoption of Y 6 by the partner for at-least one of their customers.
The properties of this dataset enables OpportunityFinder to run all Their lifts on three business metrics. Comparing to results of prior
implemented algorithms and select based on standard errors. As studies, we can see delta between past and current downstream
shown in Table 3, all models except GANITE, perform well that impact, which is caused due to behavioral and market changes
estimated ATE are close to ground-truth. Due to lowest variance in over time. We further compare GSC results with DML and DNN
estimations from GSC together with mean estimation lying between models. While DML models return lift scores about 5% to 10% higher
2+ other estimators, OPF chooses GSC results for the end user. than GSC, DNN estimated lifts are from 20% to hundred percent
higher which are beyond acceptable range. ML-based models may
over-estimate when input data is small.
4.5 Discussion on Model Choice
While our approach for model selection is evolving, the results on 5.2 Opportunity for Advertisers
synthetic and public datasets show that our current 2 stage decision This study estimates the effect of advertising partners on sell-
path works well. The two stage decision path allows automated ers/vendors outcomes related to Ads business, e.g., ad spend. This
rejection of estimators if they are not built for the use case at study has been traditionally taking multi-weeks of scientist’s effort
hand. For example, GSC is supposed to be used for panel data for each refresh. In 2023, the study was expanded in both number
but becomes computationally inefficient with >500,000 data sizes. of outcome variables and numbers of groups of partners and adver-
OpportunityFinder does not run GSC for such large data sizes. As tisers. Each combination of outcome and entity group is a separate
the research evolves, especially with neural networks for causal causal study. OpportunityFinder helped accelerate the study so that
inference, we plan to incorporate the new models as well as update all experiments were completed within a month.
the models selection criteria. For example, we will explore providing With a large number of advertisers, input data is redirected to
results from ensemble of models and finding the expected causal DML and DNN causal models. Input panel data is then transformed
path using causal discovery algorithms. into cohorts, before feature engineering and model training. Based
6 Opportunity names and business metrics (see Tables 4, 5) are masked due to customer
5 Synthetic datasets used in this study will be shared upon paper acceptance. data policy.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huy Nguyen, Prince Grover, and Devashish Khatwani

Table 2: Estimates for ATE on Smoking data

SDID SC DID MC DIFP OPF OPF w.price


ATE -15.6 -19.6 -27.3 -20.2 -11.1 -24.6 -14.0
Standard Error 8.4 9.9 17.7 11.5 9.5 4.9 4.7

Table 3: ATE (with std. error) on synthetic datasets using models implemented in OPF

Synthetic#1 (GT = 10) Synthetic#2 (GT = 20)


BCAUSS 10.18 (0.03) 19.00 (0.18)
DRAGON 10.34 (0.19) 18.98 (0.24)
TARNET 10.00 (0.09) 18.98 (0.23)
GANITE 7.78 (n/a) 6.41 (n/a)
LinearReg. + LinearDML 9.93 (0.02) 19.04 (0.13)
Rand.Forest + LinearDML 9.98 (0.07) 19.01 (0.13)
XGBoost + LinearDML 9.70 (0.04) 18.82 (0.13)
LightGBM + LinearDML 9.75 (0.03) 18.97 (0.15)
GSC n/a 18.87 (0.07)

Table 4: A sample of opportunity for partners studies from 2021/22 vs 2023. Metric is uplift after 6 months of adoption.

Opportunity X Opportunity Y
Metric 1 Metric 2 Metric 3 Metric 1 Metric 2 Metric 3
Manual (2021/22) 5% 20% 12% 4% 4% 6%
OpportunityFinder (2023) 6% 12% 17% 8% 8% 11%

Table 5: Results of opportunity for advertisers study for world-wide vendors. Metric is average monthly uplift on outcomes
within 3 months after adoption.

DNN +LinearDML
Outcome BCAUSS DRAGON TARNET GANITE LinearReg. Rand.Forest XGBoost LightGBM

Metric 4 68% 45% 62% 15% 17% 14% 12% 14%


Metric 5 58% 48% 64% 16% 14% 13% 16% 12%

on ATE and standard error results on validation datasets, OPF We are actively taking feature requests from current OPF users.
chooses Rand.Forest+LinearDML as final model. Our results were With causal discovery component, we will explore how hypothesis
reviewed by domain experts and in range of results from prior formulation before estimation can improve the estimation capa-
studies. In Table 5, we report lift metric returned by all possible bility, especially with large set of observational data that a non
models on a dataset. Three DNN models over-estimate treatment expert user tends to provide. We also aim to provide a master list of
effect, and only GANITE yields numbers close to DML models. variables that can be collected for causal inference studies within
our organization, and let OPF auto-shortlist covariates using data
6 CONCLUSIONS AND FUTURE WORK driven approaches for removing bias. We plan to extend OPF by
incorporating more estimators like meta learners, implement indi-
This paper presents OpportunityFinder (OPF), a codeless frame-
vidual and heterogenous treatment effects, and support categorical
work for causal inference studies, with a focus on panel data with
and continuous treatments. With more and more causal inference
binary treatment. Our experiments on multiple public, synthetic
algorithms being integrated into OPF, we will implement additional
and internal datasets show that OPF can handle a diverse set of sce-
model selection, e.g., prediction/regression accuracy of base learn-
narios and our decision criteria for algorithm selection works well
ers. Moreover, we will experiment model ensembles to provide
for given use-cases. We also see that in most of the cases, simpler
the final output. Last but not least, we have refactored Opportuni-
algorithms like DML and GSC work well. We are able to use OPF
tyFinder source code to make it a stand-alone library independent
on datasets ranging from small panel data to a large data with more
of AWS ecosystem.
than one million observations.
OpportunityFinder: A Framework for Automated Causal Inference Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

REFERENCES [25] Stefan Wager and Susan Athey. 2018. Estimation and Inference of Heterogeneous
[1] [n. d.]. Causal Inference on IHDP: Benchmark. https://paperswithcode.com/sota/ Treatment Effects using Random Forests. J. Amer. Statist. Assoc. 113, 523 (2018),
causal-inference-on-ihdp. 1228–1242.
[2] [n. d.]. No Free Lunch in Causal Inference. https://p-hunermund.com/2018/06/ [26] Yiqing Xu. 2017. Generalized Synthetic Control Method: Causal Inference with
09/no-free-lunch-in-causal-inference/. Interactive Fixed Effects Models. Political Analysis 25, 1 (2017), 57–76. https:
[3] Alberto Abadie, Alexis Diamond, and Jens Hainmueller. 2010. Syn- //doi.org/10.1017/pan.2016.2
thetic Control Methods for Comparative Case Studies: Estimating the Ef- [27] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GANITE: Estima-
fect of California’s Tobacco Control Program. J. Amer. Statist. As- tion of Individualized Treatment Effects using Generative Adversarial Nets. In
soc. 105, 490 (2010), 493–505. https://doi.org/10.1198/jasa.2009.ap08746 International Conference on Learning Representations. https://openreview.net/
arXiv:https://doi.org/10.1198/jasa.2009.ap08746 forum?id=ByKWUeWA-
[4] Alberto Abadie and Javier Gardeazabal. 2003. The economic costs of conflict: A
case study of the Basque Country. American economic review (2003), 113–132.
[5] Joshua D. Angrist and Jorn-Steffen Pischke. 2009. Mostly harmless econometrics: APPENDIX
An empiricist’s companion. Princeton university press.
[6] Dmitry Arkhangelsky, Susan Athey, David A. Hirshberg, Guido W. Im- A OVERVIEW OF CAUSAL INFERENCE
bens, and Stefan Wager. 2021. Synthetic Difference in Differences. MODELS
arXiv:1812.09970 [stat.ME]
[7] Philipp Bach, Victor Chernozhukov, Malte S. Kurz, and Martin Spindler. 2022. The following models are considered for the auto causal framework:
DoubleML – An Object-Oriented Implementation of Double Machine Learning
in Python. Journal of Machine Learning Research 23, 53 (2022), 1–6. http: • Synthetic Control (SC) and Generalized Synthetic Con-
//jmlr.org/papers/v23/21-0862.html trol (GSC): SC allows for comparative case studies using a
[8] Keith Battocchi, Eleanor Dillon, Maggie Hei, Greg Lewis, Paul Oka, Miruna
Oprescu, and Vasilis Syrgkanis. 2019. EconML: A Python Package for ML- weighted combination of control units to create a synthetic
Based Heterogeneous Treatment Effects Estimation. https://github.com/py- control unit. GSC extends this by considering interactive
why/EconML. Version 0.x. fixed effects models. The key assumption is that the out-
[9] David Card and Alan B Krueger. 1994. Minimum wages and employment: A case
study of the fast-food industry in New Jersey and Pennsylvania. The American come of treated units is a linear function of the outcomes
economic review 84, 4 (1994), 772–793. of the control units in the absence of treatment. GSC allows
[10] Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu
Zhao. 2020. CausalML: Python Package for Causal Machine Learning.
relationship to vary over time, unlike traditional SC meth-
arXiv:2002.11631 [cs.CY] ods. Both methods are well suited for panel data with small
[11] Cheng and Mark Hoekstra. 2012. Does Strengthening Self-Defense Law Deter sample sizes but require domain knowledge for selection of
Crime or Escalate Violence? Evidence from Castle Doctrine. Working Paper 18134.
National Bureau of Economic Research. https://doi.org/10.3386/w18134 control units. The limitation for implementing GSC in the
[12] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian auto causal framework is computational inefficiency with
Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine large observational data or with more number of covariates.
learning for treatment and structural parameters. The Econometrics Journal 21, 1
(2018), C1–C68. [4, 26]
[13] Rajeev H. Dehejia and Sadek Wahba. 2002. Propensity Score- • Double Machine Learning (DML): DML leverages ma-
Matching Methods for Nonexperimental Causal Studies. The Re-
view of Economics and Statistics 84, 1 (02 2002), 151–161. https:
chine learning to estimate treatment effects in a semi-parametric
//doi.org/10.1162/003465302317331982 arXiv:https://direct.mit.edu/rest/article- manner, allowing for complex relationships. The key require-
pdf/84/1/151/1613304/003465302317331982.pdf ment for DML to work well is the availability of high-quality
[14] Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy,
Mu Li, and Alexander Smola. 2020. AutoGluon-Tabular: Robust and Accurate and diverse covariate data. DML can handle large datasets
AutoML for Structured Data. arXiv preprint arXiv:2003.06505 (2020). and does not specifically require panel data. It allows for
[15] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and different ML models to be used in the two stages, providing
Frank Hutter. 2020. Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning.
arXiv:2007.04074 [cs.LG] (2020). versatility. [12]
[16] Timo Flesch, Edward Zhang, Guy Durant, Mark Harley Wen Hao Kho, and Egor • Causal Forests: Causal Forests extend random forests to
Kraev. 2022. Auto-Causality: A Python package for Automated Causal Inference
model estimation and selection. https://github.com/transferwise/auto-causality.
estimate heterogeneous treatment effects, offering flexibility
Version 0.x. and the ability to capture complex relationships. The key
[17] Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Meta-learners assumption is the unconfoundedness or ignorability assump-
for Estimating Heterogeneous Treatment Effects using Machine Learning. Pro-
ceedings of the National Academy of Sciences 116, 10 (2019), 4156–4165. tion. It is not inherently designed for panel data and requires
[18] Robert Lalonde. 1986. Evaluating the Econometric Evaluations of Training a relatively large sample size. The limitation for auto causal
Programs with Experiment Data. American Economic Review 76 (02 1986), 604– framework is that it does not handle panel data well and we
20.
[19] Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Ma- did not find it to work well in our experiments. [25]
chine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) • Neural Network based approaches: Several approaches
(July 2020). https://www.automl.org/wp-content/uploads/2020/07/AutoML_
2020_paper_61.pdf
utilize neural networks for causal inference, each with its
[20] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity unique proposition: BCAUSS, Dragonnet and TARNet model
score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55. treatment assignments and potential outcomes in a multi-
[21] Uri Shalit, Fredrik D. Johansson, and David Sontag. 2017. Estimating individual
treatment effect: generalization bounds and algorithms. In Proceedings of the 34th task learning setup, allowing finding of least dissimilar treated
International Conference on Machine Learning-Volume 70. JMLR. org, 3076–3085. and untreated observations. GANITE leverages the power
[22] Amit Sharma, Emre Kiciman, et al. 2019. DoWhy: A Python package for causal of generative adversarial networks (GANs) to estimate indi-
inference. https://github.com/microsoft/dowhy.
[23] Claudia Shi, David M. Blei, and Victor Veitch. 2019. Adapting Neural Networks vidual treatment effects. They require relatively large and
for the Estimation of Treatment Effects. arXiv:1906.02120 [stat.ML] high-quality datasets, otherwise can over/under estimate the
[24] Gino Tesei, Stefanos Giampanis, Jingpu Shi, and Beau Norgeot. 2023. Learning
end-to-end patient representations through self-supervised covariate balancing
treatment effects. These methods can handle large datasets
for causal treatment effect estimation. Journal of Biomedical Informatics 140 but are not specifically designed for panel data. [21, 23, 24, 27]
(2023), 104339. • Meta Learners: Meta Learners apply machine learning
methods to estimate treatment effects, offering the flexibility
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huy Nguyen, Prince Grover, and Devashish Khatwani

of using various base learners. The key assumption is that


the base learners are correctly specified. They are not specif-
ically designed for panel data and require a relatively large
sample size. The limitation for auto causal framework is in
the choice of base learner. [17]
• Difference in Differences (DiD): DiD compares the aver-
age change in outcome over time that occurs in the treat-
ment group to the average change over time that happens in
the control group. It’s designed to handle unobserved, time-
invariant confounders. It’s a simple and intuitive method for
panel data, widely used in economic studies. Problem with
DiD is that is relies on strong parallel trends assumption that
is often violated in the real world setting. We do not use DiD
or variant Synthetic DiD in our implementations. [6, 9]
• Propensity Score Matching (PSM): The propensity score
is the conditional probability of receiving treatment given (a) Metric 4
pre-treatment characteristics. This approach has been tra-
ditionally popular because of its simplicity, interpretability
and ability to handle large covariates. But DML is based on
similar principle and overcomes the limitations that PSM has
and is more robust. Complementing DML, SC methods do
not require unconfoundedness assumption that PSM does.
Therefore, we do not use PSM in the OPF. [20]

B VALIDATION TESTS FOR TREATMENT


EFFECT RESULTS
To demonstrate the validation test for treatment effect results, we
report refutation test outputs for LinearReg+LinearDML model and
its ATE on Synthetic#1 data in Table 6. Sensitivity test for GSC
model are reported in Table 7. All refutation tests passed, placebo
ATE are close to zero while other test ATE close to original model. (b) Metric 5
This confirms the model is robust against changes in settings and
estimated ATE is consistent. Figure 3: Average outcome metric of treated vs. control units
over time in advertiser data.
C VISUALIZATION OF OUTCOME VARIABLES
IN DIFFERENT DATA
they started prison expansion in 1993 till 2000. The increase was
We display data plots generated by OpportunityFinder when run-
low but non zero (38%) for white male incarceration during the
ning different datasets. Figure 3 plots average outcome of treated
same period in Texas.
(orange line) and control (blue line) units for a cohort. Dash-red
vertical bars indicate start and end date of cohort. These plots are
generated by our data processing module. For Smoking and Texas
D.2 NSW and Castle data
data, Figures 4, 5 are generated by our GSC model that black line In this section, we show results on two additional datasets, NSW
shows time-series of outcome values of treated unit, and dash-blue [18] and Castle [11]. Caslte is a panel data with year level informa-
line show that of synthesized control. tion for 10 years, covering 50 states out of which 21 adopted the
castle doctrine law. Castle law designates a person’s abode or any
D ADDITIONAL RESULTS legally occupied place (for example, a vehicle or home) as a place in
which that person has protections and immunity permitting, in cer-
D.1 Texas data tain circumstances, to use force (up to and including deadly force)
Table 8 shows impact on black and white male incarceration from to defend oneself against an intruder, free from legal prosecution for
prison expansion in Texas since 1993. The numbers represent aver- the consequences of the force used. The study done by [11] aimed
age percentage lift on the respective observational metric in Texas at finding the effect of castle doctrine law on increase in homicide.
vs. other states due to the expansion. The covariates used to remove This data has 550 rows and 170 features (potential covariates), and
bias include poverty rates, white male incarceration, percentage based on researcher’s outcome, expected lift in homicide of 8% (we
of population between 15 and 19, income, unemployment rate and are accessing if OPF models reproduce the study). We see that in
AIDS mortality. We see that the Texas had an average of 2x (100%) such small data sizes with large number of potential covariates,
more black male incarceration compared to the other states after linear models do the best, and boosted trees can be very off.
OpportunityFinder: A Framework for Automated Causal Inference Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

Table 6: Refutation tests for DML model on Synthetic#1 and #2 sets

Synth#1 ATE Synth#2 ATE


Ground-truth 10.00 20.00
Model ATE 9.93 19.04
Placebo test -0.08 passed -0.05 passed
Random common cause test 9.98 passed 19.08 passed
Unobserved common cause test 9.98 passed 15.48 passed
Data-subset test 9.99 passed 19.00 passed

Table 7: Sensitivity tests for GSC model on partner data with Metric 2. The numbers represent % uplift. All lifts are statistically
significant

Opportunity X Opportunity Y
Overall model 7.8% 12.0%
Remove covariates test 14.3% passed 34.5% passed
Random downsample test 7.0% passed 11.8% passed
Reduced period for SC weights test 7.8% passed 11.2% passed

Figure 4: Synthetic control fit on smoking data without covariate. The pre-treatment fit is good.

Table 8: Average percentage lift on black and white male incarceration from prison expansion in Texas since 1993.

black-male prison white-male prison


100% 38%

NSW is a famous experimental data that is complemented with


additional synthetic data where researchers added selection bias in
the control population. This is used in multiple research works to
replicate the results of randomized trials. It contains 50,000 rows and
180 features, and has 100 simulated variations. It is not a panel data
and we tested it with DML variations. We observe underestimation

compared to other research works but closest results from Random


Forest based model.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Huy Nguyen, Prince Grover, and Devashish Khatwani

(a) Black male incarceration (b) White male incarceration

(c) Aids per capita (d) Poverty

Figure 5: Pre treatment synthetic control fit and post treatment diversion of different metrics on Texas data.

Table 9: Comparison of % lift on Castle datasets using DML models from OPF vs. previous research works

Previous Research +LinearDML


Cheng’12 [11] LinearReg. Rand.Forest XGBoost LightGBM
8% 7% 4% 4% 50%
OpportunityFinder: A Framework for Automated Causal Inference Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

Table 10: Comparison of ATE on NSW datasets using DML models from OPF vs. previous research works

Previous Research +LinearDML


Lalonde’86 [18] Dehejia’02 [13] LinearReg. Rand.Forest XGBoost LightGBM
900 1300-1800 286 776 728 637

You might also like