Causal-Inference 2020 Engineering
Causal-Inference 2020 Engineering
Engineering
journal homepage: www.elsevier.com/locate/eng
Research
Artificial Intelligence—Review
Causal Inferencey
Kun Kuang a,⇑, Lian Li b, Zhi Geng c, Lei Xu d, Kun Zhang e, Beishui Liao f, Huaxin Huang f,
Peng Ding g, Wang Miao h, Zhichao Jiang i
a
College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
b
Department of Computer Science and Technology, Hefei University of Technology, Hefei 230009, China
c
School of Mathematical Science, Peking University, Beijing 100871, China
d
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
e
Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA 15213, USA
f
School of Humanities, Zhejiang University, Hangzhou 310058, China
g
University of California Berkeley, Berkeley, CA 94720, USA
h
Guanghua School of Management, Peking University, Beijing 100871, China
i
Department of Government & Department of Statistics, Harvard University, Cambridge, MA 02138, USA
a r t i c l e i n f o a b s t r a c t
Article history: Causal inference is a powerful modeling tool for explanatory analysis, which might enable current
Received 8 May 2019 machine learning to become explainable. How to marry causal inference with machine learning to
Revised 31 July 2019 develop explainable artificial intelligence (XAI) algorithms is one of key steps toward to the artificial
Accepted 26 August 2019
intelligence 2.0. With the aim of bringing knowledge of causal inference to scholars of machine learning
Available online 8 January 2020
and artificial intelligence, we invited researchers working on causal inference to write this survey from
different aspects of causal inference. This survey includes the following sections: ‘‘Estimating average
Keywords:
treatment effect: A brief review and beyond” from Dr. Kun Kuang, ‘‘Attribution problems in counterfac-
Causal inference
Instructive variables
tual inference” from Prof. Lian Li, ‘‘The Yule–Simpson paradox and the surrogate paradox” from Prof. Zhi
Negative control Geng, ‘‘Causal potential theory” from Prof. Lei Xu, ‘‘Discovering causal information from observational
Causal reasoning and explanation data” from Prof. Kun Zhang, ‘‘Formal argumentation in causal reasoning and explanation” from Profs.
Causal discovery Beishui Liao and Huaxin Huang, ‘‘Causal inference with complex experiments” from Prof. Peng Ding,
Counterfactual inference ‘‘Instrumental variables and negative controls for observational studies” from Prof. Wang Miao, and
Treatment effect estimation ‘‘Causal inference with interference” from Dr. Zhichao Jiang.
Ó 2020 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and
Higher Education Press Limited Company. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Estimating average treatment effect: A brief review and 1.1. The setup
beyond
We are interested in estimating the causal effect of a binary
Machine learning methods have demonstrated great success in variable based on potential outcome framework [1]. For each
many fields, but most lack interpretability. Causal inference is a unit indexed by i = 1, 2, . . ., n (n denotes the sample size),
powerful modeling tool for explanatory analysis, which might we observe a treatment T i , an outcome, and a vector of
enable current machine learning to make explainable prediction. observed variables X 2 Rp1 , where p refers to the dimension
In this article, we review two classical estimators for estimating of observed variables. The pair of potential outcomes for
causal effect, and discuss the remaining challenges in practice. each unit i is fY i ð1Þ; Y i ð0Þg corresponding to its treatment
Moreover, we present a possible way to develop explainable artifi- assignment T i ¼ 1 (treated) or T i ¼ 0 (control). The observed
cial intelligence (XAI) algorithms by marrying causal inference outcome Y obs is
i
with machine learning.
Y obs
i ¼ Y i ðT i Þ ¼ T i Y i ð1Þ þ ð1 T i Þ Y i ð0Þ ð1Þ
y
The authors contributed equally to this work. The symbol definitions and
notations of each section are relatively independent. Then, the average treatment effect is defined as follows:
⇑ Corresponding author.
E-mail address: [email protected] (K. Kuang). s ¼ E½Y i ð1Þ Y i ð0Þ ð2Þ
https://doi.org/10.1016/j.eng.2019.08.016
2095-8099/Ó 2020 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and Higher Education Press Limited Company.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
254 K. Kuang et al. / Engineering 6 (2020) 253–263
where function EðÞ denotes the expectation function, 1.3.2. Interaction of treatments
and the average treatment effect for treated is defined as In practice, the treatment can consist of multiple variables and
st ¼ E½Y i ð1Þ Y i ð0ÞjT i ¼ 1. their interactions. In social marketing, the combined causal effects
To identify s and st , we assume the un-confoundedness—that of different advertising strategies may be of interest. More work is
T i ? ½ð1Þ; Y i ð0ÞjX i —and assume the overlap of the covariate distri- needed on the causal analyses of treatment combination.
bution—that 0 < pðT i ¼ 1jX i Þ < 1.
1.3.3. Unobserved confounders
1.2. Two estimators The existence of unobserved confounders is equivalent to viola-
tion of the unconfoundedness assumption and is not testable. Con-
Here, we briefly introduce two of the most promising estima- trolling high-dimensional variables may make unconfoundedness
tors for treatment effect estimation and discuss them for the case more plausible but poses new challenges to propensity score esti-
with many observed variables. mation and confounder balancing.
of the causality of x and y. In social science or logical science, this is statistics. This term reflects the different risk ratio conditioning
called the attribution problem. It is also known as the ‘‘but-for” cri- on x = 1 and x = 0. The second term is the confounding factor, which
terion in jurisprudence. The attribution problem has a long history should be particularly noticed. This term reflects the effect con-
of being studied; however, previous methods used to address this founded by other variables. In a natural environment, a change in
problem have mostly been case studies, statistical analysis, experi- y could be caused by x in two different ways: First, it could be
mental design, and so forth; one example is the influential INUS directly caused by a change in x; or, second, it could be caused
theory put forward by the Australian philosopher Mackie in the by other variables. This phenomenon is called confounding. The
1960s [14]. These methods are basically qualitative, relying on difference Pðy ¼ 1jx ¼ 0Þ P ðyx¼0 ¼ 1Þ denotes the degree of con-
experience and intuition. With the emergence of big data, however, founding. In some situations, the change in x did give rise to the
data-driven quantitative study has been developed for the attribu- change in y, but x may not be the reason for the change in y (e.g.,
tion problem, making the inference process more scientific and the sun rises after the cock crows). It is possible to exclude con-
reasonable. founding by means of scientific experiments to determine the true
Attribution has a twin problem, which is to determine the prob- causality of the change in y. However, scientific experiments can
ability that the event y would have occurred (y = 1) had the event x hardly be conducted in many social science problems, or even in
occurred (x = 1), given that event x did not occur (x = 0) and event y some natural science problems. In such cases, only the observa-
did not happen (y = 0). Eq. (7) represents this probability. tional data can be obtained. Thus, the question of how to recognize
confounding from observational data in order to determine the
Pðyx¼1 ¼ 1jx ¼ 0; y ¼ 0Þ ð7Þ true causality is a fundamental problem in artificial intelligence.
This equation reflects the probability that event x causes event In order to explain the relationship between the attributable
y; that is, it reflects the sufficiency of the causality of x and y. risk fraction and the confounding factor, and their roles in the attri-
Counterfactual inference corresponds to human introspection, bution problem (i.e., the necessity of causality) more specifically,
which is a key feature of human intelligence. Inference allows peo- we applied the example in Ref. [15]. In this example, Mr. A goes
ple to predict the outcome of performing a certain action, while to buy a drug to relieve his pain and dies after taking the drug.
introspection allows people to rethink how they could have The plaintiff files a lawsuit to ask the manufacturer to take respon-
improved the outcome, given the known effect of the action. sibility. The manufacturer and plaintiff provide the drug test
Although introspection cannot change the existing de facto situa- results (i.e., experimental data) and survey results (i.e., nonexperi-
tion, it can be used to correct future actions. Introspection is a mental data), respectively. The data is illustrated in Table 1, where
mathematical model that uses past knowledge to guide future x = 1 denotes taking drugs, while y = 1 denotes death.
action. Unless it possesses the ability of introspection, intelligence The manufacturer’s data comes from strict drug safety experi-
cannot be called true intelligence. ments, while the plaintiff’s data comes from surveys among
Introspection is also important in daily life. For example, sup- patients taking drugs by their own volition. The manufacturer
pose Ms. Jones and Mrs. Smith both had cancer surgery. Ms. Jones claims that the drug was approved based on the drug distribution
also had irradiation. Eventually, both recovered. Then Ms. Jones regulations. Although it causes a minor increase in death rate (from
rethought whether she would have recovered had she not taken 0.014 to 0.016), this increase is acceptable compared with the anal-
the irradiation. Obviously, we cannot infer that Ms. Jones would gesic effect. Based on the traditional calculation of the attributable
have recovered had she not take the irradiation, based on the fact risk fraction (excess risk ratio), the responsibility taken by the
that Mrs. Smith recovered without irradiation. manufacturer is
There is an enormous amount of this kind of problem in medical
disputes, court trials, and so forth. What we are concerned with is Pðy ¼ 1jx ¼ 1Þ Pðy ¼ 1jx ¼ 0Þ 0:016 0:014
¼ ¼ 0:125 ð9Þ
what the real causality is, once a fact has occurred for a specific P ð y ¼ 1jx ¼ 1Þ 0:016
individual case. In these situations, general statistics data—such The plaintiff argues that the drug test was conducted under
as the recovery rate with irradiation—cannot provide the explana- experimental protocols, the subjects were chosen randomly, and
tion. Calculating the necessity of causality by means of introspec- the subjects did not take the drug of their own volition. Therefore,
tion and attribution inference plays a key role in these areas [14]. there is bias in the experiment, and the experimental setting differs
As yet, no general calculation method exists for Eq. (6). In cases from the actual situation. There is a huge difference between
that involve solving a practical problem, researchers introduce a observational data and experimental data. Given the fact of the
monotonic assumption that can be satisfied in most cases; that is: death of Mr. A, the calculation of the manufacturer’s responsibility
should obey the counterfactual equation. The result is
yx¼1 yx¼0
The intuition of monotonicity is that the effect y of taking an Pðy ¼ 1jx 1Þ Pðy ¼ 1jx ¼ 0Þ Pðy ¼ 1jx ¼ 0Þ P ðyx¼0 ¼ 1Þ
þ
action (x = 1) will not be worse than that of not taking the action Pð y ¼ 1jx 1Þ P ð y ¼ 1j x ¼ 1Þ
ð10Þ
(x = 0). For example, in epidemiology, the intuition of monotonicity 0:002 0:028 0:028 0:014
is not true for people who are contrarily infected (y = 0) after being
¼ þ ¼1
0:002 0:001
quarantined (x = 1), and who were uninfected (y = 1) before being
Therefore, the manufacturer should take full responsibility for
quarantined (x = 0). Because of the monotonicity, Eq. (6) can be
the death of Mr. A.
rewritten as follows:
Pðy ¼ 1Þ Pðyx¼0 ¼ 1Þ
Pðyx¼0 ¼ 0jx ¼ 1; y ¼ 1Þ ¼ Table 1
Pðx ¼ 1; y ¼ 1Þ Experimental and non-experimental data for the example of a drug lawsuit.
Pðy ¼ 1jx 1Þ Pðy ¼ 1jx ¼ 0Þ
¼ ð8Þ Outcomes Experimental data Non-experimental
P ð y ¼ 1j x 1Þ (number of data (number of
Pðy ¼ 1jx ¼ 0Þ P ðyx¼0 ¼ 1Þ patients) patients)
þ
Pð y ¼ 1jx ¼ 1Þ x=1 x=0 x=1 x=0
Deaths (y = 1) 16 14 2 28
Eq. (8) has two terms. The first term is named the attributable
Survivals (y = 0) 984 986 998 972
risk fraction, or the excess risk ratio, and is well known in risk
256 K. Kuang et al. / Engineering 6 (2020) 253–263
A quick look shows that, based on the survey data, the death Table 3
rates of taking and not taking the drug are 0.2% and 2.8%, Smoking and lung cancer with populations stratified by gender.
respectively, which is in favor of the manufacturer. However, after Condition Males Females
careful analysis, the confounding factor is Cancer No cancer Cancer No cancer
Pðy ¼ 1jx ¼ 0Þ P ðyx¼0 ¼ 1Þ ¼ 0:014; that is, half of the subjects
Smoking 35 15 45 105
died due to reasons other than not taking the drug. This part should No smoking 90 60 10 40
not be attributed to the drug, so the manufacturer’s responsibility
increases. Of course, there is some doubt regarding whether the
manufacturer should take full responsibility, as well as regarding unmeasured endpoint is predicted by the effect on the surrogate.
the rationality and scientificity of the calculation [16]. Neverthe- The surrogate paradox means that the treatment has a positive
less, this example demonstrates that there are confounding factors effect on the surrogate, and the surrogate has a positive effect on
that will disturb the discovery of true causality. The question of the endpoint, but the treatment may have a negative effect on
how to determine confounding factors is a practical problem in the endpoint [21]. Numerical examples are given in Refs. [21,22].
causal inference, naturally, and is also important in counterfactual This paradox also queries whether scientific knowledge is useful
inference. for policy analysis [23]. As a real example, doctors have the knowl-
In data science, there are simulated data and objective data, edge that an irregular heartbeat is a risk factor for sudden death.
with the latter containing experimental data and observational Several therapies can correct irregular heartbeats, but they
data. Although observational data are objective, easily available, increase mortality [24].
and low in cost, the confounding problems among them become Yule–Simpson paradox and the surrogate paradox warn about
an obstacle for causal inference [17]. In particular, there may be that a conclusion obtained from data can be inverted due to unob-
unknown variables (i.e., hidden variables) in an objective world. served confounders and emphasize the importance of using appro-
These variables are not observed, but may have effects on known priate approaches to obtain data. To avoid the Yule–Simpson
variables—that is, the known variables should be sensitive to paradox, first, randomization is the golden standard approach for
unmeasured confounding due to unknown variables. In this aspect, causal inference. Second, the use of an experimental approach to
current studies on confounding are still in their infancy. Readers obtain data is expected, if randomization is prohibited, as such
can refer to Ref. [18] for more detail. an approach attempts to balance all possible unobserved con-
founders between the two groups to be compared. Third, an
encouragement-based experimental approach—in which benefits
3. The Yule–Simpson paradox and the surrogate paradox are randomly assigned to a portion of the involved persons, such
that the assignment can change the probability of their expo-
An association measurement between two variables may be sure—can be used to design an instrumental variable. Finally, for
dramatically changed from positive to negative by omitting a third a pure observational approach, it is necessary to verify the assump-
variable, Z; this is called the Yule–Simpson paradox [19,20]. The tions required for causal inference using field knowledge, and to
third variable, Z, is called a confounder. A numerical example is further execute a sensitivity analysis for violations of these
shown in Table 2. The risk difference (RD) is the difference between assumptions. The two paradoxes also point out that a syllogism
the proportion of lung cancer in the smoking group and that in the and transitive reasoning may not be applicable to statistical
no-smoking group, RD = (80/200) (100/200) = 0.10, which is results. Statistically speaking, smoking is good for both males
negative. If the 400 persons listed in Table 2 are split into males and females, and the studied population consists of these males
and females, however, a dramatic change can be seen (Table 3). and females; however, the statistics indicate that smoking is bad
The RDs for both males and females are positive, at 0.10. This for the population as a whole. Statistics may show that a new drug
means that while smoking is bad for both males and females, can correct irregular heartbeats, and it is known that a regular
separately, smoking is good for all of these persons. heartbeat can promote survival time, both statistically speaking
The main difference between causal inference and other forms and for individuals; however, the new drug may still shorten the
of statistical inference is whether the confounding bias induced survival time of these persons in terms of statistics.
by the confounder is considered. For experimental studies, it is
possible to determine which variables affect the treatment or
exposure; this is particularly true for a randomized experiment, 4. Causal potential theory
in which the treatment or exposure is randomly assigned to indi-
viduals, as there is no confounder affecting the treatment. Thus, Extensive efforts have been made to detect causal direction,
randomized experiments are the gold standard for causal infer- evaluate causal strength, and discover causal structure from obser-
ence. For observational studies, it is key to observe a sufficient vations. Examples include not only the studies based on condi-
set of confounders or an instrumental variable that is independent tional independence and directed acyclic graphs (DAGs) by Pearl,
of all confounders. However, neither a sufficient confounder set Spirtes, and many others, but also those on the Rubin causal model
nor an instrumental variable can be verified by observational data (RCM), structural equation model (SEM), functional causal model
without manipulations. (FCM), additive noise model (ANM), linear non-Gaussian acyclic
In scientific studies, a surrogate variable (e.g., a biomarker) is model (LiNGAM), post-nonlinear (PNL) model, and causal genera-
often measured instead of an endpoint, due to its infeasible mea- tive neural networks (CGNNs), as well as the studies that discov-
surement; and, then, the causal effect of a treatment on the ered star structure [25] and identified the so called q-diagram
[26]. To some extent, these efforts share a similar direction of
thinking. First, one presumes a causal structure (e.g., merely one
Table 2 direction in the simplest case, or a DAG in a sophisticated situa-
Smoking and lung cancer. tion) for a multivariate distribution, either modeled in parametric
Condition Number of persons form or partly inspected via statistics, which is subject to certain
Cancer No cancer Total
constraints. Second, one uses observational data to learn the para-
metric model or estimate the statistics, and then examines
Smoking 80 120 200
No smoking 100 100 200
whether the model fits the observations and the constraints are
satisfied; based on this, one verifies whether the presumed causal
K. Kuang et al. / Engineering 6 (2020) 253–263 257
structure externally describes observations the well. Typically, a the methods in Table 4 into the famous Peter–Clark (PC) algorithm
set of causal structures are presumed as candidates, among which [28], especially on edges that are difficultly identified by indepen-
the best is selected. dent and conditional independent tests. The other is turning the
Causal potential theory (CPT) was recently proposed as a very conditions that gy is uncorrelated (or independent) of x and that
different way of thinking [27]. In analogy to physics, causality is gx is uncorrelated (or independent) of y into multivariate polyno-
here regarded as an intrinsic kinetic nature caused by a causal mial equations, and adding the equations into the q-diagram equa-
potential energy. Without losing generality, this CPT is introduced tions in Ref. [26], e.g., Eq. (29) and Eq. (33), to get an augmented
by starting with the consideration of a cause-effect relation group of polynomial equations. Then, the well known Wen-Tsun
between a pair of variables, x, y,y in an environment, U. Instead Wu method may be adopted to check whether the equations have
of presuming a causal structure (i.e., a specific direction), one esti- unique or a finite number of solutions.
mates a nonparametric distribution pU ðx; yÞ , pðx; yjU Þ from sam-
ples of x, y, and obtains the corresponding causal potential energy
EU ðx; yÞ / ln pU ðx; yÞ in an analogy based on the Gibbs distribu- 5. Discovering causal information from observational data
tion. In such a perspective of causal dynamics, an event occurring
at x, y is associated with EU ðx; yÞ that yields a force g x ; g y to Causality is a fundamental notion in science, and plays an
cause subsequent events by the dynamics ½x_ t ; y_ t / g x ; g y , driv- important role in explanation, prediction, decision-making, and
ing the information flow or causal process toward an area with the control [28,29]. There are two essential problems to address in
lowest energy or, equivalently, toward an area in which events modern causality research. One essential problem is the identifica-
tion of causal effects, that is, identifying the effects of interven-
have high chances to occur, using the notations g U , rU EU and
tions, given the partially or completely known causal structure
u_ t , du=dt. That is, CPT regards causality as an intrinsic nature of
and some observed data; this is typically known as ‘‘causal infer-
the dynamics ½x_ t ; y_ t / g x ; g y and discovers causality by analyz-
ence.” For advances in this research direction, readers are referred
ing g x ; g y . to Ref. [29] and the references therein. In causal inference, causal
Table 4 shows two roads for analyzing CPT causality. RoadA is structure is assumed to be given in advance—but how can we find
proceeded by testing a ‘‘Yes” or ‘‘No” answer on the mutual inde- causal structure if it is not given? A traditional way to discover cau-
pendence between g y , y and on that between g x , x, resulting in four sal relations resorts to interventions or randomized experiments,
types of Y-N combinations. The first two types indicate two types of which are too expensive or time-consuming in many cases, or
causality. The third type, Y-Y, indicates the independence between may even be impossible from a practical standpoint. Therefore,
x, y—that is, indicates that there is no relation between them. The the other essential causality problem, which is how to reveal cau-
last type, N-N, indicates ‘‘unclear ?”—that is, further study is needed sal information by analyzing purely observational data, has drawn
to determine whether a causal relation still occurs locally, or even a great deal of attention [28].
reciprocally, in some regions of x, y, although there is no causal rela- In the last three decades, there has been a rapid spread of inter-
tion detected globally between x, y. RoadA needs an independence est in principled methods causal discovery, which has been driven
test. In contrast, RoadB turns the problem into supervised learning, in part by technological developments. These technological devel-
with x, y as inputs into a neural net to fit two gradient components opments include the ability to collect and store big data with huge
g x ; g y , each of which is fit by a different neural net, with one or numbers of variables and sample sizes, and increases in the speed
both of x, y as inputs, respectively. An appropriate one is chosen of computers. In domains containing measurements such as satel-
according to not only fit, but also simplicity. Table 4 lists four types lite images of weather, functional magnetic resonance imaging
of outcomes based on this method [27]. (fMRI) for brain imaging, gene-expression data, or single-
It is possible to seek a certain estimator to obtain g x , g y directly nucleotide polymorphism (SNP) data, the number of variables
from samples xt , yt , where t = 1, . . ., N and N refers to the sample can range in the millions, and there is often very limited back-
size. It is also possible to obtain g x , g y indirectly, by estimating ground knowledge to reduce the space of alternative causal
pU ðx; yÞ first; that is, by performing a kernel estimate hypotheses. Causal discovery techniques without the aid
P of an automated search then appear to be hopeless. At the same
2
ph ðx; yÞ ¼ N1 Nt¼1 G x; yxt ; yt ; h I , where there is a Gaussian of time, the availability of faster computers with larger memories
mean m and variance r2 . Alternatively, it is possible to obtain pU and disc space allow for practical implementations of computa-
by one presumed causal structure, and to perform CPT analyses tionally intensive automated algorithms to handle large-scale
on this pU . problems.
Experiments on the CauseEffectPairs (CEP) benchmark have It is well known in statistics that ‘‘causation implies correlation,
demonstrated that a preliminary and simple implementation of but correlation does not imply causation.” Perhaps it is fairer to say
CPT has achieved performances that are comparable with ones that correlation does not directly imply causation; in fact, it has
achieved by state-of-art methods. become clear that under suitable sets of assumptions, the causal
Further development is to explore the estimation of causal structure (often represented by a directed graph) underlying a
structure between multiple variable distributions and multiple set of random variables can be recovered from the variables’
variables, possibly along two directions. One is simply integrating observed data, at least to some extent. Since the 1990s, conditional
Table 4
Two roads for analyzing CPT causality.
y
In this section, we reuse x, y to denote a pair variable, their relationship might be
cause and effect.
258 K. Kuang et al. / Engineering 6 (2020) 253–263
independence relationships in the data have been used for the pur- a number of learning problems involving distribution shift or con-
pose of estimating the underlying causal structure. Typical (condi- cerning the relationship between different factors of the joint dis-
tional independence) constraint-based methods include the PC tribution. In particular, for learning under data heterogeneity, it is
algorithm and fast causal inference (FCI) [28]. Under the assump- naturally helpful to learn and model the properties of data hetero-
tion that there is no confounder (i.e., unobserved direct common geneity, which then benefit from causal modeling. Such learning
cause of two measured variables), the result of PC is asymptotically problems include domain adaptation (or transfer learning) [35],
correct. FCI gives asymptotically correct results even when there semi-supervised learning, and learning with positive and unla-
are confounders. These methods are widely applicable because beled examples. Leveraging causal modeling for recommender sys-
they can handle various types of causal relations and data distribu- tems and reinforcement learning is becoming an active research
tions, given reliable conditional independence testing methods. field in recent years.
However, they may not provide all the desired causal information,
because they output (independence) equivalence classes—that is, a
6. Formal argumentation in causal reasoning and explanation
set of causal structures with the same conditional independence
relations. The PC and FCI algorithms output graphical representa-
In this section, we sketch why and how formal argumentation
tions of the equivalence classes. In cases without confounders,
can play an important role in causal reasoning and explanation.
there also exist score-based algorithms that estimate causal struc-
Reasoning in argumentation is realized by constructing, compar-
ture by optimizing some properly defined score function. The
ing, and evaluating arguments [36]. An argument commonly con-
greedy equivalence search (GES), among them, is a widely used
sists of a claim that may be supported by premises, which can be
two-phase procedure that directly searches over the space of
observations, assumptions, or intermediate conclusions of some
equivalence classes.
other arguments. The claim, the premises, and the inference rela-
In the past 13 years, it has been further shown that algorithms
tion between them may be the subject of rebuttals or counter-
based on properly constrained FCMs are able to distinguish
arguments [37]. An argument can be accepted only when it sur-
between different causal structures in the same equivalence class,
vives all attacks. In AI, formal argumentation is a general formalism
thanks to additional assumptions on the causal mechanism. An
for modeling defeasible reasoning. It provides a natural way for
FCM represents the outcome or effect variable Y as a function of
justifying and explaining causation, and is complementary to
its direct causes X and some noise term E, that is, Y ¼ f ðX; EÞ, where
machine learning approaches, for learning, reasoning, and explain-
E is independent of X. It has been shown that, without constraints
ing cause-and-effect relations.
on function f, for any two variables, one of them can always be
expressed as a function of the other and independent noise [30].
However, if the functional classes are properly constrained, it is 6.1. Nonmonotonicity and defeasibility
possible to identify the causal direction between X and Y because
for wrong directions, the estimated noise and hypothetical cause Causal reasoning is the process of identifying causality, that is,
cannot be independent (although they are independent for the the relationship between a cause and its effect, which is often
right direction). Such FCMs include the LiNGAM [31], where causal defeasible and nonmonotonic. On the one hand, causal rules are
relations are linear and noise terms are assumed to be non- typically defeasible. A causal rule may be represented in the form
Gaussian; the post-nonlinear (PNL) causal model [32], which con- ‘‘c causes e” where e is some effect and c is a possible cause. The
siders nonlinear effects of causes and possible nonlinear sensor/ causal connective is not a material implication, but a defeasible
measurement distortion in the data; and the nonlinear ANM conditional with strength or uncertainty. For example, ‘‘turning
[33,34], in which causes have nonlinear effects and noise is addi- the ignition key causes the motor to start, but it does not imply
tive. For a review of these models and corresponding causal discov- it, since there are some other factors such as there being a battery,
ery methods, readers are referred to Ref. [30]. the battery not being dead, there being gas, and so on” [38]. On the
Causal discovery exploits observational data. The data are pro- other hand, causal reasoning is nonmonotonic, in the sense that
duced not only by the underlying causal process, but also by the causal connections can be drawn tentatively and retracted in light
sampling process. In practice, for reliable causal discovery, it is of further information. It is usually the case that c causes e, but c
necessary to consider specific challenges posed in the causal and and d jointly do not cause e. For example, an agent believes that
sampling processes, depending on the application domain. For turning the ignition key causes the motor to start, but when it
example, for multivariate time series data such as mRNA expres- knows that the battery is dead, it does not believe that turning
sion series in genomics and blood-oxygenation-level-dependent the ignition key will cause the motor to start. In AI, this is the
(BOLD) time series in neuropsychology, finding the causal dynam- famous qualification problem. Since the potentially relevant fac-
ics generating such data is challenging for many reasons, including tors are typically uncertain, it is not cost effective to reason explic-
nonlinear causal interactions, a much lower data-acquisition rate itly. So, when doing causal inference, people usually ‘‘jump” to
compared with the underlying rates of change, feedback loops in conclusions and retract some conclusions when needed. Similarly,
the causal model, the existence of measurement error, non- reasoning from evidence to cause is nonmonotonic. If an agent
stationarity of the process, and possible unmeasured confounding observes some effect e, it is allowed to hypothesize a possible cause
causes. In clinical studies, there is often a large amount of missing c. The reasoning from the evidence to a cause is abductive, since for
data. Data collected on the Internet or in hospital often suffer from some evidence, one may accept an abductive explanation if no
selection bias. Some datasets involve both mixed categorical and better explanation is available. However, when new explanations
continuous variables, which may pose difficulties in conditional are generated, the old explanation might be discarded.
independence tests and in the specification of appropriate forms
of the FCM. Many of these issues have recently been considered, 6.2. Efficiency and explainability
and corresponding methods have been proposed to address them.
Causal discovery has benefited a great deal from advances in From a perspective of computation, monotonicity is a crucial
machine learning, which provide an essential tool to extract infor- property of classical logic, which means that each conclusion
mation from data. On the other hand, causal information describes obtained by local computation using a subset of knowledge is
properties of the process that render a set of constraints on the equal to the one made by global computation using all the knowl-
data distribution and is able to facilitate understanding and solve edge. This property does not hold in nonmonotonic reasoning and,
K. Kuang et al. / Engineering 6 (2020) 253–263 259
0 P 0
levels j and j as s j; j ¼ n1 ni¼1 Y i ðjÞ Y i j : Let T i ð jÞ be the
0
therefore, the computation could be highly inefficient. Due to the
nonmonotonicity of causal reasoning, in order to improve effi- indicator if unit i actually receives treatment level j. Let
ciency, formal argumentation has been evidenced to be a good can- P
Y i ¼ Jj¼1 T i ð jÞY i ð jÞ be the observed outcome of unit i. With
didate, by comparing it with some other nonmonotonic formalisms
observed data fT i ð1Þ; :::; T i ð J Þ; Y i gni¼1 , Splawa-Neyman [47] pro-
such as default logic and circumscription. The reason is that in for- Pn Pn 0
mal argumentation, computational approaches may take advan- posed to use s b j; j0 ¼ n1 j
1
i¼1 T i ðjÞY i nj0 i¼1 T i j Y i as an esti-
0 0
tage of the divide-and-conquer strategy and maximal usage of mator for s j; j . He showed that b s j; j is unbiased with
existing computational results in terms of the reachability between 2 S2 ðj0 Þ S2 ðjj0 Þ 0 0
variance S nðjjÞ þ n 0 n , where S2 ðjÞ, S2 j and S2 j j are
nodes in an argumentation graph [39]. Another important property j
0 0
of causal reasoning in AI is explainability. Traditional nonmono- the sample variances of Y i ðjÞ, Y i j and Y i ðjÞ Y i j . Note that
tonic formalisms are not ideal for explanation, since all the proofs the randomness comes from the treatment indicators with all the
are not represented in a human understandable way. Since the potential outcomes fixed. Splawa-Neymanhas [47] further dis-
purpose of explanation is to let the audience understand, the cog- cussed variance estimation and the large-sample confidence
nitive process of comparing and contrasting arguments is signifi- interval.
cant [37]. Argumentation provides such a way by exchanging We can extend the framework from Ref. [47] to a general causal
P P
arguments in terms of justification and argument dialogue [40]. effect defined as s ¼ n1 ni¼1 si where si ¼ Jj¼1 cj Y i ðjÞ is the indi-
P
vidual effect and the cj are contrast matrices with Jj¼1 cj ¼ 0. With
6.3. Connections to machine learning approaches
appropriately chosen contrast matrices, the special cases include
analysis of variance [48] and factorial experiments [49,50]. Fur-
In explainable AI, there are two components: the explainable
thermore, with an appropriately chosen subset of units, the special
model and the explanation interface. The latter includes reflexive
cases include subgroup analysis, post-stratification [51], and peer
explanations that arise directly from the model and rational expla-
effects [52]. Ref. [53] provides the general forms of central limit
nations that come from reasoning about the user’s beliefs. To real-
theorems under this setting for asymptotic inference. Ref. [54] dis-
ize this vision, it is natural to combine argumentation and machine
cusses split-plot designs, and Ref. [55] discusses general designs.
learning, in the sense that knowledge is obtained by machine
learning approaches, while the reasoning and explanation are real-
ized by argumentation. Since argumentation provides a general 7.2. The role of covariates in the analysis of experiments
approach for various kinds of reasoning in the context of disagree-
ment, and can be combined with some uncertainty measures, such Splawa-Neyman randomization model [47] also allows for the
as probability and fuzziness, it is very flexible to model the knowl- use of covariates to improve efficiency without strong modeling
edge learned from data. An example is when a machine learns fea- assumptions. In the case with a binary treatment, for unit i, let
tures and produces an explanation, such as ‘‘This face is angry, {Y(1), Y(0)} be the potential outcomes, T i be the binary treatment
because it is similar to these examples, and dissimilar from those indicator, and xi be pretreatment covariates. The average causal
P
examples.” This is an argument, which might be attacked by other effect s ¼ n1 ni¼1 fY i ð1Þ Y i ð0Þg has an unbiased estimator
P Pn
arguments. And, in order to measure the uncertainty described by s ¼ n1
b 1
n 1
i¼1 T i Y i n0 i¼1 ð1 T i ÞY i . Fisher [56] suggested using
some words such as ‘‘angry,” one may choose to use possibilistic or the analysis of covariance to improve efficiency; that is, running
probabilistic argumentation [41]. Different explanations may be in a least squares fit of Y i on T i and xi and using the coefficient of T i
conflict. For instance, there could be some cases invoking specific to estimate s. Ref. [57] uses the model from Ref. [47] to show that
examples or stories that support a choice, and rejections of an Fisher’s analysis of the covariance estimator is inferior because it
alternative choice that argue against less-preferred answers based can be even less efficient than b s and the ordinary least squares
on analytics, cases, and data. By using argumentation graphs, these can give an inconsistent variance estimate. Ref. [58] proposes a
kinds of support-and-attack relations can be conveniently modeled simple correction: First, center covariates to have mean x ¼ 0;
and can be used to compute the status of conflicting arguments for second, run a least squares fit of Y i on ðT i ; xi ; T i xi Þ and use the
different choices. coefficient of T i to estimate s, and third, use the Eicker–Huber–
White variance estimator [59–61]. With large samples, the estima-
7. Causal inference with complex experiments tor from Ref. [58] is at least as efficient as b s , and that researcher’s
variance estimate is consistent for the true variance of b s.
The potential outcomes framework for causal inference starts Ref. [62] extends to the setting with high-dimensional covari-
with a hypothetical experiment in which the experimenter can ates and replaces the least squares fit by the least absolute shrink-
assign every unit to several treatment levels. Every unit has poten- age and selection operator (LASSO) [63]. Ref. [64] examines the
tial outcomes corresponding to these treatment levels. Causal theoretical boundary of the estimator from Ref. [58], allowing for
effects are comparisons of the potential outcomes among the same a diverging number of covariates. Ref. [65] investigates treatment
set of units. This is sometimes called the experimentalist’s effect heterogeneity using the least squares fit of Y i on
approach to causal inference [42]. Readers are referred to Refs. ðT i ; xi ; T i xi Þ. Ref. [66] discusses covariate adjustment in a
[43–46], for textbook discussions. factorial experiment, and Ref. [67] discusses covariate adjustment
in general designs.
7.1. Randomized factorial experiments
7.3. The role of covariates in the design of experiments
Splawa-Neyman [47] first formally discussed the following
randomization model. In an experiment with n units, the An analyzer can use covariates to improve the estimation effi-
experimenter randomly assigns (n1, . . ., nJ) units to treatment ciency. As a dual, a designer can use covariates to improve the
P
levels (1, . . ., J), where n ¼ Jj¼1 nj . Unit i has potential outcomes covariate balance and consequently improve the estimation effi-
fY i ð1Þ; :::; Y i ðJ Þg, with Y i ð jÞ being the hypothetical outcome if unit ciency. Ref. [68] hints at the idea of re-randomization—that is, only
i receives treatment level j. With potential outcomes, we can define accepting random allocation that ensures covariate balance. In par-
causal effects; for example, the comparison between treatment ticular, we accept a random allocation (T1, . . ., Tn) if and only if
260 K. Kuang et al. / Engineering 6 (2020) 253–263
trol approach. However, in contrast to the instrumental variable, randomization test for the null hypothesis of no spillover effect.
negative controls require weak assumptions that are more likely Ref. [80] extends this test to a larger class of hypotheses restricted
to hold in practice. Refs. [107,108] provide elegant surveys on to a subset of units, known as focal units. Building on this work,
the existence of negative controls in observational studies. Refs. Ref. [132] provides a general procedure for obtaining powerful
[105,109] point out that negative controls are widely available in conditional tests.
time series studies, as long as no feedback effect is present, such Interference brings up new challenges. First, the asymptotic
as studies about air pollution and public health. properties require advanced techniques deriving. Ref. [133] inves-
Refs. [107,109,110] examine the use of negative controls for tigates the consistency of the difference in the means estimator
confounding detection or bias reduction when a solely negative when the number of the units that can be interfered with does
control exposure or outcome is available but are unable to achieve not grow as quickly as the sample size. Ref. [134] develops the cen-
identification. Refs. [111,112] propose the use of multiple negative tral limit theorem for direct and spillover effects under partial
control outcomes to remove confounding in statistical genetics but interference and stratified interference. Ref. [52] provides the cen-
must rest on a factor analysis model. tral limit theorem for a peer effect under partial interference and
stratified interference. However, under general interference, the
9. Causal inference with interference asymptotic properties remain unsolved—even for the simplest dif-
ference in the means estimator. Second, interference becomes even
The stable unit treatment value assumption plays an important harder to deal with when data complications are present. Refs.
role in the classical potential outcomes framework. It assumes that [120,121,135,136] consider noncompliance in an interference set-
there is no interference between units [76]. However, interference ting. Ref. [137] examines the censoring of time-to-event data in
is likely to be present in many experimental and observational the presence of interference. However, for other data complica-
studies, where units socially or physically interact with each other. tions such as missing data and measurement error, no methods
For example, in educational or social sciences, people enrolled in a are yet available. Third, most of the literature focuses on the direct
tutoring or training program may have an effect on those not effect and the spillover effect. However, interference may be pre-
enrolled due to the transmission of knowledge [113,114]. In epi- sent in other settings, such as mediation analysis (see Ref. [138]
demiology, the prevention measures for infectious diseases may for a mediation analysis under interference) and longitudinal stud-
benefit unprotected people by reducing the probability of conta- ies, where different quantities are of interest. As a result, it is nec-
gion [115,116]. In these studies, one unit’s treatment can have a essary to generalize the commonly used methods in these settings
direct effect on its own outcome as well as a spillover effect on to account for the interference between units.
the outcome of other units. The direct and spillover effects are of
scientific or societal interest in real problems; they enable an
understanding of the mechanism of a treatment effect, and provide Compliance with ethics guidelines
guidance for policy making and implementation.
In the presence of interference, the number of potential out- Kun Kuang, Lian Li, Zhi Geng, Lei Xu, Kun Zhang, Beishui Liao,
comes of a unit grows exponentially with the number of units.y Huaxin Huang, Peng Ding, Wang Miao, and Zhichao Jiang declare
As a result, it is intractable to estimate the direct and spillover effects that they have no conflict of interest or financial conflicts to
without restriction in the literature on the estimation of treatment disclose.
effects with interference structure. There has been a rapidly growing
interest in interference (see Ref. [117] for a recent review). A signif-
References
icant direction of work focuses on limited interference within non-
overlapping clusters and assumes that there is no interference [1] Imbens GW, Rubin DB. Causal inference for statistics, social, and biomedical
between clusters [52,114,118–122]. This is referred to as the partial sciences. New York: Cambridge University Press; 2015.
[2] Bang H, Robins JM. Doubly robust estimation in missing data and causal
interference assumption [114]. Recently, several researchers have
inference models. Biometrics 2005;61(4):962–73.
considered the relaxation of the partial interference assumption to [3] Kuang K, Cui P, Li B, Jiang M, Yang S, Wang F. Treatment effect estimation with
account for a more general structure of interference (e.g., Refs. data-driven variable decomposition. In: Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence; 2017 Feb 4–9; San Francisco, CA, USA;
[123–126]). The variance estimation is more complicated under
2017.
interference. As pointed out in Ref. [118], it is difficult to calculate [4] Athey S, Imbens GW, Wager S. Approximate residual balancing: debiased
the variances for the direct and spillover effects even under partial inference of average treatment effects in high dimensions. J R Stat Soc Ser B
interference. In model-free settings, a typical assumption for obtain- (Stat Methodol) 2018;80(4):597–623.
[5] Kuang K, Cui P, Li B, Jiang M, Yang S. Estimating treatment effect in the wild
ing valid variance estimation is that the outcome of a unit depends via differentiated confounder balancing. In: Proceedings of the 23rd ACM
on the treatments of other units only through a function of the treat- SIGKDD International Conference on Knowledge Discovery and Data Mining;
ments. Ref. [118] provides a variance estimator under the stratified 2017 Aug 13–17; Halifax, NS, Canada; 2017. p. 265–74.
[6] Imai K, Van Dyk DA. Causal inference with general treatment regimes:
interference assumption, and Ref. [124] generalizes it under a generalizing the propensity score. J Am Stat Assoc 2004;99(467):854–66.
weaker assumption. [7] Egami N, Imai K. Causal interaction in factorial experiments: application to
Another direction of work targets new designs to estimate conjoint analysis. J Am Stat Assoc 2019;114(526):529–40.
[8] Louizos C, Shalit U, Mooij JM, Sontag D, Zemel R, Welling M. Causal effect
treatment effects based on the interference structure. Under the inference with deep latent-variable models. In: Proceedings of Advances in
partial interference assumption, Ref. [118] proposes the two- Neural Information Processing Systems 30; 2017 Dec 4–9; Long Beach, CA,
stage randomized experiment as a general experimental solution USA; 2017. p. 6446–56.
[9] Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap in
to the estimation of the direct and spillover effects. In more com-
estimation of average treatment effects. Biometrika 2009;96(1):187–99.
plex structures such as social networks, researchers have proposed [10] Li F, Thomas LE, Li F. Addressing extreme propensity scores via the overlap
several designs for the point and variance estimation of the treat- weights. Am J Epidemiol 2019;188(1):250–7.
[11] Kuang K, Cui P, Athey S, Xiong R, Li B. Stable prediction across unknown
ment effects [127–129].
environments. In: Proceedings of the 24th ACM SIGKDD International
For the inference under interference, Refs. [130,131] rely on Conference on Knowledge Discovery & Data Mining; 2018 Aug 19–23;
models for the potential outcomes. Ref. [79] develops a conditional London, UK; 2018. p. 1617–26.
[12] Zhuang Y, Wu F, Chen C, Pan Y. Challenges and opportunities from big data to
knowledge in AI 2.0. Front Inf Technol Elec Eng 2017;18(1):3–14.
y
If the total number of units is N, then there are 2N potential outcomes for each [13] Pan Y. 2018 special issue on artificial intelligence 2.0: theories and
unit. applications. Front Inf Technol Elec Eng 2018;19(1):1–2.
262 K. Kuang et al. / Engineering 6 (2020) 253–263
[14] Hoerl C, McCormack T, Beck SR, editors. Understanding counterfactuals, [49] Dasgupta T, Pillai NS, Rubin DB. Causal inference from 2K factorial designs by
understanding causation: issues in philosophy and psychology. New using potential outcomes. J R Stat Soc Series B Stat Methodol 2015;77
York: Oxford University Press; 2011. (4):727–53.
[15] Pearl J, Glymour M, Jewell NP. Causal inference in statistics: a [50] Wu J, Ding P. Randomization tests for weak null hypotheses. 2018.
primer. Hoboken: John Wiley & Sons; 2016. arXiv:1809.07419.
[16] Daniel RM, De Stavola BL, Vansteelandt S. Commentary: the formal approach [51] Miratrix LW, Sekhon JS, Yu B. Adjusting treatment effect estimates by post-
to quantitative causal inference in epidemiology: misguided or stratification in randomized experiments. J R Stat Soc Series B Stat Methodol
misrepresented? Int J Epidemiol 2016;45(6):1817–29. 2013;75(2):369–96.
[17] Pearl J. Causal and counterfactual inference. Forthcoming section in the [52] Li X, Ding P, Lin Q, Yang D, Liu JS. Randomization inference for peer effects. J
handbook of rationality. Cambridge: MIT press; 2018. Am Stat Assoc 2019:1–31.
[18] Goldfeld K. Considering sensitivity to unmeasured confounding: part 1 [53] Li X, Ding P. General forms of finite population central limit theorems with
[Internet]. New York: Keith Golgfeld; 2019 Jan 2 [cited 2019 Jun 1]. Available applications to causal inference. J Am Stat Assoc 2017;112(520):1759–69.
from: https://www.rdatagen.net/post/what-does-it-mean-if-findings-are- [54] Zhao A, Ding P, Mukerjee R, Dasgupta T. Randomization-based causal
sensitive-to-unmeasured-confounding/. inference from split-plot designs. Ann Stat 2018;46(5):1876–903.
[19] Yule GU. Notes on the theory of association of attributes in statistics. [55] Mukerjee R, Dasgupta T, Rubin DB. Using standard tools from finite
Biometrika 1903;2(2):121–34. population sampling to improve causal inference for complex experiments.
[20] Simpson EH. The interpretation of interaction in contingency tables. J R Stat J Am Stat Assoc 2018;113(522):868–81.
Soc B 1951;13(2):238–41. [56] Fisher R. Statistical methods for research workers. Edinburgh: Oliver and
[21] Chen H, Geng Z, Jia J. Criteria for surrogate end points. J R Stat Soc Series B Stat Boyd; 1925.
Methodol 2007;69(5):919–32. [57] Freedman DA. On regression adjustments to experimental data. Adv Appl
[22] Geng Z, Liu Y, Liu C, Miao W. Evaluation of causal effects and local structure Math 2008;40(2):180–93.
learning of causal networks. Annu Rev Stat Appl 2019;6(1):103–24. [58] Lin W. Agnostic notes on regression adjustments to experimental data:
[23] Pearl J. Is scientific knowledge useful for policy analysis? A peculiar theorem reexamining Freedman’s critique. Ann Appl Stat 2013;7(1):295–318.
says: no. J Causal Infer 2014;2(1):109–12. [59] Eicker F. Limit theorems for regressions with unequal and dependent errors.
[24] Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics
misled? Ann Intern Med 1996;125(7):605–13. and Probability; 1967 Jun 21–Jul 18; Berkeley, CA, USA. Berkeley: University
[25] Xu L, Pearl J. Structuring causal tree models with continuous variables. In: of California Press; 1967. p. 59–82.
Proceedings of the Third Conference on Uncertainty in Artificial Intelligence. [60] Huber PJ. The behavior of maximum likelihood estimates under nonstandard
Arlington: AUAI Press; 1987. p. 170–9. conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical
[26] Xu L. Deep bidirectional intelligence: alphazero, deep IA-search, deep IA- Statistics and Probability; 1967 Jun 21–Jul 18; Berkeley, CA, USA;
infer, and TPC causal learning. Appl Inf 2018;5(1):5. Berkeley: University of California Press; 1967. p. 221–33.
[27] Xu L. Machine learning and causal analyses for modeling financial and [61] White H. A heteroskedasticity-consistent covariance matrix estimator and a
economic data. Appl Inf 2018;5(1):11. direct test for heteroskedasticity. Econometrica 1980;48(4):817–38.
[28] Spirtes P, Glymour C, Scheines R. Causation, prediction, and search. 2nd [62] Bloniarz A, Liu H, Zhang CH, Sekhon JS, Yu B. Lasso adjustments of treatment
ed. Cambridge: MIT Press; 2001. effect estimates in randomized experiments. Proc Natl Acad Sci USA
[29] Pearl J. Causality: models, reasoning, and inference. Cambridge: Cambridge 2016;113(27):7383–90.
University Press; 2000. [63] Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc
[30] Spirtes P, Zhang K. Causal discovery and inference: concepts and recent Series B Stat Methodol 1996;58(1):267–88.
methodological advances. Appl Inform 2016;3(1):3. [64] Lei L, Ding P. Regression adjustment in completely randomized experiments
[31] Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A. A linear non-gaussian acyclic with a diverging number of covariates. 2018. arXiv:1806.07585.
model for causal discovery. J Mach Learn Res 2006;7:2003–30. [65] Ding P, Feller A, Miratrix L. Decomposing treatment effect variation. J Am Stat
[32] Zhang K, Hyvärinen A. On the identifiability of the post-nonlinear causal Assoc 2019;114(525):304–17.
model. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in [66] Lu J. Covariate adjustment in randomization-based causal inference for 2K
Artificial Intelligence; 2009 Jun 18–21; Montreal, QC, Canada. Arlington: AUAI factorial designs. Stat Probab Lett 2016;119:11–20.
Press; 2019. p. 647–55. [67] Middleton JA. A unified theory of regression adjustment for design-based
[33] Hoyer PO, Janzing D, Mooij JM, Peters J, Scholkopf B. Nonlinear causal inference. 2018. arXiv:1803.06011.
discovery with additive noise models. In: Proceedings of International [68] Cox DR. Randomization and concomitant variables in the design of
Conference on Neural Information Processing Systems; 2008 Dec 8–13; experiments. In: Anderson TW, Styan GHP, Kallianpur GG, Krishnaiah PR,
Vancouver, BC, Canada; 2008. p. 689–96. Ghosh JK, editors. Statistics and probability: essays in honor of CR
[34] Zhang K, Hyvärinen A. Causality discovery with additive disturbances: an Rao. Amsterdam: North-Holland; 1982. p. 197–202.
information-theoretical perspective. In: Buntine W, Grobelnik M, Mladenić D, [69] Morgan KL, Rubin DB. Rerandomization to improve covariate balance in
Shawe-Taylor J, editors. Machine learning and knowledge discovery in experiments. Ann Stat 2012;40(2):1263–82.
databases. Berlin: Springer; 2009. p. 570–85. [70] Li X, Ding P, Rubin DB. Asymptotic theory of rerandomization in treatment-
[35] Zhang K, Schölkopf B, Muandet K, Wang Z. Domain adaptation under target control experiments. Proc Natl Acad Sci USA 2018;115(37):9157–62.
and conditional shift. In: Proceedings of the 30th International Conference on [71] Morgan KL, Rubin DB. Rerandomization to balance tiers of covariates. J Am
Machine Learning; 2013 Jun 16–21; Atlanta, GA, USA; 2013. p. 819–27. Stat Assoc 2015;110(512):1412–21.
[36] Baroni P, Gabbay DM, Giacomin M, Van der Torre L. Handbook of formal [72] Branson Z, Dasgupta T, Rubin DB. Improving covariate balance in 2K factorial
argumentation. London: College Publications; 2018. designs via rerandomization with an application to a New York City
[37] Osborne J. Arguing to learn in science: the role of collaborative, critical department of education high school study. Ann Appl Stat 2016;10
discourse. Science 2010;328(5977):463–6. (4):1958–76.
[38] Shoham Y. Nonmonotonic reasoning and causation. Cogn Sci 1990;14 [73] Li X, Ding P, Rubin DB. Rerandomization in 2K factorial experiments. 2018.
(2):213–52. arXiv:1812.10911.
[39] Liao B, Jin L, Koons RC. Dynamics of argumentation systems: a division-based [74] Zhou Q, Ernst PA, Morgan KL, Rubin DB, Zhang A. Sequential rerandomization.
method. Artif Intell 2011;175(11):1790–814. Biometrika 2018;105(3):745–52.
[40] Sklar EI, Azhar MQ. Explanation through argumentation. In: Proceedings of [75] Fisher RA. The design of experiments. Edinburgh: Oliver and Boyd; 1935.
the 6th International Conference on Human–Agent Interaction; 2018 Dec 15– [76] Rubin DB. Comment on ‘‘randomization analysis of experimental data: the
18; Southampton, UK; 2018. p. 277–85. Fisher randomization test”. J Am Stat Assoc 1980;75(371):591–3.
[41] Fazzinga B, Flesca S, Furfaro F. Complexity of fundamental problems in [77] Tukey JW. Tightening the clinical trial. Control Clin Trials 1993;14(4):266–85.
probabilistic abstract argumentation: beyond independence. Artif Intell [78] Rosenbaum PR. Covariance adjustment in randomized experiments and
2019;268:1–29. observational studies. Stat Sci 2002;17(3):286–327.
[42] Pearl J. On a class of bias-amplifying variables that endanger effect estimates. [79] Aronow PM. A general method for detecting interference between units in
In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial randomized experiments. Sociol Methods Res 2012;41(1):3–16.
Intelligence; 2010 Jul 8–11; Catalina Island, CA, USA; 2000. p. 425–32. [80] Athey S, Eckles D, Imbens GW. Exact p-values for network interference. J Am
[43] Kempthorne O. The design and analysis of experiments. New York: Wiley; Stat Assoc 2018;113(521):230–40.
1952. [81] Basse G, Feller A, Toulis P. Exact tests for two-stage randomized designs in the
[44] Scheffe H. The analysis of variance. New York: John Wiley & Sons; 1959. presence of interference. 2017. arXiv:1709.08036.
[45] Hinkelmann K, Kempthorne O. Design and analysis of experiments: volume [82] Ding P. A paradox from randomization-based causal inference. Stat Sci
1: introduction to experimental design. 2nd ed. New York: John Wiley & 2017;32(3):331–45.
Sons; 2007. [83] Rosenbaum PR. Exact confidence intervals for nonconstant effects by
[46] Imbens GW, Rubin DB. Causal inference for statistics, social, and biomedical inverting the signed rank test. Am Stat 2003;57(2):132–8.
sciences: an introduction. New York: Cambridge University Press; 2015. [84] Rigdon J, Hudgens MG. Randomization inference for treatment effects on a
[47] Splawa-Neyman J. On the application of probability theory to agricultural binary outcome. Stat Med 2015;34(6):924–35.
experiments: essay on principles. Section 9. Stat Sci 1990;5(4):465–72. [85] Li X, Ding P. Exact confidence intervals for the average causal effect on a
[48] Ding P, Dasgupta T. A randomization-based perspective on analysis of binary outcome. Stat Med 2016;35(6):957–60.
variance: a test statistic robust to treatment effect heterogeneity. Biometrika [86] Ding P, Li F. Causal inference: a missing data perspective. Stat Sci 2018;33
2018;105(1):45–56. (2):214–37.
K. Kuang et al. / Engineering 6 (2020) 253–263 263
[87] Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal [114] Sobel ME. What do randomized studies of housing mobility demonstrate?
inference. Stat. Sci 1999;14:29–46. Causal inference in the face of interference. J Am Stat Assoc 2006;101
[88] Greenland S, Pearl J. Adjustments and their consequences—collapsibility (476):1398–407.
analysis using graphical models. Int Stat Rev 2011;79(3):401–26. [115] Halloran ME, Struchiner CJ. Causal inference in infectious diseases.
[89] Rosenbaum PR, Rubin DB. The central role of the propensity score in Epidemiology 1995;6(2):142–51.
observational studies for causal effects. Biometrika 1983;70(1):41–55. [116] Halloran ME, Struchiner CJ. Study designs for dependent happenings.
[90] Horvitz DG, Thompson DJ. A generalization of sampling without replacement Epidemiology 1991;2(5):331–8.
from a finite universe. J Am Stat Assoc 1952;47(260):663–85. [117] Halloran ME, Hudgens MG. Dependent happenings: a recent methodological
[91] Wright PG. Tariff on animal and vegetable oils. New York: Macmillan; 1928. review. Curr Epidemiol Rep 2016;3(4):297–305.
[92] Heckman J. Instrumental variables: a study of implicit behavioral [118] Hudgens MG, Halloran ME. Toward causal inference with interference. J Am
assumptions used in making program evaluations. J Hum Resour 1997;32 Stat Assoc 2008;103(482):832–42.
(3):441–62. [119] Basse G, Feller A. Analyzing two-stage experiments in the presence of
[93] Manski CF. Nonparametric bounds on treatment effects. Am Econ Rev interference. J Am Stat Assoc 2018;113(521):41–55.
1990;80(2):319–23. [120] Forastiere L, Mealli F, VanderWeele TJ. Identification and estimation of causal
[94] Balke A, Pearl J. Bounds on treatment effects from studies with imperfect mechanisms in clustered encouragement designs: disentangling bed nets
compliance. J Am Stat Assoc 1997;92(439):1171–6. using bayesian principal stratification. J Am Stat Assoc 2016;111
[95] Goldberger AS. Structural equation methods in the social sciences. (514):510–25.
Econometrica 1972;40(6):979–1001. [121] Kang H, Imbens G. Peer encouragement designs in causal inference with
[96] Robins JM. Correcting for non-compliance in randomized trials using partial interference and identification of local average network effects. 2016.
structural nested mean models. Commun Stat Theory Method 1994;23 arXiv:1609.04464.
(8):2379–412. [122] Rigdon J, Hudgens MG. Exact confidence intervals in the presence of
[97] Hernán MA, Robins JM. Causal inference. Boca Raton: Chapman & Hall; 2011. interference. Stat Probab Lett 2015;105:130–5.
[98] Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using [123] Aronow PM, Samii C. Estimating average causal effects under interference
instrumental variables. J Am Stat Assoc 1996;91(434):444–55. between units. 2018. arXiv:1305.6156v4.
[99] Lin W, Feng R, Li H. Regularization methods for high-dimensional [124] Aronow PM, Samii C. Estimating average causal effects under general
instrumental variables regression with an application to genetical interference, with application to a social network experiment. Ann Appl
genomics. J Am Stat Assoc 2015;110(509):270–88. Stat 2017;11(4):1912–47.
[100] Kang H, Zhang A, Cai TT, Small DS. Instrumental variables estimation with [125] Choi D. Estimation of monotone treatment effects in network experiments. J
some invalid instruments and its application to mendelian randomization. J Am Stat Assoc 2017;112(519):1147–55.
Am Stat Assoc 2016;111(513):132–44. [126] Forastiere L, Airoldi EM, Mealli F. Identification and estimation of treatment
[101] Wang L, Robins JM, Richardson TS. On falsification of the binary instrumental and interference effects in observational studies on networks. 2016.
variable model. Biometrika 2017;104(1):229–36. arXiv:1609.06245.
[102] Manski CF, Pepper JV. Monotone instrumental variables: with an application [127] Eckles D, Karrer B, Ugander J. Design and analysis of experiments in
to the returns to schooling. Econometrica 2000;68(4):997–1010. networks: reducing bias from interference. J Causal Inference 2017;5(1):
[103] Small DS. Sensitivity analysis for instrumental variables regression with 1–23.
overidentifying restrictions. J Am Stat Assoc 2007;102(479):1049–58. [128] Eckles D, Kizilcec RF, Bakshy E. Estimating peer effects in networks with peer
[104] Miao W, Geng Z, Tchetgen Tchetgen EJ. Identifying causal effects with proxy encouragement designs. Proc Natl Acad Sci USA 2016;113(27):7316–22.
variables of an unmeasured confounder. Biometrika 2018;105(4):987–93. [129] Jagadeesan R, Pillai N, Volfovsky A. Designs for estimating the treatment
[105] Miao W, Tchetgen Tchetgen E. Invited commentary: bias attenuation and effect in networks with interference. 2017. arXiv:1705.08524.
identification of causal effects with multiple negative controls. Am J [130] Bowers J, Fredrickson MM, Panagopoulos C. Reasoning about interference
Epidemiol 2017;185(10):950–3. between units: a general framework. Polit Anal 2013;21(1):97–124.
[106] Miao W, Tchetgen ET. A confounding cridge approach for couble negative [131] Toulis P, Kao E. Estimation of causal peer influence effects. In: Proceedings of
control inference on causal effects. 2018. arXiv:1808.04945. 30th International Conference on Machine Learning; 2013 Jun 16–21;
[107] Lipsitch M, Tchetgen Tchetgen E, Cohen T. Negative controls: a tool for Atlanta, GA, USA; 2013. p. 1489–97.
detecting confounding and bias in observational studies. Epidemiology [132] Basse GW, Feller A, Toulis P. Randomization tests of causal effects under
2010;21(3):383–8. interference. Biometrika 2019;106(2):487–94.
[108] Smith GD. Negative control exposures in epidemiologic studies. [133] Sävje F, Aronow PM, Hudgens MG. Average treatment effects in the presence
Epidemiology 2012;23(2):350–1. of unknown interference. 2017. arXiv:1711.06399.
[109] Flanders WD, Strickland MJ, Klein M. A new method for partial correction of [134] Liu L, Hudgens MG. Large sample randomization inference of causal effects in
residual confounding in time-series and other observational studies. Am J the presence of interference. J Am Stat Assoc 2014;109(505):288–301.
Epidemiol 2017;185(10):941–9. [135] Imai K, Jiang Z, Malani A. Causal inference with interference and
[110] Rosenbaum PR. The role of known effects in observational studies. Biometrics noncompliance in two-stage randomized experiments. Technical
1989;45(2):557–69. report. Princeton: Princeton University; 2018.
[111] Wang J, Zhao Q, Hastie T, Owen AB. Confounder adjustment in multiple [136] Kang H, Keele L. Spillover effects in cluster randomized trials with
hypothesis testing. Ann Stat 2017;45(5):1863–94. noncompliance. 2018. arXiv:1808.06418.
[112] Gagnon-Bartsch JA, Speed TP. Using control genes to correct for unwanted [137] Loh WW, Hudgens MG, Clemens JD, Ali M, Emch ME. Randomization
variation in microarray data. Biostatistics 2012;13(3):539–52. inference with general interference and censoring. 2018. arXiv:1803.02302.
[113] Hong G, Raudenbush SW. Evaluating kindergarten retention policy: a case [138] Vanderweele TJ, Hong G, Jones SM, Brown JL. Mediation and spillover effects
study of causal inference for multilevel observational data. J Am Stat Assoc in group-randomized trials: a case study of the 4Rs educational intervention.
2006;101(475):901–10. J Am Stat Assoc 2013;108(502):469–82.