0% found this document useful (0 votes)
14 views

9268 Causal Inference Using LLM Gui

The document proposes using large language models (LLMs) like GPT-3.5 to obtain causal order from domain experts, which can then be used for causal effect estimation from observational data. It presents a novel triplet-based prompting strategy to query LLMs for causal order and shows this outperforms pairwise prompts. It also introduces two algorithms that combine LLM-derived causal order with existing causal discovery methods like PC and CaMML to improve causal graph accuracy compared to the base methods alone.

Uploaded by

samjiaudios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

9268 Causal Inference Using LLM Gui

The document proposes using large language models (LLMs) like GPT-3.5 to obtain causal order from domain experts, which can then be used for causal effect estimation from observational data. It presents a novel triplet-based prompting strategy to query LLMs for causal order and shows this outperforms pairwise prompts. It also introduces two algorithms that combine LLM-derived causal order with existing causal discovery methods like PC and CaMML to improve causal graph accuracy compared to the base methods alone.

Uploaded by

samjiaudios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Under review as a conference paper at ICLR 2024

C AUSAL I NFERENCE U SING LLM-G UIDED


D ISCOVERY
Anonymous authors
Paper under double-blind review

A BSTRACT

1 At the core of causal inference lies the challenge of determining reliable causal
2 graphs solely based on observational data. Since the well-known back-door cri-
3 terion depends on the graph structure, any errors in the estimated graph structure
4 affect the correctness of the estimated causal effects. In this work, to construct a
5 valid back-door adjustment set, we propose to use a topological or causal order
6 among graph nodes, which is easier to get from domain experts. Given a node
7 pair, causal order is easier to elicit from domain experts compared to graph edges
8 since determining the existence of an edge depends extensively on other variables.
9 Interestingly, we observe that the same principle holds for Large Language Mod-
10 els (LLMs) such as GPT-3.5-turbo and GPT-4, motivating an automated method to
11 obtain causal order (and hence causal effect) with LLMs acting as virtual domain
12 experts. To this end, we employ different prompting strategies and contextual
13 cues to propose a robust technique for obtaining causal order from LLMs. Ac-
14 knowledging LLMs’ limitations, we also study possible techniques to integrate
15 LLMs with established causal discovery algorithms, including constraint-based
16 and score-based methods, to enhance their performance. Extensive experiments
17 demonstrate that our approach significantly improves causal ordering accuracy as
18 compared to discovery algorithms, highlighting the potential of LLMs to enhance
19 causal inference across diverse fields.

20 1 I NTRODUCTION
21 Causal inference plays a pivotal role across scientific disciplines, aiding researchers in uncovering
22 fundamental causal relationships and how they affect observed phenomena. For example, causal
23 inference is used to discern the causes of diseases and design effective interventions for diagnosis
24 and treatment in epidemiology (Mahmood et al., 2014), to evaluate policy impact based on observa-
25 tional studies in economics (Imbens & Rubin, 2015), and to understand the effects of pollution on
26 ecosystems in environmental science (Boslaugh, 2023). A key technical question for these studies
27 is estimating the causal effect of variables on a specific outcome variable.
28 Inferring causal effect from observational data, however, is a challenging task because the effect esti-
29 mate depends critically on the causal graph considered in the analysis. While there has been progress
30 in graph discovery algorithms, especially for specific parametric settings (Shimizu et al., 2006;
31 Hoyer et al., 2008b; Hyvärinen et al., 2010; Rolland et al., 2022), studies on real-world datasets
32 such as from atmospheric science (Huang et al., 2021) and healthcare (Tu et al., 2019) show that in-
33 ferring the causal graph from data remains a challenging problem in practice (Reisach et al., 2021).
34 Hence, causal inference studies often rely on human experts to provide the causal graph.
35 In this paper, based on the fact that the topological/causal order over the graph variables is enough
36 for effect inference, we leverage Large Language Models (LLMs) as virtual domain experts to pro-
37 pose an automated method to query causal order (and hence causal effect). Moreover, providing
38 the order between variables is the right question to ask experts because order depends only on the
39 variables under question, unlike existence of a graph edge that depends on which other variables
40 are present (to account for direct and indirect effects). For example, consider the data-generating
41 process, lung cancer → doctor visit → positive Xray. If an expert is asked whether there should be
42 a causal edge from lung cancer to positive Xray, they would answer “Yes” (indeed, such an edge
43 exists in the BNLearn Cancer dataset (Scutari & Denis, 2014)). However, if they are told that the
44 set of observed variables additionally includes doctor visit, then the correct answer would be to not

1
Under review as a conference paper at ICLR 2024

Figure 1: The LLM-augmented causal inference process based on inferring causal order. We propose a triplet-
based prompting technique to infer all three-variable subgraphs and aggregate them using majority voting to
produce a causal order. The causal order can then be used to identify a valid backdoor adjustment set. Ties in
causal order are broken using another LLM (e.g., GPT-4). For robustness, LLM-generated causal order may be
used in conjunction with discovery algorithms like PC or CaMML.

45 create a direct edge between lung cancer and positive Xray, but rather create edges mediated through
46 doctor visit. However, note that the causal order, lung cancer ≺ positive Xray remains the same in
47 both settings (a ≺ b indicates that a occurs before b in a casual process).
48 We show that large language models (LLMs) like GPT-3.5 (Hagendorff et al., 2022) and GPT-
49 4 (Peng et al., 2023) can be used to approximate experts’ capability to provide causal order, thereby
50 automating the process of causal inference. Extending results using LLMs for pairwise causal dis-
51 covery (Kıcıman et al., 2023), we find that LLMs can provide accurate causal order for a diverse set
52 of benchmark graphs. To do so, we propose a novel triplet-based prompting strategy that asks LLM
53 to consider three variables at once, compared to the pairwise prompts employed in past work (Kıcı-
54 man et al., 2023; Willig et al., 2022; Long et al., 2023; 2022). Causal order outputted using the
55 triplet-based prompt outperforms pairwise prompts across all benchmark datasets we considered.
56 Importantly, it avoids cycles in the predicted order whereas LLM outputs from pairwise prompts
57 often yield cycles.
58 Still, LLMs can exhibit unknown failure modes. Therefore, a more principled way is to adapt
59 existing graph discovery algorithms to utilize LLM output. To this end, we present two algorithms
60 based on constraint-based and score-based discovery algorithms respectively. The first uses causal
61 order from an LLM to orient the undirected edges outputted by a constraint-based algorithm such
62 as PC (Spirtes et al., 2000). The second algorithm utilizes the LLM causal order as a prior to
63 a score-based algorithm like CaMML (Wallace et al., 1996). Results show that LLM-augmented
64 algorithms outperform the base causal discovery algorithms in determining the causal order. The
65 overall methodology is depicted in Figure 1. Our contributions can be summarized as follows.
66 • We argue that in causal effect estimation, querying a domain expert for a causal order is more
67 principled than asking for exact causal structure among variables.
68 • We provide a novel prompting strategy based on triplets and show that LLMs like GPT-3.5 can be
69 used to obtain causal order for a diverse range of datasets.
70 • We propose two algorithms combining causal discovery algorithms with LLM output and show
71 that the final causal order is substantially more accurate than the discovery algorithms alone.

72 2 R ELATED W ORK
73 Combining graph discovery and causal inference. Historically, causal discovery and causal effect
74 inference have been studied separately. Graph discovery algorithms can broadly be divided into (i)
75 algorithms using conditional independence tests (constraint-based) (Glymour et al., 2019); (ii) algo-
76 rithms using a score function to evaluate predicted graph (score-based) (Glymour et al., 2019); (iii)

2
Under review as a conference paper at ICLR 2024

77 algorithms that determine a causal order and then infer edges (order-based) (Rolland et al., 2022;
78 Teyssier & Koller, 2005); and (iv) deep learning-based methods that formulate an optimization prob-
79 lem based on acyclicity and sparsity constraints (Zheng et al., 2018; Lachapelle et al., 2020). Causal
80 discovery methods are evaluated on error with respect to the true graph, e.g., using the structural
81 hamming distance (SHD) (Acid & de Campos, 2003; Tsamardinos et al., 2006). In contrast, causal
82 inference methods focus on the estimation of causal effect given a causal graph (Pearl, 2009); the
83 graphs are assumed to be known. A natural way to combine these approaches to use the graph out-
84 putted by discovery algorithms in inference methods, as in (Hoyer et al., 2008a; Mooij et al., 2016;
85 Maathuis et al., 2010; Gupta et al., 2022). In this paper, we show that there exists a simpler way to
86 combine the two approaches: only a causal order is needed instead of the full graph.
87 Knowledge-driven Causal Discovery: Prior knowledge has been used in causal discovery liter-
88 ature (Hasan & Gani, 2022; Constantinou et al., 2023; Heckerman & Geiger, 2013; Teshima &
89 Sugiyama, 2021; O’Donnell et al., 2006; Wallace et al., 1996). These methods rely on prior knowl-
90 edge such as domain expert opinions and documented knowledge from randomized controlled trials
91 (RCT). Various priors have been studies in literature, including the priors of the form edge existence,
92 forbidden edge, ancestral constraints (Constantinou et al., 2023; Ban et al., 2023). Prior knowledge
93 significantly reduces the search space over all possible causal graphs.
94 Recent advancements in LLMs has led to more attention towards knowledge-driven causal discov-
95 ery (Kıcıman et al., 2023; Ban et al., 2023; Long et al., 2023; Willig et al., 2022). Unlike causal
96 discovery algorithms that use statistical patterns in the data, LLM-based algorithms use metadata
97 such as variable names. Most of these methods use only LLMs to predict the causal relationships
98 among a set of variables (Kıcıman et al., 2023; Willig et al., 2022; Long et al., 2022). Recent work
99 also shows how LLMs can be used as priors or imperfect experts which can be combined with dif-
100 ferent types of discovery algorithms like (Long et al., 2023) uses LLMs to improve output of a
101 constraint-based algorithm for full graph discovery by orienting undirected edges in the CPDAG
102 and (Ban et al., 2023) uses LLMs as priors for scoring-based methods. However, the focus of these
103 works has been on minimizing graph error metrics such as SHD. Instead, we focus on the down-
104 stream causal inference task and choose causal order as the metric since it directly correlates with
105 accuracy in effect estimation whereas SHD does not.
106 LLM Prompting Strategies for Causal Discovery: Existing LLM-based algorithms for graph dis-
107 covery (Kıcıman et al., 2023; Long et al., 2022; Ban et al., 2023) use a pairwise prompt, essentially
108 asking “does A cause B” with varying levels of prompt complexity. Extending this line of work, we
109 propose a triplet-based prompt that provides more accurate answers and avoids cycles when query-
110 ing relationships between variables. As a result, our triplet-based prompt may be of independent
111 interest to improve LLM-based graph discovery. We also explore the chain-of-thought prompting
112 strategy (Wei et al., 2022) in our experiments.

113 3 BACKGROUND AND P ROBLEM F ORMULATION


114 Let G(X, E) be a causal directed acyclic graph (DAG) consisting of a set of variables X =
115 {X1 , . . . , Xn } and a set of directed edges E among the variables in X. A directed edge
116 Xi → Xj ∈ E denotes the direct causal influence of the variable Xi on the variable Xj . Let
117 pa(Xi ) = {Xk |Xk → Xi }, de(Xi ) = {Xk |Xk ← · · · ← Xi } denote the set of parents and
118 descendants of Xi respectively. We focus on a downstream application of causal graph discovery
119 called causal effect inference, defined as follows.
120 Definition 3.1. (Average Causal Effect (Pearl, 2009)) The average causal effect (ACE) of a variable
121 Xi on a variable Xj is defined as
X
ACEXij = E[Xj |do(Xi = xi )] − E[Xj |do(Xi = x∗i )] (1)

122 In Defn 3.1, Xi is called the treatment variable and Xj is called the target variable. do(Xi = xi )
123 denotes an external intervention to the variable Xi with the value xi . The interventional quantity
124 E[Xj |do(Xi = xi )] is different from conditional E[Xj |Xi = xi ] since it involves setting the value
125 of Xi rather than conditioning on it. To estimate the quantity E[Xj |do(Xi = xi )] from observational
126 data, the backdoor adjustment formula is used.
127 Definition 3.2. (Back-door Adjustment (Pearl, 2009)) Given a DAG G, a set of variables Z satisfies
128 back-door criterion relative to a pair of treatment and target variables (Xi , Xj ) if

3
Under review as a conference paper at ICLR 2024

129 (i) no variable in Z is a descendant of Xi ; and


130 (ii) Z blocks every path between Xi and Xj that contains an arrow into Xi .
131 where a path in a causal DAG is a sequence of unique vertices Xi , Xi+1 , . . . , Xj with a directed
132 edge between each consecutive vertices Xk and Xk+1 (either Xk → Xk+1 or Xk+1 → Xk ).
133 If a set of variables Z satisfies the back-door criterion relative to (Xi , Xj ), E[Xj |do(Xi = xi )]
134 can be computed using the formula: E[Xj |do(Xi = xi )] = Ez∼Z E[Xj |Xi = xi , Z = z] (Thm.
135 3.3.2 of (Pearl, 2009)). To ensure that all variables in Z are observed, we assume that there are no
136 unobserved variables in the underlying causal graph.

137 4 C AUSAL O RDER IS S UFFICIENT FOR E FFECT E STIMATION


138 Although backdoor adjustment is defined with respect to a DAG G, we now show that the causal
139 order is sufficient to find a valid backdoor set. We also discuss why providing the causal order is a
140 better task for experts than providing the graph.
141 4.1 C AUSAL (T OPOLOGICAL ) O RDER Y IELDS A VALID BACKDOOR S ET
142 Definition 4.1. (Topological Order.) Given a causal graph G(X, E), a sequence π of variables X
143 is said to be a topological order iff for each edge Xi → Xj ∈ E, πi < πj .
144 Proposition 4.2. (Pearl, 2009; Cinelli et al., 2022) Under the no confounding assumption, given an
145 pair of treatment and target variables (Xi , Xj ) in G, Z = {Xk |πk < πi } is a valid adjustment set
146 relative to (Xi , Xj ) for any topological ordering π of G.
147 Proof of all Propositions are in Appendix § A. Propn 4.2 states that all the variables that precede the
148 treatment variable in a topological order π of G constitute a valid adjustment set. Note that the set Z
149 may contain variables that are not necessary to adjust for, e.g., ancestors of only treatment or only
150 target variables. For statistical estimation, ancestors of target variable are beneficial for precision
151 whereas ancestors of treatment can be harmful (Cinelli et al., 2022). On balance though, causal
152 effect practitioners tend to include all confounders that do not violate the backdoor criterion; we are
153 following the same principle.
154 In practice, however, we may not know the true order. To evaluate the goodness of a given causal
155 order, we use the topological divergence metric from (Rolland et al., 2022) (for an example, see Fig-
156 ure 3). The topological divergence of an estimated topological order π̂ with ground truth adjacency
Xn X
157 matrix A, denoted by Dtop (π̂, A), is defined as Dtop (π̂, A) = Aij . Where Aij = 1 if
i=1 j:π̂i >π̂j
158 there is a directed arrow from node i to j else Aij = 0. Dtop (π̂, A) counts the number of edges that
159 cannot be recovered due to estimated topological order π̂.
160 4.2 T OPOLOGICAL DIVERGENCE IS THE CORRECT METRIC FOR EFFECT ESTIMATION
161 Below we show that Dtop is a valid metric to optimize for effect estimation: Dtop = 0 for a topo-
162 logical order is equivalent to obtaining the correct backdoor adjustment set using Proposition 4.2.
163 Proposition 4.3. For an estimated topological order π̂ and a true topological order π of a causal
164 DAG G with the corresponding adjacency matrix A, Dtop (π̂, A) = 0 iff Z = {Xk |π̂k < π̂i } is a
165 valid adjustment set relative to (Xi , Xj ), ∀πi < πj .
166 We now compare Dtop to structural hamming distance (SHD), a common metric used to evaluate
167 graph discovery algorithms. Given a true causal DAG G and an estimated causal DAG Ĝ, SHD
168 counts the number of missing, falsely detected, and falsely directed edges in Ĝ. Formally, Dtop acts
169 as a lower-bound on the structural hamming distance (SHD) (Rolland et al., 2022). However, as
170 we show below, SHD is not a good metric for evaluating downstream effect estimation accuracy.
171 Specifically, we show that SHD can be very high even when Dtop = 0 and a valid backdoor set
172 can be inferred. This result is of significance since most estimated graphs (included those that are
173 LLM-generated (Ban et al., 2023; Long et al., 2023)) are evaluated on SHD.
174 Definition 4.4. (Level Order.) Given a causal graph G(X, E), the level order refers to a systematic
175 assignment of levels to variables. This assignment begins with the set of variables {Xi |pa(Xi ) = ∅}
176 at level 0. Subsequently, each of the remaining variables is assigned a level i such that all nodes
177 within a given level i has a directed path of length i from one/more nodes in level 0.

4
Under review as a conference paper at ICLR 2024

178 Proposition 4.5. In a causal DAG G with N levels in the level-ordering of variables where the level i
−1
NP PN
179 contains ni variables,∃ Ĝ s.t. SHD(Ĝ, G) ≥ (ni × nj )−|E| and Dtop (π̂, A) = 0 ∀π̂ of Ĝ.
i=1 j=i+1

180 Figure 2 shows the unsuitability of SHD for


181 our work empirically. Given a fixed number
182 of nodes, we sample a graph at random as the
183 “ground-truth” and then consider all graph ori-
184 entations of the same size (number of nodes)
185 such that Dtop = 0 with respect to to ground-
186 truth graph. For these set of graphs, we com-
187 pute the SHD with respect to the ground-truth
188 graph. Notice that SHD exhibits high variance.
189 For graphs with six nodes, SHD can vary from
190 0 to 14 even as Dtop = 0 and backdoor set
191 validity stays the same. Figure 3 shows this
192 phenomenon on a real-world BNLearn dataset,
193 Cancer. The candidate graph (right panel) has Figure 2: Variability of SHD for various graph sizes
194 Dtop = 0 with respect to the true graph (left) with consistent Dtop = 0 within each graph.
195 and yields valid backdoor identification sets. However, its SHD is high (6), showing the disconnect
196 between SHD and causal effect identification.
197 4.3 C AUSAL ORDER IS MORE SUITABLE TO ELICIT FROM EXPERTS THAN GRAPH EDGES
198 In addition to the favorable theoretical prop-
199 erties, causal order is easier to elicit from ex- Pollution Smoking Pollution Smoking
200 perts and can be objectively evaluated. This
201 is because given two variables, their relative
202 causal order does not depend on other variables Cancer Cancer
203 whereas existence of an edge between them
204 depends on which other variables are consid- X-ray Dyspnoea X-ray Dyspnoea
205 ered. To see this, let us continue the exam-
206 ple from Figure 3 (left) where pollution causes Figure 3: Left: Causal graph of Cancer dataset.
207 dyspnoea (breathing difficulty) through the in- Right: GPT-3.5’s estimated causal graph of Cancer
208 termediary node cancer. Whether an edge ex- dataset. GPT-3.5 gets causal order correct at the cost
209 ists between pollution and dyspnoea depends of higher SHD score, which is not a relevant metric for
210 on whether Cancer variable is part of the study. causal inference. Here Dtop = 0 and SHD = 6,
211 In case an expert is only provided pollution and showing the importance of Dtop in our study as com-
pared to SHD.
212 dyspnoea, they may add an edge between the
213 two, but if Cancer node is also provided, they may not. Apriori, it is difficult to know which nodes
214 may be relevant for a pair; hence experts’ answers may not be consistent for questions about edges,
215 but will always be consistent on causal order.

216 5 L ARGE L ANGUAGE M ODELS AS V IRTUAL E XPERTS FOR C AUSAL O RDER


217 While causal order is a simpler construct than the graph, it still needs to be provided by a domain
218 expert. We now study whether LLMs can used to obtain causal order, thereby making automating
219 the process. We propose two kinds of prompting strategies; 1) Pairwise; and 2) Triplet-based. These
220 methods employ variable names or extra metadata without utilizing the associated data.
221 5.1 P ROMPT TECHNIQUES BASED ON A PAIR OF VARIABLES
222 A natural way to elicit causal order from LLMs is to ask about each pair of variables and aggregate
223 the result. A similar pairwise strategy has been adopted by past work on inferring graph edges (Kıcı-
224 man et al., 2023; Ban et al., 2023; Long et al., 2022). Here we augment such strategies with addi-
225 tional contextual information. Our hypothesis is that adding context relevant to the pair of variables
226 may help increase the accuracy of the LLM answers.
227 We propose four types of pairwise prompts (see Appendix § D for full prompts).
228 • Basic prompt. This is the simplest technique. We directly ask LLM to find the causal direction
229 between a given pair of variables (Kıcıman et al., 2023).

5
Under review as a conference paper at ICLR 2024

230 • Iterative Context. Here we provide the previously oriented pairs as context in the prompt. Since
231 the LLM has access to its previous decisions, we expect that it may avoid creating cycles through
232 its predictions.
233 • Markov Blanket Context. Providing previously oriented pairs may become prohibitive for
234 large graphs. Using the fact that a variable is independent of all other nodes given the Markov
235 Blanket (Pearl, 2009), here we provide the Markov Blanket of the given node pairs as additional
236 context in the prompt.
237 • Chain-of-Thought (+In-context learning). Based on encouraging results of providing in-context
238 examples in the prompt for various tasks (Brown et al., 2020), here we include 3 examples of
239 the ordering task that we expect the LLM to perform. Effectively, we provide example node
240 pairs and their correct causal ordering before asking the question about the given nodes. Each
241 example answer also contains an explanation of the answer, generated using Bing GPT-4. Adding
242 the explanation encourages LLM to employ chain-of-thought reasoning (Wei et al., 2022) when
243 deciding the causal order. To avoid overfitting, we select node pairs from graphs that are not
244 evaluated in our study. Node pairs with and without direct edges were equally chosen. Examples
245 of LLM’s answers (and their explanations) using the CoT prompt can be found in Table A13 and
246 Table A14 in Appendix.

247 5.2 P ROMPT TECHNIQUE BASED ON TRIPLETS


248 As we shall see, while pairwise prompts are conceptually simple, they are prone to yielding cycles
249 in the graph since they decide about each edge separately. Taking inspiration from the PC algorithm
250 that employs constraints over three variables , we now describe a prompting technique based on
251 iterating over all possible triplets given a set of nodes. Once the LLM has provided subgraphs for
252 each triplet, we determine causal order between a pair by aggregating over all triplet LLM answers
253 where the pair was included. To resolve ties, we use another LLM. The algorithm is as follows:
254 • From a given set of nodes in the graph, we generate all possible triplets, each triplet treated as
255 independently from the others.
256 • We prompt the LLM to orient nodes of each triplet group to form a Directed Acyclic Graph
257 representing the causal relationship between the nodes of the triplet. This will result in multiple
258 mini graphs representing causal relationship for each triplet group.
259 • LLMs will be prompted to directly orient the three nodes for each triplet, hence identifying causal
260 relationship based on the third node, similar to how PC functions.
261 • Once we have DAGs representing each triplet, we focus on merging them. Our Merging process
262 can be broken down into two major steps:
263 – We iterate over all node pairs, and for each combination we check what is the majority
264 orientation between them over all the triplets containing the node pair.
265 – In case there occurs a clash (same number of votes) between any of the two or all three
266 possible edge orientation (A → B; B → A; No connection between A and B nodes), we
267 resort to GPT-4 to resolve the clash by taking the final decision via CoT prompt.

268 In addition, the triplet prompt uses the techniques of in-context examples and chain-of-thought from
269 the pairwise setup. An example prompt is shown in Table A12.

270 6 A LGORITHMS FOR LLM-G UIDED C AUSAL D ISCOVERY


271 LLMs using the above prompts may yield accurate causal order predictions, but may also exhibit
272 some unknown failure modes (Kıcıman et al., 2023). To increase robustness of the final result, we
273 now provide algorithms for combining LLM-outputted causal order with existing causal discovery
274 paradigms: (i) constraint-based and (ii) score-based methods.
275 6.1 C ONSTRAINT- BASED METHODS USING POST- HOC EDGE ORIENTATION BY LLM S

276 Constraint-based algorithms return a graph where some edges may not be oriented. Given a graph
277 from constraint-based algorithm like PC, we use the causal order π̂ from LLM to orient the undi-
278 rected edges. Iterating over the undirected edges, we first check if the nodes of that edge are occur-
279 ring in π̂. If yes, we orient the edge according to the causal order. Since there is a possibility that
280 LLM’s final graph might have some isolated nodes which won’t be in π̂, therefore if either (or both)
281 nodes of the undirected edge are not included in π̂, we query GPT-4 using pairwise CoT prompt
282 (from Section 5.1) to finalise a direction between the pair.

6
Under review as a conference paper at ICLR 2024

Algorithm 1 Combining constraint based methods and experts to get π̂ for a given set of variables.
1: Input: LLM topological ordering π̂, Expert EGP T 4 , PC-CPDAG Ĝ
2: Output: Estimated topological order π̂final of {X1 , . . . , Xn }.
3: for (i − j) ∈ undirected-edges(Ĝ) do
4: If both the node i and j are in π̂ and if π̂i < π̂j , orient (i − j) as (i → j) in Ĝ.
5: Otherwise, use the expert EGP T 4 with CoT prompt to orient the edge (i − j).
6: end for
7: π̂final = topological ordering of Ĝ
8: return π̂

283 6.2 S CORE - BASED METHODS USING EXPERT PRIORS


284 We utilize the output of LLM as a prior in the Algorithm 2 Combining score based methods and ex-
285 score-based algorithms. We provide the level perts to get π̂ for a given set of variables.
286 order of the causal graph returned by LLM as
1: Input: D, variables {X1 , . . . , Xn }, Expert E,
287 a prior for a score-based algorithm. Unlike Score based method S, Prior probability p.
288 a similar LLM-prior approach by Ban et al. 2: Output: Estimated topological order π̂ of
289 (2023), where they combine the output of LLM {X1 , . . . , Xn }.
290 and a score based method using an ancestral 3: Ĝ = E(X1 , . . . , Xn )
291 constraint as a prior, ours is a sequential ap- 4: L = level order of Ĝ.
292 proach where a score based algorithm starts 5: for cycle C ∈ Ĝ do
293 with the order based constraint, aligning with 6: for node ∈ C do
294 our goal of recovering causal order among vari- 7: L(node) = min(level(c) ∀c ∈ C)
295 ables. Optionally, we can provide prior prob- 8: end for
296 ability to control the influence of prior on the 9: end for
297 algorithm. Algorithm on the right outlines the 10: Ĝ = S(D, L, p)
298 steps to combine score based method and ex- 11: π̂ = topological ordering of Ĝ
299 pert knowledge in terms of variables’ level order. 12: return π̂

300 7 E XPERIMENTS AND R ESULTS


301 To evaluate the accuracy of LLM-based algorithms on inferring causal order, we perform experi-
302 ments on the benchmark datasets from Bayesian network repository (Scutari & Denis, 2014): Earth-
303 quake, Cancer, Survey, Asia, Asia modified (Asia-M), and Child. Asia-M is derived from Asia
304 by removing the node either since it is not a node with a semantic meaning (see Appendix§ C
305 for details). We also used a medium sized subset graph (refer Figure A6 in Appendix) from the
306 Neuropathic dataset (Tu et al., 2019) used for pain diagnosis. Except Child (with 20 nodes) and
307 Neuropathic subgraph (with 22 nodes), all other graphs are small-scale graphs with <10 nodes.
308 7.1 Dtop CORRELATES WITH EFFECT ESTIMATION ERROR
309 Before comparing methods on the Dtop metric,
310 we first show that Dtop has a strong correla- Cancer
311 tion with effect estimation error and hence is SHD vs. ϵACE | Dtop = 0 Dtop vs. ϵACE | SHD = 2
312 the correct metric for effect inference. Specif- SHD ϵACE Dtop ϵACE
313 ically, we study how the error in causal ef- 0 0.00 0 0.00
2 0.00 1 0.25
314 fect, ϵACE , changes as values of the metrics 4 0.00 2 0.50
315 SHD, Dtop change. For the datasets Can- Asia
316 cer, Asia and Survey, we consider dyspnoea, SHD vs. ϵACE | Dtop = 0 Dtop vs. ϵACE | SHD = 3
317 dyspnoea, and T ravel respectively as the tar- SHD ϵACE Dtop ϵACE
318 get variables. In each graph, we evaluate causal 0 0.00 1 0.14
319 effects of each variable on the target variable. 6 0.00 2 0.22
320 We iterate through estimated causal graphs with 10 0.00 3 0.57

321 different values of SHD and Dtop and report Survey


322 the mean absolute difference between estimated SHD vs. ϵACE | Dtop = 0 Dtop vs. ϵACE | SHD = 2
323 and true causal effects. As Table 1 shows, when SHD ϵACE Dtop ϵACE
324 Dtop is zero, effect error ϵACE is also zero. 0 0.00 0 0.00
2 0.00 1 0.25
325 And as Dtop increases (right panel), effect error 4 0.03 2 0.50
326 increases. In contrast, SHD has no correlation
327 with the ϵACE . Table 1: ϵACE vs. SHD (Dtop ) given Dtop (SHD)

7
Under review as a conference paper at ICLR 2024

Dataset Dtop SHD IN/TN Cycles


Base Prompt
Earthquake 0 7 0/5 0
Cancer 0 6 0/5 0
Survey 3 12 0/6 0 Dataset Dtop SHD IN/TN Cycles
Asia - 21 0/8 7
Chain of Thought
Asia-M - 15 0/7 6
Child - 177 0/20 20 Earthquake 1 4 0/5 0
Neuropathic - 212 0/22 22 Survey 1 6 2/6 0
Asia 1 17 0/8 0
All Directed Edges
Asia-M 1 11 0/7 0
Earthquake 1 9 0/5 0 Child - 91 0/20 13
Cancer 1 7 0/5 0 Neuropathic - 64 0/22 8
Survey 2 11 0/6 0
Triplet Prompt
Asia - 21 0/8 8
Asia-M 0 13 0/7 0 Earthquake 0 4 0/5 0
Child - 139 0/20 18 Cancer 0 4 1/5 0
Neuropathic - 194 0/22 20 Survey 0 6 0/6 0
Asia 1 7 1/8 0
Markov Blanket
Asia-M 0 3 2/7 0
Earthquake 0 8 0/5 0 Child 0 29 11/20 0
Cancer 0 6 0/5 0 Neuropathic 2 23 16/22 0
Survey 3 12 0/6 0
Asia - 21 0/8 5
Asia-M 0 14 0/7 0 Table 3: Triplet Prompt output using variable
Child - 167 0/20 20 names with their descriptions (Cancer not in-
Neuropathic - 204 0/22 21 cluded since CoT prompt has examples from
this graph). IN: Isolated Nodes, TN:Total
Table 2: Comparison of various prompt- Nodes.
ing strategies for only LLM based setups,
providing different contextual cues in each
setup about the graph. IN: Isolated Nodes,
TN:Total Nodes.

328 7.2 T RIPLET PROMPTING TECHNIQUE IS MOST ACCURATE FOR CAUSAL ORDER
329 Tables 2 and 3 compare the different prompting techniques. As the graph size increases, we observe
330 limitations with pairwise prompts. In many cases, pairwise prompts yield cycles in many cases
331 due to which Dtop cannot be computed. In particular, for Child dataset with 20 nodes, pairwise
332 prompts yield anywhere from 13-79 cycles. LLM output tends to connect more edges than needed,
333 which explains why SHD is high. Overall, among the pairwise prompts, the chain of thought prompt
334 performs the best: it has the lowest Dtop on the four small graphs and the lowest number of cycles
335 for Child and Neuropathic datasets. This indicates that in-context examples and chain-of-thought
336 reasoning helps to increase accuracy of causal order output, but other contextual cues do not matter.
337 Finally, the triplet prompt provides the most accurate causal order. Even for medium-size graphs
338 like Child and Neuropathic, the LLM output includes no cycles and SHD is fairly low betwen 4-
339 29. Moreover, Dtop is zero for all datasets, except for Asia and Neuropathic where it is 1 and
340 2 respectively. That said, we do see that isolated nodes in the output increase compared to the
341 pairwise prompts (all graphs are connected, so outputting an isolated node is an error). Considering
342 LLMs as virtual experts, this indicates that there are some nodes on which the LLM expert cannot
343 determine the causal order. This is still a better tradeoff than outputting the wrong causal order,
344 which can confuse downstream algorithms. Overall, therefore, we conclude that the triplet prompt
345 provides the most robust causal order predictions.
346 7.3 LLM S IMPROVE CAUSAL ORDER ACCURACY OF EXISTING DISCOVERY ALGORITHMS
347 We now study whether LLM output can be used to increase accuracy of discovery algorithms in
348 inferring causal order. We compare with popular causal discovery methods: PC (Spirtes et al., 2000),
349 SCORE (Rolland et al., 2022), ICA-LiNGAM (Shimizu et al., 2006), Direct-LiNGAM (Shimizu
350 et al., 2011), NOTEARS (Zheng et al., 2018), and Causal discovery via minimum message length
351 (CaMML) (Wallace et al., 1996); across five different sample sizes: 250, 500, 1000, 5000, 10000.
352 For LLM, we use the triplet prompt. Table 4 shows the Dtop metric for different algorithms and
353 compares it to the Dtop of our combined algorithms: PC+LLM and CaMML+LLM. Among the
354 discovery algorithms, we find that PC and CaMML perform the best, with the lowest Dtop across
355 the five datasets. For Neuropathic dataset, ICA LiNGAM is also competitive.

8
Under review as a conference paper at ICLR 2024

Dataset PC SCORE ICA Direct NOTEARS CaMML Ours Ours


LiNGAM LiNGAM (PC+LLM) (CaMML+LLM)
Earthquake 0.30±0.44 4.00±0.00 3.20±0.39 3.00±0.00 1.80±0.74 2.00±0.00 0.00±0.00 0.00±0.00
N = 250 Cancer 0.00±0.00 3.00±0.00 4.00±0.00 3.60±0.48 2.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00
Survey 0.50±0.00 3.00±0.00 6.00±0.00 6.00±0.00 3.20±0.39 3.33±0.94 0.00±0.00 3.33±0.94
Asia 2.33±0.59 5.00±0.00 6.20±0.74 7.00±0.00 4.00±0.00 1.85±0.58 0.00±0.00 0.97±0.62
Asia-M 2.00±0.00 5.00±0.00 7.60±0.48 6.20±1.16 3.40±0.48 1.00±0.00 0.00±0.00 1.71±0.45
Child 8.16±1.58 8.80±2.70 12.8±0.97 13.0±0.63 15.0±1.09 3.00±0.00 4.00±0.00 3.53±0.45
Neuropathic 3.25±0.00 6.00±0.00 13.0±6.16 10.0±0.00 9.00±0.00 10.4±1.95 1.00±0.00 5.00±0.00
Earthquake 0.85±0.65 4.00±0.00 3.20±0.39 3.40±0.48 1.20±0.40 0.00±0.00 0.4±0.89 0.00±0.00
Cancer 0.00±0.00 3.00±0.00 3.40±0.48 3.00±0.00 2.00±0.00 1.00±0.00 0.00±0.00 1.00±0.00
N = 500

Survey 1.75±0.00 4.00±0.00 6.00±0.0 6.00±0.00 3.40±0.48 3.39±0.08 1.00±0.00 3.33±0.94


Asia 3.00±0.00 5.00±0.00 5.60±0.48 7.00±0.00 3.20±0.39 3.81±0.39 1.00±0.00 0.97±0.62
Asia-M 2.00±0.00 6.00±0.00 7.60±0.48 5.00±0.00 3.80±0.39 2.00±0.00 1.00±0.00 0.17±0.45
Child 9.79±1.17 6.20±1.32 12.2±0.74 10.6±1.35 15.4±0.48 2.00±0.00 4.6±1.34 2.00±0.00
Neuropathic 7.50±0.00 6.00±0.00 9.00±1.41 13.0±0.00 11.0±0.00 5.32±0.57 8.00±0.00 7.49±0.64
Earthquake 0.80±0.57 4.00±0.00 3.00±0.00 3.00±0.00 1.00±0.00 0.00±0.00 0.20±0.44 0.00±0.00
Cancer 0.00±0.00 3.00±0.00 3.00±0.00 3.00±0.00 2.00±0.00 1.60±0.48 0.00±0.00 0.00±0.00
N = 1000

Survey 1.00±0.00 4.00±0.00 5.80±0.39 5.40±0.48 3.20±0.39 2.71±0.27 1.00±0.00 2.83±0.00


Asia 3.09±1.05 4.00±0.00 6.20±0.74 6.60±0.48 3.40±0.48 1.75±0.43 1.75±0.95 0.97±0.62
Asia-M 2.50±0.00 4.00±0.00 8.00±0.00 5.20±0.39 3.40±0.48 2.04±0.51 2.00±0.00 0.65±0.47
Child 9.61±1.07 3.80±0.74 12.2±1.72 11.8±0.74 15.2±0.97 2.00±0.00 8.0±0.00 2.00±0.40
Neuropathic - 6.00±0.00 4.00±0.81 12.0±0.00 12.0±0.00 5.54±0.75 - 10.1±2.12
Earthquake 0.30±0.67 4.00±0.00 2.80±0.39 3.00±0.00 1.00±0.00 0.80±0.97 0.00±0.00 0.00±0.00
Cancer 0.00±0.00 3.00±0.00 3.00±0.00 3.00±0.00 2.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00
N = 5000

Survey 2.00±0.00 4.00±0.00 5.00±0.00 5.00±0.00 3.00±0.00 3.33±0.69 2.00±0.00 2.60±0.00


Asia 3.05±0.94 4.00±0.00 6.60±0.79 4.40±1.35 3.40±0.48 1.75±0.43 2.80±1.30 0.97±0.62
Asia-M 1.00±0.00 4.00±0.00 7.60±0.48 4.60±0.48 3.20±0.39 1.68±0.46 0.20±0.44 0.00±0.00
Child 8.42±0.75 3.00±0.00 12.6±0.79 10.8±1.72 14.2±0.40 3.00±0.00 7.00±0.00 3.00±0.00
Neuropathic 9.00±0.00 6.00±0.00 9.33±0.94 10.0±0.00 10.0±0.00 4.20±0.96 9.00±0.00 1.23±0.42
Earthquake 0.00±0.00 4.00±0.00 3.00±0.00 3.00±0.00 1.00±0.00 0.40±0.48 0.00±0.00 0.00±0.00
Cancer 0.00±0.00 3.00±0.00 3.00±0.00 3.00±0.00 2.00±0.00 0.60±0.80 0.00±0.00 0.00±0.00
N = 10000

Survey 2.00±0.00 4.00±0.00 5.00±0.00 5.00±0.00 3.00±0.00 3.60±1.35 2.00±0.00 1.83±0.00


Asia 1.95±0.41 4.00±0.00 6.00±0.00 4.40±1.35 3.00±0.00 1.40±0.48 1.20±0.83 0.34±0.47
Asia-M 1.75±0.00 4.00±0.00 8.00±0.00 4.80±0.39 3.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00
Child 7.67±0.65 3.00±0.00 12.2±1.46 11.6±0.48 14.4±0.48 2.80±0.84 7.00±0.00 1.00±0.00
Neuropathic 10.00±0.00 6.00±0.00 1.00±0.00 10.0±0.00 10.0±0.00 3.00±0.00 10.00±0.00 1.00±0.00

Table 4: Comparison with existing discovery methods. Mean and std dev of Dtop over 5 runs. (For
Neuropathic subgraph (1k samples), PC Algorithm returns cyclic graphs in the MEC)

356 For the BNLearn datasets, adding LLM output decreases the Dtop of both algorithms further. Specif-
357 ically, PC+LLM leads to a significant reduction in Dtop and the gains are larger at lower sample
358 sizes. This indicates that obtaining causal order from LLMs may matter more in limited sample
359 settings. At sample size of 500, Dtop of PC is nearly double that of PC+LLM for most datasets.
360 Going from CaMML to CaMML+LLM, we also see significant reductions in Dtop . Interestingly,
361 CaMML+LLM yields benefits even at higher sample sizes. At a sample size of 10,000, CaMML’s
362 Dtop for Child and Asia surpasses CaMML+LLM by three and fivefold respectively.
363 For the Neuropathic dataset, we see a similar pattern: adding LLM to existing algorithms improves
364 Dtop or keeps it constant, except at sample sizes 500 and 1000 where CaMML+LLM yields a worse
365 Dtop than CaMML alone. However, as sample size increases to 5000 and 10000, we do see that
366 CaMML+LLM improves the Dtop substantially compared to CaMML. Overall, these results show
367 that LLM output can signficantly improve the accuracy of existing causal discovery algorithms.

368 8 L IMITATIONS AND C ONCLUSIONS


369 We presented causal order as a suitable metric for evaluating quality of causal graphs for downstream
370 effect inference tasks. Using a novel formulation of LLM prompts based on triplets, we showed that
371 LLMs can be useful in the generating accurate causal order, both individually and in combination
372 with existing discovery algorithms. Our results point to techniques that can automate the causal
373 inference process.
374 That said, our work has limitations. We studied LLMs utility on popular benchmarks which may
375 have been partially memorized. It will be interesting to extend our experiments to diverse and
376 bigger datasets. In addition, we studied only one downstream task (effect inference). Identifying
377 the necessary graph metrics for tasks such as causal prediction and counterfactual inference will be
378 useful future work.

9
Under review as a conference paper at ICLR 2024

379 R EFERENCES
380 Silvia Acid and Luis M de Campos. Searching for bayesian network structures in the space of
381 restricted acyclic partially directed graphs. Journal of Artificial Intelligence Research, 18:445–
382 490, 2003.
383 Taiyu Ban, Lyvzhou Chen, Xiangyu Wang, and Huanhuan Chen. From query tools to causal archi-
384 tects: Harnessing large language models for advanced causal discovery from data. arXiv preprint
385 arXiv:2306.16902, 2023.
386 Sarah E. Boslaugh. Silent spring. 2023.
387 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
388 wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
389 Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
390 Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
391 Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
392 Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR,
393 abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
394 Carlos Cinelli, Andrew Forney, and Judea Pearl. A crash course in good and bad controls. Socio-
395 logical Methods & Research, pp. 00491241221099552, 2022.
396 Anthony C. Constantinou, Zhigao Guo, and Neville K. Kitson. The impact of prior knowledge on
397 causal structure learning. Knowledge and Information Systems, 65(8):3385–3434, 2023.
398 Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graph-
399 ical models. Frontiers in genetics, 10:524, 2019.
400 Shantanu Gupta, David Childers, and Zachary Chase Lipton. Local causal discovery for estimating
401 causal effects. In NeurIPS 2022 Workshop on Causality for Real-world Impact, 2022.
402 Thilo Hagendorff, Sarah Fabi, and Michal Kosinski. Machine intuition: Uncovering human-like
403 intuitive decision-making in gpt-3.5, 12 2022.
404 Uzma Hasan and Md Osman Gani. Kcrl: A prior knowledge based causal discovery framework with
405 reinforcement learning. In Proceedings of the 7th Machine Learning for Healthcare Conference,
406 2022.
407 David Heckerman and Dan Geiger. Learning bayesian networks: a unification for discrete and
408 gaussian domains. arXiv preprint arXiv:1302.4957, 2013.
409 Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear
410 causal discovery with additive noise models. In Advances in Neural Information Processing
411 Systems, 2008a.
412 Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlin-
413 ear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, and
414 L. Bottou (eds.), Advances in Neural Information Processing Systems, 2008b.
415 Yiyi Huang, Matthäus Kleindessner, Alexey Munishkin, Debvrat Varshney, Pei Guo, and Jianwu
416 Wang. Benchmarking of data-driven causality discovery approaches in the interactions of arctic
417 sea ice and atmosphere. Frontiers in big Data, 4:642182, 2021.
418 Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer. Estimation of a structural
419 vector autoregression model using non-gaussianity. Journal of Machine Learning Research, 11
420 (56):1709–1731, 2010.
421 Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sci-
422 ences: An Introduction. Cambridge University Press, 2015. doi: 10.1017/CBO9781139025751.
423 Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language
424 models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023.

10
Under review as a conference paper at ICLR 2024

425 Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based
426 neural dag learning. In ICLR, 2020.
427 Stephanie Long, Tibor Schuster, and Alexandre Piché. Can large language models build causal
428 graphs? In NeurIPS 2022 Workshop on Causality for Real-world Impact, 2022.
429 Stephanie Long, Alexandre Piché, Valentina Zantedeschi, Tibor Schuster, and Alexandre Drouin.
430 Causal discovery with language models as imperfect experts. In ICML 2023 Workshop on Struc-
431 tured Probabilistic Inference & Generative Modeling, 2023.
432 Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal
433 effects in large-scale systems from observational data. Nature methods, 7(4):247–248, 2010.
434 Syed S Mahmood, Daniel Levy, Ramachandran S Vasan, and Thomas J Wang. The framingham
435 heart study and the epidemiology of cardiovascular disease: a historical perspective. The lancet,
436 383(9921):999–1008, 2014.
437 Joris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. Dis-
438 tinguishing cause from effect using observational data: methods and benchmarks. The Journal of
439 Machine Learning Research, 17(1):1103–1204, 2016.
440 Rodney T O’Donnell, Ann E Nicholson, Bin Han, Kevin B Korb, Md Jahangir Alam, and Lucas R
441 Hope. Causal discovery with prior information. In AI 2006: Advances in Artificial Intelligence:
442 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, December 4-8,
443 2006. Proceedings 19, pp. 1162–1167. Springer, 2006.
444 Judea Pearl. Causality. Cambridge university press, 2009.
445 Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning
446 with gpt-4, 2023.
447 Jonas Peters and Peter Bühlmann. Structural intervention distance for evaluating causal graphs.
448 Neural computation, 27(3):771–799, 2015.
449 Alexander Reisach, Christof Seiler, and Sebastian Weichwald. Beware of the simulated dag! causal
450 discovery benchmarks may be easy to game. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S.
451 Liang, and J. Wortman Vaughan (eds.), NeurIPS, pp. 27772–27784, 2021.
452 Paul Rolland, Volkan Cevher, Matthäus Kleindessner, Chris Russell, Dominik Janzing, Bernhard
453 Schölkopf, and Francesco Locatello. Score matching enables causal discovery of nonlinear addi-
454 tive noise models. In ICML, 2022.
455 M. Scutari and J.B. Denis. Bayesian Networks: With Examples in R. Chapman & Hall/CRC Texts
456 in Statistical Science. Taylor & Francis, 2014.
457 Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. A linear
458 non-gaussian acyclic model for causal discovery. JMLR, 7(10), 2006.
459 Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvarinen, Yoshinobu Kawahara,
460 Takashi Washio, Patrik O Hoyer, Kenneth Bollen, and Patrik Hoyer. Directlingam: A direct
461 method for learning a linear non-gaussian structural equation model. JMLR, 12(Apr):1225–1248,
462 2011.
463 Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT
464 press, 2000.
465 Takeshi Teshima and Masashi Sugiyama. Incorporating causal graphical prior knowledge into pre-
466 dictive modeling via simple data augmentation. In UAI, pp. 86–96. PMLR, 2021.
467 Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for
468 learning bayesian networks. In UAI, UAI’05, pp. 584–590, 2005.
469 Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing
470 bayesian network structure learning algorithm. Machine learning, 65:31–78, 2006.

11
Under review as a conference paper at ICLR 2024

471 Ruibo Tu, Kun Zhang, Bo Bertilson, Hedvig Kjellstrom, and Cheng Zhang. Neuropathic pain
472 diagnosis simulator for causal discovery algorithm evaluation. Advances in Neural Information
473 Processing Systems, 32, 2019.
474 Chris Wallace, Kevin B Korb, and Honghua Dai. Causal discovery via mml. In ICML, volume 96,
475 pp. 516–524, 1996.
476 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
477 Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
478 Neural Information Processing Systems, 35:24824–24837, 2022.
479 Moritz Willig, Matej Zečević, Devendra Singh Dhami, and Kristian Kersting. Probing for correla-
480 tions of causal facts: Large language models and causality. 2022.
481 Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous
482 optimization for structure learning. NeurIPS, 31, 2018.

12
Under review as a conference paper at ICLR 2024

483 A PPENDIX

484 A P ROOFS OF P ROPOSITIONS

485 Proposition 4.2. (Pearl, 2009; Cinelli et al., 2022) Under the no confounding assumption, given an
486 pair of treatment and target variables (Xi , Xj ) in G, Z = {Xk |πk < πi } is a valid adjustment set
487 relative to (Xi , Xj ) for any topological ordering π of G.

488 Proof. We need to show that the set Z = {Xk |πk < πi } satisfies the conditions (i) and (ii) in
489 Defn 3.2. For any variable Xk such that πk < πi , we have Xk ̸∈ de(Xi ) and hence the condition
490 (i) is satisfied. Additionally, for each Xk ∈ pa(Xi ) we have πk < πi and hence pa(Xi ) ⊆ Z. Since
491 pa(Xi ) blocks all paths from Xi to Xj that contains an arrow into Xi (Peters & Bühlmann, 2015),
492 Z satisfies condition (ii).

493 Proposition 4.3. For an estimated topological order π̂ and a true topological order π of a causal
494 DAG G with the corresponding adjacency matrix A, Dtop (π̂, A) = 0 iff Z = {Xk |π̂k < π̂i } is a
495 valid adjustment set relative to (Xi , Xj ), ∀πi < πj .

496 Proof. The statement of proposition is of the form A ⇐⇒ B with A being “Dtop (π̂, A) = 0”
497 and B being “Z = {Xk |π̂k < π̂i } is a valid adjustment set relative to (Xi , Xj ), ∀i, j”. We prove
498 A ⇐⇒ B by proving (i) A =⇒ B and (ii) B =⇒ A.
499 (i) Proof of A =⇒ B: If Dtop (π̂, A) = 0, for all pairs of nodes (Xi , Xj ), we have π̂i <
500 π̂j whenever πi < πj . That is, causal order in estimated graph is same that of the causal order
501 in true graph. Hence, from Propn 4.2, Z = {Xk |π̂k < π̂i } is a valid adjustment set relative to
502 (Xi , Xj ), ∀i, j.
503 (ii) Proof of B =⇒ A: we prove the logical equivalent form of B =⇒ A i.e., ¬A =⇒ ¬B,
504 the contrapositive of B =⇒ A. To this end, assume Dtop (π̂, A) ̸= 0, then there will be at least
505 one edge Xi → Xj that cannot be oriented correctly due to the estimated topological order π̂. i.e.,
506 π̂j < π̂i but πj > πi . Hence, to find the causal effect of Xi on Xl ; l ̸= j, Xj is included in the
507 back-door adjustment set Z relative to (Xi , Xl ). Adding Xj to Z renders Z an invalid adjustment
508 set because it violates the condition (i) of Defn 3.2.

509 Proposition 4.5. In a causal DAG G with N levels in the level-ordering of variables where the level i
−1
NP PN
510 contains ni variables,∃ Ĝ s.t. SHD(Ĝ, G) ≥ (ni × nj )−|E| and Dtop (π̂, A) = 0 ∀π̂ of Ĝ.
i=1 j=i+1

511 Proof. Recall that SHD counts the number of missing, falsely detected, and falsely directed edges
512 in the estimated causal graph as compared to the ground truth graph. Since we want Dtop (π̂, A) =
513 0; ∀π̂ of Ĝ, there cannot be an edge Xi → Xj in Ĝ such that Xi ← Xj is in G. This constraint
514 avoids the possibility of having falsely directed edges in Ĝ. Consider a Ĝ with all the edges in G
515 and in addition, each variable in level i having a directed edge to each variable in all levels below
516 level i. All such edges contribute to the SHD score while still obeying the causal ordering in G.
NP−1 N
P NP−1 N
P
517 This number will be equal to (ni × nj ) − |E|. The quantity (ni × nj ) is the
i=1 j=i+1 i=1 j=i+1
518 number of edges possible from each node to the every other node in the levels below it. We need
519 to subtract the number of existing edges in E to count the newly added edges that contribute to the
520 SHD score. Now, we can remove some of the edges Xi → Xj from Ĝ such that Xi → Xj is in G
521 while still leading to same causal ordering of variables. This leads to increased SHD score due to
522 missing edges in Ĝ. Since it will only increase the SHD score, we ignore such corner cases.

523 B A DDITIONAL R ESULTS

524 Table A1 shows the results of various prompt strategies and their improvements over no-prior meth-
525 ods.

13
Under review as a conference paper at ICLR 2024

Ours Ours Ours Ours


Dataset PC CaMML (CoT) (CoT) (Triplet Pairwise) (Triplet Pairwise)
(PC+LLM) (CaMML+LLM) (PC+LLM) (CaMML+LLM)
Earthquake 0.30±0.44 2.00±0.00 0.00±0.00 2.75±0.43 0.2±0.44 0.00±0.00
Cancer 0.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00
N = 250

Survey 0.50±0.00 3.33±0.94 0.00±0.00 3.33±0.94 0.00±0.00 2.66±0.94


Asia 2.33±0.59 1.85±0.58 0.00±0.00 0.97±0.62 1.00±0.00 3.39±0.62
Asia-M 2.00±0.00 1.00±0.00 0.00±0.00 1.71±0.45 0.00±0.00 1.71±0.45
Child 8.16± 1.58 3.00±0.00 4.70±1.34 3.53±0.45 8.00±3.17 4.00±0.00
Neuropathic 3.25±0.00 10.4±1.95 1.00±0.00 8.00±0.00 0.00±0.00 12.63±1.92
Earthquake 0.85±0.65 0.00±0.00 0.4±0.89 2.60±0.48 1.00±0.00 0.00±0.00
Cancer 0.00±0.00 3.00±0.00 0.00±0.00 1.00±0.00 0.00±0.00 0.00±0.00
N = 500

Survey 1.75±0.00 3.39±0.08 1.00±0.00 3.33±0.94 0.60±0.00 2.66±0.94


Asia 3.00±0.00 3.81±0.39 1.00±0.00 0.97±0.62 0.00±0.00 3.28±0.64
Asia-M 2.00±0.00 2.00±0.00 1.00±0.00 0.17±0.45 1.00±0.00 1.04±0.20
Child 9.79±1.17 2.00±0.00 4.60±1.34 2.00±0.00 9.20±2.16 3.00±0.00
Neuropathic 7.50±0.00 5.32±0.57 9.00±0.00 8.90±0.00 9.00±0.00 12.1±1.56
Earthquake 0.80±0.57 0.00±0.00 0.20±0.44 2.00±0.00 1.40±1.31 0.00±0.00
Cancer 0.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00
N = 1000

Survey 1.00±0.00 2.71±0.27 1.00±0.00 2.83±0.00 1.00±0.00 2.16±0.74


Asia 3.09±1.05 1.75±0.43 1.75±0.95 0.97±0.62 3.00±0.00 1.78±0.41
Asia-M 2.50±0.00 2.04±0.51 2.00±0.00 0.65±0.47 1.00±0.00 0.65±0.47
Child 9.61±1.07 2.00±0.00 8.00±0.00 2.00±0.40 6.6±1.14 2.83±0.00
Neuropathic - 5.54±0.75 - 6.00±0.00 - 4.00±0.00
Earthquake 0.30±0.67 0.80±0.97 0.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00
Cancer 0.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00
N = 5000

Survey 2.00±0.00 3.33±0.69 2.00±0.00 2.60±0.00 2.00±0.00 1.80±0.83


Asia 3.05±0.94 1.75±0.43 2.80±1.30 0.97±0.62 0.00±0.00 1.78±0.41
Asia-M 1.00±0.00 1.68±0.46 0.20±0.44 0.00±0.00 2.00±0.00 1.00±0.00
Child 8.42±0.75 3.00±0.00 7.00±0.00 3.00±0.00 7.4±1.51 1.00±0.00
Neuropathic 9.00±0.00 4.20±0.96 9.00±0.00 3.00±0.00 9.00±0.00 3.31±0.00
Earthquake 0.00±0.00 0.40±0.48 0.00±0.00 2.00±0.00 0.00±0.00 0.00±0.00
Cancer 0.00±0.00 2.00±0.00 0.60±0.80 0.00±0.00 0.00±0.00 0.00±0.00
N = 10000

Survey 2.00±0.00 3.60±1.35 2.00±0.00 1.83±0.00 2.00±0.00 1.08±0.64


Asia 1.95±0.41 1.40±0.48 1.20±0.83 0.34±0.47 0.00±0.00 1.78±0.41
Asia-M 1.75±0.00 2.00±0.00 0.00±0.00 0.00±0.00 0.00±0.00 1.00±0.00
Child 7.67±0.65 2.80±0.84 4.66±3.05 1.00±0.00 5.80±1.48 1.00±0.00
Neuropathic 10.00±0.00 3.00±0.00 - 2.39±0.48 10.00±0.00 1.00±0.00

Table A1: Dtop metric results. Comparison with various prompting strategies. Neuropathic sub-
graph for 1k samples return cyclic graphs in the MEC. Using LLM CoT prior with Neuropathic 10k
samples, orients the undirected edges to create cyclic graphs

14
Under review as a conference paper at ICLR 2024

526 B.1 LLM S USED IN POST PROCESSING FOR GRAPH DISCOVERY

527 We conducted some experiments where we utilised discovery algorithms like PC for creating skele-
528 tons of the graph and employed LLMs for orienting the undirected edges. The idea was to utilise
529 LLMs ability to correctly estimate the causal direction while leveraging PC algorithm’s ability to
530 give a skeleton which could be oriented in a post processing setup. We saw that LLM ended up
531 giving improved results as compared to PC alone.

1000 samples
Context Base prompt Past iteration Markov Blanket PC
orientations (Avg. over MEC)
Dtop 8.0 5.3 6.6 9.61
SHD 14.33 12.66 14.0 17.0
10000 samples
Dtop 6.33 9.66 6.0 7.67
SHD 9.0 13.33 8.33 12.0

Table A2: PC + LLM results where LLM is used to orient the undirected edges of the skeleton PC
returns over different data sample sizes. We show how LLMs can be used in a post processing setup
for edge orientation besides having the capability of acting as a strong prior for different discovery
algorithms.

Dataset Number of Number of Description


Nodes Edges (used as a context)
Asia 8 8 Model the possible respiratory problems
someone can have who has recently visited
Asia and is experiencing shortness of breath
Cancer 5 4 Model the relation between various variables
responsible for causing Cancer and its possible
outcomes
Earthquake 5 5 Model factors influencing the probability of a burglary
Survey 6 6 Model a hypothetical survey whose aim is to investigate
the usage patterns of different means of transport
Child 20 25 Model congenital heart disease in babies
Neuropathic Pain
Diagnosis (subgraph) 22 25 For neuropathic pain diagnosis

Table A3: Overview of the datasets used.

15
Under review as a conference paper at ICLR 2024

532 C C AUSAL G RAPHS USED IN E XPERIMENTS

533 Figures A1-A5 show the causal graphs and details we considered from BNLearn repository (Scutari
534 & Denis, 2014).

Figure A1: Earthquake Bayesian network. Abbreviations/Descriptions: Burglary: burglar entering, Earth-
quake: earthquake hitting, Alarm: home alarm going off in a house, JohnCalls: first neighbor to call to inform
the alarm sound, Marycalls: second neighbor to call to inform the alarm sound.

Figure A2: Cancer Bayesian network. Abbreviations/Descriptions: Pollution: exposure to pollutants, Smoker:
smoking habit, Cancer: Cancer, Dyspnoea: Dyspnoea, Xray: getting positive xray result.

16
Under review as a conference paper at ICLR 2024

Figure A3: Survey Bayesian network. Abbreviations: A=Age/Age of people using transport, S=Sex/male
or female, E=Education/up to high school or university degree, O=Occupation/employee or self-employed,
R=Residence/the size of the city the individual lives in, recorded as either small or big, T=Travel/the means of
transport favoured by the individual.

Figure A4: Asia Bayesian network. Abbreviations/Descriptions: asia=visit to Asia/visiting Asian countries
with high exposure to pollutants, smoke=smoking habit, tub=tuberculosis, lung=lung cancer, either=either
tuberculosis or lung cancer, bronc=bronchitis, dysp=dyspnoea, xray=getting positve xray result.

17
Under review as a conference paper at ICLR 2024

Figure A5: Child Bayesian network. Abbreviations: BirthAsphyxia: Lack of oxygen to the blood during
the infant’s birth, HypDistrib: Low oxygen areas equally distributed around the body, HypoxiaInO2: Hypoxia
when breathing oxygen, CO2: Level of carbon dioxide in the body, ChestXray: Having a chest x-ray, Grunting:
Grunting in infants, LVHreport: Report of having left ventricular hypertrophy, LowerBodyO2: Level of oxygen
in the lower body, RUQO2: Level of oxygen in the right upper quadricep muscle, CO2Report: A document
reporting high levels of CO2 levels in blood, XrayReport: Report of having a chest x-ray, Disease: Presence
of an illness, GruntingReport: Report of infant grunting, Age: Age of infant at disease presentation, LVH:
Thickening of the left ventricle, DuctFlow: Blood flow across the ductus arteriosus, CardiacMixing: Mixing of
oxygenated and deoxygenated blood, LungParench: The state of the blood vessels in the lungs, LungFlow: Low
blood flow in the lungs, Sick: Presence of an illness

18
Under review as a conference paper at ICLR 2024

Figure A6: For Neuropathic dataset, we consider a sub-graph induced


by one of the root nodes, containing the following 22 nodes and corre-
sponding edges taken from https://observablehq.com/@turuibo/
the-complete-causal-graph-of-neuropathic-pain-diagnosis: ‘right C7’,
‘right elbow trouble’, ‘left shoulder trouble’, ‘left bend of arm problem’, ’right shoulder trouble’,
‘right hand problem’, ‘left medival elbow problems’, ‘right finger trouble’, ‘left neck problems’,
‘left wrist problems’, ’left shoulder problem’, ‘right neck’, ‘right wrist problem’, ‘right shoulder
problem’, ‘discoligment injury C6 C7’, ‘left hand problem’, ‘left C7’, ‘right arm band’, ‘left lower
arm disorders’, ‘neck pain’, ‘left finger trouble’, ‘left arm’. We did not use descriptions for the
nodes of Neuropathic graph.

19
Under review as a conference paper at ICLR 2024

535 D VARIOUS P ROMPT S TRATEGIES S TUDIED

Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’],
Which cause-and-effect relationship is more likely between nodes ’smoker’ and ’cancer’?

A. changing the state of node ’smoker’ causally effects a change in another node ’cancer’.

B. changing the state of node ’cancer’ causally effects a change in another node ’smoker’.

C. There is no causal relation between the nodes ’cancer’ and ’smoker’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: The causal effect of "smoker" directing to "cancer" is based on the strong evidence from epi-
demiological studies linking smoking to an increased risk of developing cancer. Smoking introduces
harmful substances into the respiratory system, leading to cellular damage and mutation, which sig-
nificantly raises the likelihood of cancer development in the lungs or respiratory tract, subsequently
impacting the occurrence of respiratory problems like shortness of breath. Therefore answer is <An-
swer>A</Answer>

Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which cause-
and-effect relationship is more likely between nodes ’xray’ and ’dyspnoea’?

A. changing the state of node ’xray’ causally effects a change in another node ’dyspnoea’.

B. changing the state of node ’dyspnoea’ causally effects a change in another node ’xray’.

C. There is no causal relation between the nodes ’xray’ and ’dyspnoea’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: Reasoning behind the lack of causal relation between X-ray and dyspnoea is that X-ray and
dyspnoea are both effects of having cancer, but they do not directly cause or affect each other. X-ray is
a diagnostic test that can help detect cancer in the lungs or other organs, while dyspnoea is a symptom
of cancer that involves feeling short of breath. Therefore, X-ray and dyspnoea are not causally related,
but they are both associated with cancer. Therefore answer is <Answer>C</Answer>

Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which cause-
and-effect relationship is more likely between nodes ’xray’ and ’cancer’?

A. changing the state of node ’xray’ causally effects a change in another node ’cancer’.

B. changing the state of node ’cancer’ causally effects a change in another node ’xray’.

C. There is no causal relation between the nodes ’xray’ and ’cancer’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Table A4: Chain of Thought Prompt

20
Under review as a conference paper at ICLR 2024

Answer: The causal effect of cancer on X-ray is that X-rays are often used to diagnose or detect
cancer in different parts of the body, such as the bones, lungs, breasts, or kidneys123. X-rays use low
doses of radiation to create pictures of the inside of the body and show the presence, size, and location
of tumors. X-rays can also help monitor the response to treatment or check for signs of recurrence.
Therefore, having cancer may increase the likelihood of getting an X-ray as part of the diagnostic
process or follow-up care. Therefore answer is <Answer>B</Answer>

Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which
cause-and-effect relationship is more likely between nodes ’pollution’ and ’cancer’?

A. changing the state of node ’pollution’ causally effects a change in another node ’cancer’.

B. changing the state of node ’cancer’ causally effects a change in another node ’pollution’.

C. There is no causal relation between the nodes ’pollution’ and ’cancer’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer
in the following format: <Answer>A/B/C</Answer>. It is very important that you output the
final Causal graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be
processed.

Answer: The causal effect of pollution on cancer is that air pollution contains carcinogens (cancer-
causing substances) that may be absorbed into the body when inhaled and damage the DNA of cells.
Another possible reasoning is that particulate matter (tiny dust-like particles) in air pollution may cause
physical damage to the cells in the lungs, leading to inflammation and oxidative stress and eventually
cell mutations. A third possible reasoning is that air pollution may create an inflamed environment in
the lungs that encourages the proliferation of cells with existing cancer-driving mutations. These are
some of the hypotheses that researchers have proposed to explain how air pollution may cause cancer,
but more studies are needed to confirm them. Therefore answer is <Answer>A</Answer>

Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which
cause-and-effect relationship is more likely between nodes ’pollution’ and ’smoker’?

A. changing the state of node ’pollution’ causally effects a change in another node ’smoker’.

B. changing the state of node ’smoker’ causally effects a change in another node ’pollution’.

C. There is no causal relation between the nodes ’pollution’ and ’smoker’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer
in the following format: <Answer>A/B/C</Answer>. It is very important that you output the
final Causal graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be
processed.

Answer: Reason behind the lack of causal relation between pollution and smoker is that pollution
and smoking are both independent risk factors for respiratory problems, but they do not directly cause or
affect each other. Pollution and smoking both contribute to air pollution, which can harm the health of
people and the environment. However, pollution is mainly caused by human activities such as burning
fossil fuels, deforestation, or industrial processes, while smoking is a personal choice that involves
inhaling tobacco smoke. Therefore, pollution and smoker are not causally related, but they are both
associated with respiratory problems. Therefore answer is <Answer>C</Answer>.

Table A5: Chain of Thought Prompt (continued..)

21
Under review as a conference paper at ICLR 2024

Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Family Disease’ and
’Gene’?

A. changing the state of node ’Family Disease’ causally effects a change in another node ’Gene’.

B. changing the state of node ’Gene’ causally effects a change in another node ’Family Disease’.

C. There is no causal relation between the nodes ’Family Disease’ and ’Gene’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: Reason behind the causal effect of family disease on gene is that family disease is a term that
refers to diseases or health conditions that run in the family, meaning that they are influenced by genetic
factors. Gene is a term that refers to the basic unit of heredity that carries information for a specific trait
or function. Family disease can affect gene by altering the type or frequency of genes that are inherited
by the offspring from their parents. For example, some family diseases are caused by deterministic
genes, which are genes that guarantee the development of a disease if they are present in a person’s
genome. Other family diseases are influenced by risk genes, which are genes that increase the likelihood
of developing a disease but do not guarantee it. Therefore, family disease can causally direct to gene by
changing the genetic makeup of a person or a population. Therefore answer is <Answer>A</Answer>.

Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Coronary Heart Dis-
ease’ and ’Gene’?

A. changing the state of node ’Coronary Heart Disease’ causally effects a change in another node
’Gene’.

B. changing the state of node ’Gene’ causally effects a change in another node ’Coronary Heart Dis-
ease’.

C. There is no causal relation between the nodes ’Coronary Heart Disease’ and ’Gene’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: Possible reasoning behind the causal effect of gene on coronary heart disease is that gene is
a term that refers to the basic unit of heredity that carries information for a specific trait or function.
Gene can affect coronary heart disease by influencing the structure and function of the blood vessels,
the metabolism and transport of lipids (fats) in the blood, the inflammation and clotting processes, or
the response to environmental factors such as smoking or diet. For example, some genes code for
proteins that regulate the cell cycle and growth of the cells that line the arteries, which can affect their
susceptibility to damage or plaque formation. Other genes code for proteins that control the synthesis
and clearance of cholesterol or other lipids, which can affect their levels and deposition in the arteries.
Therefore, gene can causally direct to coronary heart disease by modifying the biological pathways that
contribute to the development or progression of the disease. Therefore answer is <Answer>B</Answer>

Table A6: Chain of Thought Prompt (continued..)

22
Under review as a conference paper at ICLR 2024

Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Blood Pressure’ and
’Smoking’?

A. changing the state of node ’Blood Pressure’ causally effects a change in another node ’Smoking’.

B. changing the state of node ’Smoking’ causally effects a change in another node ’Blood Pressure’.

C. There is no causal relation between the nodes ’Blood Pressure’ and ’Smoking’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: Possible reasoning behind the causal effect of smoking on blood pressure is that smoking is
a habit that involves inhaling tobacco smoke, which contains nicotine and other harmful chemicals.
Smoking can affect blood pressure by activating the sympathetic nervous system (SNS), which is the
part of the nervous system that controls the body’s response to stress or danger. When the SNS is
activated, it releases hormones such as adrenaline and noradrenaline, which cause the heart to beat faster
and harder, and the blood vessels to constrict. This results in a temporary increase in blood pressure,
which can last for 15 to 20 minutes after each cigarette. Therefore, smoking can causally direct to blood
pressure by stimulating the SNS and increasing the cardiac output and vascular resistance. Therefore
answer is <Answer>B</Answer>.

Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Headache’ and ’Smok-
ing’?

A. changing the state of node ’Headache’ causally effects a change in another node ’Smoking’.

B. changing the state of node ’Smoking’ causally effects a change in another node ’Headache’.

C. There is no causal relation between the nodes ’Headache’ and ’Smoking’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: One possible reasoning behind the lack of causal relation between headache and smoking is
that headache and smoking are both associated with various health conditions, but they do not directly
cause or affect each other12. Headache is a term that refers to pain or discomfort in the head, scalp, or
neck, which can have many possible causes, such as stress, dehydration, infection, injury, or medication.
Smoking is a habit that involves inhaling tobacco smoke, which contains nicotine and other harmful
chemicals, which can increase the risk of diseases such as cancer, heart disease, stroke, and lung disease.
Therefore, headache and smoking are not causally related, but they are both linked to different health
problems. Therefore the answer is <Answer>C</Answer>

Table A7: Chain of Thought Prompt (continued..)

23
Under review as a conference paper at ICLR 2024

Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Headache’ and ’Smok-
ing’?

A. changing the state of node ’Headache’ causally effects a change in another node ’Smoking’.

B. changing the state of node ’Smoking’ causally effects a change in another node ’Headache’.

C. There is no causal relation between the nodes ’Headache’ and ’Smoking’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: One possible reasoning behind the lack of causal relation between headache and smoking is
that headache and smoking are both associated with various health conditions, but they do not directly
cause or affect each other. Headache is a term that refers to pain or discomfort in the head, scalp, or
neck, which can have many possible causes, such as stress, dehydration, infection, injury, or medication.
Smoking is a habit that involves inhaling tobacco smoke, which contains nicotine and other harmful
chemicals, which can increase the risk of diseases such as cancer, heart disease, stroke, and lung disease.
Therefore, headache and smoking are not causally related, but they are both linked to different health
problems. Therefore the answer is <Answer>C</Answer>

Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Coronary Heart Dis-
ease’ and ’Smoking’?

A. changing the state of node ’Smoking’ causally effects a change in another node ’Coronary Heart
Disease’.

B. changing the state of node ’Coronary Heart Disease’ causally effects a change in another node
’Smoking’.

C. There is no causal relation between the nodes ’Coronary Heart Disease’ and ’Smoking’.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Answer: Possible reasoning behind the causal effect of smoking on coronary heart disease is smoking
damages the heart and blood vessels by raising triglycerides, lowering HDL, increasing blood clotting,
and impairing blood flow to the heart. This can lead to plaque buildup, heart attacks, and death. There-
fore answer is <Answer>A</Answer>.

Question: For a causal graph used for context with the following nodes: nodes, Which cause-and-effect
relationship is more likely between nodes X and Y?

A. changing the state of node X causally effects a change in another node Y.

B. changing the state of node Y causally effects a change in another node X.

C. There is no causal relation between the nodes X and Y.

Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.

Table A8: Chain of Thought Prompt (continued..)

24
Under review as a conference paper at ICLR 2024

Which cause-and-effect relationship is more likely?

A. changing the state of node which says X causally effects a change in another node which
says Y.

B. changing the state of node which says Y causally effects a change in another node which
says X.

C. There is no causal relationship between node X and Y.

Make sure to first output a factually grounded reasoning for your answer. X and Y are nodes
of a Causal Graph. The causal graph is sparse and acyclic in nature. So option C could be
chosen if there is some uncertainity about causal relationship between X and Y.

First give your reasoning and after that please make sure to provide your final answer within
the tags <Answer>A/B/C</Answer>.
It is very important that you output your final answer between the tags like <An-
swer>A/B/C</Answer> otherwise your response will not be processed.

Table A9: Base prompt

For the nodes X and Y which form an edge in a Causal Graph, you have to identify which
cause-and-effect relationship is more likely between the nodes of the edge. This will be used
to rearrange the nodes in the edge to create a directed edge which accounts for causal relation
from one node to another in the edge.

A. changing the state of node X causally affects a change in another node Y.

B. changing the state of node Y causally affects a change in another node X.

C. There is no causal relation between the nodes X and Y.

You can also take the edges from the skeleton which have been rearranged to create a directed
edge to account for causal relationship between the nodes: directed_edges.
Make sure to first output a factually grounded reasoning for your answer. First give your
reasoning and after that please make sure to provide your final answer within the tags <An-
swer>A/B/C</Answer>.
It is very important that you output your final answer between the tags like <An-
swer>A/B/C</Answer> otherwise your response will not be processed.

Table A10: Iterative orientation prompt

25
Under review as a conference paper at ICLR 2024

For the following undirected edge in a Causal Graph made of nodes X and Y, you have to
identify which cause-and-effect relationship is more likely between the nodes of the edge.
This will be used to rearrange the nodes in the edge to create a directed edge which accounts
for causal relation from one node to another in the edge.

A. changing the state of node X causally effects a change in another node Y.

B. changing the state of node Y causally effects a change in another node X.

C. There is no causal relation between the nodes X and Y.

You can also take the other directed edges of nodes X: X_edges and Y: Y_edges of the Causal
graph as context to redirect the edge to account for causal effect.
Make sure to first output a factually grounded reasoning for your answer. First give your
reasoning and after that please make sure to provide your final answer within the tags <An-
swer>A/B/C</Answer>.
It is very important that you output your final answer between the tags like <An-
swer>A/B/C</Answer> otherwise your response will not be processed.

Table A11: Markov Blanket prompt

Identify the causal relationships between the given variables and create a directed acyclic graph
to {context}. Make sure to give a reasoning for your answer and then output the directed graph
in the form of a list of tuples, where each tuple is a directed edge. The desired output should
be in the following form: [(‘A’,‘B’), (‘B’,‘C’)] where first tuple represents a directed edge from
Node ‘A’ to Node ‘B’, second tuple represents a directed edge from Node ‘B’ to Node ‘C’and
so on.

If a node should not form any causal relationship with other nodes, then you can add it as an
isolated node of the graph by adding it seperately. For example, if ‘C’ should be an isolated
node in a graph with nodes ‘A’, ‘B’, ‘C’, then the final DAG representation should be like
[(‘A’,‘B’), (‘C’)].
Use the description about the node provided with the nodes in brackets to form a better decision
about the causal direction orientation between the nodes.

It is very important that you output the final Causal graph within the tags <An-
swer></Answer>otherwise your answer will not be processed.

Example:
Input: Nodes: [‘A’, ‘B’, ‘C’, ‘D’];
Description of Nodes: [(description of Node A), (description of Node B), (description of Node
C), (description of Node D)]
Output: <Answer>[(‘A’,‘B’),(‘C’,‘D’)]</Answer>
Question:
Input: Nodes: {Triplet Nodes Input}
Description of Nodes: {Description of Each Node from the Triplet}
Output:

Table A12: The triplet prompt template, which includes a concise context of the graph, the input
triplet nodes and their respective descriptions. As an example, for the Child graph, the context is
"to model congenital heart disease in babies", the three nodes may be (‘HypoxiaInO2’, ‘Grunting’,
‘GruntingReport’); and their node descriptions are ["hypoxia when breathing oxygen", "grunting in
infants", "report of infant grunting"] respectively.

26
Under review as a conference paper at ICLR 2024

Input: (‘HypDistrib’, ‘LowerBodyO2’)

Answer: Low oxygen areas equally distributed around the body can affect the level of oxygen in the
lower body by reducing the amount of oxygen available for circulation. Therefore, the answer is <An-
swer>A</Answer>.

Input: (‘Disease’, ‘LungFlow’)

Answer: Infant methemoglobinemia is a condition where the blood is unable to carry enough oxygen,
which can lead to low blood flow in the lungs. Therefore, the answer is <Answer>A</Answer>.

Input: (‘CardiacMixing’, ‘HypDistrib’)

Answer: Mixing of oxygenated and deoxygenated blood is a common cause of congenital heart disease
in babies. This mixing can occur due to structural defects in the heart, such as a hole between the heart
chambers or abnormal connections between the blood vessels. This mixing can lead to low oxygen
areas equally distributed around the body. Therefore, the answer is <Answer>B</Answer>.

Table A13: Example LLM (GPT-3.5-turbo) outputs for estimating the causal direction between a
given pair of nodes (with their descriptions) using the CoT Prompt (refer Table A4 in Appendix for
the prompt). Given a pair (X,Y), Option A refers to X causing Y, option B refers to Y causing X,
and option C refers to no causal relationship between the variables. The pairs shown are from the
Child graph from BNLearn repository (Scutari & Denis, 2014), that is used for modeling Congenital
Heart Disease in infants.

Input: (‘Right C7’,‘Discoligment injury C6-C7’)

Answer: Discoligment injury C6-C7 can cause compression of the nerve roots that exit the
spinal cord at the C7 level, which can lead to symptoms such as pain, numbness, and weak-
ness in the right C7 dermatome. Therefore, the answer is <Answer>B</Answer>.

Input: (‘Right C7’, ‘Left C7’)

Answer: Right C7 and left C7 are both parts of the cervical spine and are not known to
directly influence each other. Therefore, the answer is <Answer>C</Answer>.

Input: (‘Right elbow trouble’, ‘Left shoulder trouble’)

Answer: There is no direct causal relationship between right elbow trouble and left shoulder
trouble. They may both be symptoms of a larger underlying condition, but they do not directly
cause or affect each other. Therefore the answer is <Answer>C</Answer>.

Table A14: Example LLM (GPT-3.5-turbo) reasoning outputs for estimating causal directionality
between different pairs of nodes using CoT Prompt (refer Table A4 for the prompt) for Neuropathic
subgraph (used for pain diagnosis).

27
Under review as a conference paper at ICLR 2024

Dataset Samples LLM Ground Truth PC (Average over MEC)


250 1.00±0.00 0.00±0.00 2.00±0.00
Asia 1000 3.00±0.00 2.00±0.00 3.00±0.00
10000 3.00±0.00 3.00±0.00 3.00±0.00
250 5.00±0.00 5.00±0.00 6.50±0.00
Child 1000 6.00±0.00 6.00±0.00 8.43±0.00
10000 9.00±0.00 9.00±0.00 9.75±0.00

Table A15: Comparing Dtop of final graph using LLM order vs Ground truth order as prior to PC
algorithm for Child and Asia graph, averaged over 4 runs

Dataset Samples ϵAT E (S1 ) ϵAT E (S2 ) ϵAT E (S3 ) ∆12 ∆13
250 0.70±0.40 0.70±0.39 0.69±0.39 0.00±0.00 0.00±0.00
500 0.64±0.39 0.64±0.39 0.64±0.38 0.00±0.00 0.00±0.00
Asia 1000 0.59±0.32 0.59±0.32 0.59±0.32 0.00±0.00 0.00±0.00
5000 0.59±0.30 0.59±0.30 0.59±0.29 0.00±0.00 0.00±0.00
10000 0.49±0.00 0.49±0.00 0.49±0.00 0.00±0.00 0.00±0.00

Table A16: Results on Asia dataset. Here we test the difference in the estimated causal effect
of lung on dyspnoea when the causal effect is estimated using the backdoor set S1 = {smoke} vs.
the causal effect estimated when all variables in two topological orders as backdoor sets: S2 =
{asia, smoke}, S2 = {asia, tub, smoke}. ∆12 , ∆13 refers to the absolute difference between the
pairs ϵAT E (S1 ), ϵAT E (S2 ) and ϵAT E (S1 ), ϵAT E (S3 ) respectively. From the last two columns, we
observe that using the variables that come before the treatment node in a topological order as a
backdoor set does not result in the deviation of causal effects from the ground truth effects.

28

You might also like