9268 Causal Inference Using LLM Gui
9268 Causal Inference Using LLM Gui
A BSTRACT
1 At the core of causal inference lies the challenge of determining reliable causal
2 graphs solely based on observational data. Since the well-known back-door cri-
3 terion depends on the graph structure, any errors in the estimated graph structure
4 affect the correctness of the estimated causal effects. In this work, to construct a
5 valid back-door adjustment set, we propose to use a topological or causal order
6 among graph nodes, which is easier to get from domain experts. Given a node
7 pair, causal order is easier to elicit from domain experts compared to graph edges
8 since determining the existence of an edge depends extensively on other variables.
9 Interestingly, we observe that the same principle holds for Large Language Mod-
10 els (LLMs) such as GPT-3.5-turbo and GPT-4, motivating an automated method to
11 obtain causal order (and hence causal effect) with LLMs acting as virtual domain
12 experts. To this end, we employ different prompting strategies and contextual
13 cues to propose a robust technique for obtaining causal order from LLMs. Ac-
14 knowledging LLMs’ limitations, we also study possible techniques to integrate
15 LLMs with established causal discovery algorithms, including constraint-based
16 and score-based methods, to enhance their performance. Extensive experiments
17 demonstrate that our approach significantly improves causal ordering accuracy as
18 compared to discovery algorithms, highlighting the potential of LLMs to enhance
19 causal inference across diverse fields.
20 1 I NTRODUCTION
21 Causal inference plays a pivotal role across scientific disciplines, aiding researchers in uncovering
22 fundamental causal relationships and how they affect observed phenomena. For example, causal
23 inference is used to discern the causes of diseases and design effective interventions for diagnosis
24 and treatment in epidemiology (Mahmood et al., 2014), to evaluate policy impact based on observa-
25 tional studies in economics (Imbens & Rubin, 2015), and to understand the effects of pollution on
26 ecosystems in environmental science (Boslaugh, 2023). A key technical question for these studies
27 is estimating the causal effect of variables on a specific outcome variable.
28 Inferring causal effect from observational data, however, is a challenging task because the effect esti-
29 mate depends critically on the causal graph considered in the analysis. While there has been progress
30 in graph discovery algorithms, especially for specific parametric settings (Shimizu et al., 2006;
31 Hoyer et al., 2008b; Hyvärinen et al., 2010; Rolland et al., 2022), studies on real-world datasets
32 such as from atmospheric science (Huang et al., 2021) and healthcare (Tu et al., 2019) show that in-
33 ferring the causal graph from data remains a challenging problem in practice (Reisach et al., 2021).
34 Hence, causal inference studies often rely on human experts to provide the causal graph.
35 In this paper, based on the fact that the topological/causal order over the graph variables is enough
36 for effect inference, we leverage Large Language Models (LLMs) as virtual domain experts to pro-
37 pose an automated method to query causal order (and hence causal effect). Moreover, providing
38 the order between variables is the right question to ask experts because order depends only on the
39 variables under question, unlike existence of a graph edge that depends on which other variables
40 are present (to account for direct and indirect effects). For example, consider the data-generating
41 process, lung cancer → doctor visit → positive Xray. If an expert is asked whether there should be
42 a causal edge from lung cancer to positive Xray, they would answer “Yes” (indeed, such an edge
43 exists in the BNLearn Cancer dataset (Scutari & Denis, 2014)). However, if they are told that the
44 set of observed variables additionally includes doctor visit, then the correct answer would be to not
1
Under review as a conference paper at ICLR 2024
Figure 1: The LLM-augmented causal inference process based on inferring causal order. We propose a triplet-
based prompting technique to infer all three-variable subgraphs and aggregate them using majority voting to
produce a causal order. The causal order can then be used to identify a valid backdoor adjustment set. Ties in
causal order are broken using another LLM (e.g., GPT-4). For robustness, LLM-generated causal order may be
used in conjunction with discovery algorithms like PC or CaMML.
45 create a direct edge between lung cancer and positive Xray, but rather create edges mediated through
46 doctor visit. However, note that the causal order, lung cancer ≺ positive Xray remains the same in
47 both settings (a ≺ b indicates that a occurs before b in a casual process).
48 We show that large language models (LLMs) like GPT-3.5 (Hagendorff et al., 2022) and GPT-
49 4 (Peng et al., 2023) can be used to approximate experts’ capability to provide causal order, thereby
50 automating the process of causal inference. Extending results using LLMs for pairwise causal dis-
51 covery (Kıcıman et al., 2023), we find that LLMs can provide accurate causal order for a diverse set
52 of benchmark graphs. To do so, we propose a novel triplet-based prompting strategy that asks LLM
53 to consider three variables at once, compared to the pairwise prompts employed in past work (Kıcı-
54 man et al., 2023; Willig et al., 2022; Long et al., 2023; 2022). Causal order outputted using the
55 triplet-based prompt outperforms pairwise prompts across all benchmark datasets we considered.
56 Importantly, it avoids cycles in the predicted order whereas LLM outputs from pairwise prompts
57 often yield cycles.
58 Still, LLMs can exhibit unknown failure modes. Therefore, a more principled way is to adapt
59 existing graph discovery algorithms to utilize LLM output. To this end, we present two algorithms
60 based on constraint-based and score-based discovery algorithms respectively. The first uses causal
61 order from an LLM to orient the undirected edges outputted by a constraint-based algorithm such
62 as PC (Spirtes et al., 2000). The second algorithm utilizes the LLM causal order as a prior to
63 a score-based algorithm like CaMML (Wallace et al., 1996). Results show that LLM-augmented
64 algorithms outperform the base causal discovery algorithms in determining the causal order. The
65 overall methodology is depicted in Figure 1. Our contributions can be summarized as follows.
66 • We argue that in causal effect estimation, querying a domain expert for a causal order is more
67 principled than asking for exact causal structure among variables.
68 • We provide a novel prompting strategy based on triplets and show that LLMs like GPT-3.5 can be
69 used to obtain causal order for a diverse range of datasets.
70 • We propose two algorithms combining causal discovery algorithms with LLM output and show
71 that the final causal order is substantially more accurate than the discovery algorithms alone.
72 2 R ELATED W ORK
73 Combining graph discovery and causal inference. Historically, causal discovery and causal effect
74 inference have been studied separately. Graph discovery algorithms can broadly be divided into (i)
75 algorithms using conditional independence tests (constraint-based) (Glymour et al., 2019); (ii) algo-
76 rithms using a score function to evaluate predicted graph (score-based) (Glymour et al., 2019); (iii)
2
Under review as a conference paper at ICLR 2024
77 algorithms that determine a causal order and then infer edges (order-based) (Rolland et al., 2022;
78 Teyssier & Koller, 2005); and (iv) deep learning-based methods that formulate an optimization prob-
79 lem based on acyclicity and sparsity constraints (Zheng et al., 2018; Lachapelle et al., 2020). Causal
80 discovery methods are evaluated on error with respect to the true graph, e.g., using the structural
81 hamming distance (SHD) (Acid & de Campos, 2003; Tsamardinos et al., 2006). In contrast, causal
82 inference methods focus on the estimation of causal effect given a causal graph (Pearl, 2009); the
83 graphs are assumed to be known. A natural way to combine these approaches to use the graph out-
84 putted by discovery algorithms in inference methods, as in (Hoyer et al., 2008a; Mooij et al., 2016;
85 Maathuis et al., 2010; Gupta et al., 2022). In this paper, we show that there exists a simpler way to
86 combine the two approaches: only a causal order is needed instead of the full graph.
87 Knowledge-driven Causal Discovery: Prior knowledge has been used in causal discovery liter-
88 ature (Hasan & Gani, 2022; Constantinou et al., 2023; Heckerman & Geiger, 2013; Teshima &
89 Sugiyama, 2021; O’Donnell et al., 2006; Wallace et al., 1996). These methods rely on prior knowl-
90 edge such as domain expert opinions and documented knowledge from randomized controlled trials
91 (RCT). Various priors have been studies in literature, including the priors of the form edge existence,
92 forbidden edge, ancestral constraints (Constantinou et al., 2023; Ban et al., 2023). Prior knowledge
93 significantly reduces the search space over all possible causal graphs.
94 Recent advancements in LLMs has led to more attention towards knowledge-driven causal discov-
95 ery (Kıcıman et al., 2023; Ban et al., 2023; Long et al., 2023; Willig et al., 2022). Unlike causal
96 discovery algorithms that use statistical patterns in the data, LLM-based algorithms use metadata
97 such as variable names. Most of these methods use only LLMs to predict the causal relationships
98 among a set of variables (Kıcıman et al., 2023; Willig et al., 2022; Long et al., 2022). Recent work
99 also shows how LLMs can be used as priors or imperfect experts which can be combined with dif-
100 ferent types of discovery algorithms like (Long et al., 2023) uses LLMs to improve output of a
101 constraint-based algorithm for full graph discovery by orienting undirected edges in the CPDAG
102 and (Ban et al., 2023) uses LLMs as priors for scoring-based methods. However, the focus of these
103 works has been on minimizing graph error metrics such as SHD. Instead, we focus on the down-
104 stream causal inference task and choose causal order as the metric since it directly correlates with
105 accuracy in effect estimation whereas SHD does not.
106 LLM Prompting Strategies for Causal Discovery: Existing LLM-based algorithms for graph dis-
107 covery (Kıcıman et al., 2023; Long et al., 2022; Ban et al., 2023) use a pairwise prompt, essentially
108 asking “does A cause B” with varying levels of prompt complexity. Extending this line of work, we
109 propose a triplet-based prompt that provides more accurate answers and avoids cycles when query-
110 ing relationships between variables. As a result, our triplet-based prompt may be of independent
111 interest to improve LLM-based graph discovery. We also explore the chain-of-thought prompting
112 strategy (Wei et al., 2022) in our experiments.
122 In Defn 3.1, Xi is called the treatment variable and Xj is called the target variable. do(Xi = xi )
123 denotes an external intervention to the variable Xi with the value xi . The interventional quantity
124 E[Xj |do(Xi = xi )] is different from conditional E[Xj |Xi = xi ] since it involves setting the value
125 of Xi rather than conditioning on it. To estimate the quantity E[Xj |do(Xi = xi )] from observational
126 data, the backdoor adjustment formula is used.
127 Definition 3.2. (Back-door Adjustment (Pearl, 2009)) Given a DAG G, a set of variables Z satisfies
128 back-door criterion relative to a pair of treatment and target variables (Xi , Xj ) if
3
Under review as a conference paper at ICLR 2024
4
Under review as a conference paper at ICLR 2024
178 Proposition 4.5. In a causal DAG G with N levels in the level-ordering of variables where the level i
−1
NP PN
179 contains ni variables,∃ Ĝ s.t. SHD(Ĝ, G) ≥ (ni × nj )−|E| and Dtop (π̂, A) = 0 ∀π̂ of Ĝ.
i=1 j=i+1
5
Under review as a conference paper at ICLR 2024
230 • Iterative Context. Here we provide the previously oriented pairs as context in the prompt. Since
231 the LLM has access to its previous decisions, we expect that it may avoid creating cycles through
232 its predictions.
233 • Markov Blanket Context. Providing previously oriented pairs may become prohibitive for
234 large graphs. Using the fact that a variable is independent of all other nodes given the Markov
235 Blanket (Pearl, 2009), here we provide the Markov Blanket of the given node pairs as additional
236 context in the prompt.
237 • Chain-of-Thought (+In-context learning). Based on encouraging results of providing in-context
238 examples in the prompt for various tasks (Brown et al., 2020), here we include 3 examples of
239 the ordering task that we expect the LLM to perform. Effectively, we provide example node
240 pairs and their correct causal ordering before asking the question about the given nodes. Each
241 example answer also contains an explanation of the answer, generated using Bing GPT-4. Adding
242 the explanation encourages LLM to employ chain-of-thought reasoning (Wei et al., 2022) when
243 deciding the causal order. To avoid overfitting, we select node pairs from graphs that are not
244 evaluated in our study. Node pairs with and without direct edges were equally chosen. Examples
245 of LLM’s answers (and their explanations) using the CoT prompt can be found in Table A13 and
246 Table A14 in Appendix.
268 In addition, the triplet prompt uses the techniques of in-context examples and chain-of-thought from
269 the pairwise setup. An example prompt is shown in Table A12.
276 Constraint-based algorithms return a graph where some edges may not be oriented. Given a graph
277 from constraint-based algorithm like PC, we use the causal order π̂ from LLM to orient the undi-
278 rected edges. Iterating over the undirected edges, we first check if the nodes of that edge are occur-
279 ring in π̂. If yes, we orient the edge according to the causal order. Since there is a possibility that
280 LLM’s final graph might have some isolated nodes which won’t be in π̂, therefore if either (or both)
281 nodes of the undirected edge are not included in π̂, we query GPT-4 using pairwise CoT prompt
282 (from Section 5.1) to finalise a direction between the pair.
6
Under review as a conference paper at ICLR 2024
Algorithm 1 Combining constraint based methods and experts to get π̂ for a given set of variables.
1: Input: LLM topological ordering π̂, Expert EGP T 4 , PC-CPDAG Ĝ
2: Output: Estimated topological order π̂final of {X1 , . . . , Xn }.
3: for (i − j) ∈ undirected-edges(Ĝ) do
4: If both the node i and j are in π̂ and if π̂i < π̂j , orient (i − j) as (i → j) in Ĝ.
5: Otherwise, use the expert EGP T 4 with CoT prompt to orient the edge (i − j).
6: end for
7: π̂final = topological ordering of Ĝ
8: return π̂
7
Under review as a conference paper at ICLR 2024
328 7.2 T RIPLET PROMPTING TECHNIQUE IS MOST ACCURATE FOR CAUSAL ORDER
329 Tables 2 and 3 compare the different prompting techniques. As the graph size increases, we observe
330 limitations with pairwise prompts. In many cases, pairwise prompts yield cycles in many cases
331 due to which Dtop cannot be computed. In particular, for Child dataset with 20 nodes, pairwise
332 prompts yield anywhere from 13-79 cycles. LLM output tends to connect more edges than needed,
333 which explains why SHD is high. Overall, among the pairwise prompts, the chain of thought prompt
334 performs the best: it has the lowest Dtop on the four small graphs and the lowest number of cycles
335 for Child and Neuropathic datasets. This indicates that in-context examples and chain-of-thought
336 reasoning helps to increase accuracy of causal order output, but other contextual cues do not matter.
337 Finally, the triplet prompt provides the most accurate causal order. Even for medium-size graphs
338 like Child and Neuropathic, the LLM output includes no cycles and SHD is fairly low betwen 4-
339 29. Moreover, Dtop is zero for all datasets, except for Asia and Neuropathic where it is 1 and
340 2 respectively. That said, we do see that isolated nodes in the output increase compared to the
341 pairwise prompts (all graphs are connected, so outputting an isolated node is an error). Considering
342 LLMs as virtual experts, this indicates that there are some nodes on which the LLM expert cannot
343 determine the causal order. This is still a better tradeoff than outputting the wrong causal order,
344 which can confuse downstream algorithms. Overall, therefore, we conclude that the triplet prompt
345 provides the most robust causal order predictions.
346 7.3 LLM S IMPROVE CAUSAL ORDER ACCURACY OF EXISTING DISCOVERY ALGORITHMS
347 We now study whether LLM output can be used to increase accuracy of discovery algorithms in
348 inferring causal order. We compare with popular causal discovery methods: PC (Spirtes et al., 2000),
349 SCORE (Rolland et al., 2022), ICA-LiNGAM (Shimizu et al., 2006), Direct-LiNGAM (Shimizu
350 et al., 2011), NOTEARS (Zheng et al., 2018), and Causal discovery via minimum message length
351 (CaMML) (Wallace et al., 1996); across five different sample sizes: 250, 500, 1000, 5000, 10000.
352 For LLM, we use the triplet prompt. Table 4 shows the Dtop metric for different algorithms and
353 compares it to the Dtop of our combined algorithms: PC+LLM and CaMML+LLM. Among the
354 discovery algorithms, we find that PC and CaMML perform the best, with the lowest Dtop across
355 the five datasets. For Neuropathic dataset, ICA LiNGAM is also competitive.
8
Under review as a conference paper at ICLR 2024
Table 4: Comparison with existing discovery methods. Mean and std dev of Dtop over 5 runs. (For
Neuropathic subgraph (1k samples), PC Algorithm returns cyclic graphs in the MEC)
356 For the BNLearn datasets, adding LLM output decreases the Dtop of both algorithms further. Specif-
357 ically, PC+LLM leads to a significant reduction in Dtop and the gains are larger at lower sample
358 sizes. This indicates that obtaining causal order from LLMs may matter more in limited sample
359 settings. At sample size of 500, Dtop of PC is nearly double that of PC+LLM for most datasets.
360 Going from CaMML to CaMML+LLM, we also see significant reductions in Dtop . Interestingly,
361 CaMML+LLM yields benefits even at higher sample sizes. At a sample size of 10,000, CaMML’s
362 Dtop for Child and Asia surpasses CaMML+LLM by three and fivefold respectively.
363 For the Neuropathic dataset, we see a similar pattern: adding LLM to existing algorithms improves
364 Dtop or keeps it constant, except at sample sizes 500 and 1000 where CaMML+LLM yields a worse
365 Dtop than CaMML alone. However, as sample size increases to 5000 and 10000, we do see that
366 CaMML+LLM improves the Dtop substantially compared to CaMML. Overall, these results show
367 that LLM output can signficantly improve the accuracy of existing causal discovery algorithms.
9
Under review as a conference paper at ICLR 2024
379 R EFERENCES
380 Silvia Acid and Luis M de Campos. Searching for bayesian network structures in the space of
381 restricted acyclic partially directed graphs. Journal of Artificial Intelligence Research, 18:445–
382 490, 2003.
383 Taiyu Ban, Lyvzhou Chen, Xiangyu Wang, and Huanhuan Chen. From query tools to causal archi-
384 tects: Harnessing large language models for advanced causal discovery from data. arXiv preprint
385 arXiv:2306.16902, 2023.
386 Sarah E. Boslaugh. Silent spring. 2023.
387 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
388 wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
389 Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
390 Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
391 Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
392 Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR,
393 abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
394 Carlos Cinelli, Andrew Forney, and Judea Pearl. A crash course in good and bad controls. Socio-
395 logical Methods & Research, pp. 00491241221099552, 2022.
396 Anthony C. Constantinou, Zhigao Guo, and Neville K. Kitson. The impact of prior knowledge on
397 causal structure learning. Knowledge and Information Systems, 65(8):3385–3434, 2023.
398 Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graph-
399 ical models. Frontiers in genetics, 10:524, 2019.
400 Shantanu Gupta, David Childers, and Zachary Chase Lipton. Local causal discovery for estimating
401 causal effects. In NeurIPS 2022 Workshop on Causality for Real-world Impact, 2022.
402 Thilo Hagendorff, Sarah Fabi, and Michal Kosinski. Machine intuition: Uncovering human-like
403 intuitive decision-making in gpt-3.5, 12 2022.
404 Uzma Hasan and Md Osman Gani. Kcrl: A prior knowledge based causal discovery framework with
405 reinforcement learning. In Proceedings of the 7th Machine Learning for Healthcare Conference,
406 2022.
407 David Heckerman and Dan Geiger. Learning bayesian networks: a unification for discrete and
408 gaussian domains. arXiv preprint arXiv:1302.4957, 2013.
409 Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear
410 causal discovery with additive noise models. In Advances in Neural Information Processing
411 Systems, 2008a.
412 Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlin-
413 ear causal discovery with additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, and
414 L. Bottou (eds.), Advances in Neural Information Processing Systems, 2008b.
415 Yiyi Huang, Matthäus Kleindessner, Alexey Munishkin, Debvrat Varshney, Pei Guo, and Jianwu
416 Wang. Benchmarking of data-driven causality discovery approaches in the interactions of arctic
417 sea ice and atmosphere. Frontiers in big Data, 4:642182, 2021.
418 Aapo Hyvärinen, Kun Zhang, Shohei Shimizu, and Patrik O. Hoyer. Estimation of a structural
419 vector autoregression model using non-gaussianity. Journal of Machine Learning Research, 11
420 (56):1709–1731, 2010.
421 Guido W. Imbens and Donald B. Rubin. Causal Inference for Statistics, Social, and Biomedical Sci-
422 ences: An Introduction. Cambridge University Press, 2015. doi: 10.1017/CBO9781139025751.
423 Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language
424 models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023.
10
Under review as a conference paper at ICLR 2024
425 Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient-based
426 neural dag learning. In ICLR, 2020.
427 Stephanie Long, Tibor Schuster, and Alexandre Piché. Can large language models build causal
428 graphs? In NeurIPS 2022 Workshop on Causality for Real-world Impact, 2022.
429 Stephanie Long, Alexandre Piché, Valentina Zantedeschi, Tibor Schuster, and Alexandre Drouin.
430 Causal discovery with language models as imperfect experts. In ICML 2023 Workshop on Struc-
431 tured Probabilistic Inference & Generative Modeling, 2023.
432 Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. Predicting causal
433 effects in large-scale systems from observational data. Nature methods, 7(4):247–248, 2010.
434 Syed S Mahmood, Daniel Levy, Ramachandran S Vasan, and Thomas J Wang. The framingham
435 heart study and the epidemiology of cardiovascular disease: a historical perspective. The lancet,
436 383(9921):999–1008, 2014.
437 Joris M Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. Dis-
438 tinguishing cause from effect using observational data: methods and benchmarks. The Journal of
439 Machine Learning Research, 17(1):1103–1204, 2016.
440 Rodney T O’Donnell, Ann E Nicholson, Bin Han, Kevin B Korb, Md Jahangir Alam, and Lucas R
441 Hope. Causal discovery with prior information. In AI 2006: Advances in Artificial Intelligence:
442 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, December 4-8,
443 2006. Proceedings 19, pp. 1162–1167. Springer, 2006.
444 Judea Pearl. Causality. Cambridge university press, 2009.
445 Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning
446 with gpt-4, 2023.
447 Jonas Peters and Peter Bühlmann. Structural intervention distance for evaluating causal graphs.
448 Neural computation, 27(3):771–799, 2015.
449 Alexander Reisach, Christof Seiler, and Sebastian Weichwald. Beware of the simulated dag! causal
450 discovery benchmarks may be easy to game. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S.
451 Liang, and J. Wortman Vaughan (eds.), NeurIPS, pp. 27772–27784, 2021.
452 Paul Rolland, Volkan Cevher, Matthäus Kleindessner, Chris Russell, Dominik Janzing, Bernhard
453 Schölkopf, and Francesco Locatello. Score matching enables causal discovery of nonlinear addi-
454 tive noise models. In ICML, 2022.
455 M. Scutari and J.B. Denis. Bayesian Networks: With Examples in R. Chapman & Hall/CRC Texts
456 in Statistical Science. Taylor & Francis, 2014.
457 Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. A linear
458 non-gaussian acyclic model for causal discovery. JMLR, 7(10), 2006.
459 Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvarinen, Yoshinobu Kawahara,
460 Takashi Washio, Patrik O Hoyer, Kenneth Bollen, and Patrik Hoyer. Directlingam: A direct
461 method for learning a linear non-gaussian structural equation model. JMLR, 12(Apr):1225–1248,
462 2011.
463 Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT
464 press, 2000.
465 Takeshi Teshima and Masashi Sugiyama. Incorporating causal graphical prior knowledge into pre-
466 dictive modeling via simple data augmentation. In UAI, pp. 86–96. PMLR, 2021.
467 Marc Teyssier and Daphne Koller. Ordering-based search: A simple and effective algorithm for
468 learning bayesian networks. In UAI, UAI’05, pp. 584–590, 2005.
469 Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing
470 bayesian network structure learning algorithm. Machine learning, 65:31–78, 2006.
11
Under review as a conference paper at ICLR 2024
471 Ruibo Tu, Kun Zhang, Bo Bertilson, Hedvig Kjellstrom, and Cheng Zhang. Neuropathic pain
472 diagnosis simulator for causal discovery algorithm evaluation. Advances in Neural Information
473 Processing Systems, 32, 2019.
474 Chris Wallace, Kevin B Korb, and Honghua Dai. Causal discovery via mml. In ICML, volume 96,
475 pp. 516–524, 1996.
476 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
477 Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
478 Neural Information Processing Systems, 35:24824–24837, 2022.
479 Moritz Willig, Matej Zečević, Devendra Singh Dhami, and Kristian Kersting. Probing for correla-
480 tions of causal facts: Large language models and causality. 2022.
481 Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous
482 optimization for structure learning. NeurIPS, 31, 2018.
12
Under review as a conference paper at ICLR 2024
483 A PPENDIX
485 Proposition 4.2. (Pearl, 2009; Cinelli et al., 2022) Under the no confounding assumption, given an
486 pair of treatment and target variables (Xi , Xj ) in G, Z = {Xk |πk < πi } is a valid adjustment set
487 relative to (Xi , Xj ) for any topological ordering π of G.
488 Proof. We need to show that the set Z = {Xk |πk < πi } satisfies the conditions (i) and (ii) in
489 Defn 3.2. For any variable Xk such that πk < πi , we have Xk ̸∈ de(Xi ) and hence the condition
490 (i) is satisfied. Additionally, for each Xk ∈ pa(Xi ) we have πk < πi and hence pa(Xi ) ⊆ Z. Since
491 pa(Xi ) blocks all paths from Xi to Xj that contains an arrow into Xi (Peters & Bühlmann, 2015),
492 Z satisfies condition (ii).
493 Proposition 4.3. For an estimated topological order π̂ and a true topological order π of a causal
494 DAG G with the corresponding adjacency matrix A, Dtop (π̂, A) = 0 iff Z = {Xk |π̂k < π̂i } is a
495 valid adjustment set relative to (Xi , Xj ), ∀πi < πj .
496 Proof. The statement of proposition is of the form A ⇐⇒ B with A being “Dtop (π̂, A) = 0”
497 and B being “Z = {Xk |π̂k < π̂i } is a valid adjustment set relative to (Xi , Xj ), ∀i, j”. We prove
498 A ⇐⇒ B by proving (i) A =⇒ B and (ii) B =⇒ A.
499 (i) Proof of A =⇒ B: If Dtop (π̂, A) = 0, for all pairs of nodes (Xi , Xj ), we have π̂i <
500 π̂j whenever πi < πj . That is, causal order in estimated graph is same that of the causal order
501 in true graph. Hence, from Propn 4.2, Z = {Xk |π̂k < π̂i } is a valid adjustment set relative to
502 (Xi , Xj ), ∀i, j.
503 (ii) Proof of B =⇒ A: we prove the logical equivalent form of B =⇒ A i.e., ¬A =⇒ ¬B,
504 the contrapositive of B =⇒ A. To this end, assume Dtop (π̂, A) ̸= 0, then there will be at least
505 one edge Xi → Xj that cannot be oriented correctly due to the estimated topological order π̂. i.e.,
506 π̂j < π̂i but πj > πi . Hence, to find the causal effect of Xi on Xl ; l ̸= j, Xj is included in the
507 back-door adjustment set Z relative to (Xi , Xl ). Adding Xj to Z renders Z an invalid adjustment
508 set because it violates the condition (i) of Defn 3.2.
509 Proposition 4.5. In a causal DAG G with N levels in the level-ordering of variables where the level i
−1
NP PN
510 contains ni variables,∃ Ĝ s.t. SHD(Ĝ, G) ≥ (ni × nj )−|E| and Dtop (π̂, A) = 0 ∀π̂ of Ĝ.
i=1 j=i+1
511 Proof. Recall that SHD counts the number of missing, falsely detected, and falsely directed edges
512 in the estimated causal graph as compared to the ground truth graph. Since we want Dtop (π̂, A) =
513 0; ∀π̂ of Ĝ, there cannot be an edge Xi → Xj in Ĝ such that Xi ← Xj is in G. This constraint
514 avoids the possibility of having falsely directed edges in Ĝ. Consider a Ĝ with all the edges in G
515 and in addition, each variable in level i having a directed edge to each variable in all levels below
516 level i. All such edges contribute to the SHD score while still obeying the causal ordering in G.
NP−1 N
P NP−1 N
P
517 This number will be equal to (ni × nj ) − |E|. The quantity (ni × nj ) is the
i=1 j=i+1 i=1 j=i+1
518 number of edges possible from each node to the every other node in the levels below it. We need
519 to subtract the number of existing edges in E to count the newly added edges that contribute to the
520 SHD score. Now, we can remove some of the edges Xi → Xj from Ĝ such that Xi → Xj is in G
521 while still leading to same causal ordering of variables. This leads to increased SHD score due to
522 missing edges in Ĝ. Since it will only increase the SHD score, we ignore such corner cases.
524 Table A1 shows the results of various prompt strategies and their improvements over no-prior meth-
525 ods.
13
Under review as a conference paper at ICLR 2024
Table A1: Dtop metric results. Comparison with various prompting strategies. Neuropathic sub-
graph for 1k samples return cyclic graphs in the MEC. Using LLM CoT prior with Neuropathic 10k
samples, orients the undirected edges to create cyclic graphs
14
Under review as a conference paper at ICLR 2024
527 We conducted some experiments where we utilised discovery algorithms like PC for creating skele-
528 tons of the graph and employed LLMs for orienting the undirected edges. The idea was to utilise
529 LLMs ability to correctly estimate the causal direction while leveraging PC algorithm’s ability to
530 give a skeleton which could be oriented in a post processing setup. We saw that LLM ended up
531 giving improved results as compared to PC alone.
1000 samples
Context Base prompt Past iteration Markov Blanket PC
orientations (Avg. over MEC)
Dtop 8.0 5.3 6.6 9.61
SHD 14.33 12.66 14.0 17.0
10000 samples
Dtop 6.33 9.66 6.0 7.67
SHD 9.0 13.33 8.33 12.0
Table A2: PC + LLM results where LLM is used to orient the undirected edges of the skeleton PC
returns over different data sample sizes. We show how LLMs can be used in a post processing setup
for edge orientation besides having the capability of acting as a strong prior for different discovery
algorithms.
15
Under review as a conference paper at ICLR 2024
533 Figures A1-A5 show the causal graphs and details we considered from BNLearn repository (Scutari
534 & Denis, 2014).
Figure A1: Earthquake Bayesian network. Abbreviations/Descriptions: Burglary: burglar entering, Earth-
quake: earthquake hitting, Alarm: home alarm going off in a house, JohnCalls: first neighbor to call to inform
the alarm sound, Marycalls: second neighbor to call to inform the alarm sound.
Figure A2: Cancer Bayesian network. Abbreviations/Descriptions: Pollution: exposure to pollutants, Smoker:
smoking habit, Cancer: Cancer, Dyspnoea: Dyspnoea, Xray: getting positive xray result.
16
Under review as a conference paper at ICLR 2024
Figure A3: Survey Bayesian network. Abbreviations: A=Age/Age of people using transport, S=Sex/male
or female, E=Education/up to high school or university degree, O=Occupation/employee or self-employed,
R=Residence/the size of the city the individual lives in, recorded as either small or big, T=Travel/the means of
transport favoured by the individual.
Figure A4: Asia Bayesian network. Abbreviations/Descriptions: asia=visit to Asia/visiting Asian countries
with high exposure to pollutants, smoke=smoking habit, tub=tuberculosis, lung=lung cancer, either=either
tuberculosis or lung cancer, bronc=bronchitis, dysp=dyspnoea, xray=getting positve xray result.
17
Under review as a conference paper at ICLR 2024
Figure A5: Child Bayesian network. Abbreviations: BirthAsphyxia: Lack of oxygen to the blood during
the infant’s birth, HypDistrib: Low oxygen areas equally distributed around the body, HypoxiaInO2: Hypoxia
when breathing oxygen, CO2: Level of carbon dioxide in the body, ChestXray: Having a chest x-ray, Grunting:
Grunting in infants, LVHreport: Report of having left ventricular hypertrophy, LowerBodyO2: Level of oxygen
in the lower body, RUQO2: Level of oxygen in the right upper quadricep muscle, CO2Report: A document
reporting high levels of CO2 levels in blood, XrayReport: Report of having a chest x-ray, Disease: Presence
of an illness, GruntingReport: Report of infant grunting, Age: Age of infant at disease presentation, LVH:
Thickening of the left ventricle, DuctFlow: Blood flow across the ductus arteriosus, CardiacMixing: Mixing of
oxygenated and deoxygenated blood, LungParench: The state of the blood vessels in the lungs, LungFlow: Low
blood flow in the lungs, Sick: Presence of an illness
18
Under review as a conference paper at ICLR 2024
19
Under review as a conference paper at ICLR 2024
Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’],
Which cause-and-effect relationship is more likely between nodes ’smoker’ and ’cancer’?
A. changing the state of node ’smoker’ causally effects a change in another node ’cancer’.
B. changing the state of node ’cancer’ causally effects a change in another node ’smoker’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: The causal effect of "smoker" directing to "cancer" is based on the strong evidence from epi-
demiological studies linking smoking to an increased risk of developing cancer. Smoking introduces
harmful substances into the respiratory system, leading to cellular damage and mutation, which sig-
nificantly raises the likelihood of cancer development in the lungs or respiratory tract, subsequently
impacting the occurrence of respiratory problems like shortness of breath. Therefore answer is <An-
swer>A</Answer>
Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which cause-
and-effect relationship is more likely between nodes ’xray’ and ’dyspnoea’?
A. changing the state of node ’xray’ causally effects a change in another node ’dyspnoea’.
B. changing the state of node ’dyspnoea’ causally effects a change in another node ’xray’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: Reasoning behind the lack of causal relation between X-ray and dyspnoea is that X-ray and
dyspnoea are both effects of having cancer, but they do not directly cause or affect each other. X-ray is
a diagnostic test that can help detect cancer in the lungs or other organs, while dyspnoea is a symptom
of cancer that involves feeling short of breath. Therefore, X-ray and dyspnoea are not causally related,
but they are both associated with cancer. Therefore answer is <Answer>C</Answer>
Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which cause-
and-effect relationship is more likely between nodes ’xray’ and ’cancer’?
A. changing the state of node ’xray’ causally effects a change in another node ’cancer’.
B. changing the state of node ’cancer’ causally effects a change in another node ’xray’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
20
Under review as a conference paper at ICLR 2024
Answer: The causal effect of cancer on X-ray is that X-rays are often used to diagnose or detect
cancer in different parts of the body, such as the bones, lungs, breasts, or kidneys123. X-rays use low
doses of radiation to create pictures of the inside of the body and show the presence, size, and location
of tumors. X-rays can also help monitor the response to treatment or check for signs of recurrence.
Therefore, having cancer may increase the likelihood of getting an X-ray as part of the diagnostic
process or follow-up care. Therefore answer is <Answer>B</Answer>
Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which
cause-and-effect relationship is more likely between nodes ’pollution’ and ’cancer’?
A. changing the state of node ’pollution’ causally effects a change in another node ’cancer’.
B. changing the state of node ’cancer’ causally effects a change in another node ’pollution’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer
in the following format: <Answer>A/B/C</Answer>. It is very important that you output the
final Causal graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be
processed.
Answer: The causal effect of pollution on cancer is that air pollution contains carcinogens (cancer-
causing substances) that may be absorbed into the body when inhaled and damage the DNA of cells.
Another possible reasoning is that particulate matter (tiny dust-like particles) in air pollution may cause
physical damage to the cells in the lungs, leading to inflammation and oxidative stress and eventually
cell mutations. A third possible reasoning is that air pollution may create an inflamed environment in
the lungs that encourages the proliferation of cells with existing cancer-driving mutations. These are
some of the hypotheses that researchers have proposed to explain how air pollution may cause cancer,
but more studies are needed to confirm them. Therefore answer is <Answer>A</Answer>
Question: For a causal graph used to model relationship of various factors and outcomes related to
cancer with the following nodes: [’Pollution’, ’Cancer’, ’Smoker’, ’Xray’, ’Dyspnoea’], Which
cause-and-effect relationship is more likely between nodes ’pollution’ and ’smoker’?
A. changing the state of node ’pollution’ causally effects a change in another node ’smoker’.
B. changing the state of node ’smoker’ causally effects a change in another node ’pollution’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer
in the following format: <Answer>A/B/C</Answer>. It is very important that you output the
final Causal graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be
processed.
Answer: Reason behind the lack of causal relation between pollution and smoker is that pollution
and smoking are both independent risk factors for respiratory problems, but they do not directly cause or
affect each other. Pollution and smoking both contribute to air pollution, which can harm the health of
people and the environment. However, pollution is mainly caused by human activities such as burning
fossil fuels, deforestation, or industrial processes, while smoking is a personal choice that involves
inhaling tobacco smoke. Therefore, pollution and smoker are not causally related, but they are both
associated with respiratory problems. Therefore answer is <Answer>C</Answer>.
21
Under review as a conference paper at ICLR 2024
Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Family Disease’ and
’Gene’?
A. changing the state of node ’Family Disease’ causally effects a change in another node ’Gene’.
B. changing the state of node ’Gene’ causally effects a change in another node ’Family Disease’.
C. There is no causal relation between the nodes ’Family Disease’ and ’Gene’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: Reason behind the causal effect of family disease on gene is that family disease is a term that
refers to diseases or health conditions that run in the family, meaning that they are influenced by genetic
factors. Gene is a term that refers to the basic unit of heredity that carries information for a specific trait
or function. Family disease can affect gene by altering the type or frequency of genes that are inherited
by the offspring from their parents. For example, some family diseases are caused by deterministic
genes, which are genes that guarantee the development of a disease if they are present in a person’s
genome. Other family diseases are influenced by risk genes, which are genes that increase the likelihood
of developing a disease but do not guarantee it. Therefore, family disease can causally direct to gene by
changing the genetic makeup of a person or a population. Therefore answer is <Answer>A</Answer>.
Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Coronary Heart Dis-
ease’ and ’Gene’?
A. changing the state of node ’Coronary Heart Disease’ causally effects a change in another node
’Gene’.
B. changing the state of node ’Gene’ causally effects a change in another node ’Coronary Heart Dis-
ease’.
C. There is no causal relation between the nodes ’Coronary Heart Disease’ and ’Gene’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: Possible reasoning behind the causal effect of gene on coronary heart disease is that gene is
a term that refers to the basic unit of heredity that carries information for a specific trait or function.
Gene can affect coronary heart disease by influencing the structure and function of the blood vessels,
the metabolism and transport of lipids (fats) in the blood, the inflammation and clotting processes, or
the response to environmental factors such as smoking or diet. For example, some genes code for
proteins that regulate the cell cycle and growth of the cells that line the arteries, which can affect their
susceptibility to damage or plaque formation. Other genes code for proteins that control the synthesis
and clearance of cholesterol or other lipids, which can affect their levels and deposition in the arteries.
Therefore, gene can causally direct to coronary heart disease by modifying the biological pathways that
contribute to the development or progression of the disease. Therefore answer is <Answer>B</Answer>
22
Under review as a conference paper at ICLR 2024
Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Blood Pressure’ and
’Smoking’?
A. changing the state of node ’Blood Pressure’ causally effects a change in another node ’Smoking’.
B. changing the state of node ’Smoking’ causally effects a change in another node ’Blood Pressure’.
C. There is no causal relation between the nodes ’Blood Pressure’ and ’Smoking’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: Possible reasoning behind the causal effect of smoking on blood pressure is that smoking is
a habit that involves inhaling tobacco smoke, which contains nicotine and other harmful chemicals.
Smoking can affect blood pressure by activating the sympathetic nervous system (SNS), which is the
part of the nervous system that controls the body’s response to stress or danger. When the SNS is
activated, it releases hormones such as adrenaline and noradrenaline, which cause the heart to beat faster
and harder, and the blood vessels to constrict. This results in a temporary increase in blood pressure,
which can last for 15 to 20 minutes after each cigarette. Therefore, smoking can causally direct to blood
pressure by stimulating the SNS and increasing the cardiac output and vascular resistance. Therefore
answer is <Answer>B</Answer>.
Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Headache’ and ’Smok-
ing’?
A. changing the state of node ’Headache’ causally effects a change in another node ’Smoking’.
B. changing the state of node ’Smoking’ causally effects a change in another node ’Headache’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: One possible reasoning behind the lack of causal relation between headache and smoking is
that headache and smoking are both associated with various health conditions, but they do not directly
cause or affect each other12. Headache is a term that refers to pain or discomfort in the head, scalp, or
neck, which can have many possible causes, such as stress, dehydration, infection, injury, or medication.
Smoking is a habit that involves inhaling tobacco smoke, which contains nicotine and other harmful
chemicals, which can increase the risk of diseases such as cancer, heart disease, stroke, and lung disease.
Therefore, headache and smoking are not causally related, but they are both linked to different health
problems. Therefore the answer is <Answer>C</Answer>
23
Under review as a conference paper at ICLR 2024
Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Headache’ and ’Smok-
ing’?
A. changing the state of node ’Headache’ causally effects a change in another node ’Smoking’.
B. changing the state of node ’Smoking’ causally effects a change in another node ’Headache’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: One possible reasoning behind the lack of causal relation between headache and smoking is
that headache and smoking are both associated with various health conditions, but they do not directly
cause or affect each other. Headache is a term that refers to pain or discomfort in the head, scalp, or
neck, which can have many possible causes, such as stress, dehydration, infection, injury, or medication.
Smoking is a habit that involves inhaling tobacco smoke, which contains nicotine and other harmful
chemicals, which can increase the risk of diseases such as cancer, heart disease, stroke, and lung disease.
Therefore, headache and smoking are not causally related, but they are both linked to different health
problems. Therefore the answer is <Answer>C</Answer>
Question: For a causal graph used for modeling factors causing Coronary Heart Diseases with the
following nodes: [’Family Disease’, ’Gene’, ’Smoking’, ’Blood Pressure’, ’Coronary Heart Disease’,
’Headache’], Which cause-and-effect relationship is more likely between nodes ’Coronary Heart Dis-
ease’ and ’Smoking’?
A. changing the state of node ’Smoking’ causally effects a change in another node ’Coronary Heart
Disease’.
B. changing the state of node ’Coronary Heart Disease’ causally effects a change in another node
’Smoking’.
C. There is no causal relation between the nodes ’Coronary Heart Disease’ and ’Smoking’.
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
Answer: Possible reasoning behind the causal effect of smoking on coronary heart disease is smoking
damages the heart and blood vessels by raising triglycerides, lowering HDL, increasing blood clotting,
and impairing blood flow to the heart. This can lead to plaque buildup, heart attacks, and death. There-
fore answer is <Answer>A</Answer>.
Question: For a causal graph used for context with the following nodes: nodes, Which cause-and-effect
relationship is more likely between nodes X and Y?
Make sure to first provide a grounded reasoning for your answer and then provide the answer in the
following format: <Answer>A/B/C</Answer>. It is very important that you output the final Causal
graph within the tags like <Answer>A/B/C</Answer> otherwise your answer will not be processed.
24
Under review as a conference paper at ICLR 2024
A. changing the state of node which says X causally effects a change in another node which
says Y.
B. changing the state of node which says Y causally effects a change in another node which
says X.
Make sure to first output a factually grounded reasoning for your answer. X and Y are nodes
of a Causal Graph. The causal graph is sparse and acyclic in nature. So option C could be
chosen if there is some uncertainity about causal relationship between X and Y.
First give your reasoning and after that please make sure to provide your final answer within
the tags <Answer>A/B/C</Answer>.
It is very important that you output your final answer between the tags like <An-
swer>A/B/C</Answer> otherwise your response will not be processed.
For the nodes X and Y which form an edge in a Causal Graph, you have to identify which
cause-and-effect relationship is more likely between the nodes of the edge. This will be used
to rearrange the nodes in the edge to create a directed edge which accounts for causal relation
from one node to another in the edge.
You can also take the edges from the skeleton which have been rearranged to create a directed
edge to account for causal relationship between the nodes: directed_edges.
Make sure to first output a factually grounded reasoning for your answer. First give your
reasoning and after that please make sure to provide your final answer within the tags <An-
swer>A/B/C</Answer>.
It is very important that you output your final answer between the tags like <An-
swer>A/B/C</Answer> otherwise your response will not be processed.
25
Under review as a conference paper at ICLR 2024
For the following undirected edge in a Causal Graph made of nodes X and Y, you have to
identify which cause-and-effect relationship is more likely between the nodes of the edge.
This will be used to rearrange the nodes in the edge to create a directed edge which accounts
for causal relation from one node to another in the edge.
You can also take the other directed edges of nodes X: X_edges and Y: Y_edges of the Causal
graph as context to redirect the edge to account for causal effect.
Make sure to first output a factually grounded reasoning for your answer. First give your
reasoning and after that please make sure to provide your final answer within the tags <An-
swer>A/B/C</Answer>.
It is very important that you output your final answer between the tags like <An-
swer>A/B/C</Answer> otherwise your response will not be processed.
Identify the causal relationships between the given variables and create a directed acyclic graph
to {context}. Make sure to give a reasoning for your answer and then output the directed graph
in the form of a list of tuples, where each tuple is a directed edge. The desired output should
be in the following form: [(‘A’,‘B’), (‘B’,‘C’)] where first tuple represents a directed edge from
Node ‘A’ to Node ‘B’, second tuple represents a directed edge from Node ‘B’ to Node ‘C’and
so on.
If a node should not form any causal relationship with other nodes, then you can add it as an
isolated node of the graph by adding it seperately. For example, if ‘C’ should be an isolated
node in a graph with nodes ‘A’, ‘B’, ‘C’, then the final DAG representation should be like
[(‘A’,‘B’), (‘C’)].
Use the description about the node provided with the nodes in brackets to form a better decision
about the causal direction orientation between the nodes.
It is very important that you output the final Causal graph within the tags <An-
swer></Answer>otherwise your answer will not be processed.
Example:
Input: Nodes: [‘A’, ‘B’, ‘C’, ‘D’];
Description of Nodes: [(description of Node A), (description of Node B), (description of Node
C), (description of Node D)]
Output: <Answer>[(‘A’,‘B’),(‘C’,‘D’)]</Answer>
Question:
Input: Nodes: {Triplet Nodes Input}
Description of Nodes: {Description of Each Node from the Triplet}
Output:
Table A12: The triplet prompt template, which includes a concise context of the graph, the input
triplet nodes and their respective descriptions. As an example, for the Child graph, the context is
"to model congenital heart disease in babies", the three nodes may be (‘HypoxiaInO2’, ‘Grunting’,
‘GruntingReport’); and their node descriptions are ["hypoxia when breathing oxygen", "grunting in
infants", "report of infant grunting"] respectively.
26
Under review as a conference paper at ICLR 2024
Answer: Low oxygen areas equally distributed around the body can affect the level of oxygen in the
lower body by reducing the amount of oxygen available for circulation. Therefore, the answer is <An-
swer>A</Answer>.
Answer: Infant methemoglobinemia is a condition where the blood is unable to carry enough oxygen,
which can lead to low blood flow in the lungs. Therefore, the answer is <Answer>A</Answer>.
Answer: Mixing of oxygenated and deoxygenated blood is a common cause of congenital heart disease
in babies. This mixing can occur due to structural defects in the heart, such as a hole between the heart
chambers or abnormal connections between the blood vessels. This mixing can lead to low oxygen
areas equally distributed around the body. Therefore, the answer is <Answer>B</Answer>.
Table A13: Example LLM (GPT-3.5-turbo) outputs for estimating the causal direction between a
given pair of nodes (with their descriptions) using the CoT Prompt (refer Table A4 in Appendix for
the prompt). Given a pair (X,Y), Option A refers to X causing Y, option B refers to Y causing X,
and option C refers to no causal relationship between the variables. The pairs shown are from the
Child graph from BNLearn repository (Scutari & Denis, 2014), that is used for modeling Congenital
Heart Disease in infants.
Answer: Discoligment injury C6-C7 can cause compression of the nerve roots that exit the
spinal cord at the C7 level, which can lead to symptoms such as pain, numbness, and weak-
ness in the right C7 dermatome. Therefore, the answer is <Answer>B</Answer>.
Answer: Right C7 and left C7 are both parts of the cervical spine and are not known to
directly influence each other. Therefore, the answer is <Answer>C</Answer>.
Answer: There is no direct causal relationship between right elbow trouble and left shoulder
trouble. They may both be symptoms of a larger underlying condition, but they do not directly
cause or affect each other. Therefore the answer is <Answer>C</Answer>.
Table A14: Example LLM (GPT-3.5-turbo) reasoning outputs for estimating causal directionality
between different pairs of nodes using CoT Prompt (refer Table A4 for the prompt) for Neuropathic
subgraph (used for pain diagnosis).
27
Under review as a conference paper at ICLR 2024
Table A15: Comparing Dtop of final graph using LLM order vs Ground truth order as prior to PC
algorithm for Child and Asia graph, averaged over 4 runs
Dataset Samples ϵAT E (S1 ) ϵAT E (S2 ) ϵAT E (S3 ) ∆12 ∆13
250 0.70±0.40 0.70±0.39 0.69±0.39 0.00±0.00 0.00±0.00
500 0.64±0.39 0.64±0.39 0.64±0.38 0.00±0.00 0.00±0.00
Asia 1000 0.59±0.32 0.59±0.32 0.59±0.32 0.00±0.00 0.00±0.00
5000 0.59±0.30 0.59±0.30 0.59±0.29 0.00±0.00 0.00±0.00
10000 0.49±0.00 0.49±0.00 0.49±0.00 0.00±0.00 0.00±0.00
Table A16: Results on Asia dataset. Here we test the difference in the estimated causal effect
of lung on dyspnoea when the causal effect is estimated using the backdoor set S1 = {smoke} vs.
the causal effect estimated when all variables in two topological orders as backdoor sets: S2 =
{asia, smoke}, S2 = {asia, tub, smoke}. ∆12 , ∆13 refers to the absolute difference between the
pairs ϵAT E (S1 ), ϵAT E (S2 ) and ϵAT E (S1 ), ϵAT E (S3 ) respectively. From the last two columns, we
observe that using the variables that come before the treatment node in a topological order as a
backdoor set does not result in the deviation of causal effects from the ground truth effects.
28