Inductive Learning of Logical Theories
Inductive Learning of Logical Theories
A Complexity-graded Analysis
tations of Large Language Models (LLMs) bining an extensive set of structural an semantic
with feedback from a formal inference engine, signals to approximate the most probable answer
on logic theory induction. The analysis is in a generative fashion.
complexity-graded w.r.t. rule dependency struc- While such means of problem solving might lack
ture, allowing quantification of specific infer- explicit symbolic grounding, LLMs can leverage
ence challenges on LLM performance. Integrat- its large-scale internal representation to support
ing LLMs with formal methods is a promising
inductive-style inference. Still, the properties de-
frontier in the Natural Language Processing
field, as an important avenue for improving livered by LLMs w.r.t. inductive learning, in partic-
model inference control and explainability. In ular regarding logic rules and theory induction, are
particular, inductive learning over complex sets not well understood and quantified.
of facts and rules, poses unique challenges for This paper presents a systematic methodology
current autoregressive models, as they lack ex- to evaluate the inductive learning properties (in
plicit symbolic grounding. While they can be the context of logic theory induction) of LLMs.
complemented by formal systems, the prop- It is aimed at answering the following research
erties delivered by LLMs regarding inductive
learning, are not well understood and quanti-
questions (RQs):
fied. Empirical results indicate that the largest RQ1. To what extent the combination of an LLM
LLMs can achieve competitive results against a with feedback from a formal inference engine can
SOTA Inductive Logic Programming (ILP) sys- compare to a SOTA inductive logic programming
tem baseline, but also that tracking long predi- (ILP) system in logic theory induction, w.r.t. infer-
cate relationship chains is a more difficult ob- ence quality at different degrees of complexity?
stacle than theory complexity for the LLMs. RQ2. How does the complexity of the target
1 Introduction theories affect inference quality of LLMs for
inductive reasoning?
The integration of Large Language Models (LLMs)
with formal methods stands out as a promising fron- In order to address these RQs, we propose a
tier in the field of Natural Language Processing. It method for combining iterative theory refinement
is an avenue for improving model inference control on stock LLMs (i.e., zero-shot inference), a formal
and explainability, by both complementing the con- ILP inference engine and a synthetic generator for
tent flexibility of Large Language Models (LLMs) inductive reasoning datasets, in order to perform
with the systematicity of symbolic/formal systems systematic evaluations of the LLM induced theo-
(Quan et al., 2024a,b) and by using well-defined ries, using state-of-the-art ILP solvers as a baseline.
formal settings to assess the underlying inference Moreover, to quantify the extent in which LLMs
properties of the model. can address ILP tasks, the evaluation is graded wrt.
Inductive Logic Programming (ILP) is a subfield the dependency complexity of the target rulesets.
of symbolic AI which focuses on methods that Figure 1 schematises the proposed approach.
can derive (generalise) theories to explain observed This work’s main contributions are as follows:
facts and rules (Muggleton, 1991; Nienhuys-Cheng
and de Wolf, 1997). Addressing inductive learning 1. (Methodological) A novel method for system-
(d)
Complexity-graded
Theory Induction
Theory Refinement Evaluation
Background knowledge:
p3(c65,c75). (a) (c)
p1(c38,c17). Theory Theory
p1(c65,c75). [ LLM set ]
prompt Refinement Complexity
noise level
p4(c34,c35). Prompt Generator
... DRDG R.
This theory scored:
(c) (a) Accuracy = 0.48 DRDG
Synthetic Datasets Precision = 0.5
Generation prompt
Theory induced by LLM: Recall = 0.31 RDG R.
Theory Induction p2(X,Y) :- p1(X,Y). Evaluation
for Inductive Reasoning [ LLM set ] F1 = 0.38
Prompt Generator And got these examples wrongly: RDG
p2(X,Y) :- p6(X,Y).
p2(X,Y) :- p4(X,Y). pos(p2(c61,c9)),
dependency pos(p2(c7,c44)),pos(p2(c17,c3)) CHAIN R.
recursivity ...
complexity
Instance set: Examples: (b) CHAIN
CHAIN, RDG, DRDG pos(p2(c57,c64)). Evaluation instances:
Logic
pos(p2(c69,c4)). pos(p2(c69,c4)). program
neg(p2(c10,c45)). neg(p2(c10,c45)). interpreter
neg(p2(c49,c69)). ...
...
Figure 1: The proposed method to evaluate theory induction with an LLM in Prolog based on background knowledge
and training examples. The process starts with a prompt generator (c) that formulates prompts for an LLM (a).
Both the background knowledge and training sets are parameterised by different noise and rule complexities levels:
Chain, Rooted Directed Graph (DG), Disjunctive Rooted DG, and Mixed. The LLM generates theories, which are
then evaluated by a logic program interpreter (b). The evaluation feedback, including accuracy, precision, recall,
and F1 scores, as well as wrongly classified examples, is used to refine the prompts iteratively. We analyse and
categorise the generated theories according to their complexity (d).
atic evaluation of LLM induced logic theories directly with logical predicates and rules, using a
with feedback from a formal inference engine. formal representation that is usually represented
using first-order logic (FOL). For example, the
2. (Empirical) A detailed empirical analysis on fact “parent(john, tom).” means that John is the
the strengths and limitations of SOTA LLMs parent of Tom, and “parent(john, anne).” means
regarding logic theory induction generation, that John is the parent of Anne. From that, we can
according to target ruleset complexity. create a rule (theory) “sibling(X, Y) :- parent(P, X),
parent(P, Y), X ̸= Y.” . This rule states that if there
3. (Resources) A reusable and extensible frame-
exists a parent who has both X and Y as children,
work for extending and assessing the inductive
and X and Y are not identical, then X and Y are
capabilities of LLMs.
considered siblings. Deriving rules from a set of
The remainder of the paper is organised as fol- examples is a process known as theory induction.
lows: Section 2 formalises the relevant ILP con- This task can be formally defined as follows:
cepts, dataset generation, and LLM used in the
Given: background knowledge (BK), a set of log-
paper. Section 3 describes the proposed method
ical clauses that represent prior knowledge about
and algorithms. Section 4 presents the experimen-
the domain a set of positive examples (E + ), a set of
tal procedures, results and discussion, followed by
ground facts (instances) which the learned theory
related work (5) and conclusions (6).
should entail (i.e., these are examples that should
2 Inductive Learning, Complexity & be true according to the theory), a set of negative
Datasets examples (E − ), a set of ground facts which the
learned theory should not entail (i.e., these are ex-
In this section we introduce the target task and the amples that should be false according to the theory).
typology of inductive learning complexity classes, Find a hypothesis H (a set of logical clauses) such
as well as the dataset generation process. that:
F1 Score
CHAIN REC. 54.2 ± 14.1 19.6 ± 7.8 4.4 ± 2.5 0.50 Llama3
Mixtral
RDG 60.2 ± 9.8 18.6 ± 12.3 4.2 ± 3.4 0.25
RDG REC. 63.2 ± 9.0 16.6 ± 4.7 3.4 ± 1.3 0.00 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
DRDG 61.2 ± 14.1 31.0 ± 26.0 8.0 ± 8.2 Noise 0.2
1.00
DRDG REC. 54.2 ± 12.5 34.0 ± 22.2 9.0 ± 6.2
0.75
MIXED 54.4 ± 18.1 24.2 ± 12.3 4.8 ± 2.7
F1 Score
0.50
0.25
Table 2: Statistics for each dataset category. A detailed
0.00 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
description of each can be found in Section 2. Noise 0.3
1.00
0.75
F1 Score
0.50
world scenarios, and missing facts. The mean val- 0.25
ues reported are based on the results obtained from 0.00 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
the theory that was generated from the train set and
evaluated on the test set. Figure 2: F1 score trends across categories. Differ-
ent models (GPT-4o, Llama3 8b instruct, Popper, and
The experiment used the OpenAI service for
Mixtral-8x7B-Instruct-v0.1) under varying noise lev-
GPT models. For Popper, Llama3-8B-Instruct, els and categories reveal distinct performance patterns.
Gemma-7B-It and Mixtral-8x7B-Instruct-v0.1, it GPT-4o demonstrates stable accuracy yet sensitivity to
was conducted on a computer with an Intel(R) noise, particularly in complex rule-based categories like
Xeon(R) Gold 5217 CPU @ 3.00GHz, 188GB RDG and DRDG. Mixtral-8x7B-Instruct-v0.1 exhibits
RAM, and 2x NVIDIA RTX A6000 (48GB mixed results with notable variability across categories
VRAM) GPUs. The software used was CUDA particularly in more complex tasks. Llama3 8b instruct
12.3, PyTorch 2.2.2, and Transformers 4.41.2. struggles with low scores, indicating challenges in rea-
soning and theory generation.
Prompt templates used were included in the supple-
mentary material (Appendix C). Noise 0.1
Popper
3 GPT-4o
log(Time (s))
GPT-3.5-Turbo
4.2 Results & Discussion 2 Llama3
Mixtral
1
Overall results for F1 are presented in Figure 2. We
0 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
additionally report on processing times as a mea- Noise 0.2
sure of practical interest in Figure 3. We present the 3
log(Time (s))
Neural Logic Programming, An end-to-end differ- In contrast, our work focuses on the still under-
entiable model integrating neural networks with explored area of assessing and controlling induc-
logic programming. Within the LLM-Symbolic tive learning/inference capabilities of LLMs. These
space, (Wan et al., 2024) developed LogicAsker, contributions integrate LLMs and formal logic for
which evaluates and improves LLMs’ logical rea- robust theory induction and allows a graded anal-
soning using propositional and predicate logic. It ysis of LLM capabilities, with respect to theory
identifies reasoning failures and enhances capabili- induction complexity.
ties through in-context learning. Within the context
of symbolic toolformers over LLMs, (Quan et al.,
2024a,b) proposed methods of improving explana- 6 Conclusion
tory reasoning with the support of formal itera-
tive cycles using both logical solvers and theorem
provers for supporting more controlled step-wise In this study we thoroughly investigate the in-
reasoning. tegration of state-of-the-art formal theory induc-
Despite these advancements at the interface of tion within the context of large language models
LLM-based reasoning and formal controls, it is un- (LLMs), aiming to elucidate the extent in which
clear the extent and the conditions in which LLMs LLMs can systematically perform inductive learn-
can perform formal reasoning (Huang and Chang, ing for theories spanning across different complex-
2023). (Sinha et al., 2019) introduced CLUTRR, a ity levels. At the heart of this exploration lies the
benchmark assessing LLMs’ structural learning by recognition of relational data’s inherent semantic
inferring kinship relations in stories, requiring rela- depth, stemming from its symbolic representations.
tionship extraction and logical rule inference. (Zhu The empirical results presented here have indicated
et al., 2024) proposed the Hypotheses-to-Theories the ability of LLMs to address inductive learning
(HtT) framework to improve LLM reasoning by tasks, with the largest LLMs achieving competitive
learning explicit rules in two stages: generating results against the algorithmic SOTA with better
and verifying rules (induction) and using the ob- tolerance to higher noise levels, which can be at-
tained rule library for reasoning (deduction). HtT tributed to their semantic flexibility. This flexibil-
enhances relational and numerical reasoning and ity however has certain limitations, as we found
concept learning. that tested language models are more limited by
(Madaan et al., 2023) introduces a novel tech- their capacity of tracking long relationship chains
nique for improving machine learning models of independent predicates than by the dependency
through iterative refinement. This approach allows complexity of the rule sets (disjunctiveness, recur-
models to improve their performance by contin- sivity).
uously evaluating and adjusting their predictions As future work we plan to utilise larger datasets
based on self-generated feedback. By critiquing to test the scalability of the proposed approach, al-
their own outputs, models can identify errors and lowing researchers to assess its performance across
make corrections over successive iterations, lead- a broader range of scenarios. Additionally, it is
ing to increased accuracy and robustness across dif- worth considering the integration of the LLM’s
ferent tasks. Our work builds upon this approach output as an initial input for the ILP process, poten-
by employing a formal method to evaluate and then tially leveraging the strengths of both approaches
refine itself. to overcome their respective limitations. Another
In a related study, (Dziri et al., 2023), the authors avenue, is the use of ILP middle steps, such as the
investigate the limitations of transformer models bottom clause, to help LLM induce a theory.
7 Limitations Jie Huang and Kevin Chen-Chuan Chang. 2023. To-
wards reasoning in large language models: A survey.
While the proposed evaluation methodology aims In Findings of the Association for Computational
to cover a wide range of logic theory induction Linguistics: ACL 2023, pages 1049–1065, Toronto,
complexity, it is limited in its resolution to the Canada. Association for Computational Linguistics.
categories specified by (Cornelio and Thost, 2021),
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
and does not quantify other ruleset characteristics, sch, Chris Bamford, Devendra Singh Chaplot, Diego
such as workspace size or unification rate in the de las Casas, Florian Bressand, Gianna Lengyel, Guil-
case of Prolog (Dikovsky, 1993). laume Lample, Lucile Saulnier, et al. 2023. Mistral
The methodology compares all models under the 7b. arXiv preprint arXiv:2310.06825.
same inputs. Therefore it is not concerned with ex- Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
tracting maximum performance of any given model, Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
but obtaining a relative assessment of their funda- Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
mental capabilities. This means that the empirical Sean Welleck, Bodhisattwa Prasad Majumder,
Shashank Gupta, Amir Yazdanbakhsh, and Peter
analysis reported scores should not be taken as a
Clark. 2023. Self-refine: Iterative refinement with
measure of SOTA performance. self-feedback. Preprint, arXiv:2303.17651.
Furthermore, the time gains demonstrated in the
experiments are presented as an achievable result, Stephen Muggleton. 1991. Inductive logic program-
conditioned to the combination of software and ming. New generation computing, 8:295–318.
hardware indicated in the paper and the services
Shan-Hwei Nienhuys-Cheng and Roland de Wolf. 1997.
provided by third-parties (e.g., OpenAI). They What is inductive logic programming? Springer.
should not be interpreted as a measure of com-
putational efficiency. Xin Quan, Marco Valentino, Louise A. Dennis, and An-
dré Freitas. 2024a. Enhancing ethical explanations
Acknowledgements of large language models through iterative symbolic
refinement. Preprint, arXiv:2402.00745.
This work was partially funded by the Swiss Na-
tional Science Foundation (SNSF) project Neu- Xin Quan, Marco Valentino, Louise A. Dennis, and
Math (200021_204617) and by the Manchester Ex- André Freitas. 2024b. Verification and refinement of
natural language explanations through llm-symbolic
perimental Cancer Medicine Centre and the NIHR theorem proving. Preprint, arXiv:2405.01379.
Manchester Biomedical Research Centre.
Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle
Pineau, and William L. Hamilton. 2019. CLUTRR:
References A diagnostic benchmark for inductive reasoning from
text. In Proceedings of the 2019 Conference on
Yi Chu, Shaowei Cai, and Chuan Luo. 2023. Nuwls: Im- Empirical Methods in Natural Language Processing
proving local search for (weighted) partial maxsat by and the 9th International Joint Conference on Natu-
new weighting techniques. Proceedings of the AAAI ral Language Processing (EMNLP-IJCNLP), pages
Conference on Artificial Intelligence, 37(4):3915– 4506–4515, Hong Kong, China. Association for Com-
3923. putational Linguistics.
Cristina Cornelio and Veronika Thost. 2021. Synthetic
datasets and evaluation tools for inductive neural rea- Gemini Team, Rohan Anil, Sebastian Borgeaud,
soning. In Proceedings of the 30th International Con- Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
ference on Inductive Logic Programming, ILP2020- Radu Soricut, Johan Schalkwyk, Andrew M Dai,
21 @ IJCLR. Anja Hauth, et al. 2023. Gemini: a family of
highly capable multimodal models. arXiv preprint
Andrew Cropper and Rolf Morel. 2021. Learning pro- arXiv:2312.11805.
grams by learning from failures. Machine Learning,
110(4):801–856. Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang
A Ja Dikovsky. 1993. On computational complexity Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, and
of prolog programs. Theoretical Computer Science, Michael R. Lyu. 2024. A & B == B & A: triggering
119(1):63–102. logical reasoning failures in large language models.
CoRR, abs/2401.00757.
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine
Li, Liwei Jian, Bill Yuchen Lin, Peter West, Chandra David S Warren, Veronica Dahl, Thomas Eiter,
Bhagavatula, Ronan Le Bras, Jena D Hwang, et al. Manuel V Hermenegildo, Robert Kowalski, and
2023. Faith and fate: Limits of transformers on com- Francesca Rossi. 2023. Prolog: The Next 50 Years,
positionality. arXiv preprint arXiv:2305.18654. volume 13900. Springer Nature.
Fan Yang, Zhilin Yang, and William W Cohen. 2017. There are no alternative rules for deriving facts
Differentiable learning of logical rules for knowledge related to a specific body atom. For example,
base reasoning. Advances in neural information pro-
p1(X1, X2) appears in the body of p0(X0, X1)
cessing systems, 30.
and only one rule has p1(X1, X2) in the head:
Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, p1(X1, X2) :- p7(X2, X1). The same applies to
Jian Tang, Dale Schuurmans, and Hanjun Dai. 2024. p3.
Large language models can learn rules. Preprint,
arXiv:2310.07064.
Category Disjunctive Rooted DG (DRDG). Cate-
Appendices gory DRDG generalises Category RDG by allow-
ing for alternative rules represented as children of
A Further theoretical background an "OR" node. For instance:
A.1 Detailed Complexity Classes p7(X0,X1) :- p5(X0,X1).
Category Chain. In this category, every rule, ex- p5(X0,X1) :- p0(X0,X1).
cept the root, deduces facts relevant to precisely p5(X0,X1) :- p8(X1,X0).
one other rule. Essentially, each node has at most In the example, the first rule states p7(X0, X1)
one parent, and each rule is associated with at most is true if p5(X0, X1) is true, indicating p7
one other rule that might infer relevant facts. Re- depends on p5. The second rule states p5(X0, X1)
cursive rules, where the predicate in the head also is true if p0(X0, X1) is true, showing p5 depends
occurs in the body, are exceptions, as they are rel- on p0. The third rule states p5(X0, X1) is true
evant both for themselves and one additional rule. if p8(X1, X0) is true, adding an alternative
For example: condition with swapped arguments. Thus, p5 acts
p5(X, Y) :- p0(X, Z), P2(Y, W). as an "OR" condition in the first rule’s body and
p0(X, Y) :- p3(X, Z), p4(W, Y). the second and third rules’ heads.
p3(X, Y) :- p6(X, Z), p7(W, Y).
Category Mixed. A rule graph in this category
For example, according to Rule 1, p0(X, Z) contains connected components of different cate-
is necessary for p5(X, Y ). Therefore, satisfying gories mentioned above. Additionally, recursion
p5(X, Y ) requires p0(X, Z), which in turn is allowed, meaning that the head of the rule may
requires p3(X, Z) and p4(W, Y ). This creates appear in the body as well.
a dependency chain where p5(X, Y ) relies on
p3(X, Z) and p4(W, Y ). B Further empirical data & findings
Category Rooted DG (RDG). This category gen- GPT-4o shows stable performance with moderate
eralises the Chain category. Here, every rule can to high accuracy but is sensitive to noise, especially
be relevant for several others, and each node can in RDG and DRDG. For instance, its F1 score in
have multiple parent nodes. Furthermore, for each RDG drops from 0.75 at noise level 0.1 to 0.25 at
rule, there may be several other rules that might noise level 0.3. GPT-3.5-Turbo does not perform
infer facts relevant for it. However, for each pred- well with complex categories like RDG and DRDG
icate occurring in the body of a rule, there must under noise, with an F1 score of 0 at noise level 0.3
be at most one other rule with this predicate in the in RDG.
head. In other words, there are no alternative rules Mixtral-8x7B-Instruct-v0.1 shows high variabil-
to derive facts relevant for a rule with respect to a ity w.r.t. noise, performing reasonably well in RDG
specific body atom. For example: (0.64 F1 at noise level 0.1, dropping to 0.43 at noise
level 0.3) but with significant time consumption, es-
p0(X0,X1) :- p1(X1,X2),p3(X0,X1). pecially in DRDG (130.160 seconds at noise level
p3(X0,X1) :- p8(X0,X1),p6(X0,X1). 0.2). It does not perform well with complex rule
p1(X1,X2) :- p7(X2,X1). sets like DRDG across all noise levels.
In the given example, each rule has at least one Llama3-8B instruct has low accuracy across
child node. For instance, p0(X0, X1) has two most categories, with slight improvement at higher
child nodes: p1(X1, X2) and p3(X0, X1). Each noise levels but increased time consumption. At
predicate in the body of a rule corresponds to noise level 0.1, it achieves an F1 score above 0 only
at most one rule with that predicate in the head. in RDG R. It often fails to produce valid theories or
Category Noise 0.1 Noise 0.2 Noise 0.3
F1 Time (s) F1 Time (s) F1 Time (s)
MIXED 0.51 1071.53 0.44 817.15 0.35 2418.88
CHAIN 0.80 1397.35 0.46 1254.33 0.57 3824.31
CHAIN R. 0.77 1123.25 0.41 1190.14 0.14 3646.43
RDG 0.74 1122.13 0.57 854.85 0.50 2460.90
RDG R. 0.71 1523.37 0.68 940.50 0.38 1659.27
DRDG 0.77 934.98 0.41 1089.47 0.25 1363.27
DRDG R. 0.68 927.30 0.48 882.28 0.26 820.51
Table 4: Results for different categories with theory induced by Popper and different noise levels.
Category GPT-4o - noise 0.1 GPT-4o - noise 0.2 GPT-4o - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F1 (avg) Time (s)
MIXED 0.70 8.57 0.52 10.73 0.62 9.81
CHAIN 0.52 8.54 0.42 11.05 0.35 8.29
CHAIN R. 0.72 11.48 0.53 8.86 0.49 8.13
RDG 0.75 8.80 0.50 11.94 0.22 20.16
RDG R. 0.74 10.83 0.55 7.35 0.49 10.79
DRDG 0.46 12.59 0.39 16.44 0.42 11.14
DRDG R. 0.83 13.11 0.32 13.29 0.12 8.45
Category GPT-3.5-Turbo - noise 0.1 GPT-3.5-Turbo - noise 0.2 GPT-3.5-Turbo - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F1 (avg) Time (s)
MIXED 0.20 4.32 0.35 4.11 0.32 4.20
CHAIN 0.24 3.47 0.33 7.88 0.11 3.13
CHAIN R. 0.545 2.592 0.00 7.11 0.00 3.05
RDG 0.56 4.22 0.29 3.19 0.00 3.76
RDG R. 0.20 3.48 0.02 4.73 0.00 3.98
DRDG 0.28 4.63 0.22 8.91 0.08 3.96
DRDG R. 0.31 5.01 0.15 3.06 0.01 11.14
Category Llama3-8B-it - noise 0.1 Llama3-8B-it - noise 0.2 Llama3-8B-it - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F (avg) Time (s)
MIXED 0.00 62.54 0.00 176.31 0.02 51.55
CHAIN 0.00 38.86 0.21 12.84 0.06 29.31
CHAIN R. 0.00 31.39 0.32 34.90 0.08 57.75
RDG 0.00 41.42 0.00 18.42 0.07 55.80
RDG R. 0.21 20.86 0.20 36.45 0.00 40.70
DRDG 0.00 76.70 0.08 45.48 0.04 60.10
DRDG R. 0.00 25.96 0.00 43.88 0.00 14.61
Category Mixtral-8x7B-It-v0.1 - noise 0.1 Mixtral-8x7B-It-v0.1 - noise 0.2 Mixtral-8x7B-It-v0.1 - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F1 (avg) Time (s)
MIXED 0.36 34.04 0.21 65.83 0.20 71.60
CHAIN 0.49 50.07 0.00 45.15 0.16 45.02
CHAIN R. 0.47 29.56 0.00 87.16 0.08 69.38
RDG 0.60 75.42 0.05 35.68 0.00 101.54
RDG R. 0.48 74.58 0.00 123.80 0.00 69.17
DRDG 0.10 90.51 0.28 130.16 0.12 82.51
DRDG R. 0.15 60.61 0.00 97.11 0.00 64.92
Table 5: Performance metrics for various categories under different noise conditions.
introduces new predicates incorrectly. For instance, CHAIN present more stable performance.
the rule p(X, Y) - p2(X, Y); p0(X, Y); p4(X, Y); p9(X,
Y). is valid, but the predicate p, in the head of the
rule, does not exist in the BK, neither is the tar-
get predicate. Llama3-8B was the only model to
exhibit this pattern. Time variance has an inverse relation w.r.t. noise
levels on LLMs when compared with Popper.
The models generally present higher initial ac- As the noise increases, Popper may take less time
curacy on recursive (R) categories, but are more to induce a theory based on the remaining relevant
sensitive to noise on them, leading to larger perfor- data, as indicated by the scattering pattern progres-
mance drops. For example, on DRDG-R, GPT-4o’s sion in Figure 4, while the LLMs are more likely
F1 score drops from 0.83 at noise level 0.1 to 0.120 to take longer to process it. Detailed values are
at noise level 0.3. Non-recursive categories like presented in Tables 4 and 5.
Noise 0.1 For the refinement steps, the following prompt
0.8
template was used:
0.6
F1 score
• missing: % Rules
ancestor(X,Y) :- parent(X,Y).
– Definition: Specifies the percentage of ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y).
missing data in the dataset.
– Constraint: Must be a value in the range Scenario 2: Noisy Data
[0, 1]. Alternatively, if 20% of the data contains noise, the
• category: dataset might appear as follows:
Table 6: Main Information about the models evaluated in this study. All models tested are auto-regressive decoder-
only, with Mixtral-8x7B Instruct being a Sparse Mixture of Experts (SMoE). The original, non-quantised versions
were used.
theory :-
X \= Y,
p5(X, X),
\+ pos(p2(X, Y)),
\+ neg(p2(X, Y)),
asserta(pos(p2(X, X))).