0% found this document useful (0 votes)
22 views15 pages

Inductive Learning of Logical Theories

This paper introduces a systematic methodology for analyzing the capabilities of Large Language Models (LLMs) in inductive logic programming, focusing on their performance in generating logical theories. The study evaluates LLMs' inductive learning properties through a complexity-graded approach, addressing challenges related to rule dependencies and inference quality. The findings indicate that while LLMs can achieve competitive results compared to state-of-the-art systems, they struggle with tracking long predicate relationships, highlighting areas for further research and improvement.

Uploaded by

0xh311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

Inductive Learning of Logical Theories

This paper introduces a systematic methodology for analyzing the capabilities of Large Language Models (LLMs) in inductive logic programming, focusing on their performance in generating logical theories. The study evaluates LLMs' inductive learning properties through a complexity-graded approach, addressing challenges related to rule dependencies and inference quality. The findings indicate that while LLMs can achieve competitive results compared to state-of-the-art systems, they struggle with tracking long predicate relationships, highlighting areas for further research and improvement.

Uploaded by

0xh311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Inductive Learning of Logical Theories with LLMs:

A Complexity-graded Analysis

João Pedro Gandarela2† , Danilo S. Carvalho3 , André Freitas1,2,3


1
Department of Computer Science, University of Manchester, United Kingdom
2
Idiap Research Institute, Switzerland
3
National Biomarker Centre, CRUK-MI, Univ. of Manchester, United Kingdom
{firstname.lastname}[@manchester.ac.uk]|[@idiap.ch]†

Abstract over complex sets of facts and rules, poses unique


challenges for current autoregressive LLMs, as they
This work presents a novel systematic method-
do not operate over data symbolically, rather com-
ology to analyse the capabilities and limi-
arXiv:2408.16779v1 [cs.CL] 15 Aug 2024

tations of Large Language Models (LLMs) bining an extensive set of structural an semantic
with feedback from a formal inference engine, signals to approximate the most probable answer
on logic theory induction. The analysis is in a generative fashion.
complexity-graded w.r.t. rule dependency struc- While such means of problem solving might lack
ture, allowing quantification of specific infer- explicit symbolic grounding, LLMs can leverage
ence challenges on LLM performance. Integrat- its large-scale internal representation to support
ing LLMs with formal methods is a promising
inductive-style inference. Still, the properties de-
frontier in the Natural Language Processing
field, as an important avenue for improving livered by LLMs w.r.t. inductive learning, in partic-
model inference control and explainability. In ular regarding logic rules and theory induction, are
particular, inductive learning over complex sets not well understood and quantified.
of facts and rules, poses unique challenges for This paper presents a systematic methodology
current autoregressive models, as they lack ex- to evaluate the inductive learning properties (in
plicit symbolic grounding. While they can be the context of logic theory induction) of LLMs.
complemented by formal systems, the prop- It is aimed at answering the following research
erties delivered by LLMs regarding inductive
learning, are not well understood and quanti-
questions (RQs):
fied. Empirical results indicate that the largest RQ1. To what extent the combination of an LLM
LLMs can achieve competitive results against a with feedback from a formal inference engine can
SOTA Inductive Logic Programming (ILP) sys- compare to a SOTA inductive logic programming
tem baseline, but also that tracking long predi- (ILP) system in logic theory induction, w.r.t. infer-
cate relationship chains is a more difficult ob- ence quality at different degrees of complexity?
stacle than theory complexity for the LLMs. RQ2. How does the complexity of the target
1 Introduction theories affect inference quality of LLMs for
inductive reasoning?
The integration of Large Language Models (LLMs)
with formal methods stands out as a promising fron- In order to address these RQs, we propose a
tier in the field of Natural Language Processing. It method for combining iterative theory refinement
is an avenue for improving model inference control on stock LLMs (i.e., zero-shot inference), a formal
and explainability, by both complementing the con- ILP inference engine and a synthetic generator for
tent flexibility of Large Language Models (LLMs) inductive reasoning datasets, in order to perform
with the systematicity of symbolic/formal systems systematic evaluations of the LLM induced theo-
(Quan et al., 2024a,b) and by using well-defined ries, using state-of-the-art ILP solvers as a baseline.
formal settings to assess the underlying inference Moreover, to quantify the extent in which LLMs
properties of the model. can address ILP tasks, the evaluation is graded wrt.
Inductive Logic Programming (ILP) is a subfield the dependency complexity of the target rulesets.
of symbolic AI which focuses on methods that Figure 1 schematises the proposed approach.
can derive (generalise) theories to explain observed This work’s main contributions are as follows:
facts and rules (Muggleton, 1991; Nienhuys-Cheng
and de Wolf, 1997). Addressing inductive learning 1. (Methodological) A novel method for system-
(d)
Complexity-graded
Theory Induction
Theory Refinement Evaluation
Background knowledge:
p3(c65,c75). (a) (c)
p1(c38,c17). Theory Theory
p1(c65,c75). [ LLM set ]
prompt Refinement Complexity
noise level
p4(c34,c35). Prompt Generator
... DRDG R.
This theory scored:
(c) (a) Accuracy = 0.48 DRDG
Synthetic Datasets Precision = 0.5
Generation prompt
Theory induced by LLM: Recall = 0.31 RDG R.
Theory Induction p2(X,Y) :- p1(X,Y). Evaluation
for Inductive Reasoning [ LLM set ] F1 = 0.38
Prompt Generator And got these examples wrongly: RDG
p2(X,Y) :- p6(X,Y).
p2(X,Y) :- p4(X,Y). pos(p2(c61,c9)),
dependency pos(p2(c7,c44)),pos(p2(c17,c3)) CHAIN R.
recursivity ...
complexity
Instance set: Examples: (b) CHAIN
CHAIN, RDG, DRDG pos(p2(c57,c64)). Evaluation instances:
Logic
pos(p2(c69,c4)). pos(p2(c69,c4)). program
neg(p2(c10,c45)). neg(p2(c10,c45)). interpreter
neg(p2(c49,c69)). ...
...

Figure 1: The proposed method to evaluate theory induction with an LLM in Prolog based on background knowledge
and training examples. The process starts with a prompt generator (c) that formulates prompts for an LLM (a).
Both the background knowledge and training sets are parameterised by different noise and rule complexities levels:
Chain, Rooted Directed Graph (DG), Disjunctive Rooted DG, and Mixed. The LLM generates theories, which are
then evaluated by a logic program interpreter (b). The evaluation feedback, including accuracy, precision, recall,
and F1 scores, as well as wrongly classified examples, is used to refine the prompts iteratively. We analyse and
categorise the generated theories according to their complexity (d).

atic evaluation of LLM induced logic theories directly with logical predicates and rules, using a
with feedback from a formal inference engine. formal representation that is usually represented
using first-order logic (FOL). For example, the
2. (Empirical) A detailed empirical analysis on fact “parent(john, tom).” means that John is the
the strengths and limitations of SOTA LLMs parent of Tom, and “parent(john, anne).” means
regarding logic theory induction generation, that John is the parent of Anne. From that, we can
according to target ruleset complexity. create a rule (theory) “sibling(X, Y) :- parent(P, X),
parent(P, Y), X ̸= Y.” . This rule states that if there
3. (Resources) A reusable and extensible frame-
exists a parent who has both X and Y as children,
work for extending and assessing the inductive
and X and Y are not identical, then X and Y are
capabilities of LLMs.
considered siblings. Deriving rules from a set of
The remainder of the paper is organised as fol- examples is a process known as theory induction.
lows: Section 2 formalises the relevant ILP con- This task can be formally defined as follows:
cepts, dataset generation, and LLM used in the
Given: background knowledge (BK), a set of log-
paper. Section 3 describes the proposed method
ical clauses that represent prior knowledge about
and algorithms. Section 4 presents the experimen-
the domain a set of positive examples (E + ), a set of
tal procedures, results and discussion, followed by
ground facts (instances) which the learned theory
related work (5) and conclusions (6).
should entail (i.e., these are examples that should
2 Inductive Learning, Complexity & be true according to the theory), a set of negative
Datasets examples (E − ), a set of ground facts which the
learned theory should not entail (i.e., these are ex-
In this section we introduce the target task and the amples that should be false according to the theory).
typology of inductive learning complexity classes, Find a hypothesis H (a set of logical clauses) such
as well as the dataset generation process. that:

2.1 Inductive Logic Programming Completeness: For every example e ∈ E + ,


H ∪ BK |= e, meaning the hypothesis H, together
Inductive Logic Programming main objective is to with the background knowledge BK, should entail
generate logical hypotheses or theories from the all positive examples.
available background knowledge, such as facts,
rules, and positive and negative examples. Unlike Consistency: For every example e ∈ E − ,
traditional machine learning methods, ILP works H ∪ BK ̸|= e, meaning the hypothesis H, together
with the background knowledge BK, should not Category # Parents Recursive Alt. rules
CHAIN 1 No No
entail any negative examples. CHAIN REC. 1 Yes No
RDG 1–* No No
Formally, an ILP system seeks a hypothesis H that RDG REC. 1–* Yes No
satisfies: DRDG 1–* No Yes
DRDG REC. 1–* Yes Yes
MIXED 1–* Yes Yes
∀e ∈ E + , BK ∪ H |= e
Table 1: Characteristics for each dataset category. #
− Parents refers to the number of rules that each rule can
∀e ∈ E , BK ∪ H ̸|= e
deduce relevant facts to. Recursive refers to whether a
The learned hypothesis H should thus be a logi- predicate in the head of a rule can also occur in the body.
cal theory that explains the positive examples and Alt. rules indicate whether a predicate can be deduced
by alternative rules.
excludes the negative examples, based on the given
background knowledge.
with the category parameter, it is possible to en-
2.2 Theory complexity sure comprehensive coverage of different structural
Inductive learning can be organised according to configurations and complexity levels.
different classes of complexity, which involves a ty- For each configuration, the following settings
pology of the structural complexity of the problem. were applied:
Moreover, variables such as the amount and type of
noise within the evidence set (such as the number • The minimum and maximum number of Di-
of incorrect or missing facts or completeness lev- rected Acyclic Graphs (DAGs) were both set
els) can be integrated within this setting. Following to 1 (mindags = 1, maxdags = 1).
the typology of (Cornelio and Thost, 2021) four
• Noise levels were systematically varied at in-
base categories of rules can be introduced based
tervals of 0.1, 0.2, and 0.3.
on their dependencies: Chain, Rooted Directed
Graph (RDG), Disjunctive Rooted DG, and Mixed. • The percentage of missing data (missing) and
Each category represents a hierarchical generalisa- open world degree (owa) were similarly varied
tion, with each step encompassing a broader range across 0.1, 0.2, and 0.3.
of structures. Starting with CHAIN, which repre-
sents a linear composition of predicates, RDG is • The category parameter was set to cover all the
a generalisation of CHAIN, meaning it includes complexity classes described in the previous
all chain structures but also accommodates more section, and listed in Table 1. The distribution
complex dependencies. Moving up, DRDG gener- of each category in the synthesised dataset is
alises RDG by incorporating directed relationships, an independent hyperparameter, discussed in
thus offering a more extensive representation of Section 4.
dependencies. Finally, Mixed contains connected
components from CHAIN, RDG, and DRDG. Each Further details regarding the dataset generation
progression from CHAIN to MIXED represents a can be found in Appendix E.
step towards greater inclusivity and complexity in
the types of structures captured. 3 Proposed Approach
The complexity classes and their characteristics The proposed approach can be divided in two parts:
are summarised in Table 1. A detailed descrip- iterative refinement and graded evaluation. They
tion of each class is provided in the supplementary form a systematic evaluation loop, covering all the
material (Appendix A.1). complexity classes described in the previous sec-
tion, for a given set of LLMs and dataset synthesis
2.3 Dataset synthesis parameters.
In order to generate datasets for rigorous analy-
sis, this study employed the RuDaS tool (Cornelio 3.1 Iterative Refinement
and Thost, 2021) to systematically vary parame- Consists of an iterative refinement loop that al-
ters such as noise, open-world degree, and missing ternates between the generation of a theory by a
data. By adjusting these factors in conjunction language model and the evaluation of said theory
through a formal interpreter. It is comprised of the F N = |{e ∈ E + | BK ∪ H ̸|= e}|
following components, as illustrated in Figure 1:
(a) A language model capable of generating a the- True Negatives (TN): The number of negative
ory H, based on background knowledge, positive examples correctly not entailed by the hypothesis.
and negative examples, and hypothesis search, pro- T N = |{e ∈ E − | BK ∪ H ̸|= e}|
vided as a prompt text in a logic program language.
Typically an LLM with structured (e.g., code) gen- from which precision (P) ( T PT+F P
P ), recall
eration capabilities. (b) A logic program interpreter. TP
(R) (Recall = T P +F N ) and F1-score (F1)
We use Prolog (Warren et al., 2023) as the logic Precision·Recall
(2 · Precision+Recall ) can be generated.
program language. (c) A prompt generation com-
ponent, that interleaves logical programming lan-
4. Theory refinement: Following an initial eval-
guage with natural language queries designed to
uation, the LM is tasked to refine the induced the-
drive the theory induction responses. The logi-
ory iteratively. Each refinement round involves
cal programming language expresses background
adjusting the theory based on feedback from the
knowledge and the relevant outputs of the program
Prolog interpreter validation. The refinement aims
interpreter. (d) An evaluation module, that uses the
to improve the theory’s performance by address-
logic program interpreter to execute the generated
ing misclassifications and enhancing its predictive
theory H as logical rules, and computes a set of
capabilities. If H does not satisfy completeness
evaluation metrics.
and consistency, update the input representation
Given: background knowledge (BK), positive
based on feedback and generate a new hypothesis
examples (E + ), negative examples (E − ), and
using the language model, given the Feedback Con-
assuming a language model LM which can find a
text ← {F P, F N, P, R, F 1} and the final prompt
hypothesis H (a set of logical clauses) such that it
input Input ← (Refinement Prompt + Context +
satisfies the conditions of completeness (for every
Feedback Context): H ← LM(Input).
example e ∈ E + , H ∪ BK |= e) and consistency
The main loop of the algorithm continues until
(For every example e ∈ E − , H ∪ BK ̸|= e).
the evaluation metrics meet the defined thresholds
or the maximum number of iterations is reached.
1. Context Representation: Represent the
In each iteration, the language model generates a
input to the language model as a combina-
theory based on the current prompt. This generated
tion of background knowledge and examples:
theory is then evaluated using the logic program
Context = encode(BK, E + , E − ).
interpreter, in our case Prolog, resulting in a vali-
dation set. Evaluation metrics are computed from
2. Theory Generation: From a background knowl-
these validation results and stored. Based on the
edge set of clauses sampled from a knowledge base
feedback from the validation, a new prompt is gen-
dataset, including positive and negative examples,
erated by incorporating the initial knowledge base
a prompt is created for the LM to induce a theory
sample, the current theory, and the validation feed-
as Prolog code, i.e. using the language model to
back. Our approach removes recursive rules from
generate a set of logical clauses (hypothesis H):
the LLM-induced theory before evaluation. The
H = LM(Theory Prompt + Context).
refinement loop is summarised in Algorithm 1. The
process starts by sampling facts from the knowl-
3. Evaluation of Hypothesis: Checking for the
edge base dataset to create kb. An initial prompt is
completeness and consistency conditions:
then generated using these sampled facts, denoted
True Positives (TP): The number of positive
as prompt.
examples correctly entailed by the hypothesis.
T P = |{e ∈ E + | BK ∪ H |= e}| 5. Termination: The process continues iteratively
until a maximum number of iterations is reached.
False Positives (FP): The number of negative
3.2 Graded evaluation
examples incorrectly entailed by the hypothesis.
F P = |{e ∈ E − | BK ∪ H |= e}| A synthetic data generator is used to control for the
input parameters of the ruleset complexity, namely:
False Negatives (FN): The number of positive categorical distribution (CHAIN, RDG, DRDG,
examples not entailed by the hypothesis. etc.), background knowledge, positive examples
Algorithm 1 Iterative LM theory refinement obtain the complete graded evaluation statistics.
Define: KB as the background knowledge For each combination of c ∈ C and n ∈ N , com-
dataset. pute the average F1 score and average processing
Define: P Gen as the prompt generator. time:
S
Define: Exs as the positive and negative exam- 1X
F1c,n = F1c,n,i
ples. S
i=1
Define: LM as the language model.
Define: P L as the logic program interpreter. 1X
S
Define: Eval as the evaluation module. Timec,n = Timec,n,i
S
Define: M as the evaluation metrics set. i=1
Define: M axiter as the maximum number of
4 Experiments
iterations.
Define: M Ttresh ← M ap : M → R as the In order to answer the research questions, a set
evaluation metrics treshold. of experiments was elaborated to systematically
analyse the theory inducing capabilities of a set
Let prompt ← P Gen(KB, Examples) of most popular open-source LLMs and two ver-
Let iter ← 0 sions of GPT, with the proposed approach, having
Let results ← M ap : M → R the state-of-the-art ILP system Popper (Cropper
while (∃m ∈ results : results[m] < and Morel, 2021) as a baseline. The tests covered
M Ttresh [m]) ∧ (iter < M axiter ) do all data categories discussed on Section 2, allow-
theory ← LM (prompt) ing a graded analysis w.r.t. the expected level of
results ← Eval(Examples) induction complexity and tolerated noise.
prompt ← P Gen(KB, theory, Exs)
iter ← iter + 1 4.1 Experimental Setup & Dataset
end while For each data category, five datasets were generated
using RuDaS (Cornelio and Thost, 2021). The size
and negative examples as well as the amount of of each dataset was set to XS (min = 50, max =
noise introduced within the dataset. 100, support = 3), and noise, missing, and open
world were all set to 0.1, then all set to 0.2, and
1. Categorised Learning Sets: Consisting of C: finally all set to 0.3. This resulted in 105 datasets in
set of ruleset complexity categories (e.g., CHAIN, total, with 35 datasets for each rate. Subsequently,
RDG, DRDG, etc.), N : set of noise levels, S: two methods are used to induce a theory for each
number of samples per combination of C and N . dataset: (1) employing Popper, with NuWLS (Chu
For each c ∈ C and n ∈ N , generate S datasets et al., 2023) and WMaxCDCL, varying its time
{Dc,n,i | i = 1, . . . , S} where each dataset Dc,n,i limit parameter from 10 to 800 seconds; (2) apply-
includes: ing the proposed iterative LM theory refinement
method (Section 3), with parameters M axiter = 4
+ −
Dc,n,i = (BKc,n,i , Ec,n,i , Ec,n,i , noisec,n ) and M Tthresh = 1.0. Three different LLM mod-
els were used for (2): Open AI1 ’s model GPT-4o2 ,
2. Hypothesis Generation and Evaluation: For Mistral AI3 ’s Mixtral-8x7B4 (Jiang et al., 2023),
each dataset Dc,n,i , use a learning algorithm to and Google’s Gemma5 (Team et al., 2023).
generate a hypothesis Hc,n,i , tracking the F1 score Table 2 presents a comprehensive overview of
Fc,n,i and processing time Tc,n,i at each iteration statistical metrics pertaining to each category of
and recording the best F1 score F1c,n,i and corre- data, in order of complexity (except MIXED).
sponding processing time Timec,n,i : We computed the average F1-score for each cat-
F1c,n,i = max(Fc,n,i ) egory, taking into account the level of noise, open
1
https://openai.com/
Timec,n,i = time until max(Fc,n,i ) 2
https://openai.com/index/hello-gpt-4o
3
https://mistral.ai/
3. Aggregation: The information is then aggre- 4
https://huggingface.co/mistralai/
gated by complexity category and noise level for Mixtral-8x7B-Instruct-v0.1
5
all the samples, averaging times and F1 scores, to https://huggingface.co/google/gemma-7b-it
Categories Facts Positive Negative 1.00 Noise 0.1
Popper
CHAIN 67.6 ± 5.4 23.6 ± 6.5 4.8 ± 2.1 0.75 GPT-4o
GPT-3.5-Turbo

F1 Score
CHAIN REC. 54.2 ± 14.1 19.6 ± 7.8 4.4 ± 2.5 0.50 Llama3
Mixtral
RDG 60.2 ± 9.8 18.6 ± 12.3 4.2 ± 3.4 0.25
RDG REC. 63.2 ± 9.0 16.6 ± 4.7 3.4 ± 1.3 0.00 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
DRDG 61.2 ± 14.1 31.0 ± 26.0 8.0 ± 8.2 Noise 0.2
1.00
DRDG REC. 54.2 ± 12.5 34.0 ± 22.2 9.0 ± 6.2
0.75
MIXED 54.4 ± 18.1 24.2 ± 12.3 4.8 ± 2.7

F1 Score
0.50
0.25
Table 2: Statistics for each dataset category. A detailed
0.00 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
description of each can be found in Section 2. Noise 0.3
1.00
0.75

F1 Score
0.50
world scenarios, and missing facts. The mean val- 0.25
ues reported are based on the results obtained from 0.00 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
the theory that was generated from the train set and
evaluated on the test set. Figure 2: F1 score trends across categories. Differ-
ent models (GPT-4o, Llama3 8b instruct, Popper, and
The experiment used the OpenAI service for
Mixtral-8x7B-Instruct-v0.1) under varying noise lev-
GPT models. For Popper, Llama3-8B-Instruct, els and categories reveal distinct performance patterns.
Gemma-7B-It and Mixtral-8x7B-Instruct-v0.1, it GPT-4o demonstrates stable accuracy yet sensitivity to
was conducted on a computer with an Intel(R) noise, particularly in complex rule-based categories like
Xeon(R) Gold 5217 CPU @ 3.00GHz, 188GB RDG and DRDG. Mixtral-8x7B-Instruct-v0.1 exhibits
RAM, and 2x NVIDIA RTX A6000 (48GB mixed results with notable variability across categories
VRAM) GPUs. The software used was CUDA particularly in more complex tasks. Llama3 8b instruct
12.3, PyTorch 2.2.2, and Transformers 4.41.2. struggles with low scores, indicating challenges in rea-
soning and theory generation.
Prompt templates used were included in the supple-
mentary material (Appendix C). Noise 0.1
Popper
3 GPT-4o
log(Time (s))

GPT-3.5-Turbo
4.2 Results & Discussion 2 Llama3
Mixtral
1
Overall results for F1 are presented in Figure 2. We
0 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
additionally report on processing times as a mea- Noise 0.2
sure of practical interest in Figure 3. We present the 3
log(Time (s))

values obtained in detail in the supplementary mate- 2

rial (Appendix B, in Tables 4 and 5). Gemma-7B-It 1


0
results are not included as it failed to generate valid MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
Noise 0.3
theories. The results reveal significant insights into 3
log(Time (s))

LLM capabilities and limitations w.r.t. theory in- 2


duction, which are summarised as follows: 1
0 MIXED CHAIN CHAIN R. RDG RDG R. DRDG DRDG R.
LLMs can achieve competitive performance
against the baseline, specially at higher noise lev- Figure 3: Performance on time consumption trends
els. The larger scale models (GPT3.5, 4) demon- across categories using a logarithmic scale. The data
strate more resilience to noise and consistent F1 consistently shows that LLM outperforms Popper in
across the different categories, as indicated in Fig- all intervals. The results however do not represent a
measure of efficiency, as the computational resources
ure 2, with an average F1-score difference of ±0.25
employed are vastly different across methods.
against Popper. This answers the overall quality
part of RQ1.
ited capacity of tracking long relationship chains
Inducing theories on long relation chains is the
of independent predicates. A part of RQ2 answer.
major obstacle for LLMs, rather than depen-
dency complexity. With the CHAIN category Increasing iteration limits does not monotoni-
being the least complex and one of the most con- cally improve results for LLMs. Upon increas-
sistently solved by Popper, but none of the tested ing the iteration limits from 1 to 4, it was found
LLMs was able to overcome the baseline perfor- that the metrics can either increase or decrease non-
mance at all noise levels (Figure 2 CHAIN cate- monotonically. Thus M axiter was set to 4 and the
gory). This suggests that such models have a lim- iteration with the best accuracy is taken as the final
result. F1-score drop of over 0.8 from noise = 0.1
to noise = 0.2 and a monotonic performance
Performance is remarkably lower on complex reduction with the increase of the noise level.
rule sets at moderate noise levels. Responses
for complex categories, such as RDG and DRDG Regarding the parameterisation of each method,
display higher variance and LLM hallucination arti- some higher-level observations on the trade-off
facts, such as valid rules containing predicates that between time and inference quality can be sum-
do not exist in the rule set. We present an error marised as follows:
analysis in Section 4.3. For instance, a comparison
of the results for the RDG category generated by Computational scalability allow LLMs to op-
GPT-4o under noise levels set to 0.1 and 0.3 reveals erate at substantially lower times. While Pop-
a significant decline in performance, with the F1- per is a serial algorithm, the parallel nature of
score dropping from 0.75 to 0.22. A comparable transformer-based LLMs allows them to operate
pattern is observed with GPT-3.5 Turbo for RDG at times about 3 orders of magnitude lower, given
and DRDG and Mixtral RDG in the presence of enough computational resources. This can be ob-
elevated noise levels, with GPT-3.5 Turbo scores served in Figure 3, for all complexity classes and
going from 0.56 to 0.0, 0.28 to 0.08, and Mixtral- all tested noise intervals.
8x7B going from 0.60 to 0.0.This complements the
4.3 Error analysis
answer to RQ2. Further details are included in the
supplementary material (Appendix B). The errors found in the evaluation of the gener-
ated theories could be separated in two categories:
Induction capability varies substantially across syntactic and logical.
models. Using the same inputs resulted in vastly Syntactic errors occur when the generated response
different responses from different models, suggest- does not match the logic programming language
ing critical influence from model size in parameters. grammar. For example, the following response:
Figure 2 illustrates this: When comparing GPT-4o, theory :-
Mixtral-8x7B and Llama3 at noise levels set to 0.1 p(X, Y), pos(p0(X, Y)) - positive.
p(X, Y), neg(p0(X, Y)) - negative.
and 0.3 respectively, the consistency in generating \+ p(X, Y), pos(p0(X, Y)) - false.
a valid theory correlates to their relative size. \+ p(X, Y), neg(p0(X, Y)) - true.
At a noise level of 0.1, GPT-4o’s F1 score is
is not valid Prolog and will fail evaluation.
almost twice that of the GPT-3.5-Turbo in aver-
age, and at a noise level of 0.3, the difference
Logical errors occur when the generated response
increases to a ratio of 4, indicating substantially
has correct grammar, but cannot be induced from
higher noise resiliency. The performance gap is
the examples. Consider the following Prolog the-
more pronounced when comparing with Llama3-
ory:
8B, where GPT-4o F1 score is 21 times higher at
theory :-
the lowest noise setting. p(X, Y) :- p1(X, Y); p3(X, Y);
Mixtral-8x7B-It-v0.1 performs similarly to GPT- p4(X, Y); p7(X, Y); p8(X, Y);
p0(X, Y),
3.5-Turbo at lower noise levels, scoring 13.4% not neg(p(X, Y)),
higher in average at 0.1 noise. However, its perfor- (pos(p(X, Y)) - true; fail).
mance becomes less stable at higher noise levels.
The response contains the head of the clause
It consistently outperforms Llama3-8B-it, at 0.1
"theory," as well as the predicates "p" and "pos",
noise, with a F1-score 11 times higher in average.
which do not exist in the BK. Table 3 presents
Model size does not correlate to noise resilience a distribution of error categories for the analysed
Despite being able to achieve higher scores models. A more detailed analysis of the models
than GPT-3.5 and Mixtral-8x7B in some of the outputs is included in the supplementary material
tests (e.g., RDG-R @noise = 0.1, CHAIN-R (Appendix G).
@noise = 0.2) and scoring higher on intermediate
5 Related Work
noise than on low noise, Llama3-8B did not
consistently generate valid theories. On the other Neural symbolic computation combines neural net-
hand, Mixtral-8x7B, a much larger model, is works with symbolic methods to enhance AI rea-
particularly susceptible to noise, with an average soning capabilities. (Yang et al., 2017) introduced
Model # Syntactic Logical in handling composition tasks. Their results show
GPT-4o 0% 100%
GPT3.5 0% 100% that, despite their strengths, transformers face sig-
Llama3-8B 46% 54% nificant challenges in dealing with compositional-
Mixtral-8x7B 20% 80% ity, which involves understanding and generating
Gemma-7B-it 100% 0%
complex structures from simpler components. This
Table 3: Error distribution for each of the evaluated limitation highlights the need for innovative ap-
models. Gemma-7B-it did not produce valid Prolog. proaches, such as self-refining, to further enhance
the capabilities of machine learning models.

Neural Logic Programming, An end-to-end differ- In contrast, our work focuses on the still under-
entiable model integrating neural networks with explored area of assessing and controlling induc-
logic programming. Within the LLM-Symbolic tive learning/inference capabilities of LLMs. These
space, (Wan et al., 2024) developed LogicAsker, contributions integrate LLMs and formal logic for
which evaluates and improves LLMs’ logical rea- robust theory induction and allows a graded anal-
soning using propositional and predicate logic. It ysis of LLM capabilities, with respect to theory
identifies reasoning failures and enhances capabili- induction complexity.
ties through in-context learning. Within the context
of symbolic toolformers over LLMs, (Quan et al.,
2024a,b) proposed methods of improving explana- 6 Conclusion
tory reasoning with the support of formal itera-
tive cycles using both logical solvers and theorem
provers for supporting more controlled step-wise In this study we thoroughly investigate the in-
reasoning. tegration of state-of-the-art formal theory induc-
Despite these advancements at the interface of tion within the context of large language models
LLM-based reasoning and formal controls, it is un- (LLMs), aiming to elucidate the extent in which
clear the extent and the conditions in which LLMs LLMs can systematically perform inductive learn-
can perform formal reasoning (Huang and Chang, ing for theories spanning across different complex-
2023). (Sinha et al., 2019) introduced CLUTRR, a ity levels. At the heart of this exploration lies the
benchmark assessing LLMs’ structural learning by recognition of relational data’s inherent semantic
inferring kinship relations in stories, requiring rela- depth, stemming from its symbolic representations.
tionship extraction and logical rule inference. (Zhu The empirical results presented here have indicated
et al., 2024) proposed the Hypotheses-to-Theories the ability of LLMs to address inductive learning
(HtT) framework to improve LLM reasoning by tasks, with the largest LLMs achieving competitive
learning explicit rules in two stages: generating results against the algorithmic SOTA with better
and verifying rules (induction) and using the ob- tolerance to higher noise levels, which can be at-
tained rule library for reasoning (deduction). HtT tributed to their semantic flexibility. This flexibil-
enhances relational and numerical reasoning and ity however has certain limitations, as we found
concept learning. that tested language models are more limited by
(Madaan et al., 2023) introduces a novel tech- their capacity of tracking long relationship chains
nique for improving machine learning models of independent predicates than by the dependency
through iterative refinement. This approach allows complexity of the rule sets (disjunctiveness, recur-
models to improve their performance by contin- sivity).
uously evaluating and adjusting their predictions As future work we plan to utilise larger datasets
based on self-generated feedback. By critiquing to test the scalability of the proposed approach, al-
their own outputs, models can identify errors and lowing researchers to assess its performance across
make corrections over successive iterations, lead- a broader range of scenarios. Additionally, it is
ing to increased accuracy and robustness across dif- worth considering the integration of the LLM’s
ferent tasks. Our work builds upon this approach output as an initial input for the ILP process, poten-
by employing a formal method to evaluate and then tially leveraging the strengths of both approaches
refine itself. to overcome their respective limitations. Another
In a related study, (Dziri et al., 2023), the authors avenue, is the use of ILP middle steps, such as the
investigate the limitations of transformer models bottom clause, to help LLM induce a theory.
7 Limitations Jie Huang and Kevin Chen-Chuan Chang. 2023. To-
wards reasoning in large language models: A survey.
While the proposed evaluation methodology aims In Findings of the Association for Computational
to cover a wide range of logic theory induction Linguistics: ACL 2023, pages 1049–1065, Toronto,
complexity, it is limited in its resolution to the Canada. Association for Computational Linguistics.
categories specified by (Cornelio and Thost, 2021),
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men-
and does not quantify other ruleset characteristics, sch, Chris Bamford, Devendra Singh Chaplot, Diego
such as workspace size or unification rate in the de las Casas, Florian Bressand, Gianna Lengyel, Guil-
case of Prolog (Dikovsky, 1993). laume Lample, Lucile Saulnier, et al. 2023. Mistral
The methodology compares all models under the 7b. arXiv preprint arXiv:2310.06825.
same inputs. Therefore it is not concerned with ex- Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler
tracting maximum performance of any given model, Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,
but obtaining a relative assessment of their funda- Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,
mental capabilities. This means that the empirical Sean Welleck, Bodhisattwa Prasad Majumder,
Shashank Gupta, Amir Yazdanbakhsh, and Peter
analysis reported scores should not be taken as a
Clark. 2023. Self-refine: Iterative refinement with
measure of SOTA performance. self-feedback. Preprint, arXiv:2303.17651.
Furthermore, the time gains demonstrated in the
experiments are presented as an achievable result, Stephen Muggleton. 1991. Inductive logic program-
conditioned to the combination of software and ming. New generation computing, 8:295–318.
hardware indicated in the paper and the services
Shan-Hwei Nienhuys-Cheng and Roland de Wolf. 1997.
provided by third-parties (e.g., OpenAI). They What is inductive logic programming? Springer.
should not be interpreted as a measure of com-
putational efficiency. Xin Quan, Marco Valentino, Louise A. Dennis, and An-
dré Freitas. 2024a. Enhancing ethical explanations
Acknowledgements of large language models through iterative symbolic
refinement. Preprint, arXiv:2402.00745.
This work was partially funded by the Swiss Na-
tional Science Foundation (SNSF) project Neu- Xin Quan, Marco Valentino, Louise A. Dennis, and
Math (200021_204617) and by the Manchester Ex- André Freitas. 2024b. Verification and refinement of
natural language explanations through llm-symbolic
perimental Cancer Medicine Centre and the NIHR theorem proving. Preprint, arXiv:2405.01379.
Manchester Biomedical Research Centre.
Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle
Pineau, and William L. Hamilton. 2019. CLUTRR:
References A diagnostic benchmark for inductive reasoning from
text. In Proceedings of the 2019 Conference on
Yi Chu, Shaowei Cai, and Chuan Luo. 2023. Nuwls: Im- Empirical Methods in Natural Language Processing
proving local search for (weighted) partial maxsat by and the 9th International Joint Conference on Natu-
new weighting techniques. Proceedings of the AAAI ral Language Processing (EMNLP-IJCNLP), pages
Conference on Artificial Intelligence, 37(4):3915– 4506–4515, Hong Kong, China. Association for Com-
3923. putational Linguistics.
Cristina Cornelio and Veronika Thost. 2021. Synthetic
datasets and evaluation tools for inductive neural rea- Gemini Team, Rohan Anil, Sebastian Borgeaud,
soning. In Proceedings of the 30th International Con- Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
ference on Inductive Logic Programming, ILP2020- Radu Soricut, Johan Schalkwyk, Andrew M Dai,
21 @ IJCLR. Anja Hauth, et al. 2023. Gemini: a family of
highly capable multimodal models. arXiv preprint
Andrew Cropper and Rolf Morel. 2021. Learning pro- arXiv:2312.11805.
grams by learning from failures. Machine Learning,
110(4):801–856. Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang
A Ja Dikovsky. 1993. On computational complexity Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, and
of prolog programs. Theoretical Computer Science, Michael R. Lyu. 2024. A & B == B & A: triggering
119(1):63–102. logical reasoning failures in large language models.
CoRR, abs/2401.00757.
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine
Li, Liwei Jian, Bill Yuchen Lin, Peter West, Chandra David S Warren, Veronica Dahl, Thomas Eiter,
Bhagavatula, Ronan Le Bras, Jena D Hwang, et al. Manuel V Hermenegildo, Robert Kowalski, and
2023. Faith and fate: Limits of transformers on com- Francesca Rossi. 2023. Prolog: The Next 50 Years,
positionality. arXiv preprint arXiv:2305.18654. volume 13900. Springer Nature.
Fan Yang, Zhilin Yang, and William W Cohen. 2017. There are no alternative rules for deriving facts
Differentiable learning of logical rules for knowledge related to a specific body atom. For example,
base reasoning. Advances in neural information pro-
p1(X1, X2) appears in the body of p0(X0, X1)
cessing systems, 30.
and only one rule has p1(X1, X2) in the head:
Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, p1(X1, X2) :- p7(X2, X1). The same applies to
Jian Tang, Dale Schuurmans, and Hanjun Dai. 2024. p3.
Large language models can learn rules. Preprint,
arXiv:2310.07064.
Category Disjunctive Rooted DG (DRDG). Cate-
Appendices gory DRDG generalises Category RDG by allow-
ing for alternative rules represented as children of
A Further theoretical background an "OR" node. For instance:
A.1 Detailed Complexity Classes p7(X0,X1) :- p5(X0,X1).
Category Chain. In this category, every rule, ex- p5(X0,X1) :- p0(X0,X1).
cept the root, deduces facts relevant to precisely p5(X0,X1) :- p8(X1,X0).
one other rule. Essentially, each node has at most In the example, the first rule states p7(X0, X1)
one parent, and each rule is associated with at most is true if p5(X0, X1) is true, indicating p7
one other rule that might infer relevant facts. Re- depends on p5. The second rule states p5(X0, X1)
cursive rules, where the predicate in the head also is true if p0(X0, X1) is true, showing p5 depends
occurs in the body, are exceptions, as they are rel- on p0. The third rule states p5(X0, X1) is true
evant both for themselves and one additional rule. if p8(X1, X0) is true, adding an alternative
For example: condition with swapped arguments. Thus, p5 acts
p5(X, Y) :- p0(X, Z), P2(Y, W). as an "OR" condition in the first rule’s body and
p0(X, Y) :- p3(X, Z), p4(W, Y). the second and third rules’ heads.
p3(X, Y) :- p6(X, Z), p7(W, Y).
Category Mixed. A rule graph in this category
For example, according to Rule 1, p0(X, Z) contains connected components of different cate-
is necessary for p5(X, Y ). Therefore, satisfying gories mentioned above. Additionally, recursion
p5(X, Y ) requires p0(X, Z), which in turn is allowed, meaning that the head of the rule may
requires p3(X, Z) and p4(W, Y ). This creates appear in the body as well.
a dependency chain where p5(X, Y ) relies on
p3(X, Z) and p4(W, Y ). B Further empirical data & findings

Category Rooted DG (RDG). This category gen- GPT-4o shows stable performance with moderate
eralises the Chain category. Here, every rule can to high accuracy but is sensitive to noise, especially
be relevant for several others, and each node can in RDG and DRDG. For instance, its F1 score in
have multiple parent nodes. Furthermore, for each RDG drops from 0.75 at noise level 0.1 to 0.25 at
rule, there may be several other rules that might noise level 0.3. GPT-3.5-Turbo does not perform
infer facts relevant for it. However, for each pred- well with complex categories like RDG and DRDG
icate occurring in the body of a rule, there must under noise, with an F1 score of 0 at noise level 0.3
be at most one other rule with this predicate in the in RDG.
head. In other words, there are no alternative rules Mixtral-8x7B-Instruct-v0.1 shows high variabil-
to derive facts relevant for a rule with respect to a ity w.r.t. noise, performing reasonably well in RDG
specific body atom. For example: (0.64 F1 at noise level 0.1, dropping to 0.43 at noise
level 0.3) but with significant time consumption, es-
p0(X0,X1) :- p1(X1,X2),p3(X0,X1). pecially in DRDG (130.160 seconds at noise level
p3(X0,X1) :- p8(X0,X1),p6(X0,X1). 0.2). It does not perform well with complex rule
p1(X1,X2) :- p7(X2,X1). sets like DRDG across all noise levels.
In the given example, each rule has at least one Llama3-8B instruct has low accuracy across
child node. For instance, p0(X0, X1) has two most categories, with slight improvement at higher
child nodes: p1(X1, X2) and p3(X0, X1). Each noise levels but increased time consumption. At
predicate in the body of a rule corresponds to noise level 0.1, it achieves an F1 score above 0 only
at most one rule with that predicate in the head. in RDG R. It often fails to produce valid theories or
Category Noise 0.1 Noise 0.2 Noise 0.3
F1 Time (s) F1 Time (s) F1 Time (s)
MIXED 0.51 1071.53 0.44 817.15 0.35 2418.88
CHAIN 0.80 1397.35 0.46 1254.33 0.57 3824.31
CHAIN R. 0.77 1123.25 0.41 1190.14 0.14 3646.43
RDG 0.74 1122.13 0.57 854.85 0.50 2460.90
RDG R. 0.71 1523.37 0.68 940.50 0.38 1659.27
DRDG 0.77 934.98 0.41 1089.47 0.25 1363.27
DRDG R. 0.68 927.30 0.48 882.28 0.26 820.51

Table 4: Results for different categories with theory induced by Popper and different noise levels.

Category GPT-4o - noise 0.1 GPT-4o - noise 0.2 GPT-4o - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F1 (avg) Time (s)
MIXED 0.70 8.57 0.52 10.73 0.62 9.81
CHAIN 0.52 8.54 0.42 11.05 0.35 8.29
CHAIN R. 0.72 11.48 0.53 8.86 0.49 8.13
RDG 0.75 8.80 0.50 11.94 0.22 20.16
RDG R. 0.74 10.83 0.55 7.35 0.49 10.79
DRDG 0.46 12.59 0.39 16.44 0.42 11.14
DRDG R. 0.83 13.11 0.32 13.29 0.12 8.45
Category GPT-3.5-Turbo - noise 0.1 GPT-3.5-Turbo - noise 0.2 GPT-3.5-Turbo - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F1 (avg) Time (s)
MIXED 0.20 4.32 0.35 4.11 0.32 4.20
CHAIN 0.24 3.47 0.33 7.88 0.11 3.13
CHAIN R. 0.545 2.592 0.00 7.11 0.00 3.05
RDG 0.56 4.22 0.29 3.19 0.00 3.76
RDG R. 0.20 3.48 0.02 4.73 0.00 3.98
DRDG 0.28 4.63 0.22 8.91 0.08 3.96
DRDG R. 0.31 5.01 0.15 3.06 0.01 11.14
Category Llama3-8B-it - noise 0.1 Llama3-8B-it - noise 0.2 Llama3-8B-it - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F (avg) Time (s)
MIXED 0.00 62.54 0.00 176.31 0.02 51.55
CHAIN 0.00 38.86 0.21 12.84 0.06 29.31
CHAIN R. 0.00 31.39 0.32 34.90 0.08 57.75
RDG 0.00 41.42 0.00 18.42 0.07 55.80
RDG R. 0.21 20.86 0.20 36.45 0.00 40.70
DRDG 0.00 76.70 0.08 45.48 0.04 60.10
DRDG R. 0.00 25.96 0.00 43.88 0.00 14.61
Category Mixtral-8x7B-It-v0.1 - noise 0.1 Mixtral-8x7B-It-v0.1 - noise 0.2 Mixtral-8x7B-It-v0.1 - noise 0.3
F1 (avg) Time (s) F1 (avg) Time (s) F1 (avg) Time (s)
MIXED 0.36 34.04 0.21 65.83 0.20 71.60
CHAIN 0.49 50.07 0.00 45.15 0.16 45.02
CHAIN R. 0.47 29.56 0.00 87.16 0.08 69.38
RDG 0.60 75.42 0.05 35.68 0.00 101.54
RDG R. 0.48 74.58 0.00 123.80 0.00 69.17
DRDG 0.10 90.51 0.28 130.16 0.12 82.51
DRDG R. 0.15 60.61 0.00 97.11 0.00 64.92

Table 5: Performance metrics for various categories under different noise conditions.

introduces new predicates incorrectly. For instance, CHAIN present more stable performance.
the rule p(X, Y) - p2(X, Y); p0(X, Y); p4(X, Y); p9(X,
Y). is valid, but the predicate p, in the head of the
rule, does not exist in the BK, neither is the tar-
get predicate. Llama3-8B was the only model to
exhibit this pattern. Time variance has an inverse relation w.r.t. noise
levels on LLMs when compared with Popper.
The models generally present higher initial ac- As the noise increases, Popper may take less time
curacy on recursive (R) categories, but are more to induce a theory based on the remaining relevant
sensitive to noise on them, leading to larger perfor- data, as indicated by the scattering pattern progres-
mance drops. For example, on DRDG-R, GPT-4o’s sion in Figure 4, while the LLMs are more likely
F1 score drops from 0.83 at noise level 0.1 to 0.120 to take longer to process it. Detailed values are
at noise level 0.3. Non-recursive categories like presented in Tables 4 and 5.
Noise 0.1 For the refinement steps, the following prompt
0.8
template was used:
0.6
F1 score

0.4 Popper MIXED Prompt p2 ...pn :


GPT-4o MIXED
0.2 GPT-3.5-turbo MIXED
Llama3 MIXED The theory scored:
0.0 Mixtral MIXED
0.5 1.0 1.5 2.0 2.5 3.0
accuracy = {acc} precision = {precision}
log(Time (s))
Noise 0.2 recall = {recall} f1 = {f1}
0.8
0.6 and got these examples wrongly:
F1 score

0.4 {examples that were misclassified}


0.2
0.0 Refine the theory. Answer only the theory.
0.5 1.0 1.5 2.0 2.5 3.0
log(Time (s))
Noise 0.3 The prompt templates were designed to be ob-
0.8
jective and minimal, containing only the necessary
0.6
F1 score

instructions and data for solving the task. They


0.4
were adjusted using a small sample of inputs, to
0.2
minimise the syntax errors across all models. The
0.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 same prompt templates were used across all lan-
log(Time (s))
guage models.
Figure 4: Relationship between the F1 score and the
logarithm of processing time (in seconds) for five dif- D Reproducibility
ferent mixed models—Popper, GPT-4, GPT-3.5-turbo,
Upon acceptance of this paper, we will release all
Llama3, and Mixtral—across three noise levels: 0.1, 0.2,
and 0.3 and each rule set category: CHAIN (•), CHAIN code and data associated with our study. The raw
R.(□), RDG( ), RDG R.(+), DRDG(⋆), DRDG R.(♢), data, along with detailed instructions for data pro-
and MIXED(X). Each subplot corresponds to a different cessing, are accessible via the provided repository
noise level, showing how each model’s performance link. Any proprietary tools or materials used in
and processing time vary with increasing noise. While this study are either commercially available or pro-
Popper always takes more time to develop a theory, the vided under a reasonable request. By ensuring that
other two levels (0.5 - 1.0, 1.0 - 2.5) correspond to dif-
all aspects of this research are openly accessible,
ferent execution environments. Time variance changes
in opposite ways w.r.t. noise on Popper vs. the LLMs.
we invite the scientific community to replicate our
findings and build upon this work, fostering a col-
laborative and transparent scientific environment.
C Prompt templates
E Dataset Generation Process
For the iterative LM theory refinement method, the
following template was used for the initial prompt: Below, we provide a detailed description of each
of the parameters used for the dataset generation
Prompt p1 : process, along with their specific configurations.
Induce a theory based on background
knowledge, positive and negative Parameters
examples. Write it in Prolog. Do • mindags:
not give an explanation. Answer only the
– Definition: Minimum number of gener-
theory.
ated DAGs. This parameter ensures that
at least the specified number of DAGs is
BK:
generated in the dataset.
{BK}
– Constraint: Must be greater than 0.
Examples:
• maxdags:
{positive and negative examples}
– Definition: Maximum number of gener-
ated DAGs. This parameter sets an upper
limit on the number of DAGs to be in- is the parent of Mary. A corresponding rule
cluded in the dataset. might be expressed as: ∀X, Y (parent(X, Y ) →
– Constraint: Must be greater than or ancestor(X, Y )).
equal to mindags. The initial dataset could be represented as fol-
lows:
• noise:
– Definition: Represents the percentage of % Facts
noise in the datasets. Noise here refers parent(john, mary).
to the random perturbations added to the parent(mary, susan).
data. parent(john, michael).
parent(michael, robert).
– Constraint: Must be a value in the range
[0, 1].
% Rules
• owa (Open World): ancestor(X,Y) :- parent(X,Y).
ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y).
– Definition: The open-world degree in-
dicates how many of the consequences Scenario 1: Missing Data
of an initial set of relevant facts, called
Assume that 20% of the data is missing. The
support facts, are missing from the data
dataset would then be:
set. In other workds it indicates the per-
centage of consequences missing in the
% Facts with 20% missing
dataset. This parameter simulates incom-
parent(john, mary).
plete data scenarios by randomly omit-
% parent(mary, susan). % This fact is missing
ting a portion of the data.
parent(john, michael).
– Constraint: Must be a value in the range parent(michael, robert).
[0, 1].

• missing: % Rules
ancestor(X,Y) :- parent(X,Y).
– Definition: Specifies the percentage of ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y).
missing data in the dataset.
– Constraint: Must be a value in the range Scenario 2: Noisy Data
[0, 1]. Alternatively, if 20% of the data contains noise, the
• category: dataset might appear as follows:

– Definition: Determines the type of the % Facts with 20% noise


rule to be generated. The categories in- parent(john, mary).
clude different structural patterns and parent(mary, susan).
combinations. parent(john, michael).
– Values: parent(michael, robert).
%Below is a noisy fact
* Chain
parent(michael, alice).
* Rooted Directed Graph (DG)
* Disjunctive Rooted DG
% Rules
* Mixed ancestor(X,Y) :- parent(X,Y).
* All of them with recursion ancestor(X,Y) :- parent(X,Z),ancestor(Z,Y).
Illustrative Example: Family Knowledge
This methodical approach to dataset generation
Base
allows us to simulate a wide range of real-world
To illustrate the dataset’s construction, consider conditions, providing a robust foundation for ana-
a simple example representing a small knowl- lyzing the effects of noise, missing data, and struc-
edge base about familial relationships. Here, tural variations on the performance of our experi-
the fact parent(john, mary). denotes that John ments.
F LLMs theory :-
p(X, Y) :- p1(X, Y); p3(X, Y);
Table 6 provides a summary of the main informa-
p4(X, Y); p7(X, Y); p8(X, Y);
tion about the models used in this study.
p0(X, Y),
G Models Output not neg(p(X, Y)),
(pos(p(X, Y)) - true; fail).
The GPT-4o and GPT-3.5-turbo models have been
demonstrated to consistently generate valid theo- The models in question autonomously created
ries, thereby ensuring the successful execution of the head of the clause "theory," as well as the pred-
the Prolog code they produce. To illustrate, the icates "p" and "pos," which should not exist.
following displays a theory induced by GPT-4o. Additionally, Mixtral demonstrated satisfactory
performance, although it exhibited a proclivity to
p10(A,B):-p8(A,B). insert the theory at the outset of the output. Al-
p10(A,B):-p1(A,B). though the output was generally valid, the quality
p10(A,B):-p7(A,B). of the generated theories was not as robust as that
of GPT-4o, particularly in more intricate recursive
Nevertheless, a recurrent pattern has been iden- scenarios such as RDG and DRDG. Additionally,
tified in the theories generated by GPT, namely it was also able to identify the relevant predicates,
the rewriting of rules in which the variable is in- but their arrangement was not optimal. For exam-
terchanged. To illustrate, the following example is ple, in the same example as GPT-4o, the correct
provided. predicates were identified, but their arrangement
was not optimal.
p1(A, B) :- p2(A, B).
p1(A, B) :- p2(B, A). p1(X,Y) :-
p0(X,Y),
Furthermore, GPTs, particularly GPT-4o, are
\+ p2(X,Y),
highly effective at identifying the relevant predi-
\+ p4(X,Y).
cate for a given theory, disregarding the noise, the
irrelevant facts added on purpose in the dataset. Furthermore, this model produces an excessive
The following is an exemplar rule: number of rules, particularly in more intricate rule
sets such as RDG, DRDG, and MIXED, both re-
p1(X0,X1) :- p2(X0,X1).
cursive and non-recursive. It also introduces pos
p2(X0,X1) :- p4(X1,X2),p0(X0,X2).
or neg predicates that are erroneous and should not
The initial prompt identifies the predicate p2 exist, in 20% of the results in these categories. The
following example demonstrates a theory generated
p1(A, B) :- p2(A, B). by Mixtral-8x7B with these issues:

The refinement identifies the predicate p0 theory :-


dif(X, Y),
p1(A, B) :- p2(A, B). p3(X, Y), p5(Y, Y), p7(Y, _),
p1(A, B) :- p0(A, B). \+ pos(p2(X, Y)),
\+ neg(p2(X, Y)),
Subsequently, the predicate p4 is identified.
asserta(pos(p2(X, Y))).
p1(A, B) :- p2(A, B).
p1(A, B) :- p0(A, B). theory :-
p1(A, B) :- p4(A, B). dif(X, Y),
p4(X, Y), p5(Y, Y), p7(Y, _),
However, the same degree of precision could \+ pos(p2(X, Y)),
not be obtained from Llama3-8B, which may not \+ neg(p2(X, Y)),
consistently generate Prolog code that adheres to asserta(pos(p2(X, Y))).
the necessary syntactical or logical constraints, po-
tentially leading to errors during execution. To theory :-
illustrate, consider the following Prolog theory: dif(X, Y),
Model Maintainer Parameters Hidden dim. # hidden layers Context size
LLaMA3-8B Instr. Meta-Llama 8B 4096 32 8K
Mixtral-8x7B Instr. Mistral AI 46.7B (sparse) 4096 32 32K
Gemma-7B-IT Google 7B 3072 28 8K
GPT-3.5 Turbo OpenAI – – – 16K
GPT-4o OpenAI – – – 128K

Table 6: Main Information about the models evaluated in this study. All models tested are auto-regressive decoder-
only, with Mixtral-8x7B Instruct being a Sparse Mixture of Experts (SMoE). The original, non-quantised versions
were used.

p6(X, Y), p5(Y, Y), p7(Y, _), - c70 and c60


\+ pos(p2(X, Y)), - c52 and c51
\+ neg(p2(X, Y)), - c81 and c71
asserta(pos(p2(X, Y))). - c2 and c14
- c30 and c44
theory :- - c78 and c72
dif(X, Y), - c81 and c35
p10(X, Y), p5(Y, Y), p7(Y, _),
\+ pos(p2(X, Y)), However, the predicate p9(c55, c48) is
\+ neg(p2(X, Y)), false.
asserta(pos(p2(X, Y))).

theory :-
X \= Y,
p5(X, X),
\+ pos(p2(X, Y)),
\+ neg(p2(X, Y)),
asserta(pos(p2(X, X))).

Finally, Gemma 7B did not produce a valid the-


ory. While Llama-3-8B does not entirely conform
to the characteristics of a theory, it nonetheless
approximates a theory to a certain degree. In com-
parison, Gemma 7B’s output lacked the elements
necessary to be considered even a preliminary valid
theory. The following is an example of a Gemma’s
output.
**Theory:**

The facts in the knowledge base indicate


that the predicate p9(cX, cY) is true for
the following pairs of facts:

- c41 and c17


- c13 and c52
- c54 and c7
- c62 and c61
- c71 and c75
- c24 and c48
- c79 and c67
- c50 and c46

You might also like