A Critical Review of Inductive Logic Programming Techniques For Explainable AI
A Critical Review of Inductive Logic Programming Techniques For Explainable AI
8, AUGUST 2024
Abstract— Despite recent advances in modern machine learn- relations to improve causal and evidential reasoning and to
ing algorithms, the opaqueness of their underlying mechanisms immerse into the decision loops to influence tasks as needed.
continues to be an obstacle in adoption. To instill confidence and According to these criteria, an AI system should perform a
trust in artificial intelligence (AI) systems, explainable AI (XAI)
has emerged as a response to improve modern machine learning specific task or recommend decisions and produce an explain-
algorithms’ explainability. Inductive logic programming (ILP), able characterization of why it renders specific decisions along
a subfield of symbolic AI, plays a promising role in generating with the supporting rationale.
interpretable explanations because of its intuitive logic-driven Recently, with the success of precise but largely inscrutable
framework. ILP effectively leverages abductive reasoning to deep learning models, explainability received significant atten-
generate explainable first-order clausal theories from exam-
ples and background knowledge. However, several challenges tion [4]. The opaqueness of deep learning algorithms motivates
in developing methods inspired by ILP need to be addressed AI researchers to open the black box for bringing transparency
for their successful application in practice. For example, the and avoiding reliance only on model accuracy [1]. Explana-
existing ILP systems often have a vast solution space, and the tions are essential for the explainability of a machine-learned
induced solutions are very sensitive to noises and disturbances. model. Symbolic AI is the overarching framework for AI
This survey paper summarizes the recent advances in ILP
and a discussion of statistical relational learning (SRL) and methods that are based on high-level “symbolic” (human-
neural-symbolic algorithms, which offer synergistic views to ILP. readable) representations of problems, plans, and solutions.
Following a critical review of the recent advances, we delineate Symbolic AI requires a modest amount of training and can
observed challenges and highlight potential avenues of further achieve reasonable performance with limited data. Symbolic
ILP-motivated research toward developing self-explanatory AI methods also grant the benefits of explainability and general-
systems.
ization at the concept level through inferential and inductive
Index Terms— Differentiable inductive logic programming reasoning while offering connections to established planning
(∂ILP), explainable artificial intelligence (XAI), ILP, machine
algorithms [5]. Fig. 1 shows a flowchart of the symbolic AI
learning, meta-interpretive learning (MIL), neuro-symbolic AI,
probabilistic ILP (PILP), statistical relational learning (SRL). pipeline: by learning from symbolic background knowledge
and examples (positive and negative), the symbolic model can
elaborate and revise logic frames in an explainable manner
I. I NTRODUCTION while enabling the provision of valid and provable answers
for symbolic queries. The user could interact with the trained
E XPLAINABILITY has become an important research
area to overcome the challenges in addressing the com-
plexity and understandability of artificial intelligence (AI)
system as follows: the system can take action for the current
task and provide an explanation to the user that justifies
systems [1]. Explainable AI (XAI) is an emerging field [2] in its action; the user can then make a decision based on the
machine learning that refers to methods and techniques that explanation and improve the knowledge base (KB) as well as
enable experts to understand the decisions made by AI algo- the training process.
rithms. The U.S. Defense Advanced Research Project Agency As symbolic AI achieves unprecedented impact on
describes three AI explainability criteria: prediction accuracy, problem-solving via search, planning, and decision-making,
decision understanding or trust, and traceability [3]. Predic- one of the hallmarks of symbolic AI is inductive logic
tion accuracy refers to explaining how specific conclusions programming (ILP) [6]. Besides explainability, ILP systems
are derived, while decision understanding involves fostering are also data-efficient, while most machine learning models
trust in the underlying mechanisms and processes. Traceabil- require large quantities of data to reach acceptable accuracy
ity involves inspection of the actions and their cause-effect during learning. ILP, as a knowledge-based strategy, can
provide a symbolic, goal-driven, and causal interpretation of
Manuscript received 24 November 2021; accepted 6 February 2023. Date data.
of publication 5 April 2023; date of current version 6 August 2024. (Corre-
sponding author: Bo Liu.) However, there are also some drawbacks to these symbolic
Zheng Zhang and Levent Yilmaz are with the Department of Computer Sci- approaches: they are inherently brittle and do not deal well
ence and Software Engineering, Auburn University, Auburn, AL 36839 USA. with noise, it is not easy to express complex nonlinear
Bo Liu is with the Department of Computer Science, Auburn University,
Auburn, AL 36839 USA (e-mail: [email protected]). decision surfaces in logic [7], and the guiding method for
Digital Object Identifier 10.1109/TNNLS.2023.3246980 hypothesis search is questionable [8]. To improve the efficacy
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10221
of ILP methods and mitigate the challenges, several types of the systems construct hypotheses within the limits of a fixed
ILPs have been developed: meta-interpretive learning (MIL) vocabulary of propositional attributes [12].
framework [9] has made progress in predicate invention and Later, Plotkin [18] and Shapiro [19] introduce
recursive generalizations using abduction in terms of a meta- computer-based inductive systems within the framework
interpreter; probabilistic ILP (PILP) that introduced by De to solve these problems of full first-order logic (FOL).
Raedt and Kersting [10] is more powerful than ILP and, Plotkin’s significant contributions are as follows: 1) the
in turn, traditional attribute-value approaches due to its ability introduction of relative subsumption, a relationship of
to deal explicitly with the uncertainty; differentiable ILP generality between clauses and 2) the inductive mechanism
(∂ILP) [11] augments ILP with neural networks, making the of relative least general generalization (RLGG). Shapiro [19]
system robust to noise and error in the training data that ILP investigates an approach to Horn clause induction in which
cannot cope with alone. the search for hypotheses is from general to specific, rather
The rest of this article is organized based on the flowchart than Plotkin’s specific to the general approach. These all
of the symbolic AI pipeline. In Section II, related works in become the foundation of ILP. Later, Sammut and Banerji [20]
the extant literature regarding ILP are reviewed. Section III describe a system called MARVIN, which generalizes a single
discusses three types of ILP, including PILP, MIL, and ∂ILP. example at a time regarding a set of background clauses.
Section IV introduces symbolic-based frameworks, includ- Quinlan [21] has described a highly efficient program called
ing statistical relational learning (SRL) and neural-symbolic Foil, which relies on a general to specific heuristic search,
AI (NeSy), and their relation. In Section V, we present a which is guided by an information criterion related to entropy.
user-centered explanation and a measurement of trust appro- Muggleton et al. [22] apply a “determinate” restriction to
priate for ILP. The experimental evaluations of ILP systems are hypotheses, and their learning program, Golem, has been
depicted in Section VI. Section VII delineates the challenges demonstrated [22] to have a level of efficiency similar to
and highlights future research directions for academics and Quinlan’s Foil but without the accompanying loss of scope.
domain experts, and Section VIII concludes this article. In the late 1980s, right before ILP appears, Muggleton and
Buntine [23] present inverse resolution (IR) and implement
II. BACKGROUND it in Cigol [23]. They also introduce a mechanism for
ILP, a classical rule-based system, is a subfield of symbolic automatically inventing and generalizing first-order Horn
AI that uses logic programming as a uniform representation, clause predicates.
for example, background knowledge and hypotheses. An ILP ILP produces a widely used technology with a firm the-
system will derive a hypothesized logic program, which entails oretical foundation based on principles from both logic and
all the positive and none of the negative examples given an statistics. On the side of statistical justification of hypotheses,
encoding of the general background knowledge and a set of Muggleton et al. [24] discuss the possible relationship between
examples representing a logical database of facts. algorithmic complexity theory and probably approximately
ILP was first introduced in [12]. The successes of ILP correct (PAC) learning; in terms of logic, Muggleton pro-
have been in the area of inductive construction of expert vides a unifying framework for IR and RLGG by rederiving
systems: MYCIN [13] and XCON [14] are built using hand RLGG in terms of IR. Great progress has been made since
coding of rules; GASOIL [15] and BMT [16] are built using the birth of ILP. Muggleton introduces more comprehensive
software derived from Quinlan’s [17] inductive decision tree approaches to the FOL inductive theory: inverting implication
building algorithm ID3. Along with the successes of this and inverse entailment. Kietz [25] proves that ILP generally is
technology, the following limitations have become apparent: 1) not PAC-learnable. Nienhuys-Cheng and De Wolf [26] show
propositional-level systems cannot be used in areas requiring the theoretical basis of ILP. Besides theoretical works, some
essentially relational knowledge representations; 2) inability ILP algorithms and implementations have also been developed:
to make use of background knowledge when learning; and 3) Progol is Stephen Muggleton’s implementation of ILP that
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10222 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
combines “inverse entailment” with the “general-to-specific ¬parent(i, a) and ¬parent(a, b), is called literal. A definite
search” through a refinement graph; Golem, developed by clause is a many-way OR (disjunction) of literals (formula
Muggleton et al. [22], is based on RLGG; Lavrac et al. 3). a, b, and i in the example are constants, and X, Y , and
present an ILP system called LINUS based on propositional Z are variables, and all constants and variables are terms.
logic; CLINT uses queries to eliminate irrelevant literals and A term that does not contain any free variables is called a
raises the generality of hypotheses by proposing more complex ground term [parent(i, a)]. A Boolean-valued function P: X →
hypothesis clauses. Along with the era of machine learning and {true, false} is called the predicate (parent and grandparent)
data-driven AI, ILP naturally becomes an important part of the on X; grandparent/2 and parent/2 denote predicates with
ML model, because the learned hypotheses of ILP are repre- their arity, i.e., number of arguments. A function can be any
sented in a symbolic form. They are inspectable by humans value unlike a predicate, and it will never appear except as
and, therefore, provide transparency and comprehensibility of arguments to predicates. FOL is a structure of logic consisting
the machine-learned classifiers. of constants, variables, predicates, functions, and sentences.
We conclude our introduction with Fig. 2 that depicts the It uses quantified variables over nonlogical objects and allows
overlay visualization of keywords related to ILP based on the the use of sentences that contain variables.
analysis of over 300 articles. The term ILP system is associated
with articles across multiple clusters. The articles that use ILP B. Semantics of ILP
to develop systems have strong connections with the terms
There are two different semantics for ILP: standard and
FOL, Aleph [27], Foil [28], and Prolog [29]. The high-order
nonmonotonic semantics. For the normal semantic, given
logic abduction method connects the term ILP system with
background (prior) knowledge B and example E (E = E + ∧
the terms MIL [9], Metagol [30], and predicate invention.
E − consists of positive example E + and negative example
Common terms that combine probability with the term ILP
E − ), find a hypothesis H , such that the following conditions
system are SRL, Markov logic network (MLN), and Bayesian
hold.
network. Keywords, including learnability, inductive inference,
and classification, also have close relations to the term ILP Prior Satisfiability: B ∧ E − ⊭ □.
system. Posterior Satisfiability: B ∧ H ∧ E − ⊭ □.
Prior Necessity: B ⊭ E +.
A. Concepts of ILP Posterior Sufficiency: B ∧ H ⊨ E +.
We use the following example to introduce the basic con-
cepts of the ILP system: A general setting is used for the normal semantics. In most
ILP systems, the definite setting will be used as a simple
parent(i, a) parent(a, b) version of the normal setting, as the background theory and
grandparent(X, Y ) ← parent(X, Z ), parent(Z , Y ). hypotheses are restricted to being definite. The example setting
is a special case of definite semantics, where the example
ILP is defined using FOL [31]. In FOL, a formula, is restricted to true and false ground facts. Notice that the
parent(i, a) or parent(a, b), that contains no logical con- example setting is equivalent to the normal semantics, where
nectives is called the atom. An atom or its negation, i.e., B and H are definite clauses and E is a set of ground unit
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10223
TABLE I TABLE II
G RANDPARENT DATASET ILP S YSTEMS W ITH D IFFERENT S EARCHING M ETHODS
TABLE III
D EDUCTION AND I NDUCTION O PERATIONS
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10224 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
rule corresponds to θ -subsumption among single clauses. In particular, ¬⊥ is the (potentially infinite) conjunction of
One extension of θ -subsumption that takes into account ground literals, which are true in all models of B ∧ ¬E. Since
background knowledge is called relative subsumption. Simiar H must be true in every model of B ∧ ¬E, it must contain a
to θ -subsumption, it is straightforward to define relatively subset of the ground literals in ¬⊥. Therefore
reduced clauses using a straightforward definition of relative
clause equivalence. Relative subsumption forms a lattice over B ∧ ¬E |H ¬⊥ |H ¬H
relatively reduced clauses. H |H ⊥.
2) Inverse Resolution: Inductive inference rules can be A subset of the solutions for H can be found by considering
viewed as the inverse of deductive rules of inference. Since the clauses which θ-subsume ⊥. The complete set of candi-
the deductive rule of resolution is complete for the deduction, dates for H can be found by considering all clauses which
an inverse of resolution should be complete for induction. θ -subsume sub-saturants of ⊥.
IR takes into account background knowledge and aims at
inverting the resolution principle. Four main rules of IR are
widely used. E. ILP Systems
A large amount of logic learning systems have been devel-
q←A p ← A, B oped based on the inference rules we mentioned before. Fig. 3
Absorption: .
q←A p ← q, B shows the timeline of the development of logic learning sys-
p ← A, B p ← A, q tems, including Prolog [29], MIS [19], CLINT [34], Foil [28],
Identification: .
q←B p ← A, q LINUS [36], Golem [22], Aleph [27], Progol [35], Cigol [23],
p ← A, B p ← A, C Metagol [30], ProbLog [39], DeepProbLog [40], ∂ILP [11],
Intra-Construction: .
q ← B p ← A, q q ← C and Popper [41]. Note that there are no common-used inverse
p ← A, B q ← A, C implication systems, because it suffers from problems of
Inter-Construction: . nondeterminacy.
p ← r, B r ← A q ← r, C
In these rules, lower-case letters are atoms, and upper-case III. VARIANTS OF ILP
letters are conjunctions of atoms. Both absorption and identi-
Traditional ILP frameworks that we have discussed in
fication invert a single resolution step. The rules of inter- and
Section II mainly have three limitations: most ILP systems
intra-construction introduce “predicate invention,” which leads
do the following: 1) cannot address noisy data; 2) cannot deal
to reducing the hypothesis space and the length of clauses.
with predicates invention directly and recursion effectively;
3) Inverse Implication: Since the deductive inference rule and 3) cannot learn hypothesis H efficiently due to the massive
is incomplete regarding implication among clauses, extensions hypothesis space. Variants of ILP systems have been developed
of inductive inference under θ -subsumption have been studied to solve the problem mentioned above. PILP [10] becomes
under the header “inverting implication.” The inability to invert a powerful tool when dealing explicitly with uncertainty,
implication between clauses limits the completeness of IR MIL [9] holds merits on predicate invention and recursive
and RLGGs, since θ-subsumption is used in place of clause generalizations, and ∂ILP [11] can speed up the learning
implication in both. The difference between θ -subsumption process and also robust to noise and error. We will introduce
and implication between clauses C and D is only pertinent each of them in this section.
when C can self-resolve. Attempts were made to do the
following: 1) extend IR and 2) use a mixture of IR and
LGG [37] to solve the problem. The extended IR method A. Probabilistic Inductive Logic Programming
suffers from problems of nondeterminacy. Due to the nonde- PILP is a machine learning technique based on probabilistic
terminacy problem, the development of algorithms regarding logic programming. It addresses one of the central questions
inverse implication in ILP is limited. Idestam-Almquist’s use of AI by integrating probabilistic reasoning with machine
of LGG suffers from the standard problem of intractably learning, and first-order relational logic representations [10].
large clauses. Both approaches are incomplete for inverting Dealing explicitly with uncertainty makes PILP more pow-
implication, though Idestam-Almquist’s technique is complete erful than ILP and, in turn, than traditional attribute-value
for a restricted form of entailment called T-implication [38]. approaches [10]. It also provides better predictive accuracy
4) Inverse Entailment: The general problem specification of and understanding of domains and has become a growth path
ILP is, given background knowledge B and examples E find in the machine learning community.
the simplest consistent hypothesis H , such that B ∧ H |H E. The terms used in PILP are close to those in ILP with
In general, B, H , and E could be arbitrary logic programs. small differences: since negative examples conflict with the
Each clause in the simplest H should explain at least one usual view on learning examples in statistical learning (the
example, since there is a simpler H ′ that will do otherwise. probability of a failure is zero), the definition of PILP problem
Then, consider the case of H and E, each being single Horn uses observed and unobserved examples instead of positive and
clauses. By rearranging entailment relationship B ∧ H |H E, negative ones.
Muggleton [35] proposed the “inverse entailment” PILP Problem: Given a set E = E p ∪ E i of observed and
unobserved examples E p and E i (with E p ∩ E i = ∅) over
B ∧ ¬E |H ¬H. some example language L E , a probabilistic covers relation
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10225
covers(e, H, B) = P(e|H, B), a probabilistic logical language and logical hidden Markov models [55]. Learning from inter-
L H for hypotheses, and a background theory B, find a hypoth- pretations setting has been investigated for learning SLPs [56],
esis H ∗ in L H , such that [57] and for parameter estimation of Prism programs [58], [59]
from observed examples.
H ∗ = argmax score(E, H, B) Many algorithms have been developed since the PILP
H
framework: Muggleton describes a method, which based on
and the following constraints hold: an approximate Bayes “maximum a posterior probability”
∃e p ∈ E p : covers(e p , H ∗ , B) > 0 (MAP) algorithm, for learning SLPs from examples and
background knowledge [57], which is considered one of the
∃ei ∈ E i : covers(ei , H ∗ , B) = 0. earliest structure learning methods for PILP; De Raedt and
The score is some objective function, usually involving Thon [60] upgrade rule learning to a probabilistic setting,
the probabilistic covers relationQof the observed examples, in which both the examples themselves as well as their
such as the observed likelihood e∈E covers(e p , H ∗ , B) [10]. classification can be probabilistic. To solve “large groundings”
We denote H = (L , λ), where L represents all FOL program problem, Wang et al. [61] present a first-order probabilistic
rules in H , and λ indicates probabilistic parameters. Two language, which is well suited to approximate “local” ground-
subtasks should be considered when solving PILP learning ing. The algorithm “Structure LearnIng of ProbabilistiC logic
problems: 1) parameter estimation, where it is assumed that progrAmS with Em over bdds” (SLIPCASE) performs a
the underlying logic program L is fixed, and the learning beam search in the space of the language of LPAD using
task consists of estimating the parameters λ that maximize the log-likelihood of the data as the guiding heuristics [62];
the likelihood and 2) structure learning where both L and λ Fierens et al. [63] investigate how classical inference and
have to be learned from the data. learning tasks known from the graphical model community
Similar to the ILP learning problem, the language L E can be tackled for probabilistic logic programs; and the
is selected for representing the examples, and the proba- algorithm “Structure LearnIng of Probabilistic logic programs
bilistic covers relation determines different learning settings. by searching OVER the clause space” (SLIPCOVER) performs
In ILP, this leads to learning from interpretations [42], from a beam search in the space of probabilistic clauses and a
proofs [10], and from entailment [43]. Therefore, it should greedy search in the space of theories, using the log-likelihood
be no surprise that this very same distinction also applies of the data as the guiding heuristics [64].
to probabilistic knowledge representation formalisms. There
are three PILP settings as well: probabilistic learning from B. Meta-Interpretive Learning
interpretations, from entailment, and proofs [10]. The main
MIL is a framework developed by Cropper and Muggle-
idea is to lift ILP settings by associating probabilistic infor-
ton [65], which uses higher-order metarules to support predi-
mation with clauses and interpretations and by replacing ILP’s
cate invention and learning of recursive definitions. Metarules,
deterministic covers relation with a probabilistic one. The
second-order Horn clauses, are widely discussed [65], [66],
large majority of PILP techniques proposed so far fall into
[67], [68] as a form of declarative bias. Metarules define the
the learning from interpretations setting, including parame-
structure of learnable programs, which, in turn, defines the
ter estimation of probabilistic logic programs [44], learning
hypothesis space. For instance, to learn the grandparent/2 rela-
of probabilistic relational models (PRMs) [45], parameter
tion given the parent/2 relation, the chain metarule1 would be
estimation of relational Markov models [46], learning of suitable
object-oriented Bayesian networks [47], learning relational
P(A, B) ← Q(A, C), R(C, B).
dependency networks (RDNs) [48], and learning logic pro-
grams with annotated disjunctions (LPAD) [49]. To define In this metarule, the letters P, Q, and R denote existen-
probabilities on proofs, ICL [50], Prism [51], and stochas- tially quantified second-order variables (variables that can be
tic logic programs (SLPs) [52] attach probabilities to facts bound to predicate symbols), and the letters A, B, and C
(respectively, clauses) and treat them as stochastic choices
1 Common-used metarules [30]: 1) identity: P(A, B) ← Q(A, B); 2)
within resolution. PILP techniques that learn from proofs have
inverse: P(A, B) ← Q(B, A); 3) precon: P(A, B) ← Q(A), R(A, B);
been developed, including hidden Markov model induction by 4) postcon: P(A, B) ← Q(A, B), R(B); 5) chain: P(A, B) ←
Bayesian model merging [53], relational Markov models [54], Q(A, C), R(C, B); and 6) recursive: P(A, B) ← Q(A, C), P(C, B).
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10226 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
denote universally quantified first-order variables (variables the learned model H ′ is only equivalence with the actual
that can be bound to constant symbols). Given the chain model H ∗ semantically, not literally (H ′ usually contains
metarule, the background parent/2 relation, and examples of more hypothesis than H ∗ ), so reducing metarules does not
the grandparent/2 relation, the learner will try to find the necessarily improve the learning efficiency of MIL.
correct substitutions for predicates, and one of the correct Advances have been made to increase the performance of
solutions would be MIL recently. A new reduction technique, called derivation
reduction [69], has been introduced to find a finite subset of
{P/grandparent, Q/parent, R/parent} a Horn theory from which the whole theory can be derived
and the substitution result is using SLD resolution. Recent work [67] also shows that adding
types to MIL can improve learning performance, and type
grandparent(A, B) ← parent(A, C), parent(C, B). checking can reduce the MIL hypothesis space by a cubic
factor, which can substantially reduce learning times. Extended
We will discuss the MIL problem after the description of MIL [66] has also been used to support learning higher-order
metarules. Before that, we define what MIL input is: an MIL programs by allowing for higher-order definitions to be used
input is a tuple (B, E + , E − , and M), where B is a set of as background knowledge. Recently, Popper [41] supports
Horn clauses denoting background knowledge, E + and E − types, learning optimal solutions, learning recursive programs,
are disjoint sets of ground atoms representing positive and reasoning about lists and infinite domains, and hypothesis
negative examples, respectively, and M is a set of metarules. constraints by combining ASP and Prolog.
An MIL problem [69] can be defined from an MIL input.
Given an MIL input (B, E + , E − , and M), the MIL problem
is to return a logic program hypothesis H , such that the C. Differentiable Inductive Logic Programming
following hold. The ∂ILP that combines intuitive perceptual with concep-
1) ∀c ∈ H, ∃m ∈ M, such that c = mθ, where θ is a tually interpretable reasoning has several advantages: robust
substitution that grounds all the existentially quantified to noise and error, data-efficient, and produces interpretable
variables in m. rules [11]. ∂ILP implements differentiable deduction over
2) H ∪ B ⊨ E + . continuous values. The gradient of the loss concerning the
3) H ∪ B ⊭ E − . rule weights, which we use to minimize classification loss,
implements a continuous form of induction [11].
H can be considered as a solution to the MIL problem. Based
The set P (positive examples) and N (negative examples)
on the equation c = mθ , MIL focuses on searching for θ
are used to form a new set 3
instead of H through abductive reasoning in second-order
logic. Lambda = {(γ , 1)|γ ∈ P} ∪ {(γ , 0)|γ ∈ N }.
As the grandparent task shown before, MIL could be con-
sidered as an FOL rule searching problem based on metarules. Each pair (γ , λ) indicates that whether the atom γ is
Since any first-order predicates could substitute second-order in P (λ = 1) or N (λ = 0). A differentiable model is
variables, the searching of a logic program hypothesis H constructed to implement the conditional probability of λ for a
is more flexible than that of ILP: if second-order variables ground atom α
are replaced by those predicates that do not exist in B, new p(λ|α, W, 5, L , B)
predicates will be invented (predicate invention); if they are
replaced by the same predicates in both head and body of the where W is a set of clause weights, 5 is a program template, L
metarules, the recursive definition will be learned. is a language frame, and B is a set of background assumptions.
As we discussed before, the metarules determine the struc- The goal is to match the predicted label p(λ|α, W, 5, L , B)
ture of permissible rules, which, in turn, defines the hypothesis with the actual label λ in the pair (γ , λ). Minimizing the
space. Deciding which metarules to use for a given learning expected negative log-likelihood is performed
task is a major open problem and is a trade-off between
efficiency and expressivity: we wish to use fewer metarules as loss = −E (γ ,λ)∼3 [λ · log p(λ|α, W, 5, L , B)
the hypothesis space grows given more metarules, but if we use + (1 − λ) · log(1 − p(λ|α, W, 5, L , B))].
too few metarules, we will lose expressivity. Also, the hypoth-
The conditional probability p(λ|α, W, 5, L , B) is calcu-
esis space of MIL highly depends on the metarules we choose.
lated by four functions: f extract , f infer , f convert , and f generate
For example, the identity metarule P(A, B) ← Q(A, B)
cannot be learned from the chain metarule P(A, B) ← p(λ|α, W, 5, L , B) = f extract ( f infer ( f convert (B),
Q(A, C), R(C, B). Cropper and Muggleton [65] demonstrate f generate (5, L), W, T ), α).
that irreducible or minimal sets of metarules can be found
automatically by applying Plotkin’s clausal theory reduction The f extract extracts the value from an atom, the f convert
algorithm. When this approach is applied to a set of metarules converts the elements of B to 1 and others to 0, and the
consisting of an enumeration of all metarules in a given finite f generate generates a set of clauses from 5 and L. The f infer
hypothesis language, they show that, in some cases, as few performs T steps of forward-chaining inference where T is
as two metarules are complete and sufficient for generating part of 5. To show how inference is performed over multiple
all hypotheses. Nevertheless, for expected hypothesis H ∗ , time steps, each clause c has been translated into a function
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10227
Fc : [0, 1]n → [0, 1]n on valuations [11]. Each intensional A. Statistical Relational Learning
predicate p is defined by two clauses generated from two rule Most machine learning algorithms assume that the data
j,k 1, j
templates τ p1 and τ p2 .2 G p 3 is the result of applying C p 4 and are independently distributed. In the real world, objects have
2,k
C p and taking the elementwise max [11] different kinds of relations between them, which means that
the objects in the dataset have relations to each other, have
G pj,k (a) = x
different types, and have multiple kinds of distribution. SRL
where is a subdiscipline of AI and machine learning that is concerned
with domain models that exhibit both uncertainties (which can
x[i] = max F p1, j (a)[i], F p2,k (a)[i] .
be dealt with using statistical methods) and complex, relational
at depicts conclusions after t time steps of inference. a0 [x] structures [74], [75].
equals to 1 if γ belongs to B, and 0 otherwise. Intuitively, SRL usually provides not only a better understanding of
p, j,k j,k
ct , which equals to G p (at ), is the result of applying one domains and predictive accuracy but also a more complex
1, j learning and inference process. The difference between PILP
step of forward chaining inference to at using clauses C p and
p, j,k and SRL is that SRL has started from a statistical and
C 2,k
p . The weighted average of the ct can be defined using
the softmax of the weights probabilistic learning perspective and extended probabilistic
formalisms with relational aspects, while PILP takes a differ-
p
X p, j,k e W p [ j,k] ent perspective, starting from ILP. As we mentioned before,
bt = ct ·P W p [ j ′ ,k ′ ]
j,k j ′ ,k ′ e SRL focuses on learning when samples are non-i.i.d. Domains
where data are non-i.i.d. are widespread; examples include
and the successor at+1 of at is the probabilistic sum
X p X p web search, information extraction, perception, medical diag-
at+1 = at + bt − at · bt . nosis/epidemiology, molecular and systems biology, social
p∈Pi p∈Pi science, security, ubiquitous computing, and so on. In all
Besides ∂ILP, several neural program synthesis models of these domains, modeling dependencies between examples
have been used to produce explicit human-readable programs: can significantly improve predictive performance and lead
RobustFill [70] performs an end-to-end synthesis of programs to a better understanding of the relevant phenomena. The
from examples by using a modified attention RNN to allow knowledge representation formalisms developed in SRL use
encoding of variable-sized sets of I/O pairs; a differentiable FOL to describe relational properties of a domain in a general
Forth interpreter [71] is an end-to-end interpreter for the manner (universal quantification) and draw upon probabilistic
programming language Forth, which enables programmers to graphical models, such as Bayesian networks or Markov
write program sketches with slots that can be filled with networks to model the uncertainty, while others also build
behavior trained from program input–output data; neural theo- upon the methods of ILP.
rem provers (NTPs) [72], end-to-end differentiable provers for PRMs [76] are a rich representation language for structured
basic theorems, formulate as queries to a KB and use Prolog’s statistical models, such as a probabilistic graphical model [77].
backward chaining algorithm as a recipe for recursively con- A PRM models the uncertainty over the attributes of objects
structing neural networks that are capable of proving queries to in the domain and uncertainty over the relations between
a KB using subsymbolic representations. Payani and Fekri [73] the objects. The model specifies, for each attribute of an
propose a novel paradigm, called differentiable neural logic object, its (probabilistic) dependence on other attributes of
ILP (dNL-ILP), for solving ILP problems via deep recurrent that object and attributes of related objects. Typical PRMs,
neural networks. The dNL-ILP, in contrast to the majority of including relational Bayesian networks (RBNs) [78], relational
past methods, directly learns the symbolic logical predicate Markov networks (RMNs) [79], and RDNs [80], can deal
rules instead of searching through the space of possible FOL with noncomplete and nonaccurate relational data. However,
rules by using some restrictive rule templates. learning a graphical model requires structure learning and
parametric learning. Structure learning is a combinatorial
IV. S YMBOLIC -BASED I NTEGRATION optimization problem that has high complexity. Besides, due
to the convergence for parameter learning of RMNs and RDNs
ILPs hold certain merits on explainability due to their being slow, some approximation strategies are commonly used.
symbolic nature and also show drawbacks, such as lack of Therefore, the PRMs are suitable for processing small-scale
robustness, cannot deal with big data, and being hard to data.
express complex nonlinear decision surfaces. With the devel- Probabilistic logic models (PLMs) [77] that combine prob-
opment of AI in other fields, the integration of symbolic AI abilities with FOL can handle relational data well. Typi-
and other ML methods can make use of the strengths and cal PLMs include probabilistic Horn abduction (PHA) [81],
avoid weaknesses of each method. We present the two most Bayesian logic programming (BLP) [82], and MLNs [83]. The
important integrations in this section: SRL and NeSy. learning speed of PRMs is slow, since PLMs are based on
2 Rule template τ describes a range of clauses that can be generated. graph models, so they are only suitable for processing a small
3 G j,k
p
1, j i, j
combines the application of two functions F p and F p2,k . F p is the amount of data. PLMs can handle noisy data and provide better
i, j
valuation function corresponding to the clause C p . predictive accuracy and understanding of domains but suffer
4 C i, j is the jth clause of the ith rule template τ i for intensional predicate p. from the computational complexity of inference.
p p
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10228 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
B. Neural-Symbolic AI TABLE IV
Neural networks can deal with mislabeled and noisy data PAPERS BASED ON K AUTZ ’ S TAXONOMY
but suffer from problems, such as data efficiency and explain-
ability. Symbolic methods are data-efficient along with a
number of drawbacks, such as a lack of handling error data
and problems in the data. Blending deep neural network
topologies with symbolic reasoning techniques creates a more
advanced version of AI, i.e., NeSy: the hybrid model aims
to combine robust learning in neural networks with reasoning
and explainability via symbolic representations for network
models [84].
To bring together neural networks and symbolic AI,
Kautz [85] indicates a taxonomy including five different types
of NeSy at AAAI-2020. In Kautz’s taxonomy, Type 1 is a
large KBs. For type 3, neural-symbolic reader (NeRd) [95]
standard deep learning procedure in which the input and output
is a scalable integration of distributed representations and
of a neural network can be made of vectors of symbols,
symbolic operations for reading comprehension that consists
e.g., text in the case of classification, entity extraction, and
of a reader that encodes text into vector representations and a
translation. Type 2 is a neural pattern recognition subroutine
programmer that generates programs, which will be executed
within a symbolic problems solver, such as Monte Carlo
to produce the answer; Know-Evolve [96] is a novel deep evo-
tree search, e.g., AlphaGo and self-driving car paradigms.
lutionary knowledge network that learns nonlinearly evolving
Type 3 is a hybrid system that takes symbolic rules, e.g.,
entity representations over time; neural equivalence networks
A ← B, as an input–output training pair (A, B); i.e., the
(EQNETs) [97] focus on the challenge of learning continuous
knowledge of symbolic rules will be learned by the neural
semantic representations of algebraic and logical expressions.
network. One example uses integration to solve differential
For type 4, NS-CL [87] learns by looking at images and
equations [86]. Type 4 cascades from the neural network into
reading associated questions and answers, without any explicit
symbolic reasoner, i.e., the neuro-symbolic concept learner
supervision, such as class labels for objects; CTP [88] is
(NS-CL) [87] and conditional theorem provers (CTPs) [88].
an extension to NTP that uses gradient-based optimization
Type 5 embeds symbolic reasoning inside a neural engine;
to learn the best rule selection technique; DrKIT [98] is a
e.g., in business AI, when attention to concepts is very
differentiable module that is capable of answering multihop
high, they are decoded into symbolic entities in an atten-
questions directly using a large entity-linked text corpus.
tion schema. A goal that appears in the attention schema
indicates that deliberative symbolic reasoning should be
initiated. C. SRL Versus NeSy
Table IV summarizes models for the types based on Kautz’s Although SRL and NeSy are merging symbolic reasoning
categorization. We now briefly discuss state-of-the-art neural- with a different fundamental learning paradigm, SRL is the
symbolic models for each type of integration. Note that there integration of logic and probability, while NeSy is the combi-
are no papers fully regarding type 5. For type 1, symbolic nation of symbolic reasoning and neural networks. They have
deep reinforcement learning (SDRL) [89] framework conducts a lot in common, and there are interactions between these
high-level symbolic planning based on intrinsic goals using two fields. De Raedt et al. [99] identify seven dimensions
explicitly represented symbolic knowledge and utilizes DRL to that SRL and NeSy approaches have in common. These seven
learn low-level control policy, leading to improved task-level dimensions are as follows: 1) type of logic; 2) model versus
interpretability for DRL and data efficiency; neural-symbolic proof-based inference; 3) directed versus undirected models;
stack machine (NeSS) [90] is a differentiable neural network 4) logical semantics; 5) learning parameters or structure;
that operates a symbolic stack machine that supports general- 6) representing entities as symbols or sub-symbols; and 7)
purpose sequence-to-sequence generalization, to accomplish integrating logic with probability and/or neural computation.
compositional generalization; LENSR [91] is a novel approach We add another dimension, SRL, to represent multiple SRL
for improving the performance of deep models by lever- and NeSy models in Table V in addition to summarizing
aging prior symbolic knowledge. For type 2, generative numerous NeSy structures in [99], along the seven dimensions.
neuro-symbolic machine (GNM) [92] combines the advantages Note that, in dimension 1, logic programming and FOL are
of distributed and symbolic representation in generative latent mostly used for SRL, while models adapt to different types
variable models; GraIL [93] is a GNN-based framework that of logic based on the complexity and expressiveness of logic
predicts relations between nodes that were unseen during train- for NeSy. In dimension 4, probabilistic Boolean logic is the
ing and produces state-of-the-art performances in this induc- most common semantic in SRL. Whereas in NeSy, many
tive setting; a sparse-matrix reified KB [94] is a revolutionary probabilistic approaches use neural components to parameter-
technique for representing a symbolic KB, which enables neu- ize the underlying distribution, while fuzzy logic is used for
ral modules that are fully differentiable, faithful to the original computational reasons mostly and to relax rules. In dimension
semantics of the KB, expressive enough to model multihop 6, for NeSy, the input and intermediate representations are
inferences, and scalable enough to utilize with realistically sub-symbolic, while the output representations can be either
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10229
TABLE V domain experts, and lay users) and design different explana-
SRL AND N E S Y M ODELS A LONG S EVEN D IMENSIONS tions for each group with a particular purpose, content, and
presents it in a specific way, which is a suitable framework for
ILP systems. For developers and AI researchers, model inspec-
tion and simulation with proxy models are provided. These
two types of explanations are great for validating the system,
discovering flaws, and providing suggestions for improvement.
Because they can understand codes, data representation struc-
tures, and statistical variances, the audience is well suited to
this mode of communication. For domain experts, who are
capable of deciding when and how to question the explanation
and are led to their discovery by themselves, explanations
through natural language conversations or interactive visual-
izations are offered. For lay users, outcome explanations with
several counterfactuals are used, with which users can interact
to select the one most interesting to their particular case.
B. Measurement of Trust
Besides creating explanations for different users, how users
trust AI systems is also crucial. Yilmaz and Liu [109] propose
a measurement of trust that is also appropriate for ILP. In this
measurement, three evaluation models with increasing levels
of specificity can be considered: binary evaluation, quan-
tized/ discrete evaluation, and continuous/ spectral evaluation
(Fig. 4).
As shown in Fig. 4, binary evaluation shows that model
users are at the level of categorizing a model as either trusted
or distrusted without any middle ground; quantized evaluation
sets a threshold in the unfavorable region distinguishes trust
from distrust when the model user lacks trust; continuous
evaluation measures trust over a continuous range and gives
model users the discretion on the degree of trust. Table VI
denotes how these criteria relate to the evaluation models.
symbolic or sub-symbolic; all representations in SRL are
symbolic. C. Explanation for Symbolic-Based Integration
In the current state-of-the-art XAI, models that com-
V. U SER -C ENTERED E XPLAINABLE ILP bine connectionism and symbolism are not frequently rep-
resented. The explanation for symbolic-based integration is
XAI has gotten a lot of attention lately as researchers try
hard, since the connectionist methods in the hybrid system
to figure out what it means and expects. As AI systems are
offer poor explainability. To conquer this problem, explainable
used by a broader and more diverse audience, researchers have
neural-symbolic learning [110] learns both symbolic and deep
realized that it is crucial to pay enough attention to AI users.
representations, together with an explainability metric to assess
From a user-centered perspective, as we surveyed ILP systems
the level of alignment of machine and human-expert expla-
and their integrations, we briefly discuss models that could
nations. Campagner and Cabitza [111] use deep learning for
apply to ILP by answering the following questions.
the automatic detection of meaningful, hand-crafted high-level
1) How should we create explanations for users? symbolic features, which are then to be used by a standard
2) How should a user trust a model? and more interpretable learning model. Bennetot et al. [112]
3) How to provide explanations for symbolic-based inte- present a methodology based on the neural network’s learning
gration, e.g., NeSy? data that allow us to influence its learning and entirely fix
biases while providing a fair explanation from its predictions.
A. User-Centered Explanation
Explainability based on model users’ characteristics is VI. E XPERIMENTS
essential, since generating explanations based on different In this section, we evaluate several ILP systems following
audiences can overcome the difficulties of fulfilling all require- the experiment procedures in [41] when varying the following:
ments simultaneously. Ribera and Lapedriza [108] classify 1) optimal solution size; 2) optimal domain size; 3) example
users into three main groups (developers or AI researchers, size; and 4) benchmark ILP problems. We also compare
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10230 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
TABLE VI
T YPES OF T RUST E VALUATION
A. Robot
This comparison aims to find the optimal solution size for
each ILP system. To control the solution size, we use a strategy
learning problem; i.e., robot: there is a robot in an n × n
world. The aim is to develop a general plan for moving the
robot i steps toward the right in the grid from an arbitrary
starting location. For example, for a 5 × 5 world, and a start
position (1, 3), the goal for the robot is to move to position
(3, 3) within i steps where i = 2. A solution could be
Fig. 5. Learning time for robot experiment.
f (A, B) ← move_right(A, C), move_right(C, B).
We set a timeout of 60 s per task and repeat each experiment
The solution indicates that the robot could reach the final
ten times to get the mean and the standard error.
position by moving two steps to the right. We fix n to 20 and
Fig. 5 shows that Aleph outperforms both methods when
vary i that corresponds to the optimal solution size. The
i > 4. The learning time of Aleph and Popper increases
settings for each system are shown as follows. For Aleph,
significantly when the step size reaches 5. The largest program
the maximum number of nodes to be searched is 50 000. For
Popper learns is when i = 5, which takes 16 s, compared
Metagol, five metarules (Table VII) are used to guide the
with less than 1 s for Aleph and Metagol. Popper and Aleph
searching process if i < 3, and six metarules are used if
share comparable learning curves, while Metagol outperforms
i ≥ 3 (metarules in Table VII and another metarule for the
the other two, because metarules lead the search efficiently.
current step size). For Popper, there are no restrictions in the
Note that the accuracies of all three systems are 100% for the
experiment.
experiment.
For each i in [1, 2, . . . , 10], we generate 100 positive
(robot has reached the correct position within i steps) and
100 negative examples (robot has not reached the correct B. Robot2
position within i steps). If a system fails to learn a solution The goal of this comparison is to find the optimal domain
within a given time, it will achieve default accuracy (50%). size for each ILP system. To control the domain size,
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10231
TABLE VII
M ETARULES U SED IN THE ROBOT E XPERIMENT
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10232 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
TABLE VIII
B ENCHMARK ILP P ROBLEMS
TABLE IX
ACCURACIES FOR B ENCHMARK ILP P ROBLEMS
TABLE X
L EARNING T IMES FOR B ENCHMARK ILP P ROBLEMS
edge from graphs. The detailed description of each example rules—a logic program can handle recursive rules if the
is shown in Table VIII. same predicate appears in the head and body of a rule;
The settings are depicted as follows. For Aleph, the maxi- 2) robustness—the ability of a system to cope with errors
mum number of nodes to be searched is 50 000. For Metagol, during execution and cope with erroneous input [113]; 3)
the metarules are the same as those in the robot experiment. predicate invention—a way of finding new theoretical terms,
The examples should also be provided in increasing sizes to or abstract new concepts that are not directly observable in
make Metagol find solutions, since Metagol is sensitive to the the measured data [24]; and 4) scalability—it describes the
order of examples. There are no restrictions for Popper. capability of a system to cope and perform well under an
For each example, we use the same strategy as in the increased or expanding workload or scope. Although this list
robot problem: We generate 100 positive and 100 negative can be extended with other features, we aim to include features
examples. The default accuracy is, therefore, 50%. We repeat that we believe are important for different ILP systems. The
each experiment ten times to get the mean and the standard comparison of the systems is depicted in Table XI.
error. Recursive clauses are part of the hypothesis in some learning
Table IX indicates that Popper, ∂ILP, and dNL-ILP perform examples. In a recursive rule, one atom at least shows in
perfectly on all the tasks in terms of accuracy. Metagol reaches both the head and the body. Although the ability to generate
100% predictive accuracy except for the even problem. Aleph recursive rules is important to a system, it is often easier for
struggles to learn solutions for even and member problems. an ILP system to generalize from small numbers of examples
Table X shows that Aleph is the fastest and ∂ILP is the slowest with recursion [114]. All models in Table XI could generate
within all the systems. Note that the main difference between recursive rules, but traditional ILP systems struggle to learn
∂ILP and dNL-ILP is that ∂ILP allows for clauses of at most recursive programs, especially from small numbers of training
two atoms and only two rules per each predicate to reduce the examples [115]. For the robustness of each system, all the
size of the search space, while in dNL-ILP, the membership traditional ILP systems, as well as Metagol and Popper, cannot
weights of the conjunction can be directly interpreted as the be robust to noise in and mislabeled inputs. As PILP combines
flags in the satisfiability interpretation [73]. uncertainty with logic, while dNL-ILP and ∂ILP connect ILP
with neural networks, they can deal with noise and ambiguity.
E. Feature Comparison Predicate invention in ILP means the automatic introduction
We compare six ILP systems in terms of the inference of new and hopefully useful predicates during learning from
rules they employ. In the comparison, we compare impor- examples. Such new predicates can then be used as part of
tant features, including the following: 1) handling recursive background knowledge in finding a definition of the target
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10233
TABLE XI
F EATURE C OMPARISON FOR ILP S YSTEMS
TABLE XII
L EARNING T IMES FOR E XPERIMENTS
predicate [24]. Traditional ILP systems have nonsupport for A. Data Efficiency
predicate invention. Since Metagol and Popper use high-order Symbolic methods deal with small amounts of data effi-
metarules to define the hypothesis space, the new predicate ciently, while the leading approaches in neural models need
will be invented when induce definitions are necessary. ∂ILP large amounts of data to achieve outstanding performance.
and dNL-ILP also use rules as guidance to support predicate Meanwhile, symbolic methods have difficulty dealing with
invention. large datasets, but neural models address them easily. For
Existing ILP systems and Metagol cannot be applied effec- SRL, the structure learning, i.e., in MLNs, is not scalable
tively for data sets with 10 000 data points for each system’s and inefficient for large amounts of data as well. This has
scalability. Approximative generalization can compress several triggered the development of NeSy and SRL. To learn the
data points into one example to tackle large datasets in benefits of each model and overcome their complimentary
PILP [116]. ∂ILP is restricted to small datasets because of its flaws are potential research directions.
memory requirements. The scalability of ∂ILP is not satisfied
due to the predicates constraints (only two predicates in each
clause are allowed in ∂ILP to avoid the memory-consuming B. Generalization
problem). It should be good if the memory-consuming problem
Generalization refers to the capacity of models to adapt
can be solved, since gradient descent used in ∂ILP performs
to new, previously unseen data derived from the same dis-
well under an increased workload.
tribution as the one used to create the model. Understanding
generalization is one of the fundamental unsolved problems
F. Summary for all machine learning models. We found that several
We test several systems in this section by adjusting the symbolic-based models suffer from weak generalization. For
following: 1) optimal solution size; 2) optimal domain size; instance, Metagol sometimes cannot generate valid inference
3) example size; and 4) benchmark ILP problems. We broadly rules when changing the number range from 10 to 20 for
summarize the learning times and predictive accuracies of positive and negative examples in the even dataset. Although
all models in the experiments (see Table XII): Although the search method during training can be fully guided and
Aleph is faster than all the other systems in benchmark ILP understood, enough attention should be paid to the failure of
problems, it only learns accurate solutions for grandparent and generalization that may be caused only by a tiny change in
undirected edge. Popper can be roughly considered an upgrade the input.
of Metagol: by learning from failure, Popper finds accurate
solutions with small time-consuming for most of the exper- C. Application
iments. The performance of Popper highly relies on initial
restrictions, while the performance of Metagol heavily depends From an application point of view, the usage of traditional
on metarules. ∂ILP and dNL-ILP only handle benchmark ILP symbolic-based applications is very limited, and applications
problems, although they produce reliable results. in SRL and NeSy are rarely developed. As the integration
of symbolic with probability and neural networks, hybrid
methods can be adapted to new fields, such as computer vision,
VII. C HALLENGES
natural language processing, and so on. As explainability and
In this section, we share challenges facing the community logic play crucial roles in many fields, methods in SRL and
based on the recent developments surveyed in this article. NeSy will be challenging and have great potential for the field.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10234 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CRITICAL REVIEW OF ILP TECHNIQUES FOR EXPLAINABLE AI 10235
[46] B. Taskar, P. Abbeel, and D. Koller, “Discriminative probabilistic [75] R. A. Rossi, L. K. McDowell, D. W. Aha, and J. Neville, “Transforming
models for relational data,” 2012, arXiv:1301.0604. graph data for statistical relational learning,” J. Artif. Intell. Res.,
[47] O. B. H. Langseth and T. Nielsen, “Structural learning in object vol. 45, no. 1, pp. 363–441, 2012.
oriented domains,” in Proc. 14th Florida Artif. Intell. Res. Soc. Conf., [76] D. Koller, “Probabilistic relational models,” in Proc. Int. Conf. Induc-
2001, pp. 340–344. tive Log. Program. Springer, 1999, pp. 3–13.
[48] J. Neville and D. Jensen, “Dependency networks for relational data,” [77] J. Chen, S. Muggleton, and J. Santos, “Learning probabilistic logic
in Proc. 4th IEEE Int. Conf. Data Mining, Apr. 2004, pp. 170–177. models from probabilistic examples,” Mach. Learn., vol. 73, no. 1,
[49] F. Riguzzi, “Learning logic programs with annotated disjunctions,” in pp. 55–85, Oct. 2008.
Proc. Int. Conf. Inductive Log. Program. Springer, 2004, pp. 270–287. [78] M. Jaeger, “Relational Bayesian networks,” in Proc. 13th Conf.
[50] D. Poole, “The independent choice logic for modelling multiple agents Uncertainty Artif. Intell. Morgan Kaufmann Publishers Inc., 1997,
under uncertainty,” Artif. Intell., vol. 94, nos. 1–2, pp. 7–56, Jul. 1997. pp. 266–273.
[51] T. Sato and Y. Kameya, “PRISM: A language for symbolic-statistical [79] B. Taskar, P. Abbeel, M.-F. Wong, and D. Koller, “Relational Markov
modeling,” in Proc. IJCAI, vol. 97, 1997, pp. 1330–1339. networks,” in Introduction to Statistical Relational Learning, 2007,
[52] S. Muggleton et al., “Stochastic logic programs,” in Advances in pp. 175–200.
Inductive Logic Programming, vol. 32, 1996, pp. 254–264. [80] J. Neville and D. Jensen, “Relational dependency networks,” J. Mach.
[53] A. Stolcke and S. Omohundro, “Hidden Markov model induction by Learn. Res., vol. 8, pp. 653–692, Mar. 2007.
Bayesian model merging,” in Proc. Adv. Neural Inf. Process. Syst., [81] D. Poole, “Probabilistic horn abduction and Bayesian networks,” Artif.
1993, pp. 11–18. Intell., vol. 64, no. 1, pp. 81–129, Nov. 1993.
[54] C. R. Anderson, P. Domingos, and D. S. Weld, “Relational Markov [82] K. Kersting and L. De Raedt, “Bayesian logic programming: Theory
models and their application to adaptive web navigation,” in Proc. 8th and tool,” in Statistical Relational Learning, 2007, p. 291.
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2002, [83] M. Richardson and P. Domingos, “Markov logic networks,” Mach.
pp. 143–152. Learn., vol. 62, no. 1, pp. 107–136, Feb. 2006.
[55] K. Kersting, L. De Raedt, and T. Raiko, “Logical hidden Markov [84] A. D. Garcez and L. C. Lamb, “Neurosymbolic AI: The 3rd wave,”
models,” J. Artif. Intell. Res., vol. 25, pp. 425–456, Apr. 2006. 2020, arXiv:2012.05876.
[56] J. Cussens, “Parameter estimation in stochastic logic programs,” Mach. [85] H. Kautz, “AAAI 2020 Robert S. Engelmore memorial award lecture,”
Learn., vol. 44, no. 3, pp. 245–271, 2001. in Proc. AAAI, 2020.
[57] S. Muggleton, “Learning stochastic logic programs,” Electron. Trans. [86] G. Lample and F. Charton, “Deep learning for symbolic mathematics,”
Artif. Intell., vol. 4, no. B, pp. 141–153, 2000. 2019, arXiv:1912.01412.
[58] Y. Kameya, T. Sato, and N.-F. Zhou, “Yet more efficient EM learning [87] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, “The neuro-
for parameterized logic programs by inter-goal sharing,” in Proc. ECAI, symbolic concept learner: Interpreting scenes, words, and sentences
vol. 16, 2004, p. 490. from natural supervision,” 2019, arXiv:1904.12584.
[59] T. Sato and Y. Kameya, “Parameter learning of logic programs [88] P. Minervini, S. Riedel, P. Stenetorp, E. Grefenstette, and
for symbolic-statistical modeling,” J. Artif. Intell. Res., vol. 15, T. Rocktäschel, “Learning reasoning strategies in end-to-end differen-
pp. 391–454, Dec. 2001. tiable proving,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 6938–6949.
[60] L. De Raedt and I. Thon, “Probabilistic rule learning,” in Proc. Int. [89] D. Lyu, F. Yang, B. Liu, and S. Gustafson, “SDRL: Interpretable and
Conf. Inductive Log. Program. Springer, 2010, pp. 47–58. data-efficient deep reinforcement learning leveraging symbolic plan-
[61] W. Y. Wang, K. Mazaitis, and W. W. Cohen, “Programming with ning,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 2970–2977.
personalized PageRank: A locally groundable first-order probabilistic [90] X. Chen, C. Liang, A. W. Yu, D. Song, and D. Zhou, “Compo-
logic,” in Proc. 22nd ACM Int. Conf. Conf. Inf. Knowl. Manage., 2013, sitional generalization via neural-symbolic stack machines,” 2020,
pp. 2129–2138. arXiv:2008.06662.
[62] E. Bellodi and F. Riguzzi, “Learning the structure of probabilistic logic [91] Y. Xie, Z. Xu, M. S. Kankanhalli, K. S. Meel, and H. Soh, “Embedding
programs,” in Proc. Int. Conf. Inductive Log. Program. Springer, 2011, symbolic knowledge into deep networks,” 2019, arXiv:1909.01161.
pp. 61–75. [92] J. Jiang and S. Ahn, “Generative neurosymbolic machines,” 2020,
[63] D. Fierens et al., “Inference and learning in probabilistic logic programs arXiv:2010.12152.
using weighted Boolean formulas,” Theory Pract. Log. Program., [93] X. Peng, C. Zhang, and K. Xu, “Document-level relation extraction
vol. 15, no. 3, pp. 358–401, May 2015. via subgraph reasoning,” in Proc. 31st Int. Joint Conf. Artif. Intell.,
[64] E. Bellodi and F. Riguzzi, “Structure learning of probabilistic logic Jul. 2022, pp. 9448–9457.
programs by searching the clause space,” Theory Pract. Log. Program., [94] W. W. Cohen, H. Sun, R. A. Hofer, and M. Siegler, “Scalable
vol. 15, no. 2, pp. 169–212, Mar. 2015. neural methods for reasoning with a symbolic knowledge base,” 2020,
[65] A. Cropper and S. Muggleton, “Logical minimisation of meta-rules arXiv:2002.06115.
within meta-interpretive learning,” in Inductive Logic Programming. [95] X. Chen, C. Liang, A. W. Yu, D. Zhou, D. Song, and Q. V. Le, “Neural
Springer, 2015, pp. 62–75. symbolic reader: Scalable integration of distributed and symbolic
[66] A. Cropper, S. Morel, and R. Muggleton, “Learning higher-order logic representations for reading comprehension,” in Proc. Int. Conf. Learn.
programs,” Mach. Learn., vol. 109, pp. 1–34, Jul. 2019. Represent., 2019.
[67] R. Morel, A. Cropper, and C.-H. L. Ong, “Typed meta-interpretive [96] R. Trivedi, H. Dai, Y. Wang, and L. Song, “Know-evolve: Deep
learning of logic programs,” in Proc. Eur. Conf. Logics Artif. Intell. temporal reasoning for dynamic knowledge graphs,” in Proc. Int. Conf.
Springer, 2019, pp. 198–213. Mach. Learn., 2017, pp. 3462–3471.
[68] A. Albarghouthi, P. Koutris, M. Naik, and C. Smith, “Constraint- [97] M. Allamanis, P. Chanthirasegaran, P. Kohli, and C. Sutton, “Learning
based synthesis of datalog programs,” in Proc. Int. Conf. Princ. Pract. continuous semantic representations of symbolic expressions,” in Proc.
Constraint Program. Springer, 2017, pp. 689–706. Int. Conf. Mach. Learn., 2017, pp. 80–88.
[69] A. Cropper and S. Tourret, “Logical reduction of metarules,” Mach. [98] B. Dhingra, M. Zaheer, V. Balachandran, G. Neubig, R. Salakhutdinov,
Learn., vol. 109, pp. 1–47, Jul. 2019. and W. W. Cohen, “Differentiable reasoning over a virtual knowledge
[70] J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A.-R. Mohamed, and base,” 2020, arXiv:2002.10640.
P. Kohli, “RobustFill: Neural program learning under noisy I/O,” in [99] L. De Raedt, S. S. Dumančić, R. Manhaeve, and G. Marra, “From
Proc. 34th Int. Conf. Mach. Learn., vol. 70, 2017, pp. 990–998. statistical relational to neuro-symbolic artificial intelligence,” 2020,
[71] M. Bošnjak, T. Rocktäschel, J. Naradowsky, and S. Riedel, “Program- arXiv:2003.08316.
ming with a differentiable forth interpreter,” in Proc. 34th Int. Conf. [100] J. Xu, Z. Zhang, T. Friedman, Y. Liang, and G. Broeck, “A semantic
Mach. Learn., vol. 70, 2017, pp. 547–556. loss function for deep learning with symbolic knowledge,” in Proc. Int.
[72] T. Rocktäschel and S. Riedel, “End-to-end differentiable proving,” in Conf. Mach. Learn., 2018, pp. 5502–5511.
Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3788–3800. [101] H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou, “Neural logic
[73] A. Payani and F. Fekri, “Inductive logic programming via differentiable machines,” 2019, arXiv:1904.11694.
deep neural logic networks,” 2019, arXiv:1906.03523. [102] L. Serafini and A. D. Garcez, “Logic tensor networks: Deep
[74] D. Koller et al., Introduction to Statistical Relational Learning. learning and logical reasoning from data and knowledge,” 2016,
Cambridge, MA, USA: MIT Press, 2007. arXiv:1606.04422.
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.
10236 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 8, AUGUST 2024
[103] G. Marra, M. Diligenti, F. Giannini, M. Gori, and M. Maggini, Levent Yilmaz (Member, IEEE) received the M.S.
“Relational neural machines,” 2020, arXiv:2002.02193. and Ph.D. degrees in computer science from Virginia
[104] L. Weber, P. Minervini, J. Münchmeyer, U. Leser, and T. Rocktäschel, Tech, Blacksburg, VA, USA.
“NLProlog: Reasoning with weak unification for question answering He is currently an Alumni Distinguished Pro-
in natural language,” 2019, arXiv:1906.06187. fessor of computer science and software engineer-
[105] L. De Raedt and A. Kimmig, “Probabilistic (logic) programming ing with Auburn University, Auburn, AL, USA,
concepts,” Mach. Learn., vol. 100, no. 1, pp. 5–47, Jul. 2015. with a courtesy appointment in industrial and sys-
[106] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor, “Hinge- tems engineering. His research interests include the-
loss Markov random fields and probabilistic soft logic,” 2015, ory and methodology of modeling and simulation,
arXiv:1505.04406. agent-directed simulation, cognitive systems, and
[107] P. Larrañaga and S. Moral, “Probabilistic graphical models in artificial model-driven science and engineering for complex
intelligence,” Appl. Soft Comput., vol. 11, no. 2, pp. 1511–1528, adaptive systems.
Mar. 2011. Dr. Yilmaz is the Founding Organizer and General Chair of the
[108] M. Ribera and A. Lapedriza, “Can we do better explanations? Agent-Directed Simulation Conference Series. He is a former Editor-in-Chief
A proposal of user-centered explainable AI,” in Proc. IUI Workshops, of Simulation: Transactions of the Society for Modeling and Simulation
2019. International.
[109] L. Yilmaz and B. Liu, “Model credibility revisited: Concepts and
considerations for appropriate trust,” J. Simul., vol. 16, no. 3, pp. 1–14,
2020.
[110] N. Díaz-Rodríguez et al., “Explainable neural-symbolic learning
(X-NeSyL) methodology to fuse deep learning representations with
expert knowledge graphs: The MonuMAI cultural heritage use case,”
Inf. Fusion, vol. 79, pp. 58–83, Mar. 2022.
[111] A. Campagner and F. Cabitza, “Back to the feature: A neural-symbolic
perspective on explainable AI,” in Proc. Int. Cross-Domain Conf. Mach.
Learn. Knowl. Extraction Springer, 2020, pp. 39–55.
[112] A. Bennetot, J.-L. Laurent, R. Chatila, and N. Díaz-Rodríguez,
“Towards explainable neural-symbolic visual reasoning,” 2019,
arXiv:1909.09065.
[113] J.-C. Fernandez, L. Mounier, and C. Pachon, “A model-based approach
for robustness testing,” in Proc. IFIP Int. Conf. Test. Communicating
Syst. Springer, 2005, pp. 333–348.
[114] A. Cropper, A. Tamaddoni-Nezhad, and S. H. Muggleton, “Meta-
interpretive learning of data transformation programs,” in Proc. Int.
Conf. Inductive Log. Program. Springer, 2015, pp. 46–59.
[115] A. Cropper, S. S. Dumančić, and S. H. Muggleton, “Turning 30: New Bo Liu (Senior Member, IEEE) received the Ph.D.
ideas in inductive logic programming,” 2020, arXiv:2002.11002. degree from the Autonomous Learning Laboratory,
[116] H. Watanabe and S. Muggleton, “Can ILP be applied to large University of Massachusetts Amherst, Amherst, MA,
datasets?” in Proc. Int. Conf. Inductive Log. Program. Springer, 2009, USA, in 2015, co-led by Dr. Sridhar Mahadevan and
pp. 249–256. Dr. Andrew Barto.
[117] L. Ai, S. H. Muggleton, C. Hocquette, M. Gromowski, and He is currently an artificial intelligence (AI)/ML
U. Schmid, “Beneficial and harmful explanatory machine learning,” Researcher. Previously, he was promoted to Asso-
2020, arXiv:2009.06410. ciate Professor (with tenure) with the Com-
puter Science Department, Auburn University (AU),
Auburn, AL, USA. His research interests include
decision-making under uncertainty, human-aided
machine learning, symbolic AI, trustworthiness and interpretability in machine
learning, and their applications to BIGDATA.
Zheng Zhang received the M.S. degree in electrical Dr. Liu is an Editorial Board Member of Machine Learning (MLJ).
engineering from the New Jersey Institute of Tech- He was a recipient of the UAI’2015 Facebook Best Student Paper Award,
nology, Newark, NJ, USA, in 2014, and the M.S. the AAMAS’2022 OptLearnMAS Best Paper Award, the Tencent Faculty
degree in computer science from Auburn University, Research Award in 2017, and the Amazon Faculty Research Award in
Auburn, AL, USA, in 2018, where he is currently 2018, many of which were the first time granted to AU. He is a reg-
pursuing the Ph.D. degree with the Department of ular Area Chair/Senior PC of several flagship AI conferences, including
Computer Science and Software Engineering. UAI/AAAI/IJCAI, and has given several tutorials or plenary talks at various
His research interests include neuro-symbolic AI, conferences, including AAMAS/ICAPS/UAI. He is an Associate Editor of
reinforcement learning, computer vision, and artifi- IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS
cial intelligence. (IEEE TNNLS).
Authorized licensed use limited to: Birla Inst of Technology and Science Pilani Dubai. Downloaded on September 02,2024 at 10:43:04 UTC from IEEE Xplore. Restrictions apply.