Concepts of ILP We Use The Following Example To Introduce The Basic Con
Concepts of ILP We Use The Following Example To Introduce The Basic Con
parent(i, a) parent(a, b) grandparent(X, Y ) ← parent(X, Z), parent(Z, Y ). ILP is defined using FOL [31]. In
FOL, a formula, parent(i, a) or parent(a, b), that contains no logical connectives is called the atom. An
atom or its negation, i.e., ¬parent(i, a) and ¬parent(a, b), is called literal. A definite clause is a many-way
OR (disjunction) of literals (formula 3). a, b, and i in the example are constants, and X, Y , and Z are
variables, and all constants and variables are terms. A term that does not contain any free variables is
called a ground term [parent(i, a)]. A Boolean-valued function P: X → {true, false} is called the predicate
(parent and grandparent) on X; grandparent/2 and parent/2 denote predicates with their arity, i.e.,
number of arguments. A function can be any value unlike a predicate, and it will never appear except as
arguments to predicates. FOL is a structure of logic consisting of constants, variables, predicates,
functions, and sentences. It uses quantified variables over nonlogical objects and allows the use of
sentences that contain variables. B. Semantics of ILP There are two different semantics for ILP: standard
example E (E = E + ∧ E − consists of positive example E + and negative example E −), find a hypothesis H,
and nonmonotonic semantics. For the normal semantic, given background (prior) knowledge B and
such that the following conditions hold. Prior Satisfiability: B ∧ E − ⊭ □. Posterior Satisfiability: B ∧ H ∧
E − ⊭ □. Prior Necessity: B ⊭ E + . Posterior Sufficiency: B ∧ H ⊨ E + . A general setting is used for the
normal semantics. In most ILP systems, the definite setting will be used as a simple version of the normal
setting, as the background theory and hypotheses are restricted to being definite. The example setting is
a special case of definite semantics, where the example is restricted to true and false ground facts.
Notice that the example setting is equivalent to the normal semantics, where B and H are definite
clauses and E is a set of ground unit
clauses. The example setting is the main setting of ILP. It is employed by the large majority of ILP
systems. Table I shows the grandparent dataset for the ILP system. The task is to learn the grandparent
relation from various facts involving the father-of and mother-of relations. In the nonmonotonic setting
of ILP, the background theory is a set of definite clauses, the evidence is empty, and the hypotheses are
sets of general clauses expressible using the same alphabet as the background theory. The reason that
the evidence is empty is that the positive evidence is considered part of the background theory, and the
negative evidence is derived implicitly by making a kind of closed world assumption (realized by taking
the least Herbrand model [32]). The nonmonotonic semantics realizes induction by deduction. The
induction principle of the nonmonotonic setting states that the hypothesis H, which is, in a sense,
deduced from the set of observed examples E and the background theory B (using a kind of closed world
and closed domain assumption), holds for all possible sets of examples. This produces generalizations
beyond the observations. As a consequence, properties derived in the nonmonotonic setting are more
conservative than those derived in the normal setting. C. Searching Method An enumeration algorithm
will be used to solve the ILP problem. Generalization and specialization form the basis for pruning the
search space. Generalization corresponds to induction, and specialization to deduction, implying that
induction is viewed here as the inverse of deduction. A generic ILP system can now be defined.
rules r1, . . . ,rk ∈ R to be applied to H Apply the rules r1, . . . ,rk to H to yield H1, . . . , Hn Add H1, . . . , Hn
Algorithm 1 Searching Algorithm [33] Q H := I nitialize repeat Delete H from Q H Choose the inference
hypotheses using inference rules. Then, the expanded hypotheses are added to the queue of
hypotheses Q H, which may be pruned to discard the unpromising hypothesis from further
consideration. This process continues until the stop criterion is satisfied. There are two kinds of search
methods for ILP systems. First, “specific-to-general” systems, a.k.a. bottom-up systems, start from the
examples and background knowledge and repeatedly generalize their hypothesis by applying inductive
inference rules. During the search, they take care that the hypothesis remains satisfiable (i.e., does not
imply negative examples). Second, “general-to-specific” systems, a.k.a. topdown systems, start with the
most general hypothesis (i.e., the inconsistent clause □) and repeatedly specialize the hypothesis by
applying deductive inference rules to remove inconsistencies with the negative examples. During the
search, care is taken that the hypotheses remain sufficient with regard to the positive evidence. Table II
as the inverse of deduction. Given the formulas B ∧ H |H E +, deriving E + from B ∧ H is deduction, and
shows some related systems for both two types. Inductive Inference Rules Induction can be considered
deriving H from B ∧ E + is induction. Therefore, inductive inference rules can be obtained by inverting
deductive ones. Table III summarizes commonly used rules for both deduction and induction. Since this
assumptions about the deductive rule for ⊨ and the format of background theory B and evidence E +,
“inverting deduction” paradigm can be studied under various assumptions, corresponding to different
different models of inductive inference are obtained. Four frameworks of inference rules, θ-
subsumption, inverse resolution, inverse implication, and inverse entailment, will be described in this
consistent hypothesis H, such that B ∧ H |H E. In general, B, H, and E could be arbitrary logic programs.
problem specification of ILP is, given background knowledge B and examples E find the simplest
Each clause in the simplest H should explain at least one example, since there is a simpler H ′ that will do
¬⊥ is the (potentially infinite) conjunction of ground literals, which are true in all models of B ∧ ¬E.
Since H must be true in every model of B ∧ ¬E, it must contain a subset of the ground literals in ¬⊥.
Therefore B ∧ ¬E |H ¬⊥ |H ¬H H |H ⊥. A subset of the solutions for H can be found by considering the
clauses which θ-subsume ⊥. The complete set of candidates for H can be found by considering all
clauses which θ-subsume sub-saturants of ⊥. E. ILP Systems A large amount of logic learning systems
have been developed based on the inference rules we mentioned before. Fig. 3 shows the timeline of
the development of logic learning systems, including Prolog [29], MIS [19], CLINT [34], Foil [28], LINUS
[36], Golem [22], Aleph [27], Progol [35], Cigol [23], Metagol [30], ProbLog [39], DeepProbLog [40], ∂ILP
[11], and Popper [41]. Note that there are no common-used inverse implication systems, because it
suffers from problems of nondeterminacy. III. VARIANTS OF ILP Traditional ILP frameworks that we have
discussed in Section II mainly have three limitations: most ILP systems do the following: 1) cannot
address noisy data; 2) cannot deal with predicates invention directly and recursion effectively; and 3)
cannot learn hypothesis H efficiently due to the massive hypothesis space. Variants of ILP systems have
been developed to solve the problem mentioned above. PILP [10] becomes a powerful tool when
dealing explicitly with uncertainty, MIL [9] holds merits on predicate invention and recursive
generalizations, and ∂ILP [11] can speed up the learning process and also robust to noise and error. We
will introduce each of them in this section. A. Probabilistic Inductive Logic Programming PILP is a
machine learning technique based on probabilistic logic programming. It addresses one of the central
questions of AI by integrating probabilistic reasoning with machine learning, and first-order relational
logic representations [10]. Dealing explicitly with uncertainty makes PILP more powerful than ILP and, in
turn, than traditional attribute-value approaches [10]. It also provides better predictive accuracy and
understanding of domains and has become a growth path in the machine learning community. The
terms used in PILP are close to those in ILP with small differences: since negative examples conflict with
the usual view on learning examples in statistical learning (the probability of a failure is zero), the
ones. PILP Problem: Given a set E = Ep ∪ Ei of observed and unobserved examples Ep and Ei (with Ep ∩
definition of PILP problem uses observed and unobserved examples instead of positive and negative
B, find a hypothesis H ∗ in L H , such that H ∗ = argmax H score(E, H, B) and the following constraints
covers(e, H, B) = P(e|H, B), a probabilistic logical language L H for hypotheses, and a background theory
hold: ∃ep ∈ Ep : covers(ep, H ∗ , B) > 0 ∃ei ∈ Ei : covers(ei, H ∗ , B) = 0. The score is some objective
observed likelihood Q e∈E covers(ep, H ∗ , B) [10]. We denote H = (L, λ), where L represents all FOL
function, usually involving the probabilistic covers relation of the observed examples, such as the
program rules in H, and λ indicates probabilistic parameters. Two subtasks should be considered when
solving PILP learning problems: 1) parameter estimation, where it is assumed that the underlying logic
program L is fixed, and the learning task consists of estimating the parameters λ that maximize the
likelihood and 2) structure learning where both L and λ have to be learned from the data. Similar to the
ILP learning problem, the language L E is selected for representing the examples, and the probabilistic
covers relation determines different learning settings. In ILP, this leads to learning from interpretations
[42], from proofs [10], and from entailment [43]. Therefore, it should be no surprise that this very same
distinction also applies to probabilistic knowledge representation formalisms. There are three PILP
settings as well: probabilistic learning from interpretations, from entailment, and proofs [10]. The main
idea is to lift ILP settings by associating probabilistic information with clauses and interpretations and by
replacing ILP’s deterministic covers relation with a probabilistic one. The large majority of PILP
techniques proposed so far fall into the learning from interpretations setting, including parameter
estimation of probabilistic logic programs [44], learning of probabilistic relational models (PRMs) [45],
parameter estimation of relational Markov models [46], learning of object-oriented Bayesian networks
[47], learning relational dependency networks (RDNs) [48], and learning logic programs with annotated
disjunctions (LPAD) [49]. To define probabilities on proofs, ICL [50], Prism [51], and stochastic logic
programs (SLPs) [52] attach probabilities to facts (respectively, clauses) and treat them as stochastic
choices within resolution. PILP techniques that learn from proofs have been developed, including hidden
Markov model induction by Bayesian model merging [53], relational Markov models [54], and logical
hidden Markov models [55]. Learning from interpretations setting has been investigated for learning
SLPs [56], [57] and for parameter estimation of Prism programs [58], [59] from observed examples.
Many algorithms have been developed since the PILP framework: Muggleton describes a method, which
based on an approximate Bayes “maximum a posterior probability” (MAP) algorithm, for learning SLPs
from examples and background knowledge [57], which is considered one of the earliest structure
learning methods for PILP; De Raedt and Thon [60] upgrade rule learning to a probabilistic setting, in
which both the examples themselves as well as their classification can be probabilistic. To solve “large
groundings” problem, Wang et al. [61] present a first-order probabilistic language, which is well suited
to approximate “local” grounding. The algorithm “Structure LearnIng of ProbabilistiC logic progrAmS
with Em over bdds” (SLIPCASE) performs a beam search in the space of the language of LPAD using the
log-likelihood of the data as the guiding heuristics [62]; Fierens et al. [63] investigate how classical
inference and learning tasks known from the graphical model community can be tackled for probabilistic
logic programs; and the algorithm “Structure LearnIng of Probabilistic logic programs by searching OVER
the clause space” (SLIPCOVER) performs a beam search in the space of probabilistic clauses and a greedy
search in the space of theories, using the log-likelihood of the data as the guiding heuristics [64]. B.
Meta-Interpretive Learning MIL is a framework developed by Cropper and Muggleton [65], which uses
higher-order metarules to support predicate invention and learning of recursive definitions. Metarules,
second-order Horn clauses, are widely discussed [65], [66], [67], [68] as a form of declarative bias.
Metarules define the structure of learnable programs, which, in turn, defines the hypothesis space. For
instance, to learn the grandparent/2 relation given the parent/2 relation, the chain metarule1 would be
suitable P(A, B) ← Q(A,C), R(C, B). In this metarule, the letters P, Q, and R denote existentially quantified
second-order variables (variables that can be bound to predicate symbols), and the letters A, B, and C
1Common-used metarules [30]: 1) identity: P(A, B) ← Q(A, B); 2) inverse: P(A, B) ← Q(B, A); 3) precon:
P(A, B) ← Q(A), R(A, B); 4) postcon: P(A, B) ← Q(A, B), R(B); 5) chain: P(A, B) ← Q(A,C), R(C, B); and 6)
recursive: P(A, B) ← Q(A,C), P(C, B).
denote universally quantified first-order variables (variables that can be bound to constant symbols).
Given the chain metarule, the background parent/2 relation, and examples of the grandparent/2
relation, the learner will try to find the correct substitutions for predicates, and one of the correct
solutions would be {P/grandparent, Q/parent, R/parent} and the substitution result is grandparent(A, B)
← parent(A,C), parent(C, B). We will discuss the MIL problem after the description of metarules. Before
that, we define what MIL input is: an MIL input is a tuple (B, E +, E −, and M), where B is a set of Horn
clauses denoting background knowledge, E + and E − are disjoint sets of ground atoms representing
positive and negative examples, respectively, and M is a set of metarules. An MIL problem [69] can be
program hypothesis H, such that the following hold. 1) ∀c ∈ H, ∃m ∈ M, such that c = mθ, where θ is a
defined from an MIL input. Given an MIL input (B, E +, E −, and M), the MIL problem is to return a logic
Nevertheless, for expected hypothesis H ∗ , the learned model H ′ is only equivalence with the actual
that, in some cases, as few as two metarules are complete and sufficient for generating all hypotheses.
model H ∗ semantically, not literally (H ′ usually contains more hypothesis than H ∗ ), so reducing
metarules does not necessarily improve the learning efficiency of MIL. Advances have been made to
increase the performance of MIL recently. A new reduction technique, called derivation reduction [69],
has been introduced to find a finite subset of a Horn theory from which the whole theory can be derived
using SLD resolution. Recent work [67] also shows that adding types to MIL can improve learning
performance, and type checking can reduce the MIL hypothesis space by a cubic factor, which can
substantially reduce learning times. Extended MIL [66] has also been used to support learning higher-
order programs by allowing for higher-order definitions to be used as background knowledge. Recently,
Popper [41] supports types, learning optimal solutions, learning recursive programs, reasoning about
lists and infinite domains, and hypothesis constraints by combining ASP and Prolog. C. Differentiable
Inductive Logic Programming The ∂ILP that combines intuitive perceptual with conceptually
interpretable reasoning has several advantages: robust to noise and error, data-efficient, and produces
interpretable rules [11]. ∂ILP implements differentiable deduction over continuous values. The gradient
of the loss concerning the rule weights, which we use to minimize classification loss, implements a
form a new set 3 Lambda = {(γ , 1)|γ ∈ P} ∪ {(γ , 0)|γ ∈ N}. Each pair (γ , λ) indicates that whether the
continuous form of induction [11]. The set P (positive examples) and N (negative examples) are used to