0% found this document useful (0 votes)
7 views6 pages

Concepts of ILP We Use The Following Example To Introduce The Basic Con

Concepts of ILP We Use the Following Example to Introduce the Basic Con

Uploaded by

f20230222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

Concepts of ILP We Use The Following Example To Introduce The Basic Con

Concepts of ILP We Use the Following Example to Introduce the Basic Con

Uploaded by

f20230222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Concepts of ILP We use the following example to introduce the basic concepts of the ILP system:

parent(i, a) parent(a, b) grandparent(X, Y ) ← parent(X, Z), parent(Z, Y ). ILP is defined using FOL [31]. In
FOL, a formula, parent(i, a) or parent(a, b), that contains no logical connectives is called the atom. An
atom or its negation, i.e., ¬parent(i, a) and ¬parent(a, b), is called literal. A definite clause is a many-way
OR (disjunction) of literals (formula 3). a, b, and i in the example are constants, and X, Y , and Z are
variables, and all constants and variables are terms. A term that does not contain any free variables is
called a ground term [parent(i, a)]. A Boolean-valued function P: X → {true, false} is called the predicate
(parent and grandparent) on X; grandparent/2 and parent/2 denote predicates with their arity, i.e.,
number of arguments. A function can be any value unlike a predicate, and it will never appear except as
arguments to predicates. FOL is a structure of logic consisting of constants, variables, predicates,
functions, and sentences. It uses quantified variables over nonlogical objects and allows the use of
sentences that contain variables. B. Semantics of ILP There are two different semantics for ILP: standard

example E (E = E + ∧ E − consists of positive example E + and negative example E −), find a hypothesis H,
and nonmonotonic semantics. For the normal semantic, given background (prior) knowledge B and

such that the following conditions hold. Prior Satisfiability: B ∧ E − ⊭ □. Posterior Satisfiability: B ∧ H ∧
E − ⊭ □. Prior Necessity: B ⊭ E + . Posterior Sufficiency: B ∧ H ⊨ E + . A general setting is used for the
normal semantics. In most ILP systems, the definite setting will be used as a simple version of the normal
setting, as the background theory and hypotheses are restricted to being definite. The example setting is
a special case of definite semantics, where the example is restricted to true and false ground facts.
Notice that the example setting is equivalent to the normal semantics, where B and H are definite
clauses and E is a set of ground unit

clauses. The example setting is the main setting of ILP. It is employed by the large majority of ILP
systems. Table I shows the grandparent dataset for the ILP system. The task is to learn the grandparent
relation from various facts involving the father-of and mother-of relations. In the nonmonotonic setting
of ILP, the background theory is a set of definite clauses, the evidence is empty, and the hypotheses are
sets of general clauses expressible using the same alphabet as the background theory. The reason that
the evidence is empty is that the positive evidence is considered part of the background theory, and the
negative evidence is derived implicitly by making a kind of closed world assumption (realized by taking
the least Herbrand model [32]). The nonmonotonic semantics realizes induction by deduction. The
induction principle of the nonmonotonic setting states that the hypothesis H, which is, in a sense,
deduced from the set of observed examples E and the background theory B (using a kind of closed world
and closed domain assumption), holds for all possible sets of examples. This produces generalizations
beyond the observations. As a consequence, properties derived in the nonmonotonic setting are more
conservative than those derived in the normal setting. C. Searching Method An enumeration algorithm
will be used to solve the ILP problem. Generalization and specialization form the basis for pruning the
search space. Generalization corresponds to induction, and specialization to deduction, implying that
induction is viewed here as the inverse of deduction. A generic ILP system can now be defined.

rules r1, . . . ,rk ∈ R to be applied to H Apply the rules r1, . . . ,rk to H to yield H1, . . . , Hn Add H1, . . . , Hn
Algorithm 1 Searching Algorithm [33] Q H := I nitialize repeat Delete H from Q H Choose the inference

to Q H Prune Q H until stop-criterion(Q H) satisfied Q H denotes a queue of candidate hypotheses.


Algorithm 1 works as follows: it keeps track of Q H, and then repeatedly deletes a hypothesis H from the
queue and expands the

hypotheses using inference rules. Then, the expanded hypotheses are added to the queue of
hypotheses Q H, which may be pruned to discard the unpromising hypothesis from further
consideration. This process continues until the stop criterion is satisfied. There are two kinds of search
methods for ILP systems. First, “specific-to-general” systems, a.k.a. bottom-up systems, start from the
examples and background knowledge and repeatedly generalize their hypothesis by applying inductive
inference rules. During the search, they take care that the hypothesis remains satisfiable (i.e., does not
imply negative examples). Second, “general-to-specific” systems, a.k.a. topdown systems, start with the
most general hypothesis (i.e., the inconsistent clause □) and repeatedly specialize the hypothesis by
applying deductive inference rules to remove inconsistencies with the negative examples. During the
search, care is taken that the hypotheses remain sufficient with regard to the positive evidence. Table II

as the inverse of deduction. Given the formulas B ∧ H |H E +, deriving E + from B ∧ H is deduction, and
shows some related systems for both two types. Inductive Inference Rules Induction can be considered

deriving H from B ∧ E + is induction. Therefore, inductive inference rules can be obtained by inverting
deductive ones. Table III summarizes commonly used rules for both deduction and induction. Since this

assumptions about the deductive rule for ⊨ and the format of background theory B and evidence E +,
“inverting deduction” paradigm can be studied under various assumptions, corresponding to different

different models of inductive inference are obtained. Four frameworks of inference rules, θ-
subsumption, inverse resolution, inverse implication, and inverse entailment, will be described in this

where c 1 θ ⊆ c 2 . For example, grandparent(X, Z) ← parent(X, Y ), parent(Y, Z) θ-subsumes


section. 1) θ-Subsumption: The θ-subsumption inductive inference rule is θ − subsumption: c 2 c 1 ,
grandparent(i, b) ← parent(i, a), parent(a, b) with θ = {X = i, Y = a, Z = b}. In the simplest θ-subsumption,
the background knowledge is supposed to be empty, and the deductive inference rule corresponds to θ-
subsumption among single clauses. One extension of θ-subsumption that takes into account background
knowledge is called relative subsumption. Simiar to θ-subsumption, it is straightforward to define
relatively reduced clauses using a straightforward definition of relative clause equivalence. Relative
subsumption forms a lattice over relatively reduced clauses. 2) Inverse Resolution: Inductive inference
rules can be viewed as the inverse of deductive rules of inference. Since the deductive rule of resolution
is complete for the deduction, an inverse of resolution should be complete for induction. IR takes into
account background knowledge and aims at inverting the resolution principle. Four main rules of IR are
widely used.

Absorption: q ← A p ← A, B q ← A p ← q, B . Identification: p ← A, B p ← A, q q ← B p ← A, q . Intra-


Construction: p ← A, B p ← A,C q ← B p ← A, q q ← C . Inter-Construction: p ← A, B q ← A,C p ← r, B r ←
A q ← r,C . In these rules, lower-case letters are atoms, and upper-case letters are conjunctions of
atoms. Both absorption and identification invert a single resolution step. The rules of inter- and intra-
construction introduce “predicate invention,” which leads to reducing the hypothesis space and the
length of clauses. 3) Inverse Implication: Since the deductive inference rule is incomplete regarding
implication among clauses, extensions of inductive inference under θ-subsumption have been studied
under the header “inverting implication.” The inability to invert implication between clauses limits the
completeness of IR and RLGGs, since θ-subsumption is used in place of clause implication in both. The
difference between θ-subsumption and implication between clauses C and D is only pertinent when C
can self-resolve. Attempts were made to do the following: 1) extend IR and 2) use a mixture of IR and
LGG [37] to solve the problem. The extended IR method suffers from problems of nondeterminacy. Due
to the nondeterminacy problem, the development of algorithms regarding inverse implication in ILP is
limited. Idestam-Almquist’s use of LGG suffers from the standard problem of intractably large clauses.
Both approaches are incomplete for inverting implication, though Idestam-Almquist’s technique is
complete for a restricted form of entailment called T-implication [38]. 4) Inverse Entailment: The general

consistent hypothesis H, such that B ∧ H |H E. In general, B, H, and E could be arbitrary logic programs.
problem specification of ILP is, given background knowledge B and examples E find the simplest

Each clause in the simplest H should explain at least one example, since there is a simpler H ′ that will do

relationship B ∧ H |H E, Muggleton [35] proposed the “inverse entailment” B ∧ ¬E |H ¬H. In particular,


otherwise. Then, consider the case of H and E, each being single Horn clauses. By rearranging entailment

¬⊥ is the (potentially infinite) conjunction of ground literals, which are true in all models of B ∧ ¬E.
Since H must be true in every model of B ∧ ¬E, it must contain a subset of the ground literals in ¬⊥.
Therefore B ∧ ¬E |H ¬⊥ |H ¬H H |H ⊥. A subset of the solutions for H can be found by considering the
clauses which θ-subsume ⊥. The complete set of candidates for H can be found by considering all
clauses which θ-subsume sub-saturants of ⊥. E. ILP Systems A large amount of logic learning systems
have been developed based on the inference rules we mentioned before. Fig. 3 shows the timeline of
the development of logic learning systems, including Prolog [29], MIS [19], CLINT [34], Foil [28], LINUS
[36], Golem [22], Aleph [27], Progol [35], Cigol [23], Metagol [30], ProbLog [39], DeepProbLog [40], ∂ILP
[11], and Popper [41]. Note that there are no common-used inverse implication systems, because it
suffers from problems of nondeterminacy. III. VARIANTS OF ILP Traditional ILP frameworks that we have
discussed in Section II mainly have three limitations: most ILP systems do the following: 1) cannot
address noisy data; 2) cannot deal with predicates invention directly and recursion effectively; and 3)
cannot learn hypothesis H efficiently due to the massive hypothesis space. Variants of ILP systems have
been developed to solve the problem mentioned above. PILP [10] becomes a powerful tool when
dealing explicitly with uncertainty, MIL [9] holds merits on predicate invention and recursive
generalizations, and ∂ILP [11] can speed up the learning process and also robust to noise and error. We
will introduce each of them in this section. A. Probabilistic Inductive Logic Programming PILP is a
machine learning technique based on probabilistic logic programming. It addresses one of the central
questions of AI by integrating probabilistic reasoning with machine learning, and first-order relational
logic representations [10]. Dealing explicitly with uncertainty makes PILP more powerful than ILP and, in
turn, than traditional attribute-value approaches [10]. It also provides better predictive accuracy and
understanding of domains and has become a growth path in the machine learning community. The
terms used in PILP are close to those in ILP with small differences: since negative examples conflict with
the usual view on learning examples in statistical learning (the probability of a failure is zero), the

ones. PILP Problem: Given a set E = Ep ∪ Ei of observed and unobserved examples Ep and Ei (with Ep ∩
definition of PILP problem uses observed and unobserved examples instead of positive and negative

Ei = ∅) over some example language L E , a probabilistic covers relation

B, find a hypothesis H ∗ in L H , such that H ∗ = argmax H score(E, H, B) and the following constraints
covers(e, H, B) = P(e|H, B), a probabilistic logical language L H for hypotheses, and a background theory

hold: ∃ep ∈ Ep : covers(ep, H ∗ , B) > 0 ∃ei ∈ Ei : covers(ei, H ∗ , B) = 0. The score is some objective

observed likelihood Q e∈E covers(ep, H ∗ , B) [10]. We denote H = (L, λ), where L represents all FOL
function, usually involving the probabilistic covers relation of the observed examples, such as the

program rules in H, and λ indicates probabilistic parameters. Two subtasks should be considered when
solving PILP learning problems: 1) parameter estimation, where it is assumed that the underlying logic
program L is fixed, and the learning task consists of estimating the parameters λ that maximize the
likelihood and 2) structure learning where both L and λ have to be learned from the data. Similar to the
ILP learning problem, the language L E is selected for representing the examples, and the probabilistic
covers relation determines different learning settings. In ILP, this leads to learning from interpretations
[42], from proofs [10], and from entailment [43]. Therefore, it should be no surprise that this very same
distinction also applies to probabilistic knowledge representation formalisms. There are three PILP
settings as well: probabilistic learning from interpretations, from entailment, and proofs [10]. The main
idea is to lift ILP settings by associating probabilistic information with clauses and interpretations and by
replacing ILP’s deterministic covers relation with a probabilistic one. The large majority of PILP
techniques proposed so far fall into the learning from interpretations setting, including parameter
estimation of probabilistic logic programs [44], learning of probabilistic relational models (PRMs) [45],
parameter estimation of relational Markov models [46], learning of object-oriented Bayesian networks
[47], learning relational dependency networks (RDNs) [48], and learning logic programs with annotated
disjunctions (LPAD) [49]. To define probabilities on proofs, ICL [50], Prism [51], and stochastic logic
programs (SLPs) [52] attach probabilities to facts (respectively, clauses) and treat them as stochastic
choices within resolution. PILP techniques that learn from proofs have been developed, including hidden
Markov model induction by Bayesian model merging [53], relational Markov models [54], and logical
hidden Markov models [55]. Learning from interpretations setting has been investigated for learning
SLPs [56], [57] and for parameter estimation of Prism programs [58], [59] from observed examples.
Many algorithms have been developed since the PILP framework: Muggleton describes a method, which
based on an approximate Bayes “maximum a posterior probability” (MAP) algorithm, for learning SLPs
from examples and background knowledge [57], which is considered one of the earliest structure
learning methods for PILP; De Raedt and Thon [60] upgrade rule learning to a probabilistic setting, in
which both the examples themselves as well as their classification can be probabilistic. To solve “large
groundings” problem, Wang et al. [61] present a first-order probabilistic language, which is well suited
to approximate “local” grounding. The algorithm “Structure LearnIng of ProbabilistiC logic progrAmS
with Em over bdds” (SLIPCASE) performs a beam search in the space of the language of LPAD using the
log-likelihood of the data as the guiding heuristics [62]; Fierens et al. [63] investigate how classical
inference and learning tasks known from the graphical model community can be tackled for probabilistic
logic programs; and the algorithm “Structure LearnIng of Probabilistic logic programs by searching OVER
the clause space” (SLIPCOVER) performs a beam search in the space of probabilistic clauses and a greedy
search in the space of theories, using the log-likelihood of the data as the guiding heuristics [64]. B.
Meta-Interpretive Learning MIL is a framework developed by Cropper and Muggleton [65], which uses
higher-order metarules to support predicate invention and learning of recursive definitions. Metarules,
second-order Horn clauses, are widely discussed [65], [66], [67], [68] as a form of declarative bias.
Metarules define the structure of learnable programs, which, in turn, defines the hypothesis space. For
instance, to learn the grandparent/2 relation given the parent/2 relation, the chain metarule1 would be
suitable P(A, B) ← Q(A,C), R(C, B). In this metarule, the letters P, Q, and R denote existentially quantified
second-order variables (variables that can be bound to predicate symbols), and the letters A, B, and C
1Common-used metarules [30]: 1) identity: P(A, B) ← Q(A, B); 2) inverse: P(A, B) ← Q(B, A); 3) precon:
P(A, B) ← Q(A), R(A, B); 4) postcon: P(A, B) ← Q(A, B), R(B); 5) chain: P(A, B) ← Q(A,C), R(C, B); and 6)
recursive: P(A, B) ← Q(A,C), P(C, B).

denote universally quantified first-order variables (variables that can be bound to constant symbols).
Given the chain metarule, the background parent/2 relation, and examples of the grandparent/2
relation, the learner will try to find the correct substitutions for predicates, and one of the correct
solutions would be {P/grandparent, Q/parent, R/parent} and the substitution result is grandparent(A, B)
← parent(A,C), parent(C, B). We will discuss the MIL problem after the description of metarules. Before
that, we define what MIL input is: an MIL input is a tuple (B, E +, E −, and M), where B is a set of Horn
clauses denoting background knowledge, E + and E − are disjoint sets of ground atoms representing
positive and negative examples, respectively, and M is a set of metarules. An MIL problem [69] can be

program hypothesis H, such that the following hold. 1) ∀c ∈ H, ∃m ∈ M, such that c = mθ, where θ is a
defined from an MIL input. Given an MIL input (B, E +, E −, and M), the MIL problem is to return a logic

substitution that grounds all the existentially quantified variables in m. 2) H ∪ B ⊨ E +. 3) H ∪ B ⊭ E −. H


can be considered as a solution to the MIL problem. Based on the equation c = mθ, MIL focuses on
searching for θ instead of H through abductive reasoning in second-order logic. As the grandparent task
shown before, MIL could be considered as an FOL rule searching problem based on metarules. Since any
first-order predicates could substitute second-order variables, the searching of a logic program
hypothesis H is more flexible than that of ILP: if second-order variables are replaced by those predicates
that do not exist in B, new predicates will be invented (predicate invention); if they are replaced by the
same predicates in both head and body of the metarules, the recursive definition will be learned. As we
discussed before, the metarules determine the structure of permissible rules, which, in turn, defines the
hypothesis space. Deciding which metarules to use for a given learning task is a major open problem and
is a trade-off between efficiency and expressivity: we wish to use fewer metarules as the hypothesis
space grows given more metarules, but if we use too few metarules, we will lose expressivity. Also, the
hypothesis space of MIL highly depends on the metarules we choose. For example, the identity metarule
P(A, B) ← Q(A, B) cannot be learned from the chain metarule P(A, B) ← Q(A,C), R(C, B). Cropper and
Muggleton [65] demonstrate that irreducible or minimal sets of metarules can be found automatically
by applying Plotkin’s clausal theory reduction algorithm. When this approach is applied to a set of
metarules consisting of an enumeration of all metarules in a given finite hypothesis language, they show

Nevertheless, for expected hypothesis H ∗ , the learned model H ′ is only equivalence with the actual
that, in some cases, as few as two metarules are complete and sufficient for generating all hypotheses.

model H ∗ semantically, not literally (H ′ usually contains more hypothesis than H ∗ ), so reducing
metarules does not necessarily improve the learning efficiency of MIL. Advances have been made to
increase the performance of MIL recently. A new reduction technique, called derivation reduction [69],
has been introduced to find a finite subset of a Horn theory from which the whole theory can be derived
using SLD resolution. Recent work [67] also shows that adding types to MIL can improve learning
performance, and type checking can reduce the MIL hypothesis space by a cubic factor, which can
substantially reduce learning times. Extended MIL [66] has also been used to support learning higher-
order programs by allowing for higher-order definitions to be used as background knowledge. Recently,
Popper [41] supports types, learning optimal solutions, learning recursive programs, reasoning about
lists and infinite domains, and hypothesis constraints by combining ASP and Prolog. C. Differentiable
Inductive Logic Programming The ∂ILP that combines intuitive perceptual with conceptually
interpretable reasoning has several advantages: robust to noise and error, data-efficient, and produces
interpretable rules [11]. ∂ILP implements differentiable deduction over continuous values. The gradient
of the loss concerning the rule weights, which we use to minimize classification loss, implements a

form a new set 3 Lambda = {(γ , 1)|γ ∈ P} ∪ {(γ , 0)|γ ∈ N}. Each pair (γ , λ) indicates that whether the
continuous form of induction [11]. The set P (positive examples) and N (negative examples) are used to

atom γ is in P (λ = 1) or N (λ = 0). A differentiable model is constructed to implement the conditional


probability of λ for a ground atom α p(λ|α, W, 5, L, B) where W is a set of clause weights, 5 is a program
template, L is a language frame, and B is a set of background assumptions. The goal is to match the
predicted label p(λ|α, W, 5, L, B) with the actual label λ in the pair (γ , λ). Minimizing the expected
negative log-likelihood is performed loss = −E(γ ,λ)∼3[λ · log p(λ|α, W, 5, L, B) + (1 − λ) · log(1 − p(λ|α,
W, 5, L, B))]. The conditional probability p(λ|α, W, 5, L, B) is calculated by four functions: fextract, finfer,
fconvert, and fgenerate p(λ|α, W, 5, L, B) = fextract( finfer( fconvert(B), fgenerate(5, L), W, T ), α). The
fextract extracts the value from an atom, the fconvert converts the elements of B to 1 and others to 0,
and the fgenerate generates a set of clauses from 5 and L. The finfer performs T steps of forward-
chaining inference where T is part of 5. To show how inference is performed over multiple time steps,
each clause c has been translated into a function.

You might also like