0% found this document useful (0 votes)
13 views

Inductive_Logic_Programming_via_Differen

This document presents a novel approach to Inductive Logic Programming (ILP) using differentiable deep neural networks, specifically a new architecture called the differentiable Neural Logic (dNL) network. The proposed method allows for the direct learning of symbolic logical predicate rules without restrictive templates, enabling features like recursion and predicate invention. Experimental results demonstrate that this dNL-ILP solver outperforms existing ILP solvers on classification tasks involving benchmark relational datasets.

Uploaded by

Jeremy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Inductive_Logic_Programming_via_Differen

This document presents a novel approach to Inductive Logic Programming (ILP) using differentiable deep neural networks, specifically a new architecture called the differentiable Neural Logic (dNL) network. The proposed method allows for the direct learning of symbolic logical predicate rules without restrictive templates, enabling features like recursion and predicate invention. Experimental results demonstrate that this dNL-ILP solver outperforms existing ILP solvers on classification tasks involving benchmark relational datasets.

Uploaded by

Jeremy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Inductive Logic Programming via Differentiable Deep

Neural Logic Networks

Ali Payani Faramarz Fekri


Electrical and Computer Engineering Electrical and Computer Engineering
Georgia Institute of Technology Georgia Institute of Technology
[email protected] [email protected]
arXiv:1906.03523v1 [cs.AI] 8 Jun 2019

Abstract

We propose a novel paradigm for solving Inductive Logic Programming (ILP)


problems via deep recurrent neural networks. This proposed ILP solver is designed
based on differentiable implementation of the deduction via forward chaining. In
contrast to the majority of past methods, instead of searching through the space of
possible first-order logic rules by using some restrictive rule templates, we directly
learn the symbolic logical predicate rules by introducing a novel differentiable
Neural Logic (dNL) network. The proposed dNL network is able to learn and
represent Boolean functions efficiently and in an explicit manner. We show that
the proposed dNL-ILP solver supports desirable features such as recursion and
predicate invention. Further, we investigate the performance of the proposed ILP
solver in classification tasks involving benchmark relational datasets. In particular,
we show that our proposed method outperforms the state of the art ILP solvers in
classification tasks for Mutagenesis, Cora and IMDB datasets.

1 Introduction
Despite the tremendous success of the deep neural networks, they are still prone to some limitations.
These systems, in general, do not construct any explicit and symbolic representation of the algorithm
they learn. In particular, the learned algorithm is implicitly stored in thousands or even millions of
weights, which is typically impossible for human agents to decipher or verify. Further, MLP networks
are suitable when large training examples are available. Otherwise, they usually do not generalize
well. One of the machine learning approaches that addresses these shortcomings is Inductive Logic
Programming (ILP). In ILP, explicit rules and symbolic logical representations can be learned using
only a few training examples. Further, the solutions usually generalize well.
The idea of using neural networks for learning ILP has attracted a lot of research in recent years (
Hölldobler et al. [1999]; França et al. [2014]; Serafini and Garcez [2016]; Evans and Grefenstette
[2018]). Most neural ILP solvers work by propositionalization of the relational data and use the
neural networks for the inference tasks. As such, they usually are superior to classical ILP solvers in
handling missing or uncertain data. However, in many of the proposed neural solvers, the learning
is not explicit (e.g. connectionist network (Bader et al. [2008]). Further, these methods do not
usually support features such as inventing new predicates and learning recursive rules for predicates.
Additionally, in almost all of the past ILP solvers, the space of possible symbolic rules for each
predicate is significantly restricted and reduced by introducing some types of rule templates before
searching through this space for possible candidate. (e.g., mode declarations in Progol (Muggleton
[1995]) and meta-rules in Metagol). In fact, as stated in Evans and Grefenstette [2018], the need for
using program templates to generate a limited set of viable candidate clauses in forming the predicates
is the key weakness in all existing (past) ILP systems (neural or non-neural), severely limiting the
solution space of a problem. The contribution of this paper is as follows: we introduce a new neural

Preprint. Under review.


framework for learning ILP, by using a differentiable implementation of the forward chaining. Further,
we practically remove the need for the use of rule templates by introducing novel symbolic Boolean
function learners via multiplicative neurons. This flexibility in learning the first-order formulas
without the need for a rule template makes it possible to learn very complex recursive predicates.
Finally, as will show in the experiments, the proposed method outperforms the state of the art ILP
solvers in relational data classification for the problems involving thousands of constants.

2 Inductive Logic Programming via dNL


Logic programming is a programming paradigm in which we use formal logic (and usually first-
order-logic) to describe relations between facts and rules of a program domain. In this framework
rules are usually written as clauses of the form:
H ← B 1 , B 2 , . . . , Bm (1)
where H is called head of the clause and B1 , B2 , . . . , Bm is called body of the clause. A clause
of this form expresses that if all the atoms in the body are true, the head is necessarily true. We
assume each of the terms H and B are made of atoms. Each atom is created by applying an n-ary
Boolean function called predicate to some constants or variables. A predicate states the
relation between some variables or constants in the logic program. Throughout this paper we will
use small letters for constants and capital letters (A, B, C, ...) for variables. In ILP, a problem can be
defined as a tuple (B,P,N ) where B is the set of background assumptions and P and N are the set of
positive and negative examples, respectively. Given this setting, the goal of the ILP is to construct a
logic program (usually expressed as a set of definite clauses, R) such that it explains all the examples.
More precisely,
B, R |= e, ∀e ∈ P , B, R 6|= e, ∀e ∈ N (2)
Let’s consider the logic program that defines the lessThan predicate over natural numbers and as-
sume that our constants contains the set C = {0, 1, 2, 3, 4} and the ordering of the natural numbers are
defined using the predicate inc (which defines increments of 1). The set of background atoms which
describes the known facts about this problem is the set B = {inc(0, 1), inc(1, 2), inc(2, 3), inc(3, 4)}.
Further, P = {lt(a, b)|a, b ∈ C, a < b} and N = {lt(a, b)|a, b ∈ C, a ≥ b}. It is easy to verify that
the program with rules defined in the following entails all the positive examples and rejects all the
negative ones:
lessThan(A, B) ← inc(A, B)
lessThan(A, B) ← lessThan(A, C), inc(C, B) (3)
In most ILP systems, the set of possible atoms that can be used in the body of each rule are generated
by using a template (e.g. mode declarations in Progol and meta-rules in Metagol (Cropper and
Muggleton [2016])). If we allow for num_vari (p) variables in the body of the ith rule for the
predicate p (e.g., num_var1 (lt) = 2, num_var2 (lt) = 3 in above example), the set of possible
(symbolic) atoms for the ith rule for the predicate p is given by:
[
Iip = T(p∗ , Vpi ) , where (4)
p∗ ∈P

T(p, V ) = {p(arg)| arg ∈ P erm(V, arity(p) ) } (5)


where Vpi is the set of variables (|Vpi | = num_vari (p)) for the ith rule and P is the set of all the
predicates in the program. Further the function P erm(S, n) generates the set of all the permutations
of tuples of length n from the elements of a set S and the function arity(p) returns the number
of arguments in predicate p. For example, in the lessThan program, Vlt1 = {A, B} and Vlt2 =
{A, B, C}. Consequently, the set of possible atoms can be enumerated as:
[
I1lt = {inc(A, A), inc(A, B), inc(B, A), inc(B, B)} {lt(A, A), lt(A, B), lt(B, A), lt(B, B)}
[
I2lt = {inc(A, A), inc(A, B), inc(A, C), . . . , inc(C, C)} {lt(A, A), lt(A, B), . . . , lt(C, C)}
In general, there are two main approaches to ILP. The bottom-up family of approaches (e.g. Progol)
start by examining the provided examples and extract specific clauses from those and try to generalize
from those specific clauses. In the top-down approaches (e.g., most neural implementations as well

2
xi mi Fc xi mi Fd
0 0 1 0 0 0
0 1 0 0 1 0
1 0 1 1 0 0
1 1 1 1 1 1
(a) (b)
Figure 1: Truth table of Fc (·) and Fd (·) functions

as Metagol and dILP (Evans and Grefenstette [2018])), the possible clauses are generated via a
template and the generated clauses are tested against positive and negative examples. Since the space
of possible clauses are vast, in most of these systems, very restrictive template rules are employed
to reduce the size of the search space. For example, dILP allows for clauses of at most two atoms
and only two rules per each predicate. In the above example, since |I2lt | = 18, this corresponds
to considering only 18 2
2 items from all the possible clauses (i.e., the power set of Ilt ). Metagol
employs a more flexible approach by allowing the programmer to define the rule templates via some
meta-rules. However, in practice, this approach does not resolve the issue completely. Even though it
allows for more flexibility, defining those templates is itself a complicated task which requires expert
knowledge and possible trials and it can still lead to exponentially large space of possible solutions.
Later we will consider examples where these kinds of approaches are practically impossible.
Alternatively, we propose a novel approach which allows for learning any arbitrary Boolean function
involving several atoms from the set Iip . This is made possible via a set of differentiable neural
functions which can explicitly learn and represent Boolean functions.

2.1 Differentiable Neural Logic Networks

Any Boolean functions can be learned (at least in theory) via a typical MLP network. However, since
the corresponding logic is stored implicitly in weights of the MLP network, it is very difficult (if
not impossible) to decipher the actual learned function. Therefore, MLP is not a good candidate to
use in our ILP solver. Our intermediate goal is to design new neuronal functions which are capable
of learning and representing Boolean functions in an explicit manner. Since any Boolean functions
can be expressed in a Disjunctive Normal Form (DNF) or alternatively in a Conjunctive Normal
Form (CNF), we first introduce novel conjunctive and disjunctive neurons. We can then combine
these elementary functions to form more expressive constructs such as DNF and CNF functions.
We use the extension of the Boolean to real values in the range [0, 1] and we use 1 (True) and 0
(False) representations for the two states of a binary variable. We also define the fuzzy unary and
dual Boolean functions of two Boolean variables x and y as:
x̄ = 1 − x , x ∧ y = xy x ∨ y = 1 − (1 − x)(1 − y) (6)
This algebraic representation of the Boolean logic allows us to manipulate the logical expressions
via Algebra. Let xn ∈ {0, 1}n be the input vector for our logical neuron. In order to implement
the conjunction function, we need to select a subset in xn and apply the fuzzy conjunction (i.e.
multiplication) to the selected elements. To this end, we associate a trainable Boolean membership
weight mi to each input elements xi from vector xn . Further, we define a Boolean function Fc (xi , mi )
with the truth table as in Fig.1a which is able to include (exclude) each element in (out of) the
conjunction function. This design ensures the incorporation of each element xi in the conjunction
function only when the corresponding membership weight is 1. Consequently, the neural conjunction
function fconj can be defined as:
Yn
fconj (xn ) = Fc (xi , mi )
i=1
where, Fc (xi , mi ) = xi mi = 1 − mi (1 − xi ) , (7)
To ensure the membership weights remain in the range [0, 1] we apply a sigmoid function to corre-
sponding trainable weights wi in the neural network, i.e., mi = sigmoid(c wi ) where c ≥ 1 is a
constant. Similar to perceptron layers, we can stack N conjunction neurons to create a conjunction
layer of size N . This layer has the same complexity as a typical perceptron layer without incorpo-
rating any bias term. More importantly, this implementation of the conjunction function makes it

3
possible to interpret the learned Boolean function directly from the values of the membership weights
mi . The disjunctive neuron can be defined similarly but using the function Fd with truth table as
depicted in Fig.1b, i.e.:
n
Y n
Y
fdisj (xn ) = Fd (xi , mi ) = 1 − (1 − Fd (xi , mi )) ,
i=1 i=1
where, Fd (xi , mi ) = xi mi (8)
We call a complex networks made by combining the elementary conjunctive and disjunctive neurons,
a dNL (differentiable Neural Logic) network. For example, by cascading a conjunction layer with one
disjunctive neuron we can form a dNL-DNF construct. Similarly, a dNL-CNF can be constructed.

2.2 ILP as a Satisfiability Problem

We associate a dNL (conjunction) function Fpi to ith rule of every intensional predicate p in our
logic program. intensional predicates can use other predicates and variables in contrast to the
extensional predicates which are entirely defined by the ground atoms. We view the membership
weights m in the conjunction neuron as a Boolean flags that indicates whether each atom in a rule is
off or on. In this view, the problem of ILP can be seen as finding an assignment to these membership
Boolean flags such that the resulting rules applied to the background facts, entail all positive examples
and reject all negative examples. However, by allowing these membership weights to be learnable
weights, we are formulating a continuous relaxation of the satisfiability problem. This approach is
in some ways similar to the approach in dILP Evans and Grefenstette [2018], but differs in how we
define Boolean flags. In dILP, a Boolean flag is assigned to each of the possible combinations of
two atoms from the set Iip . They then use a softmax network to learn the set of winning clauses and
they interpret those weights in the softmax network as the Boolean flags that select one clause out
of possible clauses. However, as mentioned earlier, in our approach the membership weights of the
conjunction (or any other logical function from dNL) can be directly interpreted as the flags in the
satisfiability interpretation.

2.3 Forward Chaining

We are now able to formulate the ILP problem as an end-to-end differentiable neural network.
(t)
We associate a (fuzzy) value vector for each predicate p at time-stamp t as Xp which holds
the (fuzzy) Boolean values of all the ground atoms involving that predicate. For the example
(t)
in consideration (i.e., lessThan), the vector Xinc includes the Boolean values for atoms in
{inc(0, 0), inc(0, 1), . . . , inc(4, 4)}. For extensional predicates, these values will be constant during
(t)
the forward chain of reasoning, but for intensional predicates such as lt, the values of the Xp
i
would change during the application of the predicate rules Fp at each time-stamp. Let G be the set
of all ground atoms and Gp be the subset of G associated with predicate p. For every ground atom
e ∈ Gp and for every rule Fpi , let Θip (e) be the set of all the substitutions of the constants into the
variables Vpi which would result in the atom e. In the lessThan program (see page 2) for example,
for the ground atom lt(0, 2), the set of all substitutions corresponding to the second rule (i.e., i = 2)
is given by Θ2lt ( lt(0, 2) ) = {{A 7→ 0, B 7→ 2, C 7→ 0}, . . . , {A 7→ 0, B 7→ 2, C 7→ 4}}. We can
now define the one step forward inference formula as:
∀e ∈ Gp , Xp(t+1) [e] = Fam (Xp(t) [e], F(e)) , where (9a)
_ _
F(e) = Fpi ( Iip |θ ) (9b)
i θ∈Θip (e)

For the most practical purposes we can assume that the amalgamate function Fam is simply the
fuzzy disjunction function, but we will consider other options in the Appendix B. Here, for brevity
we did not introduce the indexing notations in (9). By Xp [e], we actually mean Xp [index(Xp , e)]
where index(Xp , e) returns the index of the corresponding element of vector Xp . Further, each Fpi
is the corresponding predicate rule function implemented as a differentiable dNL network (e.g., a
conjunctive neuron). In each substitution, this function is applied to the input vector Iip |θ which is
evaluated for the substitution θ. As an example, for the ground atom lt(0, 2) in the previous example,

4
Figure 2: The diagram for one step forward chaining for predicate lt where Flt is implemented using a
dNL-DNF network.

and for the substitution θ = {A 7→ 0, B 7→ 2} corresponding to the first rule we have:


I1lt = {Xinc [(0, 0)], Xinc [(0, 2)], Xinc [(2, 0)], Xinc [(2, 2)], Xlt [(0, 0)], Xlt [(0, 2)], Xlt [(2, 0)], Xlt [(2, 2)]}
Fig.2 shows one step forward chaining for learning the predicate lt. In this diagram two rules are
combined and replaced by one dNL-DNF functions.

2.4 Training

We obtain the initial values of the valuation vectors from the background atoms. i.e.,
∀p, ∀e ∈ Gp , if e ∈ B, Xp(0) [e] = 1, else Xp(0) [e] = 0 (10)
(t )
We interpret the final values of Xp max [e] (after tmax steps of forward chaining) as the conditional
probability for the value of atom given the model parameters and we define the loss as the average
cross-entropy loss between the ground truth provided by the positive and negative examples for the
(t )
corresponding predicate p) and Xp max which is the algorithm output after tmax forward chaining
steps. We train the model using ADAM (Kingma and Ba [2014]) optimizer to minimize the aggregate
loss over all intensional predicates with the learning rate of 0.001 (in some cases we may increase the
rate for faster convergence). After the training is completed, a zero cross-entropy loss indicates that
the model has been able to satisfy all the examples in the positive and negative sets. However, there
might exist a few atoms with membership weights of ’1’ in the corresponding dNL network for a
predicate which are not necessary for the satisfiability of the solution. However, since there is no
gradient at this point, those terms cannot be directly removed during the gradient descent algorithm
unless we include some penalty terms. In practice, we use a simpler approach. In the final stage of
algorithm we remove each atom if by switching its membership variable from ’1’ to ’0’, the loss
function does not change.

2.5 Predicate rules (Fpi )

In the majority of the ILP systems, the body of the rules are defined as the conjunction of some
atoms. However,in general the predicate rules can be defined as any arbitrary Boolean function of
the elements of set Ip . One of the main reasons for restricting the form of these rules in most ILP
implementations is the vast space of possible Boolean functions that is needed to be considered. For
example, by restricting the form of rule’s body to a pure Horn clause we reduce the space of possible
L
functions from 22 to only 2L , where L = |Iip |. Most ILP systems apply much further restrictions.
For example, dILP limits the possible combinations to the L2 possible combinations of terms made

of two atoms. In contrast, in our proposed framework via dNL networks, we are able to learn arbitrary

5
L
functions with any number of atoms in the formula. Though some functions from the possible 22
functions require exponentially large number of terms if expressed in DNF form for example, in
most of the typical scenarios, a dNL-DNF function with reasonable number of disjunction terms is
capable of learning the required logic. Further, even though our approach allows for multiple rules
per predicates, in most scenarios we can learn all the rules for a predicate as one DNF formula instead
of learning separate rules. Finally, we can easily allow for including the negation of each atom in
the formula by concatenating the vector Ip |θ and its fuzzy negation, i.e., (1.0 − Ip |θ ) as the input to
the Fpi function. This would only double the number of parameters of the model. In contrast, in
most other implementations of ILP, this would increase the number of parameters and the problem
complexity at much higher rates.

2.6 Implementation and Performance

We have implemented1 the dNL-ILP solver model using Tensorflow (Abadi et al. [2016]). In the
previous sections, we have outlined the process in a sequential manner. However, in the actual
implementations we first create index matrices using all the background facts before starting the
optimization task. Further, all the substitution operations for each predicate (at each time-stamp)
are carried using a single gather function. Finally, at each time-stamp and for each intensional
predicate, all instances of applying (executing) the neural function Fpi are carried in a batch operation
and in parallel. The proposed algorithm allows for a very efficient learning of arbitrary complex
formulas and significantly reduces the complexity that arises in the typical ILP systems when
increasing the number of possible atoms in each rule. Indeed, in our approach, usually there is no
need for any tuning and parameter specification other than the size of the DNF network (total number
of rules for a predicate) and specifying the number of existentially quantified variables for each rule.
On the other hand, since we use a propositionalization step (typical to almost all neural ILP solvers),
special care is required when the number of constants in the program is very large. While for the
extensional and target predicates we can usually define the vectors Xp corresponding only to the
provided atoms in the sets B, P and N , for the auxiliary predicates we may need to consider many
intermediate ground atoms not included in the program. In such cases, when the space of possible
atoms is very large, we may need to restrict the set of possible ground atoms.

3 Past Works
Addressing all the important past contributions in ILP is a tall order and given the limited space we
will only focus on a few recent approaches that are in some ways relevant to our work. Among the ILP
solvers that are capable of learning recursive predicates (in an explicit and symbolic manner), the most
notable examples are Metagol (Cropper and Muggleton [2016]) and dILP (Evans and Grefenstette
[2018]). Metagol is a powerful method that is capable of learning very complex tasks via using the
user-provided meta-rules. The main issue with Metagol is that while it allows for some flexibility
in terms of providing the meta-rules, it is not always clear how to define those meta formulas. In
practice, unless the expert already has some knowledge regarding the form of the possible solution,
it would be very difficult to use this method. dILP, on the other hand, is a neural ILP solvers that,
like our method, uses propositionalization of the data and formulates a differentiable neural ILP
solver. Our proposed algorithm is in many regards similar to dILP. However, because of the way it
define templates, dILP is limited to learning simple predicates with arity of at most two and with
maximum two atoms in each rule. CILP++ (França et al. [2014]) is another noticeable neural ILP
solver which also uses propositionalization similar to our method and dILP. CLIP++ is a very efficient
algorithm and is capable of learning large scale relational datasets. However, since this algorithm
uses the bottom clause propositionalization, it is not able to learn recursive predicates. In dealing with
uncertain data and specially in the tasks involving classification of the relational datasets, the most
notable framework is the probabilistic ILP (PILP) (De Raedt and Kersting [2008]) and its variants and
also Markov Logic Networks (MLN) Richardson and Domingos [2006]. These types of algorithms
extend the framework of ILP to handle uncertain data by introducing a probabilistic framework. Our
proposed approach is related to PILP in that we also associate a real number to each atom and each
rule in the formula. We will compare the performance of our method to this category of statistical
relational learners later in our experiment. The methods in this category in general are not capable of
learning recursive predicates.
1
The python implementation of dNL-ILP is available at https://github.com/apayani/ILP

6
4 Experiments
The ability to learn recursive predicates is fundamental in learning a variety of algorithmic tasks
(Tamaddoni-Nezhad et al. [2015]; Cropper and Muggleton [2015]). In practice, Metagol is the only
notable ILP solver which can efficiently learn recursive predicates (via meta-rule templates). Our
evaluations2 show that the proposed dNL-ILP solver can learn a variety of discrete algorithmic tasks
involving recursion very efficiently and without the need for predefined meta-rules. Here, we briefly
explore two synthetic learning tasks before considering large-scale tasks involving relational datasets.

4.1 Learning decimal multiplication

We use dNL-ILP solver for learning the predicates mul/3 for decimal multiplication using only the
positive and negative examples. We use C = {0, 1, 2, 3, 4, 5, 6} as constants and our background
knowledge is consisted of the extensional predicates {zero/1, inc/2, add/3}, where inc/2 defines
increment of one and add/3 defines the addition. The target predicate is mul(A, B, C) and we allow
for using 5 variables (i.e., num_vari (mul) = 5) in each rule. We use a dNL-DNF network with 4
disjunction terms (4 conjunctive rules) for learning Fmul . It is worth noting that since we do not
know in advance how many rules would be needed, we should pick an arbitrary number and increase
in case the ILP program cannot explain all the examples. Further, we set the tmax = 8. One of the
solutions that our model finds is:
mul(A, B, C) ← zero(B), zero(C)
mul(A, B, C) ← mul(A, D, E), inc(D, B), add(E, A, C)

4.2 Sorting

The sorting task is more complex than the previous task since it requires not only the list semantics,
but also many more constants compared to the arithmetic problem. We implement the list semantic by
allowing the use of functions in defining predicates. For a data of type list, we define two functions
H and t which allow for decomposing a list into head and tail elements, i.e A = [AH |At ]. We use
elements of {a, b, c, d} and all the ordered lists made from permutations of up to three elements as
constants in the program (i.e., |C| = 40). We use extensional predicates such as gt (greater than), eq
(equals) and lte (less than or equal) to define ordering between the elements of lists as part of the
background knowledge. We allow for using 4 variables (and their functions) in defining the predicate
sort (i.e., num_vari (sort) = 4). One of the solution that our model finds is:
sort(A, B) ← sort(AH , C), lte(Ct , At ), eq(BH , C), eq(At , Bt )
sort(A, B) ← sort(AH , C), sort(D, BH ), gt(Ct , At ), eq(Bt , Ct ), eq(DH , CH ), eq(At , Dt )

Even though the above examples involve learning tasks that may not seem very difficult on the
surface, and deal with relatively small number of constants, they are far from trivial. To the best of
our knowledge, learning a recursive predicate for a complex algorithmic task such as sort which
involves multiple recursive rules with 6 atoms and includes 12 variables (by counting two functions
head and tail per variables) is beyond the power of any existing ILP solver. Here for example, the total
2
number of possible atoms to choose from
176
 is |Isort10
| = 176 and for the case of choosing 6 elements
from this list we need to consider 6 > 3 × 10 possible combinations (assuming we knew in
advance there is a need for 6 atoms). While we can somewhat reduce this large space by removing
some of the improbable clauses, no practical ILP solver is capable of learning these kinds of relations
directly from examples.

4.3 Classification for Relational Data

We evaluate the performance of our proposed ILP solver in some benchmark ILP tasks. We use
relational datasets Mutagenesis (Debnath et al. [1991]), UW-CSE (Richardson and Domingos [2006])
as well as IMDB and Cora datasets3 . Table 1 summarizes the features of these datasets. As baseline
2
Many of the symbolic tasks used in Evans and Grefenstette [2018] as well as some others are provided in
the accompanying source code.
3
Publicly-available at https://relational.fit.cvut.cz/

7
Table 1: Dataset Features
Dataset Constants Predicates Examples Target Predicate
Mutagenesis 7045 20 188 active(A)
UW-CSE 7045 15 16714 advisedBy(A, B)
Cora 3079 10 70367 sameBib(A, B)
IMDB 316 10 14505 workingU nder(A, B)

we are comparing our method with the state of the art algorithms based on Markov Logic Networks
such GSLP (Dinh et al. [2011]), LSM (Kok and Domingos [2009]), MLN-B (Boosted MLN), B-RLR
(Ramanan et al. [2018]) as well as probabilistic ILP based algorithms such as SleepCover (Bellodi
and Riguzzi [2015]). Further, since in most of these datasets, the number of negative examples
are significantly greater than the positive examples, we report the Area Under Precision Recall
(AUPR) curve as a more reliable measure of the classification performance. We use 5-fold cross
validations except for the Mutagenesis dataset which we have used 10-fold and we report the average
AUPR over all the folds. Table 2 summarizes the classification performance for the 4 relational
datasets. As the results show, our proposed method outperforms the previous algorithms in the three

Table 2: AUPR measure for the 4 relational classification tasks


Dataset GSLP LSM SleepCover MLN-B B-RLR dNL-ILP
Mutagenesis 071 0.76 0.95 N/A N/A 0.97
UW-CSE 0.42 0.46 0.07 0.91 0.89 0.51
Cora 0.80 0.89 N/A N/A N/A 0.95
IMDB 0.71 0.79 N/A 0.83 0.90 1.00

tasks; Mutagenesis, Cora and IMDB. In case of IMDB dataset, it reaches the perfect classification
(AUROC=1.0, AUPR=1.0). This impressive performance is only made possible because of the ability
of learning recursive predicates. Indeed, when we disallow the recursion in this model, the AUPR
performance drops to 0.76. The end-to-end design of our differentiable ILP solver makes it possible
to combine some other forms of learnable functions with the dNL networks. For example, while
handling continuous data is usually difficult in most ILP solvers, we can directly learn some threshold
values to create binary predicates from the continuous data (see Appendix D)4 . We have used this
method in the Mutagenesis task to handle the continuous data in this dataset. For the case of UW-CSE,
however, our method did not perform as well. One of the reasons is arguably the fact that the number
of negative examples is significantly larger than the positive ones for this dataset. Indeed, in some
of the published reports, (e.g. França et al. [2014]), the number of negative examples are limited
using the closed world assumption as Davis et al. [2005]. Because of the difference in hardware, it is
difficult to directly compare the speed of algorithms. In our case, we have evaluated the models using
a 3.70GHz CPU, 16GB RAM and GeForce GTX 1080TI graphic card. Using this setup the problems
such as IMDB, Mutagenesis are learned in just a few seconds. For Cora, the model creation takes
about one minute and the whole simulation for any fold takes less than 3 minutes.

5 Conclusion
We have introduced dNL-ILP as a new framework for learning inductive logic programming problems.
Using various experiments we showed that dNL-ILP outperforms past algorithms for learning
algorithmic and recursive predicates. Further, we demonstrated that dNL-ILP is capable of learning
from uncertain and relational data and outperforms the state of the art ILP solvers in classification
tasks for Mutagenesis, Cora and IMDB datasets.

4
Alternatively, we can assign learnable probabilistic functions to those variables (see Appendix C).

8
A Notations

Table 3: Some of the notations used in this paper


Notation Explanation
p/n a predicate p of arity n
C the set of constants in the program
B the set of background atoms
P the set of positive examples
N the set of negative examples
Gp the set of ground atoms for predicate p
G the set of all ground atoms
P the set of all predicates in the program
(t)
Xp the fuzzy values of all the ground atoms for predicate p at time t
A 7→ a substitution of constant a into variable A
R the set of all the rules in the program
perm(S, n) the set of all the permutations of tuples of length n from the set S
T(p, V ) the set of atoms involving predicate p and using the variables in the set V
Iip the set of all atoms that can be used in generating ith rule for predicate p

B Amalgamate Function (Fam )


In most scenarios we may set this function to a disjunction functions. However, we can modify this
function for some specific purposes. For example:
V
• Fam (old, new) = old new: by this choice we can implement a notion of ∀(for all) in
logic which can be useful in certain programs. There is an example in the source code which
learns array indexing by the help of this option.
• Fam (old, new) = new: by this choice we can learn transient logic (an alternative approach
to the algorithm presented in Inoue et al. [2014]).

C Dream4 challenge Experiment (handling uncertain data)


Inferring the causal relationship among different genes is one of the important problems in biology.
In this experiment, we study the application of dNL-ILP for inferring the structure of gene regulatory
networks using 10-genes time-series dataset from the DREAM4 challenge tasks Marbach et al. [2009].
In the 10-gene challenge, the data consists of 5 different biological systems, each composed of 10
genes. For each system a time-series containing 105 noisy readings of the genes expressions (in
range [0, 1]) is provided. The time-series is obtained via 5 different experiments, each created using a
simulated perturbation in a subset of genes and recording the gene expressions over time. To tackle
this problem using dNL-ILP framework, we simply assume that each gene can be in one of the two
states: on (excited or perturbed) or off. The key idea here is to model each gene’s state (off or On)
using two different approaches and then aim to get a consensus on the gene’s state using these two
different probes. To accomplish this, for each gene Gi we define the predicate offi which evaluates
the state of Gi using the corresponding continuous values. We also use predicate inf_offi which
takes the state of all the predicates offj ’s (j 6= i) to infer the state of Gi . To ensure that at each
background data inf_offi would be close to offi , we define another auxiliary predicate auxi with
predicate function defined as Fauxi = 1 − |inf_offi − offi |.
Since the state of genes are uncertain, we use a probabilistic approach and assume that each gene
state is conditionally distributed according to a Gaussian mixture (with 4 components), i.e., x|off ∼
GMMoff and x|on ∼ GMMon . As such, we design Finf_offi such that it returns the probability
of the Gi gene being off, and let all the parameters of the mixture models be trainable weights.
For the FInf_offi we use a dNL-CNF network with only one term. Each data point in the time
series corresponds to one background knowledge and consists of the continuous value of each gene
expression. For each gene, we assign 5 percent of data points with the lowest and highest absolute

9
distance from mean as positive and negative examples for predicate offi , respectively. We interpret
the values of the membership weights in the trained dNL-CNF networks which are used in FInf-offi
as the degree of connection between two genes. Table 4 compares the performance of dNL-ILP to the
two state of the art algorithms NARROMI Zhang et al. [2012] and MICRAT Yang et al. [2018] for
10-gene classification tasks of DREAM4 dataset.

Table 4: DREAM4 challenge scores


Method
NARROMI MICRAT dNL-ILP
Metric
Accuracy 0.82 0.87 0.86
F-Score 0.35 0.32 0.36
MCC 0.24 0.33 0.35

D UCI Dataset Classification (handling continuous data)


Reasoning using continuous data has been an ongoing challenge for ILP. Most of the current ap-
proaches either model continuous data as random variables and use probabilistic ILP framework
De Raedt and Kersting [2008], or use some forms of discretization using iterative approaches. The
former approach cannot be applied to the cases where we do not have reasonable assumptions for
the probability distributions. Further, the latter approach is usually limited to small scale problems
(e.g. Ribeiro et al. [2017]) since the search space grows exponentially as the number of continuous
variables and the boundary decisions increases. Alternatively, the end-to-end design of dNL-ILP
makes it rather easy to handle continuous data. Recall that even though we usually use dNL based
functions, the predicate functions F ′ s can be defined as any arbitrary Boolean function in our model.
Thus, for each continuous variable x we define k lower-boundary predicates gtxi (x, lxi ) as well as k
upper-boundary predicates ltxi (x, uxi ) where i ∈ {1, . . . , k}. We let the boundary values lxi ’s and
uxi ’s be trainable weights and we define the upper-boundary and lower-boundary predicate functions
as:
Fgtx i = σ(c (x − uxi )) , Fltx i = σ(−c (x − lxi )),
where σ is the sigmoid function and c ≫ 1 is a constant. To evaluate this approach we use it in a
classification task for two datasets containing continuous data; Wine and Sonar from UCI Machine
learning dataset Dua and Karra Taniskidou [2017] and compare its performance to the ALEPH
Srinivasan [2001], a state-of-the-art ILP system, as well as the recently proposed FOLD+LIME
algorithm Shakerin and Gupta [2018]. The wine classification task involves 13 continuous features
and three classes and the Sonar task is a binary classification task involving 60 features. For each
class we define a corresponding intensional predicate via dNL-DNF and learn that predicate from
a set of ltxi and gtxi predicates corresponding to each continuous feature. We set the number of
boundaries to 6 (i.e., k = 6). The classification accuracy results for the 5-fold cross validation setting
is depicted in Table 5.

Table 5: Classification accuracy


Task ALEPH+LIME FOLD+LIME dNL-ILP
Wine 0.92 0.93 0.98
Sonar 0.74 0.78 0.85

References
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu
Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for
large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and
Implementation ({OSDI} 16), pages 265–283, 2016.
Sebastian Bader, Pascal Hitzler, and Steffen Hölldobler. Connectionist model generation: A first-order
approach. Neurocomputing, 71(13-15):2420–2432, 2008.
Elena Bellodi and Fabrizio Riguzzi. Structure learning of probabilistic logic programs by searching
the clause space. Theory and Practice of Logic Programming, 15(2):169–212, 2015.

10
Andrew Cropper and Stephen H Muggleton. Learning efficient logical robot strategies involving
composable objects. In Twenty-Fourth International Joint Conference on Artificial Intelligence,
2015.
Andrew Cropper and Stephen H. Muggleton. Metagol system. https://github.com/metagol/metagol,
2016.
Jesse Davis, Elizabeth Burnside, Inês de Castro Dutra, David Page, and Vítor Santos Costa. An
integrated approach to learning bayesian networks of rules. In European Conference on Machine
Learning, pages 84–95. Springer, 2005.
Luc De Raedt and Kristian Kersting. Probabilistic inductive logic programming. In Probabilistic
Inductive Logic Programming, pages 1–27. Springer, 2008.
A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch. Structure-
activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with
molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2):786–797,
1991.
Quang-Thang Dinh, Matthieu Exbrayat, and Christel Vrain. Generative structure learning for markov
logic networks based on graph of predicates. In Twenty-Second International Joint Conference on
Artificial Intelligence, 2011.
Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017.
Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of
Artificial Intelligence Research, 61:1–64, 2018.
Manoel VM França, Gerson Zaverucha, and Artur S d’Avila Garcez. Fast relational learning using
bottom clause propositionalization with artificial neural networks. Machine learning, 94(1):81–104,
2014.
Steffen Hölldobler, Yvonne Kalinke, and Hans-Peter Störr. Approximating the semantics of logic
programs by recurrent neural networks. Applied Intelligence, 11(1):45–58, 1999.
Katsumi Inoue, Tony Ribeiro, and Chiaki Sakama. Learning from interpretation transition. Machine
Learning, 94(1):51–79, 2014.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,
abs/1412.6980, 2014.
Stanley Kok and Pedro Domingos. Learning markov logic network structure via hypergraph lifting.
In Proceedings of the 26th annual international conference on machine learning, pages 505–512.
ACM, 2009.
Daniel Marbach, Thomas Schaffter, Dario Floreano, Robert J Prill, and Gustavo Stolovitzky. The
dream4 in-silico network challenge. Draft, version 0.3, 2009.
Stephen Muggleton. Inverse entailment and progol. New generation computing, 13(3-4):245–286,
1995.
Nandini Ramanan, Gautam Kunapuli, Tushar Khot, Bahare Fatemi, Seyed Mehran Kazemi, David
Poole, Kristian Kersting, and Sriraam Natarajan. Structure learning for relational logistic regres-
sion: An ensemble approach. In Sixteenth International Conference on Principles of Knowledge
Representation and Reasoning, 2018.
Tony Ribeiro, Sophie Tourret, Maxime Folschette, Morgan Magnin, Domenico Borzacchiello, Fran-
cisco Chinesta, Olivier Roux, and Katsumi Inoue. Inductive learning from state transitions over
continuous domains. In International Conference on Inductive Logic Programming, pages 124–139.
Springer, 2017.
Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):107–
136, 2006.

11
Luciano Serafini and Artur d’Avila Garcez. Logic tensor networks: Deep learning and logical
reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016.
Farhad Shakerin and Gopal Gupta. Induction of non-monotonic logic programs to explain boosted
tree models using lime. arXiv preprint arXiv:1808.00629, 2018.
Ashwin Srinivasan. The aleph manual, 2001.
Alireza Tamaddoni-Nezhad, David Bohan, Alan Raybould, and Stephen Muggleton. Towards
machine learning of predictive models from ecological data. In Inductive Logic Programming,
pages 154–167. Springer, 2015.
Bei Yang, Yaohui Xu, Andrew Maxwell, Wonryull Koh, Ping Gong, and Chaoyang Zhang. Micrat:
a novel algorithm for inferring gene regulatory networks using time series gene expression data.
BMC systems biology, 12(7):115, 2018.
Xiujun Zhang, Keqin Liu, Zhi-Ping Liu, Béatrice Duval, Jean-Michel Richer, Xing-Ming Zhao,
Jin-Kao Hao, and Luonan Chen. Narromi: a noise and redundancy reduction technique improves
accuracy of gene regulatory network inference. Bioinformatics, 29(1):106–113, 2012.

12

You might also like