Mapping The Neuro-Symbolic AI Landscape by Architectures: A Handbook On Augmenting Deep Learning Through Symbolic Reasoning
Mapping The Neuro-Symbolic AI Landscape by Architectures: A Handbook On Augmenting Deep Learning Through Symbolic Reasoning
Abstract
Integrating symbolic techniques with statistical ones is a long-standing problem in arti-
ficial intelligence. The motivation is that the strengths of either area match the weaknesses
of the other, and – by combining the two – the weaknesses of either method can be limited.
Neuro-symbolic AI focuses on this integration where the statistical methods are in partic-
ular neural networks. In recent years, there has been significant progress in this research
field, where neuro-symbolic systems outperformed logical or neural models alone. Yet,
neuro-symbolic AI is, comparatively speaking, still in its infancy and has not been widely
adopted by machine learning practitioners. In this survey, we present the first mapping of
neuro-symbolic techniques into families of frameworks based on their architectures, with
several benefits: Firstly, it allows us to link different strengths of frameworks to their re-
spective architectures. Secondly, it allows us to illustrate how engineers can augment their
neural networks while treating the symbolic methods as black-boxes. Thirdly, it allows us
to map most of the field so that future researchers can identify closely related frameworks.
1. Introduction
Over the last decades, machine learning has achieved outstanding performance in pattern
recognition across a range of applications. In particular, in the areas of computer vision
(LeCun et al., 1998; Goodfellow et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2022),
natural language processing (NLP) (Hochreiter & Schmidhuber, 1997; Mikolov et al., 2013;
Vaswani et al., 2017; Devlin et al., 2019), and recommendation systems (He et al., 2017;
Zheng et al., 2018; Rashed et al., 2019), neural networks have outperformed more traditional
machine learning models. These breakthroughs have led to significant progress in fields such
as medicine (Kearnes et al., 2016; Xie et al., 2019; Huang et al., 2019), finance (Deng et al.,
2016; Bao et al., 2017), and autonomous driving (Bojarski et al., 2016; Luo et al., 2018).
1
Feldstein, Dilkas, Belle, & Tsamoura
Despite these strides, purely neural models still have important limitations (Marcus, 2018).
Five important limitations are particularly worth mentioning:
1. Structured reasoning. Neural networks are particularly suited for pattern recog-
nition but do not lend themselves well to hierarchical or composite reasoning and do
not differentiate between causality and correlation (Lake & Baroni, 2017).
2. Data need. To achieve robustness in the predictions of neural models, large amounts
of data are needed (Halevy et al., 2009; Ba & Caruana, 2014). However, large datasets
are unavailable in many applications, making neural networks an unviable choice.
While neural networks are arguably the more known models to the general public, logi-
cal models have been the more prominent research direction in artificial intelligence (AI)
(Russell & Norvig, 2016) prior to the resurgence of neural networks (Schmidhuber, 2015).
In contrast to neural models, logical models are particularly suited for symbolic reasoning
tasks that depend on the ability to capture and identify relations and causality (Vennekens
et al., 2009; De Raedt & Kimmig, 2015). However, logical models have their own set of
limitations.1 The first limitation of logical models is their inability to deal with uncertainty
both in the data and in the theory as, traditionally, each proposition must either be true or
false (Pearl, 1988). The second major bottleneck, which arguably has led to falling behind
in popularity behind neural models, is scalability, as computational complexity generally
grows exponentially with the size of the alphabet and the lengths of formulae in the logical
theory (Bradley & Manna, 2007).
To tackle the first problem, the field of statistical relational learning (SRL) aims at
unifying logical and probabilistic frameworks (Getoor & Taskar, 2007; Raedt et al., 2016).
In fact, this unification has been a long-standing goal in machine learning (Russell, 2015).
Logical notions capture objects, properties and relations, focusing on learning processes,
1. We use the term “logical models” to collectively refer to approaches for modelling systems using logic,
including knowledge representation, verification and automated planning (Bradley & Manna, 2007).
2
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
dependencies and causality. On the other hand, the underlying probabilistic theory ad-
dresses uncertainty and noisy knowledge acquisition with a focus on learning correlations.
In contrast to neural networks, SRL frameworks allow us to reason at a symbolic level,
generalise from little data, and integrate domain expertise easily. In addition, they are easy
to interpret as the logical theories are close to natural language. However, by combining
models in logic and probability theory – two independently computationally-hard problems
– SRL frameworks fail to scale well in general (Natarajan et al., 2015). As illustrated by
Table 1, the weaknesses of one area are the strengths of the other, and thus, it should come
as no surprise that further unification of these two areas is necessary.
To tackle the second limitation, building on the strengths of SRL, the area of neuro-
symbolic AI takes this unification further by combining logical models with neural networks
(d’Avila Garcez et al., 2022). This approach is taken as one reason neural networks have
gained so much attention is their increased scalability compared to logical AI. The scalability
has been supported through improved hardware, in particular, GPUs (Mittal & Vaishay,
2019) and hardware specifically designed for neural computation (Schuman et al., 2017).
Often, as we will see throughout this survey, the symbolic component of a neuro-symbolic
system is an SRL framework. Thus, neuro-symbolic AI builds heavily on SRL. On the one
hand, neuro-symbolic AI leverages the power of deep neural networks, which offer a complex
and high-dimensional hypothesis space. On the other hand, owing to hard constraints (e.g.
in safety-critical applications), data efficiency, transferability (e.g. one-shot and zero-shot
learning) and interpretability (e.g. program induction), symbolic constructs are increasingly
seen as an explicit representation language for the output of neural networks.
3
Feldstein, Dilkas, Belle, & Tsamoura
Neuro-Symbolic AI
Composite Monolithic
Frameworks Frameworks
(Section 4) (Section 5)
This section describes the contributions of this survey, from which we derive the structure
of the remainder of this work.
A gentle introduction to SRL. Our first contribution is an in-depth, yet succinct,
introduction to SRL in Section 3. To this end, we discuss various concepts of probability,
logic, and SRL. We connect the different concepts through examples, which we continuously
extend throughout this survey to illustrate the commonalities and differences between the
methods. We aim to give enough details to enable a robust understanding of the frameworks
discussed in this survey without getting lost in details that are not pertinent.
A map of the neuro-symbolic AI landscape by architectures. Our second contri-
bution is a map of the neuro-symbolic AI research area, illustrated in Figure 1, through
the lens of the architectures of the different frameworks. At the top level, we distinguish
between composite and monolithic frameworks. While composite frameworks keep the
symbolic (i.e. logical) and neural components separate, monolithic frameworks integrate
logical reasoning directly into the architecture of neural networks. The two groups are,
thus, complement of each other. At lower levels of the hierarchy, we differentiate how
symbolic and neural components are connected, as well as the types of neural models and
inference properties of the symbolic methods used in the frameworks.
A handbook for extending existing AI models with neuro-symbolic concepts.
Our third contribution is a guide for researchers on how to augment their existing neural
networks with concepts from neuro-symbolic AI. We therefore focus primarily on composite
frameworks (Section 4). As these frameworks tend to be model agnostic, they allow for a
simple extension of an existing logical or neural model.
A comprehensive account of neuro-symbolic concepts. While it is not in the scope
of this survey to discuss every paper, we aim to cover the area more broadly to position
regularisation techniques in relation to other approaches. To this end, Section 5 discusses the
complement of composite frameworks, i.e. monolithic frameworks. We keep the discussion
of such frameworks brief and refer the reader to d’Avila Garcez et al. (2019) for more details.
4
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
3. Preliminaries
This section briefly covers probabilistic models (Section 3.1), propositional and first-order
logic (Section 3.2), SRL (Section 3.3), and neural networks (Section 3.4). Since the field is
developing, there are proposals where some of the structures are adapted to fit for purpose,
and covering all variations is beyond the scope of this survey. We aim to cover the essentials
which should be sufficient for someone to get started in the field of neuro-symbolic AI.
5
Feldstein, Dilkas, Belle, & Tsamoura
where, assuming x = I(X), for some instantiation I, xi = I(X i ), with i ranging over
factors. Z is the partition function, i.e. a normalising constant, given by
M
X Y
Z= fi (xi ) . (2)
x∈I(X) i=1
where wi is its corresponding coefficient. A feature function can be any real-valued function
evaluating the state of (part of) the system. A state is a specific instantiation I of the
variables. In this survey, we consider binary features, i.e. Fi (xi ) ∈ {0, 1}, and features
mapping to the unit interval, i.e. Fi (xi ) ∈ [0, 1].
6
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Example 1 (Factor Graph). Let us consider a set of seven Boolean RVs {X1 , . . . , X7 }, i.e.
Xi ∈ {0, 1}, and three factors {f1 , f2 , f3 }:
(
1 if X1 = X2 = X3 = 1 and X4 = 0
f1 (X1 , X2 , X3 , X4 ) =
2 otherwise
(
1 if X3 = X5 = 1 and X6 = 0
f2 (X3 , X5 , X6 ) =
4 otherwise
(
1 if X4 = X6 = 1 and X7 = 0
f3 (X4 , X6 , X7 ) =
3 otherwise.
These factors can be visualised by a factor graph (Figure 2), or written in log-linear form
by assigning weights w1 = ln(2), w2 = ln(4), w3 = ln(3) and feature functions:
(
0 if X1 = X2 = X3 = 1 and X4 = 0
F1 (X1 , X2 , X3 , X4 ) =
1 otherwise
(
0 if X3 = X5 = 1 and X6 = 0
F2 (X3 , X5 , X6 ) =
1 otherwise
(
0 if X4 = X6 = 1 and X7 = 0
F3 (X4 , X6 , X7 ) =
1 otherwise.
Note that using these features in the log-linear model of Equation (3) is equivalent to using
the original factors in Equation (1), e.g. exp(1 · w1 ) = exp(ln(2)) = 2 and exp(0 · w1 ) = 1.
X1
X2 f1
X3
X4 f2
X5
X6 f3
X7
Computing marginals is generally intractable. Consider for example a distribution over just
100 Boolean variables, then, to compute marginals one would need to compute 2100 states.
7
Feldstein, Dilkas, Belle, & Tsamoura
However, note that the factorisation in Equation (1) implies that an RV only depends on
the factors it is connected to. One option to compute marginals efficiently in factor graphs
is belief propagation (Koller & Friedman, 2009). In this algorithm, messages, which
encode the node’s belief about its possible values, are passed between connected nodes. The
marginalisation then reduces to a sum of products of simpler terms compared to the full
joint distribution (which is why the algorithm is also referred to as sum-product message
passing ), thereby reducing the computational complexity.
P (X ′u = x′u , X o = xo )
P (X ′u = x′u | X o = xo ) = . (5)
P (X o = xo )
The MAP can be computed using max-product message passing, where instead of sum-
ming messages the maximum is chosen. The operation MAP(X u = xu | X o = xo ), i.e.
predicting all unobserved RVs, is called the most probable explanation (MPE).
P (X7 = 1, X4 = 1, X6 = 1)
P (X7 = 1|X4 = 1, X6 = 1) =
P (X4 = 1, X6 = 1)
P (X7 = 1, X4 = 1, X6 = 1)
=
P (X7 = 1, X4 = 1, X6 = 1) + P (X7 = 0, X4 = 1, X6 = 1)
f3 (X7 = 1, X4 = 1, X6 = 1)
=
f3 (X7 = 1, X4 = 1, X6 = 1) + f3 (X7 = 0, X4 = 1, X6 = 1)
3
= ,
4
where, first, we used Equation (5), second, we used Equation (4) and third, we cancelled
contributions from f1 , f2 , and the partition function Z.
8
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Markov random fields (MRFs) (Kindermann & Snell, 1980). Markov random
fields or Markov networks are undirected probabilistic graphical models, where the nodes
of the graph represent RVs and the edges describe Markov properties. The local Markov
property states that any RV is conditionally independent of all other RVs given its neigh-
bours. There are three Markov properties (pairwise, local, and global). However, for positive
distributions (i.e. distributions with non-zero probabilities for all variables) the three are
equivalent (Koller & Friedman, 2009). Each maximal clique in the graph is associated with
a potential (in contrast to factor graphs, the potentials are not explicit in the graph), and
the MRF then defines a probability distribution equivalently to Equation (1). An MRF
can be converted to a factor graph by creating a factor node for each maximal clique and
connecting it to each RV from that clique. Figure 3 shows the MRF on the left for the
equivalent factor graph on the right from Example 1.
X1
X1 X2 f1
X2 X3
X3 X4 X4 f2
X5
X5 X6 X7 X6 f3
X7
Figure 3: The MRF (left) and the equivalent factor graph (right) for Example 1.
Bayesian networks (BNs) (Pearl, 1988). Bayesian Networks or belief networks are
directed probabilistic graphical models where each node corresponds to a random variable,
and each edge represents the conditional probability for the corresponding random variables.
The probability distribution expressed by the BN is defined by providing the conditional
probabilities for each node given its parent nodes’ states. A BN can be converted to a
factor graph by introducing a factor fi for each RV Xi and connecting the factor with the
parent nodes and itself. fi represents the conditional probability distribution of Xi given
its parents. Figure 4 illustrates an example of a BN and an equivalent factor graph.
X1 f1
X1 X2 X3 X5
X2 f2
X3 f3
X4 X6 X4 f4
X5 f5
X6 f6
X7
X7 f7
Figure 4: Example of a Bayesian network (left) and an equivalent factor graph (right).
9
Feldstein, Dilkas, Belle, & Tsamoura
M
1 Y Y
PΦ (X = x) := ϕi (xj ) , (6)
Z
i=1 X j ∈I C (X i )
where X j ⊆ X are the different sets that can be instantiated from the par-RVs X i partic-
ipating in the par-factor ϕi , and xj ⊆ x = I(X), such that xj = I(X j ).
An important goal of parameterised factor graphs is to serve as a syntactic convenience,
providing a much more compact representation of relationships between objects in prob-
abilistic domains. The same models can be represented just as well as propositional or
standard graphical models. However, as we will see in Section 3.3.1, the more succinct
representation can also lead to more efficient inference.
Example 3 (Par-factor graph). Let us consider a par-factor graph Φ with just one par-
factor ϕ1 (X1 , X2 , Y), where the par-RVs can be instantiated as follows IC (X1 ) ∈ {X1 , X2 },
IC (X2 ) ∈ {X1 , X2 }, and IC (Y) ∈ {Y11 , Y12 , Y21 , Y22 }. The constraint set C is given as: if
IC (X1 ) = Xi and IC (X2 ) = Xj , then IC (Y) = Yij . Then, we can instantiate the following
factors f11 (X1 , X1 , Y11 ), f12 (X1 , X2 , Y12 ), f21 (X2 , X1 , Y21 ) and f22 (X2 , X2 , Y22 ). Figure 5
shows the par-factor graph Φ on the left and the instantiated factor graph F on the right.
Y12
Y
f12
ϕ1 Y11 f11 X1 X2 f22 Y22
X1 X2 f21
Y21
Figure 5: The par-factor graph (left) and instantiated factor graph (right) of Example 3.
10
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
3.2 Logic
TL;DR (Logic). Logic allows us to reason about connections between objects and express
rules to model worlds. The goal of logic programming is threefold:
• State what is true: Alice likes Star Wars.
• Check whether something is true: Does Alice like Star Wars?
• Check what is true: Who likes Star Wars?
This section begins by introducing propositional logic (Section 3.2.1) and first-order logic
(Section 3.2.2), and finishes with a brief introduction to logic programming (Section 3.2.3) –
a programming paradigm which allows us to reason about a database using logical theories.
We present the syntaxes of the different languages and their operations, for the semantics,
the reader is referred to Bradley and Manna (2007).
11
Feldstein, Dilkas, Belle, & Tsamoura
φ1 := SA ∧ SB ∧ CAB → FAB
φ2 := SA ∧ LAW → LAT (7)
φ3 := FAB ∧ LAT → LBT
All three formulae are written as logical implications and form together the theory φ.
{SA = ⊤, SB = ⊤, CAB = ⊤, FAB = ⊤, LAW = ⊤, LAT = ⊥, LBT = ⊤} is an interpretation of
the theory, but not a model, since {SA = ⊤, LAW = ⊤, LAT = ⊥} ̸|= φ2 .
A problem of propositional logic becomes apparent from this example: once we want to
generalise these rules to a large set of people, we would need to create a new logical variable
for each person. First-order logic allows us to reason at a higher level and abstract the
problem away by reasoning about groups rather than individual instances.
where Users = {alice, bob} and Items = {startrek} are sets of constants. φ3 in (7) is,
thus, equivalent to a ground or instantiated case of this rule.
12
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
The Herbrand base in logic programming consists of two sets: the abducibles A, with
α ⊆ A, and the outcomes O. The two sets are disjoint, i.e. A ∩ O = ∅. In logic
programming, we operate under the closed world asummption, i.e. any ground atom
in the abducibles that is not in the input facts is assumed to be false: ∀α ∈ A \ α : α is
False. Further, all groundings of atoms in the conclusion of the rules are in O.
The Herbrand instantiation HI (P) is the set of all ground rules obtained after
replacing the variables in each rule in ρ with terms from its Herbrand universe in every
possible way. A Herbrand interpretation or possible world I, is a mapping of the
ground atoms in the Herbrand base to truth values. For brevity, we will use the notation
α ∈ I (resp. α ̸∈ I), when a ground atom α is mapped to True (resp. False) in I. A
partial interpretation is a truth assignment to a subset of atoms of the Herbrand base.
A Herbrand interpretation I is a model M of P if all rules in ρ are satisfied. A model
M of P is minimal if no other model M′ of P has fewer atoms mapped to True than M.
Such a model is called the least Herbrand model. Each logic program has a unique least
Herbrand model, which is computed via the consequence operator.
Definition 1 (Consequence operator). For a logic program P and a partial interpretation
I of P, the consequence operator is defined by
^
TP (I) := {α | α ∈ I or ∃(α ← αp ) ∈ HI (P), s.t. ∀αp ∈ αp , αp ∈ I is True}
αp ∈αp
where αp denotes the ground atoms in the premise of a rule in the Herbrand instantiation.
Consider the sequence I1 = TP (∅), I2 = TP (I1 ), . . . . Let n be the smallest positive integer
such that In = TP (In ), then In is the least Herbrand model of P.
Entailment. A program P entails a ground atom α – denoted as P |= α or ρ ∪ α |= α
– if α is True in every model of P. Note that if α is True in every model of P, then it is
True in the least Herbrand model of P.
13
Feldstein, Dilkas, Belle, & Tsamoura
with Users = {alice, bob}, Items = {starwars, startrek}, and a set of input facts
Rule (10) is a solution to enable partial knowledge of item preferences. It states that for
the cases where we know that a user likes an item, we can assign likes(U, I) to True. This
is necessary since the predicates in the heads of the rules need to be distinct from the
predicates in the input facts.
? : l(b, st)
? : kl(a, sw) ∧ f(a, b) ∧ s(st, sw) ? : kl(a, sw) ∧ s(st, sw) ∧ f(a, b)
We can check whether program P entails likes(bob, startrek) by building a proof using
backward chaining as shown in Figure 6. On the top level of the proof tree, we check
whether likes(bob, startrek) ∈ α. Since that is not the case, we examine at the second
level two ways of proving likes(bob, startrek): either bob likes a movie similar to startrek
or he is friends with someone who likes startrek. Neither case is to be found directly in
α. However, the similar and friends atoms are part of the abducibles, and specifically,
we find similar(starwars, startrek) and friends(bob, alice) in the input facts. Since
similar(starwars, startrek) is in the input, at the third level on the left branch, we check
whether there is a friend of bob who likes starwars; and at the third level on the right
branch, we check whether alice likes a movie similar to startrek, since we know that
14
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
friends(bob, alice). The program then finds on the last level of the proof that since we
have knownlikes(alice, starwars) that likes(alice, starwars) is True, and since we
known that similar(starwars, startrek), alice also likes startrek, and since alice is
friends with bob he also likes startrek. Note that both proofs are the same. Similarly, we
can perform a non-boolean query for all possible assignments to likes(U, startrek), which
from the above proof would return likes(alice, startrek) and likes(bob, startrek).
Alternatively, one can reason about this query via forward chaining by applying the
consequence operator from Definition 1. After the first iteration,
where the new fact is derived from (10). Then, in the second step, we can derive
where the first new fact is derived from (8), and the second from (9). Finally,
where the new fact is derived from (9). Since likes(bob, startrek) ∈ TP3 (α), we can stop
the computation here and return an affirmative answer to the query.
Logical abduction (Kakas, 2017). In the logic programming community, abduction
for a query Q ⊂ O consists in finding all possible sets of input facts α ⊆ A, such that
P(ρ, α) |= Q. Abduction then returns a formula φA , which is a disjunction where each
disjunct is a conjunction of the ground atoms in α ⊆ A such that ρ ∪ α |= Q, i.e.
^
φQ
_
A := α j . (11)
αi ⊆A αj ∈αi
s.t. αi ∪ρ|=Q
V
The disjuncts, i.e. αj ∈αi αj , are called abductive proofs, and we denote the process of
finding all abductive proofs by abduce(ρ, A, Q). The leaves of Figure 6 are two abductive
proofs of abduce(ρ, A, likes(bob, startrek)).
15
Feldstein, Dilkas, Belle, & Tsamoura
where X j ⊆ X are the different sets that can be instantiated from the par-RVs X i partic-
ipating in the par-factor ϕi , and xj ⊆ x = I(X), such that xj = I(X j ). Here, X map to
atoms, X to ground atoms, and x to truth assignments of the ground atoms. Note that
X = I C (X ), i.e. the atoms of the theory are grounded in every possible way. Thus, the
outer sum iterates over the different formulae and the inner sum iterates over the different
groundings of each formula. Then, one can compute marginals and conditional probabilities
as in Equations (4) and (5).
Example 8 (Lifted graphical model). Let us construct an LGM for the theory of Example 5
with Users = {alice, bob} and Items = {startrek}.
1. We assign a confidence value w1 to the formula:
w1 : ∀U1 , U2 ∈ Users, ∀I ∈ Items : friends(U1 , U2 ) ∧ likes(U1 , I) → likes(U2 , I) (13)
16
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
3. The left-hand side of Figure 7 shows how the par-RVs are connected through par-
factors, and the right-hand side shows the instantiated factor graph for the given sets.
Note that the same graphs are obtained as in Example 3.
f(a, b)
f(U1 , U2 )
l(U1 , I) l(U2 , I)
f(b, a)
Figure 7: The par-factor graph representing Formula (13) (left) and an instantiated factor
graph based on the sets Users = {alice, bob} and Items = {starwars} (right). For
readability, we used f for friends, l for likes, a for alice, b for bob, and s for startrek.
Inference. Inference consists in computing the MPE for a set of unobserved variables X u ,
given assignments to observed variables X o = xo by means of a conditional distribution:
b u = arg max PL (X u = xu |X o = xo ; wL )
x (14)
xu
b t+1
w t
L = arg max log PL (X = x; w L ) (15)
wL
Equation (15) works when x provides truth assignments to all variables in X. If this
assumption is violated, i.e. unobserved variables exist in the training data, training resorts
to an expectation-maximisation problem. Parameter learning in LGMs works via gradient
ascent (Richardson & Domingos, 2006; Bach et al., 2017).
In most neuro-symbolic frameworks, it is assumed that a logical theory is given and that
the only trainable parameters are wL . However, algorithms to learn the theory, known as
structure learning, exist. The general approach consists of three steps: i) finding com-
monly recurrent patterns in the data, ii) extracting formulae from the patterns as potential
candidates, iii) reducing the set of candidates to the formulae explaining the data best. The
first step helps to reduce the search space, as generally, the number of possible formulae
grows exponentially. Khot et al. (2015) use user-defined templates as a starting point to
find formulae. However, this approach therefore still requires user input. Kok and Domin-
gos (2010) and Feldstein et al. (2023b) present algorithms based on random walks to find
patterns in a hypergraph representation of the relational data. However, the algorithms fail
to scale past O(103 ) relations. Feldstein et al. (2024) present a scalable (O(106 )) algorithm
that avoids expensive inference by estimating the “usefulness” of candidates up-front, but
only find rules of a specific form.
17
Feldstein, Dilkas, Belle, & Tsamoura
Example 9 (Ground MLN). Consider mapping the atoms of the formulae in Example 4
to RVs as {SB 7→ X1 , CAB 7→ X2 , SA 7→ X3 , FAB 7→ X4 , LAW 7→ X5 , LAT 7→ X6 , LBT 7→ X7 }
and assign weights w1 = ln(2), w2 = ln(4), w3 = ln(3) to the respective formulae:
φ1 := SA ∧ SB ∧ CAB → FAB
φ2 := SA ∧ LAW → LAT (16)
φ3 := FAB ∧ LAT → LBT
Then, the factors in Example 1 implement the formulae in (16) as a ground MLN, as
illustrated in Figure 8. Firstly, the weights of the formulae match the weights of the log-
linear model, and secondly, the features in Example 1 evaluate to 1 when the respective
formulae in (16) are satisfied. For example, consider the feature F3 that would evaluate φ3
(
0 if X4 = X6 = 1 and X7 = 0
F3 (X4 , X6 , X7 ) =
1 otherwise.
Under the above mapping {FAB 7→ X4 , LAT 7→ X6 , LBT 7→ X7 }, and {FAB = True,
LAT = True, LBT = False} ̸|= φ3 , while all other truth assignments to the logical vari-
ables satisfy φ3 . Note that Figure 8 is equivalent to the MRF of Example 1 illustrated in
Figure 3, where the RVs have been replaced by the atoms. In Example 2, we computed
that P (X7 = 1 | X4 = 1, X6 = 1) = 0.75. Thus, we can conclude that, given FAB = True
and LAT = True as evidence, the MLN in this example predicts P (LBT = False) = 0.75.
SB
CAB
SA FAB
18
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Probabilistic soft logic (PSL) (Bach et al., 2017). Similarly to MLNs, PSL defines,
for a tuple L(ρ; wL ), how to instantiate an MRF from the grounded formulae. However,
PSL has four major differences:
2. While in MLNs, all ground atoms take on Boolean values (True or False), in PSL,
ground atoms take soft truth values from the unit interval [0, 1]. Depending on the
application, this allows for two interpretations of likes(alice, starwars) = 0.7 and
likes(alice, startrek) = 0.5: it could either be understood as a stronger confidence
in the fact that alice likes starwars rather than startrek or it can be interpreted
as alice liking starwars more than startrek.
3. PSL uses a different feature function F to compute the rule satisfaction: For a
grounding X i of a rule ρi and an assignment xi , the satisfaction of ρi computed
by Fi (X i = xi ) in PSL is evaluated using the Lukasiewicz t-(co)norms, defined as
follows 2 :
xi ∧ xj = max{xi + xj − 1, 0}
xi ∨ xj = min{xi + xj , 1} (17)
¬x = 1 − x
Remark 2 (PSL optimisation). Note that, because the truth values are soft and the feature
functions in PSL are continuous, in contrast to MLNs, maximising the potential functions
becomes a convex optimisation problem and not a combinatorial one, allowing the use of
standard optimisation techniques (e.g. quadratic programming). Bach et al. (2017) intro-
duced an even more efficient MAP inference based on consensus optimisation, where the
optimisation problem is divided into independent subproblems and the algorithm iterates
over the subproblems to reach a consensus on the optimum.
Example 10 (PSL). Let us compute the soft truth value of likes(bob, startrek) for the
theory of Example 8, given a partial interpretation with soft truth values:
2. Notice that we use the same symbols as in classical logic for the logical connectives for convenience.
However, the connectives here are interpreted as defined above.
19
Feldstein, Dilkas, Belle, & Tsamoura
f(U1 , U2 ) ∧ l(U1 , I) → l(U2 , I) ≡ ¬ f(U1 , U2 ) ∧ l(U1 , I) ∨ l(U2 , I)
≡ ¬f(U1 , U2 ) ∨ ¬l(U1 , I) ∨ l(U2 , I) (18)
Step 2 Find the possible groundings of Equation (18):
1. ¬f(a, b) ∨ ¬l(a, s) ∨ l(b, s)
2. ¬f(b, a) ∨ ¬l(b, s) ∨ l(a, s)
3. ¬f(a, a) ∨ ¬l(a, s) ∨ l(a, s)
4. ¬f(b, b) ∨ ¬l(b, s) ∨ l(b, s)
Step 3 Compute the potentials:
1. F (x1 ) = min{xs + (1 − 0.7) + (1 − 0.7), 1} = min{0.6 + xs , 1}
2. F (x2 ) = min{(1 − 0.7) + (1 − xs ) + 0.5, 1} = min{1.8 − xs , 1} = 1
3. F (x3 ) = min{(1 − 1) + (1 − 0.7) + 0.7, 1} = min{1, 1} = 1
4. F (x4 ) = min{(1 − 1) + (1 − xs ) + xs , 1} = min{1, 01} = 1
xi are the soft truth values of the ground atoms of the ground rule i (of Step 2), xs is the
to-be-determined soft truth value of likes(bob, startrek), and we applied (17).
Step 4 Then, the probability density function is given as
exp −0.5(1 − min{0.6 + xs , 1}) exp −0.5 max{0.4 − xs , 0}
P (x) = = ,
Z(x) Z(x)
with
Z 1
Z(x) = exp(−0.5 max{0.4 − xs , 0}) dxs
0
Z 0.4 Z 1
= exp(−0.5(0.4 − xs ) dxs + 1 dxs
0 0.4
Z 0.4
= exp(−0.2) exp(0.5xs ) dxs + 0.6
0
= exp(−0.2) 2 exp(0.5 · 0.4) − 2 exp(0) + 0.6
≈0.963 .
Step 5 Compute the mean estimate for P (likes(bob, startrek)):
Z 1
⟨xs ⟩ = xs · P (xs ) dxs
0
Z 0.4 Z 1
exp(−0.2 + 0.5xs ) 1
= xs dxs + xs dxs
0 0.963 0.4 0.963
≈ 0.515 .
Note that in general Step 4 and Step 5 are more complex, and PSL resorts to inference
algorithms as per Remark 2
20
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
TL;DR (Weighted model counting). Weighted model counting (Chavira & Darwiche, 2008)
is an extension of model counting (#SAT) with weights on literals that can be used to
represent probabilities.
Definition 2 (WMC). Let φ be a propositional theory, Xφ be the set of all Boolean variables
in φ, w : Xφ → R≥0 and w : Xφ → R≥0 be two functions that assign weights to all atoms of
φ, and M be a model of φ. The WMC of φ is defined as
X Y Y
WMC(φ; w, w) := w(X) w(X) . (19)
M|=φ X∈M ¬X∈M
When w captures exactly the probability of a Boolean variable X being true, i.e. w(X) ∈ [0, 1]
and w(X) = 1 − w(X), WMC captures the probability of a formula φ being satisfied, by
treating each Boolean variable X as a Bernoulli variable that becomes true with probability
w(X) and false with probability 1 − w(X). The outer summation in Equation (19) iterates
over all models of φ, while the product computes the probability to instantiate M, given
the weights of the atoms, which, due to the assumption of independence, is the product of
the weights of the literals in M.
Example 11. Consider the last formula of Example 4 as the theory φ for this example:
We list all possible interpretations of φ in Table 3. Observe that I2 is the only interpretation
that is not a model of φ. Let us denote the interpretations Ii by Mi , where it is a model.
Given a weight function that assigns the following weights to the atoms of the theory,
w(FAB ) = 0.1, w(LAT ) = 0.9, w(LBT ) = 0.5, and w(X) = 1 − w(X) for all variables, we can
21
Feldstein, Dilkas, Belle, & Tsamoura
Exact WMC solvers are based on knowledge compilation (Darwiche & Marquis, 2002) or
exhaustive DPLL search (Sang et al., 2005). Knowledge compilation is a paradigm which
aims to transform theories to a format, such as circuits, that allows one to compute queries
such as WMC in time polynomial in the size of the new representation (Darwiche & Marquis,
2002; Van den Broeck et al., 2010; Muise et al., 2012). The compilation of such circuits is
generally #P-complete, and thus exponential in the worst case. The benefit of compiling
such circuits is that it allows for efficient repeated querying, i.e. once the circuit is compiled
one can train the weights of the model efficiently. Approximate WMC algorithms use local
search (Wei & Selman, 2005) or sampling (Chakraborty et al., 2014).
The definition we discussed above supports propositional formulae and assumes discrete
domains. Several extensions to generalise WMC to first-order logic and continuous domains
have been proposed. WFOMC (Van den Broeck et al., 2011) lifts the problem to first-order
logic and allows in the two-variable fragment (i.e. logical theories with at most two variables)
to reduce the computational cost from #P-complete to polynomial-time in the domain size.
WMI (Belle et al., 2015) extends WMC to hybrid (i.e. mixed real and Boolean variables)
domains, allowing us to reason over continuous values. WFOMI (Feldstein & Belle, 2021)
combines the two extensions to allow for lifted reasoning in hybrid domains.
Inference. As for logic programs, the set of outcomes O is disjoint from the abducibles
A, i.e. A ∩ O = ∅. The success probability of a query Q ∈ O is defined as
X
PP (Q | w) := PP (αi ; w) , (21)
αi ∈φQ
A
22
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
where αi ∈ φQ
A is the set of facts in a disjunct of the abuctive formula
^
φQ
_
A := αj . (22)
αi ⊆A αj ∈αi
s.t. αi ∪ρ|=Q
Intuitively, it is the sum of all possible interpretations of A that together with the rules
ρ entail Q. Finding all interpretations of α that entail Q is done by abduction. A full
abduction is generally intractable since there are 2A possible worlds. However, in general,
we do not need a full abductive formula, since most facts in A have a zero probability and
can thus be ignored. One option to reason efficiently over PLPs is, thus, to only extend
branches in proof trees that have a high likelihood of being successful and ignore branches
with facts with low probabilities (Huang et al., 2021). Another option is to avoid redundant
computations in the construction of φQ A (Tsamoura et al., 2023). For example, notice how
the two proofs in Example 7 are equivalent. Further, notice that the semantics of PLPs
in Equations (20) and (21) match with Definition 2 of WMC, where w now encodes the
probabilities of facts and their complements. Thus, to compute the success probability we
can compile the abductive formula φQ Q
A into a circuit and compute WMC(φA ; w). Other
(fuzzy) semantics to compute the satisfaction of φQ A canWbe used (Donadello et al., 2017).
Often, the outcomes in O are mutually exclusive, i.e. o . For example, when pre-
oi ∈O i
dicting what action an autonomous agent should take, it should be one out of a finite list.
Computing the most probable outcome is known as deduction, and is computed as
Example 12 (PLP). Let us extend Example 7 with the following probabilistic facts:
Note that all other facts in A are assumed to have zero probability of being True. To
compute the probability of the PLP entailing likes(bob, startrek), we can:
From Example 7, we know that for likes(bob, startrek) to be True the last three facts of
(24) need to be True, whereas the other three facts are inconsequential, i.e.
23
Feldstein, Dilkas, Belle, & Tsamoura
Training. Learning the probabilities of facts in a PLP typically relies on circuit compi-
lation of the abductive formula and performing WMC (Fierens et al., 2015). Similarly to
LGMs, there are algorithms for learning the rules of PLPs (De Raedt & Kersting, 2004).
More broadly, the field of learning the rules of logic programs is known as inductive logic
programming (Muggleton & De Raedt, 1994), which is outside the scope of this survey
as, generally, the frameworks presented here expect a (background) logical theory to be
provided.
TL;DR (Neural networks). Neural networks are machine learning models that consist of a
set of connected units – the neurons – typically organised in layers that are connected by
(directed) edges. Each neuron transmits a signal to the neurons it is connected to in the next
layer. The parameters wN of a neural network are the weights associated with the neurons
and edges. A neural network with more than three layers is called a deep neural network
(DNN). Figure 9 shows an abstract representation of a feed-forward neural network.
...
...
...
1 2 3 n−1 n
Input Layer Output Layer
Figure 9: Abstract representation of a feed-forward neural network.
24
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Visible Hidden
Figure 10: Abstract representation of an RBM.
E(v, h) := −bT T T
v v − bh h − v Wh .
25
Feldstein, Dilkas, Belle, & Tsamoura
1
PN (v, h) := exp(−E(v, h)) , (25)
Z
where Z is, as usual, the partition function computed over all possible assignments, im-
plying that a lower energy state is a more probable one. Inference then consists in finding
the assignments to h that maximise the probability in Equation (25), which is generally
intractable, and thus resorts to sampling methods. For example, Gibbs sampling itera-
tively samples each variable conditioned on the current values of all other variables, cycling
through all variables until the distribution converges.
4. Composite Frameworks
This section discusses neuro-symbolic architectures, where existing neural or logical models
can be plugged in. These architectures have two separate building blocks: a neural com-
ponent N and a logical component L, where, typically, the logical component is used to
regularise the neural one. At the top level, we distinguish architectures based on how the
neural network is supervised. Table 4 provides a high-level comparison of the loss functions,
which will be explained in the relevant sections.
In direct supervision frameworks (Section 4.1), the output of the overall framework
is the same as the output of the neural network. The logical component only provides
an additional supervision signal to correct the neural network’s predictions. The neural
network is trained on the logical loss but also directly on the training labels (Table 4).
These frameworks are particularly suited for applications where a neural network is already
set up to solve a task, and the goal is to improve the model further (e.g. improving the
accuracy with limited data or enforcing guarantees).
In indirect supervision frameworks (Section 4.2), the neural network identifies pat-
terns, which are passed to a logical model for high-level reasoning. For example, the neural
network identifies objects in a traffic scene (e.g. a red traffic light), and then the logical
model deduces the action an autonomous car should take (e.g. stop). The output of the
framework is the prediction of the logical model. The training labels are provided only for
the reasoning step (e.g. stop), and the neural network is trained indirectly by having to
identify patterns that allow the logical model to correctly predict the training label (e.g. a
red traffic light or a stop sign but not a green traffic light).
26
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
PN
N
x + ŷ x N L ŷ
L
PL
Figure 11: Abstract representation of parallel (left) and stratified (right) architectures.
Symbol ⊕ denotes that the output ŷ is computed by “composing” the neural predictions
with the logical predictions.
In parallel approaches (left-hand side of Figure 11), the neural and logical model solve
the same task. Thus, the input data to both models is the same or there exists a simple
mapping to pre-process the data into the respective formats required by the two models.
The output of both models maps to the same range. The difference between the predictions
of the logical and neural models is then used as an additional loss term in the training
function of the neural model. The output of the logical model is the probability PL (e.g.
Equation (12)) or the MAP (Equation (14)).
In stratified approaches (right-hand side of Figure 11), the neural model makes pre-
dictions first, and its outputs are then mapped to atoms of a logical theory. Violations of
the logical theory are penalised in the loss function of the neural model. The output of the
logical model is, generally, the SAT.
We begin by describing the big picture and the operations in parallel supervision frame-
works, abstracting away details. We will provide more details about specific frameworks
that fit this architectural pattern in Section 4.1.2.
27
Feldstein, Dilkas, Belle, & Tsamoura
28
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Using the entire distribution provides more information for the training than simply using
the MAP (or SAT in stratified supervision). However, it has its disadvantages when it
comes to constraint satisfaction. If the logical rules have large weights, the KL-term in
Equation (26) encourages the neural model to satisfy these constraints but the constraints
are not enforced. A solution to enforce constraints will be discussed in Section 4.1.3.
PN
N ŷ N N ŷ N N
PN PN ŷ N
x q x q x G ŷ C
PL PL PL
L L L
PL
Figure 12: Abstract representation of parallel architectures used for direct supervision:
(from left to right) Teacher-Student, Deep Probabilistic Logic, Concordia.
Teacher-Student (Hu et al., 2016). The teacher-student framework is one of the ear-
liest attempts to combine the outputs of a neural and a logical model in a parallel fashion.
Hu et al. (2016) propose to use posterior regularisation by constructing an intermediate
probability distribution q by solving a minimisation problem, where q is optimised to be as
close as possible to the neural distribution (using a KL divergence) while satisfying a set of
soft constraints ρi ∈ ρ each weighted with a parameter wi . q is constructed by optimising
X
min KL(q, PN ) + C ξi,j , such that ∀i, j : wi (1 − Eq [ρi,j (x, y)]) ≤ ξi,j , (27)
q,ξ≥0
i,j
where the ξi,j ≥ 0 are the slack variables of the respective rules, C ≥ 0 is the regularisation
parameter, and i loops over the different rules, while j loops over the different groundings
of each rule, with ρi,j denoting the j-th grounding of the i-th rule. The logical model,
therefore, implements soft logic but does not abide by the semantics of PSL due to the way
the constraints are encoded in the optimisation problem. In particular, when solving the
optimisation problem in Equation (27), the slack variables ξi,j should approach 0 in the
objective of the linear program so that the expectation of the rule satisfaction becomes 1
in the constraint of the linear program. However, when the rule satisfaction becomes 1, the
weight wi of each constraint has no impact. The distribution q is used as a teacher to distil
the knowledge of the constraints into the neural network during training, which consists in
optimising
wt+1
N = arg min((1 − π)ℓ(b
y N , y) + πℓ(b
y N , arg max(q))) , (28)
wN y
29
Feldstein, Dilkas, Belle, & Tsamoura
The auxiliary distribution q is then used to regularise each of the components by minimising
the KL divergence between q and the two individual distributions of the logical and neural
model, respectively, i.e.
The benefit is that the neural and logical components are trained jointly and learn from each
other. However, note that the joint distribution in Equation (29) is constructed assuming
independence between the logical and neural distributions, i.e. PL · PN . This assumption is
generally an oversimplification since both models receive the same input and, therefore, are
not independent. Inference is performed using only the neural model, and the logical model
is dropped after training. DPL has been used in unsupervised settings for entity-linking
tasks and cross-sentence relation extraction.
Concordia (Feldstein et al., 2023a). Concordia wires a logical and neural component
in three ways. Firstly, the neural outputs can be used as priors for the logical model. For
example, in the recommendation task, the simplest option is to add the rule
wi : dnn(U, I) → likes(U, I) ,
where dnn(U, I) is the neural prediction for likes(U, I), i.e. the logical model is told that the
neural prediction is true with some confidence wi . Note that wi ∈ wL can be retrained in
each epoch (using Equation (15)) and the confidence in the neural predictions can increase
with each training epoch. Secondly, the logical model is used to distil domain expertise into
the neural network by minimising the KL divergence between the two models during the
training of the neural model as an additional supervision signal, i.e.
where the first summand minimises the difference between the prediction and the label, and
the second summand reduces the difference between the distributions of the models.
30
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Thirdly, Concordia is the only model in this section using the logical model for inference
via a gating network G, which is used to combine the neural and logical model in a mixture-
of-experts approach:
Here, we present how neural networks can be regularised in a stratified fashion, and the
implications of using a SAT solver rather than a KL divergence. Details of specific imple-
mentations of stratified architectures are presented in Section 4.2.1.
Inference. The logic is generally only used during training, and thus, inference is the
same as for standard neural networks (Section 3.4).
Training. The neural network’s loss function is extended with an additional regularisation
term measuring how well the neural predictions satisfy a set of formulae, i.e.
b t+1
w N = arg min πℓ(b y N , y) + (1 − π)(1 − SAT(φ(b y N )) , (30)
wN
where SAT(φ(ŷ N )) ∈ [0, 1] is the satisfaction of the logical theory containing the con-
straints, given ŷ N the predictions of the neural model as assignments to the atoms in φ,
and π is an optional parameter to weight the enforcement of constraints. The main differ-
ence between the frameworks in this section is the semantics used to compute the SAT.
31
Feldstein, Dilkas, Belle, & Tsamoura
Figure 13: Impact of direct supervision on constraint satisfaction. The left plot displays
probability distributions of a neural network N and a logical model L. Here, L only
constrains the output to be in [0.3, 0.6] but does not discriminate between predictions within
the interval. The middle plot illustrates a distribution of a framework using a SAT loss with
π → 0, and the right plot illustrates a framework with a KL loss with π → 0.
DFL (Van Krieken et al., 2020). Differentiable Fuzzy Logic (DFL) takes the outputs of
the neural model as inputs to a logical model and computes the fuzzy maximum satisfiability
of the theory and the derivatives of the satisfiability function w.r.t. the neural outputs.
The derivatives are then used to update the parameters of the neural model to increase the
satisfiability. The loss function is defined as
X
bN ) = −
ℓDF L (φ; y wi · SAT(φi (b
y N )) ,
φi ∈φ
where yb N are the neural predictions, φi is a propositional formula, and SAT calculates the
satisfaction of the formula given neural predictions. Van Krieken et al. (2020) compare a
range of different implementations of fuzzy logic to compute SAT(φi (b y N )) and how the
derivatives of the corresponding functions behave. They find that some common fuzzy
logics do either not correct the premise or the conclusion and that they do not behave well
in very imbalanced datasets. To counteract the identified edge cases, they introduce a new
class of fuzzy logic called sigmoidal fuzzy logic, which applies a sigmoid function to the
satisfaction of a fuzzy implication. The comparison of the different fuzzy logics illustrates
the behaviour of modus ponens, modus tollens, distrust, and exception correction. However,
the experiments are limited to the MNIST dataset (Deng, 2012) and simple rules.
Semantic loss (Xu et al., 2018). In order to regularise neural networks, Xu et al.
(2018) proposed a semantic loss based on possible world semantics. The output of a neural
network is mapped to a possible world of a propositional theory φ, which expresses the
constraints on the neural network. The semantic loss is defined as
X Y Y
ℓSL (φ; PN ) ∝ − log PN (Xi = xi ) (1 − PN (Xi = xi )) ,
x|=φ i:x|=Xi i:x|=¬Xi
32
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
i.e. the negative log probability of generating a state that satisfies the constraint when that
state is sampled with the probabilities in PN . Interestingly this loss reduces to WMC(φ; PN )
(Section 3.3.3), where the weights are the probabilities for the different classes as predicted
by the neural model. This loss is differentiable and syntax-independent (i.e. two formulae
that are semantically equivalent have the same loss). Semantic loss achieves near state-of-
the-art results on semi-supervised experiments on simple datasets (MNIST (Deng, 2012),
FASHION (Xiao et al., 2017), CIFAR-10 (Krizhevsky, 2009)) while using < 10% of the
training data. Further, Xu et al. (2018) show a significant improvement in constraint
satisfaction when predicting structured objects in graphs, such as finding shortest paths.
However, all comparisons are w.r.t. simple neural networks with generic formulae.
Inconsistency loss (Minervini & Riedel, 2018). The semantic loss proposed by Min-
ervini and Riedel (2018) (called inconsistency loss), is computed as
where αP is the atom in the premise, and αC is the atom in the conclusion of a logical
rule ρ. Seeing that a logical rule is not satisfied if the premise is true but the conclusion
is false, this loss penalises instances where the premise atom has a higher probability than
the conclusion atom. Given this loss function and a propositional rule, Minervini and
Riedel (2018) generate a set of adversarial examples where the neural model does not
satisfy some of the constraints and train the neural model on these examples. While the
overall performance only improves slightly, the presented experiments include comparisons
of a neural network with and without the logical constraints on adversarial examples with
positive results regarding robustness. This should be expected, as one model has specifically
been trained on adversarial examples whereas the baseline has not. However, this framework
illustrates, similarly to the other stratified frameworks, how neural models can be pushed to
satisfy constraints, which in many applications could be more valuable than only improving
the accuracy. A limitation of this framework is that it only supports a single rule with one
premise and one conclusion atom. This framework was developed for NLP tasks.
DL2 (Fischer et al., 2019). Deep Learning with Differentiable Logic (DL2) proposes
their own fuzzy logic similar to PSL in (17). Fischer et al. (2019) argue that, in some
cases, PSL might stop optimising due to local optima, while DL2 would continue optimis-
ing due to different gradient computations. DL2 supports Boolean combinations of terms
as constraints, where a term is either a constant, a variable, or a differentiable function over
variables. In contrast to Xu et al. (2018), DL2 supports real variables (enabling constraints
for regression tasks). In addition, Fischer et al. (2019) provide a language to query the neu-
ral network. This language allows, for example, to find all input-output pairs that satisfy
certain requirements or to find neurons responsible for certain decisions, which could be
of interest to explainability but has not been tested in that regard. Similarly to Xu et al.
(2018), DL2 was tested on standard benchmarks (MNIST (Deng, 2012), FASHION (Xiao
et al., 2017), CIFAR-10 (Krizhevsky, 2009)) with generic constraints, and has only been
compared to purely neural but not neuro-symbolic baselines. While prediction accuracy
slightly decreases in some experiments, the satisfaction of constraints increases significantly.
33
Feldstein, Dilkas, Belle, & Tsamoura
One limitation across all of the above frameworks is that they only support propositional
logic (DL2 supports arithmetic expressions but no relations and quantifiers). In general,
this is sufficient for the task at hand, as the formulae are applied to the neural predictions
and the number of possible outputs is generally limited. However, there are instances where
a FOL constraint would help. For instance, when the task is to predict the actions of several
people across a sequence of images (Example 14) numerous predictions are made. In this
case, having logical constraints connecting the different predictions would be particularly
helpful as the actions performed by the different people are likely to be linked and could be
optimised jointly. One possible solution could be to lift the semantic loss using WFOMC
(Van den Broeck et al., 2011) or WFOMI (Feldstein & Belle, 2021), neither of which has
been tested so far. Rocktäschel et al. (2015) present a solution, which implements FOL,
where the framework learns the entity and relation embeddings maximising the satisfaction
of a fixed set of FOL formulae. Here, satisfaction is defined according to fuzzy semantics
using the product t-norm (in contrast to the Lukasiewicz t-(co)norms used in PSL (17)).
However, this framework is not model-agnostic and expects matrix factorisation neural
networks.
Here, we present a unified framework for describing indirect supervision techniques and
then highlight their main differences in Section 4.2.1. We abstract indirect supervision
frameworks using the notation proposed by Tsamoura and Michael (2021).
ŷ
x N L ŷ x N L o
Figure 14: Abstract comparison of direct (left) and indirect (right) stratified architectures.
34
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Example 15 (Indirect Supervision (Adopted from Tsamoura and Michael (2021))). Let
our dataset consist of images of a chessboard and assume we have:
1. a DNN that recognises the pieces occurring in an image of a chessboard,
2. a logical theory that encodes the rules of chess, and
3. the status of the black king (i.e. the black king is in a draw, safe, or mate position).
The goal is to amend the parameters of the DNN so that the recognised pieces represent
a chessboard in which the black king is in the given status. The main challenge here is
that we are only given the status of the black king as training labels. In other words, the
complete configuration of the board has to be deduced just from the image and the rules of
the game.
Assuming that the training label for the input chessboard is “mate”, abduction would
return a logical formula encoding all possible chessboard configurations where the black
king just got mated. If the DNN recognises any of the configurations, then reasoning using
the rules of the logical theory would entail that the black king is in a mate state. Given
a target, abduction returns all the possible inputs that should be provided to the logical
theory in order to deduce the target when reasoning using the rules of the theory.
As for (probabilistic) logic programming (Section 3.2.3 and 3.3.4), let A be the set of ab-
ducibles and O the set of outcomes. For a given outcome o ⊆ O, let φoA := abduce(ρ; A; o)
denote the abductive formula – the disjunction of abductive proofs (Section 3.2.3). Observe
that for any fixed outcome there might exist zero, one, or multiple abductive proofs. Let X
be the space of possible inputs and Y = [0, 1]k be the space of possible outputs of the neural
model. We denote by PN (y | x; wN ) the probability distribution of the neural model. For
notational simplicity, we assume that there is a function µ that maps each yi ∈ Y to an
abducible µ(yi ) = αi ∈ A.
Inference. Given a logical model with a theory ρ, the inference of the neuro-symbolic
system is the process that maps an input in X to an outcome subset of O as follows: For a
given input x, the neural model computes the probabilities over y, i.e. PN (y|x; wN ). The
probability of an abducible µ(yi ) = αi ∈ A is wi = PN (yi |x; wN ), and the logical model
computes the outcome o = deduce(ρ; PN (y|x; wN )) ∈ O ∪ {⊥}. Thus, inference proceeds
by running the inference mechanism of the logical model over the inferences of the neural
network. To simplify our notation, we use hρ (x) – the hypothesis function – to mean
deduce(ρ; PN (y|x; wN )). L can also implement deduction with a non-probabilistic logic
program, where instead of using the probabilities, we use the predictions for the abducibles,
e.g. if PN (yi |x; wN ) > 0.5 then αi ∈ α, and check which outcome is entailed by P(ρ, α).
Training. As in standard supervised learning, consider a set of labelled samples of the
form {xj , f (xj )}j , with f being the target function that we wish to learn, xj corresponding
to the features of the sample, and f (xj ) being the label of the sample. Training seeks
to identify, after t iterations over a training set of labelled samples, a hypothesis function
htρ that sufficiently approximates the target function f on a test set of labelled samples.
Given a fixed theory ρ for the logical model, the only part of the hypothesis function
htρ (x) = deduce(ρ; PN (y|x; wN )) that remains to be learned is the function PN (y|x; wN ))
implemented by the neural component. PN (y|x; wN ) is learnt in two steps:
35
Feldstein, Dilkas, Belle, & Tsamoura
1. For the label f (xj ) of a given sample, viewed as a (typically singleton) subset of O,
find the abductive feedback formula abduce(ρ; A; f (x)), i.e.
^
f (x)
_
φA := αj ,
αi ⊆A αj ∈αi
s.t. αi ∪ρ|=f (x)
f (x)
which as remarked in Section 3.3.4 can be computed as − WMC(φA ; PN (y|x; wN )).
However, other fuzzy logic semantics to compute the satisfaction can be used as well
(Tsamoura & Michael, 2021). Critically, the resulting loss function is differentiable,
even if the theory ρ of the logical model is not. By differentiating the loss function,
we can use backpropagation to update the neural model.
A2 The subset of the derived abductive formula used for training the neural classifier N
(e.g. the entire abductive formula or only the most promising proofs).
A3 The loss computation used for training N based on the abductive formula (i.e. min-
imising −PL (f (x)|PN (y|x; wN )) in Equation (31) via WMC or fuzzy logic).
DeepProbLog (Manhaeve et al., 2018). DeepProbLog was the first framework pro-
posed in this line of research. Regarding A1, DeepProbLog relies on the semantics of PLPs,
while regarding A2 and A3, DeepProbLog uses the WMC of all the abductive proofs, i.e.
the semantic loss introduced above (Xu et al., 2018), as the loss function for training the
neural component. Computing the abductive formula is a computationally intensive task,
severely affecting the scalability of the framework (Manhaeve et al., 2018). Recently, a few
approximations to the original framework were proposed to tackle the problem of computing
all the abductive proofs and then the WMC of the corresponding formula. For example,
Scallop (Huang et al., 2021) trains the neural network considering only the top-k abduc-
tive proofs and relies on the notion of provenance semirings (Green et al., 2007). Instead of
using the top-k proofs, Manhaeve et al. (2021) present an approach that relies on geomet-
ric mean approximations. Beyond the academic interest, Huang et al. (2021) have shown
the merits of indirect supervision frameworks in training deep neural classifiers for visual
question answering, i.e. answering natural language questions about an image.
36
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
ABL (Dai et al., 2019). Abductive Learning (ABL) is another framework in that line of
research that employs an ad-hoc optimisation procedure. Regarding A1, ABL is indifferent
to the semantics of L and could, for example, also use a classical (i.e. non-probabilistic)
logic program. Regarding A2 and A3, ABL relies on a pseudo-labeling process to train the
neural component. In each training iteration over a training dataset D = {(xj , f (xj ))}j ,
ABL first considers different training data subsets Dt ⊂ D and performs the following steps:
ABL performs these steps with different subsets of varying sizes. Let D∗ be the largest Dt
satisfying the theory after obscuring and abducing. For each {(xj , f (xj ))}j ∈ D∗ , ABL
trains multiple times the neural component using obscured and abduced neural predictions.
It was shown empirically that the optimisation procedure of obscuring and abducing neural
predictions can be time-consuming and less accurate, compared to other techniques in this
section, even when only a single abductive proof is computed (Tsamoura & Michael, 2021).
4.3 Discussion
All composite frameworks share the idea of keeping the logical and neural components
separate. However, inherent to the differences in their architectures at a lower level (e.g.
how the components are connected or the operations performed by the logical model), the
benefits achieved are quite different.
37
Feldstein, Dilkas, Belle, & Tsamoura
Structured reasoning and support of logic. Most frameworks presented in this sec-
tion have been implemented and tested with only one type of SRL framework. Still, since
the logical model is separate from the neural model, the logic those frameworks support is,
generally speaking, not limited. This clean separation enables the usage of logical models
that support complex (e.g. hierarchical and recursive) reasoning that is not supported by
neural networks, which is especially taken advantage of in indirect supervision frameworks.
Data need. Since logical models provide more knowledge, it is intuitive that less data
is needed. By taking advantage of the entire distribution of the logical model, and thus,
even more knowledge than other frameworks, parallel direct supervision frameworks seem
particularly suited to reduce data need. However, while Concordia (Feldstein et al., 2023a)
has shown empirically a reduction in data need by up to 20%, none of the frameworks
provide theoretical results on the effect of background knowledge on sample complexity.
Guarantees. Indirect supervision can provide guarantees for the overall framework, as
the output of the framework is the prediction of the logic program, which can contain hard
constraints. Wang et al. (2024) present theoretical results in that regard. For direct super-
vision frameworks, stratified architectures are particularly well suited to enforce constraints
(Remark 3), and Xu et al. (2018) and Fischer et al. (2019) present very positive results in
that aspect. However, no theoretical guarantees have been presented. In contrast, parallel
direct supervision frameworks are not well suited to enforce guarantees.
Scalability. Parallel architectures can scale well as they build on efficient LGMs and the
resulting overhead remains small. Stratified direct supervision frameworks also seem to scale
well, as the outputs are simply checked against a propositional formula. However, most of
the frameworks have only been tested on small datasets and simple rules. In contrast,
indirect supervision frameworks have limited scalability due to their reliance on PLPs.
Explainability. While there are some improvements in explainability over purely neu-
ral models, the impact on explainability for these frameworks remains limited. Indirect
supervision offers explainability for the reasoning step which remains a white-box system.
However, the explainability of the neural network, in all architectures discussed in this sec-
tion, remains unaffected, as the logical models only guide the neural networks but no direct
link between the background knowledge and the neural predictions has been established.
DL2’s query language (Fischer et al., 2019) could be of interest to the explainable AI com-
munity, as the language allows us to find neurons responsible for certain decisions but the
framework has not been tested with explainability in mind.
38
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
5. Monolithic Frameworks
Up to this point, we surveyed logic-based regularisation approaches that have logical tools
and solvers as components of the overall system. A natural question is how neural mod-
els could be constructed that inherently provide the capability of logical reasoning and
instantiate expert knowledge. This leads us to the area of monolithic frameworks.
We identified two groups of monolithic frameworks. The first group, which we refer to
as logically wired neural networks (Section 5.1), gives neurons a logical meaning and
uses the edges of the neural network to implement logical rules. The second group, which
we refer to as tensorised logic programs (Section 5.2), starts from a logic program and
then maps logical symbols to differentiable objects. We refer the reader to d’Avila Garcez
et al. (2019) for a detailed survey on frameworks that fit this section.
The earliest neuro-symbolic frameworks consist of simple neural networks whose neurons
and connections directly represent logical rules in the knowledge base. Establishing a map-
ping between neurons and logical atoms enables us to interpret the activation of a neuron
as either a positive or a negative literal. Such neural networks can then be used for logical
inference, learning with background knowledge, and knowledge extraction. We distinguish
between models using directed models (e.g. feed-forward neural networks (FNNs) or
recurrent neural networks (RNNs)) and undirected models (e.g. RBMs).
Inference. The networks use standard inference techniques (Section 3.4). Because of
the inherent interpretability of these approaches, knowledge extraction techniques are often
proposed alongside inference.
Training. The frameworks based on directed models are trained using classic backprop-
agation with minor adjustments for recursive connections and various stopping conditions.
RBM-based approaches can be trained with a variety of methods such as hybrid learn-
ing (Larochelle & Bengio, 2008) and contrastive divergence (Hinton, 2002) (Section 3.4.1).
39
Feldstein, Dilkas, Belle, & Tsamoura
1
. (32)
1 + exp(−(wN x − θ))
C-I2 LP (d’Avila Garcez & Zaverucha, 1999). The connectionist inductive learning
and logic programming system (C-I2 LP) uses RNNs, where the atoms in the premises of
the rules are mapped to input neurons in the neural network, a hidden layer implements
the logical conjunctions of the rules, and the output layer consists of the atoms in the
conclusions of the rules. C-I2 LP also uses standard backpropagation to train the network
but, in contrast to KBANNs, uses a bipolar semi-linear activation function
2
− 1,
1 + exp(−βwN x)
where β is a hyperparameter to control the steepness of the activation, x are the inputs,
and wN are the edge weights. C-I2 LP has been extended to CILP++ (França et al., 2014)
to allow FOL via a propositionalisation technique inspired by ILP systems.
Example 16 (KBANN and C-I2 LP). Consider a logic program consisting of a single rule
FAB ∧ LAT → LBT , (33)
where, as in Example 4, FAB models whether Alice and Bob are friends, LAT models whether
Alice likes Star Trek, and LBT models whether Bob likes Star Trek. Suppose FAB = 1 and
LAT = 1 is fed to both networks. Figure 15 shows a KBANN and C-I2 LP for this program.
LBT LBT
Figure 15: KBANN (left) and C-I2 LP (right) implementation of the logic program (33).
Both networks have a neuron for each atom, and the C-I2 LP neural network has a hidden
neuron h for the rule itself.
In the case of KBANN, each edge is initialised with a weight w (Towell and Shavlik
(1994) suggest w = 4), which can later be refined using backpropagation. The bias of the
LBT neuron, given L the number of literals in the premise of the rule, is set as
1 3
θ = L− w = w = 6.
2 2
The input to the LBT neuron is then 2w = 8, and its activation value (Equation (32)) is
1 1
= ≈ 0.88,
1 + exp(−(8 − θ)) 1 + exp(−2)
which is in [0, 1], and so a value larger than 0.5 can be interpreted as LBT being set to True.
40
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
2 ln(1 + A) − ln(1 − A)
w≥ × ≈ 8.47,
β M (A − 1) + A + 1
where we set β = 1, and thus, we set w = 8.5. The activation of the h neuron, given
FAB = 1 and LAT = 1, becomes
2
− 1 ≈ 1.
1 + exp(−2βw)
FAB LAT h1 h2 h3
Li and Srikumar (2019) Instead of creating a neural network from a logic program,
one can alter a neural network (potentially introducing new neurons) in a way that biases
the network towards satisfying some logical constraints. Li and Srikumar (2019) achieve
this by recognising that the activation of some neurons (called named neurons) can be
interpreted as the degree to which a proposition is satisfied. Suppose we want the neural
network to satisfy rule Z1 ∧ · · · ∧ Zn → Y. For each atom, e.g. Y, we must first identify the
corresponding neuron y. The goal is to increase the activation of y if all zi ’s have high
activation values but do so in a differentiable manner. Suppose originally we had that
41
Feldstein, Dilkas, Belle, & Tsamoura
where g is the activation function, wN are the network parameters, and x is the immediate
input to y. Then we replace Equation (34) with
n
( )!
X
y = g wN x + wL max 0, zi − n + 1 , (35)
i=1
DLNs (Tran & d’Avila Garcez, 2016). Deep logic networks (DLNs) are DBNs that
can be built from a (propositional) knowledge base. Alternatively, logical rules can be
extracted from a trained DLN. The rules used by DLNs, called confidence rules, declare
an atom to be equivalent to a conjunction of literals, with a confidence value that works
similarly to those in MLNs (Richardson & Domingos, 2006) and penalty logic (Pinkas,
1995). Knowledge extraction runs in O(kn2 ) time on a network with k layers, each of which
42
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
has at most n neurons. However, confidence rules are extracted separately for each RBM
in the stack, causing two issues for deep DLNs: they become less interpretable and either
less computationally tractable or lossier. The authors also note that DLNs perform hardly
better than DBNs on complex non-binary large-variance data such as images.
We review three frameworks that map (parts of) logic programs to differentiable objects:
neural theorem provers (Rocktäschel & Riedel, 2017), logic tensor networks (Serafini &
d’Avila Garcez, 2016; Serafini et al., 2017), and TensorLog (Cohen et al., 2020). These de-
velopments primarily arose from applications such as automated knowledge base construc-
tion. Here, natural language processing tools need to parse large texts to learn concepts
and categories, e.g. human, mammal, parent, gene type, etc. However, natural language
sources do not make every tuple explicit – often, they are common sense (e.g. humans are
mammals, or parents of parents are grandparents). Owing to the scalability issue of dealing
with thousands or even millions of tuples, there has been considerable effort and interest
in enabling logical reasoning in neural networks to populate instances of logical rules. For
example, embedding logical symbols, such as constants and predicates, in a vector space is
a convenient way of discovering synonymous concepts.
Example 17 (NTP). Reconsider the recommendation setting and suppose we want to find
all users in the knowledge base that like Star Trek, i.e. the query is likes(U, startrek).
Moreover, suppose that the knowledge base already contains the fact likes(alice, starwars).
The unification procedure between these two atoms would produce the substitution {U 7→
alice} along with a proof success score that depends on the distance between the em-
beddings of Star Trek and Star Wars. Let us assume that the distance is small, e.g. because
people who like one of them tend to also like the other. Then, alice will be output as a
probable candidate for liking Star Trek even though such a fact might not be deducible
according to the standard logic programming semantics.
43
Feldstein, Dilkas, Belle, & Tsamoura
Training aims to learn the embeddings of all predicates and constants. Rule templates are
used to guide the search for a logic program that best describes the data. A rule template is
a rule with placeholder predicates that have to be replaced with predicates from the data.
NTPs are trained using gradient descent by maximising the proof success scores of ground
atoms in the input knowledge base and minimising this score for randomly sampled other
ground atoms. The main advantage of NTPs is their robustness to inconsistent data, e.g.
when two predicates or constants have different names but the same meaning. However,
both training and inference are intractable for most real-world datasets (Rocktäschel &
Riedel, 2017). Furthermore, the use of neural network-based similarity measures obfuscates
the “reasoning” behind a decision, resulting in a less explainable system. Similarly, owing
to the neural machinery, there are often no guarantees about logical consistency.
LTNs (Serafini & d’Avila Garcez, 2016; Serafini et al., 2017). Logic tensor net-
works (LTNs) are deep tensor networks that allow for learning in the presence of logical
constraints by implementing real logic. Real logic extends fuzzy logic with universal and
existential quantification. Universal quantification is defined as some aggregation function
(e.g. the mean) applied to the collection of all possible groundings. Existential quantifica-
tion is instead Skolemized (Huth & Ryan, 2004) – enabling LTNs to support open-world
semantics, i.e. the model does not assume that an existentially quantified constant comes
from a finite a priori -defined list. Any FOL formula can then be grounded to a real number
in [0, 1] as follows:
Such a grounding is defined by its embeddings of constants and predicates. Given a partial
grounding and a set of formulae with confidence intervals, the learning task is to extend the
partial grounding to a full function to maximise some notion of satisfiability. For example,
maximising satisfiability can be achieved by minimising the Manhattan distance between
groundings of formulae and their given confidence intervals, or by maximising the number
of satisfied formulae when not provided with confidence intervals. Successful applications of
LTNs include knowledge base completion (Serafini & d’Avila Garcez, 2016), assigning labels
to objects in images and establishing semantic relations between those objects (Donadello
et al., 2017; Serafini et al., 2017) as well as various examples using taxonomy and ancestral
datasets (Bianchi & Hitzler, 2019). Bianchi and Hitzler (2019) found LTNs to excel when
given data with little noise, i.e. where the input logical formulae are almost always satisfied,
but, similarly to other work in this section, identified scalability issues.
TensorLog (Cohen et al., 2020). TensorLog is a probabilistic logic programming lan-
guage that implements inference via matrix operations. It traces its lineage to probabilistic
relational languages from SRL, except that it is integrated with neural networks. In Tensor-
Log, queries are compiled into differentiable functions. This is achieved by interpreting the
44
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
logical rules in fuzzy logic, i.e. conjunctions become a minimisation over real-valued atoms,
and, as discussed in previous sections, fuzzy logic admits an easy integration with con-
tinuous optimisation and neural learning. In some restricted languages, TensorLog allows
query computations without an explicit grounding, unlike, for example, PSL (Bach et al.,
2017). In addition, a fragment of deductive databases, as admitted by “small” proof trees,
is allowed, which is a deliberate attempt to limit reasoning whilst maintaining traceability.
To achieve tractability, TensorLog imposes some restrictions on the supported PLPs. Ten-
sorLog deals with non-recursive PLPs without function symbols and assumes up to two
arguments per predicate. Let D denote the set of all constants in the database, TensorLog
considers queries of the form p(a, X), where p is a predicate, a ∈ D is a logical constant, and
X is a logical variable. Furthermore, TensorLog only considers chain-like rules of the form
Given a query p(a, X), TensorLog computes the probability of each answer to p(a, X). We
first describe how constants and probabilistic facts are represented in TensorLog. Let (α, p)
be our database of probabilistic facts, where α is a set of facts (i.e. ground atoms), and
p : α → [0, 1] is a mapping from facts to their probabilities. We fix an ordering of the
constants in D and use a constant symbol a to denote the position of a in this ordering.
TensorLog treats the facts in α as entries in |D| × |D| matrices. In particular, it associates
each predicate p occurring in α with a |D| × |D| matrix M p , where
(
p(p(a, b)) if p(a, b) ∈ α
M p (a, b) =
0 otherwise
n
Y
δa = v ⊺a M pi . (37)
i=1
As all operations used by TensorLog are differentiable, one can easily learn the values of
some of the matrices in equation (37) while keeping others fixed. Overall, besides the
fuzzy interpretation of the logical rules in TensorLog, end-to-end differentiability offers a
degree of transparency. Compared to NTPs, TensorLog is also reasonably efficient, although
scalability becomes an issue with PLPs with a large number of constants. However, logical
reasoning needs to be unrolled explicitly into neural computations, which needs a priori,
often immutable, commitment to certain types of reasoning steps or proof depth. Moreover,
TensorLog supports only chain-like rules, and there is no mechanism for extensibility, i.e.
the ability to add new knowledge on a run-time basis.
45
Feldstein, Dilkas, Belle, & Tsamoura
5.3 Discussion
Monolithic frameworks implement logic through neural networks. Inherent to their ar-
chitectures, such frameworks come with different strengths and weaknesses, compared to
composite frameworks.
Explainability. Constructing neural models which emulate logic programs inherently
leads to fully explainable models. Firstly, the neural models implement logic programs
and thereby offer global explainability as the entire knowledge of the neural model can
be expressed in logical rules, which are very close to natural language. Secondly, once a
prediction has been made, one can back-trace through the neural network, identifying the
rules that impacted the network’s prediction.
Knowledge integration. Unlike composite frameworks, the knowledge is directly in-
serted into the neural model and thereby into every step of the architecture. This leads to
a far tighter integration compared to composite frameworks.
Scalability. Scalability is a major issue for all of these systems (and limits their applica-
bility in real-world settings) as proofs can be long, and recursively mimicking these might
lead to exponentially many networks.
Structured reasoning and support of logic. Due to the way in which logical formulae
have to be mapped to neural constructs, only limited fragments of logic are supported, e.g.
Horn clauses and chain-like rules.
Guarantees. Most of the discussed approaches offer no guarantees of logical consistency
or preserving background knowledge during training. Notable exceptions are the works of
França et al. (2014) and Tran (2017) that ensure that the initial neural network is a faithful
representation of the background knowledge.
Data need. The reduction in data need is difficult to quantify. Similarly to frameworks
in the previous section, no theoretical guarantees regarding sample complexity are provided.
Experimentally, a faithful comparison to purely neural networks is more difficult compared
to composite frameworks, where N can be compared to N + L. Towell and Shavlik (1994)
experimentally observe that KBANNs need less training data compared to pure FNNs
because background knowledge biases the network in a simplifying manner. However, this
comparison is w.r.t. an FNN with a single hidden layer. Tensorised logic programs are
mainly compared to PLPs, and thus, a sample complexity improvement compared to neural
networks cannot be quantified.
6. Related Work
The results obtained by empirical research suggest that neuro-symbolic AI has the potential
to overcome part of the limitations of techniques relying only on neural networks. Unsur-
prisingly, neuro-symbolic AI has received increasing attention in recent years and there are
already a couple of surveys on neuro-symbolic AI reporting on the achievements: d’Avila
Garcez et al. (2019) focus on, what we call, monolithic neuro-symbolic frameworks. Marra
et al. (2024) provide a perspective on neuro-symbolic AI through the lens of SRL, specifi-
cally PLPs, which is primarily influenced by the authors’ work on DeepProbLog (Manhaeve
46
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
et al., 2018). In contrast, Dash et al. (2022) present a more general overview of how domain
expertise can be integrated into neural networks, including different ways the loss function
of neural networks can be augmented with symbolic constraints, e.g. the semantic-loss ap-
proach (Xu et al., 2018). d’Avila Garcez et al. (2022) offers comprehensive interpretations
of neuro-symbolic AI, including the underlying motivations, challenges, and applications.
In this survey, we presented a map of the field through the lens of the architectures
of the frameworks and identified meta-level properties that result from the architectural
design choices. We primarily focused on regularisation approaches, as such models allow
for a straightforward extension of existing neural models, and thus, are particularly easy
for machine learning practitioners to adopt. We classified a large number of relevant work
in the literature based on the supported logical languages (e.g. propositional or first-order
logic) and model features (e.g. types of SRL frameworks and inference operations).
There are several benefits to having scoped our survey this way: Firstly, we provide
a map that can be used to position future research w.r.t. other frameworks and identify
closely related frameworks. Secondly, we are able to isolate and inspect the logical basis
of regularisation approaches in a systematic and technically precise manner, linking the
strengths and weaknesses of frameworks to their inherent composition. The underlying types
of logical reasoning (e.g. MAP, SAT or abduction) are often glossed over and left implicit
in the research literature despite the fact that they fundamentally affect the computational
and correctness properties of the frameworks. Thirdly, this map provides researchers and
engineers outside the area of neuro-symbolic AI with the necessary tools to navigate the
neuro-symbolic landscape and find the architectures they need based on desired properties.
7. Conclusion
The expectations for neuro-symbolic AI, as outlined in the introduction, were manifold.
While we discussed the different frameworks in terms of architecture, each architecture
also addresses different expectations and concerns. Every framework tackles the limitations
of neural networks to some extent. However, none of the architectures address all the
limitations introduced at the outset of this survey but rather provide one particular benefit.
Broadly, when structured reasoning is required and the application allows for a clean
separation of perception and reasoning, indirect supervision frameworks outperform other
methods. Since perception and reasoning are split into two steps, each step is performed
by the component best suited to the task. The neural model perceives patterns and then
passes them on as inputs to the high-level reasoner. By separating the two tasks, one can
use reasoning frameworks that support complex (e.g., hierarchical and recursive) logical
formulae, including user-defined functions (e.g., arithmetic operations), such as ProbLog
(De Raedt et al., 2007). However, scalability suffers as a consequence.
When presented with limited training data but domain expertise is provided, parallel
direct supervision is likely to be the best option. Such circumstances are common in the
industry as companies have vast amounts of expertise in their domain but typically low
amounts of data as it is either expensive or simply not available due to privacy issues,
such as in medical AI. These frameworks ensure scalability by keeping the neural network
unchanged and using more scalable (lifted) SRL frameworks. However, such models only
47
Feldstein, Dilkas, Belle, & Tsamoura
improve the accuracy and data need of neural models but do not improve the explainability
or satisfaction of constraints.
In safety-critical systems (e.g. autonomous driving), where guarantees are of concern,
stratified direct supervision frameworks have the best track record, as such frameworks
check every output of a neural model against a set of (potentially hard) constraints. Similar
to parallel supervision, such frameworks can be scalable as the constraints are defined at
the outset, and limited overhead is added compared to purely neural models. However,
these frameworks come with limitations regarding complex reasoning and explainability.
If explainability is crucial to an application, the best bet is monolithic frameworks,
as these models offer a high level of transparency, as neural networks can be mapped to
interpretable logic programs. However, such frameworks come with limitations in scalability
and the types of logical theories they support.
48
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
model’s prediction power is retained in the student model, but similar results are lacking
for parallel direct supervision frameworks. Quantifying how domain knowledge reduces
data requirements would enable more strategic data collection and usage, making training
processes faster and less resource-intensive. Similarly, when it comes to guarantees, exper-
imental evidence of stratified direct supervision shows that neural networks can be guided
to follow constraints; however, theoretical guarantees are still missing on how closely these
constraints are adhered to. Such guarantees on how logical rules enforce output constraints
would make these systems more reliable and trustworthy, critical for applications in sensitive
and high-stakes domains.
Quantitative comparison. Finally, it is important to comment on the fact that while
this is a technical survey, we have only explored the different architectures through a math-
ematical and computational lens. Like other surveys (d’Avila Garcez et al., 2019; d’Avila
Garcez et al., 2022; Marra et al., 2024), we identified limitations and achievements in the
current state but did not quantify them. A crucial element is missing to properly assess the
current state of neuro-symbolic AI: a standardised benchmark for a quantitative analysis.
This benchmark should include a variety of datasets, each equipped with domain knowledge.
The datasets must be curated to allow evaluation of frameworks across various dimensions
such as scalability, data need, and explainability rather than simply accuracy. Such a
benchmark would streamline the evaluation of new frameworks, quantify their strengths
and weaknesses, and highlight their merits. Additionally, it would help researchers identify
gaps and guide future research. This survey could serve as a skeleton for a future survey
following the same categorisation but evaluating metrics rather than theory.
References
Ba, J., & Caruana, R. (2014). Do deep nets really need to be deep?. Advances in Neural
Information Processing Systems (NeurIPS), 27.
Bach, S. H., Broecheler, M., Huang, B., & Getoor, L. (2017). Hinge-loss Markov random
fields and probabilistic soft logic. Journal of Machine Learning Research, 18.
Bao, W., Yue, J., & Rao, Y. (2017). A deep learning framework for financial time series
using stacked autoencoders and long-short term memory. PLOS One, 12.
Belle, V., Passerini, A., & Van den Broeck, G. (2015). Probabilistic inference in hybrid
domains by weighted model integration. In Proceedings of the 24th International
Joint Conference on Artificial Intelligence (IJCAI). International Joint Conferences
on Artificial Intelligence.
Berman, D. S., Buczak, A. L., Chavis, J. S., & Corbett, C. L. (2019). A survey of deep
learning methods for cyber security. Information, 10, 122.
Bianchi, F., & Hitzler, P. (2019). On the capabilities of logic tensor networks for deductive
reasoning. In AAAI Spring Symposium: Combining machine learning with knowledge
engineering.
Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D.,
Monfort, M., Muller, U., Zhang, J., Zhang, X., Zhao, J., & Zieba, K. (2016). End-to-
end learning for self-driving cars. CoRR, abs/1604.07316.
49
Feldstein, Dilkas, Belle, & Tsamoura
Bradley, A. R., & Manna, Z. (2007). The calculus of computation: decision procedures with
applications to verification. Springer Science & Business Media.
Buffelli, D., & Tsamoura, E. (2023). Scalable theory-driven regularization of scene graph
generation models. In Proceedings of the 37th AAAI Conference on Artificial Intelli-
gence, pp. 6850–6859. Association for the Advancement of Artificial Intelligence.
Cardelli, L., Kwiatkowska, M., Laurenti, L., Paoletti, N., Patane, A., & Wicker, M. (2019).
Statistical guarantees for the robustness of bayesian neural networks. In Kraus, S.
(Ed.), Proceedings of the 28th International Joint Conference on Artificial Intelligence
(IJCAI), pp. 5693–5700. International Joint Conferences on Artificial Intelligence.
Chakraborty, S., Fremont, D., Meel, K., Seshia, S., & Vardi, M. (2014). Distribution-aware
sampling and weighted model counting for SAT. In Proceedings of the 28th AAAI
Conference on Artificial Intelligence. Association for the Advancement of Artificial
Intelligence.
Chavira, M., Darwiche, A., & Jaeger, M. (2006). Compiling relational Bayesian networks
for exact inference. International Journal of Approximate Reasoning, 42, 4–20.
Chavira, M., & Darwiche, A. (2008). On probabilistic inference by weighted model counting.
Artificial Intelligence, 172, 772–799.
Cheng, K., Amed, N. K., & Sun, Y. (2023). Neural compositional rule learning for knowledge
graph reasoning. In The 11th International Conference on Learning Representations
(ICLR). OpenReview.net.
Choi, A., Kisa, D., & Darwiche, A. (2013). Compiling probabilistic graphical models using
sentential decision diagrams. In Symbolic and Quantitative Approaches to Reasoning
with Uncertainty, pp. 121–132. Springer.
Clocksin, W. F., & Mellish, C. S. (2003). Programming in Prolog. Springer Science &
Business Media.
Cohen, W. W., Yang, F., & Mazaitis, K. (2020). Tensorlog: A probabilistic database imple-
mented using deep-learning infrastructure. Journal of Artificial Intelligence Research,
67, 285–325.
Cook, S. (1971). The complexity of theorem-proving procedures. In Proceedings of the 3rd
annual ACM Symposium on Theory of Computing (STOC), pp. 151–158.
Dai, W.-Z., Xu, Q., Yu, Y., & Zhou, Z.-H. (2019). Bridging machine learning and logical
reasoning by abductive learning. Advances in Neural Information Processing Systems
(NeurIPS), 32.
Darwiche, A., & Marquis, P. (2002). A knowledge compilation map. Journal of Artificial
Intelligence Research, 17, 229–264.
Dash, T., Chitlangia, S., Ahuja, A., & Srinivasan, A. (2022). A review of some techniques for
inclusion of domain-knowledge into deep neural networks. Scientific Reports, 12 (1),
1040.
d’Avila Garcez, A. S., Gori, M., Lamb, L. C., Serafini, L., Spranger, M., & Tran, S. N. (2019).
Neural-symbolic computing: An effective methodology for principled integration of
machine learning and reasoning. Journal of Applied Logics, 6 (4), 611–632.
50
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
d’Avila Garcez, A. S., & Zaverucha, G. (1999). The connectionist inductive learning and
logic programming system. Applied Intelligence, 11 (1), 59–77.
Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in
artificial intelligence. Communications of the ACM, 58 (9), 92–103.
De Raedt, L., & Kersting, K. (2004). Probabilistic inductive logic programming. In ALT,
Vol. 3244 of Lecture Notes in Computer Science, pp. 19–36. Springer.
De Raedt, L., & Kimmig, A. (2015). Probabilistic (logic) programming concepts. Machine
Learning, 100, 5–47.
De Raedt, L., Kimmig, A., & Toivonen, H. (2007). Problog: A probabilistic prolog and its
application in link discovery. In Proceedings of the 20th International Joint Confer-
ence on Artificial Intelligence (IJCAI). International Joint Conferences on Artificial
Intelligence.
Deng, L. (2012). The MNIST database of handwritten digit images for machine learning
research. IEEE Signal Processing Magazine, 29 (6), 141–142.
Deng, Y., Bao, F., Kong, Y., Ren, Z., & Dai, Q. (2016). Deep direct reinforcement learning
for financial signal representation and trading. IEEE transactions on neural networks
and learning systems, 28 (3), 653–664.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidi-
rectional transformers for language understanding. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, (NAACL-HLT), pp. 4171–4186. Association for Com-
putational Linguistics.
Donadello, I., Serafini, L., & d’Avila Garcez, A. S. (2017). Logic tensor networks for seman-
tic image interpretation. In Proceedings of the 26th International Joint Conference
on Artificial Intelligence (IJCAI). International Joint Conferences on Artificial Intel-
ligence.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., De-
hghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021).
An image is worth 16x16 words: Transformers for image recognition at scale. In The
9th International Conference on Learning Representations, (ICLR). OpenReview.net.
d’Avila Garcez, A. S., Bader, S., Bowman, H., Lamb, L. C., de Penning, L., Illuminoo, B., &
Poon, H. (2022). Neural-symbolic learning and reasoning: A survey and interpretation.
Neuro-Symbolic Artificial Intelligence: The State of the Art, 342 (1), 327.
Feldstein, J., & Belle, V. (2021). Lifted reasoning meets weighted model integration. In
Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI).
PMLR.
Feldstein, J., Jurčius, M., & Tsamoura, E. (2023a). Parallel neurosymbolic integration with
Concordia. In Proceedings of the 40th International Conference on Machine Learning
(ICML), pp. 9870–9885. PMLR.
Feldstein, J., Phillips, D., & Tsamoura, E. (2023b). Principled and efficient motif finding
for structure learning of lifted graphical models. In Proceedings of the 37th AAAI
51
Feldstein, Dilkas, Belle, & Tsamoura
52
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9, 1735–1780.
Hu, Z., Ma, X., Liu, Z., Hovy, E., & Xing, E. (2016). Harnessing deep neural networks with
logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computa-
tional Linguistics (ACL), pp. 2410–2420. Association for Computational Linguistics.
Huang, J., Li, Z., Chen, B., Samel, K., Naik, M., Song, L., & Si, X. (2021). Scallop: From
probabilistic deductive databases to scalable differentiable reasoning. Advances in
Neural Information Processing Systems (NeurIPS), 34.
Huang, K., Altosaar, J., & Ranganath, R. (2019). ClinicalBERT: Modeling clinical notes
and predicting hospital readmission. CoRR, abs/1904.05342.
Huth, M., & Ryan, M. D. (2004). Logic in computer science - modelling and reasoning about
systems. Cambridge University Press.
Kakas, A. C. (2017). Abduction. In Encyclopedia of Machine Learning and Data Mining,
pp. 1–8. Springer US, Boston, MA.
Kearnes, S., McCloskey, K., Berndl, M., Pande, V., & Riley, P. (2016). Molecular graph con-
volutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design,
30, 595–608.
Khot, T., Natarajan, S., Kersting, K., & Shavlik, J. (2015). Gradient-based boosting for sta-
tistical relational learning: the markov logic network and missing data cases. Machine
Learning, 100 (1), 75–100.
Kindermann, R., & Snell, J. L. (1980). Markov random fields and their applications, Vol. 1.
American Mathematical Society.
Kok, S., & Domingos, P. (2010). Learning Markov logic networks using structural motifs.
In Proceedings of the 27th International Conference on Machine Learning (ICML),
pp. 551–558. PMLR.
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques.
MIT press.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images..
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of
Mathematical Statistics, 22 (1), 79–86.
Lake, B. M., & Baroni, M. (2017). Still not systematic after all these years: On the compo-
sitional skills of sequence-to-sequence recurrent networks. CoRR, abs/1711.00350.
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann
machines. In Proceedings of the 25th International Conference on Machine Learning
(ICML), pp. 536–543. PMLR.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86 (11), 2278–2324.
Li, T., & Srikumar, V. (2019). Augmenting neural networks with first-order logic. In Pro-
ceedings of the 57th Annual Meeting of the Association for Computational Linguistics
(ACL). Association for Computational Linguistics.
53
Feldstein, Dilkas, Belle, & Tsamoura
Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., Van
Der Laak, J. A., Van Ginneken, B., & Sánchez, C. I. (2017). A survey on deep
learning in medical image analysis. Medical Image Analysis, 42, 60–88.
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet
for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 11976–11986.
Luo, W., Yang, B., & Urtasun, R. (2018). Fast and furious: Real time end-to-end 3d detec-
tion, tracking and motion forecasting with a single convolutional net. In Proceedings
of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577.
Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., & De Raedt, L. (2018). Deep-
problog: Neural probabilistic logic programming. Advances in Neural Information
Processing Systems (NeurIPS), 31.
Manhaeve, R., Marra, G., & Raedt, L. D. (2021). Approximate inference for neural prob-
abilistic logic programming. In Proceedings of the 18th International Conference on
Principles of Knowledge Representation and Reasoning (KR), pp. 475–486.
Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., & Wu, J. (2019). The neuro-symbolic concept
learner: Interpreting scenes, words, and sentences from natural supervision. In The
7th International Conference on Learning Representations (ICLR). OpenReview.net.
Marcus, G. (2018). Deep learning: A critical appraisal. CoRR, abs/1801.00631.
Marra, G., Dumančić, S., Manhaeve, R., & De Raedt, L. (2024). From statistical relational
to neurosymbolic artificial intelligence: A survey. Artificial Intelligence, 328, 104062.
Maurer, A. (2016). A vector-contraction inequality for rademacher complexities. In Pro-
ceedings of the 27th International Conference Algorithmic Learning Theory (ALT), p.
3–17. Springer-Verlag.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word repre-
sentations in vector space. In The 1st International Conference on Learning Repre-
sentations (ICLR) Workshop Track. OpenReview.net.
Minervini, P., & Riedel, S. (2018). Adversarially regularising neural NLI models to integrate
logical background knowledge. In Proceedings of the 22nd Conference on Computa-
tional Natural Language Learning (CoNLL), pp. 65–74.
Mittal, S., & Vaishay, S. (2019). A survey of techniques for optimizing deep learning on
gpus. Journal of Systems Architecture, 99, 101635.
Muggleton, S. H. (1996). Stochastic logic programs. In Advances in Inductive Logic Pro-
gramming. IOS Press.
Muggleton, S. H., & De Raedt, L. (1994). Inductive logic programming: Theory and meth-
ods. Journal of Logic Programming, 19/20, 629–679.
Muise, C., McIlraith, S. A., Beck, J. C., & Hsu, E. I. (2012). D-sharp: fast d-dnnf compi-
lation with sharpsat. In Proceedings of the 25th Canadian Conference on Artificial
Intelligence Proceedings, pp. 356–361. Springer.
Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition
using deep neural networks: A systematic review. IEEE Access, 7, 19143–19165.
54
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
Natarajan, S., Kersting, K., Khot, T., & Shavlik, J. (2015). Boosted statistical relational
learners: From benchmarks to data-driven medicine. Springer.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible infer-
ence. Morgan Kaufmann.
Pinkas, G. (1995). Reasoning, nonmonotonicity and learning in connectionist networks that
capture propositional knowledge. Artifical Intelligence, 77 (2), 203–247.
Poole, D. (1993). Logic programming, abduction and probability: A top-down anytime al-
gorithm for estimating prior and posterior probabilities. New Generation Computing,
11, 377–400.
Poole, D. (2003). First-order probabilistic inference. In Proceedings of the 18th International
Joint Conference on Artificial Intelligence (IJCAI), pp. 985–991. International Joint
Conferences on Artificial Intelligence.
Qu, M., Chen, J., Xhonneux, L.-P., Bengio, Y., & Tang, J. (2021). RNNLogic: Learning
logic rules for reasoning on knowledge graphs. In The 9th International Conference
on Learning Representations,(ICLR). OpenReview.net.
Raedt, L. D., Kersting, K., Natarajan, S., & Poole, D. (2016). Statistical relational artificial
intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial
Intelligence and Machine Learning, 10 (2), 1–189.
Rashed, A., Grabocka, J., & Schmidt-Thieme, L. (2019). Attribute-aware non-linear co-
embeddings of graph features. In Proceedings of the 13th ACM Conference on Rec-
ommender Systems, pp. 314–321.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). ”Why should i trust you?” Explaining the
predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 1135–1144.
Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62 (1),
107–136.
Rocktäschel, T., & Riedel, S. (2017). End-to-end differentiable proving. Advances in Neural
Information Processing Systems (NeurIPS), 30.
Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting logical background knowledge
into embeddings for relation extraction. In Proceedings of the 2015 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, (NAACL-HLT), pp. 1119–1129.
Ruan, W., Wu, M., Sun, Y., Huang, X., Kroening, D., & Kwiatkowska, M. (2019). Global
robustness evaluation of deep neural networks with provable guarantees for the ham-
ming distance. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence (IJCAI). International Joint Conferences on Artificial Intelligence Orga-
nization.
Russell, S. (2015). Unifying logic and probability. Communications of the ACM, 58 (7),
88–97.
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Pearson.
55
Feldstein, Dilkas, Belle, & Tsamoura
Samek, W., Wiegand, T., & Müller, K. (2017). Explainable artificial intelligence: Under-
standing, visualizing and interpreting deep learning models. CoRR, abs/1708.08296.
Sang, T., Bearne, P., & Kautz, H. (2005). Performing Bayesian inference by weighted model
counting. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp.
475–482. Association for the Advancement of Artificial Intelligence.
Sato, T. (1995). A statistical learning method for logic programs with distribution semantics.
In Proceedings of the 12th International Conference on Logic Programming (ICLP),
pp. 715–729. Citeseer.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks,
61, 85–117.
Schuman, C. D., Potok, T. E., Patton, R. M., Birdwell, J. D., Dean, M. E., Rose, G. S.,
& Plank, J. S. (2017). A survey of neuromorphic computing and neural networks in
hardware. CoRR, abs/1705.06963.
Serafini, L., Donadello, I., & d’Avila Garcez, A. S. (2017). Learning and reasoning in
logic tensor networks: theory and application to semantic image interpretation. In
Proceedings of the Symposium on Applied Computing, pp. 125–130.
Serafini, L., & d’Avila Garcez, A. S. (2016). Learning and reasoning with logic tensor
networks. In Conference of the Italian Association for Artificial Intelligence, pp. 334–
348. Springer.
Smolensky, P. (1986). Information processing in dynamical systems: foundations of har-
mony theory. In Parallel distributed Processing: explorations in the microstructure of
cognition, vol. 1: foundations, pp. 194–281. MIT Press.
Sterling, L., & Shapiro, E. Y. (1994). The art of Prolog: advanced programming techniques.
MIT press.
Suciu, D., Olteanu, D., Ré, C., & Koch, C. (2011). Probabilistic databases. Synthesis
Lectures on Data Management, 3 (2), 1–180.
Towell, G. G., & Shavlik, J. W. (1994). Knowledge-based artificial neural networks. Artificial
Intelligence, 70, 119–165.
Tran, S. N. (2017). Propositional knowledge representation in restricted Boltzmann ma-
chines. CoRR, abs/1705.10899.
Tran, S. N., & d’Avila Garcez, A. S. (2016). Deep logic networks: Inserting and extracting
knowledge from deep belief networks. IEEE Transactions on Neural Networks and
Learning Systems, 29 (2), 246–258.
Tsamoura, E., Lee, J., & Urbani, J. (2023). Probabilistic reasoning at scale: Trigger graphs
to the rescue. Proceedings of the ACM on Management of Data, 1 (1), 1–27.
Tsamoura, E., & Michael, L. (2021). Neural-symbolic integration: A compositional perspec-
tive. In Proceedings of the 35th AAAI Conference on Artificial Intelligence. Associa-
tion for the Advancement of Artificial Intelligence.
Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., & De Raedt, L. (2011). Lifted
probabilistic inference by first-order knowledge compilation. In Proceedings of the
56
A Handbook on Augmenting Deep Learning Through Symbolic Reasoning
57