Query Processing On Probabilistic Data: A Survey
Query Processing On Probabilistic Data: A Survey
1 Introduction 198
ii
iii
Acknowledgements 323
References 324
Abstract
G. Van den Broeck and D. Suciu. Query Processing on Probabilistic Data: A Survey.
Foundations and Trends R in Databases, vol. 7, no. 3-4, pp. 197–341, 2015.
DOI: 10.1561/1900000052.
1
Introduction
198
199
1993, Chaudhuri, 1998, Kossmann, 2000, Abadi et al., 2013, Ngo et al.,
2013], and many commercial or open-source query engines exists to-
day that implement these algorithms. However, query processing on
probabilistic data is quite a different problem, since now, in addition to
traditional data processing we also need to do probabilistic inference.
A typical query consists of joins, projections with duplicate removal,
grouping and aggregation, and/or negation. When the input data is
probabilistic, each tuple in the query answer must be annotated with
a probability: computing this output probability is called probabilistic
inference, and is, in general, a challenging problem. For some simple
queries, probabilistic inference is very easy: for example, when we join
two input relations, we can simply multiply the probabilities of the tu-
ples from the two inputs, assuming they are independent events. This
straightforward approach was already used by Barbará et al. [1992].
But for more complex queries probabilistic inference is challenging.
The query evaluation problem over probabilistic databases has
been studied over the last twenty years [Fuhr and Rölleke, 1997, Lak-
shmanan et al., 1997, Dalvi and Suciu, 2004, Benjelloun et al., 2006a,
Antova et al., 2007, Olteanu et al., 2009]. In general, query evalua-
tion, or, better, the probabilistic inference sub-problem of query evalu-
ation, is equivalent to weighted model counting on a Boolean formula,
a problem well studied in the theory community, as well as the AI
and automated reasoning communities. While weighted model count-
ing is known to be #P-hard in general, it has been shown that, for
certain queries, probabilistic inference can be done efficiently. Even
better, such a query can be rewritten into a (more complex) query,
which computes probabilities directly using simple operations (sum,
product, and difference). Therefore, query evaluation, including prob-
abilistic inference, can be done entirely in one of today’s relational
database engines. Such a query can benefit immediately from decades
of advances in query processing, including indexes, query optimiza-
tion, parallel processing, etc. However, for other queries, computing
their output probability is #P-hard. In this case the probabilistic in-
ference task far dominates the query evaluation cost, and these hard
queries are typically evaluated using some approximate methods for
200 Introduction
weighted model counting. Dalvi and Suciu [2004] proved that, for a
simple class of queries, either the query can be computed in poly-
nomial time in the size of the database, by pushing the probabilistic
inference in the engine, or the query’s output probability is provably
#P-hard to compute. Thus, we have a dichotomy: every query is ei-
ther efficiently computable, or provably hard, and the distinction can
be made using static analysis on the query.
Probabilistic graphical models preceded probabilistic databases,
and were popularized in a very influential book by Pearl [1988]. In
that setting, the knowledge base is described by a graph, such as
a Bayesian or Markov network. Probabilistic inference on graphical
models is also #P-hard in the size of the graph. Soon the AI commu-
nity noticed that this graph often results from a concise relational rep-
resentation [Horsch and Poole, 1990, Poole, 1993, Jaeger, 1997, Ngo
and Haddawy, 1997, Getoor and Taskar, 2007]. Usually the relational
representation is much more compact than the resulting graph, raising
the natural question whether probabilistic inference can be performed
more efficiently by reasoning on the relational representation instead
of the grounded graphical model. This lead to the notion of lifted in-
ference [Poole, 2003], whose goal is to perform inference on the high-
level relational representation without having to ground the model.
Lifted inference techniques in AI and query processing on proba-
bilistic databases were developed independently, and their connection
was established only recently [Gribkoff et al., 2014b].
This is a survey on probabilistic databases and query evaluation.
The goals of this survey are the following.
• A domain: D.
• A relational schema: R
• A single tuple: t
• A probabilistic database: D
203
204 Probabilistic Data Model
ω1 ω2 ω3 ω4
Researcher Researcher Researcher Researcher
Alice Pixar Alice Pixar Alice Brown Alice Brown
Carol UPenn Carol INRIA Carol UPenn Carol INRIA
Pa (ω1 ) = 0.10 Pa (ω2 ) = 0.10 Pa (ω3 ) = 0.60 Pa (ω4 ) = 0.20
Pb (ω1 ) = 0.14 Pb (ω2 ) = 0.06 Pb (ω3 ) = 0.56 Pb (ω4 ) = 0.24
Pc (ω1 ) = 0.20 Pc (ω2 ) = 0.30 Pc (ω3 ) = 0.40 Pc (ω4 ) = 0.10
Figure 2.1: Four possible worlds ω1 , . . . , ω4 of a schema with one relation (Researcher)
denoting research affiliations. Worlds are labeled with three different valid probabil-
ity functions, Pa , Pb and Pc . Worlds not shown have probability 0.
Readers familiar with incomplete databases will note that the col-
lection of possible worlds is precisely an incomplete database [Imielinski
and Lipski, 1984]. In other words, a probabilistic database is an incom-
plete databases plus a probability distribution.
k
Y
P(ω) = P(ω1 ∪ ω2 ∪ · · · ∪ ωk ) = PTi (ωi ). (2.1)
i=1
Researcher
Name Expertise Affiliation
Alice Graphics Pixar 0.3
Brown 0.7
Bob Vision UPenn 0.3
PSU 0.3
Brown 0.4
Carol Databases UPenn 0.5
INRIA 0.5
ω1 ω2 ω3 ω4
Alice Pixar Alice Pixar Alice Brown Alice Brown
Bob UPenn Bob UPenn Bob UPenn Bob UPenn
Carol UPenn Carol INRIA Carol UPenn Carol INRIA
P(ω1 ) = 0.045 P(ω2 ) = 0.045 P(ω3 ) = 0.105 P(ω4 ) = 0.105
(0.3 · 0.3 · 0.5) (0.3 · 0.3 · 0.5) (0.7 · 0.3 · 0.5) (0.7 · 0.3 · 0.5)
ω5 ω6 ω7 ω8
Alice Pixar Alice Pixar Alice Brown Alice Brown
Bob PSU Bob PSU Bob PSU Bob PSU
Carol UPenn Carol INRIA Carol UPenn Carol INRIA
P(ω5 ) = 0.045 P(ω6 ) = 0.045 P(ω7 ) = 0.105 P(ω8 ) = 0.105
(0.3 · 0.3 · 0.5) (0.3 · 0.3 · 0.5) (0.7 · 0.3 · 0.5) (0.7 · 0.3 · 0.5)
ω9 ω10 ω11 ω12
Alice Pixar Alice Pixar Alice Brown Alice Brown
Bob Brown Bob Brown Bob Brown Bob Brown
Carol UPenn Carol INRIA Carol UPenn Carol INRIA
P(ω9 ) = 0.06 P(ω10 ) = 0.06 P(ω11 ) = 0.14 P(ω12 ) = 0.14
(0.3 · 0.4 · 0.5) (0.3 · 0.4 · 0.5) (0.7 · 0.4 · 0.5) (0.7 · 0.4 · 0.5)
Figure 2.3: Possible worlds for the Probabilistic Database in Figure 2.2.
Researcher
Name Expertise Affiliation p
Alice Graphics Pixar 0.3
Alice Graphics Brown 0.7
Bob Vision UPenn 0.3
Bob Vision PSU 0.3
Bob Vision Brown 0.4
Carol Databases UPenn 0.5
Carol Databases INRIA 0.5
Figure 2.4: Block-Independent Disjoint (BID) table for the relation in Figure 2.2.
et al., 2008, Mitchell et al., 2015]. The extractions are performed using
statistical machine learning, and therefore inherently uncertain. Every
extracted tuple has a degree of confidence associated with it.
A tuple-independent relation is represented by a standard
database instance where each relation has a distinguished attribute
storing the marginal probability of that tuple. Formally, a tuple-
independent probabilistic database is a pair D = (T, p), where T is
a standard database (a set of tuples) and p : T → [0, 1] associates a
probability to each tuple in T. Any subset of tuples forms a possible
world, obtained by including randomly and independently each tuple
t ∈ T, with the probability specified in the database, p(t). Worlds that
contain tuples not found in T have probability zero. We denote by PD
the probability induced by the tuple-independent database D:
Y Y
p(t) (1 − p(t)) if ω ⊆ T
PD (ω) = t∈ω t ∈ T−ω (2.2)
0 otherwise
This data model captures the marginal independence assumption of
Equation 2.1 where each partition Ti consists of a single tuple t ∈ T.
We drop the subscript and simply write P(ω) when the probabilistic
database is clear from the context. In practice, we represent D by sim-
ply extending the schema of T to include the probability p as an extra
attribute.
For example, Figure 2.5a shows a hypothetical table extracted
from the Web, consisting of (CEO,Company) pairs. The extrac-
tor is not fully confident in its extractions, so that the tuple
Manager(David, PestBye) has a confidence level of only 60%, while the
tuple Manager(Elga, KwikEMart) has a confidence level of 90%. Any
subset of the uncertain tuples is a possible world, hence there are eight
possible worlds (not shown). Here, too, the simplest way to uniquely
define the probability function is to assume independence. In that case
the probability of any possible world is the product of the probabili-
ties of the tuples in the world, times the product of one minus the
probabilities of the tuples not in the world. For example, the probabil-
ity of the world {Manager(David, PestBye), Manager(Fred, Vulgari)}
is 0.6 · (1 − 0.9) · 0.8 = 0.048.
2.2. Independence Assumptions 211
Manager Manager
CEO Company p CEO Company w
David PestBye 0.6 David PestBye 1.5
Elga KwikEMart 0.9 Elga KwikEMart 9.0
Fred Vulgari 0.8 Fred Vulgari 4.0
(a) Probabilities (b) Weights
Figure 2.5: A tuple-independent Manager relation. For each tuple we can specify the
probability p as shown in (a), or the weight w as shown in (b).
The probabilistic databases in Figures 2.5a and 2.5b are indeed equiv-
alent.
Probabilities are a more natural and intuitive representation than
weights for tuple-independent probabilistic relations. We will explain
the rationale for considering weights in §2.5, when we introduce soft
constraints and Markov Logic Networks.
Building on the possible world semantics, this section studies the se-
mantics of a query over probabilistic databases: given a query Q in
some query language, such as SQL, datalog, or relational calculus,
what should Q return on a probabilistic database? Query semantics
on a traditional database is defined by some sort of induction on the
structure of the query expression; for example, the value of a relational
algebra expression is defined bottom up. Our semantics over proba-
bilistic databases is different, in the sense that it ignores the query ex-
pression, and instead assumes only that the query already has a well-
defined semantics over deterministic databases. Our task will be to ex-
tend this semantics to probabilistic databases. In other words, assum-
ing we know exactly how to compute Q on a traditional database T,
we want to define the meaning of Q on a probabilistic database D.
It is convenient to assume that the semantics of a query Q over
traditional databases is a mapping from database instances (2Tup )
into d-dimensional vectors (Rd ). Then, its semantics over probabilis-
tic databases is simply its expectation, which is a vector as well.
The query has arity r = 1. Since there are only three distinct expertises
in the domain of the database in Figure 2.2, we can assume d = 3, and
the expected value vector E[Q2 ] is
Graphics 0.7
Vision 0.4
Databases 0.0
The answer vector has dimension d = 5, because there are five possi-
ble affiliations in Figure 2.2. The expected value vector E[Q3 ] is
Pixar 0.3
Brown 1.1
UPenn 0.8
PSU 0.3
INRIA 0.5
Semantics
A first-order sentence is a first-order formula without free variables,
that is, one where each logical variable is associated with a univer-
sal or existential quantifier. For a given finite domain D, a first-order
formula ∆(x) can be seen as representing a set of first-order sen-
tences {δ1 , . . . , δd }, obtained by substituting the free variables x with
constants from the domain D. We will refer to these sentences as the
groundings of ∆(x). For example, a grounding of the first constraint
above is δ = Smoker(Alice) ∧ Friend(Alice, Bob) ⇒ Smoker(Bob). We
use the term grounding with some abuse, since, in general, δ may be
a sentence with quantified variables. For example, if ∆(x) = (R(x) ⇒
∃y S(x, y)), then one of its groundings is the sentence δ = (R(5) ⇒
∃yS(5, y)). The grounding of an MLN M over domain D is defined as
follows.
ground(M ) = {(wi , δ) | (wi , ∆i (x)) ∈ M and δ is a grounding of ∆i (x)}
Next, consider a single possible world ω ⊆ Tup. Recall that the
notation ω |= δ means that the sentence δ is true in ω.
220 Probabilistic Data Model
the reader to find an example where P(δ) > P(δ 0 ).) MLNs give
great flexibility in expressing dependencies, yet also a risk of be-
ing uninterpretable, because weighted constraints can partially
undo each other.
Tuple-Indep. Tuple-Indep.
Probabilistic Weighted
§2.2.3 §2.5.2
Database Database
Figure 2.6: Reductions between MLNs and various probabilistic database models.
In other words, we create one soft constraint for every tuple in the
database, and one soft constraint with weight 0 for every tuple not in
the database. The probability distribution on possible worlds defined
222 Probabilistic Data Model
Manager Smoker
CEO Company w Person w
David PestBye u11 David r1
David KwikEMart u12 Elga r2
Elga KwikEMart u22 Fred r3
2.5.3 Discussion
The soft constraints in MLNs create complex correlations between the
tuples in a probabilistic database. In MLNs we can also define hard
constraints, by giving them a weight of w = 0 or w = ∞. In the first
case we simply assert that the constraint is false, since all worlds satis-
fying that constraint have weight 0. In the second case, the weight of a
world that satisfies the constraint becomes ∞, and then its probability
is no longer well defined, since both numerator and denominator of
Equation 2.8 are ∞. There are two workarounds that lead to the same
224 Probabilistic Data Model
Then, for any query Q, we have PM (Q) = P(Q|∆), where the latter
probability is in a tuple-independent probabilistic database.
Discussion
We end our treatment of soft constraints with two observations. Recall
that a clause is a disjunction of literals, L1 ∨ L2 ∨ . . ., where each literal
is either a positive relational atom R(x1 , x2 , . . .) or a negated relational
atom ¬R(x1 , x2 , . . .). Our first observation is that, if the soft constraint
Φ(x) is a clause, then its corresponding sentence in the hard constraint
∆ is also a clause ¬A(x)∨Φ(x). In many applications of MLNs, the soft
constraint are Horn clauses, B1 ∧ B2 ∧ . . . ⇒ C, and in that case the
hard constraint ∆ is a Horn clause as well: A ∧ B1 ∧ B2 ∧ . . . ⇒ C. In
other words, the formula for the hard constraint is no more complex
than the original MLN constraint.
Second, the weight of the ground tuples in each new relation Ai is
wi − 1. When wi < 1, then the tuples in Ai have a negative weight,
which, in turn, corresponds to a probability that is either < 0 or > 1.
This is inconsistent with the traditional definition of a probability, and
requires a discussion. One simple way to try to avoid this is to replace
every soft constraint (wi , Φi ) where wi < 1 with (1/wi , ¬Φi ), but the
new constraint ¬Φi is in general more complex than Φi ; for example
if Φi is a clause, then ¬Φi is a conjunctive term. However, in many
applications we do not need to avoid negative weights. The marginal
probability of any event Q over the tuple-independent probabilistic
database is still well defined, even if some tuple probabilities are < 0
or > 1. All exact probabilistic inference methods work unchanged
2.7. Related Data Models 229
Figure 2.8: A simple Bayesian network for the dependencies between random vari-
ables Manager(David, PestBye), Manager(David, KwikEMart), and Smoker(David).
Manager
Smoker
CEO Company w
Person w
David PestBye 0.4
David 1
David KwikEMart 1
Figure 2.9: A tuple-independent weighted database with soft constraints. The repre-
sented distribution is equivalent to the Bayesian network distribution in Figure 2.8.
2.7. Related Data Models 233
X1 X2 X3 F
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
Figure 3.1: Truth table for the formula F = (X1 ∨ X2 ) ∧ (X1 ∨ X3 ) ∧ (X2 ∨ X3 ).
2
It is also common to associate weights w̄i with assigning false to Xi (or equiv-
alently, to associate a weight with all literals)
Q [ChaviraQand Darwiche, 2008]. The
weight of a model then becomes w(θ) = i:θ(X )=1 wi i:θ(X )=0 w̄i . These defini-
i i
tions are interchangeable by normalization as long as wi + w̄i 6= 0.
3.2. Relationships Between the Three Problems 241
The lineage is a DNF formula whose terms are given by the rows
returned by the following query:
select *
from Researcher r, University u
where r.affiliation = u.uname
and u.city = "Seattle"
order by r.expertise
The answers to the first SQL query are interpreted as unit clauses
¬Rnu , and the answers to the second SQL query are interpreted as
clauses with two literals ¬Rnu ∨ Uu . These query rewritings do not
assume any key or foreign key constraints in the database. If UName
is a key in University and Affiliation is a foreign key, then the first
SQL query above simplifies: it suffices to lookup the unique city of the
researcher’s affiliation, and check that it is not Seattle:
select distinct r.Name, r.affiliation
from Researcher r, University u
where u.city != ’Seattle’
and r.expertise = ’Vision’
and r.affiliation = u.uname
From Theorem 3.4, one can easily show that computing the probability
P(∆) is #P-hard in the size of the input database, even if the input
database is tuple-independent.
The DPLL Family of Algorithms Exact model counting algo-
rithms are based on extensions of the DPLL family of algorithms intro-
duced by Davis and Putnam [1960] and Davis et al. [1962] that were
originally designed for satisfiability search. We review them briefly
here, and refer to Gomes et al. [2009] for a survey.
A DPLL algorithm chooses a Boolean variable X, uses the Shannon
expansion formula #F = #F [X = 0] + #F [X = 1], and computes the
number of models of the two residual formulas #F [X = 0], #F [X =
1]. The DPLL algorithm was initially developed for the Satisfiability
Problem. In that case, it suffices to check if one of the two residual
formulas #F [X = 0] or #F [X = 1] is satisfiable: if the first one is
satisfiable, then there is no need to check the second one. When we
adapt it to model counting, the DPLL algorithm must perform a full
traversal of the search space.
In addition to the basic Shannon expansion step, modern DPLL-
based algorithms implement two extensions. The first consists of
caching intermediate results, to avoid repeated computations of equiva-
lent residual formulas. Before computing #F the algorithm checks if
F is in the cache; if it is in the cache then the algorithm returns #F
immediately; otherwise, it computes #F using a Shannon expansion,
then stores the result in the cache.
3.4. Algorithms and Complexity for Exact Model Counting 251
Cache Read Lookup the pair (F, #F ) in the cache: if found, re-
turn #F .
Cache Write Store the pair (F, #F ) in the cache, and return #F .
where the probability is taken over the random choices of the algo-
rithm. The meaning of C̃ is that of an estimated value of the exact
count C = #F . We say that the algorithm is an approximation algorithm
for model counting, and define its complexity in terms of the size of
the input formula F , and the quantities 1/ε, 1/δ. When the algorithm
runs in polynomial time in all three parameters, then call it a Poly-
nomial Time Approximation Scheme (FPTRAS). These definitions for
model counting carry over naturally to the probability computation
problem, or to the weighted model counting problem, and we omit
the straightforward definition.
We will describe below an approximation algorithm based on
Monte Carlo simulations. In general, this algorithm is not an FP-
TRAS. Karp and Luby [1983] have shown that DNF formulas ad-
mit an FPTRAS consisting of a modified Monte Carlo simulation.
Roth [1996] and Vadhan [2001] proved essentially that no FPTRAS
can exists for CNF formulas. More precisely, they proved the follow-
ing result (building on previous results by Jerrum, Valiant, Vazirani
and later by Sinclair): for any fixed ε > 0, given a bipartite graph
E ⊆ [n] × [n], it is NP-hard to approximate #F within a factor nε ,
where F = (i,j)∈E (Xi ∨ Xj ).
V
Proof. For any UCQ Q, the lineage FQ,T is a DNF, hence the first part
follows from Karp and Luby [1983]. By setting P(S(i, j)) = 0 when
(i, j) ∈ E and P(S(i, j)) = 1 when (i, j) 6∈ E, the lineage of ∆ is,
essentially, (i,j)∈E (R(i) ∨ R(j)), hence the second part follows from
V
Vadhan [2001].
P
i Yi /N . Then:
Dissociation
Gatterbauer and Suciu [2014] describe an approximation method
called dissociation, which gives guaranteed upper and lower bounds
on the probability of a Boolean formula. Fix a formula F with variables
X. A dissociation, F 0 , is obtained, intuitively, by choosing a variable X
256 Weighted Model Counting
other Fj0 , for j 6= i). If p0 (Xi ) ≤ p(X), forall i, then P(F 0 ) ≤ P(F ); if
Q 0 0
i p (Xi ) ≥ p(X), then P(F ) ≥ P(F ).
(2) Suppose the dissociation is disjunctive, meaning that we can
write F 0 = i Fi0 such that Xi occurs only in Fi0 . If p0 (Xi ) ≥ p(X) forall
W
Approximate DPLL
Olteanu et al. [2010] describe a simple heuristics that terminates the
DPLL early, and returns lower and upper bounds on the probability
rather than the exact probability. When it terminates, it approximates
the probability of the unprocessed residual formula by using these
bounds:
_ X
max P(Fi ) ≤P( Fi ) ≤ P(Fi )
i=1
i=1 i=1
X ^
P(Fi ) − (n − 1) ≤P( Fi ) ≤ min P(Fi )
i=1
i=1 i=1
258
259
Extensional Operators
A query plan can be extended to compute probabilities. Assume that
each intermediate relation in the query plan has an attribute p rep-
resenting the probability that the tuple appears in that intermediate
relation. Then, each relational operator is extended as follows; these
operators are called extensional operators:
Safe Plans
Fix a relational schema for a probabilistic database, which specifies
for each table name whether it is a tuple-independent table or a BID
table; in the latter case the schema also specifies the key(s) defining
the groups of disjoint tuples (cf. §2.2.1).
Definition 4.1. Let Q be a set-valued query. A safe plan for a query Q is
an extensional query plan that computes correct output probabilities
(according to Definition 2.2), on any input probabilistic database.
We illustrate with an example.
Example 4.1. Assume both relations R, S are tuple-independent, and
consider the query:
Q(z):-R(z,x) ∧ S(x,y)
262 Lifted Query Processing
Since these two events are independent (they refer to disjoint sets
of independent tuples), the probability that c is in the answer is:
1 − [1 − p1 (1 − (1 − q1 )(1 − q2 ))] · [1 − p2 (1 − (1 − q3 )(1 − q4 )(1 − q5 ))]
To compute the query on a traditional database, any modern query
engine would produce a query plan like that in Figure 4.1b: it first joins
R and S on the attribute x, then projects the result on the attribute z.
We assume that the project operator includes duplicate elimination.
If we extend each operator to manipulate explicitly the probability at-
tribute p, then we obtain the intermediate result and final result shown
in the figure (the actual tuples are dropped and only the probabilities
are shown, to reduce clutter). As we can see the final result is wrong.
The reason is that duplicate elimination incorrectly assumed that all
tuples that have z = c are independent: in fact the first two such tu-
ples depend on (c, a1 ) and the next three depend on (c, a2 ) (see the
repeated probabilities p1 , p1 and p2 , p2 , p2 in the figure). Thus, the plan
in Figure 4.1b is not safe.
In contrast, the plan in Figure 4.1c is safe, because it computes the
output probabilities correctly; the figure shows the computation only
4.1. Extensional Operators and Safe Plans 263
S: x y p
R: z x p a1 b1 q1
c a1 p1 a1 b2 q2
c a2 p2 a2 b3 q3
c a3 p3 a2 b4 q4
a2 b5 q5
(a) Probabilistic database
1-(1-p1q1)(1-p1q2)(1-p2q3)(1-p2q4)(1-p2q5)
Πz
p1 q1
p1 q2
p2 q3
p2 q4
⋈ x
p2 q5
R(z,x) S(x,y)
1-{1-p1[1-(1-q1)(1-q2)]}*
{1-p2[1-(1-q4)(1-q5) (1-q6)]}
Πz
p1(1-(1-q1)(1-q2))
p2(1-(1-q4)(1-q5) (1-q6))
⋈ x
1-(1-q1)(1-q2)
1-(1-q4)(1-q5) (1-q6)
Πx
R(z,x) S(x,y)
Figure 4.1: Example query plans. Intermediate partial results show the probabilities.
264 Lifted Query Processing
on our toy database, but the plan is correct on any input database. It
starts by projecting out the redundant attribute y in S(x, y) and doing
duplicate elimination, before joining the result with R, then contin-
ues like the previous plan. The two plans (b) and (c) are equivalent
over deterministic databases, but when the join and projection opera-
tors are extended to manipulate probabilities, then they are no longer
equivalent. Notice that a SQL engine would normally not choose (c)
over (b), because the extra cost of the duplicate elimination is not jus-
tified. Over probabilistic databases, however, the two plans are dif-
ferent, and the latter returns the correct probability, as shown in the
figure. We invite the reader to verify that this plan returns the correct
output probabilities for any tuple independent probabilistic relations
R and S.
Unsafe Plans
If we evaluate an extensional, unsafe plan, are the resulting probabili-
ties of any use? Somewhat surprisingly, Gatterbauer and Suciu [2015]
show that every extensional plan for a conjunctive query without self-
joins returns an upper bound on the true probabilities. In other words,
4.2. Lifted Inference Rules 265
even if the plan is unsafe, it always returns an upper bound on the true
probability. This follows from Theorem 3.7 (2) by observing that every
extensional plan computes the exact probability of some dissociation
of the lineage. Indeed, any projection operator treats repeated copies
of the same tuple as distinct random variables, which means that it
dissociates the random variable associated to the tuple. For example,
both plans P1 and P2 above return upper bounds on the probability of
H0 , and obviously can be computed in polynomial time in the size of
the input database. In general, by considering all plans for the query,
and taking the minimum of their probabilities one obtains an even
tighter upper bound on the true probabilities than a single plan. Some
plans are redundant and can be eliminated from the enumeration of
all plans, because they are dominated by tighter plans. For example,
def
H0 admits a third plan, P3 = Π∅ (R(x) o n T (y)), but Theo-
n S(x, y) o
rem 3.7 (2) implies that eval(P1 ) ≤ eval(P3 ) and eval(P2 ) ≤ eval(P3 ),
because the lineage of P3 is a dissociation of the lineage of both P1
and P2 ; hence P3 is dominated by P1 and P2 and does not need to
be considered when computing the minimum of all probabilities. The
technique described in Gatterbauer and Suciu [2015] is quite effective
for conjunctive queries without self-joins, but it is open whether it can
be extended to more general queries.
To summarize, some queries admit safe plans and can be com-
puted as efficiently as queries over standard databases; others do not
admit safe plans. The practical question is: which queries admit safe
plans? And if a query admits a safe plan, how can we find it? We ad-
dress this issue in the rest of this chapter.
P(Q0 ), until we reach ground atoms, and then return the probability
of the ground atom, which is obtained directly from the database. If
we reach a sentence Q0 where no rule applies, then we are stuck, and
lifted inference failed on our query.
We illustrate with a simple example, computing the probability of
the constraint Γ = ∀x∀y(S(x, y) ⇒ R(x)):
Y
P(∀x∀y(S(x, y) ⇒ R(x))) = P(∀y(S(a, y) ⇒ R(a)))
a∈D
Y
= P(∀y(¬S(a, y) ∨ R(a)))
a∈D
Y
= P((∀y¬S(a, y)) ∨ R(a))
a∈D
Y
= 1 − (1 − P((∀y¬S(a, y)))(1 − p(R(a))))
a∈D
Y Y
= 1 − (1 − (1 − p(S(a, b)))(1 − p(R(a))))
a∈D b∈D
In the last expression all probabilities are for ground atoms, and these
can be obtained using the function p, or, in practice, looked up in an
attribute of the relation. We invite the reader to check that in both steps
where we applied the universal quantifier rule, the variable that we
eliminated was a syntactic separator variable.
When the rules succeed, then we compute P(Q) in polynomial
time in the size of the input domain: more precisely in time O(nk )
where n is the size of the domain and k is total number of variables.
Therefore, the lifted inference rules will not work on queries whose
complexity is #P-hard. For example, the query H0 = ∃x∃yR(x) ∧
S(x, y) ∧ T (y) is #P-hard (by Theorem 3.4). It is easy to check that no
lifted inference rule applies here: the query is not the conjunction of
two sentences, so we cannot apply the join-rule, and neither x nor y
is a separator variable (because none of them occurs in all atoms). In
general, it will be the case that for some first-order sentences the lifted
inference rules will not be able to compute the query.
An important question is to characterize the class of queries whose
probability can be computed using only the lifted inference rules. We
say that such a query is liftable. A query can be computed using lifted
4.3. Hierarchical Queries 269
inference rules iff it admits a safe query plan, and therefore we also call
such a query a safe query. When Q is liftable, then P(Q) can be com-
puted in polynomial time in the size of the database. If the converse
also holds, then the rules are complete. We describe below some frag-
ments of first-order sentences for which the rules are complete: more
precisely, if the rules fail to compute the probability of some query Q,
then P(Q) is provably #P-hard in the size of the input database.
s.t. at(x0 ), at(y 0 ) are overlapping but none contains the other. Thus,
Q has three atoms R0 ∈ at(x0 ) − at(y 0 ), S 0 ∈ at(x0 ) ∩ at(y 0 ), T 0 ∈
at(y 0 )−at(x0 ). The three atoms are R0 (x0 , . . .)∧S 0 (x0 , y 0 , . . .)∧T 0 (y 0 , . . .),
i.e. R0 contains the variable x0 and possibly others, but does not con-
tain the variable y 0 , similarly for T 0 , while S 0 contains both x0 , y 0 . Given
the input probabilistic database R(x), S(x, y), T (y), we construct an in-
stance for R0 (x0 , . . .) by extending R(x) with the extra attributes and
filling them with some fixed constant; similarly, construct S 0 , T 0 from
S, T , where the extended attributes are filled with the same constant.
The new relations R0 , S 0 , T 0 are therefore in 1-1 correspondence with
R, S, T : they have the same cardinalities, and corresponding tuples
have the same probabilities. For all other relations in Q, define their
instance to be the cartesian product of the domain, and set their prob-
abilities to be 1. It is easy to check that P(Q) = P(H0 ), proving that the
query evaluation problem for H0 can be reduced to that for Q; hence
Q is #P-hard.
and that all occurrences of the ¬ operator are pushed down to the
atoms, using De Morgan’s laws. We will consider various fragments
of FO. If S ⊆ {¬, ∃, ∀, ∨, ∧}, then F OS denotes the fragment of F O
restricted to the operations in S. For example, F O∃,∧ is equivalent
to Conjunctive Queries (CQ), F O∃,∨,∧ is the positive, existential frag-
ment of FO, and equivalent to Unions of Conjunctive Queries (UCQ),
while F O∀,∨,∧ is positive universal fragment of FO, and F O¬,∃,∀,∨,∧ is
the full first-order logic, which we denote FO.
Call a first-order formula unate if every relational symbol occurs
either only positively, or only negated. For example, the first sentence
below is unate, the second is not:
Lemma 4.3. (1) If Q1 , Q2 are two shattered, ranked and minimal sen-
tences in F O¬un,∃,∨,∧ (or F O¬un,∀,∨,∧ ), then syntactic independence is
equivalent to independence. (2) If Q is in F O¬un,∃,∨,∧ (or F O¬un,∀,∨,∧ )
is shattered, ranked, and minimal, and has a single free variable x,
then x is a syntactic separator variable iff it is a separator variable.
The rewritten query Q02 is ranked (using the variable order x, y) and is
equivalent to Q2 , hence P(Q2 ) = P(Q02 ) and
4.5 Negation
We invite the reader to pause and reflect about this statement. In par-
ticular, note that the statement is false if one of the relations Ri is not
unary: for example if Ri is binary, then a particular instance of Ri rep-
resents a graph, and the probability of Q depends not just on the num-
ber of edges ki , but also on the entire degree sequence, and on much
more [Grohe, 2017, pp.105]. Thus, de Finetti’s theorem allows us to
prove Equation 4.1 only when all relations Ri are unary. From here we
derive:
!
X Y n Y k{i}
P(Q) = pi (1 − pi )n−k{i} P(Q|(kS )S⊆[`] )
kS =0,n,S⊆[`] S⊆[`]
kS i∈[`]
! !
X n n kR
P (H0 ) = p · (1 − pR )n−kR
kR ,kT =0,n
kR kT R
2 −k
· pkTT (1 − pT )n−kT · pnS R kT
The first line holds because the clause (R(x)∨S(x, y)∨T (y)) is already
satisfied by the kR kT tuples x ∈ R, y ∈ T , thus S must contain all
remaining n2 − kR kT tuples. In this simple example the conditional
probability depends only on the cardinalities kR , kT , and there was no
need to consider all four cells R ∩ T, R ∩ T̄ , R̄ ∩ T, R̄ ∩ T̄ .
4.7 Extensions
result by Dalvi and Suciu [2007b] covers such cases but only for con-
junctive queries without self-joins. The case of more general queries
is open. When the database consists of both symmetric probabilistic
relations and asymmetric deterministic relations, then queries that are
liftable on symmetric databases remain liftable, as long as the Boolean
rank of the deterministic relations is bounded [Van den Broeck and
Darwiche, 2013].
The paper describes two results. The first result describes a class
of tractable queries. We say that a query is max-one if, for every atom,
at most one variable in that atom occurs in an inequality predicate.
The paper shows that, for every max-one query, its probability on
tuple-independent databases can be computed in PTIME. The proof
by Olteanu and Huang [2009] consists of showing how to construct
a polynomial-size OBDD (which we define in the next chapter). Here
we prove that a max-one query can be compute in polynomial time
using dynamic programming. For simplicity, we illustrate the proof
on the query q1 above: the general case follows immediately. Choose
any variable that is maximal under the predicates <: such a variable
must exist, otherwise the inequality predicates form a cycle, and the
query is unsatisfiable. For q1 , let’s choose z. Suppose w.l.o.g. that the
values in the z-column of the relation T are 1, 2, . . . , n. Thus, n the
largest possible value of the variable z, and we may assume w.l.o.g.
that all values of x and y in the database are < n (since we can re-
move values ≥ n without affecting the query). On any possible world,
there are two cases: (1) ∃uT (n, u): in that case the query is true iff the
following residual query is true, q10 = R(x), S(y), K(v, w), y < v. (2)
¬∃uT (n, u): in that case q1 is true iff it is true on the world obtained
4.7. Extensions 289
by removing all tuples of the form T (n, −) from the database. Thus,
denoting Pk (−) the probability on the subset of the database where
all values of z are ≤ k, and all values of x, y are < k, we have:
Pn (q1 ) =P(∃uT (n, u))Pn (q10 ) + (1 − P(∃uT (n, u)))Pn−1 (q1 )
Each term on the right hand side is either the probability of a simpler
query (which is computed similarly), or is Pn−1 (q1 ), which is over a
smaller domain.
Second, Olteanu and Huang [2009] consider queries that are not
max-one and describe a large class of #P-hard queries. We refer the
reader to Olteanu and Huang [2009] for the rather technical defi-
nition of this class, and instead will prove that the query q2 is #P-
hard, by reduction from the query H0 = ∃x∃yR(x), S(x, y), T (y). Con-
sider a probabilistic database instance R, S, T , and assume w.l.o.g. that
all constants in the database are even numbers. Define the instance
K, S, M as follows: the relation S is the same, K = {(i − 1, i + 1) |
i ∈ R}, M = {(j − 1, j + 1) | j ∈ T }. It follows that the probabilities of
h0 and q2 are equal.
Beyond these two results, the complexity of queries with inequali-
ties is open.
The first query checks if there are at least k distinct values x satisfying
the body of the conjunctive query, the second checks of there are at
least k distinct values y, and similarly for the third.
The paper proves trichotomy results for various combinations of
aggregate functions and comparison operators. In particular, it shows
that the query q1 is in PTIME, q2 is #P-hard but admits an FPTRAS,
while q3 admits a reduction from #BIS, which is the problem: given a
bipartite graph (X, Y, E), compute the fraction of subsets of X × Y that are
independent sets. #BIS is a complete problem w.r.t. to approximation-
preserving reductions, and thus it is believed to be hard to approxi-
mate. In other words, q3 is believed to be hard to approximate.
The complexity of queries with a HAVING clause is open beyond
conjunctive queries without self-joins.
292
5.1. Compilation Targets 293
where u0 and u1 are its 0-child and 1-child respectively. Then the
def
FBDD denotes the formula F = Fr , where r is the root of the DAG.
Notice that the definition in Equation 5.1 corresponds precisely to a
Shannon expansion. In other words, an FBDD encodes a sequence of
Shannon expansions.
An equivalent way to define the Boolean F represented by the
FBDD F is as program that computes the Boolean formula, as follows.
Let θ be an assignment to the variables X. To compute θ(F ), follow a
certain path in the DAG F, as follows. Set the current node to be the
root node. If X is the variable labeling the current node, then read its
value θ(X): if θ(X) = 0 then continue with the 0-child, otherwise con-
tinue with the 1-child. When reaching a sink node, return θ(F ) as the
label of that sink node (0 or 1). We invite the reader to check that the
definitions are equivalent.
5.1. Compilation Targets 295
called the reduced OBDD, or the canonical OBDD for the given vari-
able order Π. Reducing an OBDD to obtain the reduced (canonical)
OBDD is a process similar to minimizing a deterministic automaton.
Notice that checking formula equivalence Fu = Fv is co-NP complete,
and therefore reducing an OBDD is not an effective procedure. While
for theoretical analysis we always assume that a Π-OBDD is reduced,
in practice systems use some heuristics with false negatives for the
equivalence test, resulting in an OBDD that is not fully reduced.
If every path reads all n variables (in the order Π) then we call the
OBDD complete. The OBDD can then be partitioned into layers, where
layer i reads the variable XΠ−1 (i) and both its children belong to layer
i + 1. Notice that a canonical OBDD is not necessarily layered, since
edges may skip layers. Every OBDD can be converted into a complete
OBDD by introducing dummy test nodes at skipped layers, with at
most a factor n increase in the size. The width of a complete OBDD is
the largest number of nodes at each layer. The number of nodes in the
OBDD is ≤ nw, where n is the number of Boolean variables and w is
the width.
An important property of complete OBDDs, which we will use for
query compilation, is that one can synthesize an OBDD for a Boolean
formula F = F1 op F2 from OBDDs for its sub-formulas, where op
is one of ∨ or ∧. Let F1 , F2 be complete Π-OBDDs computing F1 , F2 ,
and let w1 , w2 be their widths. We construct a Π-OBDD F for F =
F1 op F2 , with a width at most w1 w2 , as follows. The nodes of F are
all pairs (u, v) where u and v are two nodes at the same layer i of F1
and F2 respectively. The 0-child and 1-child of (u, v) are (u0 , v0 ) and
(u1 , v1 ) respectively, where u0 , u1 are the children of u in F1 and v0 , v1
are the children of v in F2 . Since both F1 and F2 are complete, all
nodes u0 , u1 , v0 , v1 are at the same layer i + 1, hence the construction
is well defined. The root of F is (r1 , r2 ), where r1 , r2 are the roots of
F1 , F2 , and the sink nodes of F will be pairs (u, v) where u, v are sink
nodes of F1 or F2 respectively, and its label is u op v. For example,
if we want to compute F1 ∨ F2 , then we obtain 4 kind of sink nodes:
(0, 0), (0, 1), (1, 0), (1, 1). The first becomes a 0-sink in F, the last three
become 1-sinks.
5.1. Compilation Targets 297
The trace of a DPLL Algorithm Recall from §3.4 that today’s exact
model counting algorithm are based on the DPLL family of algorithm.
Huang and Darwiche [2005] observed that the trace of any DPLL al-
gorithm is a compilation target. More precisely:
Query compilation for some query Q means first computing the lineage
FQ,n of the query on the domain [n], then compiling FQ,n into one of
the compilation targets described in the previous section. The main
problem that we study is the size of the resulting compilation, as a
function of the domain size n. If this size is large, e.g. exponential in
n, then any DPLL-based algorithm whose trace is that type of com-
pilation will also run in exponential time. If the size is small, then in
most cases it turns out that we can also design a specialized DPLL
algorithm to compute the query with that running time.
We will present several lower and upper bounds for the com-
pilation size. Most of these results were presented for Unions of
Conjunctive Queries (UCQ), but they carry over immediately to
Unate First Order Logic with a single type of quantifiers (∃ or ∀),
similarly to Chapter 4. To keep the discussion simple, we present
these results in the context of Unions of Conjunctive Queries. Recall
that we denoted F O∃,∨,∧ the positive, existential fragment of FO.
F O∃,∨,∧ is equivalent to Boolean UCQ queries, which are usually
W
written as a disjunction of conjunctive queries m Qm . The conversion
from F O∃,∨,∧ to an expression of the form m Qm may lead to an
W
Theorem 5.1. Jha and Suciu [2013] Let Q be a UCQ query. Then the
following conditions are equivalent.
We invite the reader to prove that every query over a unary vocabu-
lary is liftable using the inference rules in §4.2; this proves that Q is
liftable. To prove that its lineage is not read-once in general, it suffices
to check that the primal graph of the lineage FQ,2 has an induced P4
path (see Golumbic et al. [2006]).
The first query is inversion free, because in both atoms S the variables
occur in the same order as the existential quantifiers that introduced
them: ∃x1 ∃y1 S(x1 , y1 ) and ∃x2 ∃y2 S(x2 , y2 ). The second query is not
inversion-free because the variables in the two atoms S use reverse
orders relative to the existential quantifiers, i.e. ∃x1 ∃y1 S(x1 , y1 ) and
∃y2 ∃x2 S(x2 , y2 ). We cannot swap the quantifier order ∃y2 ∃x2 to ∃x2 ∃y2
because the context for T (y2 ) consists only of ∃y2 . For a more subtle
example, consider the query H2 (introduced in §4.4):
y1 (we cannot swap ∃x1 ∃y1 for that reason). But that conflicts with
the order required by the second S2 , which needs y2 to be introduced
before x2 . Hence, H2 too has an inversion. It is easy to check that all
queries Hk , k ≥ 0 in §4.4 have an inversion. On the other hand, every
hierarchical, non-repeating query expression is inversion-free.
We give an equivalent definition of inversion-free queries, perhaps
more intuitive. We will assume that Q is a hierarchical query expres-
sion. Recall from §4.3 that at(x) denotes the set of atoms that con-
tain the variable x. The unification graph of Q is defined as follows.
Its nodes consists of pairs of variables (x, y) that occur in a common
atom. And there is an undirected edge from (x, y) to (x0 , y 0 ) if there ex-
ists two atoms containing x, y and x0 , y 0 respectively, such that the two
atoms can be unified and their most general unifier sets x = x0 and
y = y 0 . Then an inversion is a path (x0 , y0 ), (x1 , y1 ), . . . , (xk , yk ) where
at(x0 ) ⊃ at(y0 ) and at(xk ) ⊂ at(yk ). The length of the inversion is de-
fined as k. (We can assume w.l.o.g. that at(xi ) = at(yi ) for i = 1, k − 1:
otherwise, if for example at(xi ) ⊃ at(yi ), then we consider the shorter
inversion from (xi , yi ) to (xk , yk ).) One can check that the query ex-
pression Q is inversion-free (as defined earlier) iff it has no inversion
(as defined here). For example, every query Hk has an inversion of
length k, namely (x0 , y0 ), (x1 , y1 ), · · · , (xk , yk ).
Fix a domain size n, and recall that Tup([n]) denotes the set of
ground tuples over the domain [n].
Theorem 5.3. Jha and Suciu [2013] Let Q be a shattered and ranked
UCQ. Then the following hold:
• If Q is inversion free, then for all n there exists an order Π on
Tup[n] such that the reduced Π-OBDD for the lineage FQ,n has
width ≤ 2|Q| = O(1), and size O(nk ), where k is the maximum
arity of any relation in Q.
in the number of Boolean variables |Tup(n)|, and those for which the
OBDD is exponential in the size of the domain.
Proof. We sketch only the proof of the first item, by showing how
to convert an inversion-free query into an OBDD. We start from an
inversion-free query Q and a domain size n. Recall that the database
schema is R = (R1 , R2 , . . . , R` ). Consider the set S = {R1 , . . . , R` } ∪ [n]
with the total order R1 < R2 < · · · < R` < 1 < 2 < · · · < n. By
definition, each relational symbol R ∈ R is associated with a permu-
def
tation π R , and define ρR = (π R )−1 . We associate each ground tuple
t = R(i1 , i2 , . . . , ik ) ∈ Tup([n]) with the following sequence in S ∗ :
(iρ(i1 ) , iρ(i2 ) , . . . , iρ(ik ) , R). Define the order Π on Tup([n]) as the lexi-
cographic order of the corresponding sequences. In other words, if A
is an atom in the query, then we order the ground tuples gr(A) by
their first existential variable, then by their second, and so on; the fact
that the query is inversion-free ensures that there is no conflict in this
ordering, i.e. we get the same order for gr(A0 ) where A0 is a different
atom that refers to the same relational symbol as A. We claim that the
width of the Π-OBDD for Q is w ≤ 2|Q| = O(1), which implies that the
size of the Π-OBDD is O(|Tup|) = O(nk ). We prove the claim by induc-
tion on the sentence Q. If Q is a single ground atom t, then the width
of the complete Π-OBDD is 2 (all levels below t must have two nodes
to remember if t was true or false). If Q = Q1 ∧ Q2 or Q = Q1 ∨ Q2 then
we first construct complete OBDDs for Q1 , Q2 . Since both Q1 , Q2 use
the same variable order Π, we can use the OBDD synthesis described
earlier to derive a complete OBDD for Q, whose width is at most the
product of the widths of the two OBDDs. If Q = ∃xQ1 , then we first
def
construct n OBDDs Gi for Q1 [i/x], for i = 1, 2, . . . , n. Let T = Tup[n].
Partition T into n sets T = T1 ∪ · · · ∪ Tn , where Ti consists of all ground
atoms R(i1 , . . . , ik ) where R ∈ R and iρR (1) = i. Then, for each i, the
lineage of Q1 [i/x] uses only Boolean variables from Ti . Moreover, the
order Π places all tuples in Ti before those in Tj , forall i < j. There-
W
fore, we can construct an Π-OBDD for FQ1 [i/x],n as follows: take the
union of all OBDDs G1 , . . . , Gn , reroute the 0-sink node of Gi−1 to the
root note of Gi . The resulting Π-OBDD is not complete yet, because
5.2. Compiling UCQ 307
the 1-sink node of Gi−1 stops early, skipping all levels in the OBDDs
Gi , . . . , Gn . We complete the OBDD (since we need the OBDD to be
complete at each step of the induction) by introducing one new node
wi,j for each layer j of each Gi , i > 1. The 1-sink node of Gi−1 , i < n
will be re-routed to wi,1 ; both the 0-child and 1-child of wi,j are wi,j+1 if
j is not the last layer of Gi , or both are wi+1,1 if i < n, or both are the 1-
sink node otherwise. The new nodes increase the width only by 1.
There are known examples of UCQ queries whose FBDDs have size
polynomial in the input domain, and there are known examples of
queries whose FBDDs are exponential. We will illustrate both. The ex-
act separation between them is open.
308 Query Compilation
Theorem 5.4. Beame et al. [2014, 2017] (1) Any FBDD for Hk has
≥ (2n − 1)/n nodes, where n is the size of the domain. (2) Let
F (Z0 , Z1 , . . . , Zk ) be any monotone Boolean formula that depends
on all variables Zi , i = 0, k. Then any FBDD for the query Q =
F (Hk0 , Hk1 , . . . , Hkk ) has size 2Ω(n) .
The theorem implies that any FBDD for QW is exponential in the size
of the domain. However, QW can be computed in PTIME using lifted
inference, by applying the inclusion/exclusion formula:
Several large datasets have been reported in the literature that are an-
notated with probabilistic values. We describe some of them below,
and summarize them in Table 6.1.
313
314 Data, Systems, and Applications
</m/02mjmr,
/people/person/place_of_birth /m/02hrh0_ 0.98>
6.2. Probabilistic Database Systems 315
1
The actual confidence of this triple is not reported in Dong et al. [2014].
316 Data, Systems, and Applications
6.3 Applications
[2016] describe a system that uses lifted inference for the inner loop
of the learning task. The authors report that the lifted learning algo-
rithm results in more accurate models than several competing approx-
imate approaches. In related work, Jaimovich et al. [2007] and Ahmadi
et al. [2012] speed up Markov logic network parameter learning by
performing approximate lifted message passing inference.
Image Retrieval Zhu et al. [2015] use probabilistic databases for im-
age retrieval within the DeepDive system: given a textual query, such
as “Find photos of me sea kayaking last Halloween in my photo al-
bum”, the task is to show images that fit the description. Probabilistic
databases are particularly suited to answer queries that require joint
reasoning about several (probabilistic) image features and meta-data.
Zhu et al. [2014] learn a more specific MLN knowledge base from im-
ages, in order to reason about object affordances.
7
Conclusions and Open Problems
321
322 Conclusions and Open Problems
Guy Van den Broeck was partially supported by NSF grants IIS-
1657613, IIS-1633857 and DARPA XAI grant N66001-17-2-4032. Dan
Suciu was supported by NSF grant III-1614738.
323
References
324
References 325
Arthur Choi and Adnan Darwiche. An edge deletion semantics for belief
propagation and its practical impact on approximation quality. In Proceed-
ings of the National Conference on Artificial Intelligence, volume 21, page 1107.
Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,
2006.
Arthur Choi and Adnan Darwiche. Relax, compensate and then recover.
In JSAI International Symposium on Artificial Intelligence, pages 167–180.
Springer, 2010.
Arthur Choi, Doga Kisa, and Adnan Darwiche. Compiling probabilistic
graphical models using sentential decision diagrams. In ECSQARU, pages
121–132, 2013.
Jaesik Choi, Rodrigo de Salvo Braz, and Hung H. Bui. Efficient methods for
lifted inference with aggregate factors. In AAAI, 2011.
Bruno Courcelle. The monadic second-order logic of graphs. i. recognizable
sets of finite graphs. Inf. Comput., 85(1):12–75, 1990.
Fabio G. Cozman. Credal networks. AIJ, 120(2):199–233, 2000. .
Fabio Gagliardi Cozman and Denis Deratani Mauá. Bayesian networks spec-
ified using propositional and relational constructs: Combined, data, and
domain complexity. In AAAI, pages 3519–3525, 2015.
Fabio Gagliardi Cozman and Denis Deratani Mauá. Probabilistic graphical
models specified by probabilistic logic programs: Semantics and complex-
ity. In Conference on Probabilistic Graphical Models, pages 110–122, 2016.
Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic
databases. In VLDB, pages 864–875, 2004.
Nilesh N. Dalvi and Dan Suciu. Management of probabilistic data: foun-
dations and challenges. In Proceedings of the Twenty-Sixth ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems, June 11-13,
2007, Beijing, China, pages 1–12, 2007a.
Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic
databases. VLDB J., 16(4):523–544, 2007b.
Nilesh N. Dalvi and Dan Suciu. The dichotomy of probabilistic inference for
unions of conjunctive queries. J. ACM, 59(6):30, 2012.
Adnan Darwiche. Any-space probabilistic inference. In Proceedings of the
Sixteenth conference on Uncertainty in artificial intelligence, pages 133–142.
Morgan Kaufmann Publishers Inc., 2000.
Adnan Darwiche. Decomposable negation normal form. J. ACM, 48(4):608–
647, 2001.
328 References
Pedro Domingos and Daniel Lowd. Markov Logic: An Interface Layer for Arti-
ficial Intelligence. Morgan & Claypool Publishers, 2009.
Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin
Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge
vault: a web-scale approach to probabilistic knowledge fusion. In The 20th
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, pages 601–
610, 2014.
Stefano Ermon, Carla P Gomes, Ashish Sabharwal, and Bart Selman. Opti-
mization with parity constraints: from binary codes to discrete integration.
In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial In-
telligence, pages 202–211. AUAI Press, 2013.
Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. Open
information extraction from the web. Commun. ACM, 51(12):68–74, 2008.
Anthony Fader, Stephen Soderland, and Oren Etzioni. Identifying relations
for open information extraction. In Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July
2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT,
a Special Interest Group of the ACL, pages 1535–1545, 2011.
Ronald Fagin, Joseph Y. Halpern, and Nimrod Megiddo. A logic for reason-
ing about probabilities. Inf. Comput., 87(1/2):78–128, 1990.
Daan Fierens, Hendrik Blockeel, Maurice Bruynooghe, and Jan Ramon. Logi-
cal Bayesian networks and their relation to other probabilistic logical mod-
els. Inductive Logic Programming, pages 121–135, 2005.
Daan Fierens, Guy Van den Broeck, Joris Renkens, Dimitar Sht. Shterionov,
Bernd Gutmann, Ingo Thon, Gerda Janssens, and Luc De Raedt. Infer-
ence and learning in probabilistic logic programs using weighted boolean
formulas. TPLP, 15(3):358–401, 2015. .
Robert Fink and Dan Olteanu. A dichotomy for non-repeating queries
with negation in probabilistic databases. In Proceedings of the 33rd ACM
SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,
PODS ’14, pages 144–155, New York, NY, USA, 2014. ACM. .
Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning prob-
abilistic relational models. In Proceedings of the Sixteenth International Joint
Conference on Artificial Intelligence, IJCAI 99, Stockholm, Sweden, July 31 - Au-
gust 6, 1999. 2 Volumes, 1450 pages, pages 1300–1309, 1999.
330 References
Todd J. Green and Val Tannen. Models for incomplete and probabilistic in-
formation. IEEE Data Eng. Bull., 29(1):17–24, 2006.
Eric Gribkoff and Dan Suciu. Slimshot: In-database probabilistic inference
for knowledge bases. PVLDB, 9(7):552–563, 2016.
Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. Understanding the com-
plexity of lifted inference and asymmetric weighted model counting. In
UAI, pages 280–289, 2014a.
Eric Gribkoff, Dan Suciu, and Guy Van den Broeck. Lifted probabilistic in-
ference: A guide for the database researcher. IEEE Data Eng. Bull., 37(3):
6–17, 2014b.
Eric Gribkoff, Guy Van den Broeck, and Dan Suciu. The most probable
database problem. In Proceedings of the First International Workshop on Big
Uncertain Data (BUDA), June 2014c.
Martin Grohe. Descriptive Complexity, Canonisation, and Definable Graph Struc-
ture Theory. LNCS, 2017. (to appear).
Joseph Y. Halpern. An analysis of first-order logics of probability. Artificial
Intelligence, 46(3):311–350, 1990.
Joseph Y. Halpern. Reasoning about uncertainty. MIT Press, 2003.
Jochen Heinsohn. Probabilistic description logics. In Proceedings of the Tenth
international conference on Uncertainty in artificial intelligence, pages 311–318.
Morgan Kaufmann Publishers Inc., 1994.
Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. Online aggregation.
In Proc. of SIGMOD, pages 171–182, 1997.
W. Hoeffding. Probability inequalities for sums of bounded random vari-
ables. Journal of the American Statistical Association, 58(301):13–30, 1963.
Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard
Weikum. Yago2: A spatially and temporally enhanced knowledge base
from wikipedia. Artif. Intell., 194:28–61, 2013.
Michael C. Horsch and David L. Poole. A dynamic approach to probabilistic
inference. In Proceedings of UAI, 1990.
Jinbo Huang and Adnan Darwiche. Dpll with a trace: From sat to knowledge
compilation. In IJCAI, pages 156–162, 2005.
Edward Hung, Lise Getoor, and VS Subrahmanian. Probabilistic interval
xml. In International Conference on Database Theory, pages 361–377. Springer,
2003.
332 References
Abhay Kumar Jha and Dan Suciu. Probabilistic databases with markoviews.
PVLDB, 5(11):1160–1171, 2012.
Abhay Kumar Jha and Dan Suciu. Knowledge compilation meets database
theory: Compiling queries to decision diagrams. Theory Comput. Syst., 52
(3):403–440, 2013.
Bhargav Kanagal, Jian Li, and Amol Deshpande. Sensitivity analysis and
explanations for robust query evaluation in probabilistic databases. In
Proceedings of the ACM SIGMOD International Conference on Management of
Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011, pages 841–852, 2011.
Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma,
Robert Grandl, Surajit Chaudhuri, and Bolin Ding. Quickr: Lazily approx-
imating complex adhoc queries in bigdata clusters. In Proc. of SIGMOD,
pages 631–646, 2016.
Richard M. Karp and Michael Luby. Monte-carlo algorithms for enumera-
tion and reliability problems. In 24th Annual Symposium on Foundations of
Computer Science, Tucson, Arizona, USA, 7-9 November 1983, pages 56–64,
1983.
Seyed Mehran Kazemi and David Poole. Elimination ordering in first-order
probabilistic inference. In AAAI, 2014.
Seyed Mehran Kazemi and David Poole. Knowledge compilation for lifted
probabilistic inference: Compiling to a low-level language. In KR, 2016.
Seyed Mehran Kazemi, Angelika Kimmig, Guy Van den Broeck, and David
Poole. New liftable classes for first-order probabilistic inference. In Ad-
vances in Neural Information Processing Systems 29 (NIPS), December 2016.
Gabriele Kern-Isberner and Thomas Lukasiewicz. Combining probabilistic
logic programming with the power of maximum entropy. Artificial Intelli-
gence, 157(1-2):139–202, 2004.
Kristian Kersting and Luc De Raedt. Bayesian logic programs. arXiv preprint
cs/0111058, 2001.
Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan. Counting belief
propagation. In Proceedings of the Twenty-Fifth Conference on Uncertainty in
Artificial Intelligence, pages 277–284. AUAI Press, 2009.
Benny Kimelfeld and Yehoshua Sagiv. Matching twigs in probabilistic xml.
In Proceedings of the 33rd international conference on Very large data bases,
pages 27–38. VLDB Endowment, 2007.
334 References
Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and Lise
Getoor. A short introduction to probabilistic soft logic. In Proceedings of the
NIPS Workshop on Probabilistic Programming: Foundations and Applications,
pages 1–4, 2012.
Timothy Kopp, Parag Singla, and Henry Kautz. Lifted symmetry detection
and breaking for MAP inference. In NIPS, pages 1315–1323, 2015.
Donald Kossmann. The state of the art in distributed query processing. ACM
Comput. Surv., 32(4):422–469, 2000.
Laks V. S. Lakshmanan, Nicola Leone, Robert B. Ross, and V. S. Subrahma-
nian. Probview: A flexible probabilistic database system. ACM Trans.
Database Syst., 22(3):419–469, 1997.
C. Y. Lee. Representation of switching circuits by binary-decision programs.
Bell System Technical Journal, 38:985–999, 1959.
Jian Li and Amol Deshpande. Consensus answers for queries over prob-
abilistic databases. In Proceedings of the Twenty-Eigth ACM SIGMOD-
SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009,
June 19 - July 1, 2009, Providence, Rhode Island, USA, pages 259–268, 2009. .
Jian Li, Barna Saha, and Amol Deshpande. A unified approach to ranking in
probabilistic databases. VLDB J., 20(2):249–275, 2011.
Leonid Libkin. Elements of Finite Model Theory. Springer, 2004.
Leonid Libkin. SQL’s three-valued logic and certain answers. In 18th Interna-
tional Conference on Database Theory, ICDT 2015, March 23-27, 2015, Brussels,
Belgium, pages 94–109, 2015.
Daniel Lowd and Pedro Domingos. Efficient weight learning for markov
logic networks. In European Conference on Principles of Data Mining and
Knowledge Discovery, pages 200–211. Springer, 2007.
Thomas Lukasiewicz. Expressive probabilistic description logics. Artificial
Intelligence, 172(6):852–883, 2008.
Andrew McCallum, Karl Schultz, and Sameer Singh. Factorie: Probabilistic
programming via imperatively defined factor graphs. In Neural Informa-
tion Processing Systems (NIPS), 2009.
Lilyana Mihalkova and Raymond J. Mooney. Bottom-up learning of markov
logic network structure. In Proceedings of the 24th international conference on
Machine learning, pages 625–632. ACM, 2007.
Gerome Miklau and Dan Suciu. A formal analysis of information disclosure
in data exchange. J. Comput. Syst. Sci., 73(3):507–534, 2007.
References 335
Davide Nitti, Tinne De Laet, and Luc De Raedt. Probabilistic logic program-
ming for hybrid relational domains. Machine Learning, 103(3):407–449,
2016. ISSN 1573-0565. .
Feng Niu, Christopher Ré, AnHai Doan, and Jude W. Shavlik. Tuffy: Scaling
up statistical inference in markov logic networks using an rdbms. PVLDB,
4(6):373–384, 2011.
Dan Olteanu and Jiewen Huang. Using obdds for efficient query evalua-
tion on probabilistic databases. In Scalable Uncertainty Management, Second
International Conference, SUM 2008, Naples, Italy, October 1-3, 2008. Proceed-
ings, pages 326–340, 2008.
Dan Olteanu and Jiewen Huang. Secondary-storage confidence computation
for conjunctive queries with inequalities. In SIGMOD Conference, pages
389–402, 2009.
Dan Olteanu, Jiewen Huang, and Christoph Koch. Sprout: Lazy vs. eager
query plans for tuple-independent probabilistic databases. In ICDE, pages
640–651, 2009.
Dan Olteanu, Jiewen Huang, and Christoph Koch. Approximate confidence
computation in probabilistic databases. In Proceedings of the 26th Inter-
national Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long
Beach, California, USA, pages 145–156, 2010.
M.A. Paskin. Maximum entropy probabilistic logic. Technical Report
UCB/CSD-01-1161, Computer Science Division, University of California,
Berkeley, 2002.
Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann series in representation and reasoning. Mor-
gan Kaufmann, 1988.
Frederick E Petry. Fuzzy databases: principles and applications, volume 5.
Springer Science & Business Media, 2012.
Avi Pfeffer. Figaro: An object-oriented probabilistic programming language.
Charles River Analytics Technical Report, 2009.
David Poole. Probabilistic horn abduction and Bayesian networks. Artificial
intelligence, 64(1):81–129, 1993.
David Poole. The independent choice logic for modelling multiple agents
under uncertainty. Artificial Intelligence, 94(1):7–56, 1997.
David Poole. First-order probabilistic inference. In IJCAI, volume 3, pages
985–991, 2003.
References 337
J. Scott Provan and Michael O. Ball. The complexity of counting cuts and of
computing the probability that a graph is connected. SIAM J. Comput., 12
(4):777–788, 1983.
Christopher Ré and Dan Suciu. Materialized views in probabilistic databases
for information exchange and query optimization. In Proceedings of the
33rd International Conference on Very Large Data Bases, University of Vienna,
Austria, September 23-27, 2007, pages 51–62, 2007.
Christopher Ré and Dan Suciu. Managing probabilistic data with mystiq:
The can-do, the could-do, and the can’t-do. In Scalable Uncertainty Man-
agement, Second International Conference, SUM 2008, Naples, Italy, October
1-3, 2008. Proceedings, pages 5–18, 2008.
Christopher Ré and Dan Suciu. The trichotomy of HAVING queries on a
probabilistic database. VLDB J., 18(5):1091–1116, 2009.
Christopher Ré, Nilesh N. Dalvi, and Dan Suciu. Efficient top-k query evalu-
ation on probabilistic data. In Proceedings of the 23rd International Conference
on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April
15-20, 2007, pages 886–895, 2007.
Raymond Reiter. On closed world data bases. Logic and Data Bases, pages
55–76, 1978.
Joris Renkens, Guy Van den Broeck, and Siegfried Nijssen. k-optimal: a novel
approximate inference algorithm for problog. Machine learning, 89(3):215–
231, 2012.
Joris Renkens, Angelika Kimmig, Guy Van den Broeck, and Luc De Raedt.
Explanation-based approximate weighted model counting for probabilis-
tic logics. In AAAI Workshop: Statistical Relational Artificial Intelligence, 2014.
Matthew Richardson and Pedro M. Domingos. Markov logic networks. Ma-
chine Learning, 62(1-2):107–136, 2006.
Fabrizio Riguzzi. A top down interpreter for lpad and cp-logic. In Congress
of the Italian Association for Artificial Intelligence, pages 109–120. Springer,
2007.
Dan Roth. On the hardness of approximate reasoning. Artif. Intell., 82(1-2):
273–302, 1996.
Stuart J. Russell. Unifying logic and probability. Commun. ACM, 58(7):88–97,
2015.
Tian Sang, Fahiem Bacchus, Paul Beame, Henry A. Kautz, and Toniann
Pitassi. Combining component caching and clause learning for effective
model counting. In SAT, 2004.
338 References
Taisuke Sato. A statistical learning method for logic programs with distri-
bution semantics. In Proceedings of the 12th International Conference on Logic
Programming (ICLP), pages 715–729. MIT Press, 1995.
Taisuke Sato and Yoshitaka Kameya. PRISM: a language for symbolic-
statistical modeling. In Proceedings of the International Joint Conference on
Artificial Intelligence, volume 15, pages 1330–1339, 1997.
Stefan Schoenmackers, Jesse Davis, Oren Etzioni, and Daniel S. Weld. Learn-
ing first-order horn clauses from web text. In Proceedings of the 2010 Confer-
ence on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11
October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a
Special Interest Group of the ACL, pages 1088–1098, 2010.
Prithviraj Sen, Amol Deshpande, and Lise Getoor. Bisimulation-based ap-
proximate lifted inference. In Proceedings of the twenty-fifth conference on
uncertainty in artificial intelligence, pages 496–505. AUAI Press, 2009.
Pierre Senellart and Serge Abiteboul. On the complexity of managing proba-
bilistic xml data. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-
SIGART symposium on Principles of database systems, pages 283–292. ACM,
2007.
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christo-
pher Ré. Incremental knowledge base construction using deepdive.
PVLDB, 8(11):1310–1321, 2015.
Sarvjeet Singh, Chris Mayfield, Sagar Mittal, Sunil Prabhakar, Susanne E.
Hambrusch, and Rahul Shah. Orion 2.0: native support for uncertain
data. In Proceedings of the ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008,
pages 1239–1242, 2008.
Parag Singla and Pedro M Domingos. Lifted first-order belief propagation.
In AAAI, volume 8, pages 1094–1099, 2008.
Richard P. Stanley. Enumerative Combinatorics. Cambridge University Press,
1997.
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. Probabilistic
Databases. Synthesis Lectures on Data Management. Morgan & Claypool
Publishers, 2011.
Nima Taghipour, Jesse Davis, and Hendrik Blockeel. First-order decompo-
sition trees. In Advances in Neural Information Processing Systems, pages
1052–1060, 2013.
References 339