0% found this document useful (0 votes)
20 views24 pages

Algebraic Data Types For Object-Oriented Datalog

Uploaded by

fricatiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views24 pages

Algebraic Data Types For Object-Oriented Datalog

Uploaded by

fricatiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1

1 Algebraic Data Types for Object-oriented Datalog


2
3
4
MAX SCHÄFER, PAVEL AVGUSTINOV, OEGE DE MOOR, Semmle
5
6 Datalog is a popular language for implementing program analyses: not only is it an elegant formalism for concisely specifying
7 least fixpoint algorithms, which are the bread and butter of program analysis, but these declarative specifications can also be
8 executed efficiently. However, plain Datalog can only work with atomic values and offers no first-class support for structured
9 data of any kind. This makes it cumbersome to express algorithms that need even very simple data structures like pairs, and
10 impossible to express those that need trees or lists. Hence, non-trivial analyses tend to rely on extra-logical features that
11
allow creating new values to represent compound data on the fly. We propose a more high-level solution: we extend QL, an
object-oriented dialect of Datalog, with a notion of algebraic data types that offer the usual combination of products, disjoint
12
unions and recursion. In addition, the branches of an algebraic data type can be full-fledged QL predicates, which may be
13
recursive not only with other data types but with arbitrary other predicates, enabling very fine-grained control over the
14
structure of the data type. The new types integrate smoothly with QL’s existing notions of classes and virtual dispatch, the
15 latter playing the role of a pattern matching construct. We have implemented our proposal by extending the QL evaluator
16 with a low-level operator for creating fresh values at runtime, and translating algebraic data types into applications of this
17 operator. To demonstrate the practical usefulness of our approach, we discuss three case studies tackling problems from the
18 general area of program analysis that were previously difficult or impossible to solve in QL.
19
CCS Concepts: •Software and its engineering → Abstract data types; Object oriented languages; Constraint and logic
20
languages;
21
22 ACM Reference format:
23 Max Schäfer, Pavel Avgustinov, Oege de Moor. 2017. Algebraic Data Types for Object-oriented Datalog. 1, 1, Article 1
24
(April 2017), 24 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
25
26
27
28 1 INTRODUCTION
29 It has been said (Wirth 1976) that “algorithms + data structures = programs”. In program analysis, many of the
30 most important algorithms are least fixpoint computations on subset lattices. The logic programming language
31 Datalog is a natural choice for expressing such algorithms: being a first-order logic with recursion, it is rich
32 enough to allow elegant, declarative specifications of fixpoint algorithms, yet simple enough to admit aggressive
33 optimisation and efficient evaluation on relational database systems (Aref et al. 2015; Semmle 2017a).
34 Consequently, Datalog-based program analysis has a long research pedigree, and has recently seen a revival,
35 with systems such as Doop (Bravenboer and Smaragdakis 2009) and the Semmle platform (Avgustinov et al. 2016)
36 demonstrating its viability for real-world analysis tasks. Usually, an extractor (not written in Datalog) is first used
37 to create a database with a representation of the program to be analysed, for example in the form of three address
38 code as used by Doop, or by encoding the entire AST structure as in the case of Semmle. The analyses themselves
39 are then written as Datalog queries that are evaluated over this database and yield relations representing the
40 analysis results. For example, the result of a Datalog-based pointer analysis might be a binary relation between
41 program variables and abstract objects they may point to.
42
43 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that
44 copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
45
© 2017 Copyright held by the owner/author(s). XXXX-XXXX/2017/4-ART1 $15.00
46
DOI: 10.1145/nnnnnnn.nnnnnnn
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:2 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 While Datalog has proved its mettle in expressing program analysis algorithms, data structures are another
2 matter: plain Datalog simply offers no support at all for expressing and working with structured data. Programs
3 can only use atomic values, typically including primitive values like numbers or strings, as well as any entity
4 values that appear in the underlying database. Entity values can be used to encode references and thereby
5 represent complex data structures (Avgustinov et al. 2016), but this can only be done at database creation time.
6 The program itself operates in a fixed universe of values: any structured value that isn’t already available in the
7 database is simply not denotable.1
8 Other logic programming languages, such as Prolog, come with built-in support for structured values, but
9 this tends to complicate their semantics and makes them more difficult to implement efficiently. While high-
10 performance Prolog engines are an active area of research (Hermenegildo et al. 2012; Swift and Warren 2012), we
11 are not aware of any implementations that are as stable and fast as their Datalog counterparts.
12 We briefly discuss three typical examples of program analyses that need structured data of one kind or another.
13 First, many analyses work on a control flow graph (CFG) representation of the program. CFG edges can be
14 very easily and naturally computed in Datalog from, say, the program AST, but there is no way of creating new
15 entities representing the CFG nodes. It is tempting to recycle AST nodes to represent CFG nodes, but this is
16 problematic since AST and CFG do not correspond cleanly to each other: some AST nodes are simply syntax
17 without CFG semantics, while conversely CFG entry and exit nodes can only be mapped onto the AST with
18 difficulty. Alternatively, the extractor could create entities for representing CFG nodes at database creation time,
19 but this causes an awkward split of the CFG construction across different analysis phases that is difficult to work
20 with. In particular, it introduces an undesirable dependence of the extractor on the analysis, since changes to the
21 CFG construction might now necessitate changes to the way the database is created.
22 As a second example, consider converting the program under analysis to static single assignment (SSA) form.
23 In SSA form, each source variable in the original program is split up into multiple SSA variables, each of which
24 have precisely one definition, and variable uses are renamed to refer to the most recently defined SSA variable.
25 Crucially, this requires introducing phi nodes, which are pseudo-assignments that merge the values of multiple
26 SSA variables at join points in the CFG. While the placement of phi nodes can be beautifully expressed in Datalog,
27 there is no way of creating new entities to represent them. Phi nodes can be characterised as a pair (n, x ) of a
28 CFG node n and a (source) variable x, so a Datalog program could deal with SSA variables by carrying around n
29 and x in separate (Datalog) variables, but this is tedious and error prone. Alternatively, the extractor could be
30 pressed into service to create entities for representing phi nodes, but this basically amounts to doing full SSA
31 conversion in the extractor, losing the benefits of a high-level, declarative specification.
32 Our final example is context-sensitive points-to analysis. Here, structured values are needed to express abstract
33 values and to express contexts. As an example of the latter, a 2-CFA analysis deals with contexts that are pairs of
34 call sites (c 1 , c 2 ) such that c 1 may call the function containing c 2 , which in turn may call the function currently
35 being analysed. Similarly, abstract values in points-to analysis are generally pairs of the form (k, a), where k is a
36 context (itself, as we have seen, a structured value), and a is an allocation site that may be analysed in context k.
37 Again, these compound values can be represented in Datalog by using separate (Datalog) variables to hold the
38 individual components, which we have already argued is error-prone. Allocating entities for all possible contexts
39 and abstract values in the extractor is not a viable choice, since, for example, not all pairs of call sites are valid
40 contexts, and the set of contexts actually needed during the analysis is smaller still.
41 In summary, all these examples show the need for structured values in program analysis. In some cases these
42 values can be emulated by explicitly tracking their components, and in other cases the extractor can enrich
43 the database sufficiently to introduce entities for representing the structured values ahead of time, but neither
44 solution is generally applicable, and both diminish the attractiveness of implementing the analysis in Datalog.
45 1 At a higher level, of course, Datalog programs do create structured values, in that they define relations, which are sets of tuples.
But relations
46
are not first-class values, and cannot be operated on by the program itself (for example, relations cannot take other relations as arguments).
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:3

1 In practice, Datalog-based program analysis systems tend not to restrict themselves to pure Datalog: some
2 only express core algorithms in Datalog and fall back on other languages for the rest of the analysis (Lhoták and
3 Hendren 2004; Whaley et al. 2005); others, notably Doop (Bravenboer and Smaragdakis 2009), are implemented
4 end-to-end in Datalog, but rely on language extensions to emulate structured values.
5 In particular, Doop makes crucial use of constructors, an extension to Datalog that allows inventing new values
6 during program execution. A predicate that is declared as a constructor has a special output parameter that is
7 not defined or referenced in the predicate body; instead, at runtime for each tuple v of values that the ordinary
8 parameters of the predicate are bound to by its body, a fresh value is created and assigned to the output parameter.
9 This value uniquely identifies the tuple v and can, for all intents and purposes, be used as if it were that tuple.
10 For example, to represent 2-CFA contexts one could define2 a constructor predicate TwoCfaContext(c1, c2,
11 k) where c1 and c2 are ordinary parameters defined by the body of TwoCfaContext to range over the set of
12 pairs that should be considered as contexts, and k is the special output parameter. At runtime, the engine
13 introduces for each tuple (v 1 , v 2 ) that satisfies the body of TwoCfaContext a fresh value k (v 1 , v 2 ), and adds the
14 tuple (v 1 , v 2 , k (v 1 , v 2 )) to the relation TwoCfaContext.
15 While this is sufficient for encoding simple compound values, constructor predicates cannot be recursive
16 (either with themselves or with other predicates), so more complex tree or list structures cannot be expressed.
17 We propose to instead extend Datalog with algebraic data types, an approach for specifying structured data
18 types that has proved its worth in the functional programming community. In its general form, an algebraic data
19 type is a union of one or more branch types, each of which, in turn, is a tuple type. The union is kept disjoint
20 by tagging tuples from a branch type with the name of the branch, so even if two branches are the same when
21 considered as tuple types their values will not be merged. Moreover, the component types of a branch type can
22 (directly or indirectly) reference the enclosing algebraic data type, which allows representing recursive data
23 structures such as lists or trees.
24 Unlike functional or imperative programming languages, where types are meta-level entities that belong to a
25 different conceptual realm than the programs they describe, typed dialects of Datalog tend to view types simply
26 as a kind of unary relation, which may either be defined in the underlying database or by the program itself. We
27 adopt this view for our algebraic data types and their branch types.
28 This, however, immediately raises a problem: recursive data types are generally infinite, so trying to interpret
29 them as unary predicate definitions and computing them in full will lead to non-termination. To avoid this, we
30 could handle them specially and introduce some sort of lazy evaluation mechanism to only construct as many
31 of their values as needed, but this would be quite a disruptive extension to the language semantics and would
32 mostly negate the advantage of conceptual simplicity we gain from treating types as unary predicates.
33 Instead, we observe that while data types may be infinite, a given (terminating) program only ever uses a finite
34 subset of it, so we add another feature to our algebraic data types (besides union, tupling and recursion) which
35 provides fine-grained control over the extent of the data type: each branch type can have a branch body that
36 restricts the set of tuples that go into the branch type. By providing branch bodies, algebraic data types can be
37 restricted to only contain those values that the program actually needs, thus ensuring that they are finite and can
38 be evaluated like any other predicate. Another interesting feature of branch bodies is that they can, naturally,
39 be recursive, not only with other branch bodies, but with arbitrary other predicates. As we will see, this allows
40 specifying data types that are much more finely structured than algebraic data types in functional languages.
41 One important feature of algebraic data types in functional languages that we do not support is polymorphism:
42 under Datalog’s types-as-predicates approach a polymorphic type would correspond to a higher-order predicate,
43 which is far beyond the expressive power of Datalog.
44
45
46 2 We use a simplified syntax for expository purposes; see Section 8.5 of the LogicBlox Reference Manual (LogicBlox 2017) for full details.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:4 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 We have implemented our proposal as an extension of QL (Avgustinov et al. 2016), an object-oriented dialect of
2 Datalog with classes and virtual dispatch that compiles down to plain Datalog without classes. At the language
3 level, algebraic data types are introduced as a new kind of types. While orthogonal to classes, the two can be
4 combined freely and naturally, with virtual dispatch playing the role of a pattern matching construct. To provide
5 runtime support, we have extended the Datalog evaluator underlying QL with a tuple numbering operator, which
6 is similar to LogicBlox’s constructors, but permits recursion. We show how algebraic data types can be compiled
7 to applications of this tuple numbering operator.
8 We briefly study the metatheory of tuple numbering, showing that it fits smoothly into Datalog’s least-fixpoint
9 semantics and interacts well with common optimisations. It also provides a dramatic boost to expressiveness,
10 making plain Datalog without primitive types, which can only express polynomial algorithms, Turing-complete.
11 Moving from theoretical considerations to practical experience, we report on three case studies tackling
12 problems from the general area of program analysis: we discuss an implementation of the Cartesian Product
13 Algorithm, a context sensitivity strategy that employs very precise list-structured contexts; a library for building
14 control flow graphs for Java from an AST representation; and a parser for regular expressions that produces
15 ASTs. All three problems are hard or impossible to solve without language support for structured values.
16 In summary, our contributions are as follows:
17
18 • We propose an extension of QL, a dialect of Datalog, with monomorphic algebraic data types.
19 • We demonstrate how these data types can be implemented by translating them into applications of a
20 low-level tuple numbering operator.
21 • We show that tuple numbering is Turing complete, yet semantically well-behaved.
22 • We present three case studies demonstrating the practical usefulness of our proposal.
23
24
In the rest of the paper, we will motivate the need for algebraic data types in more detail by means of an
25
extended example (Section 2), then describe their syntax and semantics (Section 3) and explore their theoretical
26
properties (Section 4). After a brief discussion of our implementation (Section 5) we present three case studies
27
showing practical applications in Section 6 before surveying related work in Section 7 and concluding in Section 8.
28
29 2 BACKGROUND AND MOTIVATION
30
This section introduces QL by example, and motivates the need for algebraic data types. As our running example
31
we show how to implement SSA conversion (Cytron et al. 1991).
32
Assume we have encoded a flow-graph representation of a program using the three binary relations described
33
by the schema in Figure 1: succ is the successor relation between nodes, while def and use record definitions
34
and uses of variables, respectively. The columns of these relations are typed using the entity types @cfg_node
35
and @variable, meaning that the values contained in these columns should be viewed as entity values, that is,
36
opaque identifiers modelling some external entities (in this case, flow graph nodes and variables).
37
The relations succ, def and use are called extensional relations, since they are defined explicitly by storing
38
their extent (that is, the tuples they contain) in the database. This contrasts with intensional relations that are
39
defined implicitly by QL predicates and evaluated on top of the database.
40
The entity types @cfg_node and @variable are also extensional relations: they are unary relations, i.e. sets,
41
whose elements are entity values. Annotating a column of an extensional with an entity type means that any
42
value stored in that column must be contained in the entity type.
43
This demonstrates two key principles of QL: types (with the exception of built-in types like int and string) are
44
unary relations, and for a program entity to be of a type means that all its potential values are contained in the
45
type. Like ordinary predicates, types can be either extensional or intensional: extensional types are entity types,
46
of which we have already seen examples, and intensional types are classes, which we will encounter below.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:5

1 succ(@cfg_node m, @cfg_node n); def(@cfg_node n, @variable v); use(@cfg_node n, @variable v);


2
3
Fig. 1. Extensional relations encoding a flow-graph representation of a program
4
5
predicate startsBB(@cfg_node n) {
6
not succ(_, n) or
7
exists(@cfg_node p, @cfg_node q | succ(p, n) and succ(q, n) and p != q) or
8 exists(@cfg_node p, @cfg_node q | succ(p, n) and succ(p, q) and n != q)
9 }
10
11 class BasicBlock extends @cfg_node {
12 BasicBlock() { startsBB(this) }
13 @cfg_node getNode(int i) {
14 i = 0 and result = this or
15
succ(getNode(i-1), result) and not startsBB(result)
}
16
BasicBlock getAPredecessor() { exists(int i | succ(result.getNode(i), this)) }
17
predicate dominates(BasicBlock that) { /∗ implementation omitted ∗/ }
18
predicate inDominanceFrontierOf(BasicBlock that) {
19 that.dominates(getAPredecessor()) and not (that.dominates(this) and this != that)
20 }
21 }
22 Fig. 2. A basic-block program representation defined in QL
23
24
25
As a first step towards SSA conversion, we abstract the node-based flow graph into a graph of basic blocks. So
26
we first define an (intensional) QL predicate startsBB shown at the top of Figure 2 that picks out those nodes
27
that start a new basic block: entry nodes (that is, nodes without predecessors), join nodes with more than one
28
predecessor, and branch successor nodes that have a predecessor with more than one successor.
29
Next, we define a QL class to model basic blocks. In QL, a class is simply a set of values for which a set of
30
predicates (that we might think of as methods) with a distinguished parameter this are defined. Each class
31
extends one or more other types (possibly themselves classes), and may define a characteristic predicate, which
32
can also refer to this and syntactically looks like a no-argument constructor in Java. The extent of a class contains
33
precisely those values for this that are contained in all supertype extents and in the characteristic predicate.
34
Our class BasicBlock is shown in the bottom half of Figure 2: it has @cfg_node as its only supertype and its
35
characteristic predicate requires that startsBB(this). In other words, the extent of BasicBlock are exactly
36
those @cfg_nodes n for which startsBB(n) holds.
37
Now we define four member predicates: getNode, which looks up a node in a basic block by index;
38
getAPredecessor, which navigates the basic block-level flow graph; dominates, which determines dom-
39
inance between basic blocks and whose implementation we have omitted for space reasons; and finally
40
inDominanceFrontierOf to computes dominance frontiers, which will be used later to place phi nodes.
41
Predicates getNode and getAPredecessor use QL’s functional syntax, which allows us to (syntactically)
42
treat them as functions returning a result, as shown in the recursive call getNode(i-1). Inside the body of
43
these predicates, an implicitly declared result variable is available that is used to refer to the return value. The
44
functional syntax is purely syntactic sugar, and QL neither assumes nor guarantees that a predicate using this
45
syntax is, in fact, a function: while getNode happens to be one, getAPredecessor is not. Such “multi-valued”
46
functions are, however, very natural to use in a logic programming language.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:6 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 predicate ssaDef(BasicBlock bb, @variable v) { def(bb.getNode(_), v) or phi(bb, v) }


2
3 predicate phi(BasicBlock bb, @variable v) {
4 exists(BasicBlock defbb | ssaDef(defbb, v) and bb.inDominanceFrontierOf(defbb))
5 }
6 Fig. 3. Phi node placement in QL
7
8
Note that QL supports arithmetic (cf. predicate getNode); in combination with recursion, this makes it
9
possible to write infinite, and hence non-terminating, predicates. QL also supports negation (cf. predicate
10
inDominanceFrontierOf), but restricts its use in recursive predicates by requiring parity stratification, that is,
11
any recursive cycle between predicates must go through an even number of negations.
12
Having established a basic block representation of our program, we now proceed to implement SSA conversion
13
proper. In SSA form, each program variable is split into one or more SSA variables, each of which have a single
14
definitions. A definition of an SSA variable is either an explicit definition of a program variable, or an implicit phi
15
node that is inserted into the flow graph at points where two or more definitions of a variable are merged.
16
As is well known, a phi node for a variable v needs to be inserted at the beginning of each basic block bb that
17
is in the dominance frontier of another basic block defbb that defines v, either by an explicit definition or by a
18
previously inserted phi node. Inserting a phi node may in turn trigger the insertion of other phi nodes.
19
QL’s least fixpoint semantics allows a very succinct implementation of phi node placement, shown in Figure 3:
20
predicate phi(bb, v) determines if a phi node for v is needed at the beginning of bb using the dominance
21
frontier criterion, and ssaDef(bb, v) records the fact that basic block bb contains an SSA definition of v.
22
Elegant as this implementation is, it does not give us a good representation of SSA definitions. The best we can
23
do is to treat SSA definitions as tuples (bb, v) for which ssaDef(bb, v) holds. This is awkward, since tuples
24
are not first-class values in QL, so every variable that may hold SSA definitions would have to be split up into
25
two auxiliary variables to hold the components of the tuple. Care has to be taken to carry around these variables
26
in unison and not to accidentally mix up components from different tuples. With algebraic data types, we can
27
represent tuples as first-class values, which solves this problem.
28
Another problem is that representing explicit definitions as pairs (bb, v) is too imprecise: a single basic
29
block may contain multiple definitions of the same variable, which we would often like to distinguish, but they
30
are conflated in the pair representation. We could include the index of the defining node in our representation,
31
talking about triples (bb, i, v) instead of pairs (bb, v), but this representation is not very suitable for phi
32
nodes, which do not correspond to actual flow nodes. We could assign them a dummy index, say -1, but that is a
33
workaround rather than a solution. At the end of the day, the most natural thing to do is to represent explicit
34
definitions by triples, and phi nodes by pairs. With algebraic data types, values arising from different branches of
35
the type can have different arities, which solves this problem.
36
Borrowing Haskell syntax, we might consider representing SSA definitions using an algebraic data type SsaDef
37
with two branches Def and Phi, defined like this:
38
data SsaDef = Def BasicBlock int @variable | Phi BasicBlock @variable
39
40 However, this does not fit very well into the conceptual model of QL, where types are just unary predicates:
41 SsaDef contains infinitely many values (since the second component of Def can be any integer), and thus cannot
42 be evaluated like a normal predicate. We could make special provisions for lazily evaluating algebraic data
43 types, but this would substantially complicate the language semantics and introduce a jarring mismatch between
44 algebraic data types and other QL types.3
45 3 Ofcourse, primitive types like int have similar problems, and they are indeed treated specially in QL, but primitive types are built into the
46
language and there are very few of them, while algebraic data types are user-defined.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:7

1 newtype SsaDef =
2 Def(BasicBlock bb, int i, @variable v) { def(bb.getNode(i), v) }
3 or Phi(BasicBlock bb, @variable v) {
4 exists(SsaDefinition def | def.getVariable() = v and bb.inDominanceFrontierOf(def.getBasicBlock()))
5 }
6 class SsaDefinition extends SsaDef {
7
abstract BasicBlock getBasicBlock();
abstract @variable getVariable();
8
}
9
class ExplicitDefinition extends SsaDefinition, Def {
10
BasicBlock getBasicBlock() { this = Def(result, _, _) }
11 @variable getVariable() { this = Def(_, _, result) }
12 }
13 class PhiNode extends SsaDefinition, Phi {
14 BasicBlock getBasicBlock() { this = Phi(result, _) }
15 @variable getVariable() { this = Phi(_, result) }
16 }
17 Fig. 4. An algebraic data type for describing SSA variables
18
19 Instead, we observe that while there are infinitely many SsaDef values, we are only interested in finitely many
20 of them, namely those that represent actual SSA definitions. If we allow the branches of algebraic data types to
21 restrict the possible values of their parameters so as to construct only those values that are actually needed, then
22 algebraic data types can be evaluated like any other predicates and harmony is restored.
23 To this end, the branches of an algebraic data type in QL may have a body that computes the set of tuples that
24 the branch ranges over, as shown at the top of Figure 4: branch Def of type SsaDef is defined as
25 Def(BasicBlock bb, int i, @variable v) { def(bb.getNode(i), v) }
26
meaning that it ranges over those tuples (bb, i,v) for which def(bb.getNode(i), v) holds, and no other
27
tuples. The branch body of Phi implements the phi node placement algorithm discussed above.
28
To make it easier to implement, we define classes SsaDefinition, ExplicitDefinition and PhiNode that
29
correspond, respectively, to the algebraic data type SsaDef as a whole, and to the two branch types Def and
30
Phi. These classes define member predicates getBasicBlock and getVariable for extracting the relevant bits
31
of information from SSA definitions, which are used in Phi to determine basic blocks that need a phi node.
32
Note that we use a branch name like Def for two distinct purposes: it can act as a branch type, that is, a unary
33
predicate, or as an injector predicate with four parameters (three explicitly declared ones and an implicit result
34
parameter). In fact, these two are different predicates that happen to both be called Def. In QL syntax, there is
35
never any ambiguity between the two.
36
The branch type Def is a subtype of SsaDef and can be used in declarations, such as the extends clause of
37
its corresponding class ExplicitDefinition. The injector predicate Def can either be thought of as a value
38
“constructor” that creates elements of the branch type Def given values for its parameters bb, i and v, or as a
39
“destructor” that extracts the components bb, i and v from a given value of Def. It is in this latter role that Def is
40
used in the member predicate definitions of class ExplicitDefinition.
41
Unlike the distinction between branch types and injector predicates, however, the distinction between “con-
42
structors” and “destructors” is purely pedagogical; there is only one injector predicate.
43
Finally, we note that branch bodies can be recursive with each other and with normal predicates, and this is
44
indeed the case in our example: Phi calls SsaDefinition.getVariable, which is overridden by class PhiNode
45
to call Phi. As noted above, language extensions for modelling structured data in other Datalog dialects do not
46
permit such recursion, but we have found it to be a very useful and powerful tool in practice.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:8 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1
prog ::= cd td pd program
2
3
td ::= newtype A= bd algebraic data type definition
4
bd ::= B(T x ){ f } branch definition
5
f ,д ::= . . . | y = B(x ) formula
6
S,T ::= . . . | A | B type reference
7 Fig. 5. Extensions to CoreQL (Avgustinov et al. 2016) to incorporate algebraic data types
8
9
10
3 SYNTAX AND SEMANTICS
11 We now proceed to give a precise description of the syntax and semantics of our algebraic data types. Building
12 on a previous description of the semantics of QL (Avgustinov et al. 2016) in terms of a core calculus CoreQL and
13 its translation to untyped Datalog, we extend CoreQL with algebraic data types and show how it translates to
14 Datalog with a novel tuple numbering operator for creating new values at runtime. For the convenience of the
15 reader, we reproduce the definition of CoreQL in Appendix A.
16
17 3.1 Syntax and validity rules
18 Figure 5 shows the syntax of our algebraic data types as an extension of CoreQL: in addition to classes and
19 predicates, programs can also declare algebraic data types. Each such declaration associates a type name A with
20 a list of branch declarations bd. Each branch, in turn, has a name B, a list of parameters T x and a body, which is
21 a formula f . In full QL, the body may be omitted, in which case it defaults to the always-true formula any().
22 We also add a new kind of formulas of the form y = B(x ), where y a variable name, B a branch name, and x a
23 list of variable names. In full QL, we instead introduce a new kind of expression B(e), where e a list of argument
24 expressions, which can be desugared into its CoreQL counterpart by introducing temporary variables.
25 Finally, data type names A and branch names B are added to the set of type names, so they can appear in
26 variable declarations and the extends clauses of classes.
27 In addition to CoreQL’s syntactic validity requirements (cf. Appendix A), we require that no two types (whether
28 they be classes, data types or branches) and no two parameters of the same branch have the same name, and that
29 each data type have at least one branch; type references to data types and branches must correspond to a type or
30 branch definition; and for each formula y = B(x ) there must be a branch named B with the appropriate arity.
31 Additionally, we introduce consistency rules that ensure programs do not mix up values from different algebraic
32 data types. Besides preventing logic errors, this also allows the implementation to reuse identifiers across tuple
33 numberings. We introduce a universe UA for each algebraic data type A, which is the set of all values of that type,
34 and one additional universe U0 containing all values that do not belong to algebraic data types.
35
36 Definition 3.1 (Universe assignment). To each type in a translatable CoreQL program (cf. again Appendix A),
37 we assign at most one universe:
38 • The universe of an algebraic data type A is UA .
39 • The universe of a branch type B is the universe of its enclosing algebraic data type.
40 • The universe of an entity type @b is U0 .
41 • For a class C with supertypes T , if all supertypes are from the same universe U , then that is also the
42 universe of C and C.domain.
43
44 Note that the last clause is well-defined, since translatable CoreQL programs have acyclic type hierarchies.
45 We require CoreQL programs to be universe consistent in the following sense:
46
Definition 3.2 (Universe consistency). A translatable CoreQL program is universe consistent if:
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:9

1 (1) For each class C with supertypes T , all supertypes are from the same universe U (thus guaranteeing that
2 each class has a universe).
3 (2) For each call p(x ) or y.p(x ), the type of each argument variable x i is from the same universe as the type
4 of the corresponding parameter zi of the called predicate.
5 (3) For each formula y = B(x ), the type of y is from the same universe as B, and the type of each argument
6 variable x i is from the same universe as the type of the corresponding parameter zi of B.
7 (4) For each member predicate p(S x ) that overrides another predicate p(T z), each Si is from the same
8 universe as the corresponding Ti .
9
10
11 In our implementation for full QL we use the QL compiler’s type inference mechanism (Schäfer and de Moor
12 2010) to detect additional type errors. In general, the QL compiler considers any part of the program that it can
13 show to be unsatisfiable as a type error (even if there is no consistency violation). Its type inference algorithm
14 is parameterised over a type hierarchy that allows stating relationships between types as arbitrary monadic
15 first-order formulas. For an algebraic data type A with branch types B 1 , . . . , Bn , we augment the type hierarchy
16 with inclusion facts ∀x : Bi (x ) =⇒ A(x ) and disjointness facts ¬∃x : Bi (x ) ∧ B j (x ) (where i , j). This allows us,
17 for instance, to detect code that erroneously attempts to treat a value from one branch as belonging to a different
18 branch, which will never yield any results at runtime.
19
20
21 3.2 Datalog with tuple numbering
22
The target language of our translation is an untyped variant of Datalog extended with a tuple numbering operator,
23
which we now describe in more detail.
24
A Datalog program is a set of rules of the form p(x ) ← φ where p belongs to the set I of intensional relation
25
symbols each of which is associated with an arity; x is a vector drawn from the set V of element variables, whose
26
length is the same as the arity of p; and φ is a formula of first-order logic. The set of free variables of φ must be
27
exactly x, so every parameter of p is free in the body and vice versa. φ may make use of constant symbols (but
28
no function symbols) and refer to relations both from I and the set E of extensional relation symbols (which is
29
disjoint from I), subject to parity stratification. It may also use equality and the usual connectives and quantifiers
30
of first-order logic. Additionally, φ may contain sub-formulas of the form z = #r (y), where y and z are element
31
variables and r ∈ I ∪ E is a relation symbol of the appropriate arity.
32
The semantics of a Datalog program is computed over a structure hD, E, #i i, where D is a non-empty set (also
33
called the domain); E is an interpretation that assigns to each n-ary extensional relation symbol e ∈ E a set of
34
n-tuples over D; and #i is a family of injective tuple-numbering functions from D i to D, one for each natural
35
number i. Note that we do not require the ranges of different tuple-numbering functions to be disjoint.
36
To define the meaning of formulas φ, we additionally need a relation assignment I and a variable assignment σ ;
37
the former is similar to E in that it assigns sets of domain tuples to relation symbols, but for intensional relation
38
symbols from I; the latter maps element variables to elements of D. A satisfaction judgment hD, E, #i i |= I,σ φ
39
can now be defined by structural induction on φ in the usual way, using I and E to look up relation symbols
40
and σ for element variables. The only new case is for tuple numbering: writing σ [y] for the n-tuple of values
41
assigned to y by σ , we define hD, E, #i i |= I,σ z = #r (y) to hold if σ [y] ∈ (I ∪ E)(r ) and σ (z) = #n (σ [y]).
42
Assuming that the program is stratified, rule bodies can be interpreted as monotonic maps over their free
43
relation variables. This is a well-known result for standard Datalog, and our definition of the semantics of
44
tuple-numbering is monotonic as well, as we will discuss in more detail below. Hence, intensional predicates can
45
be semantically interpreted as the least fixpoints of their defining rules, yielding the overall semantics of the
46
Datalog program.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:10 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 3.3 Translating algebraic data types to Datalog


2
Finally, we show the translation from CoreQL with algebraic data types to Datalog with tuple numbering. For ease
3
of reference, Figures 10 and 11 in Appendix A reproduces the translation from plain CoreQL to Datalog (Avgustinov
4
et al. 2016), on which we base our definitions.
5
Intuitively, the idea is to first treat each branch B of a data type A as a normal predicate B dom that computes all
6
tuples that satisfy the branch body. Then we tuple-number B dom to obtain a predicate B # that assigns identifiers
7
to the tuples in B dom . We cannot directly gather up the identifiers produced by the B # predicates for the various
8
branches to obtain A, since different tuple numberings are not guaranteed to produce disjoint identifiers, so
9
two branches B # and C # might produce overlapping identifiers. Instead, we define a predicate Adom containing
10
of all pairs (b, i), where i is an identifier produced by some B # , and b is a constant uniquely representing that
11
branch B among all other branches of A. For concreteness, we will choose the string “B” for this purpose, but
12
any other constant would do just as well. Finally, we tuple-number Adom , yielding a predicate A.A whose output
13
enumerates the set representing A. The two steps of this encoding process correspond to the two type-forming
14
operations of product and disjoint sum types that together form the basis of algebraic data types.
15
Thinking operationally for a moment, to “construct” a value B(v) of A we first use B # to compute an inner
16
identifier i for v, and then apply A.A to the pair (“B”, i) to obtain its outer identifier, which is the value representing
17
B(v) as an element of A. Conversely, to “destruct” an element of A we can apply A.A in reverse to decode it into
18
a pair (“B”, i) that tells us which branch it came from and what its inner identifier in that branch is, at which
19
point we can use B # to recover the underlying tuple. In a logic programming language, of course, predicates are
20
not “applied” forwards or backwards, they statically describe a relation that can be navigated in any direction;
21
hence the same predicates can be viewed as constructors or destructors, depending on context.
22
Making our informal description precise, each branch definition B(T x ){ f } of a data type A gives rise to four
23
Datalog predicates: B dom , which interprets f as a normal predicate body; B # , which tuple-numbers B dom to obtain
24
inner identifiers for all its tuples; B.B, which maps those inner identifiers to outer identifiers belonging to the
25
enclosing data type A; and B, which projects B.B onto its co-domain and hence contains those elements of A that
26
are generated by B. Formally, this looks as follows:4
27
28 B dom (x ) ← Tb ( f , hx i := Ti i).
29 B # (x, y) ← y = #B dom (x ).
30 B.B(x, z) ← ∃y : B # (x, y) ∧ A.A(“B”, y, z).
31 B(z) ← ∃x : B.B(x, z).
32
33 Each data type definition A, in turn, induces three Datalog predicates: Adom collects the tuple numbers assigned
34 by the B # predicates of the branches into one set, tagging each with the name of the branch it came from. A.A
35 tuple-numbers Adom to obtain identifiers for these tagged inner tuple numbers, and A again projects A.A down to
36 its last column, yielding the set of all elements in A.
37
Adom (b, y) ← B b = “B” ∧ ∃x : B # (x, y).
W
38
39
A.A(b, y, z) ← z = #Adom (b, y).
40
A(z) ← ∃b, y : A.A(b, y, z).
41
Finally, we define how to translate y = B(x ) to Datalog:
42
43 Tb (y = B(x ), Γ) = B.B(x, y)
44
45 4 See Appendix A for the definition of the translation function Tb (f , Γ), taken from (Avgustinov et al. 2016), which translates a QL formula f
46
to a corresponding Datalog formula, with the type environment Γ mapping QL variables to their declared types.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:11

1 The existing rules for translating variable declarations in CoreQL already ensure that a variable with type A or
2 B is constrained to range over the elements in the unary predicates of the same name, so no special rules are
3 needed to handle variables declared to be of an algebraic data type or branch type.
4 It is perhaps worth noting that non-recursive algebraic data types can be encoded directly in Datalog without
5 the need for a tuple numbering operator: assuming that all branches of the data type A have the same arity n
6 (which can be achieved by padding with dummy values or repeating tuple components) each variable x of type
7 A can be represented as n + 1 component variables x 0 , x 1 , . . . , x n , where x 1 , . . . , x n represents a tuple of values
8 and x 0 is a tag indicating which branch it is from. This does not work if A is recursive, as in that case one of the
9 components could itself be of type A (or another type depending on A).
10
11 3.4 Algebraic data types and classes
12 Algebraic data types and classes are semantically completely orthogonal. QL classes do not create new values;
13 they simply describe subsets of already existing values and provide an interface for working with them. Algebraic
14 data types, on the other hand, do create new values, but offer no data abstraction features. Indeed, they do not
15 need to, since we can simply define a class that extends the type and defines member predicates on it.
16 In practice, a common pattern is to have one class for each branch type and a superclass for the overall type as
17 in the SSA example of Section 2. The latter usually declares abstract predicates which are then implemented by
18 the former, with one implementation per branch. Sometimes multiple branches use the same implementation,
19 which can be accommodated by factoring out an intermediate class to hold the shared predicate.
20 Because QL classes can overlap, they can implement different interfaces for the same set of values. This allows
21 us to implement pattern matching on algebraic data types using use virtual dispatch, similar to Scala’s case
22 classes (Odersky and Zenger 2005).
23 Assume we want to match on a value a from an algebraic data type A, with clauses f 1 , . . . , fn corresponding
24 to the branches B 1 , . . . , Bn of A. Each clause fi is a formula that may refer to the branch parameters of Bi , but
25 initially we assume it has no other free variables. To encode this in QL, we first define a new subclass of A that
26 declares a single abstract predicate representing the pattern matching:
27 class AMatcher extends A { abstract predicate match(); }
28
Then, for each branch Bi (T1 x 1 , . . .) we define a subclass of AMatcher that overrides match to apply fi :
29
class Bi Matcher extends AMatcher, Bi { predicate match() { exists(T1 x 1, . . . | this = B i (x 1, . . .) and f i ) } }
30
The matching can now be encoded as a QL formula a.(AMatcher).match():5 at runtime, QL’s normal dispatch
31
machinery will choose the implementation of match from the most specific subclass of AMatcher that contains
32
a, so if a belongs to branch type Bi , its implementation Bi .match, which simply wraps fi , will be evaluated.
33
Additional free variables in match clauses can be accommodated by lifting them to parameters of match. If we
34
want to add a catch-all clause f 0 that applies if no other branch matches, we can simply turn AMatcher.match
35
into a concrete predicate with body f 0 ; dispatch semantics ensures that f 0 is only evaluated if no more specific
36
definition applies. Note that our encoding does not provide exhaustiveness checking, which would need special
37
support from the compiler.
38
39 4 METATHEORY
40
In this section, we prove a few results about the tuple numbering operator we have added to Datalog: we show
41
that tuple numbering is monotonic and hence needs no special semantic treatment; it admits context-pushing
42
optimisations; and it is Turing complete.
43
44 Theorem 4.1. Tuple numbering is monotonic in the sense that if hD, E, #i i |= I,σ z = #r (y) holds and I 0 is an
45 assignment such that I (r ) ⊆ I 0 (r ) for all r ∈ dom(I), then hD, E, #i i |= I0,σ z = #r (y) holds as well.
46 5 Recall that QL uses postfix casts, so a.(AMatcher) means “the value of a, considered as an element of class AMatcher”.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:12 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 Proof. Immediate from the definition. As mentioned above, this property is important because it means that
2 rules in Datalog with tuple numbering are monotonic maps over assignments like in plain Datalog, so they have
3 a well-defined least fixpoint semantics that can be computed by bottom-up evaluation as usual. 
4
5
The QL compiler performs a variety of whole-program optimisations on the Datalog program it generates.
6
Most of these optimisations amount to logical rewrites, so we need to clarify the interaction between tuple
7
numbering and other logical operators.
8
Our first result concerns the interaction between conjunction and tuple numbering.
9 Theorem 4.2. Tuple numbering commutes with conjunction: replacing a formula φ ∧ z = #r (y), where the free
10 variables of φ are contained in y, with z = #r 0 (y), where r 0 is a newly defined intensional predicate r 0 (y) ← φ ∧ r (y),
11 does not change program semantics.
12
Proof. Without loss of generality, we assume that φ ∧ z = #r (y) is itself the body of a rule defining an
13
intensional predicate. Then it is easy to check that the least-fixpoint model of the old program can be extended
14
to the least-fixpoint model of the new program by assigning r 0 all those tuples that satisfy φ ∧ r (y) under the
15
model of the old program. Both models assign the same meaning to all relation symbols except for r 0, which does
16
not exist in the old program. 
17
18 This is important because many of the most important whole-program optimisations the QL compiler performs
19 rely on context pushing: a predicate q that is used in a conjunction together with some other predicate p can be
20 specialised by pushing the call to p into the body of q, thereby making q smaller and less expensive to compute.
21 The theorem states that this is safe even if q uses tuple numbering.
22 However, the same is not true of other logical operators, which, fortunately, are not used for inter-procedural
23 optimisations by the QL compiler.
24
Theorem 4.3. Tuple numbering does not commute with disjunction, negation, or existential quantification.
25
26 Proof. To ease notation, we write y = #φ for arbitrary formulas φ, meaning the program obtained by lifting φ
27 into a new intensional predicate.
28 Counterexample for disjunction: > ∨ x = #> ≡ > . x = #>; for negation: ¬(x = #>) . ⊥ ≡ x = #(¬>).
29 For existential quantification, let p(x ) be a predicate that holds for at least two values of x. Then ∃x : y = #p(x )
30 holds for at least two values of y, while y = #(∃x : p(x )) ≡ y = #> holds for only one value of y. 
31
Next, we investigate the expressive power of tuple numbering. Recall that pure Datalog can only express
32
polynomial algorithms (and is, in fact, PTIME-complete). As it turns out, adding tuple numbering has a rather
33
dramatic impact on its expressiveness:
34
35 Theorem 4.4. Tuple numbering makes Datalog (without primitive types, equality or negation) Turing complete.
36
Proof. Figure 6 shows how to implement SK combinators in QL with algebraic data types. The implementation
37
is parameterised over a binary predicate initial(l, r) that encodes any input term l r (that is, the application
38
of term l to term r) that we want to reduce
39
Type Term represents combinator terms, including the combinators S and K themselves, as well as a judiciously
40
chosen set of applicative terms l r that is just large enough to include the input term and all its reducts (remember
41
that we cannot just include all applicative terms in Term, as that would make it infinite). Predicate red implements
42
one-step reduction, while eval is multi-step reduction of a term to its normal form, if it exists.
43
The precise statement of Turing completeness thus is: given an SK combinator term l r, we can construct a
44
QL program using algebraic data types (and hence a Datalog program using tuple numbering) such that reduction
45
of l r terminates at some normal form n if and only if the iterative bottom-up evaluation of the QL program
46
terminates with (an encoding of) the same normal form.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:13

1 newtype Term = S() or K() or App(Term l, Term r) {


2 initial(l, r) // inject input terms
3 or exists(Term x, Term y, Term z | exists(App(App(App(S(), x), y), z) | // if 'S x y z' is a term ...
4 l = x and r = z // ... then so is ' x z ' ...
5 or l = y and r = z // ... and ' y z ' ...
6 or l = App(x, z) and r = App(y, z))) // ... and ' x z (y z )'
7
or exists(Term lprev | exists(App(lprev, r)) | l = red(lprev)) // congruence closure on the left
or exists(Term rprev | exists(App(l, rprev)) | r = red(rprev)) // congruence closure on the right
8
}
9
10
Term red(Term t) {
11 exists(Term x, Term y | t = App(App(K(), x), y) and result = y)
12 or exists(Term x, Term y, Term z | t = App(App(App(S(), x), y), z) and result = App(App(x, z), App(y, z)))
13 or exists(Term l, Term r | t = App(l, r) and (result = App(red(l), r) or result = App(l, red(r)))))
14 }
15
16 Term eval(Term t) { result = eval(red(t)) or (not exists(red(t)) and result = t)}
17 Fig. 6. SK combinators in QL with algebraic data types; the predicate initial encodes the term to reduce
18
19
20 As an example, assume we want to reduce the term K S K. We encode it by providing an appropriate
21 implementation of initial (the first disjunct guarantees the existence of the terms used in the second disjunct):
22 predicate initial(Term l, Term r) {
23 l = K() and r = S() or
24 l = App(K(), S()) and r = K()
25 }
26 Now we can compute eval(App(K(), S()), K()), which yields the result S(), as expected.
27 Note that our implementation does not use primitive types. While it uses equality in a few places, most of
28 these equalities are syntactic and disappear when translating to Datalog, except for the equality result = t in
29 eval. However, it is easy to define a predicate equals(Term s, Term t) that computes equality of terms, so we
30 can eliminate this equality as well. Finally, the single use of negation can also be eliminated by implementing a
31 predicate nf(Term t) that holds for exactly those terms t that are in normal form, which can be done without
32 using equality or negation. 
33
34
35 5 IMPLEMENTATION
36
In this section, we briefly describe how we have extended the Semmle runtime system to support tuple numbering.
37
The theoretically cleanest way of implementing tuple numberings would be as Gödel numberings. Hash
38
functions could be used as a pragmatic alternative, but sufficiently strong hashes are too long for the engine to
39
operate on them directly; for example, a SHA-1 hash needs 160 bits, while the Semmle engine expects primitive
40
values to be 32-bit or 64-bit quantities.
41
For strings, this problem is solved by maintaining a string pool that maps strings to unique 32-bit identifiers.
42
We follow the same strategy for tuple numberings, allocating tuple pools that map tuples of values to 32-bit
43
identifiers. To avoid exhausting the space of available identifiers, we use one tuple pool per universe signature,
44
where the universe signature of a relation is the ordered list of universes its parameters belong to. Thus, when
45
tuple numbering two relations p(x, y) and q(x 0, y 0 ) we will use the same tuple pool if x is from the same universe
46
as x 0 and y from the same universe as y 0. This overlap is not observable by universe-consistent QL programs.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:14 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 Non-recursive tuple numbering can be implemented much more easily by computing the relation to be
2 numbered in full, sorting its tuples, and then assigning tuple identifiers in order. This does not work for the
3 recursive case where tuple elements might themselves be from the relation to be numbered or from another
4 relation that depends on it, so identifiers have to be assigned on the fly as tuples are added to the relation.
5
6 6 CASE STUDIES
7 To demonstrate the usefulness of our proposed algebraic data types, we now present three case studies that
8 put them to work on three practical analysis problems: context-sensitive flow analysis for JavaScript using
9 the Cartesian Product Algorithm as an example of a classic program analysis algorithm; control flow graph
10 construction for Java as an example of a supporting algorithm; and regular expression parsing as a somewhat
11 unconventional application.
12 For space reasons, we only describe the most salient parts of each case study in detail; links to full implementa-
13 tions are provided on our website (Semmle 2017b).
14
15 6.1 Implementing the Cartesian Product Algorithm
16
A typical use case for structured values in Datalog is the implementation of context-sensitive flow analyses, a
17
particularly interesting example of which is the Cartesian Product Algorithm (CPA) (Agesen 1995). CPA contexts
18
are tuples of abstract values representing the arguments passed at some call site. To analyse a call f (e 1 , . . . , en ),
19
we (i) analyse the argument expressions e 1 , . . . , en yielding abstract values v 1 , . . . , vn ; (ii) analyse the body of f
20
in the context (v 1 , . . . , vn ), which means that we assume each parameter x i to have the corresponding abstract
21
value vi ; and (iii) use the abstract return value of f in this context as the abstract value of the call. As the analysis
22
proceeds, more possible abstract argument values vi may be discovered, which may induce more possible contexts
23
for f , possibly yielding new abstract return values. These changes are monotonic, so the analysis will terminate.
24
We outline an implementation of CPA for JavaScript in QL, concentrating on the handling of contexts and
25
omitting most of the rules for handling individual language constructs, which are not interesting for our purposes.
26
At the heart of the analysis is a QL predicate eval that maps pairs of a Context and an Expr (that is, a JavaScript
27
expression) to one or more abstract values, represented by the abstract data type AbstractValue, which is a
28
straightforward enumeration of the various kinds of values tracked by the analysis:
29
newtype AbstractValue = Undefined() or Number() or AbstractFunction(Function f) or ...
30
Undefined is the abstract value representing the JavaScript undefined value; Number represents all numeric
31
values; while AbstractFunction defines one abstract value for each function f, representing all concrete function
32
objects created by evaluating f. The remaining branches are similar and have been omitted for brevity.
33
A context is a tuple of abstract values, which we represent as a cons-list:
34
newtype Context = Nil() or Cons(AbstractValue car, Context cdr) { evalStep(_, _, _, _, car, cdr) }
35
The branch body of Cons restricts it to only construct lists that correspond to arguments for an actually
36
observed call site, using the predicate evalStep which we will discuss next. If Cons were left unrestricted, the
37
set of contexts would become infinite, and the analysis would never terminate.
38
Predicate evalStep(ctxt, c, f, i, car, cdr) models one step in the left-to-right evaluation of the
39
arguments of call site c; it holds if c in context ctxt may call function f, its ith argument evaluates to car, and
40
the remaining arguments to cdr. It is implemented by mutual recursion with another predicate evalArgs that
41
iteratively builds up the context corresponding to some call site c under a context ctxt, and relies on an auxiliary
42
predicate calls that models the call graph:
43
predicate evalStep(Context ctxt, CallExpr c, Function f, int i, AbstractValue car, Context cdr) {
44
car = eval(ctxt, c.getArgument(i)) and cdr = evalArgs(ctxt, c, f, i+1)
45
}
46
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:15

1 Context evalArgs(Context ctxt, CallExpr c, Function f, int i) {


2 calls(ctxt, c, f) and i = c.getNumArgument() and result = Nil()
3 or exists(AbstractValue car, Context cdr | evalStep(ctxt, c, f, i, car, cdr) and result = Cons(car, cdr))
4 }
5 Next, we define a helper predicate appliesTo that determines whether a context applies to a function, and a
6 predicate lookup that looks up the abstract value corresponding to some parameter p in a context, using another
7 auxiliary predicate get(ctxt, i) that retrieves the ith element of context ctxt:
8
predicate appliesTo(Context ctxt, Function f) { ctxt = evalArgs(_, _, f, 0) }
9
10 AbstractValue lookup(Context ctxt, Parameter p) {
11 exists(Function f, int i | appliesTo(ctxt, f) and p = f.getParameter(i) and result = get(ctxt, i))
12 }
13 With these preparations out of the way, we now show three representative clauses of the eval predicate:
14 analysis of numeric literals as an example of a simple base case (where we use appliesTo to restrict the range
15 of the otherwise unused parameter ctxt), and the two crucial cases of parameter and return value passing. To
16 analyse a use of a parameter p we look it up in the context; to analyse a function call, we construct the appropriate
17 calling context, and use the auxiliary predicate retval to determine the possible return values of the callee in
18 that context. That predicate, in turn, considers the return statements in the function to determine possible
19 return values:
20
AbstractValue eval(Context ctxt, Expr e) {
21 e instanceof NumberLiteral and appliesTo(ctxt, e.getEnclosingFunction()) and result = Number()
22 or exists(SimpleParameter p | e = p.getVariable().getAnAccess() and result = ctxt.lookup(p))
23 or exists(Function f | calls(ctxt, e, f) and result = retval(evalArgs(ctxt, e, f, 0), f))
24 or ...
25 }
26
27 AbstractValue retval(Context ctxt, Function f) {
exists(ReturnStmt ret | ret = f.getAReturnStmt() and result = eval(ctxt, ret.getExpr()))
28
}
29
30 Extending these predicates to model all of ECMAScript 2016 is a non-trivial task, but takes no more than about
31 500 lines of QL. We want to emphasise, however, that we do not mean to claim that CPA is a silver bullet for the
32 analysis of JavaScript, the challenges of which are manifold and extensively documented in the literature (Jensen
33 et al. 2009; Kashyap et al. 2014; Park and Ryu 2015; Schäfer et al. 2013; Sridharan et al. 2012), we simply use it as
34 an example of a non-trivial context sensitivity policy that nicely demonstrates the use of recursive algebraic data
35 types: since contexts can be lists of arbitrary length, it is not clear how they could be represented in plain QL.
36
37 6.2 Constructing control flow graphs
38 As our next example, we show how to construct intra-procedural control flow graphs for programs written in
39 a very small subset of Java, comprising just four kinds of statements: expression statements, throw, try with
40 catch (but no finally) and block statements. We ignore any control flow resulting from expression evaluation.
41 The CFG contains one node for each statement, plus one entry node and one exit node for each callable (that
42 is, method or constructor):
43 newtype CfgNode = StmtNode(Stmt s) or EntryNode(Callable c) or ExitNode(Callable c)
44
CFG edges are labelled with completions that indicate the reason for the flow. We have two kinds of completions:
45
the normal completion indicating normal termination of a statement; and throw completions, indicating that a
46
statement has thrown an exception:
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:16 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 predicate final(Stmt s, CfgNode f, Completion c) {


2 s instanceof ThrowStmt and f = StmtNode(s) and c = Throw(s.(ThrowStmt).getExpr().getType())
3 or exists(Block b | b = s | final(b.getLastStmt(), f, c) or final(b.getAStmt(), f, c) and c != Normal())
4 or exists(TryStmt ts | ts = s |
5 final(ts.getBlock(), f, c) and
6 not exists(RefType thr | c = Throw(thr) | thr.getASupertype*() = ts.getACatchClause().getACaughtType())
7
or final(ts.getACatchClause(), f, c)
)
8
}
9
10
predicate succ(CfgNode s, CfgNode t) {
11 exists(Callable c |
12 s = EntryNode(c) and t = initial(c.getBody()) or
13 final(c.getBody(), s, _) and t = ExitNode(c)
14 )
15 or exists(Block blk, int i | final(blk.getStmt(i), s, Normal()) and t = initial(blk.getStmt(i+1))
16 or exists(TryStmt ts, RefType thr, CatchClause cc |
17 final(ts.getBlock(), s, Throw(thr)) and cc = ts.getACatchClause() and
18
exists(cc.getACaughtType().commonSubtype(thr)) and t = initial(cc.getBlock())
)
19
}
20
Fig. 7. Excerpts from a library for computing control flow graphs for Java.
21
22
23
24 newtype Completion = Normal() or Throw(RefType t) {t.getASupertype*().hasQualifiedName("java.lang","Throwable")}
25 Note that we use the branch body of Throw to restrict its parameter t to legal exception types. In full Java, we
26 additionally need break and continue completions (optionally including labels), and a return completion.
27 The CFG is now computed bottom-up, constructing partial CFGs for subtrees of the AST and then gradually
28 combining them into larger CFGs. The partial CFG for a subtree t has a unique initial node and one or more final
29 nodes, each associated with a completion.
30 The initial node initial(s) of a statement s is simply the corresponding CFG node StmtNode(s). Final nodes
31 are computed by a predicate final(s, f, c), parts of which are shown in Figure 7, that holds if f is a final
32 node of the subtree s when terminating with completion c. In particular:
33
• Expression statements and throw statements are their own final nodes, the former having a normal
34
completion, the latter the Throw completion of its thrown exception.
35
• For blocks, any final node of the last statement is a final node of the block, as are the final nodes of other
36
statements with throw completions; this models throw breaking out of a block.
37
• For try statements, any final node of a catch block is a final node of the try, as is any final node of its
38
body, except if it throws an exception that is a subtype of the type declared by a catch clause (and hence
39
guaranteed to be caught).
40
41 Finally, Figure 7 also shows parts of the implementation of predicate succ, which matches up final nodes with
42 initial nodes to compute the CFG successor relation itself:
43
• The successor of an entry node is the initial node of the callable’s body; the successor of any final node
44
of the body (regardless of its completion) is the exit node.
45
• For blocks, the successor of a final node of the i-th statement with a normal completion is the initial node
46
of the (i + 1)-th statement.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:17

1
alt ::= cat "|" alt | cat cat ::= term cat | term
2
term ::= atom "*" | atom atom ::= plainchar | "("alt")"
3
4
5
Fig. 8. A context-free grammar for a simple regular expression language
6
7 • For try statements, the successor of a final node in the body that may throw an exception is the initial
8 node of any catch clause that may catch this exception at runtime.
9 This approach generalises to full languages: we have implemented CFG construction for all of Java 8 (500 LoC)
10 and C# 6 (800 LoC) in QL, modelling not only statement-level control flow as shown here, but also expression-level
11 flow. For Java, we use AST nodes as CFG nodes, while the C# library has proper CFG nodes as shown here. Both
12 libraries scale to real-world code bases with millions of lines of code and are in production use.
13 Unlike the previous example, this approach to CFG construction could be implemented without algebraic data
14 types: an early version of the Java CFG library did so by encoding completions as integers and overloading AST
15 nodes to serve as CFG nodes. We found, however, that switching to algebraic data types made the code much
16 easier to understand and maintain.
17
18 6.3 Parsing regular expressions
19
Our last example is a parser for regular expressions that produces an AST representation which can be used
20
as for writing analyses. Regular expressions are ubiquitous in modern programming languages and can be a
21
source of bugs, ranging from simple logical errors such as using a start-of-input assertion “^” at a position where
22
it cannot possibly match to more complex problems such as regular expressions that are prone to exponential
23
backtracking, which can leave an application vulnerable to ReDoS attacks (Kirrage et al. 2013).
24
Admittedly, parsing regular expressions is not usually thought of as an analysis task, and certainly not a
25
problem to be solved with Datalog. In languages with built-in regular expression literals such as JavaScript or
26
Perl, the extractor can easily parse them and store an AST representation in the database. In other languages,
27
however, regular expression support is provided by a library, so the extractor cannot easily know which string
28
literals should be interpreted as regular expressions. In fact, in a dynamically typed language such as Python it
29
may take non-trivial points-to analysis to detect calls to the regular expression library in the first place.
30
With algebraic data types, we can implement the parser in QL instead, with its input coming either from a
31
database extensional or some ancillary analysis; in fact, the parser could even be recursive with the analysis
32
computing its input, if desired. In what follows, we assume that RegExp is a suitably defined QL class comprising
33
all strings that may represent regular expressions.
34
As with the other examples, we only discuss a small part of the implementation in detail and refer to our
35
website (Semmle 2017b) for the full version. We restrict ourselves to those regular expressions described by the
36
context-free grammar in Figure 8, comprising alternation, concatenation, Kleene star and grouping. The terminal
37
symbol plainchar represents any character other than the operators “|”, “*”, “(” and “)”; each plainchar
38
represents itself, except for the anchors “^” and “$” which represent start and end of input, respectively. As
39
a matter of terminology, we will call a rule whose right hand side consists of a single non-terminal, such as
40
alt ::= cat, a chain rule, and a more complex rule a production rule.
41
Implementing a recogniser, that is, a parser that does not produce any output, is straightforward: for each
42
non-terminal n with rules r 1 , . . . , rm , define ternary predicates r i (s, b, e) corresponding to the rules and another
43
ternary predicate n(s, b, e) corresponding to the non-terminal itself, which is simply the disjunction of the rule
44
predicates. Each rule predicate r i (s, b, e) is defined in such a way that it holds if the string s, which is the input to
45
be parsed, contains a substring from index b (inclusive) to index e (exclusive) that can be derived using rule r i .
46
For example, the rule alt ::= cat "|" alt, is implemented as a Datalog predicate dis:
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:18 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 predicate dis(RegExp s, int b, int e) { exists(int m | cat(s, b, m) and s.charAt(m) = "|" and alt(s, m+1, e)) }
2
This rule forms one disjunct of predicate alt, which represents the non-terminal of the same name:
3
predicate alt(RegExp s, int b, int e) { dis(s, b, e) or cat(s, b, e) }
4
5 Predicates for the other non-terminals can be defined similarly, yielding a recogniser that checks whether a
6 regular expression can be parsed according to our grammar. However, it does not reify the results of the parse in
7 any useful way, and hence is of limited use for further analysis.
8 We convert it to a proper parser in five steps:
9 (1) Turn the predicates for production rules into branches of a new data type Pattern like this:
10 newtype Pattern =
11 MkDis(RegExp s, int b, int e) {exists(int m | cat(s, b, m) and s.charAt(m)="|" and alt(s, m+1, e))} or ...
12
(2) Introduce a corresponding (concrete) QL class with a predicate at for accessing s, b and e:
13
class Dis extends MkDis { predicate at(RegExp s, int b, int e) { this = MkDis(s, b, e) } }
14
15 (3) For each non-terminal, introduce an abstract class that declares the at predicate and is extended both
16 by the concrete classes corresponding to its production rules and the abstract classes corresponding to
17 its chain rules. In our example, we introduce abstract classes Alt and Cat such that Cat and Dis both
18 extend Alt, making Alt the union of Cat and Dis.
19 (4) Each call to a non-terminal n in a production rule predicate is replaced by a call to x .at, where x is an
20 existentially quantified variable of type n:
21 MkDis(RegExp s, int b, int e) {
22 exists(Cat l, int m, Alt r | l.at(s, b, m) and s.charAt(m) = "|" and r.at(s, m+1, e))
23 }
24 Note how the declared types of l and r enforce the correct precedence and associativity: l is restricted
25 to be an element of Cat and not, say, of Dis, ensuring that alternation is given lower precedence than
26 concatenation, and associates to the right.
27 (5) Promote the existentially quantified variables thus introduced to parameters of the branch predicate and
28 define getters for them on the QL class, turning it into a full-fledged AST class:
29
newtype Pattern =
30 MkDis(RegExp s, int b, int e, Cat l, Alt r) {
31 exists(int m | l.at(s, b, m) and s.charAt(m) = "|" and r.at(s, m+1, e))
32 } or ...
33
34 class Dis extends MkDis, Alt {
35 predicate at(RegExp s, int b, int e) { this = MkDis(s, b, e, _, _) }
36 Cat getLeft() { this = MkDis(_, _, _, result, _) }
Alt getRight() { this = MkDis(_, _, _, _, result) }
37
}
38
39 Apart from the classes introduced in this way, we can define additional classes to model semantic properties;
40 for instance, anchors are different from other atoms, so we might want to subclass the class CharAtom (which we
41 assume has a member predicate getText to extract its source text) as follows:
42 abstract class Anchor extends CharAtom {}
43 class Caret extends Anchor { Caret() { getText() = "^" } }
44 class Dollar extends Anchor { Caret() { getText() = "$" } }
45
This refinement will be useful in the example analysis we discuss next: assume that we want to find caret
46
assertions which are preceded by at least one term that cannot match the empty string, and hence are unmatchable.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:19

1 Starting with the latter condition, we define a member predicate isNullable that holds for those Patterns that
2 can match the empty string. For example, a disjunction is nullable if either of its children is, so Dis.isNullable
3 is defined as getLeft().isNullable() or getRight().isNullable().
4 Character atoms are not nullable, except for anchors. This is elegantly expressed by overriding isNullable
5 once in CharAtom with body none() to model the fact that most character atoms are not nullable, and then again
6 in Anchor with body any() to model the fact that anchors are an exception.
7 Now we define a recursive predicate pred(l, r) that holds if l is matched immediately before r:
8 predicate pred(Pattern l, Pattern r) {
9 exists(Seq s | l = s.getLeft() and r = s.getRight()) or
10 exists(Dis d | pred(l, d) and (r = d.getLeft() or r = d.getRight())) or
11 exists(Group g | pred(l, g) and r = g.getBody())
12 }
13 The unmatchable caret assertions can now be identified by looking for Caret nodes that are transitively
14 preceded by a non-nullable pattern (note that in QL p+ denotes the transitive closure of predicate p):
15 from Caret c, Pattern p
16 where pred+(p, c) and not p.isNullable()
17 select c, "This assertion can never match."
18 As it stands, our parser is quite inefficient, since it effectively uses a brute-force bottom-up approach that wastes
19 a lot of time building partial ASTs that are not part of a successful parse. To improve performance, the b and e
20 parameters of the rule predicates have to be restricted to candidates where a successful parse is possible. Kanazawa
21 (2007) observed that this can be achieved by applying the well-known magic sets transformation (Abiteboul et al.
22 1995) to push calling contexts into predicates. This transforms what is essentially a CYK parser into an Earley
23 parser.
24 Based on these techniques, the miniature parser outlined above can be extended to a full-fledged parser for
25 JavaScript regular expressions, totalling about 600 LoC.
26
27 7 RELATED WORK
28
As mentioned in the introduction, LogiQL’s constructor predicates (LogicBlox 2017) are closely related to our
29
algebraic data types. Essentially, they provide a non-recursive variant of tuple numbering. Multiple constructor
30
predicates can contribute values to the same type, so there is no need for a two-stage encoding like the one we
31
have presented. Constructor predicates are heavily used by Doop (Bravenboer and Smaragdakis 2009), a points-to
32
analysis framework for Java implemented in LogiQL. The absence of recursion, however, makes it impossible to
33
encode some of the more complex examples of algebraic data types shown in Section 6; in particular, Doop does
34
not support CPA, and we conjecture that it would need to make use of other extra-logical language features of
35
LogiQL in order to implement it.
36
Tuple numbering can be viewed as an extension of Datalog with existential rules, where the head of the rule
37
may existentially quantify some of its variables (as opposed to normal rules, where each variable in the head
38
has to appear in the body at least once). Such rules were first studied in the database community to express
39
tuple-generating dependencies (Abiteboul et al. 1995), a very general class of integrity constraints on extensional
40
databases that allow asserting the existence of database entities based on logical conditions. For example, in a
41
database that encodes the AST of a Java program we might want to assert that for every entity representing a
42
if statement there is an entity representing its “then” branch. Tuple-generating dependencies have been the
43
subject of much theoretical investigation, chiefly concerned with problems such as repairing databases that fail
44
to satisfy a dependency, or optimising query execution based on known constraints. More recently, existential
45
rules have been studied in ontology-based reasoning (Baget et al. 2011; Calì et al. 2009). Besides a difference in
46
focus, the key difference between tuple numbering and more general existential rules is that the latter do not
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:20 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 require the existentially quantified variables to be instantiated with freshly constructed values. As such, they are
2 more general, but not suited for representing structured values.
3 Algebraic data types, first introduced in Hope (Burstall et al. 1980), have established themselves as an essential
4 feature of many functional programming languages, notably ML (Milner et al. 1997) and Haskell (Hudak et al.
5 2007). The OCaml dialect of ML (Doligez et al. 2016) additionally supports standard object-oriented features,
6 which, however, are orthogonal to algebraic data types. Scala (Odersky and Zenger 2005) also has both object
7 orientation and algebraic data types and uses the former to express the latter, viewing algebraic data types as
8 classes and expressing pattern matching through virtual dispatch similar to what we have shown in Section 3.
9 In QL, the two concepts are independent, but the great flexibility of classes makes it easy to combine them in
10 fruitful ways, as we have shown in our examples.
11 We have mentioned above that our algebraic data types do not support polymorphism, which is offered by
12 both Haskell and ML. Haskell supports further extensions like higher-kinded type parameters and generalised
13 algebraic data types that bring it a step closer to the very powerful inductive data types supported by theorem
14 provers such as Coq (Bertot and Castéran 2004) or Agda (Norell 2008). These data types offer not only a very
15 general notion of polymorphism, but also dependent types, which gives them much of the flexibility that QL’s
16 branch bodies provide, but with stronger static guarantees. Agda in particular has pioneered the practical use of
17 induction-recursion (Dybjer 2000), whereby the definition of an inductive data type uses a recursively defined
18 function over that same data type. In QL, such recursion between data types and other predicates is a natural
19 consequence of viewing types as unary predicates.
20 We are not aware of any previous work on adding algebraic data types to Datalog, but several other logic
21 programming languages such as SWI Prolog (Wielemaker et al. 2012) and hybrid functional-logic languages
22 such as Mercury (Somogyi et al. 1994) and Datafun (Arntzenius and Krishnaswami 2016) do have support for
23 them. Their notion of types, however, embodies the more usual notion of types as meta-level descriptions of
24 sets, rather than QL’s view of types as unary predicates defined within the language itself. Datafun uses types to
25 track monotonicity and establish finiteness of predicates, so programs can freely construct and use structured
26 values as long as they can be typed, thereby proving that only a finite number of values is constructed in any
27 given execution. SWI Prolog and Mercury, on the other hand, follow a Prolog-style semantics, which supports
28 arbitrary structured values.
29 As mentioned above, QL’s type system (Schäfer and de Moor 2010) can accommodate algebraic data types
30 without any changes, since it supports arbitrary types defined by unary predicates as long as their mutual
31 relationship can be described in terms of first-order statements such as inclusion and disjointness, which is the
32 case for our monomorphic algebraic data types. LogiQL’s type system (Zook et al. 2009) also supports inclusion
33 constraints between types, but not disjointness, which is important for algebraic data types.
34
35
36
37 8 CONCLUSION
38
We have presented an extension of QL with monomorphic algebraic data types that allow programs to work with
39
structured values, which was previously impossible. Like their counterparts in functional languages, these types
40
offer a flexible way of describing tree-structured data using disjoint union and tupling, and they can be recursive.
41
Like other types in QL, algebraic data types are just unary predicates, so they can be recursive not just with each
42
other but with other predicates as well. While the new types are orthogonal to QL’s existing class system, the
43
two can be mixed freely, effortlessly combining object-oriented and algebraic programming idioms.
44
Algebraic data types can be implemented by compiling QL to plain Datalog extended with a tuple numbering
45
operator that manages the creation of fresh identifiers for structured values. This operator adds considerable
46
expressive power to Datalog, yet is easy to implement and interacts well with common optimisations.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:21

1 Algebraic data types bring the simplicity and elegance of QL to bear on problems that previously were out of its
2 scope, as shown by the case studies. In future, we are particularly interested in further exploring its applications
3 in context-sensitive analyses like CPA and analysis support for DSLs like regular expressions.
4
5 REFERENCES
6 Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases: The Logical Level. Addison-Wesley Longman, Boston, MA,
7 USA.
8 Ole Agesen. 1995. The Cartesian Product Algorithm: Simple and Precise Type Inference Of Parametric Polymorphism. In ECOOP.
Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015.
9
Design and Implementation of the LogicBlox System. In SIGMOD.
10 Michael Arntzenius and Neelakantan R. Krishnaswami. 2016. Datafun: a Functional Datalog. In ICFP.
11 Pavel Avgustinov, Oege de Moor, Michael Peyton Jones, and Max Schäfer. 2016. QL: Object-oriented Queries on Relational Data. In ECOOP.
12 Jean-François Baget, Michel Leclère, Marie-Laure Mugnier, and Eric Salvat. 2011. On Rules with Existential Variables: Walking the Decidability
13 Line. Artificial Intelligence 175, 9-10 (2011).
Yves Bertot and Pierre Castéran. 2004. Interactive Theorem Proving and Program Development. Springer, Berlin Heidelberg.
14
Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly Declarative Specification of Sophisticated Points-to Analyses. In OOPSLA.
15 R. M. Burstall, D. B. MacQueen, and D. T. Sannella. 1980. HOPE: An Experimental Applicative Language. In LFP.
16 Andrea Calì, Georg Gottlob, and Thomas Lukasiewicz. 2009. A General Datalog-Based Framework for Tractable Query Answering over
17 Ontologies. In PODS.
18 Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment
Form and the Control Dependence Graph. TOPLAS 13, 4 (Oct. 1991).
19
Damien Doligez, Alain Frisch, Jacques Garrigue, Didier Rémy, and Jérôme Vouillon. 2016. The OCaml System: Documentation and User’s
20 Manual. (2016). http://caml.inria.fr/pub/docs/manual-ocaml/
21 Peter Dybjer. 2000. A General Formulation of Simultaneous Inductive-Recursive Definitions in Type Theory. Journal of Symbolic Logic 65, 2
22 (2000).
23 Manuel V. Hermenegildo, Francisco Bueno, Manuel Carro, Pedro López-García, Edison Mera, José F. Morales, and Germán Puebla. 2012. An
overview of Ciao and its design philosophy. TPLP 12, 1-2 (2012).
24
Paul Hudak, John Hughes, Simon Peyton Jones, and Philip Wadler. 2007. A History of Haskell: Being Lazy with Class. In HOPL.
25 Simon Holm Jensen, Anders Møller, and Peter Thiemann. 2009. Type Analysis for JavaScript. In SAS.
26 Makoto Kanazawa. 2007. Parsing and Generation as Datalog Queries. In ACL.
27 Vineeth Kashyap, Kyle Dewey, Ethan A. Kuefner, John Wagner, Kevin Gibbons, John Sarracino, Ben Wiedermann, and Ben Hardekopf. 2014.
28 JSAI: A Static Analysis Platform for JavaScript. In FSE.
James Kirrage, Asiri Rathnayake, and Hayo Thielecke. 2013. Static Analysis for Regular Expression Denial-of-Service Attacks. In NSS.
29
Ondrej Lhoták and Laurie J. Hendren. 2004. Jedd: A BDD-based relational extension of Java. In PLDI.
30 LogicBlox. 2017. LogicBlox 4 Reference Manual. (2017). https://developer.logicblox.com/content/docs4/core-reference/webhelp/
31 Robin Milner, Mads Tofte, and David Macqueen. 1997. The Definition of Standard ML. MIT Press, Cambridge, MA, USA.
32 Ulf Norell. 2008. Dependently Typed Programming in Agda. In AFP.
33 Martin Odersky and Matthias Zenger. 2005. Scalable Component Abstractions. In OOPSLA.
Changhee Park and Sukyoung Ryu. 2015. Scalable and Precise Static Analysis of JavaScript Applications via Loop-Sensitivity. In ECOOP.
34
Max Schäfer and Oege de Moor. 2010. Type Inference for Datalog with Complex Type Hierarchies. In POPL.
35 Max Schäfer, Manu Sridharan, Julian Dolby, and Frank Tip. 2013. Dynamic Determinacy Analysis. In PLDI.
36 Semmle. 2017a. Code Exploration. (2017). https://semmle.com/products/semmle-ql
37 Semmle. 2017b. Publications Page. (2017). https://semmle.com/publications
38 Zoltan Somogyi, Fergus Henderson, and Thomas C. Conway. 1994. The Implementation of Mercury, an Efficient Purely Declarative Logic
Programming Language. In ILPS.
39
Manu Sridharan, Julian Dolby, Satish Chandra, Max Schäfer, and Frank Tip. 2012. Correlation Tracking for Points-To Analysis of JavaScript.
40 In ECOOP.
41 Terrance Swift and David S. Warren. 2012. XSB: Extending Prolog with Tabled Logic Programming. TPLP 12, 1-2 (2012).
42 John Whaley, Dzintars Avots, Michael Carbin, and Monica S. Lam. 2005. Using Datalog with Binary Decision Diagrams for Program Analysis.
43 In APLAS.
Jan Wielemaker, Tom Schrijvers, Markus Triska, and Torbjörn Lager. 2012. SWI-Prolog. Theory and Practice of Logic Programming 12, 1-2
44
(2012).
45 Niklaus Wirth. 1976. Algorithms + Data Structures = Programs. Prentice Hall, Upper Saddle River, NJ, USA.
46 David Zook, Emir Pasalic, and Beata Sarna-Starosta. 2009. Typed Datalog. In PADL.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:22 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1 A COREQL
2
To make our presentation self-contained, we reproduce the definition of the syntax and semantics of CoreQL
3
from Avgustinov et al. (2016): Figure 9 gives the syntax, Figure 10 and Figure 11 the translation from CoreQL to
4
Datalog.
5
The following definitions establish notation used in the figures:
6
7 Definition A.1 (Relation specifiers). A relation specifier C.p/n consists of a class name C and a pair p/n, where p
8 is a predicate name and n a natural number.
9 Definition A.2 (Subtyping). The subtyping relation S <: T is the smallest relation such that for every class C we
10 have C <: C.domain, and if C extends T , then C.domain <: T .
11 As usual, S <:+ T denotes the transitive closure of this relation.
12
13
Definition A.3 (Overriding). C.p/n overrides D.p/n, written C.p/n ≺ D.p/n, if C <:+ D. We write C.p/n  D.p/n
14
to mean that either C = D or C.p/n ≺ D.p/n. If D.p/n overrides no other member relation, it is a rootdef. We
15
write ρ (C.p/n) for the set of all rootdefs D.p/n such that C.p/n  D.p/n.
16 Definition A.4 (Member predicate lookup). We define a lookup function λ(S, p, n) that looks up a member
17 predicate in a type given a name and its arity and returns a set of candidates:
18
if S = C and C.p/n is valid
(
{C.p/n}
19 λ(S, p, n) = S
S <:T λ(T , p, n) otherwise
20
21 Definition A.5 (Syntactic validity). In order for a Core QL program to be syntactically valid, the following
22 conditions have to be satisfied:
23 • No two classes and no two toplevel predicates with the same arity may have the same name; no two
24 member predicates of the same class with the same arity, and no two parameters of the same predicate
25 may have the same name.
26 • Every extends clause must list at least one type.
27 • Every characteristic predicate must have the same name as its enclosing class.
28 • No predicate parameter may have the name this.
29 • For every variable name appearing in a formula, there must either be an enclosing exists declaring a
30 variable of that name, or the enclosing predicate must have a parameter of that name, or the variable
31 name is this and it appears in a member predicate or character. In particular, every variable name can be
32 associated with a declared type.
33 • Similarly, for every class name appearing in a type reference there must be a class of the same name, and
34 for every predicate name appearing in a call to a toplevel predicate, there must be a toplevel predicate of
35 that name with the appropriate arity.
36 • super calls may only appear in member predicates.
37
Definition A.6 (Translatability). A syntactically valid Core QL program is translatable if the following conditions
38
are met:
39
40
• It is not the case that T <:+ T for some type T ; that is, the subtyping relation is acyclic.
41
• For every (not necessarily valid) relation specifier C.p/n, we have |λ(C, p, n)| ≤ 1; in other words, classes
42
must override ambiguously inherited predicates.
43
• For every member predicate call x .p(y) where x has type T we have λ(T , p, |y|) , ∅, i.e., all calls can be
44
resolved to a static target.
45
• Similarly, for every call D.super.p(x ) in a member predicate of a class C, we must have C <:+ D and
46
λ(D, p, |x |) , ∅.
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
Algebraic Data Types for Object-oriented Datalog • 1:23

1
prog ::= cd pd program
2
3 cd ::= abstract class C extends T {C () { f } pd}
?
class definition
4 pd ::= predicate p(T x ) { f } predicate definition
5
f ,д ::= p(x ) | x .p(y) | C.super.p(x ) | not f formula
6
| f and д | f or д | exists(T x | f )
7
8
S,T ::= C | @b | C.domain type reference
9
10 Fig. 9. Syntax of Core QL; · denotes (possibly empty) sequences, ·? optional elements
11
12 Translation of a class definition cd ≡ abstract? class C extends T {C () { f } pd}:
13 ^ ^
14 C.domain(this) ← B.B(this) ∧ @b (this).
15 C <:B C <:@b

16 Tc (cd) := C.C (this) ← Tb ( f , hthis := C.domaini).


17 C (this) ← K (cd).
18 Tm (pd i , C)
19
20 K (cd) D(this)
W
:= D <:C if cd is abstract
21
22 K (cd) := C.C (this) if cd is concrete
23
24
Translation of a toplevel predicate definition pd ≡ predicate p(T x ) { f }:
25
26 Tp (pd) := p(x ) ← Tb ( f , hx i := Ti i).
27
28
29 Translation of a member predicate definition pd ≡ predicate p(T x ) { f }:
30
C.p(this, x ) ← Tb ( f , hthis := C, x i := Ti i).
31
Tm (pd, C)
^
32
:= C.p disp (this, x ) ← ( ¬D(this)) ∧ C.p(this, x ).
33 D .p ≺C .p
34
35 Fig. 10. Translation from Core QL predicates to Datalog; for readability, we write C <: T to mean C.domain <: T
36
37
38
39
40
41
42
43
44
45
46
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.
1:24 • Max Schäfer, Pavel Avgustinov, Oege de Moor

1
Translation of a predicate or character body f :
2 ^
3 Tb ( f , Γ) := ( S (x )) ∧ Tf ( f , Γ)
4 (x,S ) ∈Γ
5
6
7
Translation of a predicate call:
8 Tf (p(x ), Γ) := p(x )
9
_ _
Tf (x .p(y), Γ) := ( B.p disp (x, y)) where D.p := λ(Γ(x ), p, |y|)
10
R .p ∈ρ (D .p ) B .p  ∗ R .p
11
Tf (C.super.p(x ), Γ) := D.p(this, x ) where D.p := λ(C, p, |x |)
12
13
14 Translation of other formulas:
15
16
Tf (not f , Γ) := ¬Tf ( f , Γ)
17 Tf ( f and д, Γ) := Tf ( f , Γ) ∧ Tf (д, Γ)
18 Tf ( f or д, Γ) := Tf ( f , Γ) ∨ Tf (д, Γ)
19
 
Tf (exists(C x | f ), Γ) := ∃x : C (x ) ∧ Tf ( f , Γ[x := C])
20
21
22 Fig. 11. Translation from Core QL formulas to Datalog
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
, Vol. 1, No. 1, Article 1. Publication date: April 2017.

You might also like