One Parser To Rule Them All: Ali Afroozeh Anastasia Izmaylova
One Parser To Rule Them All: Ali Afroozeh Anastasia Izmaylova
Abstract 1. Introduction
Despite the long history of research in parsing, constructing Parsing is a well-researched topic in computer science, and
parsers for real programming languages remains a difficult it is common to hear from fellow researchers in the field
and painful task. In the last decades, different parser gen- of programming languages that parsing is a solved prob-
erators emerged to allow the construction of parsers from a lem. This statement mostly originates from the success of
BNF-like specification. However, still today, many parsers Yacc [18] and its underlying theory that has been developed
are handwritten, or are only partly generated, and include in the 70s. Since Knuth’s seminal paper on LR parsing [25],
various hacks to deal with different peculiarities in program- and DeRemer’s work on practical LR parsing (LALR) [6],
ming languages. The main problem is that current declara- there is a linear parsing technique that covers most syntactic
tive syntax definition techniques are based on pure context- constructs in programming languages. Yacc, and its various
free grammars, while many constructs found in program- ports to other languages, enabled the generation of efficient
ming languages require context information. parsers from a BNF specification. Still, research papers and
In this paper we propose a parsing framework that em- tools on parsing in the last four decades show an ongoing
braces context information in its core. Our framework is effort to develop new parsing techniques.
based on data-dependent grammars, which extend context- A central goal in research in parsing has been to en-
free grammars with arbitrary computation, variable binding able language engineers (i.e., language designers and tool
and constraints. We present an implementation of our frame- builders) to declaratively build a parser for a real program-
work on top of the Generalized LL (GLL) parsing algorithm, ming language from an (E)BNF-like specification. Never-
and show how common idioms in syntax of programming theless, still today, many parsers are hand-written or are
languages such as (1) lexical disambiguation filters, (2) op- only partially generated and include many hacks to deal
erator precedence, (3) indentation-sensitive rules, and (4) with peculiarities in programming languages. The reason
conditional preprocessor directives can be mapped to data- is that grammars of programming languages in their sim-
dependent grammars. We demonstrate the initial experience ple and readable form are often not deterministic and also
with our framework, by parsing more than 20 000 Java, C#, often ambiguous. Moreover, many constructs found in pro-
Haskell, and OCaml source files. gramming languages are not context-free, e.g., indentation
rules in Haskell. Parser generators based on pure context-
Categories and Subject Descriptors D.3.1 [Programming free grammars cannot natively deal with such constructs,
Languages]: Formal Definitions and Theory—Syntax; D.3.4 and require ad-hoc extensions or hacks in the lexer. There-
[Programming Languages]: Processors—Parsing fore, additional means are necessary outside of the power of
context-free grammars to address these issues.
Keywords Parsing, data-dependent grammars, GLL, dis- General parsing algorithms [7, 33, 35] support all context-
ambiguation, operator precedence, offside rule, preprocessor free grammars, therefore the language engineer is not lim-
directives, scannerless parsing, context-aware scanning ited by a specific deterministic class, and there are known
declarative disambiguation constructs to address the prob-
lem of ambiguity in general parsing [12, 36, 38]. However,
implementing disambiguation constructs is notoriously hard
and requires thorough knowledge of the underlying parsing
technology. This means that it is costly to declaratively build
a parser for a given programming language in the wild if
the required disambiguation constructs are not already sup-
Copyright ©ACM, 2015. This is the author’s version of the work. It is posted here ported. Perhaps surprisingly, examples of such languages are
by permission of ACM for your personal use. Not for redistribution. The definitive
version was published in Onward!’15, October 25–30, 2015, Pittsburgh, PA, USA.
not only the legacy languages, but also modern languages
http://dx.doi.org/10.1145/2814228.2814242. such as Haskell, Python, OCaml and C#.
1
In this paper we propose a parsing framework that is able Our vision is a parsing framework that provides the right
to deal with many challenges in parsing existing and new level of abstraction for both the language engineer, who de-
programming languages. We embrace the need for context signs new languages, and the tool builder, who needs a pars-
information at runtime, and base our framework on data- ing technology as part of her toolset. From the language en-
dependent grammars [16]. Data-dependent grammars are an gineer’s perspective, our parsing framework provides an out
extension of context-free grammars that allow arbitrary com- of the box set of high-level constructs for most common id-
putation, variable binding and constraints. These features al- ioms in syntax of programming languages. The language en-
low us to simulate hand-written parsers and to implement gineer can also always express her needs directly using data-
disambiguation constructs. dependent grammars. From the tool builder’s perspective,
To demonstrate the concept of data-dependent grammars our framework provides open, extensible means to define
we use the IMAP protocol [29]. In network protocol mes- higher-level syntactic notation, without requiring knowledge
sages it is common to send the length of data before the ac- of the internal workings of a parsing technology.
tual data. In IMAP, these messages are called literals, and
The contributions of this paper are:
are described by the following (simplified) context-free rule:
L8 ::= '⇠{' Number '}' Octets
• We provide a unified perspective on many important chal-
lenges in parsing real programming languages.
Here Octets recognizes a list of octet (any 8-bit) values. An
example of L8 is ~{6}aaaaaa. As can be seen, there is no • We present several high-level syntactical constructs, and
data dependency in this context-free grammar, but the IMAP their mappings to data-dependent grammars.
specification says that the number of Octets is determined by • We provide an implementation of data-dependent gram-
the value parsed by Number. Using data-dependent grammars, mars on top of Generalized LL (GLL) parsing [33] that
we can specify such a data-dependency as: runs over ATN grammars [40]. The implementation is
L8 ::= '⇠{' nm:Number {n=toInt(nm.yield)} '}' Octets(n) part of the Iguana parsing framework1 [2].
Octets(n) ::= [n > 0] Octets(n - 1) Octet
• We demonstrate the initial results of our parsing frame-
| [n == 0] ✏
work, by parsing 20363 real source files of Java, C#,
In the data-dependent version, nm provides access to the Haskell (91% success rate), and excerpts from OCaml.
value parsed by Number. We retrieve the substring of the input
parsed by Number via nm.yield which is converted to integer The rest of this paper is organized as follows. In Section 2 we
using toInt. This integer value is bound to variable n, and describe the landscape of parsing programming languages.
is passed to Octets. Octets takes an argument that specifies In Section 3 we present data-dependent grammars, our high-
the number of iterations. Conditions [n > 0] and [n == 0] level syntactic notation, and the mapping to data-dependent
specify which alternative is selected at each iteration. grammars. Section 4 discusses the extension of GLL parsing
It is possible to parse IMAP using a general parser, and with data-dependency. In Section 5 we demonstrate the ini-
then remove the derivations that violate data dependencies tial results of our parsing framework using grammars of real
post parse. However, such an approach would be slow. With- programming languages. We discuss related work in Sec-
out enforcing the dependency on the length of Octets during tion 6. A conclusion and discussion of future work is given
parsing, given the nondeterministic nature of general pars- in Section 7.
ing, all possible lengths of Octets will be tried.
There are many common grammar and disambiguation 2. The Landscape of Parsing Programming
idioms that can be desugared into data-dependent grammars. Languages
Examples of these idioms are operator precedence, longest
In this section we discuss well-known features of program-
match, and the offside rule. Expecting the language engi-
ming languages that make them hard to parse. These features
neer to write low-level data-dependent grammars for such
motivate our design decisions.
cases would be wasteful. Instead, we describe a number of
such idioms, provide high-level notation for them and their 2.1 General Parsing for Programming Languages
desugaring to data-dependent grammars. For example, using
Grammars of programming languages in their natural form
our high-level notation the indentation rules in Haskell can
are not deterministic, and are often ambiguous. A well-
be expressed as follows:
known example is the if-then-else construct found in many
Decls ::= align (offside Decl)*
programming languages. This construct, when written in
| ignore('{' Decl (';' Decl)* '}')
its natural form, is ambiguous (the dangling-else ambigu-
This definition clearly and concisely specifies that either all ity), and therefore cannot be deterministic. Some nonde-
declarations in the list are aligned, and each Decl is offsided terministic (and ambiguous) syntactic constructs, such as
with regard to its first token (first alternative), or indentation if-then-else, can be rewritten to be deterministic and un-
is ignored inside curly braces (second alternative).
1 https://github.com/iguana-parser
2
ambiguous. However, such grammar rewriting in general is comes very important when considering parsing program-
not trivial, the resulting grammar is hard to read, maintain ming languages. For example, GLR parsers operate on LR
and evolve, and the resulting parse trees are different from automata, and have a rather complicated execution model, as
the original ones the grammar writer had in mind. a parsing state corresponds to multiple grammar positions.
Instead of rewriting a grammar, it is common to use an The Generalized LL (GLL) parsing algorithm [33] is a
ambiguous grammar, and rely on some implicit behavior of new generalization of recursive-descent parsing that sup-
a parsing technology for disambiguation. For example, the ports all context-free grammars, including left recursive
dangling-else ambiguity is often resolved using a longest ones. GLL parsers produce a parse forest in cubic time and
match scheme provided by the underlying parsing technol- space in the worst case, and are linear on LL parts of the
ogy. Relying on implicit behavior of a parsing technology grammar. GLL parsers are attractive because they have the
to achieve determinism can make it quite difficult to reason close relationship with the grammar that recursive-descent
about the accepted language. Seemingly correct sentences parsers have. From the end user’s perspective, GLL parsers
may be rejected by the parser because at a nondetermin- can produce better error messages, and can be debugged in
istic point, a wrong path was chosen. For example, Yacc a programming language IDE.
is an LALR parser generator, but can accept any context- To deal with left recursive rules and to keep the cu-
free grammar by automatically resolving all shift/reduce and bic bound, a GLL parser uses a GSS to handle multiple
reduce/reduce conflicts. Using Yacc, the language engineer call stacks. While the execution model of a GLL parser
should manually check the resolved conflicts in case of un- is close to recursive-descent parsing, the underlying ma-
expected behavior. chinery is much more complicated, and still an in-depth
A common theme in research in parsing has been to in- knowledge of GLL is required to implement disambiguation
crease the recognition power of deterministic parsing tech- constructs. In this paper, we propose a parser-independent
niques such as LL(k) or LR(k). One of the widely used gen- framework for parsing programming languages based on
eral parsing techniques for programming languages is the data-dependent grammars. We use GLL parsing as the basis
Generalized LR (GLR) algorithm [35]. GLR parsers support for our data-dependent parsing framework, as it allows an
all context-free grammars and can produce a parse forest intuitive way to implement components of data-dependent
containing all derivation trees in form of a Shared Packed grammars, such as environment threading, and enables an
Parse Forest (SPPF) in cubic time and space [34]. Note that implementation that is very close to the stack-evaluation
the cubic bound is for the worst-case, highly ambiguous based semantics of data-dependent grammars [16].
grammars. As GLR is a generalization of LR, a GLR parser
2.2 On the Interaction between Lexer and Parser
runs linearly on LR parts of the grammar, and as the gram-
mars of real programming languages are in most parts near Conventional parsing techniques use a separate lexing phase
deterministic, one can expect near-linear performance using before parsing to transform a stream of characters to a stream
GLR for parsing programming languages. GLR parsing has of tokens. In particular, whitespace and comments are dis-
successfully been used in source code analysis and develop- carded by the lexer to reduce the number of lookahead in the
ing domain-specific languages [12, 22]. parsing phase, and enable deterministic parsing.
General parsing enables the language engineer to use the The main problem with a separate lexing phase is that
most natural version of a grammar, but leaves open the prob- without having access to the parsing context, i.e., the appli-
lem of ambiguity. In declarative syntax definition [12, 23], cable grammar rules, the lexer cannot unambiguously deter-
it is common to use declarative disambiguation constructs, mine the type of some tokens. An example is >> that can ei-
e.g., for operator precedence or the longest match. As a gen- ther be parsed as a right shift operator, or two closing angle
eral parser is able to return all ambiguities in form of a parse brackets of a generic type, e.g., List<List<String>> in Java.
forest, it is possible to apply the disambiguation rules post- Some handwritten parsers deal with this issue by rewriting
parse, removing the undesired derivations from the parse for- the token stream. For example, when the javac parser reads a
est. However, such post-parse disambiguation is not practical >> token and is in a parsing state that expects only one >, e.g.,
in cases where the grammar is highly ambiguous. For exam- when matching the closing angle bracket of a generic type,
ple, parsing expression grammars without applying operator it only consumes the first > and puts the second one back to
precedence during parsing is only limited to small inputs. prevent a parse error when matching the next angle bracket.
Therefore, it is required to resolve ambiguity while parsing To resolve the problems of a separate lexing phase, we
to achieve near-linear performance. need to expose the parsing context to the lexer. To achieve
Implementing disambiguation mechanisms that are ex- this, the separate lexing phase is abandoned, and the lex-
ecuted during parsing is difficult. This is because the im- ing phase is effectively integrated into the parsing phase. We
plementation of such disambiguation mechanisms requires call this model single-phase parsing. There are two options
knowledge of the internal workings of a parsing technology. to achieve single-phase parsing. The first option is called
Therefore, the choice of the general parsing technology be- scannerless parsing [32, 38] where lexical definitions are
treated as context-free rules. In scannerless parsing, gram-
3
E ::= '-' E E ::= E '+' T E ::= T E1 f x = case x of g x = case x of
| E '*' E | T E1 ::= '+' T E1 | ✏ 0 -> 1 0 -> 1
| E '+' E T ::= T '*' F T ::= F T1 _ -> do _ -> x + 2
| 'a' | F T1 ::= '*' F | ✏ let y = 2 + 4
F ::= '-' F F ::= '-' F y + z
| 'a' | 'a' where z = 3 g x = case x of
0 -> 1
Figure 1. Three grammars that accept the same language: _ -> x + 2
the natural, ambiguous grammar (left), the grammar with + 4
precedence encoding (middle), and the grammar after left- Figure 2. Examples of indentation rules in Haskell.
recursion removal (right).
ing. There have been other solutions to build parsers from
mars are defined to the level of characters. The second op- declarative operator precedence which we discuss in Sec-
tion is context-aware scanning [37], where the parser calls tion 6. In Section 3.4 we provide a mapping from operator
the lexer on demand. At each parsing state, the lexer is called precedence rules to data-dependent grammars.
with the expected set of terminals at that state.
In almost all modern programming languages longest 2.4 Offside Rule
match (maximal munch) is applied, and keywords are ex- In most programming languages, indentation of code blocks
cluded from being recognized as identifiers. These disam- does not play a role in the syntactic structure. Rather, ex-
biguation rules are conventionally embedded in the lexer. plicit delimiters, such as begin and end or { and } are used to
In single-phase parsing—scannerless or context-aware— specify blocks of statements. Landin introduced the offside
longest match and keyword exclusion have to be applied rule [26], which serves as a basis for indentation-sensitive
during parsing, by using lexical disambiguation filters such languages. The offside rule says that all the tokens of an ex-
as follow restrictions [32, 36]. These disambiguation filters pression should be indented to the right of the first token.
have parser-specific implementations [2, 38]. In Section 3 Haskell and Python are two examples of popular program-
we show how these filters can be mapped to data-dependent ming languages that use a variation of the offside rule.
grammars. Also note that although a context-aware scanner Figure 2 shows two examples of the offside rule in
employs longest match, for example by implementing the Haskell. The keywords do, let, of, and where signal the start
Kleene star (*) as a greedy operator, in some cases we still of a block where the starting tokens of the statements should
need to use explicit disambiguation filters, see Section 3.2. be aligned, and each statement should be offsided with re-
gard to its first token. In Figure 2 (left), case has two alterna-
2.3 Operator Precedence tives which are aligned, and the second alternative that spans
Expressions are an integral part of virtually every program- several lines is offsided with regard to its first token, i.e., _.
ming language. In reference manuals of programming lan- Figure 2 (right) shows two examples that look the same, but
guages it is common to specify the semantics of expressions the indentation of the last part, + 4, is different. In the top
using the priority and associativity of operators. However, declaration + 4 belongs to the last alternative, but in the bot-
the implementation of expression grammars can consider- tom declaration, + 4 belongs to the expression on the right
ably deviate from such precedence specification. hand side of =.
It is possible to encode operator precedence by rewriting Indentation sensitivity in programming languages can-
the grammar: a new nonterminal is created for each prece- not be expressed by pure context-free grammars, and has
dence level. The rewriting is not trivial for real programming often been implemented by hacks in the lexer. For exam-
languages, and the resulting grammar becomes large. This ple, in Haskell and Python, indentation is dealt with in the
rewriting is particularly problematic in parsing techniques lexing phase, and the context-free part is written as if no
that do not support left recursion. The left-recursion removal indentation-sensitivity exists. Both GHC and CPython, the
transformation disfigures the grammar and adds extra com- popular implementations of Haskell and Python, use LALR
plexity in transforming the trees to the intended ones. Fig- parser generators. In Python, the lexer maintains a stack and
ure 1 shows three versions of the same expression grammar. emits INDENT and DEDENT tokens when indentation changes.
In the 70s, Aho et al. [4] presented a technique in which In Haskell, the lexer translates indentation information into
a parser is constructed from an ambiguous expression gram- curly braces and semicolons based on the rules specified by
mar accompanied with a set of precedence rules. This work the L function [28].
can be seen as the starting point for declarative disambigua- In Section 3.5 we show how data-dependent grammars
tion using operator precedence rules. Aho et al. ’s approach can be used for single-phase parsing of indentation-sensitive
is implemented by modifying LALR parse tables to re- programming languages in a declarative way. As data-
solve shift/reduce conflicts based on the operator prece- dependent grammars are rather low-level for such solutions,
dence. However, the semantics of operator precedence in we introduce three high-level constructs: align, offside, and
this approach is bound to the internal workings of LR pars- ignore which are desugared to data-dependent grammars.
4
void test() #if X static void Main() {
{ /* System.Console.WriteLine(@"hello,
#if Debug #else #if Debug
System.Console.WriteLine("Debug") /* */ class Q { } world
} #endif #else
#else Nebraska
} #endif
#endif ");
}
Figure 3. Problematic cases of using C# directives [30].
Figure 4. C# multi-line string containing directives [30].
5
Data-dependent grammars introduce parametrized non- In addition, EBNF constructs introduce new scopes: vari-
terminals, arbitrary computation via an expression language, ables declared inside an EBNF construct, e.g., (l : A [e])⇤,
constraints, and variable binding. Here, we assume that the are not visible outside, e.g., in the rule that uses it.
expression language e is a simple functional programming Finally, we also introduce a conditional selection sym-
language with immutable values and no side-effects. In a bol e ? ↵ : , which selects ↵ if e is evaluated to true, oth-
data-dependent grammar a rule is of the from A(p) ::= ↵, erwise , i.e., introduces deterministic choice. Similar to
where p is a formal parameter of A. Here, for simplicity of EBNF constructs, we implement conditional selection by
presentation and without loss of generality, we assume that desugaring it into a data-dependent grammar. For example,
a nonterminal can have at most one parameter. The body of A ::= ↵ e ? X : Y is translated to A ::= ↵ C(e) , where
a rule, ↵, can now contain the following additional symbols: C(b) ::= [b] X | [!b] Y . We illustrate use of the conditional
selection when discussing C# directives in Section 3.6.
• x = l : A(e) is a labeled call to A with argument e, label
l, and variable x bound to the value returned by A(e);
3.2 Single-phase Parsing Strategy
• l : a is a labeled terminal a with label l;
We implement our data-dependent grammars on top of the
• [e] is a constraint;
generalized LL (GLL) parsing algorithm [33]. As general
• {x = e} is a variable binding; parsers can deal with any context-free grammar, lexical defi-
• {e} is a return expression (only as the last symbol in ↵); nitions can be specified to the level of characters. For exam-
ple, comment in the C# specification [30] is defined as:
• e?↵ : is a conditional selection.
Comment ::= SingleLineComment | DelimitedComment
The symbols above are presented in their general forms. For SingleLineComment ::= "//" InputCharacter*
InputCharacter ::= ![\r \n]
example, labels, variables to hold return values, and return DelimitedComment ::= "/*" DelimitedCommentSect* [*]+ "/"
expressions are optional. DelimitedCommentSect ::= "/" [*]* NotSlashOrAsterisk
Our data-dependent grammars are very similar to the ones NotSlashOrAsterisk = ![/ *]
introduced in [16] with four additions. First, terminals and Such character-level grammars, however, lead to very
nonterminals can be labeled, and labels refer to properties large parse forests. These parse forests reflect the full struc-
associated with the result of parsing a terminal or nonter- ture of lexical definitions, which are not needed in most
minal. These properties are the start input index (or left ex- cases. We provide the option to use an on-demand context-
tent), the end input index (or right extent), and the parsed aware scanner, where terminals are defined using regular ex-
substring. Properties can be accessed using dot-notation, pression. For example, Comment in C# can be compiled to a
e.g., for labeled nonterminal b : B, b.l gives the left extent, regular expression. In cases where the structure is needed, or
b.r the right extent, and b.yield the substring. it is not possible to use a regular expression, e.g., recursive
Second, nonterminals can return arbitrary values (return definitions of nested comments, the user can use character-
expressions) which can be bound to variables. In several level grammars.
cases, we found this feature very practical as we could ex- Our support for context-aware scanning borrows many
press data dependency without changing the shape of the ideas from the original work by Van Wyk and Schwerd-
original specification grammar. Specifically, cases where a feger [37], but because of the top-down nature of GLL pars-
global table needs to be maintained along a parse (C# con- ing, there are some differences. The original context-aware
ditional directives discussed in Section 3.6 and C typedef scanning approach [37] is based on LR parsing, and as each
declarations in Section 3.7), or where semantic information LR state corresponds to multiple grammar rules, there may
needs to be propagated upwards from a complicated syntac- be several terminals that are valid at a state. The set of valid
tic structure (Declarator of the C grammar in Section 3.7). In terminals in a parsing state is called valid lookahead set [37].
some cases a data-dependent grammar that uses return val- In GLL parsing, in contrast, the parser is at a single grammar
ues can be rewritten to one without return values. However, position at each time, i.e., either before a nonterminal or be-
in general, whether return values enlarge the class of lan- fore a terminal in a single grammar rule. Therefore, in GLL
guages expressible with the original data-dependent gram- parsing, the valid lookahead set of a terminal grammar posi-
mars is an open question for future work. tion contains only one element, which allows us to directly
Third, we support regular expression operators (EBNF call the matcher of the regular expression of that terminal.
constructs): ⇤, +, and ?, by desugaring them to data depen- We use our simple context-aware scanning model for
dent rules as follows: A⇤ ::= A+ | ✏ ; A+ ::= A+A | A ; and better performance, see Section 5.1. The implementation
A? ::= A | ✏. In the data-dependent setting, this translation of the context-aware scanner in [37] is more sophisticated.
must also account for variable binding. For example, if sym- The scanner is composed of all terminal definitions, as a
bol ([e] A)⇤ appears in a rule, and x is a free variable in e, composite DFA. This enables a longest match scheme across
captured from the scope of the rule, our translation lifts this terminals in the same context, for example in programming
variable, introducing a parameter x to the new nonterminal. languages where one terminal is a prefix of another, e.g.,
6
A ::= ↵ B !>> c A ::= ↵ b:B [input.at(b.r) != c] Figure 5 shows the mapping from character-level disam-
A ::= ↵ c !<< B A ::= ↵ b:B [input.at(b.l-1) != c]
biguation filters to data-dependent grammars. The mapping
A ::= ↵ B \ s A ::= ↵ b:B [input.sub(b.l,b.r)!= s]
is straightforward: each restriction is translated into a con-
Figure 5. Mapping of lexical disambiguation filters. dition that operates on the input. A note should be made
regarding the condition implementing precede restrictions.
This condition only depends on the left extent, b.l, that per-
'fun' and 'function' in OCaml. To enforce longest match mits its application before parsing B. We consider this opti-
across terminals we use follow/precede restrictions, in this mization in the implementation of our parsing framework,
case a follow restriction on 'fun' or a precede restriction on permitting application of such conditions before parsing la-
identifiers. Moreover, keyword reservation in [37] is done beled nonterminals or terminals.
by giving priority to keywords at matching states of the The restrictions of Figure 5 are just examples and can be
composite DFA. In our model, keyword exclusion should extended in many ways. For example, instead of defining
be explicitly applied in the grammar rules using an exclude the restriction using a single character, we can use regular
disambiguation filter. We explain follow/precede restrictions expressions or character classes. One can also define similar
and keyword reservation in Section 3.3. restrictions for related disambiguation purposes. For exam-
In single-phase parsing layout (whitespace and com- ple, consider the cast expression in C#:
ments) are treated the same way as other lexical definitions. cast-exp ::= '(' type ')' unary-exp
Because layout is almost always needed in parsing program- An expression such as (x)-y is ambiguous, and can be in-
ming languages, we support automatic layout insertion into terpreted as either a type cast of -y to the type x, or a sub-
the rules. There are two approaches to deal with layout in- traction of y from (x). In the C# language specification, it is
sertion: a layout nonterminal can be inserted exactly before stated that this ambiguity is resolved during parsing based
or after each terminal node [20, 37]. Another way is to in- on the character that comes after the closing parentheses: if
sert layout between the symbols in a rule, like in SDF [12]. the character following the closing parentheses is ~, '!', '(',
We use SDF-style layout insertion: if X ::= x1 x2 . . . xn an identifier, a literal or keywords, the expression should be
is a rule, and L is a nonterminal defining layout, after the interpreted as a cast. We can implement this rule as follows:
layout insertion, the rule becomes X ::= x1 L x2 L . . . L xn . cast-exp ::= '(' type ')' >>> [⇠!(A-Za-z0-9] unary-exp
A benefit of SDF-style layout insertion is that no symbol
definition accidentally ends or starts with layout, provided The >>> notation specifies that the next character after the
that the layout is defined greedily (see Section 3.3). This is closing parentheses should be an element of the specified
helpful when defining the offside rule (see Section 3.5). character class. The implementation of >>> is similar to that
of >> with an additional aspect: it adds the condition on the
automatically inserted layout nonterminal after ')' instead.
3.3 Lexical Disambiguation Filters
These examples show how more syntactic sugar can be
Common lexical disambiguation filters [36], such as fol- added to the existing framework for various common lexi-
low restrictions, precede restrictions and keyword exclu- cal disambiguation tasks in programming languages without
sion, can be mapped to data-dependent grammars without changes to the underlying parsing technology.
further extensions to the parser generator or parsing algo-
rithm. These disambiguation filters are common in scanner- 3.4 Operator Precedence and Associativity
less parsing [32] and have been implemented for various Expression grammars in their natural form are often ambigu-
generalized parsers [2, 38]. ous. Consider the expression grammar in Figure 6 (left). For
A follow restriction (!>>) specifies which characters can- this grammar, the input string a+a*a is ambiguous with two
not immediately appear after a symbol in a rule. This re- derivation trees that correspond to the following groupings:
striction is used to locally define longest match (as op- (a+(a*a)) and ((a+a)*a). Given that * normally has higher
posed to a global longest match in the lexer). For ex- precedence than +, the first derivation tree is desirable. We
ample, to enforce longest match on identifiers we write use >, left, and right to define priority and left- and right-
Id ::= [A-Za-z]+ !>> [A-Za-z]. A precede restriction (!<<) associativity, respectively [3]. Figure 6 (right) shows the dis-
is similar to a follow restriction, but specifies the characters ambiguated version of this grammar by specifying > and
that cannot immediately precede a symbol in a rule. Precede left, where - has the highest precedence, and * and + are
restrictions can be used to implement longest match on key- left-associative.
words. For example, [A-Za-z] !<< Id disallows an identifier Ambiguity in expression grammars is caused by deriva-
to start immediately after a character. This disallows, for tions from the left- or right-recursive ends in a rule, i.e.,
example, to recognize intx as the keyword 'int' followed E ::= ↵E and E ::= E . We use >, left, and right to spec-
by the identifier x. Finally, exclusion (\) is usually used to ify which derivations from the left- and right-recursive ends
implement keyword reservation. For example, Id \'int' ex- are not valid with respect to operator precedence. For ex-
cludes the keyword int from beging recognized as Id. ample, E ::= '-' E > E '*' E specifies that E in the '-'-rule
7
E ::= '-' E E ::= '-' E E ::= E '*' E left E(p) ::= [2 >= p] E(2) * E(3) //2
| E '*' E > E '*' E left > E '+' E left | [1 >= p] E(0) + E(2) //1
| E '+' E > E '+' E left | '(' E ')' | '(' E(0) ')'
| 'if' E 'then' E 'else' E > 'if' E 'then' E 'else' E | a | a
| a | a
Figure 7. An expression grammar with > and left (left), and
Figure 6. An ambiguous expression grammar (left), and the its translation to data-dependent grammars (right).
same grammar disambiguated with > and left (right).
E(l,r) ::= [4 >= l] '-' E(l,4) //4
| [3 >= r, 3 >= l] E(3,3) '*' E(l,4) //3
(parent) should not derive the '*'-rule (child). The > con- | [2 >= r, 2 >= l] E(2,2) '+' E(l,3) //2
struct only restricts the right-recursive end of a parent rule | [1 >= l] 'if' E(0,0) 'then' E(0,0) 'else' E(0,0) //1
| a
when the child rule is left-recursive, and vice versa. For ex-
ample, in Figure 6 (right) the right E in the '+'-rule is not Figure 8. Operator precedence with data-dependent gram-
restricted because the 'if'- rule is not left-recursive. This is mars (binary and unary operators).
to avoid parse error on inputs that are not ambiguous, e.g.,
a + if a then a else a. Note that 'if' E 'then' E 'else' in round bracket rule. Nonterminal E gets parameter p to pass
the 'if'-rule acts as a unary operator. In addition, the > oper- the precedence level, and for each left- and right-recursive
ator is transitive for all the alternatives of an expression non- rule, a predicate is added at the beginning of the rule to ex-
terminal. Finally, left and right only affect binary recursive clude rules by comparing the precedence level of the rule
rules and only at the left- and right-recursive ends. with the precedence level passed to the parent E. Finally, for
Although > is defined as a relationship between a parent each use of E in a rule, an argument is passed.
rule and a child rule, its application may need to be arbitrary In the '*'-rule, its precedence level (2) is passed to the left
deep in a derivation tree. For example, consider the input E, and its precedence level plus one (3) is passed to the right
string a * if a then a else a + a for the grammar in Fig- E. This allows to exclude the rules of lower precedence from
ure 6 (right). This sentence is ambiguous with two derivation the left E, and to exclude the rules of lower precedence and
trees that correspond to the following groupings: the '*'-rule itself from the right E. Excluding the '*'-rule it-
(a * (if a then a else a)) + a self allows only the left-associative derivations, e.g., (a*a)*a,
a * (if a then a else (a + a))
as specified by left. In the '+'-rule, its precedence level plus
The first grouping is not valid as 'if' binds stronger than one (2) is passed to the right E, excluding the '+'-rule. The
'+', but we defined '+' to have higher priority than 'if'. This value 0 is passed to the left E, permitting any rule. Note that
example shows that restricting derivations only at one level passing 0 instead of 1 to the left E of the '+'-rule achieves
cannot disambiguate such cases. A correct implementation the same effect but enables better sharing of calls to E, as
of > thus also restricts the derivation of the 'if'-rule from the the sharing of calls (using GSS) is done based on the name
right-recursive end of the '*'-rule if the '*'-rule is derived of the nonterminal and the list of arguments. In the round
from the left-recursive end of the '+'-rule. bracket rule, 0 is passed to E as the use of E is neither left-
We now show how to implement an operator precedence nor right-recursive, hence, the precedence does not apply.
disambiguation scheme using data-dependent grammars. We Now we discuss the translation of the example shown in
first demonstrate the basic translation scheme using binary Figure 6 that contains both binary and unary operators. For
operators only, and then discuss the translation of the exam- this we need to distinguish between the rules that should be
ple in Figure 6. Figure 7 (left) shows a simple example of an excluded from the left and from the right E. This is achieved
expression grammar that defines two left-associative binary as follows. First, E gets two parameters, l and r (Figure 8),
operators * and +, where * is of higher precedence than +. to distinguish between the precedence level passed from left
Figure 7 (right) shows the result of the translation into the and right, respectively. Second, a separate condition on l
data-dependent counterpart. The basic idea behind the trans- is added to a rule when the rule can be excluded from the
lation is to assign a number, a precedence level, to each left- right E (i.e., rules for binary operators and unary postfix
and/ or right recursive rule of nonterminal E, to parameterize operators). A separate condition on r is added to a rule
E with a precedence level, and based on the precedence level when the rule can be excluded from the left E (i.e., rules
passed to E, to exclude alternatives that will lead to deriva- for binary operators and unary prefix operators). Third, l-
tion trees that violate the operator precedence. and r-arguments are determined for the left and right E’s
In Figure 7 (right) each left- and right-recursive rule in as follows. An l-argument to the left E and r-argument to
the grammar gets a precedence level (shown in comments), the right E are determined as in the example of Figure 7.
which is the reverse of the alternative number in the defi- For example, E(3,_) '*' E(_,4), where 3 is the precedence
nition of E. The precedence counter starts from 1 and incre- level of the '*'-rule, and 4 is the precedence level plus one.
ments for each encountered > in the definition. The number 0 Note that r=4 does not exclude the unary operators of E.
is reserved for the unrestricted use of E, illustrated using the Now, an l-argument to the right E’s is propagated from the
8
Decls ::= align (offside Decl)* Decls ::= a0:Star1(a0.l)
| ignore('{' Decl (';' Decl)* '}') | ignore('{' Decl Star2 '}')
Decl ::= FunLHS RHS Decl ::= FunLHS RHS
RHS ::= '=' Exp 'where' Decls RHS ::= '=' Exp 'where' Decls
9
The second parameter, fst, which is either 0 or 1, is used global defs = {}
to identify and skip the leftmost terminal that should not be
Layout ::= (Whitespace | Comment | Decl | If | Gbg)*
constrained. The value 1 is passed at the application site of !>> [\ \t\n\r\f] !>> '/*' !>> '//' !>> '#'
offside and propagated down to the first nonterminal of each
reachable rule if the rule starts with a nonterminal. The value Decl ::= '#' 'define' id:Id
{defs=put(defs,id.yield,true)} PpNL
0 is passed to any other nonterminal of a reachable rule when
| '#' 'undef' id:Id
the first symbol of the rule is not nullable. Our translation {defs=put(defs,id.yield,false)} PpNL
also accounts for nullable nonterminals (not shown here),
and in such cases the value of fst also depends on a dy- If ::= '#' 'if' v=Exp(defs) [v] ? Layout
: (Skipped (Elif|Else|PpEndif))
namic check whether the left and right extents of the node Elif ::= '#' 'elif' v=Exp(defs) [v] ? Layout
corresponding to a nullable nonterminal are equal. : (Skipped (Elif|Else|PpEndif))
Finally, each terminal reachable from Decl gets a label Else ::= '#' 'else' Layout
(labels starting with o), to refer to its start index, and a
Gbg ::= GbgElif* GbgElse? '#' 'endif'
constraint, encoded as a call to boolean function f. Note that GbgElif ::= '#' 'elif' Skipped
in the definition of f, condition i == -1 corresponds to the GbgElse ::= '#' 'else' Skipped
case when Decl appears in the context where the offside rule
does not apply or is ignored, and condition fst == 1 to the Skipped ::= Part+
Part ::= PpCond | PpLine | ... // etc.
case of the leftmost terminal. PpCond ::= PpIf PpElif* PpElse? PpEndif
The offside, align and ignore constructs are examples PpIf ::= '#' 'if' PpExp PpNL Skipped?
of reasonably complex desugarings to data-dependent gram- PpElif ::= '#' 'elif' PpExp PpNL Skipped?
PpElse ::= '#' 'else' PpNL Skipped?
mars. Their existence and their aptness to describe the syn-
PpEndif ::= '#' 'endif' PpNL
tax of Haskell is a witness of the power of data-dependent
grammars and the parsing architecture we propose. Figure 12. The grammar of conditional directives in C#.
10
Element ::= STag Content ETag global defs = [{}]
STag ::= '<' Name Attribute* '>'
ETag ::= '</' Name '>' Declaration ::= x=Specifiers Declarators(x)
Figure 13. Context-free grammar of XML elements (top) Declarators(x) ::= s=Declarator {h=put(head(defs),s,x);
and the data-dependent version (bottom). defs=list(h,tail(defs))}
("," Declarators(x))*
Declarator ::= id:Identifier {id.yield}
| x=Declarator "(" ParameterTypeList ")" {x}
while parsing, Exp uses data dependency and extends the | ...
Expr ::= Expr "-" Expr
PpExp rules, found in the C# specification, with return val-
| "(" n:TypeName [isType(defs,n.yield)] ")" Expr
ues and boolean computation. If the expression evaluates to | "(" Expr ")"
true (note the use of conditional selection), the parser first | ...
continues consuming layout, including the nested directives, | Identifier [!isType(defs,n.yield)]
and then, after no layout can be consumed, the parser returns Figure 14. Resolving typedef ambiguity in C.
to the next symbol in the alternative.
If the expression evaluates to false, the parser consumes
part of the input as a list of valid C# tokens (Skipped) until Now, we consider the problem of typedef ambiguity in
it finds the corresponding #elif-, #else- or #endif-part. Note C. For example, expression (T)+1 can have two meanings,
that Skipped also consumes nested #if-directives (PpCond), if depending on the meaning of T in the context: a cast to type T
any, but in this case, conditions are not evaluated. The defi- with +1 being a subexpression, or addition with two operands
(T) and 1. If T is a type, declared using typedef, the first parse
nition of Skipped also allows to consume invalid C# structure
(only valid token-wise) when the condition is false, see Fig- is valid, otherwise the second one.
ure 3 (right). Finally, when all #if, #elif and #else directives To resolve the typedef ambiguity, type names should be
are present, there will be dangling #elif, #else, and #endif distinguished from other identifiers, such as variables and
parts remaining if one of the conditions evaluates to true. function names, during parsing. In addition, the scoping
These dangling parts should be also consumed by the lay- rules of C should be taken into account. For example, con-
out. The Gbg (garbage) nonterminal, defined as part of layout, sider the following C program:
does exactly this. typedef int T;
main() {
int T = 0, n = (T)+1;
3.7 Miscellaneous Features }
In this section we discuss the use of data-dependent gram- In this example, T is first declared as a type alias to int and
mars for parsing XML and resolving the infamous typedef then redeclared as a variable of type int in the inner scope
ambiguity in C. XML has a relatively straightforward syn- introduced by the main function.
tax. Figure 13 (top) shows the context-free definition of Figure 14 shows a simplified excerpt of our data-dependent
Element in XML, where Content allows a list of nested ele- C grammar. The excerpt shows the declaration and expres-
ments. The problem with this definition is that it can recog- sion parts of the C grammar. As can be seen, a C decla-
nize inputs with unbalanced start and end tags, for example: ration consists of a list of specifiers followed by a list of
<note>
declarators. Each declarator declares one identifier. Key-
<to>Bob</from> word typedef can appear in the list of specifiers, for exam-
<from>Alice</to> ple, along with the declared type. A declarator can be either
</note> a simple identifier or a more complicated syntactic structure,
Using data-dependent grammars, the solution to match e.g., array and function declarators, nesting the identifier. It
start and end tags is very intuitive. Figure 13 (bottom) shows is important to note that an identifier should enter the cur-
a data-dependent grammar for XML elements. As can be rent scope when its declarator is complete. The expression
seen, inside a starting tag, STag, the result of parsing Name is part of Figure 14 shows the cast expression rule (the sec-
bound to n, and the respective substring, n.yield, is returned ond rule from top), and the primary expression rule (the last
from the rule. The returned value is assigned to s in the one). Note that to resolve the typedef ambiguity, illustrated
Element rule, and is passed to the end tag, ETag. Finally, in the in our running example, an identifier should be accepted as
ETag, the name of the end tag is checked against the name of an expression if it is not declared as a type name.
the starting tag. If the name of the starting tag is not equal to To distinguish between type names and other identifiers,
the name of the end tag, i.e., n.yield == s does not hold, the we record names, encountered as part of declarators, and as-
parsing pass dies. sociate a boolean value with each name: true for type names
11
and false otherwise. To maintain this information during E + E
parsing, we introduce global variable defs, holding a list of
- E
maps to properly account for scoping. At the beginning of E
parsing, defs is a list containing a single, empty map. At the a
beginning of a new scope, i.e., when "{" is encountered, an
empty map is prepended to the current list resulting in a new Figure 15. ATN Grammar for E ::= E + E | E | a.
list which is assigned to defs (not shown in the figure). At
the end of the current scope, i.e., when "}" is encountered,
the head of the current list is dropped by taking the tail of tion before or after a symbol in a rule) and executes the code
the list and assigning it to defs. corresponding to this slot. Because of the nondeterministic
To communicate the presence of typedef in a list of spec- nature of general parsing, a GLL parser needs to record all
ifiers, we extend each rule of Specifier to return a boolean possible paths and process them later, and at the same time
value: "typedef"-rule returns true, and the other rules return eliminate duplicate jobs. The unit of work in GLL parsing
false. Specifiers computes disjunction of the values asso- is a descriptor which captures a parsing state. Descriptors
ciated with the specifiers in the resulting list. This informa- allow a serialized, discrete view of tasks performed during
tion is passed via variable x to Declarators. We also extend parsing. GLL parsing has a main loop, in a trampolined style,
the rules of Declarator to return the declared name, id.yield. that executes the descriptors one at a time until no more de-
After a declarator is parsed, the declared name can be stored scriptors left.
in defs: pair (s,x) is added to the map taken from the head The standard way of implementing a GLL parser is to
of the current list, and a new list, with the resulting map as generate code for each grammar slot [33]. Such implemen-
its head, is created and assigned to defs. tation relies on dynamic gotos to allow arbitrary jumps to
Finally, isType function is used to check whether the the main loop or other grammar slots. In our GLL imple-
current identifier is a type name in the current scope or not: mentation, a grammar is modeled as a connected graph of
isType iterates over elements in defs, starting from the first grammar slots. This model of context-free grammars re-
element, to look up the given name. If the name is not found sembles Woods’ Recursive Augmented Transition Networks
in the current map, isType continues the search with the next (ATN) [40] grammars. As such, our implementation of GLL
element, representing the outer scope. If the name is found, over ATN grammars (Appendix A) provides an interpreter
isType returns the boolean value associated with the name. If version of GLL parsing.
none of the maps contains the name, isType returns false.
In our running example, after parsing the second decla- 4.1 ATN Grammars
ration of T, appeared in the scope of the main function, pair ATN grammars are an automaton formalism developed in
("T",false) will be added to the map in the head of defs,
the 70s to parse natural languages, and are similar to nonde-
effectively shadowing the previous typedef declaration of T, terministic finite automata.
and causing the condition in the cast expression rule to fail.
An ATN grammar is a tuple (Q, F, !) where
In this section we present our extension of the GLL parsing • F ⇢ Q is a finite set of states representing final grammar
algorithm [33] to support data-dependent grammars. GLL slots; and
parsing is a generalization of recursive-descent parsing that A
• ! is a transition relation of the form ! (nonterminal),
supports all context-free grammars, and produces a bina- t ✏
! (terminal), or ! (epsilon).
rized Shared Packed Parse Forest (SPPF) in cubic time and
space. GLL uses a Graph-Structured Stack (GSS) [35] to For example, the ATN grammar for E ::= E + E | E | a
handle multiple function calls in recursive-descent parsing. is shown in Figure 15. In an ATN, there is a one-to-many
The problem of left recursion is solved by allowing cycles relation, S ⇢ String ⇥ Q, from a nonterminal name to a
in GSS. As GLL parsers are recursive-descent like, the han- set of start states, each representing the initial state of an
dling of parameters and environment is intuitive, and the im- alternative.
plementation remains very close to the stack-based seman- Constructing an ATN grammar from a CFG is straight-
tics, which eases the reasoning about the runtime behavior forward. For each nonterminal in the grammar, and for each
of the parser. More information on GLL parsing over ATN, alternative of the nonterminal, a pair consisting of the non-
GSS, and SPPF is provided in Appendix A. terminal’s name and a state representing the start state of the
We use a variation of GLL parsing that uses a more effi- alternative is added to S. Finally, for each symbol in the al-
cient GSS [2]. GLL parsing can be seen as a grammar traver- ternative, a next state is created, and a transition, labeled with
sal process that is guided by the input. At each point during the symbol, from the previous state to this state is added. The
parsing, a GLL parser is at a grammar slot (grammar posi- last state of the alternative is marked as a final grammar slot.
12
[2 >= l, 2 >= r] E(2, 2) + E(l, 3) Finally, a GLL parser constructs a binarized SPPF (Ap-
pendix A.1), creating terminal nodes (nodeT), and nonter-
[1 >= l] - E(0, 0)
E(l, r) minal and intermediate nodes (nodeP). In GLL parsing in-
a termediate nodes are essential. In particular, they allow the
parser to carry a single node at each time by grouping the
Figure 16. Data-dependent ATN grammar for E ::= E + symbols of a rule in a left-associative manner. Nonterminal
E > E | a after desugaring operator precedence. and intermediate nodes can be ambiguous. To properly han-
dle ambiguities under nonterminal and intermediate nodes,
we include environment and return values into the SPPF con-
4.2 Data-dependent ATN Grammars struction. Specifically, arguments to nonterminals and return
values are part of nonterminal nodes, and environment is part
To support data-dependent grammars, we extend ATN gram-
of intermediate nodes.
mars with the following forms of transitions:
Figure 17 presents the semantics of GLL parsing over
•
x=l:A(e)
! (parameterized, labeled nonterminals) and !
l:t ATN, defining it as a transition relation on configuration
(labeled terminals) (R, U , G, P) where the elements are four main structures
[e]
maintained by a GLL parser:
x=e e
• ! (variable binding), ! (constraint) and ! (return
• R is a set of pending descriptors to be processed
expression).
• U is a set of descriptors created during parsing. This set
Two additional mappings are maintained: L, X : Q ! String is maintained to eliminate duplicate descriptors
that map a state, representing a grammar slot after a labeled
• G is a GSS, represented by a set of GSS edges
nonterminal, to the nonterminal’s label (l) and to the nonter-
minal’s variable (x), respectively. Here, as in Section 3.1, for • P is a set of parsing results (SPPF nodes created for
simplicity of presentation and without loss of generality, we nonterminals) associated with GSS nodes, i.e., a set of
assume that nonterminals can have at most one parameter. elements of the form (u, w)
We also only consider cases of labeled terminals and non- During parsing, a descriptor is selected and removed from
terminals, and when a return expression is present. Finally, R, represented as {(p, i, E, u, w)} [ R, and given the rules,
we assume that expression language e is a simple functional a deterministic choice is made based on the next transition
programming language with immutable values and no side- in the ATN. The simplest rules are Eps, Cond-1, Cond-2
effects, that labels and variables are scoped to the rules they and Bind. Eps creates the ✏-node (via call to nodeT) and an
are introduced in, and that labels and variables introduced by intermediate node (via call to nodeP), and adds a descriptor
desugaring have unique names in their scopes. for the next grammar slot. Cond-1 and Cond-2 depend on the
An example of a data-dependent ATN is shown in Fig- evaluation of expression e in a constraint. If the expression
ure 16. This ATN grammar is the disambiguated version of evaluates to true, a new descriptor is added to continue
the grammar shown in Figure 15 after desugaring operator with the next symbol in the rule (Cond-1), otherwise no
precedence. descriptor is added (Cond-2). Bind evaluates the expression
in an assignment and creates a new environment containing
4.3 Data Dependency in GLL Parsing the respective binding. This environment is used to create
In the following, p, q, s represent ATN states in Q, i is an the new descriptor added to R.
input index, u, u0 represent GSS nodes, and w, n, y repre- Term-1 and Term-2 deal with labeled terminals. If ter-
sent SPPF nodes. To support data-dependent grammars, we minal t matches (Term-1) the input string (represented by
introduce an environment, E, into GLL parsing. Here, we an array I) starting from input position i, a terminal node
assume that E is an immutable map of variable names to is created (assuming t is of length 1). Then, the properties,
values. In the data-dependent setting, a descriptor, the unit i.e., the left and right extents, and the respective substring,
of work in GLL parsing, is of the form (p, i, E, u, w). Now, are computed from the resulting node (props(y)). Finally, a
a descriptor contains an environment E that has to be stored new environment, containing binding [l = props(y)], is cre-
and later used whenever the parser selects the descriptor to ated and used to construct an intermediate node and a new
continue from this point. GSS is also extended to store ad- descriptor. If the terminal does not match (Term-2), no de-
ditional data. A GSS node and a GSS edge are now of the scriptor is added.
forms (A, i, v) and (u, p, w, E, u0 ), respectively. That is, in Call-1 and Call-2 deal with labeled calls to nonterminals.
addition to the current input index i, a GSS node stores an First, argument e is evaluated, where E1 allows the use of
argument v, passed to a nonterminal A, to fully identify the the left extent in e (lprop constructs properties with only left
call. A GSS edge additionally stores an environment E, to extent). If a GSS node, representing this call, already exists
capture the state of the parser before a call to a nonterminal (Call-1), the parsing results associated with this GSS node
is made. are reused, and a possibly empty set of new descriptors (D)
13
(R, U , G, P) ) (R0 , U 0 , G 0 , P 0 ) argument of the current GSS node (arg(u)) and the return
value. This node is recorded in P as a result associated with
✏
p!q the GSS node. For each GSS edge directly reachable from
n = nodeP(q, w, nodeT(✏, i, i), E) d = (q, i, E, u, n)
Eps the current GSS node, a new descriptor is created. Note that
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
labels and variables at call sites, represented by the current
l:t
p ! q I[i] = t GSS node, are retrieved via mappings L and X, respectively.
y = nodeT(t, i, i+1) E1 = E[l = props(y)]
n = nodeP(q, w, y, E1 ) d = (q, i+1, E1 , u, n)
Term-1 5. Evaluation
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
Our data-dependent parsing framework is implemented as an
l:t
Term-2
p ! q I[i] 6= t extension of the Iguana parsing framework [2]. The addition
({(p, i, E, u, w)}[R, U , G, P) ) (R, U , G, P) of data-dependency is at the moment a prototype and most of
x=l:A(e) the effort was put into correctness, rather than performance
p !q
E1 = E[l = lprop(i)] [[e]]E1 = v u0 = (A, i, v) 2 N (G) optimization. As a frontend to write data-dependent gram-
D = {d | (u0 , y) 2 P, E2 = E[l = props(y), x = val(y)], mars, we extended the syntax definition of Rascal [24], a
d = (q, rext(y), E2 , u, nodeP(q, w, y, E2 )), d 62 U } programming language for meta-programming and source
Call-1
({(p, i, E, u, w)}[R, U , G, P) ) (R[D, U [D, code analysis, and provided a mapping to Iguana’s internal
G [{(u0 , q, w, E, u)}, P) representation of data-dependent grammars.
x=l:A(e) In Section 2 we enumerated a number of challenges in
p !q
E1 = E[l = lprop(i)] [[e]]E1 = v u0 = (A, i, v) 62 N (G) parsing programming languages, and in Section 3, we pro-
D = {(s, i, [p0 = v], u0 , $) | s 2 S(A)} vided solutions based on data-dependent grammars (directly
Call-2
({(p, i, E, u, w)}[R, U , G, P) ) (R[D, U , or via desugaring) that address these challenges. For each
G [{(u0 , q, w, E, u)}, P) challenge we selected a programming language that exhibits
p !q q2F
e it, and wrote a data-dependent grammar3 , derived from the
[[e]]E = v n = nodeP(q, w, arg(u), v) specification grammar of the language. For evaluation, we
D = {d | (u, s, y, E1 , u0 ) 2 G, E2 = E1 [L(s) = props(n), X(s) = v], parsed real source files from the source distribution of the
d = (s, i, E2 , u0 , nodeP(s, y, n, E2 )), d 62 U }
Ret language and some popular open source libraries, see Ta-
({(p, i, E, u, w)}[R, U , G, P) ) (R[D, U [D, G, P [{(u, n)})
ble 2. Table 1 summarizes the evaluation results. In the fol-
[e]
p ! q [[e]]E = true lowing we discuss these results in detail, and provide an
d = (q, i, E, u, w) analysis of the expected performance in practice.
Cond-1
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
Java To evaluate the correctness of our declarative opera-
[e]
p ! q [[e]]E = false tor precedence solution using data-dependent grammars, we
Cond-2
({(p, i, E, u, w)}[R, U , G, P) ) (R, U , G, P) used the grammar of Java 7 from the main part of the Java
x=e language specification [11]. This grammar contains an un-
p ! q [[e]]E = v
d = (q, i, E[x = v], u, w)
ambiguous left-recursive expression grammar, in a similar
Bind
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
style to the expression grammar in Figure 1 (middle).
We replaced the expression part (consisting of about 30
Figure 17. GLL for data-dependent ATN grammars. nonterminals) of the Java specification grammar with a sin-
gle Expression nonterminal that declaratively expresses oper-
ator precedence using >, left and right. The resulting gram-
is created. Each descriptor in the set corresponds to a re- mar, which we refer to as the natural grammar, is much
sult, nonterminal node y, retrieved from P, so that the index more concise and readable, see Table 1. The resulting parser
of the descriptor is the right extent of y (rext), its environ- parsed all 8067 files successfully and without ambiguity.
ment contains bindings [l = props(y)] and [x = val(y)] (val The natural grammar of Java produces different parse
retrieves the value from y), and its SPPF node is a new in- trees compared to the original specification grammar, and
termediate node. Note that d 62 U ensures that no duplicate therefore it is not possible to directly compare the parse
descriptors are added at this point. If the corresponding GSS trees. To test the correctness of parsers resulting from the
node does not exist, Call-2 creates one descriptor for each desugaring of >, left, and right to data-dependent gram-
start state of the nonterminal (s 2 S(A)). Each descriptor mars, we tested their resulting parse trees against a GLL
gets a new environment with binding [p0 = v], where p0 is parser for the same natural grammar of Java, using our
the nonterminal’s parameter that we assume to have a unique previous work on rewriting operator precedence rules [3].
name in the scope of a rule. Both Call-1 and Call-2 add a new Both parsers, using desugaring to data-dependent grammars
GSS edge capturing the previous environment to G. and rewriting operator precedence rules, produced the same
Finally, in Ret-rule, the return expression is evaluated,
and the nonterminal node is created which stores both the 3 https://github.com/iguana-parser/grammars
14
Table 1. Summary of the results of parsing with character-level data-dependent grammars of programming languages.
Spec. Grammar Data-dep. Grammar
Language Challenge Solution # Files % Success
# Nont. # Rules # Nont. # Rules
Java Operator precedence >, left and right 200 485 169 435 8067 100% (8067)
C# Conditional directives global variables and dynamic layout 387 1000 395 1013 5839 99% (5838)
Haskell Indentation sensitivity align, offside and ignore 143 431 152 452 6457 72% (4657)
parse trees for all Java files, providing an evidence that our Table 2. Summary of the projects used in the evaluation.
desugaring of operator precedence to data-dependent gram- Lang. Projects Version Description
mars implements the same semantics as the rewriting in [3]. JDK 1.7.0_60-b19 Java Development Kit
Despite its prototype status, the data-dependent parser is Java
JUnit 4.12 Unit testing framework
at the moment on average only 25% slower than the rewrit- SLF4J 1.7.12 A Java logging framework
ten one. The main reason for performance difference is that Roslyn build-preview .NET Compiler Platform
in the rewriting technique [3] the precedence information is C#
MVC 6.0.0-beta5 ASP.NET MVC Framework
statically encoded in the grammar, and therefore there is no EntityFramew. 7.0.0-beta5 Data access for .NET
runtime overhead, while in the data-dependent version pass-
GHC 7.8 Glasgow Haskell Compiler
ing arguments and handling environment is done at runtime. Haskell
Cabal 1.22.4.0 Build System for Haskell
The problem with the rewriting technique is that the rewrit- Git-annex 5.20150710 File manager based on Git
ing process itself is rather slow and the resulting grammar is Fay 0.23.1.6 Haskell to JavaScript compiler
very large.
15
Java C# Haskell
CPU time (milliseconds) in log10
4
y = 1.212 x − 3.181 y = 1.098 x − 3 y = 0.95 x − 1.616
3
3
3
2
2
1
1
0
0
0
Regression line Regression line Regression line
−1
−1
2 3 4 5 2 3 4 5 6 0 1 2 3 4 5
size (#characters) in log10
Figure 18. Running time of the character-level parsers for Java, C#, and Haskell against the input size (number of characters)
plotted as log-log base 10. The red line is the linear regression fit. The goodness of each fit is indicated by the adjusted R2
value in each log-log plot. The equation in each plot describes a power relation with original data, and as all the coefficients
(1.212, 1.098, 0.950) are close to one, we can conclude the running time is near-linear on these grammars.
16
and character-level parsers on all the source files. Fig- internal workings of the parser. Since parser combinators
ure 19 shows the relative performance gain (speedup) using are normal functions, the user can modify them. Our ap-
a context-aware parser compared to a character-level parser proach provides an external DSL for defining parsers, while
for each file. For a better visualization we omitted the out- parser combinators provide an internal DSL. Therefore, our
liers from the box plots. The median and maximum speedup approach compared to parser combinators provides more
for Java is (2.45, 15.1), for C# (2.45, 4) and for Haskell (1.9, control over the syntax definition.
3). The precise impact of context-aware scanning for general Erdweg et al. present an extension of SDF to define
parsing and data-dependent grammars is future work, but our layout constraints on grammar rules [8]. These constraint-
preliminary investigation revealed that using character-level based approach is implemented by modifying the underly-
grammars for parsing layout is very expensive, as it is a very ing SGLR [38] parser. Most constraints can be solved during
common operation, see Section 3.2. parsing. Constraints that are not resolved will lead to ambi-
guity which can be removed by post-parse filtering. Adams
6. Related Work presents the notion of indentation-sensitive grammars [1],
where symbols in a rule are annotated by the relative posi-
Parsing is a well-researched topic, and many features of our
tion to the immediate parents. This technique is implemented
parsing framework are related in one or another way to other
for LR(k) parsing.
existing systems. Throughout this paper we have discussed
We do not offer a customized solution for indentation-
some related work, which we do not repeat here. In this
sensitivity for a specific parsing technology, rather we use
section we discuss direct related work and our inspirations.
the general data-dependent grammars framework, and map
Data dependency implementation Data-dependent gram- indentation rules to them. In addition, we define high-level
mars have many similarities with attribute grammars [27] constructs such as align, offside, and ignore which are
and attribute directed parsing [39]. A detailed discussion of desugared to lower-level data-dependent grammars. This en-
related systems is provided by Jim et al. [16]. From the ables a syntax definition model that is closer to what the user
implementation perspective, Jim et al. present the Yakker has in mind. We think the use of high-level constructs leads
parser generator [16], which is based on Earley’s algo- to cleaner, more maintainable grammars.
rithm [7], but we have a GLL-based interpretation of data-
dependent grammars. We also extend the SPPF creation Operator precedence SDF2 uses a parser-independent se-
functionality of GLL parsing (taking environments into ac- mantics of operator precedence which is based on parent-
count), while SPPF creation is not discussed in Jim et al. ’s child relationship on derivation trees [38]. This semantics is
approach. Another difference between our implementation implemented in SGLR parsing [38] by modifying parse ta-
and Yakker is that Yakker directly supports regular opera- bles. Although the SDF2 semantics for operator precedence
tors, by applying longest match. We, however, believe that works for most cases, in some cases it is too strong, i.e.,
all ambiguities should be returned by the parser, and avoid rejecting valid sentences, and in some cases it cannot disam-
such implicit heuristics. Therefore, we desugar regular oper- biguate the expression grammar.
ators to data-dependent BNF rules. In earlier work [3] we discussed the precedence ambigu-
We use an interpretative model of parsing based on ity, and proposed a grammar rewriting that takes an ambigu-
Woods’ ATN grammars [40]. Woods used an explicit stack ous grammar with a set of precedence rules and produces a
to run ATN grammars, similar to a pushdown automata. grammar that does not allow precedence-invalid derivations.
However, as with any top-down parser, such execution of Our current solution has the same semantics: it does not re-
ATN grammars does not terminate in presence of left recur- move sentences when there is no precedence ambiguity, and
sion. Jim et al. ’s data-dependent framework operates on a can deal with corner cases found in programming languages
data-dependent automata [15], which is a variation of ATN such as OCaml. In addition, our operator precedence solu-
grammars interpreted with Earley’s algorithm. tion is desugared to data-dependent grammars, thus it is in-
dependent of the underlying parsing technology.
Indentation-sensitive parsing Besides modification to the
lexer which has been used in GHC and CPython, there Conditional directives Recent work in parsing condi-
are a number of other systems that provide a solution for tional directives target all variations [10, 21]. Gazzillo and
indentation-sensitive parsing. Parser combinators [13] are Grimm [10] give an extensive overview of related work in
higher-order functions that are used to define grammars in this area. However, to the best of our knowledge, none of
terms of constructs such as alternation and sequence. This the existing systems employ a single-phase parsing scheme,
approach has been used in parsing indentation-sensitive lan- rather they use a separate scanner and annotate the tokens
guages [14]. Traditional parser combinators do not support based on the conditional directives they appear in. Our ap-
left-recursion and can have exponential runtime. Another proach in using data-dependent grammars to evaluate the
main difference between parser combinators and our ap- conditional directives is new. The treatment of other features
proach is that we do not give the end user access to the of preprocessors, such as macros, is future work.
17
7. Conclusion [11] J. Gosling, B. Joy, G. Steele, G. Bracha, and A. Buckley. The
Java Language Specification Java SE 7 Edition, 2013.
We have presented our vision of a parsing framework that
is able to address many challenges of declarative parsing of [12] J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The
real programming languages. We have built an implemen- Syntax Definition Formalism SDF–Reference Manual–. SIG-
tation of data-dependent grammars based on the GLL pars- PLAN Not., 24(11):43–75, Nov. 1989.
ing algorithm. We also have shown how to map common [13] G. Hutton. Higher-order Functions for Parsing. Journal of
idioms in syntax of programming languages, such as lexi- Functional Programming, 2(3):323–343, July 1992.
cal disambiguation filters, operator precedence, indentation- [14] G. Hutton and E. Meijer. Monadic Parsing in Haskell. J.
sensitivity, and conditional directives to data-dependent Funct. Program., 8(4):437–444, 1998.
grammars. These mappings provide the language engineer [15] T. Jim and Y. Mandelbaum. Efficient Earley Parsing with
with a set of out of the box constructs, while at the same Regular Right-hand Sides. 253(7):135 – 148, 2010. LDTA’09.
time, new high-level constructs can be added. The prelimi-
[16] T. Jim, Y. Mandelbaum, and D. Walker. Semantics and Algo-
nary experiments with our parsing framework show that it rithms for Data-dependent Grammars. In Principles of Pro-
can be efficient and practical. To fully realize our vision we gramming Languages, POPL’10, pages 417–430. ACM, 2010.
will explore more syntactic features, and further optimize
[17] M. Johnson. The Computational Complexity of GLR Parsing.
the implementation of our framework.
In Generalized LR Parsing, pages 35–42. Springer US, 1991.
18
[32] D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) Pars- E, 0, 4 E ::= E . + E
ing of Programming Languages. In Programming Language
Design and Implementation, PLDI ’89, pages 170–178, 1989. E, 1, 4 E ::= E . + E, 0, 3 E, 1
E ::= - E.
[33] E. Scott and A. Johnstone. GLL Parse-tree Generation.
Science of Computer Programming, 78(10):1828–1844, Oct. E, 0 E ::= E + E.
E, 0, 2 E ::= E + . E, 1, 3 E, 3, 4
2013.
E ::= E . + E
[34] E. Scott, A. Johnstone, and R. Economopoulos. BRNGLR: A E ::= E + E.
E, 1, 2 E, 3
Cubic Tomita-style GLR Parsing Algorithm. Acta informat-
ica, 44(6):427–461, 2007.
'-', 0, 1 'a', 1, 2 '+', 2, 3 'a', 3, 4 E ::= E . + E
[35] M. Tomita. Efficient Parsing for Natural Language. Kluwer
Academic Publishers, USA, 1985. ISBN 0898382025. Figure 20. SPPF (left) and GSS (right) for the input -a+a.
[36] M. G. J. van den Brand, J. Scheerder, J. J. Vinju, and
E. Visser. Disambiguation Filters for Scannerless General-
ized LR Parsers. In Compiler Construction, CC ’02, pages • intermediate nodes of the form (L, i, j) where L is the
143–158. Springer, 2002.
grammar slot, and i and j are the left and right extends.
[37] E. R. Van Wyk and A. C. Schwerdfeger. Context-aware
Scanning for Parsing Extensible Languages. GPCE ’07, pages
The left and right extents of a node represent the substring
63–72. ACM, 2007.
in the input, associated with the node. As GLL parsing is
[38] E. Visser. Syntax Definition for Language Prototyping. PhD context-free, nodes with the same label, the same left and
thesis, University of Amsterdam, 1997.
the same right extents can be shared. Nonterminal and inter-
[39] D. A. Watt. Rule Splitting and Attribute-directed Parsing. mediate nodes have packed nodes as their children. Packed
In Semantics-Directed Compiler Generation, pages 363–392. nodes represent a derivation, and can have at most two chil-
1980.
dren, which are non-packed nodes. If a non-packed node is
[40] W. A. Woods. Transition Network Grammars for Natural ambiguous, it will have more than one packed node. The
Language Analysis. Commun. ACM, 13(10):591–606, 1970. pivot of a packed node is the right extent of its left child, and
is used to distinguish between packed nodes under a non-
A. GLL Parsing packed node.
In this section we first describe GLL parsing, SPPF construc- The binarized SPPF resulting from parsing the input
tion and GSS. Then, we define the semantics of GLL parsing string -a+a with the grammar E ::= E | E + E | a is shown
over ATN grammars as a transition relation. in Figure 20 (left), where packed nodes are depicted with
small circles. For a better visualization, we have omitted the
A.1 SPPF labels of packed nodes. The input is ambiguous and has the
It is known that any parsing algorithm that constructs Tomita- following two derivations: (-(a+a)) or ((-a)+a). This can be
style SPPF is of unbounded polynomial complexity [17]. To observed by the presence of two packed nodes under the root
achieve parsing in cubic time and space, GLL uses a bina- node. The left and right packed nodes under the root node
rized SPPF [33] format, which has additional intermediate correspond to the first and second alternatives, respectively.
nodes. Intermediate nodes allow grouping of the symbols of SPPF construction is delegated to two functions nodeT
a rule in a left-associative manner, thus allowing the parser and nodeP. The nodeT(t, i, j) function takes terminal t,
to always carry a single node at each time, instead of a list and two integer values i and j (left and right extents) and
of nodes. This is the key in preserving the cubic bound. The returns an existing node with these properties, otherwise
use of intermediate nodes effectively achieves the same as a new node. nodeP(L, w, z) takes a grammar slot L, and
restricting a grammar to have rules of length at most two, two non-packed nodes w and z. nodeP returns an existing
but without requiring rewriting the original grammar, and non-packed node labeled L with two children w and z.
transforming back the resulting derivation trees to the ones If no such node exists, then a non-packed node labeled L
of the original grammar. will be created, and w and z are connected to the newly
created non-packed node via a packed node. The details
Definition 1. A binarized SPPF is a compact representation of GLL parse tree construction is discussed in [33], and
of a parse forest that has the following types of nodes. implementation techniques for efficient sharing of nodes are
• nonterminal nodes of the form (A, i, j) where A is a presented in [2, 19].
nonterminal, and i and j are the left and right extents;
• terminal nodes of the form (t, i, j) where t is a terminal, A.2 GSS
and i and j are the left and right extents; At the core of GLL parsing is the Graph-Structured Stack
• packed nodes of the form (L, k) where L is a grammar data structure. We use a variation of GLL parsing that uses a
slot and k is the pivot of the node; and more efficient GSS [2].
19
(R, U , G, P) ) (R0 , U 0 , G 0 , P 0 ) Such GLL formulation is concise and easy to extend to
support data-dependent grammars. The rules in Figure 21
✏
p!q use notation similar to one in [2, 33].
n = nodeP(q, w, nodeT(✏, i, i))
Eps The unit of work of a GLL parser is a descriptor. A de-
({(p, i, u, w)}[R, U , G, P) ) (R[{(q, i, u, n)}, U , G, P)
scriptor is of the form (p, i, u, w), where p is an ATN state
t
p ! q I[i] = t representing a grammar slot, u is a GSS node, i is an in-
n = nodeP(q, w, nodeT(t, i, i+1)) put position, and w is an SPPF (non-packed) node. A GLL
Term-1
({(p, i, u, w)}[R, U , G, P) ) (R[{(q, i+1, u, n)}, U , G, P) parser maintains a set U that holds descriptors created during
t parsing and is used to eliminate duplicate descriptors. In ad-
p ! q I[i] 6= t
Term-2 dition to U , a set R is used to hold pending descriptors that
({(p, i, u, w)}[R, U , G, P) ) (R, U , G, P)
are to be processed. Note that GLL parsing does not impose
A
p ! q v = (A, i) 2 N (G) any order in which the descriptors in R are processed. Fig-
D = {d | (v, y) 2 P, d = (q, rext(y), u, nodeP(q, w, y)), ure 21 defines the semantics of GLL parsing over ATN gram-
d 62 U }
Call-1 mars as a transition relation on configuration (R, U , G, P),
({(p, i, u, w)}[R, U , G, P) ) (R[D, U [D,
where G represents GSS (a set of GSS edges), such that
G [{(v, q, w, u)}, P)
N (G) gives a set of GSS nodes, and P is a set of parsing
A
p ! q v = (A, i) 62 N (G) results that are associated with GSS nodes, i.e., a set of ele-
Call-2
D = {(s, i, v, $) | s 2 S(A)} ments of the form (u, w).
({(p, i, u, w)}[R, U , G, P) ) (R[D, U , G [{(v, q, w, u)}, P) During parsing a descriptor is selected and removed from
p2F R, represented as {(p, i, u, w)} [ R, and given the rules, a
D = {d | (u, q, y, v) 2 G, d = (q, i, v, nodeP(q, y, w)), d 62 U } deterministic choice is made based on the next transition in
Ret
({(p, i, u, w)}[R, U , G, P) ) (R[D, U [D, G, P [{(u, n)}) the ATN. The first three rules of Figure 21 are straightfor-
ward. An ✏ transition creates an ✏-node (via call to nodeT)
Figure 21. GLL parsing over ATN grammars. and intermediate node4 (via call to nodeP), and adds a de-
scriptor for the next grammar slot. The terminal rules (Term-
1 and Term-2) try to match terminal t at the current input
Definition 2. A Graph-Structured Stack (GSS) in GLL pars-
position, where I is an array representing the input string. If
ing is a directed graph where
there is a match (Term-1), a terminal node (via nodeT) and
• nodes are of the form (A, i), where A is a nonterminal intermediate node (via nodeP) are created, and a descrip-
and i is an input position; and tor for the next grammar slot is added. If there is no match
• edges are of the form (u, L, w, v), where u and v are (Term-2), no descriptor is added.
GSS nodes, L is a grammar slot, and w is an SPPF node Call-1 and Call-2 correspond to nonterminal transitions
A
recorded on the edge. !. Similar to calling a memoized function, a GLL parser
first checks if a GSS node (A, i) exists. If such a node exists
GSS was originally developed by Tomita [35] for GLR pars- (Call-1), the parsing results associated with this GSS node
ing to merge different LR stacks. Although GLL parsing are reused. These results are retrieved from P, and for each
uses the same term, there are two main differences between result, nonterminal node y, a descriptor d is created (rext
GSS in GLL parsing and GLR. First, in GLL parsing GSS returns the right extent of y), and if the same descriptor has
represents function calls in recursive-descent parsing, simi- not been processed before (d 62 U ), it is added to R. If the
lar to memoization of functions in functional programming, GSS node does not exist (Call-2), the call to the nonterminal
and therefore has the input position at which the nonterminal is made, i.e., for each start state of the nonterminal (s 2
is called. Second, in GLL parsing GSS allows cycles in the S(A)), a descriptor is added. Both Call-1 and Call-2 add a
graph that solve the problem of left-recursion in recursive- new GSS edge to G.
descent parsing. Finally, Ret corresponds to a final grammar slot (final
The GSS resulting from parsing -a+a using the grammar states in ATNs) in which the parser returns from the current
E ::= E | E + E | a is shown in Figure 20 (right). As can nonterminal call. First, the tuple with the current SPPF node
be seen there is a cycle on all nodes, as they represent the and the current GSS node is added to P. Second, for each
left recursive calls to E at different input positions. In case outgoing GSS edge of the current GSS node, a descriptor is
of indirect left recursion, there will be a cycle in the GSS created and, if the same descriptor has not been processed
involving multiple nodes. before (d 62 U ), it is added to R.
A.3 GLL Parsing over ATN Grammars
4 In fact, when the next state is an end state, nodeP creates a nonterminal
In this section, we define GLL parsing over ATN grammars
node, instead of an intermediate node. However, in the current discussion,
as a transition relation. In contrast to the imperative style this is not essential, therefore, we always refer to the result of nodeP as an
used in [2, 33], we use the declarative rules of Figure 21. intermediate node.
20