0% found this document useful (0 votes)
8 views20 pages

One Parser To Rule Them All: Ali Afroozeh Anastasia Izmaylova

The paper presents a new parsing framework based on data-dependent grammars, which extend context-free grammars to handle context information and various programming language constructs. It aims to simplify parser construction by allowing language engineers to declaratively build parsers from (E)BNF-like specifications while addressing common challenges such as ambiguity and non-determinism. The framework has been implemented on top of the Generalized LL (GLL) parsing algorithm and successfully tested on over 20,000 source files from multiple programming languages.

Uploaded by

elemental5249
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views20 pages

One Parser To Rule Them All: Ali Afroozeh Anastasia Izmaylova

The paper presents a new parsing framework based on data-dependent grammars, which extend context-free grammars to handle context information and various programming language constructs. It aims to simplify parser construction by allowing language engineers to declaratively build parsers from (E)BNF-like specifications while addressing common challenges such as ambiguity and non-determinism. The framework has been implemented on top of the Generalized LL (GLL) parsing algorithm and successfully tested on over 20,000 source files from multiple programming languages.

Uploaded by

elemental5249
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

One Parser to Rule Them All

Ali Afroozeh Anastasia Izmaylova


Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
{ali.afroozeh, anastasia.izmaylova}@cwi.nl

Abstract 1. Introduction
Despite the long history of research in parsing, constructing Parsing is a well-researched topic in computer science, and
parsers for real programming languages remains a difficult it is common to hear from fellow researchers in the field
and painful task. In the last decades, different parser gen- of programming languages that parsing is a solved prob-
erators emerged to allow the construction of parsers from a lem. This statement mostly originates from the success of
BNF-like specification. However, still today, many parsers Yacc [18] and its underlying theory that has been developed
are handwritten, or are only partly generated, and include in the 70s. Since Knuth’s seminal paper on LR parsing [25],
various hacks to deal with different peculiarities in program- and DeRemer’s work on practical LR parsing (LALR) [6],
ming languages. The main problem is that current declara- there is a linear parsing technique that covers most syntactic
tive syntax definition techniques are based on pure context- constructs in programming languages. Yacc, and its various
free grammars, while many constructs found in program- ports to other languages, enabled the generation of efficient
ming languages require context information. parsers from a BNF specification. Still, research papers and
In this paper we propose a parsing framework that em- tools on parsing in the last four decades show an ongoing
braces context information in its core. Our framework is effort to develop new parsing techniques.
based on data-dependent grammars, which extend context- A central goal in research in parsing has been to en-
free grammars with arbitrary computation, variable binding able language engineers (i.e., language designers and tool
and constraints. We present an implementation of our frame- builders) to declaratively build a parser for a real program-
work on top of the Generalized LL (GLL) parsing algorithm, ming language from an (E)BNF-like specification. Never-
and show how common idioms in syntax of programming theless, still today, many parsers are hand-written or are
languages such as (1) lexical disambiguation filters, (2) op- only partially generated and include many hacks to deal
erator precedence, (3) indentation-sensitive rules, and (4) with peculiarities in programming languages. The reason
conditional preprocessor directives can be mapped to data- is that grammars of programming languages in their sim-
dependent grammars. We demonstrate the initial experience ple and readable form are often not deterministic and also
with our framework, by parsing more than 20 000 Java, C#, often ambiguous. Moreover, many constructs found in pro-
Haskell, and OCaml source files. gramming languages are not context-free, e.g., indentation
rules in Haskell. Parser generators based on pure context-
Categories and Subject Descriptors D.3.1 [Programming free grammars cannot natively deal with such constructs,
Languages]: Formal Definitions and Theory—Syntax; D.3.4 and require ad-hoc extensions or hacks in the lexer. There-
[Programming Languages]: Processors—Parsing fore, additional means are necessary outside of the power of
context-free grammars to address these issues.
Keywords Parsing, data-dependent grammars, GLL, dis- General parsing algorithms [7, 33, 35] support all context-
ambiguation, operator precedence, offside rule, preprocessor free grammars, therefore the language engineer is not lim-
directives, scannerless parsing, context-aware scanning ited by a specific deterministic class, and there are known
declarative disambiguation constructs to address the prob-
lem of ambiguity in general parsing [12, 36, 38]. However,
implementing disambiguation constructs is notoriously hard
and requires thorough knowledge of the underlying parsing
technology. This means that it is costly to declaratively build
a parser for a given programming language in the wild if
the required disambiguation constructs are not already sup-
Copyright ©ACM, 2015. This is the author’s version of the work. It is posted here ported. Perhaps surprisingly, examples of such languages are
by permission of ACM for your personal use. Not for redistribution. The definitive
version was published in Onward!’15, October 25–30, 2015, Pittsburgh, PA, USA.
not only the legacy languages, but also modern languages
http://dx.doi.org/10.1145/2814228.2814242. such as Haskell, Python, OCaml and C#.

1
In this paper we propose a parsing framework that is able Our vision is a parsing framework that provides the right
to deal with many challenges in parsing existing and new level of abstraction for both the language engineer, who de-
programming languages. We embrace the need for context signs new languages, and the tool builder, who needs a pars-
information at runtime, and base our framework on data- ing technology as part of her toolset. From the language en-
dependent grammars [16]. Data-dependent grammars are an gineer’s perspective, our parsing framework provides an out
extension of context-free grammars that allow arbitrary com- of the box set of high-level constructs for most common id-
putation, variable binding and constraints. These features al- ioms in syntax of programming languages. The language en-
low us to simulate hand-written parsers and to implement gineer can also always express her needs directly using data-
disambiguation constructs. dependent grammars. From the tool builder’s perspective,
To demonstrate the concept of data-dependent grammars our framework provides open, extensible means to define
we use the IMAP protocol [29]. In network protocol mes- higher-level syntactic notation, without requiring knowledge
sages it is common to send the length of data before the ac- of the internal workings of a parsing technology.
tual data. In IMAP, these messages are called literals, and
The contributions of this paper are:
are described by the following (simplified) context-free rule:
L8 ::= '⇠{' Number '}' Octets
• We provide a unified perspective on many important chal-
lenges in parsing real programming languages.
Here Octets recognizes a list of octet (any 8-bit) values. An
example of L8 is ~{6}aaaaaa. As can be seen, there is no • We present several high-level syntactical constructs, and
data dependency in this context-free grammar, but the IMAP their mappings to data-dependent grammars.
specification says that the number of Octets is determined by • We provide an implementation of data-dependent gram-
the value parsed by Number. Using data-dependent grammars, mars on top of Generalized LL (GLL) parsing [33] that
we can specify such a data-dependency as: runs over ATN grammars [40]. The implementation is
L8 ::= '⇠{' nm:Number {n=toInt(nm.yield)} '}' Octets(n) part of the Iguana parsing framework1 [2].
Octets(n) ::= [n > 0] Octets(n - 1) Octet
• We demonstrate the initial results of our parsing frame-
| [n == 0] ✏
work, by parsing 20363 real source files of Java, C#,
In the data-dependent version, nm provides access to the Haskell (91% success rate), and excerpts from OCaml.
value parsed by Number. We retrieve the substring of the input
parsed by Number via nm.yield which is converted to integer The rest of this paper is organized as follows. In Section 2 we
using toInt. This integer value is bound to variable n, and describe the landscape of parsing programming languages.
is passed to Octets. Octets takes an argument that specifies In Section 3 we present data-dependent grammars, our high-
the number of iterations. Conditions [n > 0] and [n == 0] level syntactic notation, and the mapping to data-dependent
specify which alternative is selected at each iteration. grammars. Section 4 discusses the extension of GLL parsing
It is possible to parse IMAP using a general parser, and with data-dependency. In Section 5 we demonstrate the ini-
then remove the derivations that violate data dependencies tial results of our parsing framework using grammars of real
post parse. However, such an approach would be slow. With- programming languages. We discuss related work in Sec-
out enforcing the dependency on the length of Octets during tion 6. A conclusion and discussion of future work is given
parsing, given the nondeterministic nature of general pars- in Section 7.
ing, all possible lengths of Octets will be tried.
There are many common grammar and disambiguation 2. The Landscape of Parsing Programming
idioms that can be desugared into data-dependent grammars. Languages
Examples of these idioms are operator precedence, longest
In this section we discuss well-known features of program-
match, and the offside rule. Expecting the language engi-
ming languages that make them hard to parse. These features
neer to write low-level data-dependent grammars for such
motivate our design decisions.
cases would be wasteful. Instead, we describe a number of
such idioms, provide high-level notation for them and their 2.1 General Parsing for Programming Languages
desugaring to data-dependent grammars. For example, using
Grammars of programming languages in their natural form
our high-level notation the indentation rules in Haskell can
are not deterministic, and are often ambiguous. A well-
be expressed as follows:
known example is the if-then-else construct found in many
Decls ::= align (offside Decl)*
programming languages. This construct, when written in
| ignore('{' Decl (';' Decl)* '}')
its natural form, is ambiguous (the dangling-else ambigu-
This definition clearly and concisely specifies that either all ity), and therefore cannot be deterministic. Some nonde-
declarations in the list are aligned, and each Decl is offsided terministic (and ambiguous) syntactic constructs, such as
with regard to its first token (first alternative), or indentation if-then-else, can be rewritten to be deterministic and un-
is ignored inside curly braces (second alternative).
1 https://github.com/iguana-parser

2
ambiguous. However, such grammar rewriting in general is comes very important when considering parsing program-
not trivial, the resulting grammar is hard to read, maintain ming languages. For example, GLR parsers operate on LR
and evolve, and the resulting parse trees are different from automata, and have a rather complicated execution model, as
the original ones the grammar writer had in mind. a parsing state corresponds to multiple grammar positions.
Instead of rewriting a grammar, it is common to use an The Generalized LL (GLL) parsing algorithm [33] is a
ambiguous grammar, and rely on some implicit behavior of new generalization of recursive-descent parsing that sup-
a parsing technology for disambiguation. For example, the ports all context-free grammars, including left recursive
dangling-else ambiguity is often resolved using a longest ones. GLL parsers produce a parse forest in cubic time and
match scheme provided by the underlying parsing technol- space in the worst case, and are linear on LL parts of the
ogy. Relying on implicit behavior of a parsing technology grammar. GLL parsers are attractive because they have the
to achieve determinism can make it quite difficult to reason close relationship with the grammar that recursive-descent
about the accepted language. Seemingly correct sentences parsers have. From the end user’s perspective, GLL parsers
may be rejected by the parser because at a nondetermin- can produce better error messages, and can be debugged in
istic point, a wrong path was chosen. For example, Yacc a programming language IDE.
is an LALR parser generator, but can accept any context- To deal with left recursive rules and to keep the cu-
free grammar by automatically resolving all shift/reduce and bic bound, a GLL parser uses a GSS to handle multiple
reduce/reduce conflicts. Using Yacc, the language engineer call stacks. While the execution model of a GLL parser
should manually check the resolved conflicts in case of un- is close to recursive-descent parsing, the underlying ma-
expected behavior. chinery is much more complicated, and still an in-depth
A common theme in research in parsing has been to in- knowledge of GLL is required to implement disambiguation
crease the recognition power of deterministic parsing tech- constructs. In this paper, we propose a parser-independent
niques such as LL(k) or LR(k). One of the widely used gen- framework for parsing programming languages based on
eral parsing techniques for programming languages is the data-dependent grammars. We use GLL parsing as the basis
Generalized LR (GLR) algorithm [35]. GLR parsers support for our data-dependent parsing framework, as it allows an
all context-free grammars and can produce a parse forest intuitive way to implement components of data-dependent
containing all derivation trees in form of a Shared Packed grammars, such as environment threading, and enables an
Parse Forest (SPPF) in cubic time and space [34]. Note that implementation that is very close to the stack-evaluation
the cubic bound is for the worst-case, highly ambiguous based semantics of data-dependent grammars [16].
grammars. As GLR is a generalization of LR, a GLR parser
2.2 On the Interaction between Lexer and Parser
runs linearly on LR parts of the grammar, and as the gram-
mars of real programming languages are in most parts near Conventional parsing techniques use a separate lexing phase
deterministic, one can expect near-linear performance using before parsing to transform a stream of characters to a stream
GLR for parsing programming languages. GLR parsing has of tokens. In particular, whitespace and comments are dis-
successfully been used in source code analysis and develop- carded by the lexer to reduce the number of lookahead in the
ing domain-specific languages [12, 22]. parsing phase, and enable deterministic parsing.
General parsing enables the language engineer to use the The main problem with a separate lexing phase is that
most natural version of a grammar, but leaves open the prob- without having access to the parsing context, i.e., the appli-
lem of ambiguity. In declarative syntax definition [12, 23], cable grammar rules, the lexer cannot unambiguously deter-
it is common to use declarative disambiguation constructs, mine the type of some tokens. An example is >> that can ei-
e.g., for operator precedence or the longest match. As a gen- ther be parsed as a right shift operator, or two closing angle
eral parser is able to return all ambiguities in form of a parse brackets of a generic type, e.g., List<List<String>> in Java.
forest, it is possible to apply the disambiguation rules post- Some handwritten parsers deal with this issue by rewriting
parse, removing the undesired derivations from the parse for- the token stream. For example, when the javac parser reads a
est. However, such post-parse disambiguation is not practical >> token and is in a parsing state that expects only one >, e.g.,

in cases where the grammar is highly ambiguous. For exam- when matching the closing angle bracket of a generic type,
ple, parsing expression grammars without applying operator it only consumes the first > and puts the second one back to
precedence during parsing is only limited to small inputs. prevent a parse error when matching the next angle bracket.
Therefore, it is required to resolve ambiguity while parsing To resolve the problems of a separate lexing phase, we
to achieve near-linear performance. need to expose the parsing context to the lexer. To achieve
Implementing disambiguation mechanisms that are ex- this, the separate lexing phase is abandoned, and the lex-
ecuted during parsing is difficult. This is because the im- ing phase is effectively integrated into the parsing phase. We
plementation of such disambiguation mechanisms requires call this model single-phase parsing. There are two options
knowledge of the internal workings of a parsing technology. to achieve single-phase parsing. The first option is called
Therefore, the choice of the general parsing technology be- scannerless parsing [32, 38] where lexical definitions are
treated as context-free rules. In scannerless parsing, gram-

3
E ::= '-' E E ::= E '+' T E ::= T E1 f x = case x of g x = case x of
| E '*' E | T E1 ::= '+' T E1 | ✏ 0 -> 1 0 -> 1
| E '+' E T ::= T '*' F T ::= F T1 _ -> do _ -> x + 2
| 'a' | F T1 ::= '*' F | ✏ let y = 2 + 4
F ::= '-' F F ::= '-' F y + z
| 'a' | 'a' where z = 3 g x = case x of
0 -> 1
Figure 1. Three grammars that accept the same language: _ -> x + 2
the natural, ambiguous grammar (left), the grammar with + 4
precedence encoding (middle), and the grammar after left- Figure 2. Examples of indentation rules in Haskell.
recursion removal (right).
ing. There have been other solutions to build parsers from
mars are defined to the level of characters. The second op- declarative operator precedence which we discuss in Sec-
tion is context-aware scanning [37], where the parser calls tion 6. In Section 3.4 we provide a mapping from operator
the lexer on demand. At each parsing state, the lexer is called precedence rules to data-dependent grammars.
with the expected set of terminals at that state.
In almost all modern programming languages longest 2.4 Offside Rule
match (maximal munch) is applied, and keywords are ex- In most programming languages, indentation of code blocks
cluded from being recognized as identifiers. These disam- does not play a role in the syntactic structure. Rather, ex-
biguation rules are conventionally embedded in the lexer. plicit delimiters, such as begin and end or { and } are used to
In single-phase parsing—scannerless or context-aware— specify blocks of statements. Landin introduced the offside
longest match and keyword exclusion have to be applied rule [26], which serves as a basis for indentation-sensitive
during parsing, by using lexical disambiguation filters such languages. The offside rule says that all the tokens of an ex-
as follow restrictions [32, 36]. These disambiguation filters pression should be indented to the right of the first token.
have parser-specific implementations [2, 38]. In Section 3 Haskell and Python are two examples of popular program-
we show how these filters can be mapped to data-dependent ming languages that use a variation of the offside rule.
grammars. Also note that although a context-aware scanner Figure 2 shows two examples of the offside rule in
employs longest match, for example by implementing the Haskell. The keywords do, let, of, and where signal the start
Kleene star (*) as a greedy operator, in some cases we still of a block where the starting tokens of the statements should
need to use explicit disambiguation filters, see Section 3.2. be aligned, and each statement should be offsided with re-
gard to its first token. In Figure 2 (left), case has two alterna-
2.3 Operator Precedence tives which are aligned, and the second alternative that spans
Expressions are an integral part of virtually every program- several lines is offsided with regard to its first token, i.e., _.
ming language. In reference manuals of programming lan- Figure 2 (right) shows two examples that look the same, but
guages it is common to specify the semantics of expressions the indentation of the last part, + 4, is different. In the top
using the priority and associativity of operators. However, declaration + 4 belongs to the last alternative, but in the bot-
the implementation of expression grammars can consider- tom declaration, + 4 belongs to the expression on the right
ably deviate from such precedence specification. hand side of =.
It is possible to encode operator precedence by rewriting Indentation sensitivity in programming languages can-
the grammar: a new nonterminal is created for each prece- not be expressed by pure context-free grammars, and has
dence level. The rewriting is not trivial for real programming often been implemented by hacks in the lexer. For exam-
languages, and the resulting grammar becomes large. This ple, in Haskell and Python, indentation is dealt with in the
rewriting is particularly problematic in parsing techniques lexing phase, and the context-free part is written as if no
that do not support left recursion. The left-recursion removal indentation-sensitivity exists. Both GHC and CPython, the
transformation disfigures the grammar and adds extra com- popular implementations of Haskell and Python, use LALR
plexity in transforming the trees to the intended ones. Fig- parser generators. In Python, the lexer maintains a stack and
ure 1 shows three versions of the same expression grammar. emits INDENT and DEDENT tokens when indentation changes.
In the 70s, Aho et al. [4] presented a technique in which In Haskell, the lexer translates indentation information into
a parser is constructed from an ambiguous expression gram- curly braces and semicolons based on the rules specified by
mar accompanied with a set of precedence rules. This work the L function [28].
can be seen as the starting point for declarative disambigua- In Section 3.5 we show how data-dependent grammars
tion using operator precedence rules. Aho et al. ’s approach can be used for single-phase parsing of indentation-sensitive
is implemented by modifying LALR parse tables to re- programming languages in a declarative way. As data-
solve shift/reduce conflicts based on the operator prece- dependent grammars are rather low-level for such solutions,
dence. However, the semantics of operator precedence in we introduce three high-level constructs: align, offside, and
this approach is bound to the internal workings of LR pars- ignore which are desugared to data-dependent grammars.

4
void test() #if X static void Main() {
{ /* System.Console.WriteLine(@"hello,
#if Debug #else #if Debug
System.Console.WriteLine("Debug") /* */ class Q { } world
} #endif #else
#else Nebraska
} #endif
#endif ");
}
Figure 3. Problematic cases of using C# directives [30].
Figure 4. C# multi-line string containing directives [30].

2.6 Miscellaneous Features


There are many other peculiarities in programming lan-
2.5 Conditional Directives guages and data formats that cannot be expressed by context-
free grammars. There has been considerable effort to build
Many programming languages allow compiler pragmas that declarative parsers for data formats, e.g. PADS [9], and one
specify how the compiler (or the interpreter) processes parts of the main motivations for data-dependent grammars [16]
of the input. The C family of programming languages, i.e., is indeed to enable parsing data formats.
C, C++ and C#, allow preprocessor directives such as #if Examples of languages that require data-dependent pars-
and #define. GHC also allows various compiler pragmas [28, ing are data protocols, such as IMAP and HTTP, and tag-
§12.3]. For example, it is possible to enable C preprocessor based languages such as XML. In programming languages,
directives in Haskell using {-# LANGUAGE CPP #-}. data-dependent grammars can be used to implement some
Preprocessor directives pose considerable difficulty in language-specific disambiguation mechanisms. For exam-
parsing programming languages. The main reason is that ple, to maintain a table of type definitions in C to allow re-
they are not part of the grammar of a language, but can ap- solving the infamous typedef ambiguity, e.g., in x * y which
pear anywhere in the source code. In this regard, preproces- can be either interpreted as a variable of pointer type x or as
sor directives are similar to whitespace and comment. How- a multiplication, depending on the type of x. We give an ex-
ever, conditional directives may affect the syntactic structure ample of parsing XML and resolving the typedef ambiguity
of a program, and cannot be simply ignored as a special kind in C in Section 3.7.
of whitespace. This is especially important if we consider
single-phase parsing where no lexing/preprocessing is avail-
able. We need a mechanism to allow the parser to switch
3. Parsing Programming Languages with
between the preprocessor mode and main grammar, and to Data-dependent Grammars
evaluate conditional directives to select the right branch. In this section we describe data-dependent grammars [16],
Figure 3 (left) shows a C# example where ignoring the discuss our single-phase parsing strategy, and demonstrate
conditional directive will lead to a parse error, as the closing how various high-level, declarative syntax definition con-
bracket of the test method is in the conditional directive, structs can be desugared into data-dependent grammars.
and one of the branches should be included in the input.
The example in Figure 3 (right) shows another aspect of 3.1 Data-dependent Grammars
directives in C#. If X is true, "/* #else /* */ class Q {}" will Data-dependent grammars are defined as an extension of
be considered as part of the source code. If X is false, only the context-free grammars (CFGs), where a CFG is, as usual,
else-part will be considered: "/* */ class Q {}". Note that a tuple (N, T, P, S) where
when X is false, the if-part does not have to be syntactically
• N is a finite set of nonterminals;
correct, in this case an unclosed multi-line comment.
Among the family of C languages we selected C#, as • T is a finite set of terminals;
parsing C# is more manageable. The problematic part of • P is a finite set of rules. A rule (production) is written as
parsing C with directives is textual macros. Without a pre- A ::= ↵, where A (head) is a nonterminal, and ↵ (body)
processor to expand macros before parsing, we need to deal is a string in (N [ T )⇤ ;
with macros at runtime. Parsing C without a preprocessor is
• S 2 N is the start symbol of the grammar.
future work. In C#, #define does not define a macro, rather
it only sets a boolean variable. It should also be noted that We use A, B, C to range over nonterminals, and a, b, c to
C# supports multi-line strings, where directives should not range over terminals. We use ↵, , for a possibly empty
be processed. Figure 4 shows a C# example that uses condi- sequence of terminals and nonterminals, and ✏ represents
tional directives in a multi-line string. In single-phase pars- the empty sequence. It is common to group rules with the
ing, however, multi-line strings are not a problem, as we ef- same head and write them as A ::= ↵1 | ↵2 | . . . | ↵n . In this
fectively parse each terminal in the context where it appears. representation, each ↵i is an alternative of A.

5
Data-dependent grammars introduce parametrized non- In addition, EBNF constructs introduce new scopes: vari-
terminals, arbitrary computation via an expression language, ables declared inside an EBNF construct, e.g., (l : A [e])⇤,
constraints, and variable binding. Here, we assume that the are not visible outside, e.g., in the rule that uses it.
expression language e is a simple functional programming Finally, we also introduce a conditional selection sym-
language with immutable values and no side-effects. In a bol e ? ↵ : , which selects ↵ if e is evaluated to true, oth-
data-dependent grammar a rule is of the from A(p) ::= ↵, erwise , i.e., introduces deterministic choice. Similar to
where p is a formal parameter of A. Here, for simplicity of EBNF constructs, we implement conditional selection by
presentation and without loss of generality, we assume that desugaring it into a data-dependent grammar. For example,
a nonterminal can have at most one parameter. The body of A ::= ↵ e ? X : Y is translated to A ::= ↵ C(e) , where
a rule, ↵, can now contain the following additional symbols: C(b) ::= [b] X | [!b] Y . We illustrate use of the conditional
selection when discussing C# directives in Section 3.6.
• x = l : A(e) is a labeled call to A with argument e, label
l, and variable x bound to the value returned by A(e);
3.2 Single-phase Parsing Strategy
• l : a is a labeled terminal a with label l;
We implement our data-dependent grammars on top of the
• [e] is a constraint;
generalized LL (GLL) parsing algorithm [33]. As general
• {x = e} is a variable binding; parsers can deal with any context-free grammar, lexical defi-
• {e} is a return expression (only as the last symbol in ↵); nitions can be specified to the level of characters. For exam-
ple, comment in the C# specification [30] is defined as:
• e?↵ : is a conditional selection.
Comment ::= SingleLineComment | DelimitedComment
The symbols above are presented in their general forms. For SingleLineComment ::= "//" InputCharacter*
InputCharacter ::= ![\r \n]
example, labels, variables to hold return values, and return DelimitedComment ::= "/*" DelimitedCommentSect* [*]+ "/"
expressions are optional. DelimitedCommentSect ::= "/" [*]* NotSlashOrAsterisk
Our data-dependent grammars are very similar to the ones NotSlashOrAsterisk = ![/ *]
introduced in [16] with four additions. First, terminals and Such character-level grammars, however, lead to very
nonterminals can be labeled, and labels refer to properties large parse forests. These parse forests reflect the full struc-
associated with the result of parsing a terminal or nonter- ture of lexical definitions, which are not needed in most
minal. These properties are the start input index (or left ex- cases. We provide the option to use an on-demand context-
tent), the end input index (or right extent), and the parsed aware scanner, where terminals are defined using regular ex-
substring. Properties can be accessed using dot-notation, pression. For example, Comment in C# can be compiled to a
e.g., for labeled nonterminal b : B, b.l gives the left extent, regular expression. In cases where the structure is needed, or
b.r the right extent, and b.yield the substring. it is not possible to use a regular expression, e.g., recursive
Second, nonterminals can return arbitrary values (return definitions of nested comments, the user can use character-
expressions) which can be bound to variables. In several level grammars.
cases, we found this feature very practical as we could ex- Our support for context-aware scanning borrows many
press data dependency without changing the shape of the ideas from the original work by Van Wyk and Schwerd-
original specification grammar. Specifically, cases where a feger [37], but because of the top-down nature of GLL pars-
global table needs to be maintained along a parse (C# con- ing, there are some differences. The original context-aware
ditional directives discussed in Section 3.6 and C typedef scanning approach [37] is based on LR parsing, and as each
declarations in Section 3.7), or where semantic information LR state corresponds to multiple grammar rules, there may
needs to be propagated upwards from a complicated syntac- be several terminals that are valid at a state. The set of valid
tic structure (Declarator of the C grammar in Section 3.7). In terminals in a parsing state is called valid lookahead set [37].
some cases a data-dependent grammar that uses return val- In GLL parsing, in contrast, the parser is at a single grammar
ues can be rewritten to one without return values. However, position at each time, i.e., either before a nonterminal or be-
in general, whether return values enlarge the class of lan- fore a terminal in a single grammar rule. Therefore, in GLL
guages expressible with the original data-dependent gram- parsing, the valid lookahead set of a terminal grammar posi-
mars is an open question for future work. tion contains only one element, which allows us to directly
Third, we support regular expression operators (EBNF call the matcher of the regular expression of that terminal.
constructs): ⇤, +, and ?, by desugaring them to data depen- We use our simple context-aware scanning model for
dent rules as follows: A⇤ ::= A+ | ✏ ; A+ ::= A+A | A ; and better performance, see Section 5.1. The implementation
A? ::= A | ✏. In the data-dependent setting, this translation of the context-aware scanner in [37] is more sophisticated.
must also account for variable binding. For example, if sym- The scanner is composed of all terminal definitions, as a
bol ([e] A)⇤ appears in a rule, and x is a free variable in e, composite DFA. This enables a longest match scheme across
captured from the scope of the rule, our translation lifts this terminals in the same context, for example in programming
variable, introducing a parameter x to the new nonterminal. languages where one terminal is a prefix of another, e.g.,

6
A ::= ↵ B !>> c A ::= ↵ b:B [input.at(b.r) != c] Figure 5 shows the mapping from character-level disam-
A ::= ↵ c !<< B A ::= ↵ b:B [input.at(b.l-1) != c]
biguation filters to data-dependent grammars. The mapping
A ::= ↵ B \ s A ::= ↵ b:B [input.sub(b.l,b.r)!= s]
is straightforward: each restriction is translated into a con-
Figure 5. Mapping of lexical disambiguation filters. dition that operates on the input. A note should be made
regarding the condition implementing precede restrictions.
This condition only depends on the left extent, b.l, that per-
'fun' and 'function' in OCaml. To enforce longest match mits its application before parsing B. We consider this opti-
across terminals we use follow/precede restrictions, in this mization in the implementation of our parsing framework,
case a follow restriction on 'fun' or a precede restriction on permitting application of such conditions before parsing la-
identifiers. Moreover, keyword reservation in [37] is done beled nonterminals or terminals.
by giving priority to keywords at matching states of the The restrictions of Figure 5 are just examples and can be
composite DFA. In our model, keyword exclusion should extended in many ways. For example, instead of defining
be explicitly applied in the grammar rules using an exclude the restriction using a single character, we can use regular
disambiguation filter. We explain follow/precede restrictions expressions or character classes. One can also define similar
and keyword reservation in Section 3.3. restrictions for related disambiguation purposes. For exam-
In single-phase parsing layout (whitespace and com- ple, consider the cast expression in C#:
ments) are treated the same way as other lexical definitions. cast-exp ::= '(' type ')' unary-exp
Because layout is almost always needed in parsing program- An expression such as (x)-y is ambiguous, and can be in-
ming languages, we support automatic layout insertion into terpreted as either a type cast of -y to the type x, or a sub-
the rules. There are two approaches to deal with layout in- traction of y from (x). In the C# language specification, it is
sertion: a layout nonterminal can be inserted exactly before stated that this ambiguity is resolved during parsing based
or after each terminal node [20, 37]. Another way is to in- on the character that comes after the closing parentheses: if
sert layout between the symbols in a rule, like in SDF [12]. the character following the closing parentheses is ~, '!', '(',
We use SDF-style layout insertion: if X ::= x1 x2 . . . xn an identifier, a literal or keywords, the expression should be
is a rule, and L is a nonterminal defining layout, after the interpreted as a cast. We can implement this rule as follows:
layout insertion, the rule becomes X ::= x1 L x2 L . . . L xn . cast-exp ::= '(' type ')' >>> [⇠!(A-Za-z0-9] unary-exp
A benefit of SDF-style layout insertion is that no symbol
definition accidentally ends or starts with layout, provided The >>> notation specifies that the next character after the
that the layout is defined greedily (see Section 3.3). This is closing parentheses should be an element of the specified
helpful when defining the offside rule (see Section 3.5). character class. The implementation of >>> is similar to that
of >> with an additional aspect: it adds the condition on the
automatically inserted layout nonterminal after ')' instead.
3.3 Lexical Disambiguation Filters
These examples show how more syntactic sugar can be
Common lexical disambiguation filters [36], such as fol- added to the existing framework for various common lexi-
low restrictions, precede restrictions and keyword exclu- cal disambiguation tasks in programming languages without
sion, can be mapped to data-dependent grammars without changes to the underlying parsing technology.
further extensions to the parser generator or parsing algo-
rithm. These disambiguation filters are common in scanner- 3.4 Operator Precedence and Associativity
less parsing [32] and have been implemented for various Expression grammars in their natural form are often ambigu-
generalized parsers [2, 38]. ous. Consider the expression grammar in Figure 6 (left). For
A follow restriction (!>>) specifies which characters can- this grammar, the input string a+a*a is ambiguous with two
not immediately appear after a symbol in a rule. This re- derivation trees that correspond to the following groupings:
striction is used to locally define longest match (as op- (a+(a*a)) and ((a+a)*a). Given that * normally has higher
posed to a global longest match in the lexer). For ex- precedence than +, the first derivation tree is desirable. We
ample, to enforce longest match on identifiers we write use >, left, and right to define priority and left- and right-
Id ::= [A-Za-z]+ !>> [A-Za-z]. A precede restriction (!<<) associativity, respectively [3]. Figure 6 (right) shows the dis-
is similar to a follow restriction, but specifies the characters ambiguated version of this grammar by specifying > and
that cannot immediately precede a symbol in a rule. Precede left, where - has the highest precedence, and * and + are
restrictions can be used to implement longest match on key- left-associative.
words. For example, [A-Za-z] !<< Id disallows an identifier Ambiguity in expression grammars is caused by deriva-
to start immediately after a character. This disallows, for tions from the left- or right-recursive ends in a rule, i.e.,
example, to recognize intx as the keyword 'int' followed E ::= ↵E and E ::= E . We use >, left, and right to spec-
by the identifier x. Finally, exclusion (\) is usually used to ify which derivations from the left- and right-recursive ends
implement keyword reservation. For example, Id \'int' ex- are not valid with respect to operator precedence. For ex-
cludes the keyword int from beging recognized as Id. ample, E ::= '-' E > E '*' E specifies that E in the '-'-rule

7
E ::= '-' E E ::= '-' E E ::= E '*' E left E(p) ::= [2 >= p] E(2) * E(3) //2
| E '*' E > E '*' E left > E '+' E left | [1 >= p] E(0) + E(2) //1
| E '+' E > E '+' E left | '(' E ')' | '(' E(0) ')'
| 'if' E 'then' E 'else' E > 'if' E 'then' E 'else' E | a | a
| a | a
Figure 7. An expression grammar with > and left (left), and
Figure 6. An ambiguous expression grammar (left), and the its translation to data-dependent grammars (right).
same grammar disambiguated with > and left (right).
E(l,r) ::= [4 >= l] '-' E(l,4) //4
| [3 >= r, 3 >= l] E(3,3) '*' E(l,4) //3
(parent) should not derive the '*'-rule (child). The > con- | [2 >= r, 2 >= l] E(2,2) '+' E(l,3) //2
struct only restricts the right-recursive end of a parent rule | [1 >= l] 'if' E(0,0) 'then' E(0,0) 'else' E(0,0) //1
| a
when the child rule is left-recursive, and vice versa. For ex-
ample, in Figure 6 (right) the right E in the '+'-rule is not Figure 8. Operator precedence with data-dependent gram-
restricted because the 'if'- rule is not left-recursive. This is mars (binary and unary operators).
to avoid parse error on inputs that are not ambiguous, e.g.,
a + if a then a else a. Note that 'if' E 'then' E 'else' in round bracket rule. Nonterminal E gets parameter p to pass
the 'if'-rule acts as a unary operator. In addition, the > oper- the precedence level, and for each left- and right-recursive
ator is transitive for all the alternatives of an expression non- rule, a predicate is added at the beginning of the rule to ex-
terminal. Finally, left and right only affect binary recursive clude rules by comparing the precedence level of the rule
rules and only at the left- and right-recursive ends. with the precedence level passed to the parent E. Finally, for
Although > is defined as a relationship between a parent each use of E in a rule, an argument is passed.
rule and a child rule, its application may need to be arbitrary In the '*'-rule, its precedence level (2) is passed to the left
deep in a derivation tree. For example, consider the input E, and its precedence level plus one (3) is passed to the right
string a * if a then a else a + a for the grammar in Fig- E. This allows to exclude the rules of lower precedence from
ure 6 (right). This sentence is ambiguous with two derivation the left E, and to exclude the rules of lower precedence and
trees that correspond to the following groupings: the '*'-rule itself from the right E. Excluding the '*'-rule it-
(a * (if a then a else a)) + a self allows only the left-associative derivations, e.g., (a*a)*a,
a * (if a then a else (a + a))
as specified by left. In the '+'-rule, its precedence level plus
The first grouping is not valid as 'if' binds stronger than one (2) is passed to the right E, excluding the '+'-rule. The
'+', but we defined '+' to have higher priority than 'if'. This value 0 is passed to the left E, permitting any rule. Note that
example shows that restricting derivations only at one level passing 0 instead of 1 to the left E of the '+'-rule achieves
cannot disambiguate such cases. A correct implementation the same effect but enables better sharing of calls to E, as
of > thus also restricts the derivation of the 'if'-rule from the the sharing of calls (using GSS) is done based on the name
right-recursive end of the '*'-rule if the '*'-rule is derived of the nonterminal and the list of arguments. In the round
from the left-recursive end of the '+'-rule. bracket rule, 0 is passed to E as the use of E is neither left-
We now show how to implement an operator precedence nor right-recursive, hence, the precedence does not apply.
disambiguation scheme using data-dependent grammars. We Now we discuss the translation of the example shown in
first demonstrate the basic translation scheme using binary Figure 6 that contains both binary and unary operators. For
operators only, and then discuss the translation of the exam- this we need to distinguish between the rules that should be
ple in Figure 6. Figure 7 (left) shows a simple example of an excluded from the left and from the right E. This is achieved
expression grammar that defines two left-associative binary as follows. First, E gets two parameters, l and r (Figure 8),
operators * and +, where * is of higher precedence than +. to distinguish between the precedence level passed from left
Figure 7 (right) shows the result of the translation into the and right, respectively. Second, a separate condition on l
data-dependent counterpart. The basic idea behind the trans- is added to a rule when the rule can be excluded from the
lation is to assign a number, a precedence level, to each left- right E (i.e., rules for binary operators and unary postfix
and/ or right recursive rule of nonterminal E, to parameterize operators). A separate condition on r is added to a rule
E with a precedence level, and based on the precedence level when the rule can be excluded from the left E (i.e., rules
passed to E, to exclude alternatives that will lead to deriva- for binary operators and unary prefix operators). Third, l-
tion trees that violate the operator precedence. and r-arguments are determined for the left and right E’s
In Figure 7 (right) each left- and right-recursive rule in as follows. An l-argument to the left E and r-argument to
the grammar gets a precedence level (shown in comments), the right E are determined as in the example of Figure 7.
which is the reverse of the alternative number in the defi- For example, E(3,_) '*' E(_,4), where 3 is the precedence
nition of E. The precedence counter starts from 1 and incre- level of the '*'-rule, and 4 is the precedence level plus one.
ments for each encountered > in the definition. The number 0 Note that r=4 does not exclude the unary operators of E.
is reserved for the unrestricted use of E, illustrated using the Now, an l-argument to the right E’s is propagated from the

8
Decls ::= align (offside Decl)* Decls ::= a0:Star1(a0.l)
| ignore('{' Decl (';' Decl)* '}') | ignore('{' Decl Star2 '}')
Decl ::= FunLHS RHS Decl ::= FunLHS RHS
RHS ::= '=' Exp 'where' Decls RHS ::= '=' Exp 'where' Decls

Figure 9. Simplified version of Haskell’s Decls. Star1(v) ::= Plus1(v) | ✏


Plus1(v) ::= offside a1:Decl [col(a1.l) == col(v)]
| Plus1(v) offside a1:Decl [col(a1.l) == col(v)]
parent E. This effectively excludes a unary prefix rule from
Star2 ::= Plus2 | ✏
the right E’s when the parent E is the left E of a rule of higher Plus2 ::= Plus2 Seq2 | Seq2
precedence than the unary operator. Finally, given that there Seq2 ::= ';' Decl
are no unary postfix operators, an r-argument to the left E’s Figure 10. Desugaring of align.
is not propagated from the parent E and can be the same as
the respective l-argument.
We have also extended this approach for grammars that Decls(i,fst) ::= a0:Star1(a0.l,i,fst)
| '{' Decl(-1,0) Star2 '}'
allow rules of the same precedence and/or associativity Decl(i,fst) ::= FunLHS(i,fst) RHS(i,0)
groups. For example, binary + and - operators have the RHS(i,fst) ::= o0:'=' [f(i,fst,o0.l)] Exp(i,0)
same precedence, but are left-associative with respect to o1:'where' [f(i,0,o1.l)] Decls(i,0)
each other.
Star1(v,i,fst) ::= Plus1(v,i,fst) | ✏
Our translation of operator precedence to data-dependent Plus1(v,i,fst)
grammars resembles the precedence climbing technique [5, ::= Plus1(v,i,fst) a1:Decl(a1.l,1)
31]. In contrast to precedence climbing that requires a non- [col(a1.l) == col(v), f(i,0,a1.l)]
left recursive grammar, our approach works in presence of | a1:Decl(a1.l,1) [col(a1.l) == col(v), f(i,fst,a1.l)]
Star2 ::= Plus2 | ✏
both left- and right-recursive rules. Plus2 ::= Plus2 Seq2 | Seq2
Seq2 ::= ';' Decl(-1,0)
3.5 Indentation-sensitive Constructs
f(i,fst,l) = i == -1 || fst == 1 || col(l) > col(i);
In this section we show how the offside rule can be translated
into data-dependent grammars. We use Haskell as the run- Figure 11. Desugaring of offside and ignore.
ning example, but our approach is also applicable for other
programming languages that implement the offside rule.
In Haskell, one can write a where clause consisting of a The basic idea of translating align is to use the start index
block of declarations, where the structure of the block is de- of a declaration list, and constrain the start index of each
fined by using either explicit delimiters or indentation (col- declaration in the list by an equality check on indentation
at the respective indices. Figure 10 shows the result after
umn number). For example, the structure of the following
first desugaring align and then translating EBNF constructs
blocks, one written with explicit delimiters, such as curly
braces and semicolons (left), and the other written using in- (Section 3.1). Desugaring align alone results in:
dentation (right), is the same: Decls ::= a0:(offside a1:Decl [col(a1.l) == col(a0.l)])*

{ x = 1 * 2 + 3; y = x + 4 } x = 1 * 2 Labels a0 and a1 are introduced to refer to the start index


+ 3 of a declaration list, a0.l, and to the start index of each
y = x + 4 declaration in the list, a1.l, respectively, and the constraint
Figure 9 shows a simplified excerpt of the Haskell grammar, checks whether the respective column numbers (given tabs
defined using our parsing framework. The first alternative of 8 characters) are equal. As in case of precede restric-
explicitly enforces indentation constraints on a declaration tions in Section 3.3, this constraint only depends on the
block. First, it requires that all declarations of a block are start indices and can be applied before parsing Decl. The
aligned (align) with respect to each other, i.e., each decla- EBNF translation introduces nonterminals for each EBNF
ration starts with the same indentation. Second, it requires construct, where Star1 and Plus1 also get parameter v as the
that the offside rule applies to each declaration, i.e., all non- use of a0 has to be lifted during the translation.
whitespace tokens of a declaration are strictly indented to Figure 11 shows the result of desugaring offside and
the right of its first non-whitespace token. In contrast, the ignore from Figure 10. The basic idea is to pass down
second alternative of Decls enforces the use of curly braces Decl’s start index and constrain the indentation of any non-
and semicolons, and explicitly ignores (ignore) indentation whitespace terminal that can appear under the Decl-node, ex-
constraints even when imposed by an outer scope. cept for the leftmost one, to be greater than the indentation of
In our meta-notation, align only affects regular defini- Decl’s start index. Two parameters, i and fst, are introduced
tions (EBNF constructs) such as lists and sequences, offside to Decl and to all nonterminals reachable from it. The first
affects nonterminals, and ignore applies to a sequence of parameter is used to pass Decl’s start index, calculated at the
symbols. The translation of these high-level constructs into offside application site (a1.l), to any nonterminal reachable
data-dependent grammars is illustrated in Figures 10 and 11. from Decl, and to constrain terminals reachable from Decl.

9
The second parameter, fst, which is either 0 or 1, is used global defs = {}
to identify and skip the leftmost terminal that should not be
Layout ::= (Whitespace | Comment | Decl | If | Gbg)*
constrained. The value 1 is passed at the application site of !>> [\ \t\n\r\f] !>> '/*' !>> '//' !>> '#'
offside and propagated down to the first nonterminal of each
reachable rule if the rule starts with a nonterminal. The value Decl ::= '#' 'define' id:Id
{defs=put(defs,id.yield,true)} PpNL
0 is passed to any other nonterminal of a reachable rule when
| '#' 'undef' id:Id
the first symbol of the rule is not nullable. Our translation {defs=put(defs,id.yield,false)} PpNL
also accounts for nullable nonterminals (not shown here),
and in such cases the value of fst also depends on a dy- If ::= '#' 'if' v=Exp(defs) [v] ? Layout
: (Skipped (Elif|Else|PpEndif))
namic check whether the left and right extents of the node Elif ::= '#' 'elif' v=Exp(defs) [v] ? Layout
corresponding to a nullable nonterminal are equal. : (Skipped (Elif|Else|PpEndif))
Finally, each terminal reachable from Decl gets a label Else ::= '#' 'else' Layout
(labels starting with o), to refer to its start index, and a
Gbg ::= GbgElif* GbgElse? '#' 'endif'
constraint, encoded as a call to boolean function f. Note that GbgElif ::= '#' 'elif' Skipped
in the definition of f, condition i == -1 corresponds to the GbgElse ::= '#' 'else' Skipped
case when Decl appears in the context where the offside rule
does not apply or is ignored, and condition fst == 1 to the Skipped ::= Part+
Part ::= PpCond | PpLine | ... // etc.
case of the leftmost terminal. PpCond ::= PpIf PpElif* PpElse? PpEndif
The offside, align and ignore constructs are examples PpIf ::= '#' 'if' PpExp PpNL Skipped?
of reasonably complex desugarings to data-dependent gram- PpElif ::= '#' 'elif' PpExp PpNL Skipped?
PpElse ::= '#' 'else' PpNL Skipped?
mars. Their existence and their aptness to describe the syn-
PpEndif ::= '#' 'endif' PpNL
tax of Haskell is a witness of the power of data-dependent
grammars and the parsing architecture we propose. Figure 12. The grammar of conditional directives in C#.

3.6 Conditional Directives


each nonterminal that directly or indirectly accesses a global
In this section we present our solution for parsing condi-
variable should get an extra parameter, and each nonterminal
tional directives in C#. As discussed in Section 2.5, most
that can directly or indirectly update a global variable should
directives can be regarded as comment, but conditional di-
return the new value of the variable if the variable is used
rectives have to be evaluated during parsing, as they may
after an occurrence of the nonterminal in a rule. Note that,
affect the syntactic structure of a program.
assuming immutable values, such implementation of global
Conditional directives can appear anywhere in a program.
variables properly accounts for the nondeterministic nature
Therefore, it is natural to define them as part of the layout
of generalized parsing. This way updates to a variable made
nonterminal. Figure 12 shows relevant parts of the layout
along one parse do not interfere with updates made along an
definition (Layout)2 we used to parse C# (follow restrictions
alternative parse.
enforce longest match). In addition to whitespace charac-
The basic idea of single-phase parsing of C# in presence
ters (Whitespace) and comments (starting with '/*' or '//'),
of conditional directives is as follows. Recall that in our pars-
the layout consists of declaration directives (Decl) and con-
ing strategy (Section 3.2) layout is inserted between symbols
ditional directives (If).
in a grammar rule. Conditional directives are evaluated as
According to the C# language specification, the scope
part of the layout nonterminal, and based on the result of the
of symbols introduced by declaration directives #define and
evaluation, the next lines of source code are either treated as
#undef is the file they appear in. Therefore, we need to main-
the actual source code (true case), or as a sequence of valid
tain a global symbol table defs to declare (see Decl) and ac-
C# tokens (false case), also consuming directives that should
cess (see If and Elif) symbol definitions while parsing. In
not be evaluated. To achieve this, the grammar of Figure 12
C# one can define/undefine a symbol, but a value cannot be
uses two different definitions for #if, #elif and #else. The
assigned to a symbol. Thus, the symbol table needs to asso-
bottom definition (PpIf, PpElif and PpElse), which is found
ciate a boolean value with a symbol.
in the C# specification, simply defines directives as part of
To enable global definitions, our parsing framework sup-
valid C# tokens (Skipped), while the top definition (If, Elif
ports global variables that can be declared using the global
and Else) uses data dependency. Note that conditional direc-
keyword. e.g., defs in Figure 12. In our parsing framework,
tives can be nested. This is expressed by using Layout in If,
a global variable is implemented by using parameters and
Elif and Else, and Skipped in PpIf, PpElif and PpElse.
return values to thread a value through a parse. In this case,
Whenever an #if-directive and its expression are parsed
2 For readability reasons, we omit uses of Whitespace? (optional whites-
as part of If, the expression is evaluated using the symbol
pace) before and after terminal '#', and uses of Whitespace after termi- table (defs). Exp (not shown in Figure 12) defines a sim-
nals 'define', 'undef', 'if', 'elif'. ple boolean expression. To enable evaluation of expressions

10
Element ::= STag Content ETag global defs = [{}]
STag ::= '<' Name Attribute* '>'
ETag ::= '</' Name '>' Declaration ::= x=Specifiers Declarators(x)

Element ::= s=STag Content ETag(s) Specifiers ::= x=Specifier y=Specifiers {x || y}


STag ::= '<' n:Name Attribute* '>' { n.yield } | x=Specifier {x}
ETag(s) ::= '</' n:Name [n.yield == s] '>' Specifier ::= "typedef" {true} | ...

Figure 13. Context-free grammar of XML elements (top) Declarators(x) ::= s=Declarator {h=put(head(defs),s,x);
and the data-dependent version (bottom). defs=list(h,tail(defs))}
("," Declarators(x))*
Declarator ::= id:Identifier {id.yield}
| x=Declarator "(" ParameterTypeList ")" {x}
while parsing, Exp uses data dependency and extends the | ...
Expr ::= Expr "-" Expr
PpExp rules, found in the C# specification, with return val-
| "(" n:TypeName [isType(defs,n.yield)] ")" Expr
ues and boolean computation. If the expression evaluates to | "(" Expr ")"
true (note the use of conditional selection), the parser first | ...
continues consuming layout, including the nested directives, | Identifier [!isType(defs,n.yield)]

and then, after no layout can be consumed, the parser returns Figure 14. Resolving typedef ambiguity in C.
to the next symbol in the alternative.
If the expression evaluates to false, the parser consumes
part of the input as a list of valid C# tokens (Skipped) until Now, we consider the problem of typedef ambiguity in
it finds the corresponding #elif-, #else- or #endif-part. Note C. For example, expression (T)+1 can have two meanings,
that Skipped also consumes nested #if-directives (PpCond), if depending on the meaning of T in the context: a cast to type T
any, but in this case, conditions are not evaluated. The defi- with +1 being a subexpression, or addition with two operands
(T) and 1. If T is a type, declared using typedef, the first parse
nition of Skipped also allows to consume invalid C# structure
(only valid token-wise) when the condition is false, see Fig- is valid, otherwise the second one.
ure 3 (right). Finally, when all #if, #elif and #else directives To resolve the typedef ambiguity, type names should be
are present, there will be dangling #elif, #else, and #endif distinguished from other identifiers, such as variables and
parts remaining if one of the conditions evaluates to true. function names, during parsing. In addition, the scoping
These dangling parts should be also consumed by the lay- rules of C should be taken into account. For example, con-
out. The Gbg (garbage) nonterminal, defined as part of layout, sider the following C program:
does exactly this. typedef int T;
main() {
int T = 0, n = (T)+1;
3.7 Miscellaneous Features }

In this section we discuss the use of data-dependent gram- In this example, T is first declared as a type alias to int and
mars for parsing XML and resolving the infamous typedef then redeclared as a variable of type int in the inner scope
ambiguity in C. XML has a relatively straightforward syn- introduced by the main function.
tax. Figure 13 (top) shows the context-free definition of Figure 14 shows a simplified excerpt of our data-dependent
Element in XML, where Content allows a list of nested ele- C grammar. The excerpt shows the declaration and expres-
ments. The problem with this definition is that it can recog- sion parts of the C grammar. As can be seen, a C decla-
nize inputs with unbalanced start and end tags, for example: ration consists of a list of specifiers followed by a list of
<note>
declarators. Each declarator declares one identifier. Key-
<to>Bob</from> word typedef can appear in the list of specifiers, for exam-
<from>Alice</to> ple, along with the declared type. A declarator can be either
</note> a simple identifier or a more complicated syntactic structure,
Using data-dependent grammars, the solution to match e.g., array and function declarators, nesting the identifier. It
start and end tags is very intuitive. Figure 13 (bottom) shows is important to note that an identifier should enter the cur-
a data-dependent grammar for XML elements. As can be rent scope when its declarator is complete. The expression
seen, inside a starting tag, STag, the result of parsing Name is part of Figure 14 shows the cast expression rule (the sec-
bound to n, and the respective substring, n.yield, is returned ond rule from top), and the primary expression rule (the last
from the rule. The returned value is assigned to s in the one). Note that to resolve the typedef ambiguity, illustrated
Element rule, and is passed to the end tag, ETag. Finally, in the in our running example, an identifier should be accepted as
ETag, the name of the end tag is checked against the name of an expression if it is not declared as a type name.
the starting tag. If the name of the starting tag is not equal to To distinguish between type names and other identifiers,
the name of the end tag, i.e., n.yield == s does not hold, the we record names, encountered as part of declarators, and as-
parsing pass dies. sociate a boolean value with each name: true for type names

11
and false otherwise. To maintain this information during E + E
parsing, we introduce global variable defs, holding a list of
- E
maps to properly account for scoping. At the beginning of E
parsing, defs is a list containing a single, empty map. At the a
beginning of a new scope, i.e., when "{" is encountered, an
empty map is prepended to the current list resulting in a new Figure 15. ATN Grammar for E ::= E + E | E | a.
list which is assigned to defs (not shown in the figure). At
the end of the current scope, i.e., when "}" is encountered,
the head of the current list is dropped by taking the tail of tion before or after a symbol in a rule) and executes the code
the list and assigning it to defs. corresponding to this slot. Because of the nondeterministic
To communicate the presence of typedef in a list of spec- nature of general parsing, a GLL parser needs to record all
ifiers, we extend each rule of Specifier to return a boolean possible paths and process them later, and at the same time
value: "typedef"-rule returns true, and the other rules return eliminate duplicate jobs. The unit of work in GLL parsing
false. Specifiers computes disjunction of the values asso- is a descriptor which captures a parsing state. Descriptors
ciated with the specifiers in the resulting list. This informa- allow a serialized, discrete view of tasks performed during
tion is passed via variable x to Declarators. We also extend parsing. GLL parsing has a main loop, in a trampolined style,
the rules of Declarator to return the declared name, id.yield. that executes the descriptors one at a time until no more de-
After a declarator is parsed, the declared name can be stored scriptors left.
in defs: pair (s,x) is added to the map taken from the head The standard way of implementing a GLL parser is to
of the current list, and a new list, with the resulting map as generate code for each grammar slot [33]. Such implemen-
its head, is created and assigned to defs. tation relies on dynamic gotos to allow arbitrary jumps to
Finally, isType function is used to check whether the the main loop or other grammar slots. In our GLL imple-
current identifier is a type name in the current scope or not: mentation, a grammar is modeled as a connected graph of
isType iterates over elements in defs, starting from the first grammar slots. This model of context-free grammars re-
element, to look up the given name. If the name is not found sembles Woods’ Recursive Augmented Transition Networks
in the current map, isType continues the search with the next (ATN) [40] grammars. As such, our implementation of GLL
element, representing the outer scope. If the name is found, over ATN grammars (Appendix A) provides an interpreter
isType returns the boolean value associated with the name. If version of GLL parsing.
none of the maps contains the name, isType returns false.
In our running example, after parsing the second decla- 4.1 ATN Grammars
ration of T, appeared in the scope of the main function, pair ATN grammars are an automaton formalism developed in
("T",false) will be added to the map in the head of defs,
the 70s to parse natural languages, and are similar to nonde-
effectively shadowing the previous typedef declaration of T, terministic finite automata.
and causing the condition in the cast expression rule to fail.
An ATN grammar is a tuple (Q, F, !) where

4. Implementation • Q is a finite set of states representing grammar slots;

In this section we present our extension of the GLL parsing • F ⇢ Q is a finite set of states representing final grammar
algorithm [33] to support data-dependent grammars. GLL slots; and
parsing is a generalization of recursive-descent parsing that A
• ! is a transition relation of the form ! (nonterminal),
supports all context-free grammars, and produces a bina- t ✏
! (terminal), or ! (epsilon).
rized Shared Packed Parse Forest (SPPF) in cubic time and
space. GLL uses a Graph-Structured Stack (GSS) [35] to For example, the ATN grammar for E ::= E + E | E | a
handle multiple function calls in recursive-descent parsing. is shown in Figure 15. In an ATN, there is a one-to-many
The problem of left recursion is solved by allowing cycles relation, S ⇢ String ⇥ Q, from a nonterminal name to a
in GSS. As GLL parsers are recursive-descent like, the han- set of start states, each representing the initial state of an
dling of parameters and environment is intuitive, and the im- alternative.
plementation remains very close to the stack-based seman- Constructing an ATN grammar from a CFG is straight-
tics, which eases the reasoning about the runtime behavior forward. For each nonterminal in the grammar, and for each
of the parser. More information on GLL parsing over ATN, alternative of the nonterminal, a pair consisting of the non-
GSS, and SPPF is provided in Appendix A. terminal’s name and a state representing the start state of the
We use a variation of GLL parsing that uses a more effi- alternative is added to S. Finally, for each symbol in the al-
cient GSS [2]. GLL parsing can be seen as a grammar traver- ternative, a next state is created, and a transition, labeled with
sal process that is guided by the input. At each point during the symbol, from the previous state to this state is added. The
parsing, a GLL parser is at a grammar slot (grammar posi- last state of the alternative is marked as a final grammar slot.

12
[2 >= l, 2 >= r] E(2, 2) + E(l, 3) Finally, a GLL parser constructs a binarized SPPF (Ap-
pendix A.1), creating terminal nodes (nodeT), and nonter-
[1 >= l] - E(0, 0)
E(l, r) minal and intermediate nodes (nodeP). In GLL parsing in-
a termediate nodes are essential. In particular, they allow the
parser to carry a single node at each time by grouping the
Figure 16. Data-dependent ATN grammar for E ::= E + symbols of a rule in a left-associative manner. Nonterminal
E > E | a after desugaring operator precedence. and intermediate nodes can be ambiguous. To properly han-
dle ambiguities under nonterminal and intermediate nodes,
we include environment and return values into the SPPF con-
4.2 Data-dependent ATN Grammars struction. Specifically, arguments to nonterminals and return
values are part of nonterminal nodes, and environment is part
To support data-dependent grammars, we extend ATN gram-
of intermediate nodes.
mars with the following forms of transitions:
Figure 17 presents the semantics of GLL parsing over

x=l:A(e)
! (parameterized, labeled nonterminals) and !
l:t ATN, defining it as a transition relation on configuration
(labeled terminals) (R, U , G, P) where the elements are four main structures
[e]
maintained by a GLL parser:
x=e e
• ! (variable binding), ! (constraint) and ! (return
• R is a set of pending descriptors to be processed
expression).
• U is a set of descriptors created during parsing. This set
Two additional mappings are maintained: L, X : Q ! String is maintained to eliminate duplicate descriptors
that map a state, representing a grammar slot after a labeled
• G is a GSS, represented by a set of GSS edges
nonterminal, to the nonterminal’s label (l) and to the nonter-
minal’s variable (x), respectively. Here, as in Section 3.1, for • P is a set of parsing results (SPPF nodes created for
simplicity of presentation and without loss of generality, we nonterminals) associated with GSS nodes, i.e., a set of
assume that nonterminals can have at most one parameter. elements of the form (u, w)
We also only consider cases of labeled terminals and non- During parsing, a descriptor is selected and removed from
terminals, and when a return expression is present. Finally, R, represented as {(p, i, E, u, w)} [ R, and given the rules,
we assume that expression language e is a simple functional a deterministic choice is made based on the next transition
programming language with immutable values and no side- in the ATN. The simplest rules are Eps, Cond-1, Cond-2
effects, that labels and variables are scoped to the rules they and Bind. Eps creates the ✏-node (via call to nodeT) and an
are introduced in, and that labels and variables introduced by intermediate node (via call to nodeP), and adds a descriptor
desugaring have unique names in their scopes. for the next grammar slot. Cond-1 and Cond-2 depend on the
An example of a data-dependent ATN is shown in Fig- evaluation of expression e in a constraint. If the expression
ure 16. This ATN grammar is the disambiguated version of evaluates to true, a new descriptor is added to continue
the grammar shown in Figure 15 after desugaring operator with the next symbol in the rule (Cond-1), otherwise no
precedence. descriptor is added (Cond-2). Bind evaluates the expression
in an assignment and creates a new environment containing
4.3 Data Dependency in GLL Parsing the respective binding. This environment is used to create
In the following, p, q, s represent ATN states in Q, i is an the new descriptor added to R.
input index, u, u0 represent GSS nodes, and w, n, y repre- Term-1 and Term-2 deal with labeled terminals. If ter-
sent SPPF nodes. To support data-dependent grammars, we minal t matches (Term-1) the input string (represented by
introduce an environment, E, into GLL parsing. Here, we an array I) starting from input position i, a terminal node
assume that E is an immutable map of variable names to is created (assuming t is of length 1). Then, the properties,
values. In the data-dependent setting, a descriptor, the unit i.e., the left and right extents, and the respective substring,
of work in GLL parsing, is of the form (p, i, E, u, w). Now, are computed from the resulting node (props(y)). Finally, a
a descriptor contains an environment E that has to be stored new environment, containing binding [l = props(y)], is cre-
and later used whenever the parser selects the descriptor to ated and used to construct an intermediate node and a new
continue from this point. GSS is also extended to store ad- descriptor. If the terminal does not match (Term-2), no de-
ditional data. A GSS node and a GSS edge are now of the scriptor is added.
forms (A, i, v) and (u, p, w, E, u0 ), respectively. That is, in Call-1 and Call-2 deal with labeled calls to nonterminals.
addition to the current input index i, a GSS node stores an First, argument e is evaluated, where E1 allows the use of
argument v, passed to a nonterminal A, to fully identify the the left extent in e (lprop constructs properties with only left
call. A GSS edge additionally stores an environment E, to extent). If a GSS node, representing this call, already exists
capture the state of the parser before a call to a nonterminal (Call-1), the parsing results associated with this GSS node
is made. are reused, and a possibly empty set of new descriptors (D)

13
(R, U , G, P) ) (R0 , U 0 , G 0 , P 0 ) argument of the current GSS node (arg(u)) and the return
value. This node is recorded in P as a result associated with

p!q the GSS node. For each GSS edge directly reachable from
n = nodeP(q, w, nodeT(✏, i, i), E) d = (q, i, E, u, n)
Eps the current GSS node, a new descriptor is created. Note that
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
labels and variables at call sites, represented by the current
l:t
p ! q I[i] = t GSS node, are retrieved via mappings L and X, respectively.
y = nodeT(t, i, i+1) E1 = E[l = props(y)]
n = nodeP(q, w, y, E1 ) d = (q, i+1, E1 , u, n)
Term-1 5. Evaluation
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
Our data-dependent parsing framework is implemented as an
l:t
Term-2
p ! q I[i] 6= t extension of the Iguana parsing framework [2]. The addition
({(p, i, E, u, w)}[R, U , G, P) ) (R, U , G, P) of data-dependency is at the moment a prototype and most of
x=l:A(e) the effort was put into correctness, rather than performance
p !q
E1 = E[l = lprop(i)] [[e]]E1 = v u0 = (A, i, v) 2 N (G) optimization. As a frontend to write data-dependent gram-
D = {d | (u0 , y) 2 P, E2 = E[l = props(y), x = val(y)], mars, we extended the syntax definition of Rascal [24], a
d = (q, rext(y), E2 , u, nodeP(q, w, y, E2 )), d 62 U } programming language for meta-programming and source
Call-1
({(p, i, E, u, w)}[R, U , G, P) ) (R[D, U [D, code analysis, and provided a mapping to Iguana’s internal
G [{(u0 , q, w, E, u)}, P) representation of data-dependent grammars.
x=l:A(e) In Section 2 we enumerated a number of challenges in
p !q
E1 = E[l = lprop(i)] [[e]]E1 = v u0 = (A, i, v) 62 N (G) parsing programming languages, and in Section 3, we pro-
D = {(s, i, [p0 = v], u0 , $) | s 2 S(A)} vided solutions based on data-dependent grammars (directly
Call-2
({(p, i, E, u, w)}[R, U , G, P) ) (R[D, U , or via desugaring) that address these challenges. For each
G [{(u0 , q, w, E, u)}, P) challenge we selected a programming language that exhibits
p !q q2F
e it, and wrote a data-dependent grammar3 , derived from the
[[e]]E = v n = nodeP(q, w, arg(u), v) specification grammar of the language. For evaluation, we
D = {d | (u, s, y, E1 , u0 ) 2 G, E2 = E1 [L(s) = props(n), X(s) = v], parsed real source files from the source distribution of the
d = (s, i, E2 , u0 , nodeP(s, y, n, E2 )), d 62 U }
Ret language and some popular open source libraries, see Ta-
({(p, i, E, u, w)}[R, U , G, P) ) (R[D, U [D, G, P [{(u, n)})
ble 2. Table 1 summarizes the evaluation results. In the fol-
[e]
p ! q [[e]]E = true lowing we discuss these results in detail, and provide an
d = (q, i, E, u, w) analysis of the expected performance in practice.
Cond-1
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
Java To evaluate the correctness of our declarative opera-
[e]
p ! q [[e]]E = false tor precedence solution using data-dependent grammars, we
Cond-2
({(p, i, E, u, w)}[R, U , G, P) ) (R, U , G, P) used the grammar of Java 7 from the main part of the Java
x=e language specification [11]. This grammar contains an un-
p ! q [[e]]E = v
d = (q, i, E[x = v], u, w)
ambiguous left-recursive expression grammar, in a similar
Bind
({(p, i, E, u, w)}[R, U , G, P) ) (R[{d}, U , G, P)
style to the expression grammar in Figure 1 (middle).
We replaced the expression part (consisting of about 30
Figure 17. GLL for data-dependent ATN grammars. nonterminals) of the Java specification grammar with a sin-
gle Expression nonterminal that declaratively expresses oper-
ator precedence using >, left and right. The resulting gram-
is created. Each descriptor in the set corresponds to a re- mar, which we refer to as the natural grammar, is much
sult, nonterminal node y, retrieved from P, so that the index more concise and readable, see Table 1. The resulting parser
of the descriptor is the right extent of y (rext), its environ- parsed all 8067 files successfully and without ambiguity.
ment contains bindings [l = props(y)] and [x = val(y)] (val The natural grammar of Java produces different parse
retrieves the value from y), and its SPPF node is a new in- trees compared to the original specification grammar, and
termediate node. Note that d 62 U ensures that no duplicate therefore it is not possible to directly compare the parse
descriptors are added at this point. If the corresponding GSS trees. To test the correctness of parsers resulting from the
node does not exist, Call-2 creates one descriptor for each desugaring of >, left, and right to data-dependent gram-
start state of the nonterminal (s 2 S(A)). Each descriptor mars, we tested their resulting parse trees against a GLL
gets a new environment with binding [p0 = v], where p0 is parser for the same natural grammar of Java, using our
the nonterminal’s parameter that we assume to have a unique previous work on rewriting operator precedence rules [3].
name in the scope of a rule. Both Call-1 and Call-2 add a new Both parsers, using desugaring to data-dependent grammars
GSS edge capturing the previous environment to G. and rewriting operator precedence rules, produced the same
Finally, in Ret-rule, the return expression is evaluated,
and the nonterminal node is created which stores both the 3 https://github.com/iguana-parser/grammars

14
Table 1. Summary of the results of parsing with character-level data-dependent grammars of programming languages.
Spec. Grammar Data-dep. Grammar
Language Challenge Solution # Files % Success
# Nont. # Rules # Nont. # Rules
Java Operator precedence >, left and right 200 485 169 435 8067 100% (8067)
C# Conditional directives global variables and dynamic layout 387 1000 395 1013 5839 99% (5838)
Haskell Indentation sensitivity align, offside and ignore 143 431 152 452 6457 72% (4657)

parse trees for all Java files, providing an evidence that our Table 2. Summary of the projects used in the evaluation.
desugaring of operator precedence to data-dependent gram- Lang. Projects Version Description
mars implements the same semantics as the rewriting in [3]. JDK 1.7.0_60-b19 Java Development Kit
Despite its prototype status, the data-dependent parser is Java
JUnit 4.12 Unit testing framework
at the moment on average only 25% slower than the rewrit- SLF4J 1.7.12 A Java logging framework
ten one. The main reason for performance difference is that Roslyn build-preview .NET Compiler Platform
in the rewriting technique [3] the precedence information is C#
MVC 6.0.0-beta5 ASP.NET MVC Framework
statically encoded in the grammar, and therefore there is no EntityFramew. 7.0.0-beta5 Data access for .NET
runtime overhead, while in the data-dependent version pass-
GHC 7.8 Glasgow Haskell Compiler
ing arguments and handling environment is done at runtime. Haskell
Cabal 1.22.4.0 Build System for Haskell
The problem with the rewriting technique is that the rewrit- Git-annex 5.20150710 File manager based on Git
ing process itself is rather slow and the resulting grammar is Fay 0.23.1.6 Haskell to JavaScript compiler
very large.

C# To evaluate our data-dependent framework on parsing


conditional directives, we used the grammar of C# 5 from mar of Haskell [28]. The specification grammar of Haskell
the C# language specification [30]. As mentioned in Sec- is written using explicit blocks, as if no indentation sensi-
tion 2.5, existing C# compilers resolve preprocessor direc- tivity exists, and the lexer translates indentation to physical
tives in the lexing phase, and the parser is not aware of direc- block delimiters. We took the Haskell grammar as written in
tives. However, the C# language specification has context- the specification as the starting point and added extra rules
free rules that describe the syntax of directives. Our solution that specify layout sensitivity using align, offside and ignore
to parsing conditional directives (Section 3.6) leverages lay- constructs. As shown in Table 1, the data-dependent version
out that is automatically inserted between symbols in gram- has only 21 additional rules (about 4% of the whole gram-
mar rules. We used the context-free syntax of directives in mar). From the total number of 6457 Haskell files, we could
C# as the starting point. We extended the layout definition to successfully parse 4657 files (72%). The reason for the parse
include directives. Then, the conditional directive rules were error in other files is that they contained some syntactic con-
modified to allow parse-time evaluation of conditions and structs from GHC extensions that we do not support yet.
selection of the corresponding path. Besides numerous undocumented GHC extensions we
The resulting data-dependent grammar is only different found in the source files, many Haskell files contained CPP
from the specification grammar in the layout definition, and directives which were resolved by running the C preproces-
the difference is minimal. As can be seen in Table 1 there sor, cpp, before parsing. In the future, we plan to deal with
are only 8 additional nonterminals and 13 additional rules C directives during parsing, the same way we did for C#.
(about 1.3% of the whole grammar). Using the character- One last issue about parsing Haskell is that indentation rules
level grammar of C# we could parse 5838 files out of 5839. alone are not sufficient to unambiguously parse Haskell, and
The parser timed out after 30 seconds on a very large source there is a need for a syntactic longest match that uses inden-
file from the Roslyn framework. The file, which appears to tation information. For example, the following input string
be automatically generated, contains 156033 lines of code is ambiguous, where both derivations are correct regarding
and is of size 4.8 MB. the indentation rules:
Although the grammar of C# is near deterministic, the f x = do print x
+ 1
reason for time out is that character-level grammars generate
very large parse trees, a node for each character. Neverthe- In the first derivation, the right hand side is an infix plus-
less, this file could be parsed using a context-aware parser expression, consisting of a do-expression and 1. The sec-
of C#. We discuss the performance gain of using a context- ond derivation consists of only a do-expression that has
aware scanner in Section 5.1. print x + 1 as its subexpression. According to the Haskell
language specification the second interpretation is valid, as
Haskell To evaluate our parsing framework for indenta- in do expressions longest match should be applied. We re-
tion sensitive programming languages, we used the gram- solved this issue by defining a special kind of follow re-

15
Java C# Haskell
CPU time (milliseconds) in log10

4
y = 1.212 x − 3.181 y = 1.098 x − 3 y = 0.95 x − 1.616

3
3

R 2 = 0.9395 R 2 = 0.9821 R 2 = 0.95

3
2

2
1

1
0
0

0
Regression line Regression line Regression line

−1
−1

2 3 4 5 2 3 4 5 6 0 1 2 3 4 5
size (#characters) in log10

Figure 18. Running time of the character-level parsers for Java, C#, and Haskell against the input size (number of characters)
plotted as log-log base 10. The red line is the linear regression fit. The goodness of each fit is indicated by the adjusted R2
value in each log-log plot. The equation in each plot describes a power relation with original data, and as all the coefficients
(1.212, 1.098, 0.950) are close to one, we can conclude the running time is near-linear on these grammars.

striction, similar to Section 3.3, that bypasses the layout and


checks for the indentation level of the next non-whitespace Java
token when the token is not a keyword or a delimiter.
C#
OCaml We used excerpts of OCaml source files to test our
operator precedence translation against deep and problem- Haskell
atic operator precedence cases. OCaml, in contrast to other
three programming languages we used for the evaluation,
uses a natural, ambiguous expression grammar in its lan- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
guage specification. The data-dependent grammar of OCaml
Speedup
is basically the same as the reference manual, where the al-
ternative operator in the expression part is replaced with > Figure 19. The relative speedup using context-aware scan-
and additional left and right operators added. We are not ning instead of character level grammars.
yet able to unambiguously parse full OCaml programs, as
they contain operator precedence ambiguities across indirect
nonterminals. An example is pattern-matching which can de- free grammars [33]. The more practical question, however,
rive expr on its right-most end: is how data dependency affects the runtime performance of
expr ::= expr '+' expr parsing real programming languages.
| 'function' pattern-matching In this section, we provide empirical results showing that
pattern-matching ::= pattern '->' expr
parsers for data-dependent grammars can behave nearly lin-
For example, the input string function x -> x + 1 is ambigu- early on grammars of real programming languages. We ran
ous with the following derivations: (function x -> x) + 1 the experiments on a machine with a quad-core Intel Core i7
or function x -> (x + 1). As the function alternative has 2.6 GHz CPU and 16 GB of physical memory running Mac
pattern-matching and not expr on its right-most end, the op- OS X 10.10.4. We used a 64-Bit Oracle HotSpot™ JVM ver-
erator precedence rules do not apply in our current scheme. sion 1.8.0_25. Each file was executed 10 times and the mean
The translation of operator precedence in presence of indi- running time (CPU user time) was reported. The three first
rect nonterminals to data-dependent grammars seems possi- runs of each file were skipped to allow for JIT optimizations.
ble with an additional analysis of indirect nonterminals, but Figure 18 shows the log-log plots (log base 10) of running
is left as future work. time (ms) against file size (number of characters) for all the
files we parsed (See Table 1 and 2). For showing the linear
5.1 Running Time and Performance behavior we used the character-level grammars, as they ex-
Data-dependent grammars [16] provide a pay-as-you-go hibit the relationship between the running time and the num-
model. If a pure context-free grammar is specified, the ber of characters better than the context-aware version. As
worst-case complexity of the underlying parsing technol- can be seen, all three parsers exhibit near-linear behavior on
ogy is retained. However, in the general case no guarantees grammars of programming languages.
can be made. Our data-dependent parsers, implemented on To compare the performance difference between character-
top of GLL parsing, are worst-case O(n3 ) on pure context- level and context-aware parsing, we ran both context-aware

16
and character-level parsers on all the source files. Fig- internal workings of the parser. Since parser combinators
ure 19 shows the relative performance gain (speedup) using are normal functions, the user can modify them. Our ap-
a context-aware parser compared to a character-level parser proach provides an external DSL for defining parsers, while
for each file. For a better visualization we omitted the out- parser combinators provide an internal DSL. Therefore, our
liers from the box plots. The median and maximum speedup approach compared to parser combinators provides more
for Java is (2.45, 15.1), for C# (2.45, 4) and for Haskell (1.9, control over the syntax definition.
3). The precise impact of context-aware scanning for general Erdweg et al. present an extension of SDF to define
parsing and data-dependent grammars is future work, but our layout constraints on grammar rules [8]. These constraint-
preliminary investigation revealed that using character-level based approach is implemented by modifying the underly-
grammars for parsing layout is very expensive, as it is a very ing SGLR [38] parser. Most constraints can be solved during
common operation, see Section 3.2. parsing. Constraints that are not resolved will lead to ambi-
guity which can be removed by post-parse filtering. Adams
6. Related Work presents the notion of indentation-sensitive grammars [1],
where symbols in a rule are annotated by the relative posi-
Parsing is a well-researched topic, and many features of our
tion to the immediate parents. This technique is implemented
parsing framework are related in one or another way to other
for LR(k) parsing.
existing systems. Throughout this paper we have discussed
We do not offer a customized solution for indentation-
some related work, which we do not repeat here. In this
sensitivity for a specific parsing technology, rather we use
section we discuss direct related work and our inspirations.
the general data-dependent grammars framework, and map
Data dependency implementation Data-dependent gram- indentation rules to them. In addition, we define high-level
mars have many similarities with attribute grammars [27] constructs such as align, offside, and ignore which are
and attribute directed parsing [39]. A detailed discussion of desugared to lower-level data-dependent grammars. This en-
related systems is provided by Jim et al. [16]. From the ables a syntax definition model that is closer to what the user
implementation perspective, Jim et al. present the Yakker has in mind. We think the use of high-level constructs leads
parser generator [16], which is based on Earley’s algo- to cleaner, more maintainable grammars.
rithm [7], but we have a GLL-based interpretation of data-
dependent grammars. We also extend the SPPF creation Operator precedence SDF2 uses a parser-independent se-
functionality of GLL parsing (taking environments into ac- mantics of operator precedence which is based on parent-
count), while SPPF creation is not discussed in Jim et al. ’s child relationship on derivation trees [38]. This semantics is
approach. Another difference between our implementation implemented in SGLR parsing [38] by modifying parse ta-
and Yakker is that Yakker directly supports regular opera- bles. Although the SDF2 semantics for operator precedence
tors, by applying longest match. We, however, believe that works for most cases, in some cases it is too strong, i.e.,
all ambiguities should be returned by the parser, and avoid rejecting valid sentences, and in some cases it cannot disam-
such implicit heuristics. Therefore, we desugar regular oper- biguate the expression grammar.
ators to data-dependent BNF rules. In earlier work [3] we discussed the precedence ambigu-
We use an interpretative model of parsing based on ity, and proposed a grammar rewriting that takes an ambigu-
Woods’ ATN grammars [40]. Woods used an explicit stack ous grammar with a set of precedence rules and produces a
to run ATN grammars, similar to a pushdown automata. grammar that does not allow precedence-invalid derivations.
However, as with any top-down parser, such execution of Our current solution has the same semantics: it does not re-
ATN grammars does not terminate in presence of left recur- move sentences when there is no precedence ambiguity, and
sion. Jim et al. ’s data-dependent framework operates on a can deal with corner cases found in programming languages
data-dependent automata [15], which is a variation of ATN such as OCaml. In addition, our operator precedence solu-
grammars interpreted with Earley’s algorithm. tion is desugared to data-dependent grammars, thus it is in-
dependent of the underlying parsing technology.
Indentation-sensitive parsing Besides modification to the
lexer which has been used in GHC and CPython, there Conditional directives Recent work in parsing condi-
are a number of other systems that provide a solution for tional directives target all variations [10, 21]. Gazzillo and
indentation-sensitive parsing. Parser combinators [13] are Grimm [10] give an extensive overview of related work in
higher-order functions that are used to define grammars in this area. However, to the best of our knowledge, none of
terms of constructs such as alternation and sequence. This the existing systems employ a single-phase parsing scheme,
approach has been used in parsing indentation-sensitive lan- rather they use a separate scanner and annotate the tokens
guages [14]. Traditional parser combinators do not support based on the conditional directives they appear in. Our ap-
left-recursion and can have exponential runtime. Another proach in using data-dependent grammars to evaluate the
main difference between parser combinators and our ap- conditional directives is new. The treatment of other features
proach is that we do not give the end user access to the of preprocessors, such as macros, is future work.

17
7. Conclusion [11] J. Gosling, B. Joy, G. Steele, G. Bracha, and A. Buckley. The
Java Language Specification Java SE 7 Edition, 2013.
We have presented our vision of a parsing framework that
is able to address many challenges of declarative parsing of [12] J. Heering, P. R. H. Hendriks, P. Klint, and J. Rekers. The
real programming languages. We have built an implemen- Syntax Definition Formalism SDF–Reference Manual–. SIG-
tation of data-dependent grammars based on the GLL pars- PLAN Not., 24(11):43–75, Nov. 1989.
ing algorithm. We also have shown how to map common [13] G. Hutton. Higher-order Functions for Parsing. Journal of
idioms in syntax of programming languages, such as lexi- Functional Programming, 2(3):323–343, July 1992.
cal disambiguation filters, operator precedence, indentation- [14] G. Hutton and E. Meijer. Monadic Parsing in Haskell. J.
sensitivity, and conditional directives to data-dependent Funct. Program., 8(4):437–444, 1998.
grammars. These mappings provide the language engineer [15] T. Jim and Y. Mandelbaum. Efficient Earley Parsing with
with a set of out of the box constructs, while at the same Regular Right-hand Sides. 253(7):135 – 148, 2010. LDTA’09.
time, new high-level constructs can be added. The prelimi-
[16] T. Jim, Y. Mandelbaum, and D. Walker. Semantics and Algo-
nary experiments with our parsing framework show that it rithms for Data-dependent Grammars. In Principles of Pro-
can be efficient and practical. To fully realize our vision we gramming Languages, POPL’10, pages 417–430. ACM, 2010.
will explore more syntactic features, and further optimize
[17] M. Johnson. The Computational Complexity of GLR Parsing.
the implementation of our framework.
In Generalized LR Parsing, pages 35–42. Springer US, 1991.

Acknowledgments [18] S. C. Johnson. Yacc: Yet Another Compiler-Compiler. Tech-


nical report, AT&T Bell Laboratories, 1979.
We are thankful to Jurgen Vinju, Paul Klint, and the anony-
[19] A. Johnstone and E. Scott. Modelling GLL Parser Implemen-
mous reviewers for their constructive feedback on earlier
tations. In Software Language Engineering - 3rd International
versions of this paper.
Conference, SLE ’10, pages 42–61, 2010.

References [20] A. Johnstone, E. Scott, and M. van den Brand. Modular


Grammar Specification. Sci. Comput. Prog., 87:23–43, 2014.
[1] M. D. Adams. Principled Parsing for Indentation-sensitive
Languages: Revisiting Landin’s Offside Rule. In Principles of [21] C. Kästner, P. G. Giarrusso, T. Rendel, S. Erdweg, K. Oster-
Programming Languages, POPL ’13, pages 511–522. ACM, mann, and T. Berger. Variability-aware Parsing in the Pres-
2013. ence of Lexical Macros and Conditional Compilation. In Ob-
ject Oriented Programming Systems Languages and Applica-
[2] A. Afroozeh and A. Izmaylova. Faster, Practical GLL Parsing. tions, OOPSLA ’11, pages 805–824, 2011.
In Compiler Construction, 24th International Conference, CC
’15, pages 89–108. Springer, 2015. [22] L. C. Kats and E. Visser. The Spoofax Language Workbench:
Rules for Declarative Specification of Languages and IDEs. In
[3] A. Afroozeh, M. van den Brand, A. Johnstone, E. Scott, and
Object Oriented Programming Systems Languages and Appli-
J. J. Vinju. Safe Specification of Operator Precedence Rules.
cations, OOPSLA ’10, pages 444–463. ACM, 2010.
In Software Language Engineering, SLE ’13, pages 137–156.
Springer, 2013. [23] L. C. Kats, E. Visser, and G. Wachsmuth. Pure and Declarative
Syntax Definition: Paradise Lost and Regained. In Object
[4] A. V. Aho, S. C. Johnson, and J. D. Ullman. Deterministic
Oriented Programming Systems Languages and Applications,
Parsing of Ambiguous Grammars. In Principles of Program-
OOPSLA ’10, pages 918–932. ACM, 2010.
ming Languages, POPL ’73, pages 1–21, 1973.
[5] K. Clarke. The Top-down Parsing of Expressions. Technical [24] P. Klint, T. van der Storm, and J. Vinju. R ASCAL: a Domain
report, Dept. of Computer Science and Statistics, Queen Mary Specific Language for Source Code Analysis and Manipula-
College, London, June 1986. tion. SCAM’09. IEEE, 2009.
[6] F. L. DeRemer. Practical Translators for LR(k) Languages. [25] D. E. Knuth. On the Translation of Languages from Left to
PhD thesis, Massachusetts Institute of Technology, 1969. Right. Information and control, 8(6):607–639, 1965.
[7] J. Earley. An Efficient Context-free Parsing Algorithm. Com- [26] P. J. Landin. The Next 700 Programming Languages. Com-
mun. ACM, 13(2):94–102, Feb. 1970. ISSN 0001-0782. mun. ACM, 9(3):157–166, Mar. 1966.
[8] S. Erdweg, T. Rendel, C. Kästner, and K. Ostermann. Layout- [27] P. M. Lewis, D. J. Rosenkrantz, and R. E. Stearns. Attributed
Sensitive Generalized Parsing. In Software Language Engi- Translations. J. Comput. Syst. Sci., 9(3):279–307, 1974.
neering, SLE’12, pages 244–263. Springer, 2012. [28] S. Marlow. Haskell 2010 language report, 2010.
[9] K. Fisher and R. Gruber. PADS: A Domain-specific Language
[29] A. Melnikov. Collected Extensions to IMAP4 ABNF, 2006.
for Processing Ad Hoc Data. In Programming Language
Design and Implementation, PLDI ’05, pages 295–304. ACM, [30] Microsoft Corp. C# Language Specification 5.0. 2013.
2005. [31] T. Parr, S. Harwell, and K. Fisher. Adaptive LL(*) Parsing:
[10] P. Gazzillo and R. Grimm. SuperC: Parsing All of C by The Power of Dynamic Analysis. In Object Oriented Pro-
Taming the Preprocessor. In Programming Language Design gramming Systems Languages and Applications, OOPSLA
and Implementation, PLDI ’12, pages 323–334. ACM, 2012. ’14, pages 579–598. ACM, 2014.

18
[32] D. J. Salomon and G. V. Cormack. Scannerless NSLR(1) Pars- E, 0, 4 E ::= E . + E
ing of Programming Languages. In Programming Language
Design and Implementation, PLDI ’89, pages 170–178, 1989. E, 1, 4 E ::= E . + E, 0, 3 E, 1
E ::= - E.
[33] E. Scott and A. Johnstone. GLL Parse-tree Generation.
Science of Computer Programming, 78(10):1828–1844, Oct. E, 0 E ::= E + E.
E, 0, 2 E ::= E + . E, 1, 3 E, 3, 4
2013.
E ::= E . + E
[34] E. Scott, A. Johnstone, and R. Economopoulos. BRNGLR: A E ::= E + E.
E, 1, 2 E, 3
Cubic Tomita-style GLR Parsing Algorithm. Acta informat-
ica, 44(6):427–461, 2007.
'-', 0, 1 'a', 1, 2 '+', 2, 3 'a', 3, 4 E ::= E . + E
[35] M. Tomita. Efficient Parsing for Natural Language. Kluwer
Academic Publishers, USA, 1985. ISBN 0898382025. Figure 20. SPPF (left) and GSS (right) for the input -a+a.
[36] M. G. J. van den Brand, J. Scheerder, J. J. Vinju, and
E. Visser. Disambiguation Filters for Scannerless General-
ized LR Parsers. In Compiler Construction, CC ’02, pages • intermediate nodes of the form (L, i, j) where L is the
143–158. Springer, 2002.
grammar slot, and i and j are the left and right extends.
[37] E. R. Van Wyk and A. C. Schwerdfeger. Context-aware
Scanning for Parsing Extensible Languages. GPCE ’07, pages
The left and right extents of a node represent the substring
63–72. ACM, 2007.
in the input, associated with the node. As GLL parsing is
[38] E. Visser. Syntax Definition for Language Prototyping. PhD context-free, nodes with the same label, the same left and
thesis, University of Amsterdam, 1997.
the same right extents can be shared. Nonterminal and inter-
[39] D. A. Watt. Rule Splitting and Attribute-directed Parsing. mediate nodes have packed nodes as their children. Packed
In Semantics-Directed Compiler Generation, pages 363–392. nodes represent a derivation, and can have at most two chil-
1980.
dren, which are non-packed nodes. If a non-packed node is
[40] W. A. Woods. Transition Network Grammars for Natural ambiguous, it will have more than one packed node. The
Language Analysis. Commun. ACM, 13(10):591–606, 1970. pivot of a packed node is the right extent of its left child, and
is used to distinguish between packed nodes under a non-
A. GLL Parsing packed node.
In this section we first describe GLL parsing, SPPF construc- The binarized SPPF resulting from parsing the input
tion and GSS. Then, we define the semantics of GLL parsing string -a+a with the grammar E ::= E | E + E | a is shown
over ATN grammars as a transition relation. in Figure 20 (left), where packed nodes are depicted with
small circles. For a better visualization, we have omitted the
A.1 SPPF labels of packed nodes. The input is ambiguous and has the
It is known that any parsing algorithm that constructs Tomita- following two derivations: (-(a+a)) or ((-a)+a). This can be
style SPPF is of unbounded polynomial complexity [17]. To observed by the presence of two packed nodes under the root
achieve parsing in cubic time and space, GLL uses a bina- node. The left and right packed nodes under the root node
rized SPPF [33] format, which has additional intermediate correspond to the first and second alternatives, respectively.
nodes. Intermediate nodes allow grouping of the symbols of SPPF construction is delegated to two functions nodeT
a rule in a left-associative manner, thus allowing the parser and nodeP. The nodeT(t, i, j) function takes terminal t,
to always carry a single node at each time, instead of a list and two integer values i and j (left and right extents) and
of nodes. This is the key in preserving the cubic bound. The returns an existing node with these properties, otherwise
use of intermediate nodes effectively achieves the same as a new node. nodeP(L, w, z) takes a grammar slot L, and
restricting a grammar to have rules of length at most two, two non-packed nodes w and z. nodeP returns an existing
but without requiring rewriting the original grammar, and non-packed node labeled L with two children w and z.
transforming back the resulting derivation trees to the ones If no such node exists, then a non-packed node labeled L
of the original grammar. will be created, and w and z are connected to the newly
created non-packed node via a packed node. The details
Definition 1. A binarized SPPF is a compact representation of GLL parse tree construction is discussed in [33], and
of a parse forest that has the following types of nodes. implementation techniques for efficient sharing of nodes are
• nonterminal nodes of the form (A, i, j) where A is a presented in [2, 19].
nonterminal, and i and j are the left and right extents;
• terminal nodes of the form (t, i, j) where t is a terminal, A.2 GSS
and i and j are the left and right extents; At the core of GLL parsing is the Graph-Structured Stack
• packed nodes of the form (L, k) where L is a grammar data structure. We use a variation of GLL parsing that uses a
slot and k is the pivot of the node; and more efficient GSS [2].

19
(R, U , G, P) ) (R0 , U 0 , G 0 , P 0 ) Such GLL formulation is concise and easy to extend to
support data-dependent grammars. The rules in Figure 21

p!q use notation similar to one in [2, 33].
n = nodeP(q, w, nodeT(✏, i, i))
Eps The unit of work of a GLL parser is a descriptor. A de-
({(p, i, u, w)}[R, U , G, P) ) (R[{(q, i, u, n)}, U , G, P)
scriptor is of the form (p, i, u, w), where p is an ATN state
t
p ! q I[i] = t representing a grammar slot, u is a GSS node, i is an in-
n = nodeP(q, w, nodeT(t, i, i+1)) put position, and w is an SPPF (non-packed) node. A GLL
Term-1
({(p, i, u, w)}[R, U , G, P) ) (R[{(q, i+1, u, n)}, U , G, P) parser maintains a set U that holds descriptors created during
t parsing and is used to eliminate duplicate descriptors. In ad-
p ! q I[i] 6= t
Term-2 dition to U , a set R is used to hold pending descriptors that
({(p, i, u, w)}[R, U , G, P) ) (R, U , G, P)
are to be processed. Note that GLL parsing does not impose
A
p ! q v = (A, i) 2 N (G) any order in which the descriptors in R are processed. Fig-
D = {d | (v, y) 2 P, d = (q, rext(y), u, nodeP(q, w, y)), ure 21 defines the semantics of GLL parsing over ATN gram-
d 62 U }
Call-1 mars as a transition relation on configuration (R, U , G, P),
({(p, i, u, w)}[R, U , G, P) ) (R[D, U [D,
where G represents GSS (a set of GSS edges), such that
G [{(v, q, w, u)}, P)
N (G) gives a set of GSS nodes, and P is a set of parsing
A
p ! q v = (A, i) 62 N (G) results that are associated with GSS nodes, i.e., a set of ele-
Call-2
D = {(s, i, v, $) | s 2 S(A)} ments of the form (u, w).
({(p, i, u, w)}[R, U , G, P) ) (R[D, U , G [{(v, q, w, u)}, P) During parsing a descriptor is selected and removed from
p2F R, represented as {(p, i, u, w)} [ R, and given the rules, a
D = {d | (u, q, y, v) 2 G, d = (q, i, v, nodeP(q, y, w)), d 62 U } deterministic choice is made based on the next transition in
Ret
({(p, i, u, w)}[R, U , G, P) ) (R[D, U [D, G, P [{(u, n)}) the ATN. The first three rules of Figure 21 are straightfor-
ward. An ✏ transition creates an ✏-node (via call to nodeT)
Figure 21. GLL parsing over ATN grammars. and intermediate node4 (via call to nodeP), and adds a de-
scriptor for the next grammar slot. The terminal rules (Term-
1 and Term-2) try to match terminal t at the current input
Definition 2. A Graph-Structured Stack (GSS) in GLL pars-
position, where I is an array representing the input string. If
ing is a directed graph where
there is a match (Term-1), a terminal node (via nodeT) and
• nodes are of the form (A, i), where A is a nonterminal intermediate node (via nodeP) are created, and a descrip-
and i is an input position; and tor for the next grammar slot is added. If there is no match
• edges are of the form (u, L, w, v), where u and v are (Term-2), no descriptor is added.
GSS nodes, L is a grammar slot, and w is an SPPF node Call-1 and Call-2 correspond to nonterminal transitions
A
recorded on the edge. !. Similar to calling a memoized function, a GLL parser
first checks if a GSS node (A, i) exists. If such a node exists
GSS was originally developed by Tomita [35] for GLR pars- (Call-1), the parsing results associated with this GSS node
ing to merge different LR stacks. Although GLL parsing are reused. These results are retrieved from P, and for each
uses the same term, there are two main differences between result, nonterminal node y, a descriptor d is created (rext
GSS in GLL parsing and GLR. First, in GLL parsing GSS returns the right extent of y), and if the same descriptor has
represents function calls in recursive-descent parsing, simi- not been processed before (d 62 U ), it is added to R. If the
lar to memoization of functions in functional programming, GSS node does not exist (Call-2), the call to the nonterminal
and therefore has the input position at which the nonterminal is made, i.e., for each start state of the nonterminal (s 2
is called. Second, in GLL parsing GSS allows cycles in the S(A)), a descriptor is added. Both Call-1 and Call-2 add a
graph that solve the problem of left-recursion in recursive- new GSS edge to G.
descent parsing. Finally, Ret corresponds to a final grammar slot (final
The GSS resulting from parsing -a+a using the grammar states in ATNs) in which the parser returns from the current
E ::= E | E + E | a is shown in Figure 20 (right). As can nonterminal call. First, the tuple with the current SPPF node
be seen there is a cycle on all nodes, as they represent the and the current GSS node is added to P. Second, for each
left recursive calls to E at different input positions. In case outgoing GSS edge of the current GSS node, a descriptor is
of indirect left recursion, there will be a cycle in the GSS created and, if the same descriptor has not been processed
involving multiple nodes. before (d 62 U ), it is added to R.
A.3 GLL Parsing over ATN Grammars
4 In fact, when the next state is an end state, nodeP creates a nonterminal
In this section, we define GLL parsing over ATN grammars
node, instead of an intermediate node. However, in the current discussion,
as a transition relation. In contrast to the imperative style this is not essential, therefore, we always refer to the result of nodeP as an
used in [2, 33], we use the declarative rules of Figure 21. intermediate node.

20

You might also like