Combinator Parsing: A Short Tutorial: S. Doaitse Swierstra
Combinator Parsing: A Short Tutorial: S. Doaitse Swierstra
Tutorial
S. Doaitse Swierstra
1
Combinator Parsing: A Short Tutorial
S. Doaitse Swierstra
January 5, 2009
Abstract
There are numerous ways to implement a parser for a given syntax;
using parser combinators is a powerful approach to parsing which derives
much of its power and expressiveness from the type system and seman-
tics of the host programming language. This tutorial begins with the
construction of a small library of parsing combinators. This library intro-
duces the basics of combinator parsing and, more generally, demonstrates
how domain specific embedded languages are able to leverage the facili-
ties of the host language. After having constructed our small combinator
library, we investigate some shortcomings of the naı̈ve implementation in-
troduced in the first part, and incrementally develop an implementation
without these problems. Finally we discuss some further extensions of the
presented library and compare our approach with similar libraries.
1 Introduction
Parser combinators [2, 4, 8, 13] occupy a unique place in the field of parsing;
they make its possible to write expressions which look like grammars, but ac-
tually describe parsers for these grammars. Most mature parsing frameworks
entail voluminous preprocessing, which read in the syntax at hand, analyse it,
and produce target code for the input grammar. By contrast, a relatively small
parser combinator library can achieve comparable parsing power by harnessing
the facilities of the language. In this tutorial we develop a mature parser com-
binator library, which rivals the power and expressivity of other frameworks in
only a few hundred lines of code. Furthermore it is easily extended if desired
to do so. These advantages follow from the fact that we have chosen to embed
context-free grammar notation into a general purpose programming language,
by taking the Embedded Domain Specific Language (EDSL) approach.
For many areas special purpose programming languages have been defined. The
implementation of such a language can proceed along several different lines.
On the one hand one can construct a completely new compiler, thus having
complete freedom in the choice of syntax, scope rules, type system, commenting
2
conventions, and code generation techniques. On the other hand one can try
to build on top of work already done by extending an existing host language.
Again, here one can pursue several routes; one may extend an existing compiler,
or one can build a library implementing the new concepts. In the latter case
one automatically inherits –but is also limited to– the syntax, type system and
code generation techniques from the existing language and compilers for that
language. The success of this technique thus depends critically on the properties
of the host language.
With the advent of modern functional languages like Haskell [11] this approach
has become a really feasible one. By applying the approach to build a combi-
nator parsing library we show how Haskell’s type system (with the associated
class system) makes this language an ideal platform for describing EDSLs . Be-
sides being a kind of user manual for the constructed library this tutorial also
serves as an example of how to use Haskell concepts in developing such a library.
Lazy evaluation plays a very important role in the semantics of the constructed
parsers; thus, for those interested in a better understanding of lazy evaluation,
the tutorial also provides many examples. A major concern with respect to
combinator parsing is the ability or need to properly define and use parser com-
binators so that functional values (trees, unconsumed tokens, continuations,
etc.) are correctly and efficiently manipulated.
In Sect. 2 we develop, starting from a set of basic combinators, a parser combina-
tor library, the expressive power of which extends well above what is commonly
found in EBNF -like formalisms. In Sect. 3 we present a case study, describing
a sequence of ever more capable pocket calculators. Each subsequent version
gives us a chance to introduce some further combinators with an example of
their use.
Sect. 4 starts with a discussion of the shortcomings of the naı̈ve implementation
which was introduced in Sect.2, and we present solutions for all the identified
problems, while constructing an alternative, albeit much more complicated li-
brary. One of the remarkable points here is that the basic interface which was
introduced in Sect. 2 does not have to change, and that –thanks to the facilities
provided by the Haskell class system– all our derived combinators can be used
without having to be modified.
In Section 5 we investigate how we can use the progress information, which we
introduced to keep track of the progress of parsing process, introduced in Sect.
4 to control the parsing process and how to deal with ambiguous grammars. In
Sect. 6 we show how to use the Haskell type and class system to combine parsers
which use different scanner and symbol type intertwined. In Sect. 7 we extend
our combinators with error reporting properties and the possibility to continue
with the parsing process in case of erroneous input. In Sect. 8 we introduce a
class and a set of instances which enables us to make our expressions denoting
parsers resemble the corresponding grammars even more. In Sect. 9 we touch
upon some important extensions to our system which are too large to deal with
in more detail, in Sect. 10 we provide a short comparison with other similar
3
libraries and conclude.
and when embedding grammars things are no different. The basic grammat-
ical concepts are terminal and non-terminal symbols, or terminals and non-
terminals for short. They can be combined by sequential composition (multiple
constructs occurring one after another) or by alternative composition ( a choice
from multiple alternatives).
Note that one may argue that non-terminals are actually not primitive, but
result from the introduction of a naming scheme; we will see that in the case
of parser combinators, non-terminals are not introduced as a separate concept,
but just are Haskell names referring to values which represent parsers.
4
Parsers do not need to consume the entire input list. Thus, apart from the tree,
they also need to return the part of the input string that was not consumed:
type Parser s t = [s ] → (t, [s ])
The symbol list [s ] can be thought of as a state that is transformed by the
function while building the tree result.
Parsers can be ambiguous: there may be multiple ways to parse a string. Instead
of a single result, we therefore have a list of possible results, each consisting of
a parser tree and unconsumed input:
type Parser s t = [s ] → [(t, [s ])]
This idea was dubbed by Wadler [17] as the “list of successes” method, and it
underlies many backtracking applications. An added benefit is that a parser
can return the empty list to indicate failure (no successes). If there is exactly
one solution, the parser returns a singleton list.
Wrapping the type with a constructor P in a newtype definition we get the
actual Parser type that we will use in the following sections:
newtype Parser s t = P ([s ] → [(t, [s ])])
unP (P p) = p
5
otherwise → []
)
Since we want to inspect elements from the input with terminal symbols of type
s, we have added the Eq s constraint, which gives us access to equality (≡) for
values of type s. Note that the function pSym by itself is strictly speaking not a
parser, but a function which returns a parser. Since the argument is a run-time
value it thus becomes possible to construct parsers at run-time.
One might wonder why we have incorporated the value of s in the result, and
not the value a? The answer lies in the use of Eq s in the type of Parser ; one
should keep in mind that when ≡ returns True this does not imply that the
compared values are guaranteed to be bit-wise equal. Indeed, it is very common
for a scanner –which pre-processes a list of characters into a list of tokens to be
recognised– to merge different tokens into a single class with an extra attribute
indicating which original token was found. Consider e.g. the following Token
type:
data Token = Identifier -- terminal symbol used in parser
| Ident String -- token constructed by scanner
| Number Int
| If Symbol
| Then Symbol
Here, the first alternative corresponds to the terminal symbol as we find it in
our grammar: we want to see an identifier and from the grammar point of view
we do not care which one. The second alternative is the token which is returned
by the scanner, and which contains extra information about which identifier was
actually scanned; this is the value we want to use in further semantic processing,
so this is the value we return as witness from the parser. That these symbols
are the same, as far as parsing is concerned, is expressed by the following line
in the definition of the function ≡:
instance Eq Token where
(Ident ) ≡ Identifier = True
...
If we now define:
pIdent = pSym Identifier
we have added a special kind of terminal symbol.
The second basic combinator we introduce in this subsection is pReturn, which
corresponds to the -production. The function always succeeds and as a witness
returns its parameter; as we will see the function will come in handy when
building composite witnesses out of more basic ones. The name was chosen to
resemble the monadic return function, which injects values into the monadic
computation:
pReturn :: a → Parser s a
6
pReturn a = P (λinp → [(a, inp)])
We could have chosen to let this function always return a specific value (e.g.
()), but as it will turn out the given definition provides a handy way to inject
values into the result of the overall parsing process.
The final basic parser we introduce is the one which always fails:
pFail = P (const [ ])
One might wonder why one would need such a parser, but that will become
clear in the next section, when we introduce pChoice.
The resulting function returns all possible values v1 v2 with remaining state ss 2 ,
where v1 is a witness value returned by parser p1 with remaining state ss 1 . The
state ss 1 is used as the starting state for the parser p2 , which in its turn returns
7
the witnesses v2 and the corresponding final states ss 2 . Note how the types of
the parsers were chosen in such a way that the value of type v1 v2 matches the
witness type of the composed parser.
As a very simple example, we give a parser which recognises the letter ’a’ twice,
and if it succeeds returns the string "aa":
pString aa = (pReturn (:) <∗> pLettera)
<∗>
(pReturn (λx → [x ]) <∗> pLettera)
Let us take a look at the types. The type of (:) is a → [a ] → [a ], and
hence the type of pReturn (:) is Parser s (a → [a ] → [a ]). Since the type
of pLettera is Parser Char Char , the type of pReturn (:) <∗> pLettera is
Parser Char ([Char ] → [Char ]). Similarly the type of the right hand side
operand is Parser Char [Char ], and hence the type of the complete expression
is Parser Char [Char ]. Having chosen <∗> to be left associative, the first pair
of parentheses may be left out. Thus, many of our parsers will start out by
producing some function, followed by a sequence of parsers each providing an
argument to this function.
Besides sequential composition we also need choice. Since we are using lists to
return all possible ways to succeed, we can directly define the operator <|> by
returning the concatenation of all possible ways in which either of its arguments
can succeed:
(<|>) :: Parser s a → Parser s a → Parser s a
P p1 <|> P p2 = P (λinp → p1 inp ++ p2 inp)
Now we have seen the definition of <|>, note that pFail is both a left and a
right unit for this operator:
pFail <|> p ≡ p ≡ p <|> pFail
which will play a role in expressions like
pChoice ps = foldr (<|>) pFail ps
One of the things left open thus far is what precedence level these newly in-
troduced operators should have. It turns out that the following minimises the
number of parentheses:
infixl 5 <∗>
infixr 3 <|>
As an example to see how this all fits together, we write a function which recog-
nises all correctly nested parentheses – such as "()(()())"– and returns the
maximal nesting depth occurring in the sequence. The language is described by
the grammar S → ’(’ S ’)’ S | , and its transcription into parser combinators
reads:
parens :: Parser Char Int
8
parens = pReturn (λ b d → (1 + b) ‘max ‘ d )
<∗> pSym ’(’ <∗> parens <∗> pSym ’)’ <∗> parens
<|> pReturn 0
Since the pattern pReturn ... <∗> will occur quite often, we introduce a third
combinator, to be defined in terms of the combinators we have seen already.
The combinator <$> takes a function of type b → a, and a parser of type
Parser s b, and builds a value of type Parser s a, by applying the function to
the witness returned by the parser. Its definition is:
infix 7 <$>
(<$>) :: (b → a) → (Parser s b) → Parser s a
f <$> p = pReturn f <∗> p
Using this new combinator we can rewrite the above example into:
parens = (λ b d → (1 + b) ‘max ‘ d )
<$> pSym ’(’ <∗> parens <∗> pSym ’)’ <∗> parens
<|> pReturn 0
Notice that left argument of the <$> occurrence has type a → (Int → (b →
(Int → Int))), which is a function taking the four results returned by the parsers
to the right of the <$> and constructs the result sought; this all works because
we have defined <∗> to associate to the left.
Although we said we would restrict ourselves to productions of length 2, in fact
we can just write productions containing an arbitrary number of elements. Each
extra occurrence of the <∗> operator introduces an anonymous non-terminal,
which is used only once.
Before going into the development of our library, there is one nasty point to
be dealt with. For the grammar above, we could have just as well chosen
S → S ’(’ S ’)’ | , but unfortunately the direct transcription into a parser
would not work. Why not? Because the resulting parser is left recursive: the
parser parens will start by calling itself, and this will lead to a non-terminating
parsing process. Despite the elegance of the parsers introduced thus far, this
is a serious shortcoming of the approach taken. Often, one has to change the
grammar considerably to get rid of the left-recursion. Also, one might write left-
recursive grammars without being aware of it, and it will take time to debug the
constructed parser. Since we do not have an off-line grammar analysis, extra
care has to be taken by the programmer since the system just does not work as
expected, without giving a proper warning; it may just fail to produce a result
at all, or it may terminate prematurely with a stack-overflow.
9
2.4 Special Versions of Basic Combinators: <∗, ∗>, <$ and
opt.
As we see in the parens program the values witnessing the recognition of the
parentheses themselves are not used in the computation of the result. As this
situation is very common we introduce specialised versions of <$> and <∗>: in
the new operators <$, <∗ and ∗>, the missing bracket indicates which witness
value is not to be included in the result:
infixl 3 ‘opt‘
infixl 5 <∗, ∗>
infixl 7 <$
f <$ p = const <$> pReturn f <∗> p
p <∗ q = const <$> p <∗> q
p ∗> q = id <$ p <∗> q
We use this opportunity to introduce two further useful functions, opt and
pParens and reformulate the parens function:
pParens :: Parser s a → Parser s a
pParens p = id <$ pSym ’(’ <∗> p <∗ pSym ’)’
opt :: Parser s a → a → Parser s a
p ‘opt‘ v = p <|> pReturn v
parens = (max .(1+)) <$> pParens parens <∗> parens ‘opt‘ 0
As a final combinator, which we will use in the next section, we introduce a
combinator which creates the parser for a specific keyword given as its param-
eter:
pSyms [ ] = pReturn [ ]
pSyms (x : xs) = (:) <$> pSym x <∗> pSyms xs
10
3.1 Recognising a Digit
Our first calculator is extremely simple; it requires a digit as input and returns
this digit. As a generalisation of the combinator pSym we introduce the com-
binator pSatisfy: it checks whether the current input token satisfies a specific
predicate instead of comparing it with the expected symbol:
pDigit = pSatisfy (λx → ’0’ 6 x ∧ x 6 ’9’)
pSatisfy :: (s → Bool ) → Parser s s
pSatisfy p = P (λinp → case inp of
(x : xs) | p x → [(x , xs)]
otherwise → []
)
pSym a = pSatisfy (≡ a)
A demo run now reads:
In the next version we slightly change the type of the parser such that it returns
an Int instead of a Char , using the combinator <$>:
pDigitAsInt :: Parser Char Int
pDigitAsInt = (λc → fromEnum c − fromEnum ’0’) <$> pDigit
11
into the Int value:
pNatural :: Parser Char Int
pNatural = foldl (λa b → a ∗ 10 + b) 0 <$> pMany1 pDigitAsInt
From here it is only a small step to recognising signed numbers. A − sign in
front of the digits is mapped onto the function negate, and if it is absent we use
the function id :
pInteger :: Parser Char Int
pInteger = (negate <$ (pSyms "-") ‘opt‘ id ) <∗> pNatural
12
flip (+) <$ pSyms "+"
) <∗> pInteger
)
Since we will use this pattern often we abstract from it and introduce a parser
combinator pChainL, which takes two arguments:
13
p = q <∗∗> (flip f <$> r <|> flip g <$> s)
In some cases, the element s is missing from the second alternative, and for such
situations we have the combinator <??>:
(<??>) :: Parser s a → Parser s (a → a) → Parser s a
p <??> q = p <∗∗> (q ‘opt‘ id )
Let us now return to the code for pChainR. Our first attempt reads:
pChainR op p = id <$> p
<|> flip ($) <$> p <∗> (flip <$> op <∗> pChainR op p)
which can, using the refactoring method, be expressed more elegantly by:
pChainR op p = r where r = p <??> (flip <$> op <∗> r )
14
definition of pPlusMinusTimes:
pPlusMinusTimes = pChainL addops (pChainL mulops pInteger )
in which we recognise a foldr :
pPlusMinusTimes = foldr pChainL pInteger [addops, mulops ]
Now it has become straightforward to add new operators: just add the new
operator, with its semantics, to the corresponding level of precedence. If its
precedence lies between two already existing precedences, then just add a new
list between these two levels. To complete the parsing of expressions we add the
recognition of parentheses:
pPack :: Eq s ⇒ [s ] → Parser s a → [s ] → Parser s a
pPack o p c = pSyms o ∗> p <∗ pSyms c
pExpr = foldr pChainL pFactor [addops, mulops ]
pFactor = pInteger <|> pPack "(" pExpr ")"
15
3.7 Monadic Interface: Monad and pTimes
The parsers described thus far have the expressive power of context-free gram-
mars. We have introduced extra combinators to capture frequently occurring
grammatical patterns such as in the EBNF extensions. Because parsers are
normal Haskell values, which are computed at run-time, we can however go be-
yond the conventional context-free grammar expressiveness by using the result
of one parser to construct the next one. An example of this can be found in the
recognition of XML-based input. We assume the input be a tree-like structure
with tagged nodes, and we want to map our input onto the data type XML.
To handle situations like this we make our parser type an instance of the class
Monad :
instance Monad (Parser s) where
return = pReturn
P pa >>= a2pb = P (λinput → [b input 00 | (a, input 0 ) ← pa input
, b input 00 ← unP (a2pb a) input 0 ]
)
data XML = Tag String [XML] | Leaf String
pXML = do t ← pOpenTag
Tag t <$> pMany pXML <∗ pCloseTag t
<|> Leaf <$> pLeaf
pTagged p = pPack "<" p ">"
pOpenTag = pTagged pIdent
pCloseTag t = pTagged (pSym ’/’ ∗> pSyms t)
pLeaf = ...
pIdent = pMany1 (pSatisfy (λc → ’a’ 6 c ∧ c 6 ’z’))
A second example of the use of monads is in the recognition of the language
{an bn cn |n >= 0}, which is well known not to be context-free. Here, we use the
number of ’a’’s recognised to build parsers that recognise exactly that number
of ’b’’s and ’c’’s. For the result, we return the original input, which has now
been checked to be an element of the language:
pABC = do as ← pMany (pSym ’a’)
let n = length as
bs ← p n Times n (pSym ’b’)
cs ← p n Times n (pSym ’c’)
return (as ++ bs ++ cs)
p n Times :: Int → Parser s a → Parser s [a ]
p n Times 0 p = pReturn [ ]
p n Times n p = (:) <$> p <∗> p n Times (n − 1) p
3.8 Conclusions
16
We have now come to the end of our introductory section, in which we have
introduced the idea of a combinator language and have constructed a small
library of basic and non-basic combinators. It should be clear by now that
there is no end to the number of new combinators that can be defined, each
capturing a pattern recurring in some input to be recognised. We finish this
section by summing up the advantages of using an EDSL.
17
forwardly in a lazily evaluated functional language; inherited attributes
become parameters and synthesized attributes become part of the result
of the functions giving the semantics of a construct [15, 14].
Of course there are also downsides to the embedding approach. Although the
programmer is thinking he writes a program in the embedded language, he is
still programming in the host language. As a result of this, error messages from
the type system, which can already be quite challenging in Haskell, are phrased
in terms of the host language constructs too, and without further measures the
underlying implementation shines through. In the case of our parser combi-
nators, this has as a consequence that the user is not addressed in terms of
terminals, non-terminals, keywords, and productions, but in terms of the types
implementing these constructs.
There are several ways in which this problem can be alleviated. In the first
place, we can try to hide the internal structure as much as possible by using a
lot of newtype constructors, and thus defining the parser type by:
newtype Parser 0 s a = Parser 0 ([s ] → [(a, [s ])])
A second approach is to extend the type checker of Haskell such that the gen-
erated error messages can be tailored by the programmer. Now, the library
designer not only designs his library, but also the domain specific error mes-
sages that come with the library. In the Helium compiler [5], which handles a
subset of Haskell, this approach has been implemented with good results. As
an example, one might want to compare the two error messages given for the
incorrect program in Fig. 1. In Fig. 2 we see the error message generated by a
version of Hugs, which does not even point near the location of the error, and
in which the internal representation of the parsers shines through. In Fig. 3,
taken from [6], we see that Helium, by using a specialised version of the type
rules which are provided by the programmer of the library, manages to address
the application programmer in terms of the embedded language; it uses the
word parser and explains that the types do not match, i.e. that a component
is missing in one of the alternatives. A final option in the Helium compiler is
the possibility to program the search for possible corrections, e.g. by listing
functions which are likely to be confused by the programmer (such as <∗> and
<∗ in programming parsers, or : and ++ by beginning Haskell programmers).
As we can see in Fig. 4 we can now pinpoint the location of the mistake even
better and suggest corrective actions.
4 Improved Implementations
Since the simple implementation which was used in section 2 has quite a number
of shortcomings we develop in this section a couple of alternative implementa-
tions of the basic interface. Before doing so we investigate the problems to be
solved, and then deal with them one by one.
18
data Expr = Lambda Patterns Expr -- can contain more alternatives
type Patterns = [Pattern]
type Pattern = String
Compiling Example.hs
(7,6): The result types of the parsers in the operands of <|> don’t match
left parser : pAndPrioExpr
result type : Expr
right parser : Lambda <$ pSyms "\\" <*> many pVarid <* pSyms "->"
<* pExpr
result type : Expr -> Expr
Compiling Example.hs
(11,13): Type error in the operator <*
probable fix: use <*> instead
Figure 4: Helium, version 1.1 (type rules extension and sibling functions)
19
4.1 Shortcomings
4.1.1 Error reporting
One of the first things someone notices when starting to use the library is that
when erroneous input is given to the parser the result is [ ], indicating that it is
not possible to get a correct parse. This might be acceptable in situations where
the input was generated by another program and is expected to be correct, but
for a library to be used by many in many different situations this is unacceptable.
At least one should be informed about the position in the input where the parser
got stuck, and what symbols were expected.
Although this is nowadays less common, it would be nice if the parser could
apply (mostly small) error repairing steps, such as inserting a missing closing
parenthesis or end symbol. Also spurious tokens in the input stream might be
deleted. Of course the user should be properly informed about the steps which
were taken in order to be able to proceed parsing.
20
4.1.4 Space Consumption
4.1.5 Conclusions
Although the problems at first seem rather unrelated they are not. If we want to
have an online result this implies that we want to start processing a result with-
out knowing whether a complete parse can be found. If we add error correction
we actually change our parsers from parsers which may fail to parsers which
will always succeed (i.e. return a result), but probably with an error message.
In solving the problems mentioned we will start with the space consumption
problem, and next we change the implementation to produce online results. As
we will see special measures have to be taken to make the described parsers
instances of the class Monad .
We will provide the full code in this tutorial. Unfortunately when we add error
reporting and error correction our way of presenting code in an incremental way
leads to code duplication. So we will deal with the last two issues separately in
Sect. 7.
21
4.2.1 Applicative
Since the basic interface is useable beyond the basic parser combinators from
Sect. 3 we introduce a class for it: Applicative. 1
class Applicative p where
(<∗>) :: p (b → a) → p b → p a
(<|>) :: p a →pa→pa
(<$>) :: (b → a) → p b → p a
pReturn :: a →pa
pFail :: pa
f <$> p = pReturn f <∗> p
instance Applicative p ⇒ Functor p where
fmap = (<$>)
Although for parsing the input is just a sequence of terminal symbols, in practice
the situation is somewhat different. We assume our grammars are defined in
terms of terminal symbols, whereas we can split our input state into the next
token and a a new state. A token may contain extra position information or more
detailed information which is not relevant for the parsing process. We have seen
an example already of the latter; when parsing we may want to see an identifier,
but it is completely irrelevant which identifier is actually recognised. Hence we
want check whether the current token matches with an expected symbol. Of
course these values do not have to be of the same type. We capture the relation
between input tokens and terminal symbols by the class Describes:
class symbol ‘Describes‘ token where
eqSymTok :: symbol → token → Bool
The function pSym takes as parameter a terminal symbol , but returns a parser
which has as its witness an input token. Because we again will have many
different implementations we make pSym a member of a class too.
class Symbol p symbol token where
pSym :: symbol → p token
1 We do not use the class Applicative from the module Control.Applicative, since it pro-
vides standard implementations for some operations for which we want to give optimized
implementations, as the possibility arises.
22
4.2.4 Generalising the Input: Provides
In the previous section we have taken the input to be a list of tokens. In reality
this may also be a too simple approach. We may e.g. want to maintain position
information, or extra state which can be manipulated by special combinators.
From the parsing point of view the thing that matters is that the input state
can provide a token on demand if possible:
class Provides state symbol token | state symbol → token where
splitState :: symbol → state → Maybe (token, state)
We have decided to pass the expected symbol to the function splitState. Since
we will also be able to switch state type we have decided to add a functional
dependecy state symbol → token, stating that the state together with the ex-
pected symbol type determines how a token is to be produced. We can thus
switch from one scanning stategy to another by passing a symbol of a different
type to pSym!
We will often have to check whether we have read the complete input, and
thus we introduce a class containing the function eof (end-of-file) which tells us
whether more tokens have to be recognised:
class Eof state where
eof :: state → Bool
Because our parsers will all have different interfaces we introduce a function
parse which knows how to call a specific parser and how to retrieve the result:
class Parser p where
parse :: p state a → state → a
The instances of this class will serve as a typical example of how to use a
parser of type p from within a Haskell program. For specific implementations
of p, and in specific circumstances one may want to vary on the given standard
implementations.
23
4. the type Pm (‘monad parsers’) in subsection 4.7.
All four types will be polymorphic, having two type parameters: the type of the
state, and the type of the witness of the correct parse. This is a digression from
the parser type in Sect. 2, which was polymorphic in the symbol type and the
witness type.
All four types will be functions, which operate on a state rather than a list
of symbols. The state type must be an instance of Provides together with a
symbol and a token type, and the symbol and the token must be an instance of
Describes.
A further digression from section 2 is that the parsers in this section are not
ambiguous. Instead of a list of successes, they return a single result.
As a final digression, the result type of the parsers is not a pair of a witness
and a final state, but a witness only wrapped in a Steps datatype. The Steps
datatype will be introduced below. It is an encoding of whether there is failure
or success, and in the case of success, how much input was consumed.
As we explained before, the list-of-successes method basically is a depth-first
search technique. If we manage to change this depth-first approach into a
breath-first approach, then there is no need to hang onto the complete input
until we are finished parsing. If we manage to run all alternative parsers in
parallel we can discard the current input token once it has been inspected by
all active parsers, since it will never be inspected again.
Haskell’s lazy evaluation provides a nice way to drive all the active alternatives
in a step by step fashion. The main ingredient for this process is the data type
Steps, which plays a crucial role in all our implementations, and describes the
type of values constructed by all parsers to come. It can be seen as a lazily
constructed trace representing the progress of the parsing process.
data Steps a where
Step :: Steps a → Steps a
Fail :: Steps a
Done :: a → Steps a
Instead of returning just a witness from the parsing process we will return a
nested application of Step’s, which has eventually a Fail constructor indicating
a failed branch in our breadth-first search, or a Done constructor which indicates
that parsing completed successfully and presents the witness of that parse. For
each successfully recognised symbol we get a Step constructor in the resulting
Steps sequence; thus the number of Step constructors in the result of a parser
tells us up to which point in the input we have successfully proceeded, and more
specifically if the sequence ends in a Fail the number of Step-constructors tell
us where this alternative failed to proceed.
The function driving our breadth-first behaviour is the function best, which
compares two Steps sequences and returns the “best” one:
24
best :: Steps a → Steps a → Steps a
Fail ‘best‘ r =r
l ‘best‘ Fail =l
(Step l ) ‘best‘ (Step r ) = Step (l ‘best‘ r )
‘best‘ = error "incorrect parser"
The last alternative covers all the situations, where either one parser completes
and another is still active (Step ‘best‘Done,Done ‘best‘Step), or where two active
parsers complete at the same time (Done / Done) as a result of an ambiguity in
the grammar. For the time being we assume that such situations will not occur.
The alternative which takes care of the conversion from depth-first to breadth-
first is the one in which both arguments of the best function start with a Step
constructor. In this case we discover that both alternatives can make progress,
so the combined parser can make progress by immediately returning a Step
constructor; we do however not decide nor reveal yet which alternative even-
tually will be chosen. The expression l ‘best‘ r in the right hand side is lazily
evaluated, and only unrolled further when needed, i.e. when further pattern
matching takes place on this value, and that is when all Step constructors cor-
responding to the current input position have been merged into a single Step.
The sequence associated with this Step constructor is internally an expression,
consisting of further calls to the function best. Later we will introduce more
elaborate versions of this type Steps, but the idea will remain the same, and
they will all exhibit the breadth-first behaviour.
In order to retrieve a value from a Steps value we write a function eval which
retrieves the value remembered by the Done at the end of the sequence, provided
it exists:∗
eval :: Steps a → a
eval (Step l ) = eval l
eval (Done v ) = v
eval Fail = error "should not happen"
4.4 Recognisers
After the preparatory work introducing the Steps data type, we introduce our
first ‘parser’ type, which we will dubb recogniser since it will not present a
witness; we concentrate on the recognition process only. The type of R is
polymorphic in two type parameters: st for the state, and a for the witness of the
correct parse. Basically a recogniser is a function taking a state and returning
Steps. This Steps value starts with the steps produced by the recogniser itself,
but ends with the steps produced by a continuation which is passed as the first
argument to the recogniser:
newtype R st a = R (∀ r .(st → Steps r ) → st → Steps r )
unR (R p) = p
25
Note that the type a is not used in the right hand side of the definition. To
make sure that the recognisers and the parsers have the same kind we have
included this type parameter here too; besides making it possible to to make
use of all the calsses we introduce for parsers it also introduces extra check on
the wellformedness of recognisers. Furthermore we can now, by provinding a
top level type specification use the same expression to just recognise something
or to parse with building a result.
We can now make R an instance of Applicative, that is implement the five classic
parser combinators for it. Note that the parameter f of the operator <$> is
irnored, since it does not play a role in the reognition process, and the same
holds for the parameter a of pReturn.
instance Applicative (R st) where
R p <∗> R q = R (λk st → p (q k ) st)
R p <|> R q = R (λk st → p k st ‘best‘ q k st)
f <$> R p = R p
pReturn a = R (λk st → k st)
pFail = R (λk st → Fail )
We have abstained from giving point-free definitions, but one can easily see that
sequential composition is essentially function composition, and that pReturn is
the identity function wrapped in a constructor.
Next we provide the implementation of pSym, which resembles the definition
in the basic library. Note that when a symbol is succesfully recognised this is
reflected by prefixing the result of the call to the continuation with a Step:
instance (symbol ‘Describe‘ s token, Provides state symbol token)
⇒ Symbol (R state) symbol token where
pSym a = R (λk h st → case splitState a st of
Just (t, ss) → if a ‘eqSymTok ‘ t
then Step (k ss)
else Fail
Nothing → Fail )
26
passed the history extended with the newly recognised witness, to produce the
final result from the rest of the input.
In the type Ph , we have local type variables for the type of the history h, and
the type of the witness of the final result r :
newtype Ph st a = Ph (∀ r h . ((h, a) → st → Steps r )
→ h → st → Steps r )
unP h (Ph p) = p
We can now make Ph an instance of Applicative, that is, implement the five
classic parser combinators for it.
In the definition of pReturn, we encode that the history parameter is indeed a
stack, growing to the right and implemented as nested pairs. The new witness
is pushed on the history stack, and passed on to the continuation k .
In the definition of f <$>, the continuation is modified to sneak in the application
of the function f .
In the definition of alternative composition <|>, we call both parsers and exploit
the fact that they both return Steps, of which we can take the best. Of course,
best only lazily unwraps both Steps up to the point where one of them fails.
In the definition of sequential composition <∗>, the continuation-passing style
is again exploited: we call p, passing it q as continuation, which in turn takes a
modification of the original continuation k . The modification is that two values
are popped from the history stack: the witness b from parser q, and the witness
b2a from parser p; and a new value b2a b is pushed onto the history stack which
is passed to the orginal continuation k :
instance Applicative (Ph state) where
Ph p <∗> Ph q = Ph (λk → p (q apply h )
where apply h = λ((h, b2a), b) → k (h, b2a b))
Ph p <|> Ph q = Ph (λk h st → p k h st ‘best‘ q k h st)
f <$> Ph p = Ph (λk → p $ λ(h, a) → k (h, f a))
pFail = Ph (λk → Fail )
pReturn a = Ph (λk h → k (h, a) )
Note that we have given a new definition for <$>, which is slightly more efficient
than the default one; instead of pushing the function f on the stack with a
pReturn and popping it off later, we just apply it directly to recognised result of
the parser p. In Fig. 5 we have given a pictorial representation of the flow of data
associated with this parser type. The top arrows, flowing right, correspond to
the accumulated history, and the arrows directly below them to the state which
is passed on. The bottom arrows, flowing left, correspond to the final result
which is returned through all the continuation calls.
In a slightly different formulation the stack may be respresented implicitly using
extra continuation functions. From now on we will use a somewhat simpler
type for P − h and thus we also provide a new instance definition for the class
27
h b2a (h, b2a) b ((h, b2a), b) (h, b2a b)
p q apply
Steps r Steps r Steps r
Since we will later be adding error recovery to the parsers constructed in this
chapter, which will turn every illegal input into a legal one, we will assume in
this section that there exists always precisely one way of parsing the input. If
there is more than one way then we have to deal with ambiguities, which we
will also show how to deal with in section 5.
28
4.6 Producing Results Online
The next problem we are attacking is producting the result online. The history
parser accumulates its result in an extra argument, only to be inserted at the end
of the parsing process with the Done constructor. In this section we introduce
the counterpart of the history parser, the future parser, which is named this
way because the “stack” we are maintaining contains elements which still have
to come into existence. The type of future parsers is:
newtype Pf st a = Pf (∀ r .(st → Steps r ) → st → Steps (a, r ))
unP f (Pf p) = p
We see that the history parameter has disappeared and that the parameter of
the Steps type now changes; instead of just passing the result constructed by
the call to the continuation unmodified to the caller, the constructed witness a
is pushed onto the stack of results constructed by the continuation; this stack is
made an integral part of the data type Steps by not only representing progress
information but also constructed values in this sequence.
In our programs we will make the stack grow from the right to the left; this
maintains the suggestion introduced by the history parsers that the values to
the right correspond to input parts which are located further towards the end
of the input stream (assuming we read the stream from left to right). One
way of pushing such a value on the stack would be to traverse the whole future
sequence until we reach the Done constructor and then adding the value there,
but that makes no sense since then the result again will not be available online.
Instead we extend our Steps data type with an extra constructor. We remove
the Done constructor, since it can be simulated with the new Apply constructor.
The Apply constructor makes it possible to store function values in the progress
sequence:
data Steps a where
Step :: Steps a → Steps a
Fail :: Steps a
Apply :: (b → a) → Steps b → Steps a
eval :: Steps a → a
eval (Step l ) = eval l
eval (Fail ls) = error "no result"
eval (Apply f l ) = f (eval l )
As we have seen in the case of the history parsers there are two operations we
perform on the stack: pushing a value, and popping two values, applying the
one to the other and pushing the result back. For this we define two auxiliary
functions:
push :: v → Steps r → Steps (v , r )
push v = Apply (λs → (v , s))
apply f :: Steps (b → a, (b, r )) → Steps (a, r )
29
apply f = Apply (λ(b2a, ˜(b, r )) → (b2a b, r ))
One should not confuse the Apply constructor with the apply f function. Keep
in mind that the Apply constructor is a very generally applicable construct
changing the value (and possibly the type) represented by the sequence by
prefixing the sequence with a function value, whereas the apply f function takes
care of combining the values of two sequentially composed parsers by applying
the result of the first one to the result of the second one. An important rôle
is played by the ˜-symbol. Normally, Haskell evaluates arguments to functions
far enough to check that it indeed matches the pattern. The tilde prevents
this by making Haskell assume that the pattern always matches. Evaluation
of the argument is thus slightly more lazy, which is critically needed here: the
function b2a can already return that part of the result for which evaluation of
its argument is not needed!
The code for the the function best now is a bit more involved, since there are
extra cases to be taken care of: a Steps sequence may start with an Apply step.
So before calling the actual function best we make sure that the head of the
stream is one of the constructors that indicates progress, i.e. a Step or Fail
constructor. This is taken care of by the function norm which pushes Apply
steps forward into the progress stream until a progress step is encountered:
norm :: Steps a → Steps a
norm (Apply f (Step l )) = Step (Apply f l )
norm (Apply f Fail ) = Fail
norm (Apply f (Apply g l )) = norm (Apply (f .g) l )
norm steps = steps
Our new version of best now reads:
l ‘best‘ r = norm l ‘best 0 ‘ norm r
where Fail ‘best 0 ‘ r =r
l ‘best 0 ‘ Fail =l
(Step l ) ‘best 0 ‘ (Step r ) = Step (l ‘best‘ r )
‘best 0 ‘ = Fail
We as well make Pf an instance of Applicative:
instance Applicative (Pf st) where
Pf p <∗> Pf q = Pf (λk st → apply f (p (q k ) st))
Pf p <|> Pf q = Pf (λk st → p k st ‘best‘ q k st )
pReturn a = Pf (λk st → push a (k st) )
pFail = Pf (λ → Fail )
Just as we did for the history parsers we again provide a pictorial representation
of the data flow in case of a sequential composition <∗> in Fig. 6:
Also the definitions of pSym and parse pose no problems. The only question is
what to take as the initial value of the Steps sequence. We just take ⊥, since
the types guarantee that it will never be evaluated. Notice that if the parser
constructs the value b, then the result of the call to the parser in the function
30
(b2a b, f ) apply (b2a, (b, f )) p b2a (b, f ) q b f
parse will be (b, ⊥) of which we select the first component after converting the
returned sequence to the value represented by it.
instance (symbol ‘Describes‘ token, state ‘Provides‘ token)
⇒ Symbol (Pf state) symbol token where
pSym a = Pf (λk st → case splitState a st of
Just (t, ss) → if a ‘eqSymTok ‘ t
then Step (push t (k ss))
else Fail
Nothing → Fail
)
instance Eof state ⇒ Parser (Pf state)
where
parse (Pf p) = fst.eval .p (λinp → if eof inp then ⊥ else error "end")
31
pv
(qv , f ) p pv (qv , f ) q qv
)
return = pReturn
Unfortunately execution of the above code may lead to a black hole, i.e. a non-
terminating computation, as we will explain with the help of Fig. 7. Problems
occur when inside p we have a call to the function best which starts to compare
two result sequences. Now suppose that in order to make a choice the parser p
does not provide enough information. In that case the continuation q is called
once for each branch of the choice process, in order to provide further steps of
which we hope they will lead us to a decision. If we are lucky the value of pv is
not needed by q pv in order to provide the extra needed progress information.
But if we are unlucky the value is needed; however the Apply steps contributing
to pv will have been propagated into the sequence returned by q. Now we have
constructed a loop in our computation: pv depends on the outcome of best, best
depends on the outcome of q pv , and q pv depends on the value of pv .
The problem is caused by the fact that each branch taken in p has its own call
to the continuation q, and that each branch may lead to a different value for
pv , but we get only one in our hands: the one which belongs to the successful
alternative. So we are stuck.
Fortunately we remember just in time that we have introduced a different kind
of parser, the history based ones, which have the property that they pass the
value produced along the path taken inside them to the continuation. Each
path splitting somewhere in p can thus call the continuation with the value
which will be produced if this alternative wins eventually. That is why their
implementation of Monad ’s operations is perfectly fine. This brings us to the
following insight: the reason we moved on from history based parsers to future
based parsers was that we wanted to have an online result. But the result of
the left-hand side of a monadic bind is not used at all in the construction of
the result. Instead it is removed from the result stack in order to be used as a
parameter to the right hand side operand of the monadic bind. So the solution
to our problem lies in using a history based parser as the left hand side of a
monadic bind, and a future based parser at the right hand side. Of course we
32
a1
q1 b f
a
f)
(b ,
(b, f ) p
(b ,
f )
q2 b f
a2
have to make sure that they share the Steps data type used for storing the
result. In Fig. 8 we have given a pictorial representation of the associated data
flow.
Unfortunately this does not work out as expected, since the type of the >>=
operator is Monad m ⇒ m b → (b → m a) → m a, and hence requires the left
and right hand side operands to be based upon the same functor m. A solution
is to introduce a class GenMod , which takes two functor parameters instead of
one:
infixr 1 >>>=
class GenMonad m 1 m 2 where
(>>>=) :: m 1 b → (b → m 2 a) → m 2 a
Now we can create two instances of GenMonad . In both cases the left hand
side operand is the history parser, and the right hand side operand is either a
history or a future based parser:
instance Monad (Ph state)
⇒ GenMonad (Ph state) (Ph state) where
(>>>=) = (>>=) -- the monadic bind defined before
instance GenMonad (Ph state) (Pf state) where
(Ph p) >>>= pv2q = Pf (λk → p (λpv → unP h (pv2q pv ) k ))
Unfortunately we are now no longer able to use the do notation because that is
designed for Monad expressions rather than for GenMonad expressions which
33
was introduced for monadic expressions, and thus we still cannot replace the
implementation in the basic library by the more advanced one we are developing.
Fortunately there is a trick which makes this still possible: we pair the two
implementations, and select the one which we need :
data Pm state a = Pm (Ph state a) (Pf state a)
unP m h (Pm (Ph h) )=h
unP m f (Pm (Pf f )) = f
Our first step is to make this new type again instance of Applicative:
instance ( Applicative (Ph st), Applicative (Pf st))
⇒ Applicative (Pm st) where
(Pm hp fp) <∗> ˜(Pm hq fq) = Pm (hp <∗> hq) (fp <∗> fq)
(Pm hp fp) <|> (Pm hq fq) = Pm (hp <|> hq) (fp <|> fq)
pReturn a = Pm (pReturn a) (pReturn a)
pFail = Pm pFail pFail
instance (symbol ‘Describes‘ token, state ‘Provides‘ token)
⇒ Symbol (Pm state) symbol token where
pSym a = Pm (pSym a) (pSym a)
instance Eof state ⇒ Parser (Pm state) where
parse (Pm (Pf fp))
= fst.eval .fp (λrest → if eof rest then ⊥
else error "parse")
This new type can now be made into a monad by:
instance Applicative (Pm st) ⇒ Monad (Pm st) where
(Pm (Ph p) ) >>= a2q =
Pm (Ph (λk → p (λa → unP m h (a2q a) k )))
(Pf (λk → p (λa → unP m f (a2q a) k )))
return = pReturn
Special attention has to be paid to the occurrence of the ˜ symbol in the left
hand side pattern for the <∗> combinator. The need for it comes from recursive
definitions like:
pMany p = (:) <$> p <∗> pMany p ‘opt‘ [ ]
If we match the second operand of the <∗> occurrence strictly this will force the
evaluation of the call pMany p, thus leading to an infinite recursion!
34
expressive power of context-free grammars. Because both our history and future
parsers now operate on the same Steps data type we will focus on extensions to
that data type only.
35
(<<< |>) :: p a → p a → p a
try :: pa→pa
instance Try (Pf state) where
Pf p <<< |> Pf q = Pf (λk st → let l = p k st
in maybe (l ‘best‘ q k st) id (hasSuccess id l )
)
where hasSuccess f (Step l ) = hasSuccess (f .Step) l
hasSuccess f (Apply g l ) = hasSuccess (f .Apply g) l
hasSuccess f (Success l ) = Just (f l )
hasSuccess f (Fail ) = Nothing
try (Pf p) = Pf (p.(Success.))
The function try does little more than inserting a Success marker in the result
sequence, once its argument parser has completed successfully. The function
hasSuccess tries to find such a marker. If found then the marker is removed
and success (Just) reported, otherwise failure (Nothing) is returned. In the
latter case our good old friend best takes its turn to compare both sequences in
parallel as before. One might be inclined to think that in case of failure of the
first alternative we should just take the second, but that is a bit too optimistic;
the right hand side alternative might fail even earlier.
Unfortunately this simple approach has its drawback: what happens if the pro-
grammer forgets to mark an initial part of the left hand side alternative with try?
In that case the function will never find a Success constructor, and our parsing
process fails. We can solve this problem by introducing yet another parser type
which guarantees that try has been used and thus that such a Success construc-
tor may occur. We will not pursue this alternative here any further, since it will
make our code even more involved.
36
The first question to be answered is what to choose for the result of an ambiguous
parser. We decide to return a list of all produced witnesses, and introduce
a function amb which is used to label ambiguous non-terminals; the type of
the parser that is returned by amb reflects that more than one result can be
expected.
class Ambiguous p where
amb :: p a → p [a ]
For its implementation we take inspiration from the parse functions we have
seen thus far. For history parsers we discovered that a grammar was ambiguous
by simultaneously encountering a Done marker in the left and right operand
of a call to best. So we model our amb implementation in the same way, and
introduce a new marker End h which becomes yet an extra alternative in our
result type:
data Steps a where
...
| End h :: ([a ], [a ] → Steps r ) → Steps (a, r ) → Steps (a, r )
...
To recognise the end of a potentially ambiguous parse we insert an End h mark
in the result sequence, which indicates that at this position a parse for the
ambiguous non-terminal was completed and we should continue with the call to
the continuation. Since we want to evaluate the call to the common continuation
only once we bind the current continuation k and the current state in the value
of type [a ] → Steps r ; the argument of this function will be the list of all
witnesses recognised at the point corresponding to the occurrence of the End h
constructor in the sequence:
instance Ambiguous (Ph state) where
amb (Ph p) =
Ph (λk → removeEnd h .p (λa st 0 → End h ([a ], λas → k as st 0 ) noAlts))
noAlts = Fail
We thus postpone the call to the continuation itself. The second parameter
of the End h constructor represents the other parsing alternatives that branch
within the ambiguous parser, but have not yet completed and thus contain and
End h marker further down the sequence.
All parses which reach their End h constructor at the same point are collected
in a common End h constructor. We only provide the interesting alternatives in
the new function best:
End h (as, k st) l ‘best 0 ‘ End h (bs, ) r = End h (as ++ bs, k st)
(l ‘best‘ r )
End h as l ‘best 0 ‘ r = End h as (l ‘best‘ r )
l ‘best 0 ‘ End h bs r = End h bs (l ‘best‘ r )
If an ambiguous parser succeeds at least once it will return a sequence of Step’s
37
which has the length of input consumed, followed by an End h constructor which
holds all the results and continuations of the parses that completed successfully
at this point, and a sequence representing the best result for all other parses
which were successful up-to this point. Note that all the continuations which
are stored are the same by construction.
The expression kas st 0 binds the ingredients of the continuation; it can im-
mediately be called once we have constructed the complete list containing the
witnesses of all successful parses. The tricky work is done by the function
removeEnd , which hunts down the result sequence in order to locate the End h
constructors, and to resume the best computation which was temporarily post-
poned until we had collected all successful parses with their common continua-
tions.
removeEnd h :: Steps (a, r ) → Steps r
removeEnd h (Fail ) = Fail
removeEnd h (Step l ) = Step (removeEnd h l )
removeEnd h (Apply f l ) = error "not in history parsers"
removeEnd h (End h (as, k st) r ) = k st as ‘best‘ removeEnd h r
In the last alternative the function removeEnd h has forced the evaluation of
all alternatives which are active up to this point. The result of the completed
parsers here have been collected in the value as, which can now be passed to
the function, thus resuming the parsing process at this point. Other parsers for
the ambiguous non-terminal which have not completed yet are all represented
by the second component. So the function removeEnd h still has to force further
evaluation of these sequences, and remove the End h constructor. The parsers
terminating at this point of course still have to compete wih the still active
parsers to finally reach a decision.
Without making this explicit we have gradually moved from a situation were the
calls to the function best immediately construct a single sequence, to a situation
where we have markers in the sequence which may be used to stop and start
evaluation.
The situation for the online parsers is a bit different, since we want to keep as
much of the online behaviour as possible. As an example we look at the following
set of definitions, where the parser r is marked as an ambiguous parser:
p <+
+> q = (+ +) <$> p <∗> q
a = (:[ ]) <$> pSym ’a’
a2 = a <+ +> a
a3 = a <+ +> a <+
+> a
r = amb (a <+ +> (a2 <+
+> a3 <|> a3 <++> a2 )
In section 7 we will introduce error repair, which will guarantee that each parser
always constructs a result sequence when forced to do so. This has as a con-
sequence that if we access the value computed by an ambiguous parser we can
be sure that this value has a length of at least 1, and thus we should be able to
38
match, in the case of the parser r above, the resulting value successfully against
the pattern ((a : ) : ) as soon as parsing has seen the first ’a’ in the input.
As before we add yet another marker type to the type Steps:
data Steps a where
...
End f :: [Steps a ] → Steps a → Steps a
...
We now give the code for the Pf case:
instance Ambiguous (Pf state) where
amb (Pf p) = Pf (λk inp → combineValues.removeEnd f $
p (λst → End f [k st ] noAlts) inp)
removeEnd f :: Steps r → Steps [r ]
removeEnd f (Fail ) = Fail
removeEnd f (Step l ) = Step (removeEnd f l )
removeEnd f (Apply f l ) = Apply (map 0 f ) (removeEnd f l )
removeEnd f (End f (s : ss) r ) = Apply (:(map eval ss)) s
‘best‘
removeEnd f r
combineValues :: Steps [(a, r )] → Steps ([a ], r )
combineValues lar = Apply (λlar 0 → (map fst lar 0 , snd (head lar 0 ))) lar
map 0 f ˜(x : xs) = f x : map f xs
The hard work is again done in the last alternative of removeEnd f , where we
apply the function eval to all the sequences. Fortunately this eval is again lazily
evaluated, so not much work is done yet. The case of Apply is also interesting,
since it covers the case of the first a in the example; the map 0 f adds this value
to all successful parses. We cannot use the normal map since this function is
strict in the list constructor of its second argument, and we may already want
to expose the call to f (e.g. to produce the value ’a’:) without proceeding
with the match. The function map 0 exploits the fact that its list argument is
guaranteed to be non-empty, as a result of the error correction to be introduced.
Finally we use the function combineValues to collect the values recognised by the
ambiguous parser, and combine the result of this with the sequence produced
by the continuation. It looks all very expensive, but lazy evaluation makes
that a lot of work is actually not performed; especially the continuation will
be evaluated only once, since the function fst does not force evaluation of the
second component of its argument tuple.
5.3 Micro-steps
Besides the greedy parsing strategy which just looks at the next symbol in order
to decide which alternative to choose, we sometimes want to give precedence to
one parse over the other. An example of this is when we use the combinators
to construct a scanner. The string "name" should be recognised as an identifier,
39
whereas the string "if" should be recognised as a keyword, and this alternative
thus has precedence over the interpretation as an identifier. We can easily get
the desired effect by introducing an extra kind of step, which looses from Step
but wins from Fail . The occurrence of such a step can be seen as an indication
that a small penalty has to be paid for taking this alternative, but that we are
happy to pay this price if no other alternatives are available. We extend the
type Steps with a step Micro and add the alternatives:
(Micro l ) ‘best 0 ‘ r @(Step ) = r
l @(Step ) ‘best 0 ‘ (Micro ) = l
(Micro l ) ‘best 0 ‘ (Micro r ) = Micro (l ‘best 0 ‘ r )
...
The only thing still to be done is to add a combinator which inserts this small
step into a progress sequence:
class Micro p where
micro :: p a → p a
instance Micro (Pf state) where
micro (Pf p) = Pf (p.(Micro.))
The other instances follow a similar pattern. Of course there are endless vari-
ations possible here. One might add a small integer cost to the micro step, in
order to describe even finer grained disambiguation strategies.
6 Embedding Parsers
With the introduction of the function splitState we have moved the responsibility
for the scanning process, which converts the input into a stream of tokens, to
the state type. Usually one is satisfied to have just a single way of scanning the
input, but sometimes one may want to use a parser for one language as sub-
parser in the parser for another language. An example of this is when one has a
Haskell parser and wants to recognise a String value. Of course one could offload
the recognition of string values to the tokeniser, but wouldn’t it be nice if we
could just call the parser for strings as a sub-parser, which uses single characters
as its token type? A second example arises when one extend a language like
Java with a sub-language like AspectJ, which again has Java as a sub-language.
Normally this creates all kind of problems with the scanning process, but if we
are able to switch from scanner type, many problems disappear.
In order to enable such an embedding we introduce the following class:
class Switch p where
pSwitch :: (st1 → (st2 , st2 → st1 )) → p st2 a → p st1 a
It provides a new parser combinator pSwitch that can temporarily parse with
a different state type st2 by providing it with a splitting function which splits
the original state of type st1 into a state of type st2 and a function which will
40
convert the final value of type st2 back into a value of type st1 :
instance Switch Ph where
pSwitch split (Ph p) = Ph (λk st1 → let (st2 , b) = split st1
in p (λst2 0 → k (b st2 0 )) st2 )
instance Switch Pf where
pSwitch split (Pf p) = Pf (λk st1 → let (st2 , b) = split st1
in p (λst2 0 → k (b st2 0 )) st2 )
instance Switch Pm where
pSwitch split (Pm (p, q)) = Pm (pSwitch split p, pSwitch split q)
Using the function pSwitch we can map the state to a different state and back;
by providing different instances we can thus use different versions of splitState.
A subtle point to be addressed concerns the breadth-first strategy; if we have
two alternatives working on the same piece of input, but are using different
scanning strategies, the two alternatives may get out of sync by accepting a
different number of tokens for the same piece of input. Although this may not
be critical for the breadth-first process, it may spoil our recognition process
for ambiguous parsers, which depend on the fact that when End markers meet
the corresponding input positions are the same. We thus adapt the function
splitState such that it not only returns the next token, but also an Int value
indicating how much input was consumed. We also adapt the Step alternative
to record the progress made:
type Progress = Int
data Steps a where
...
Step :: Progress → Steps a
Of course also the function best 0 needs to be adapted too. We again only show
the relevant changes:
Step n l ‘best 0 ‘ Step mr
| n ≡ m = Step n (l ‘best 0 ‘ r )
| n < m = Step n (l ‘best 0 ‘ Step (m − n) r )
| n > m = Step m (Step (n − m) l ‘best 0 ‘ r )
The changes to all other functions, such as eval , are straightforward.
41
7.1 Error Reporting
An important feature of proper error reporting is an indication of the longest
valid prefix of the input, and which symbols were expected at that point. We
have seen already that the number of Step constructors provides the former. So
we will focus on the latter. For this we change the Fail alternative of the Steps
data type, in order to record symbols that were expected at the point of failure:
data Steps a where
...
Fail :: [String ] → Steps a
...
In the functions pSym we replace the occurrences of Fail with the expression
Fail [show a ], where a is the symbol we were looking for, i.e. the argument
of pSym. The reason that we have chosen to represent the information as a
collection of String’s makes it possible to combine Fail steps from parsers with
different symbol types, which arises if we embed one parser into another.
In the function best we have to change the lines accordingly; the most interesting
line is one where two failing alternatives are merged, which in the new situation
becomes:
Fail ls ‘best‘ Fail rs = Fail (ls ++ rs)
42
repairs, each with an associated cost, and then select the best one out of these,
using a limited look-ahead.
To get an impression of the kind of repairs we will be implementing consider
the following program:
test p inp = parse ((, ) <$> p <∗> pEnd ) (listToStr inp)
The function test calls its parameter parser followed by a call to pEnd which
returns the list of constructed errors and deleted possibly unconsumed input.
The constructor (, ) pairs the error messages with the result of the parser and the
function listToStr convert a list of characters into an appropriate input stream
type.
We define the following small parsers to be tested, including an ambiguous
parser and a monadic parser to show the effects of the error correction:
a = (λa → [a ]) <$> pSym ’a’
b = (λa → [a ]) <$> pSym ’b’
p <+ +) <$> p <∗> q
+> q = (+
a2 = a <+
+> a
a3 = a <+
+> a2
pMany p = (λa b → b + 1) <$> p <∗> pMany p << |> pReturn 0
pCount 0 p = pReturn [ ]
+> pCount (n − 1) p
pCount n p = p <+
Now we have three calls to the function test, all with erroneous inputs:
main = do print (test a2 "bbab" )
print (test (do {l ← pMany a; pCount l b }) "aaacabbb")
print (test (amb ( (++) <$> a2 <∗> a3
<|> (++) <$> a3 <∗> a2 )) "aaabaa")
Running the program will generate the following outputs, in which each result
tuple contains the constructed witness and a list of error messages, each report-
ing the correcting action, the position in the input where it was performed, and
the set of expected symbols:
Before showing the new parser code we have to answer the question how we are
going to communicate the repair steps. To allow for maximal flexibility we have
decided to let the state keep track of the accumulated error messages, which
can be retrieved (and reset) by the special parser pErrors. We also add an
43
extra parser pEnd which is to be called as the last parser, and which deletes
superfluous tokens at the end of the input:
class p ‘AsksFor ‘ errors where
pErrors :: p errors
pEnd :: p errors
class Eof state where
eof :: state → Bool
deleteAtEnd :: state → Maybe (Cost, state)
In order to cater for the most common case we introduce a new class Stores,
which represents the retreival of errors, and extend the class Provides with two
more functions which report the corrective actions tken to the state:
class state ‘Stores‘ errors where
getErrors :: state → (errors, state)
class Provides state symbol token where
where
splitState :: symbol → state → Maybe (token, state)
insertSym :: symbol → state → Strings → Maybe (Cost, token, state)
deleteTok :: token → state → state → Strings → Maybe (Cost, state)
The function getErrors returns the accumulated error messages and resets the
maintained set. The function insertSym takes as argument the symbol to be
inserted, the current state and a set of strings describing what was expected at
this location. If the state decides that the symbol is acceptable for insertion, it
returns the costs associated with the insertion, a token which should be used as
the witness for the successful insertion action, and a new state. The function
deleteTok takes as argument the token to be deleted, the old state which was
passed to splitState –which may e.g. contain the position at which the token
to be deleted is located–, and the new state that was returned from splitState.
It returns the cost of the deletion, and the new state with the associated error
message included.
In Fig. 9 we give a reference implementation which lifts, using listToStr , a list
of tokens to a state which has the required interface and provides a stream of
tokens. One fine point remains to be discussed, which is the commutativity of
insert and delete actions. Inserting a symbol and then deleting the current token
has the same effect as first deleting the token and then inserting a symbol. This
is why the function deleteTok returns a Maybe; if it is called on a state into which
just a symbol has been inserted it should return Nothing. The data type Error
represents the error messages which are stored in the state, and pos maintains
the current input position. Note also that the function splitState returns the
extra integer, which represents how far the input state was “advanced”; here
the value is always 1.
Given the defined interfaces we can now define the proper instances for the
parser classes we have introduced. Since the code is quite similar we only give
the version for Pf . The occurrence of the Fail constructor is a bit more involved
44
instance Eq a ⇒ Describes a a where
eqSymTok = (≡)
data Error t s pos = Inserted s pos Strings
| Deleted t pos Strings
| DeletedAtEnd t
deriving Show
data Str t = Str {input :: [t ]
, msgs :: [Error t t Int ]
, pos :: !Int
, deleteOk :: !Bool }
listToStr ls = Str ls [ ] 0 True
instance Provides (Str a) a where
splitState (Str [] ) = Nothing
splitState (Str (t : ts) msgs pos ok ) = Just (t, Str ts msgs (pos + 1) True, 1)
instance Eof (Str a) where
eof (Str i ) = null i
deleteAtEnd (Str (i : ii ) msgs pos ok )
= Just (5, Str ii (msgs ++ [DeletedAtEnd i ]) pos ok )
deleteAtEnd
= Nothing
instance Corrects (Str a) a a where
insertSym s (Str i msgs pos ok ) exp
= Just (5, s, Str i (msgs ++ [Inserted s pos exp ]) pos False)
deleteTok i (Str ii pos True)
(Str msgs pos 0 True) exp
= (Just (5, Str ii (msgs ++ [Deleted i pos 0 exp ]) pos True))
deleteTok
= Nothing
instance Stores (Str a) [Error a a Int ] where
getErrors (Str inp msgs pos ok ) = (msgs, Str inp [ ] pos ok )
45
than expected, and will be explained soon. The function pErrors uses getErrors
to retrieve the error messages, which are inserted into the result sequence using a
push. The function pEnd uses the recursive function del to remove any remain-
ing tokens from the input, and to produce error messages for these deletions.
Having reached the end of the input it retrieves all pending error messages and
hands them over to the result:
instance (Eof state, Stores state errors) ⇒ AsksFor (Pf state) errors where
pErrors = Pf (λk inp → let (errs, inp 0 ) = getErrors inp
in push errs (k inp 0 ))
pEnd = Pf (λk inp →
let del inp = case deleteAtEnd inp of
Nothing → let (errors, state) = getErrors inp
in push errors (k state)
Just (i , inp 0 ) → Fail [ ] [const (Just (i , del inp 0 ))]
in del inp
)
Of course, if we want to base any decision about how to proceed with parsing on
what errors have been produced thus far, the Ph version of pErrors should be
used. If we just want to decide whether to proceed or not, the fact that results
are produced online can be used too. If we find a non-empty error message
embedded in the resulting value, we may decide not to inspect the rest of the
returned value at all; and since we do not inpect it, parsing will not produce it
either.
46
instance (Show symbol , Describes symbol token, Corrects state symbol token)
⇒ Symbol (Pf state) symbol token where
pSym a = Pf (
let p = λk inp →
let ins ex = case insertSym a inp ex of
Just (c i , v , st i ) → Just (c i , push v (k st i ))
Nothing → Nothing
del s ss ex
= case deleteTok s ss inp ex of
Just (c d , st d ) → Just (c d , p k st d )
Nothing → Nothing
in case splitState a inp of
Just (s, ss, pr ) → if a ‘eqSymTok ‘ s
then Step pr (push s (k ss))
else Fail [show a ] [ins, del s ss ]
Nothing → Fail [show a ] [ins ]
in p)
In figure Fig. 10 we give the final definition of pSym for the Pf case. The
local functions del and ins take care of the deletion of the current token and
the insertion of the expected symbol, and are returned where appropriate if
recognition of the expected symbol a fails.
In the best 0 alternative just given we see that the function stops working and
just collects information about how to proceed. Now it becomes the task of the
function eval to start the suspended parsing process:
eval (Fail ss fs) = eval (getCheapest 3 [c | f ← fs, let Just c = f ss ])
Once eval is called we know that all expected symbols and all information how
to proceed has been merged into a single Fail constructor. So we can construct
all possible ways how to proceed by applying the elements from ls to the set
of expected symbols ss, and selecting those cases where actually something can
be repaired. The returned progress sequences themselves of course can contain
further Fail constructors, and thus each alternative actually represents a tree of
ways of how to proceed; the branches of such a tree are either Step’s with which
we associate cost 0, or further repair steps each with its own costs. For each
tree we compute the cheapest path up-to n steps away from the root using the
function traverse, and use the result to select the progress sequence containing
the path with the lowest accumulated cost. The first parameter of traverse is
the number of tree levels still to be inspected, the second argument the tree, the
third parameter the accumulated costs from the root up-to the current node,
and the last parameter the best value found for a tree thus far, which is used to
prune the search process.
47
getCheapest :: Int → [(Int, Steps a)] → Steps a
getCheapest [ ] = error "no correcting alternative found"
getCheapest n l = snd $ foldr (λ(w , ll ) btf @(c, l )
→ if w < c
then let new = (traverse n ll w c)
in if new < c then (new , ll ) else btf
else btf
) (maxBound , error "getCheapest") l
traverse :: Int → Steps a → Int → Int → Int
traverse 0 = λv c → v
traverse n (Step ps l ) = traverse (n − 1) l
traverse n (Apply l ) = traverse n l
traverse n (Fail m m2ls) =
λv c → foldr (λ(w , l ) c 0 → if v + w < c 0
then traverse (n − 1) l (v + w ) c 0
else c 0
) c (catMaybes $ map ($m) m2ls)
traverse n (End h ((a, lf ) : ) r ) = traverse n (lf [a ] ‘best‘ removeEnd h r )
traverse n (End f (l : ) r ) = traverse n (l ‘best‘ r )
8 An Idiomatic Interface
McBride and Paterson [10] investigate the Applicative interface we have been
using throughout this tutorial. Since this extension of the pattern of sequential
composition is so common they propose an intriguing use of functional depen-
dencies to enable a very elegant way of writing applicative expressions. Here we
shortly re-introduce the idea, and give a specialised version for the type Parser
we introduced for the basic library.
Looking at the examples of parsers written with the applicative interface we
see that if we want to inject a function into the result then we will always
do this with a pReturn, and if we recognise a keyword then we always throw
away the result. Hence the question arises whether we can use the types of the
components of the right hand side of a parser to decide how to incorporate it
into the result. The overall aim of this exercise is that we will be able to replace
an expression like:
choose < $ pSyms "if" <∗> pExpr <∗ pSyms "then" <∗> pExpr
<∗ pSyms "else" <∗> pExpr
by the much shorter expression:
start choose "if" pExpr "then" pExpr "else" pExpr stop
or by nicely formatting the start and stop tokens as a pair of brackets by:
[: choose "if" pExpr "then" pExpr "else" pExpr :]
48
The core idea of the trick lies in the function idiomatic which takes two argu-
ments: an accumulating argument in which it constructs a parser, and the next
element from the expression. Based on the type of this next element we decide
what to do with it: if it is a parser too then we combine it with the parser
constructed thus far by using sequential composition, and if it is a String then
we build a keyword parser out of it which we combine in such a way with the
thus far constructed parser that the witness is thrown away. We implement the
choice based on the type by defining a collection of suitable instances for the
class Idiomatic:
class Idiomatic f g | g → f where
idiomatic :: Parser Char f → g
We start by discussing the standard case:
instance Idiomatic f r ⇒ Idiomatic (a → f ) (Parser Char a → r ) where
idiomatic isf is = idiomatic (isf <∗> is)
which is to be read as follows: if the next element in the sequence is a parser re-
turning a witness of type a, and the parser we have constructed thus far expects
a value of that type a to build a parser of type f , and we know how to combine
the rest of g with this parser of type f , then we combine the accumulated parser
recognising a value of type a → f and the argument parser recognising an a,
and call the function idiomatic available from the context to consume further
elements from the expression.
If we encounter the stop marker, we return the accumulated parser. For this
marker we introduce a special type Stop, and declare an instance which recog-
nises this Stop and returns the accumulated parser.
data Stop = Stop
stop = Stop
instance Idiomatic x (Stop → Parser Char x ) where
idiomatic ix Stop = ix
Now let us assume that the next element in the input is a function instead of
a parser. In this case the Parser Char a in the previous instance declaration
is replaced by a function of some type a → b, and we expect our thus far
constructed parser to accept such a value. Hence we get:
instance Idiomatic f g ⇒ Idiomatic ((a → b) → f ) ((a → b) → g) where
idiomatic isf a2b = idiomatic (isf <∗> pReturn a2b)
Once we have this instance it is now easy to define the function start. Since
we can prefix every parser with a id <$> fragment, we can define start as the
initialisation of the accumulated parser by the parser which always succeeds
with an id :
start :: ∀ a g.(Idiomatic (a → a) g) ⇒ g
start = idiomatic (pReturn id )
49
Finally we can provide extra instances at will, as long as we do not give more
than one for a specific type. Otherwise we would get an overloading ambiguity.
As an example we define two further cases, one for recognising a keyword and
once for recognising a single character:
instance Idiomatic f g ⇒ Idiomatic f (String → g) where
idiomatic isf str = idiomatic (isf <∗ pKey str )
instance Idiomatic f g ⇒ Idiomatic f (Char → g) where
idiomatic isf c = idiomatic (isf <∗ pSym c)
9 Further Extensions
In the previous sections we have developed a library which provides a lot of
basic functionality. Unfortunately space restrictions prevent us from describing
many more extensions to the library in detail, so we will sketch them here.
Most of them are efficiency improvements, but we will also show an example of
how to use the library to dynamically generate large grammars, thus providing
solutions to problems which are infeasible when done by traditional means, such
as parser generators.
9.1 Recognisers
In the basic library we had operators which discarded part of the recognised
result since it was not needed for constructing the final witness; typical examples
of this are e.g. recognised keywords, separating symbols such as commas and
semicolons, and bracketing symbols. The only reason for their presence in the
input is to make the program readable and unambiguously parseable.
Of course it is not such a great idea to first perform a lot of work in constructing
the result, only having to even more work to get rid of it again. Fortunately we
have already introduced the recognisers which can be combined with the other
types of parsers Ph , Pf and Pm . We introduce yet another class:
class Applicative p ⇒ ExtApplicative p st where
(<∗) :: p a → R st b → p a
(∗>) :: R st b → p a →pa
(<$) :: a → R st b → p a
The instances of this class again follow the common pattern. We only give the
implementation for Ph :
instance ExtApplicative (Ph st) st where
Ph p <∗ R r = Ph (p.(r .) )
R r ∗> Ph p = Ph (r .p )
f <$ R r = Ph (r .($f ))
50
9.2 Parsing Permutation Phrases
A nice example of the power of parser combinators is when we want to recognise
a sequence of elements of different type, in which the order in which they appear
in the input does not matter; examples of such a situation are in the recognition
of a BibTeX entry or the attribute declarations allowed at a specific node in an
XML-tree. In [1] we show how to proceed in such a case, so here we only sketch
the idea which heavily depends on lazy evaluation.
We start out by building a data structure which represents all possible permuta-
tions of the parsers for the individual elements to be recognised. This structure
is a tree, in which each path from the root to a leaf represents one of the possible
permutations. From this tree we generate a parser, which initially is prepared to
accept any of the elements; after having recognised the first element it continues
to recognise a permutation of the remaining elements, as described by the ap-
propriate subtree. Since the tree describing all the permutations and the parser
corresponding to it are constructed lazily, only the parsers corresponding to a
permutation actually occurring in the input will be generated. All the chosen
alternative has to do in the end is to put the elements in some canonical order.
51
10 Conclusions
We have come to end of a fairly long tutorial, which we hope to extend in the
future with sections describing the yet missing abstract interpretations. We
hope nevertheless that the reader has gained a good insight in the possibilities
of using Haskell as a tool for embedding domain specific languages. There are
a few final remarks we should like to make.
In the first place we claim that the library we have developed can be used outside
the context of parsing. Basically we have set up a very general infrastructure for
describing search algorithms, in which a a tree is generated representing possible
solutions. Our combinators can be used for building such trees and searching
such trees for possible solutions in a breadth-first way.
In the second place the library we have described is by far not the only one ex-
isting. Many different (Haskell) libraries are floating around, some more mature
than others, some more dangerous to use than others, some resulting in faster
parsers, etc. One of the most used libraries is Parsec, originally constructed by
Daan Leijen, which gained its popularity by packaged with the distribution of
the GHC compiler. The library distinguishes itself from our approach in that
the underlying technique is the more conventional back-tracking technique, as
described in the first part of our tutorial. In order to alleviate some of the men-
tioned disadvantages of that approach, the programmer has the possibility to
commit the search process at specific points, thus cutting away branches from
the search tree. Although this technique can be very effective it is also more
dangerous: unintentionally branches which should remain alive may be pruned
away. The programmer really has to be aware of how his grammar is parsed
in order to know where to safely put the annotations. But if he knows what
he is doing, fast parsers can be constructed. Another simplifying aspect is that
Parsec just stops if it cannot make further progress; a single error message is
produced, describing what was expected at the farthest point reached.
A relatively new library was constructed by Malcolm Wallace [19], which con-
tains many of the aspects we are dealing with: building results online, and
combing a monadic interface with an applicative one. It does however not per-
form error correction.
Another library which implements a breadth-first strategy are Koen Claessen’s
parallel parsers [3], which are currently being used in the implementation of the
GHC read functions. They are based on a rewriting process, and as a result do
not lend themselves well to an optimising implementation.
Concluding we may say that parser combinators are providing an ever last-
ing source of inspiration for research into Haskell programming patterns which
has given us a lot of insight in how to implement Embedded Domain Specific
Languages in Haskell.
Acknowledgements I thank current and past members of the Software Tech-
nology group at Utrecht University for commenting on earlier versions of this
52
paper, and for trying out the library described here. I want to thank Alesya
Sheremet for working out some details of the monadic implementation, and
the anonymous referee for his/her comments, and Magnus Carlsson for many
suggestions for improving the code.
References
[1] Arthur I. Baars, Andres Löh, and S. Doaitse Swierstra. Parsing permuta-
tion phrases. J. Funct. Program., 14(6):635–646, 2004.
53
[8] Graham Hutton and Erik Meijer. Monadic parsing in haskell. J. Funct.
Program., 8(4):437–444, 1998.
[9] Daan Leijen. Parsec, a fast combinator parser. Technical Report UU-CS-
2001-26, Institute of Information and Computing Sciences, Utrecht Univer-
sity, 2001.
[12] Niklas Röjemo. Garbage collection and memory efficiency in lazy functional
languages. PhD thesis, Chalmers University of Technology, 1995.
[13] S. D. Swierstra and L. Duponcheel. Deterministic, error-correcting combi-
nator parsers. In John Launchbury, Erik Meijer, and Tim Sheard, editors,
Advanced Functional Programming, volume 1129 of LNCS-Tutorial, pages
184–207. Springer-Verlag, 1996.
[14] S. Doaitse Swierstra, Arthur Baars, Andres Löh, and Arie Middelkoop.
uuag - utrecht university attribute grammar system.
[15] S.D. Swierstra, P.R. Azero Alocer, and J. Saraiava. Designing and imple-
menting combinator languages. In S. D. Swierstra, Pedro Henriques, and
José Oliveira, editors, Advanced Functional Programming, Third Interna-
tional School, AFP’98, volume 1608 of LNCS, pages 150–206. Springer-
Verlag, 1999.
[16] Marcos Viera, S. Doaitse Swierstra, and Eelco Lempsink. Haskell, do you
read me?: constructing and composing efficient top-down parsers at run-
time. In Haskell ’08: Proceedings of the first ACM SIGPLAN symposium
on Haskell, pages 63–74, New York, NY, USA, 2008. ACM.
[17] Philip Wadler. How to replace failure with a list of successes. In Functional
Programming Languages and Computer Architecture, volume 201 of LNCS,
pages 113–128. Springer-Verlag, 1985.
54