Introduction To PEG (Parsing Expression Grammar) in Python
Introduction To PEG (Parsing Expression Grammar) in Python
roadmap
Motivation PEG theory pyparsing PyMeta PyPy rlib/parsing Closing 04 05 16 07 01 01 mins mins mins mins min min
34 mins
2
motivation
How to parse texts with PEGs
Natural languages NLTK Mini languages (DSLs) Structured / unstructured file formats
4 thoughts :
i. Aren't structured formats like JSON, XML, HTML well-served by existing parsers ? ii. Parsing log files & configuration files are easy with python. iii. Regular expression is good enough. iv. What is wrong with the classical way of writing parsers ?
3
S Sa S
4
Name AST 'x' Ops Comp test > Num test 5 body IfExp IfExp Comp Name body orelse Log values Str 'limbo' 'y'
#1
AST #2
Definitions
Parse trees vs AST
= concrete = abstract whitespace, braces, = uses tree nodes semicolons specific to language = nodes are nonterminals constructs from grammar
Top-Down
= begin with start nonterminal. = work down the parse tree.
vs
Bottom-up
= identify terminals = infer nonterminals = climb the parse tree.
Definitions (2)
Recursive descent parsing
* A top-down parser constructed from recursive functions. * Each function represents a rule in the grammar.
version ::= <digit> '.' <digit> digit ::= '0' | '1' ... | '9' def version( source, position=0 ): digit( source, position ) period( source, position + 1 ) digit( source, position + 2 )
Run (pymeta) nose --nocapture -v test_rdp_list.py 8
def digit(source, position): fn = (lambda t: t in string.digits,this_rule()) expect(source, position, fn) def period(source, position): fn = (lambda t: t == '.',this_rule()) expect(source, position, fn)
3. Parsing phase
parser ( grammar, stream-of-tokens) => parse tree / AST
13
PEG
Formalized by Bryan Ford in 2002-2004 Grammar mimics a recursive descent parser (+ backtracking). Scanner-less A PEG grammar consists of a set of parsing expressions of the form: A e One expression is denoted the starting expression
e1 / e2 e1 e2 e+ e? &e !e e* Ordered Choice Sequence PEG Repetition Predicates
!= EBNF
14
PEG vs CFG
PEG Syntax definition philosophy Choice alternatione1/e2 Handles ambiguous grammars Requires a lexical analysis phase ? Left recursion * Analytical Ordered No No No CFG
Generative
16
( ( ( (
) 1989 Jan 8 - present ) 1926 Dec 25 1989 Jan 7 ) 1912 Jul 30 1926 Dec 24 ) 1868 Sep 8 1912 July 29
19
SHOWA
Hirohito
TAISHOU
Yoshihito
MEIJI
Mutsuhito
20
from pyparsing import Literal, Word, nums year month day heisei_era integer = = = = = Literal( u'\u5e74' ) Literal( u'\u6708' ) Literal( u'\u65e5' ) Literal( u'\u5e73\u6210' ) Word(nums) Word(nums, exact=2)
21
western_year = integer('yyyy') + year imperial_year = heisei_era + western_year year_spec = (imperial_year('imperial') | western_year('western')) grammar = year_spec + month_spec + day_spec
23
pyparsing : introduction
Easy to use PEG-based text parser Grammar definitions in python Framework distributed as one file
pyparsing.py
Runs on both python 2.x & 3.x . Future releases after 1.5.x will be focused on python 3.x only Not classified as recursive descent !
24
25
pyparsing
e1 + e2 e1 | e2 == And( e1, e2 ) == MatchFirst( [e1,e2] )
27
28
pyparsing : backtracking
Or forces the parser to make an exhaustive
No better than non-PEG parsers. Tweak the order of alternatives & put most probable (e.g. frequency of occurrence) first. Avoids wasteful backtracking.
29
pyparsing : backtracking
Ballon d'Or 2011 example
p1,p2,p3,p4,p5 = map(Literal,['ronaldo','messi', 'park-ji-sung', 'xavi','iniesta']) first = p2 + p1 + p4 second = p2 + p1 + p5 third = p2 + p1 + p3 grammar = first | second | third print grammar.parseString( "messi ronaldo park-ji-sung" )
pyparsing : backtracking
31
32
pyparsing : packrat
Memoization must be manually turned on.
ParserElement.enablePackrat()
Caches: a. ParseResults b. Exceptions thrown Caveat emptor: A grammar with parse actions that has side effects do not always play well with memoization turned on.
run python
select_parser.py
33
Usage:
ParserElement.setParseAction( *fn ) ParserElement.addParseAction( *fn )
Uses: 1. Perform validation (see ParseException) 2. Process the matched token(s) & modify it Returning a value overwrites the matched token(s). 3. Annotate with custom types (collary of #2)
34
Show: japan_simple.py
35
36
@traceParseAction def convert_kanji_year(toks): if 'imperial' in toks.keys(): year = toks.imperial.yearZero + toks.imperial.yy toks['era'] = toks.imperial.type_ toks['yyyy'] = year elif 'western' in toks.keys(): year = toks.yyyy try: toks['modernDate'] = date(year, toks.mm, toks.dd) except ValueError, error: raise ParseException(error.args[0])
Show: japan_dates.py
37
Supports a tiny subset of the full grammar : from : ( <sender> ) label : inbox -label : sent yyyyy -yyyyy zzzzz -zzzzz
38
email = (emailpartial | emailfull) squeeze = lambda t: ' '.join( t[0].split() ) name = ZeroOrMore(Word(alphanums + ' ')) .setParseAction( squeeze )
40
41
)')
42
label_rhs = delimitedList(Word(alphanums), delim='-', combine=True ) label_include = Combine( Suppress('label') + colon + label_rhs ) label_exclude = Combine( hyphen + label_include ) label_all = MatchFirst([ label_exclude.setResultsName('labels.exclude', listAllMatches=True), pyparsing 1.5.6 label_include('labels.include*')]) grammar_label = ZeroOrMore( label_all )
43
Question. Will this grammar work if the user entered LABEL instead of label ? Answer.
CaselessLiteral('label')
44
Question. If the user entered single instead of double quotes, will it conform to the grammar ?
Answer. Yes
46
result = gmail.parseString('love label:writing-tips "bird by bird" from:(Anne Lamott) -"dalai lama" -label:macchu-pichu from:([email protected]) -label:french-guiana -"epictetus" label:yoga "bugle podcast" label from:(@microsoft.com)') print result.dump() nested = opener + Group(enclosed) + closer
47
pyparsing: Recursion
A grammar is recursive when there exists a nonterminal which has itself in the right-hand-side of the production rule. number ::= digit rest
rest ::= digit rest | empty digit = Word(nums,exact=1).setName('1-digit') rest = Forward() rest << Optional(digit + rest) number = Combine(digit + rest, adjacent=False) ('digit-list') grammar = number.setParseAction(lambda t:int(t[0])) + Suppress(restOfLine)
Run 49
(nil,4,nil) ((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil))
4 2 3 5 6 7
Code
left, right, comma = map(Suppress, '(),') empty = (CaselessLiteral('nil') .setParseAction(replaceWith(None))) tree = Forward() value = Word(nums).setParseAction(lambda t:int(t[0])) tree << ((left + Group(tree) + bookend(value) + Group(tree) + right)
Run
51
((nil,2,(nil,3,nil)),4,((nil,5,(nil,6,nil)),7,nil))
Output :
[[[None],2,[[None],3,[None]]],4,[[[None],5, [[None],6,[None]]],7,[None]]]
Group(tree)
class TreeGroup(TokenConverter): def postParse(self, instring, loc, tokenlist): if len(tokenlist) == 1 and tokenlist[0] is None: return tokenlist else: 52 return [tokenlist]
PyMeta : introduction
OMeta is a language prototyping system (PEG). Implemented in several programming languages. * Packrat memoization * Grammar: BNF dialect (with host language snippets) * Object-Oriented: inheritance, overriding rules
lowercase ::= <char_range 'a' 'z'> def rule_lowercase(): // ..body..
* <anything> consumes one object from the input stream. (c.f. regex) * Built-in rules <letter> <digit>
<letterOrDigit> <token '?'>
55
PyMeta
e1 e2 e1 | e2 e* e+ e? ~~e !e == ~e
&e !e
western dates
era_heisei.py : Grammar to recognize
heisei dates
japan_date_parser.py : Final grammar
57
range_num :min :max ::= <digit>+:m ?(int(join(m)) >= min & int(join(m)) <= max) => m rest_of_line ::= <anything>* <token '\n'>? => None empty_line ::= <spaces> <rest_of_line> => None python_comment ::= <token '#'> <rest_of_line> => None """ def join(x): return ''.join(x) JapanCommonParser = OMeta.makeGrammar(baseGrammar, globals(), "JapanCommonParser")
58
60
62
def parse_file(filename): iterate through each line .... snipped ... parser = JapanDateParser(line) result,error = parser.apply('grammar') .... snipped ... results = parse_file('imperial.utf8') results = parse_file('western.utf8')
Run
63
64
Run
65
g = OMeta.makeGrammar(listGrammar, {}) parser = g( [['600','+','66']] ) iterable result,error = parser.apply('interp') >>> result 666 >>> error ParseError(2,[])
66
>>> import compiler >>> print compiler.parse('import ctypes') >>> Module(None, Stmt([Import(['ctypes', None)])]))
67
pyparsing vs PyMeta
pyparsing Whitespace sensitive? Left recursion Packrat memoization Operates on character streams Operates on object streams Syntactic predicates Semantic predicates Semantic actions Regex support No. But turned on via
leaveWhitespace()
PyMeta Yes. Use <spaces> rule to eat whitespaces Yes Yes. Only no-arg rules Yes Yes Yes
PyPy rlib/parsing
Library for generating tokenizers & parsers in RPython. Consists of: regex / packrat parser tree structure / EBNF parser Sample JSON ebnf
NUMBER: "\-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][\ +\-]?[0-9]+)?"; value: <STRING> | <NUMBER> | <object> | <array> | <"null"> |<"true"> | <"false">; array: ["["] (value [","])* value ["]"]; entry: STRING [":"] value;
Resulting parse tree can be transformed or traversed with custom visitors. (dot)
69
Usage of syntactic predicates Parsing grammars of mathematical expression in order to preserve operator precedence Handling indents/dedents in order to parse indentation-sensitive languages
Resources
pyparsing
http://pyparsing.wikispaces.com/ https://github.com/marcua/tweeql
PyMeta
http://www.tinlizzie.org/ometa/ http://gitorious.org/python-decompiler/python_rewriter