2016-01-20 20:17:11 - sivabudh - Learning how to use nestedExpr
2016-01-23 02:24:38 - Jeroen537 - getName() returns name for unnamed element?
2016-01-26 19:38:16 - ldelossa - Parsing HAProxy config, new line issues
2016-02-04 21:51:15 - palmer1979 - Pickle 3.5.1
2016-02-05 07:23:39 - Spida2 - Parsing LaTeX
2016-02-06 09:02:39 - palmer1979 - Pickle pyparsing in Python 2.7.11 versus 3.5.1
2016-02-18 15:20:04 - Spida2 - Escaped Comment Indicator (LaTeX)
2016-02-28 09:46:58 - sivabudh - Parsing a simple nested expression
2016-03-07 08:16:30 - Jeroen537 - Obtaining name for pattern using API?
2016-03-15 12:27:33 - fantomasdnb - Redundant result for Each(&) statement
2016-03-16 03:44:07 - Jeroen537 - Python 3 issue?
2016-03-16 06:04:16 - fantomasdnb - OneOrMore, ZeroOrMore + Optional hangs
2016-03-20 21:51:41 - aeiro1 - Each doesn't work unless used with Optionals
2016-03-21 15:53:02 - toddreed - Parse action is not always called
2016-03-22 03:20:54 - LawfulEvil - Performance with parsing braces "()"
2016-04-12 07:21:05 - Williamzjc - just a question about infixNotation
2016-04-28 08:54:05 - webpentest - Pypasing hangs on parsing dblQuotedString
2016-04-29 07:52:17 - Cloudo - parsing cyrillic keywords
2016-05-03 10:19:49 - pjc42 - import Upcase error for v 2.1.1 python 3.5
2016-05-09 02:24:06 - janoglop - Help getting started
2016-05-22 02:07:59 - AndreWin - How to check if word is in code block?
2016-06-03 10:51:26 - ankur2002 - Url works without '/' at the end but not if enclosed in quotes
2016-07-14 11:33:43 - yurivkhan - Constructing ParseResults
2016-08-04 20:06:32 - Akshay7790 - Generate dynamic regex from input data
2016-08-22 10:05:07 - nileshp - Defining BNF grammar for CLI
2016-09-22 09:04:01 - lhughes42 - odd error with double back slash and using setDebug
2016-10-20 12:45:49 - rcrowe123 - How to grab the key:value from parseResults and send to setParseAction
2016-10-26 10:15:50 - Amoghavarsha - Parsed nested text using pyparsing
2016-10-27 09:58:04 - infecto - Parsing Expressions for Data Retrieval and Evaluation
2016-10-28 09:11:49 - susdu - pyparsing nested structure not working as expected
2016-12-07 18:50:12 - delisson - Can't get the simplest expression to work
2016-12-29 20:57:49 - Friudorian - Forward() anc recursion error
Hi,
I have a simple parser as follow:
I got it 80% done. Eg. When I run, it the output will be like this:
What I can't really figure out is how to parse the content inside the nested expression.
I tried changing the nestedExpr expression to parse content inside nestedExpr:
attributes = Word(printables) + ':' + Word(printables)
command = nestedExpr('<%', '%>', content=OneOrMore(attributes))
but instead got everything out as plain text:
What I really want to do is to turn each attribute in the nestedExpr into a dictionary. For example, the first nestedExpr should return:
# Python dict
{
module: create-variable,
name: traffic_policy_name,
}
If anyone could please share what I'm doing wrong, that'd be great. Thank you so much.
If all you want to do is to process the various '<%...%>' tags, then you are better off using scanString or searchString, instead of parseString. parseString forces you to define a grammar for the complete input string, but scanString and searchString will work on just the sub-pattern that you are looking for. Also, none of your expressions actually nest groups - you don't have '<%... <%...%> ... %>'. If you don't actually have nesting, easier to not use nestedExpr, but just define the pattern with leading '<%' and trailing '%>'. Lastly, your definition of attributes uses 'Word(printables)' as the value part of the key-value, but in your sample code, one of the attributes has a quoted string as the value. This is easily fixed with 'attributes = Word(printables) + ':' + (quotedString | Word(printables))'. (You may have to come back to this and be a little more specific on these - 'Word(printables)' will match anything that is not whitespace.)
Wow! Thanks for such a prompt response, Paul! I'm sorry I didn't make things clear, I think this Gist will sum up what I actually wanted to achieve. I'm not sure if it's expressible in Pyparsing or not.
Basically, I need to process everything in the input string.
Thanks so much for a detailed response!
This code gives the output you are looking for, by using scanString to locate the attribute groups, and processing the intervening text for the unstructured statements. The statements are best handled in a separate pass outside the main parser - expressions like 'OneOrMore(Word(printables))'
can easily slurp in way more than is wanted, so processing a line at a time, ignoring Python comments, keeps this under control.
Uses the input variable data
from
from pyparsing import *
# set up parser for matching command key-value groups, and creating a Python dict
identifier = Word(alphas+'_', alphanums+'_')
attr_value = quotedString.setParseAction(removeQuotes) | Word(printables)
attribute = Group(identifier + Suppress(':') + attr_value)
command_group = Suppress('<%') + Dict(OneOrMore(attribute))('data') + Suppress('%>')
command_group.ignore(pythonStyleComment)
command_group.setParseAction(lambda t: t.data.asDict())
def extract_statements(s):
''' Function to extract statements that do not match the command parser,
stripping off any Python comments
'''
lines = filter(None, map(str.strip, s.splitlines()))
stmt = OneOrMore(Word(printables).ignore(pythonStyleComment))
out = [' '.join(stmt.parseString(line)) for line in lines]
return out
# iterate over matches, extracting statements from unmatched text using scanString
# scanString returns the matched tokens, AND the start and end locations of the match
last = 0
output = []
for t,s,e in command_group.scanString(data):
if s > last:
# get everything since the last match
unmatched = data[last:s]
output.extend(extract_statements(unmatched))
last = e
# add the matched command, which has already been converted to a dict
output.append(t[0])
# add any other text in the input string after the last matched command
unmatched = data[last:]
output.extend(extract_statements(unmatched))
# what did we get?
import pprint
pprint.pprint(output)
Gives:
['undo traffic-policy inbound',
{'module': 'create-variable', 'name': 'display blah'},
'quit']
I hope that gets you going.
-- Paul
For a parser/expression manipulator I am developing for the SPARQL language, I want to be able to access subexpressions of the parsed string by name so as to analyze and modify them.
I ran into a difficulty, whereby a name for a subexpression seems to be assigned not only to this subexpression, but also to the enveloping expression. A simple demonstrative example:
a = Word(nums)
test = a('left') + a('right')
r = test.parseString('123 456')
print(r)
# '123', '456'] (as expected)
print(list(r.keys()))
['right', 'left'] # as expected
print(r.left, r.right)
# 123 456 (as expected)
print(r.getName())
#left (???)
I would expect no name to be assigned to r.
For my purpose, this makes analyzing the parse result difficult. Is this intended behaviour, and if so, Is there a work around?
Thanks!
Jeroen.
I think I found a solution that works for me:
a = Word(nums)
test = Group((a('left') + a('right')))('fullexp')
r = test.parseString('123 456')
print(r)
print(list(r.keys()))
print(list(r.fullexp.keys()))
print(r.fullexp.left, r.fullexp.right)
print(r.getName())
Gives when run:
['fullexp']
['right', 'left']
123 456
fullexp
so that there is no confusion between the names of nested levels. So I think I can go on.
My question whether the behaviour is intentional remains and I would be interested to learn the answer.
Whatever that may be, thanks for an awesome package.
Jeroen.
Hey guys,
I have the following code:
This works fine, however for the live of me I can't get the results for the embedded 'defaults' block. Could anyone give me a hand? I've tried several things however I always get back an 'expected defaults' message.
Since you have redefined the allowed whitespace to not accept newlines (which is common in parsers of line-oriented text, so this is not itself the problem), you have to put them into the parser yourself. You have done so for the most part, but things fail when you hit the empty line before the 'defaults' block. It looks like your blocks are separated by empty lines. If so, and since the block bodies are just being parsed as lines of words, you could write something like this:
new_line = LineEnd()
word = Word(alphanums+'/-:._') # or maybe just use Word(printables)?
block = Group(OneOrMore(Group(OneOrMore(word) + new_line.suppress())))
blocks = delimitedList(block, delim=OneOrMore(new_line))
blocks.parseString(config).pprint()
Note that I had to modify the definition of 'word' to add '_' in order to parse the key 'default_backend'.
But this parser is not really doing much for you, not any more than you could do just with splitlines and split. Do you have plans to make this parser more syntax-aware?
This works, but what I was hoping to do is create several lists, one for global, defaults, frontend, and backend, and make separate python dictionaries from them. I can than programatically add to the dictionary, and re-write the configuration file with the new entries, effectively making changes. I was under the same impression that the black line was causing an issue, so I tried this:
and may variations of that, and I always get an error saying either expecting end of line, or expecting 'defaults'
Looking at your pastebin, you are very close. But this:
default_block.parseString(config)
won't work because config doesn't start with a default block, it starts with a global block.
Try this instead:
(global_block + default_block).parseString(config).pprint()
BTW, the code I posted does create several lists.
result = blocks.parseString(config)
for section in result:
print(section)
prints
[['global'], ['log', '127.0.0.1', 'local1', 'notice'], ['chroot', '/var/lib/haproxy'], ['user', 'haproxy'], ['group', 'haproxy'], ['daemon'], ['stats', 'socket', '/var/run/haproxysock', 'level', 'admin']]
[['defaults'], ['log', 'global'], ['mode', 'http'], ['option', 'httplog'], ['option', 'dontlognull'], ['option', 'forwardfor'], ['option', 'http-server-close'], ['timeout', 'connect', '5000'], ['timeout', 'client', '50000'], ['timeout', 'server', '50000']]
[['frontend', 'frontend'], ['bind', '127.0.0.1:80'], ['option', 'tcplog'], ['default_backend', 'backend']]
[['backend', 'backend'], ['balance', 'roundrobin'], ['server', 'redirect01', '192.168.122.112:80', 'check'], ['server', 'redirect02', '192.168.122.202:80', 'check']]
[['listen', 'stats', '127.0.0.1:1936'], ['mode', 'http'], ['stats', 'enable'], ['stats', 'uri', '/'], ['stats', 'hide-version'], ['stats', 'auth', 'user:user']]
Also, when you find yourself repeating chunks like:
OneOrMore(Group(OneOrMore(word) + new_line.suppress()))
assign it to its own variable like:
block_body = OneOrMore(Group(OneOrMore(word) + new_line.suppress()))
Then your various sections will start to look cleaner:
NL = new_line.suppress()
global_block = Keyword('global') + NL + block_body
default_block = Keyword('defaults') + NL + block_body
frontend_block = Keyword('frontend')*2 + NL + block_body
backend_block = Keyword('backend')*2 + NL + block_body
listen_block = Keyword('listen') + restOfLine + NL + block_body
SEP = OneOrMore(NL)
parser = (global_block + SEP +
default_block + SEP +
frontend_block + SEP +
backend_block + SEP +
listen_block)
Thank you for this ptmcg, very helpful. But now I'm curious,
The Haproxy config will almost always have Global, then default blocks, but after that, it could be any combination of backend and frontend blocks, e.g.
Global block{}
defaults block{}
frontend{}
frontend{}
backend{}
backend{}
Or it could be
frontend{}
backend{}
frontend{}
backend{}
Or any permutation of that. I I need to specify which block to parse, how can I tell pyparser to parse a 'frontend' block OR a 'backend' block in any order they may be specified ? (technically I can leave the differentiation between frontend, and backend, after I have the list, but this is a task for me get to know pyparsing better)
Also, and I'm not sure if this is legal, but how could I also handle the defaults block coming before the global block?
You are looking for the alternative version of And (with operator '+') which is Each (with operator '&'). Each will look for all the given expressions, in any order. This way you can write a flexible parser without having to spell out all the combinations of different orders that things can occur. Here is how your parser looks using Each (and with a little lambda to make the block definitions easier to define):
new_line = LineEnd()
word = Word(alphanums+'/-:._')
block_body = OneOrMore(Group(OneOrMore(word) + new_line.suppress()))
NL = new_line.suppress()
make_block = lambda expr: Group(expr + NL + block_body)
global_block = make_block(Keyword('global'))
default_block = make_block(Keyword('defaults'))
frontend_block = make_block(Keyword('frontend')*2)
backend_block = make_block(Keyword('backend')*2)
listen_block = make_block(Keyword('listen') + restOfLine)
SEP = OneOrMore(NL)
parser = (global_block & SEP &
default_block &
frontend_block &
backend_block &
listen_block)
parser.parseString(config).pprint()
Now try running this with blocks in different orders. An interesting side note - you don't need to specify the separator multiple times, since it is already a OneOrMore. Each will look for one or more of these, even if they are not together.
Is it possible to pickle ParseResult objects in Python 3.5.1?
I have code that worked perfect in Python 2.7.11 and now it goes 'full retard' under 3.5.1:
import grammar
schema_data = grammar.entity_decl.parseString(schema)
print(schema_data[0].entity_id) # prints the entity id -> everyting parses fine
pickle.dump(schema_data, open(storage_file, 'wb')) # -> ok
pickle.load(open(storage_file, 'rb')) # -> error
The error message is as follows:
TypeError: __new__() missing 1 required positional argument: 'toklist'
What changed??
I am trying to parse LaTeX (or at least some subset). I assume that every latex command starts with a backslash, followed by an arbitrary number of parameters and arguments enclosed in [] and {}, respectively. I am also successfully ignoring both comments introduced by the percent character and block comments in \begin{comment}\end{comment}
import string
from pyparsing import *
test = '''\documentclass[opt1,opt2][foo]{a4paper}
\\newcommand{foo}{\secondarg[opt]{arg}}
\\begin{document}
\chapter{Intro}
Chapter Intro Introduction %comment at end of line
\section{First Section}
Section Content text
escaped backslash \\\\ in text
escaped percent \\% in text
This is a test
\\begin{comment}
\section{Commented Section}
\end{comment}
\end{document}
'''
class FoundCommand:
def __init__(self, st, locn, toks):
self.st = st
self.locn = locn
self.toks = toks
print '>>> Tex Command \'%s\' at line %s'%(self.toks[0][0], lineno(self.locn, self.st))
class FoundParameter:
def __init__(self, st, locn, toks):
self.st = st
self.locn = locn
self.toks = toks
print ' [ Tex Command Parameter %s at line %s'%(self.toks, lineno(self.locn, self.st))
class FoundArgument:
def __init__(self, st, locn, toks):
self.st = st
self.locn = locn
self.toks = toks
print ' { Tex Command Argument %s at line %s'%(self.toks, lineno(self.locn, self.st))
#class FoundLineComment:
# def __init__(self, st, locn, toks):
# self.st = st
# self.locn = locn
# self.toks = toks
# print '*** Line Comment \'%s\' at line %s'%(self.toks[0], lineno(self.locn, self.st))
#class FoundBlockComment:
# def __init__(self, st, locn, toks):
# toks = ''.join(toks[0])
# self.st = st
# self.locn = locn
# self.toks = toks<
# print '*** Block Comment at line %s'%(lineno(locn, self.st))
# #print '*** Block Comment \'%s\' at line %s'%(self.toks, lineno(self.locn, self.st))
class FoundText:
def __init__(self, st, locn, toks):
toks = ''.join(toks)
self.st = st
self.locn = locn
self.toks = toks
print '... Text \'%s\' at line %s'%(self.toks, lineno(self.locn, self.st))
# Characters
backslash = '\\'
percent = '%'
bracketleft = '['
bracketright = ']'
curlyleft = '{'
curlyright = '}'
special_chars = backslash + percent + bracketleft + bracketright + curlyleft + curlyright
standard_chars = printables.translate(string.maketrans('', '', ), special_chars) + '''''''''
esc_percent = Literal(backslash + percent)
esc_backslash = Literal(backslash + backslash)
esc_bracketleft = Literal(backslash + bracketleft)
esc_bracketright = Literal(backslash + bracketright)
esc_curlyleft = Literal(backslash + curlyleft)
esc_curlyright = Literal(backslash + curlyright)
escape = esc_percent | esc_backslash | esc_bracketleft | esc_bracketright | esc_curlyleft | esc_curlyright
escape.setParseAction(lambda st, locn, toks: toks[0][1])
# Tex commands
text = Forward()
commandname = Word(alphas)
parametervalue = Word(alphas)
#parametervalue = text
parameter = Suppress(Literal(bracketleft)) + parametervalue + Suppress(Literal(bracketright))
parameter.setParseAction(FoundParameter)
argumentvalue = Word(alphas)
#argumentvalue = text
argument = Suppress(Literal(curlyleft)) + argumentvalue + Suppress(Literal(curlyright))
argument.setParseAction(FoundArgument)
texcommand = Group(Suppress(backslash) + commandname + ZeroOrMore(parameter) + ZeroOrMore(argument))
texcommand.setParseAction(FoundCommand)
# Text
text = OneOrMore(White() | Word(standard_chars) | escape)
text.setParseAction(FoundText)
tex = OneOrMore(texcommand)
linecomment = Suppress(percent) + restOfLine
#linecomment.setParseAction(FoundLineComment)
blockcomment = nestedExpr('\\begin{comment}', '\\end{comment}')
#blockcomment.setParseAction(FoundBlockComment)
comment = linecomment | blockcomment
texgrammar = OneOrMore(text | tex)
texgrammar.ignore(comment)
if __name__ == '__main__':
print '== scan =========================='
for i in texgrammar.scanString(test):
pass
print '== parse ========================='
info = texgrammar.parseString(test, parseAll=True)
However, the above code seems to work with scanString, but not with parseString:
[code]
Please delete this post. Formatting broken, old version.
Hi, I have the following test code. It works using Python 2.7.11 and fails using 3.5.1.
import pyparsing as pp
import pickle
class Greeting():
def __init__(self, toks):
self.salutation = toks[0]
self.greetee = toks[1]
word = pp.Word(pp.alphas+''.')
salutation = pp.OneOrMore(word)
comma = pp.Literal(',')
greetee = pp.OneOrMore(word)
endpunc = pp.oneOf('! ?')
greeting = salutation + pp.Suppress(comma) + greetee + pp.Suppress(endpunc)
greeting.setParseAction(Greeting)
string = 'Good morning, Miss Crabtree!'
g = greeting.parseString(string)
pkl = 'test .pkl'
pickle.dump(g, open(pkl, 'wb'))
pickle.load(open(pkl, 'rb'))
Please help, I really need to store my results, as my real grammar and input string take minutes to parse.
Sorry, the formating is messed up in the Greeting class.
I'm just seeing this post now - has this been cleared up with the latest Pyparsing release (2.1.0)?
Hi Paul, I had posted this question on stackoverflow, too. You already replied. Problem solved. Sorry about the double post. Maybe you could put a big bold statement on this page to tell people to use stackoverflow? Arne
I am trying to parse LaTeX, which uses the percent sign as comment indicator. While this works, I cannot include the percent sign escaped by a backslash into the text:
import string
from pyparsing import *
teststring = r'''
escaped backslash \\ in text
escaped percent \% in text
'''
backslash = '\\'
percent = '%'
curlyleft = '{'
curlyright = '}'
special_chars = percent + backslash + curlyleft + curlyright
standard_chars = printables.translate(string.maketrans('', '', ), special_chars)
argumentvalue = Forward()
argument = Suppress(Literal(curlyleft)) + Optional(argumentvalue) + Suppress(Literal(curlyright))
commandname = Word(alphas)
command = Suppress(backslash) + commandname + argument
text = Combine(OneOrMore(White() | Word(standard_chars)))
comment = Suppress(percent) + restOfLine
escapebackslash = Literal(backslash) + Literal(backslash)
escapepercent = Literal(backslash) + Literal(percent)
escape = escapebackslash | escapepercent
commandortext = command | escape | text
argumentvalue << OneOrMore(commandortext)
texgrammar = OneOrMore(commandortext)
texgrammar.ignore(comment)
try:
res = texgrammar.parseString(teststring, parseAll=True)
except ParseException, err:
print ''
print 'Parse Error:'
print err
print err.line
print ' ' * (err.column - 1) + '^'
Error message:
Parse Error:
Expected end of text (at char 46), (line:3, col:17)
escaped percent \% in text
^
I'm trying to parse a very simple list of attributes as can be seen by this gist.
However, I'm having trouble trying to parse the 'then:' key.
Ideally, I want the output to return the list of commands inside the braces { }. Eg. in Python:
{
'then': [
'[pe_cir] so go go',
'hello my blah [pe_cir]',
'another command to try out',
]
}
I'm not sure what I'm doing wrong. I got all other parts right except for 'then.'
Btw, I saw the author's plea about 'Getting Started with PyParsing' bootleg and him needing help to pay for his son's college tuition. Well, I just went ahead and bought the 'Getting Started with Pyparsing' from O'Reilly. ;-)
Thank you for buying the book! :) I've been pretty busy with a new job lately, will get to look at your issue this weekend.
Hi,
As I could see for now, literal_with_braces
won't parse line like '{abc [abcd] [abc] asdf}'
or whatever.
So I'd change some of the patterns like this:
value_with_bracket = Literal('[') + Word(alphas + '_-') + Literal(']')
line_with_braces = (Suppress('{') + OneOrMore(Word(alphanums) | value_with_bracket) + Suppress('}'))
There is no need to add space for chars because parser breaks words by whites by itself. I don't really know how parser treats unwrapped string like '[' so I'd add Literal wrap. (It turns out this change is not necessary.) The main change is this one.
Also as far as I understand
OneOrMore(Word(alphanums)) | OneOrMore(value_with_bracket)
will parse a series OneOrMore alphanums and then the series of OneOrMore bracket-values but you want series of alphanums and bracketed values in any order, so I offer
OneOrMore(Word(alphanums) | value_with_bracket)
If you add .ignore(pythonStyleComment) for testing just that part i will go through.
And still the command_group parser didn't parse it until I added
line_with_braces.ignore(pythonStyleComment)
I think the 'upper' pattern don't propogate ignores on nested ones I didn't get to use it yet.
Say if it didn't work or something.
Resulting gist:
For a parsing framework I am developing with pyparsing as the parsing engine, I want to have access to the name of a pattern, as set by setName(). (I know that setName() is meant for debugging, but I have additional use for it.) I have not found a method in the API to get to the name. However, 'pattern.name' seems to work. Would it be OK for me to consider this to be part of the public API?
Thanks!
As long as the attribute does not start with an underscore '_', go ahead and use it. I look forward to seeing your parsing framework. You might also get some hints by looking at an old project, pyparsing_helper, developed by Catherine Devlin.
Thanks for replying. I will certainly share it with you once the basis is stable.
I'll simply show the output
>>> pp = Optional(Literal('a')) & Optional(Literal('b'))
>>> pp.parseString('a')
(['a'], {})
>>> pp.parseString('b')
(['b'], {})
>>> pp.parseString('a b')
(['a', 'b'], {})
>>> pp.parseString('b a')
(['b', 'a'], {})
>>> pp.parseString('a b b')
(['a', 'b', 'b'], {})
Why it parses the last b? It can be alse 'a a b' and 'a b a'.
I just tried this on my development version of pyparsing, and this runs correctly. If you are not running 2.1.0, please update and retry. I should get 2.1.1 out in the next week or so.
pip didn't give 2.1.0 to install. After manual installing the problem seems to be solved. Thanks.
While debugging my code I came across the following exception. The situation is rather complicated and I am not sure that once I resolve my problem it will not go away, but I report it since it looks like a possible Python 3 issue to me.
Best regards, Jeroen.
Error evaluating: thread_id: pid56297_seq2
frame_id: 4342372520
scope: FRAME
attrs: parseresults
Traceback (most recent call last):
File '/eclipse/plugins/org.python.pydev_4.4.0.201510052309/pysrc/pydevd_vars.py', line 238, in resolveCompoundVariable
return resolver.getDictionary(var)
File '/.../eclipse/plugins/org.python.pydev_4.4.0.201510052309/pysrc/pydevd_resolver.py', line 105, in getDictionary
return self._getPyDictionary(var)
File '/.../eclipse/plugins/org.python.pydev_4.4.0.201510052309/pysrc/pydevd_resolver.py', line 171, in _getPyDictionary
names = dir(var)
File '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pyparsing.py', line 574, in __dir__
return dir(super(ParseResults,self)) + self.keys()
TypeError: can only concatenate list (not 'dict_keys') to list
Error evaluating: thread_id: pid56297_seq2
frame_id: 4342372520
scope: FRAME
attrs: t
Traceback (most recent call last):
File '/.../eclipse/plugins/org.python.pydev_4.4.0.201510052309/pysrc/pydevd_vars.py', line 238, in resolveCompoundVariable
return resolver.getDictionary(var)
File '/.../eclipse/plugins/org.python.pydev_4.4.0.201510052309/pysrc/pydevd_resolver.py', line 105, in getDictionary
return self._getPyDictionary(var)
File '/.../eclipse/plugins/org.python.pydev_4.4.0.201510052309/pysrc/pydevd_resolver.py', line 171, in _getPyDictionary
names = dir(var)
File '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pyparsing.py', line 574, in __dir__
return dir(super(ParseResults,self)) + self.keys()
TypeError: can only concatenate list (not 'dict_keys') to list
Yes, this sounds like a Python 3 issue - will get a fix checked in by the weekend (have family plans for next 2 days).
Oh, I just checked my latest code version, this has already been fixed - should be in SVN, if not already released in 2.1.0.
OK. Thanks for replying!
The statement: OneOrMore(Optional('a') + Optional('b'))
with this example: ex.parseString('a b')
hangs forever
This is simplified example from my project but in mine case it gets to parsing an empty string! (instring = '') Is there a way to rewrite it in some way? I want a sequence of one or more 'a', 'b' or 'a b'. isn't that correct pattern?
Look at this example:
>>> print((Optional('a') + Optional('b')).parseString(''))
[]
The problem here is that the body of your OneOrMore expression will also match nothing, so after matching 'a' and 'b', the OneOrMore will run forever at the end of the input string, matching and matching the empty string after the 'b'.
You can fix this in a couple of ways. You can explicitly add a lookahead inside OneOrMore, to not advance if at the end of the input string:
ex = OneOrMore(~LineEnd() + Optional('a') + Optional('b'))
Or you can add a lookahead that will only advance if there is something more printable to look at:
ex = OneOrMore(FollowedBy(Word(printables)) + Optional('a') + Optional('b'))
But both of these will fail if you change your input to 'a b c', for the same reason as before: the parser is still at a place in the input string where it is successfully matching the lookahead, but has nothing to match that will let it advance.
The solution is to drop back and revisit why you are making these expressions optional in the first place. If you want to match a string containing one or more 'a' or 'b' characters, then just do:
ex = OneOrMore(Literal('a') | Literal('b'))
-- Paul
foobar = Literal('foo') & Literal('bar')
foobar.parseString('foo bar') # Throws exception
foobar = Optional('foo') & Optional('bar')
foobar.parseString('foo bar') # works fine (but also matches 'foo' and 'bar' separately
Why is this?
i think it's a bug. I runned through the sourcecode with debugger and parser actually finds foo part but when parsing bar the start location is 0 so it starts from the begining of the string.
Also, I think it would match it separately anyway because it's two different Literals (or any element in general). To have these parts connected use Group (list) or Combine (string).
And I think host of this website should fix the annoying comment doubling.
Yes, this was a bug in Each. It has been fixed in the latest release, which I just pushed out this week - do 'pip install -U pyparsing' to get 2.1.1.
Yes, the wikispaces discussion threads are a pain, I'm sorry. But it is a free wiki space, and pyparsing does not generate tons of revenue, so we live with what we can afford.
Thanks for writing - good luck in future pyparsing work! -- Paul
I just started using pyparsing and encountered a behaviour that I believe is a bug. The following code (which uses a fragment of the sample SQL SELECT grammar), demonstrates the issue: the action function ordering_term_action() is only called once even though the ordering_term rule is matched twice. When removing the 'order_by_terms' result name, ordering_term_action() is called twice as expected. (I'm able to workaround the issue by wrapping the grammar fragment in a Group()).
Is this a bug or expected behaviour?
I'm using pyparsing 2.1.0 with Python 3.5.0.
from pyparsing import *
(ORDER, BY, ASC, DESC) = map(CaselessKeyword, '''ORDER, BY, ASC, DESC'''.replace(',', '').split())
COMMA = Suppress(',')
keyword = MatchFirst((ORDER, BY))
identifier = ~keyword + Word(alphas, alphanums + '_')
ordering_term = (identifier + Optional(ASC | DESC))
# This definition of select, ordering_term_action is called once (unexpected)
select = ORDER + BY + delimitedList(ordering_term)('order_by_terms')
# This definition of select, ordering_term_action is called twice (expected)
#select = ORDER + BY + delimitedList(ordering_term)
# This definition of select, ordering_term_action is called twice (expected)
#select = ORDER + BY + Group(delimitedList(ordering_term)('order_by_terms'))
def ordering_term_action(tokens):
print('ordering_term_action called')
ordering_term.setParseAction(ordering_term_action)
result = select.parseString('order by author asc, title desc', True)
print(result.dump())
Are you sure that the 3rd case works as expected? When I run it, it only gets called once, just like the first case. The common issue is that both the 1st and 3rd versions use results names in their definition.
In order to support reusing the same expression multiple times in a parser with different results names, pyparsing has to make a copy of that expression. This is the case whether the name is attached directly to the expression, or to a OneOrMore or other container containing the expression. Since pyparsing has made a copy of ordering_term, then your later call to setParseAction only modifies the original ordering_term, and not its copy.
The solution is to attach the parse action to ordering_term immediately after you have defined it, and before it is used elsewhere, especially with results names. When I move the setParseAction call up to right below the initial definition of ordering_term, then I get consistent double-calling of the parse action. Not a bug, but definitely a gotcha when defining your grammar.
Right'the last case should have been:
select = ORDER + BY + Group(delimitedList(ordering_term))('order_by_terms')
(the result of a last minute edit that I didn't test).
Thanks for the explanation!
Hi, I have done simple parser of logical expressions by example and add operators <,>, =, etc. (infixNotation). It's good but parsing of a string vary from 0.2 to 2 s. I read about associating parse actions with the concret tokens and will try it. Now solve that by caching parse results ) But it's does not a problem. My problem is that when there are braces '()' in expression, parsing time increase till 20 seconds! E.g.: ({a.b} < 1or {a.d} = 2) and k in [1, 2, 3] - '{}' is my quoted string and i have no problem with it if there no braces in expression. If I remove braces, parsing is fast (with consequent calculation result). I have not wrote any special token definitions to parse the braces and need in advice how to improve infix parser to do with. Do I have to do any definitions to speed up priority braces recognition?
P.S. Calculation of result on parsed expression is very fast!
Thanx
If you have defined a long list of operators in infixNotation, you should probably enable packrat parsing. See docs here:
Thanks! That helped! Improve parsing time more than ten times! I don't pay much attention how packrat parsing works but mean it's some kind of cache. And some items are in there in memory. Is there any practiece in usage of this functionality? I mean how large this cache can be and what its size depends on?
The cacheing for packrat parsing was completely rewritten in pyparsing 2.1.6, much more efficient in terms of both CPU and memory. The cache uses a size-limited OrderedDict, so that it will not grow without bound. Please upgrade to the latest version to get these and other updates.
in function infixNotation, there are some expressions such as FollowedBy(lastExpr + opExpr) + Group( lastExpr + OneOrMore( opExpr ) )
. But I think expressions like FollowedBy(e) e are equivalent to e theoretically. So why do you add FollowedBy-expression to the beginning of the expressions? What dose I ignore?
This is to avoid the grouping of bare terms that are not followed by operators. That is, parsing '3+2' returns [['3','+','2']], but parsing '3' just returns ['3']. If I recall correctly, the Group term also calls any attached parse action - using the FollowedBy lookahead also prevents calling the parse action if there is no operator.
I see. but parsing result of '3' is indeed ['3'] even delete FollowedBy parsers(except the 5th). and experiment showed that in some case, it is not slow down without them (with parse actions in oplist)
I was just rerunning my unit tests with this change - unit test InfixNotationGrammarTest3 specifically tests for this. When removing the FollowedBy's, I get the parse action called many times. I was able to reproduce your behavior if packrat parsing is enabled - this makes sense, as pyparsing will cache the parsing done by the FollowedBy's, and then when the actual expressions are matched, the parsing part is skipped and the values just fetched from the cache.
Is it possible (or meaningful or neccessary) to redefine and-expression (or define another type of and-expression) which dose not call actions until it parses all sub-expressions successfully?
The minimal example is as follows: (hangs forever)
from pyparsing import dblQuotedString
parser = dblQuotedString
instring = ''' + '\\xff' * 500
parser.parseString(instring)
Tested using pyparsing 2.2.1 on py 2.7
Thank you for reporting this! This is clearly a case of catastrophic regex backtracking. I have a very simple solution that I will check in later this evening, after getting a chance to run against unit tests, etc.
Change has been checked into SVN - thanks again for this test!
Thanks! I can confirm that the fix works. Is there any estimate on when the 2.1.2 will be pushed to pypi?
Is this an urgent point? From your test, this looks like an infrequent corner case. When quoted strings in the input are properly terminated, this backtracking doesn't really happen. (Btw, I also tuned up the C and C++ style comments - but again, it only applies to poorly formed input, which starts out looking like a comment, but fails to have the closing '*/'.)
Well, this is a real issue we were facing in our application that uses pyparsing - reproduced at least a couple of times.
The test is a reduction simplification of our real-life case.
Just pushed 2.1.2 with this fix out to pypi and sourceforge. Let me know if there are any issues.
Thanks! I checked the pypi version and it passes our tests. Great job!
I'm a newbie so this is almost too obvious to be real but I cannot import Upcase from pyparsing 2.1.1 in python 3.5. Works fine if I down grade to 2.0.3. A simple dir(pyparsing) of the 2.1.1 version shows Upcase class is note there. Am I missing something totally obvious or is this actually an issue with the particular version 2.1.1?
Upcase was deprecated in version 1.3.3, Sept. 2005, and finally removed once and for all in 2.1.0. In place of using Upcase, you can use the parse action upcaseTokens. (keepOriginalText was also removed in 2.1.0).
Thank you, got it. I appreciate the help.
Hello, I feel really lost with this module, I'm not so experience programmer and I'm not sure if I need to use this. I'm not able to get what is going on from the docs. Can someone help me and give me a clue? maybe some IRC channel or other live chat help? This module is the hardest readable from all other I used before.
What I would like to build: Program which can extract the imports, classes, functions and some other stuff from python source codes and put it into a new file as an outline.
full python file:
'''doc string of python module'''
import module1
from module2 import func, Class
def helper_function(param1,param2):
print(param1,param2,'logic')
class FullModule():
def __init__(self):
print('init logic')
outlined_file.py:
'''doc string of python module'''
import module1
from module2 import func, Class
def helper_function(param1,param2):
pass
class FullModule():
def __init__(self):
pass
full python source code file
'''doc string of python module'''
import module1
from module2 import func, Class
def helper_function(param1,param2):
print(param1,param2,'logic')
def local_helper(func):
print(func)
class FullModule:
def __init__(self):
print('init logic')
outlined_file.py:
'''doc string of python module'''
import module1
from module2 import func, Class
def helper_function(param1,param2):
# with option to outline local functions
pass
class FullModule:
def init(self):
pass
Writing a Python structure extractor is pretty ambitious for a first-time pyparsing project. In the past, I've done simple parsers that people have been struggling with, but this really goes quite beyond that. As a first cut, why don't you try just getting all the import statements from the Python code? This is a pretty straightforward problem, with a few special cases to make it interesting. If you are struggling with the pyparsing part, then at least write up a BNF of the import statement, and then from there I can help you do a little pyparsing parser to extract them.
Hello!
I'm trying to write converter from wiki to markdown. I need to convert ''inline_code'' to inline_code
. I wrote:
inlineCode = pp.QuotedString('''')('content').setParseAction(lambda t: '`' + t.content + '`')
This works fine, but this parse string inline code blocks ...
How can I turn off parsing in code blocks or check if my word is in it?
Best regards, Andrey.
I just found about tokenization. These solve my problems.
Thanks a lot.
Best regards, Andrey.
So this one is kind of a strange one I am trying to verify a particular file by specifying each of it's elements. Everything was working fine until I came across a line such as '' Now here is the strange part if I just use this following expression to match it works:
policies_url = Combine(scheme + colon + delimiter + delimiter + OneOrMore(name + dot) + OneOrMore(delimiter & name) + Optional(index) + restOfLine)
But if I use something like
policies_url = Combine(scheme + colon + delimiter + delimiter + OneOrMore(name + dot) + OneOrMore(delimiter & name) + Optional(index))
dir_url = Combine(quote + policies_url + quote + restOfLine)
to match ''
it will not work however if I simply add a / at the end such as '' then it works fine.
So my question is why does this expression need a closing '/' when quote i.e. quote = Literal(''') is added to the expression?
Regards, AB
I have trouble creating ParseResults
in my program.
Versions: Python 3.5.1 and pyparsing 2.0.3 as packaged in Ubuntu 16.04.
Suppose a grammar very much like C structures. Basically, a structure has a name and a bunch of members. A member has a type and a name. For simplicity, assume that types and names are arbitrary words. As a convenience, several consecutive members of the same type can be introduced by listing their names delimited with commas.
struct Foo {
int x, y;
float z;
};
The following pyparsing
grammar naturally follows:
from pyparsing import *
word = Word(alphas)
members = Group(word('type')
+ Group(delimitedList(word))('names')
+ Suppress(';'))('members')
structKeyword = Suppress(Keyword('struct'))
struct = Group(
structKeyword + word('name') + Suppress('{')
+ Group(ZeroOrMore(members))('members') + Suppress('}')
+ Suppress(';'))('struct')
This grammar produces ParseResults
of the following kind:
<root>
<struct>
<name>Foo</name>
<members>
<members>
<type>int</type>
<names>
<ITEM>x</ITEM>
<ITEM>y</ITEM>
</names>
</members>
<members>
<type>float</type>
<names>
<ITEM>z</ITEM>
</names>
</members>
</members>
</struct>
</root>
However, this is a nuisance to work with later. I would like to desugar the comma-separated definitions, to get the following tree:
<root>
<struct>
<name>Foo</name>
<members>
<member>
<type>int</type>
<name>x</name>
</member>
<member>
<type>int</type>
<name>y</name>
</member>
<member>
<type>float</type>
<name>z</name>
</member>
</members>
</struct>
</root>
I could do that as a postprocessing step, by walking the ParsingResults
and building a data structure of my own. This is straightforward but boring, especially considering that in the real program there are quite a few more grammar rules.
The next obvious way is to add a parsing action, and that's where I get stumped.
I imagine the action needs to be attached to the members
grammar rule. It receives a 'list' of one element which is a 'dictionary' whose one key is type
and the other is names
. type
is a string while names
is a list of strings. The action needs to return a 'list' of 'dictionaries', one for each name in the original's names
.
The following kind of works:
def expand(tokens):
return [{'type': token.type, 'name': name}
for token in tokens
for name in token.names]
Namely, it produces the following structure:
<root>
<struct>
<name>Foo</name>
<members>
<members>{'name': 'x', 'type': 'int'}</members>
<ITEM>{'name': 'y', 'type': 'int'}</ITEM>
<members>{'name': 'z', 'type': 'float'}</members>
</members>
</struct>
</root>
Notice how individual member definitions are rendered as a text representation of a Python dictionary. When accessed as Python dictionaries (x['type']
), they work as intended. But they cannot be accessed as namespaces (x.type
) or lists (x[0]
), and the XML rendition is ugly.
It becomes clear that I have to construct a proper ParseResults
structure. I sort of managed to do this:
import pyparsing
def expand(tokens):
items = []
for name in tokens[0].names:
item = ParseResults([tokens[0].type, name], 'member')
item['type'] = tokens[0].type
item['name'] = pyparsing._ParseResultsWithOffset(name, 1)
items.append(item)
return ParseResults(items)
I don't like it because (1) I have to duplicate the element values in the constructor call and in subsequent item assignments, and (2) I am forced to delve into undocumented private implementation details (_ParseResultsWithOffset
).
So what I'd like to ask is:
- Is my goal (to apply structural transformations during parsing, while keeping the whole tree accessible as `ParseResults`) sane? Or should I fall back to transforming the complete parsed AST to a different data structure after the fact?
- If it is sane, what is the proper approach that does not suffer from the deficiencies outlined above?
#!/usr/bin/python3
import pyparsing
from pyparsing import *
def expand1(tokens):
return [{'type': token.type, 'name': name}
for token in tokens
for name in token.names]
def expand(tokens):
items = []
for name in tokens[0].names:
item = ParseResults([tokens[0].type, name], 'member')
item['type'] = tokens[0].type
item['name'] = pyparsing._ParseResultsWithOffset(name, 1)
items.append(item)
return ParseResults(items)
word = Word(alphas)
members = Group(word('type')
+ Group(delimitedList(word))('names')
+ Suppress(';'))('members').setParseAction(expand)
structKeyword = Suppress(Keyword('struct'))
struct = Group(
structKeyword + word('name') + Suppress('{')
+ Group(ZeroOrMore(members))('members') + Suppress('}')
+ Suppress(';'))('struct')
testString = '''
struct Foo {
int x, y;
float z;
};
'''
result = struct.parseString(testString, parseAll=True)
print(result.asXML('root'))
Are you absolutely tied to using asXML() to list out the contents of your parsed data? I think I am going to deprecate this method, as it really is much less reliable than using dump() (having to match up results values with results names after-the-fact).
Using dump() with your original code gives this:
[['Foo', [['int', 'x'], ['int', 'y'], ['float', 'z']]]]
- struct: ['Foo', [['int', 'x'], ['int', 'y'], ['float', 'z']]]
- members: [['int', 'x'], ['int', 'y'], ['float', 'z']]
- members: [['float', 'z']]
[0]:
['float', 'z']
- member: ['float']
- name: z
- type: float
- name: Foo
We can see all the expanded type-name pairs in the list of members, but they aren't in the named sub list (only 'z' is there, the last matching member). This usually indicates that multiple expressions are being matched with the same name, and only the last one is being kept. When using the old .setResultsName() form, this would be remedied using listAllMatches=True. With the new callable short form, you can fix by appending a '*' to the name (which I will change to 'member' from 'members'):
members = Group(word('type')
+ Group(delimitedList(word))('names')
+ Suppress(';'))('member*').setParseAction(expand)
This now gets us closer:
[['Foo', [['int', 'x'], ['int', 'y'], ['float', 'z']]]]
- struct: ['Foo', [['int', 'x'], ['int', 'y'], ['float', 'z']]]
- members: [['int', 'x'], ['int', 'y'], ['float', 'z']]
- member: [[['int', 'x'], ['int', 'y']], [['float', 'z']]]
[0]:
[['int', 'x'], ['int', 'y']]
[0]:
['int', 'x']
- member: ['int']
- name: x
- type: int
[1]:
['int', 'y']
- member: ['int']
- name: y
- type: int
[1]:
[['float', 'z']]
[0]:
['float', 'z']
- member: ['float']
- name: z
- type: float
- name: Foo
But now the expanded 'int x,y' to [['int', 'x'], ['int', 'y']] is buried within the 0'th element of members, instead of being the first 2 of a 3-element members list. At this point, it seems that the solution is to attach the expand() parse action not to the individual member expression, but to the collective members expression:
def expand(tokens):
ret = []
for token in tokens:
for name in token.names:
mem_pr = ParseResults([token.type, name])
mem_pr['type'] = token.type
mem_pr['name'] = name
ret.append(mem_pr)
return ParseResults(ret)
word = Word(alphas)
members = Group(word('type')
+ Group(delimitedList(word))('names')
+ Suppress(';'))
structKeyword = Suppress(Keyword('struct'))
struct = Group(
structKeyword + word('name') + Suppress('{')
+ Group(ZeroOrMore(members).setParseAction(expand))('members') + Suppress('}')
+ Suppress(';'))('struct')
Now parsing your test string and printing out the results using dump() gives:
[['Foo', [['int', 'x'], ['int', 'y'], ['float', 'z']]]]
- struct: ['Foo', [['int', 'x'], ['int', 'y'], ['float', 'z']]]
- members: [['int', 'x'], ['int', 'y'], ['float', 'z']]
[0]:
['int', 'x']
- name: x
- type: int
[1]:
['int', 'y']
- name: y
- type: int
[2]:
['float', 'z']
- name: z
- type: float
- name: Foo
Which looks closer to your desired expanded struct. If you absolutely need XML output from this, then I would write a custom XML serializer for this structure, which will be much more reliable in picking out names, members, member types and member names than the guessing game that asXML() uses.
-- Paul
Here is an apr's-parse converter for these ParseResults to XML:
import xml.etree.ElementTree as ET
def to_struct_XML(pr):
ret = ET.Element(pr.struct.name)
members = ET.Element('members')
for member in pr.struct.members:
member_element = ET.Element('member')
type_element = ET.Element('type')
type_element.text = member.type
name_element = ET.Element('name')
name_element.text = member.name
member_element.append(type_element)
member_element.append(name_element)
members.append(member_element)
ret.append(members)
return ret
import io
out = io.BytesIO()
ET.ElementTree(to_struct_XML(result)).write(out)
xml = out.getvalue().decode('UTF-8')
xml = xml.replace('><', '>\n<')
print(xml)
Gives:
<Foo>
<members>
<member>
<type>int</type>
<name>x</name>
</member>
<member>
<type>int</type>
<name>y</name>
</member>
<member>
<type>float</type>
<name>z</name>
</member>
</members>
</Foo>
Thank you for the reply.
I am not particularly attached to asXML
, I just found its output easier to read initially. Something in its element name assignment did in fact strike me as strange, but I didn't expect it to be downright misleading.
Now that I don't have to assign results names in a way that makes asXML
output pretty, I find that it is sufficient to drop the results name on members
(which I didn't like anyway) and do away with the _ParseResultsWithOffset
invocation. Moving the action up to ZeroOrMore(members)
seems to be unnecessary, which is just as well because in the actual grammar I'm implementing members
can be interleaved with other rules.
Iteration over tokens
does look nicer than accessing tokens[0]
, although does not affect operation if I keep the action attached to members
.
Here is the final implementation I ended up with:
#!/usr/bin/python3
from pyparsing import *
def expand(tokens):
items = []
for token in tokens:
for name in token.names:
item = ParseResults([token.type, name])
item['type'] = token.type
item['name'] = name
items.append(item)
return ParseResults(items)
word = Word(alphas)
members = Group(word('type')
+ Group(delimitedList(word))('names')
+ Suppress(';')).setParseAction(expand)
structKeyword = Suppress(Keyword('struct'))
struct = Group(
structKeyword + word('name') + Suppress('{')
+ Group(ZeroOrMore(members))('members') + Suppress('}')
+ Suppress(';'))('struct')
testString = '''
struct Foo {
int x, y;
float z;
};
'''
result = struct.parseString(testString, parseAll=True)
print(result.dump())
Here are some style suggestions for your parser - use any or none as you prefer:
- changed a few names to be a little more explicit
- some idioms to simplify creating keyword expressions, instead of line after line of 'KEYWORD = Keyword('keyword')'
- changed 'word' to 'ident', and expanded to support more common identifier form
- add support for nested struct definition
- ignore comments
#!/usr/bin/python3
from pyparsing import *
def expand_multiple_member_names(tokens):
items = []
for token in tokens:
for name in token.names:
item = ParseResults([token.type, name])
item['type'] = token.type
item['name'] = name
items.append(item)
return ParseResults(items)
# suppressable punctuation
SEMI,LBRACE,RBRACE = map(Suppress, ';{}')
# define keyword expressions like STRUCT=Keyword('struct'), etc.
keywords = 'union,struct,typedef'.split(',')
for kw in keywords:
globals()[kw.upper()] = Keyword(kw)
# or use exec if you prefer:
# exec('{} = Keyword('{}')'.format(kw.upper(), kw))
# generic identifier - if using latest pyparsing version, use
# ident = pyparsing_common.identifier
ident = Word(alphas+'_', alphanums+'_')
# define a Forward for types, since they can be recursive
type_decl = Forward()('type')
struct_members_decl = Group(type_decl
+ Group(delimitedList(ident))('names')
+ SEMI).setParseAction(expand_multiple_member_names)
struct_type = Group(
STRUCT + Optional(ident, '<none>')('name') + LBRACE
+ Group(ZeroOrMore(struct_members_decl))('members') + RBRACE
)
# expand as necessary to include '*'s, '&'s, etc.
type_decl <<= Group(struct_type('struct')) | ident
struct_decl = struct_type('struct') + SEMI
# skip over comments, wherever they occur - only need to
# make this call once, at the topmost level, will propagate down
# to all embedded expressions
struct_decl.ignore(cppStyleComment)
testString = '''
struct Foo {
struct {
float a,b,c;
} values;
int x, y;
float z;
// char* s;
};
'''
result = struct_decl.parseString(testString, parseAll=True)
print(result.dump())
Gives:
[['struct', 'Foo', [[[['struct', '<none>', [['float', 'a'], ['float', 'b'], etc. ...
- struct: ['struct', 'Foo', [[[['struct', '<none>', [['float', 'a'], etc. ...
- members: [[[['struct', '<none>', [['float', 'a'], ['float', 'b'], etc. ...
[0]:
[[['struct', '<none>', [['float', 'a'], ['float', 'b'], ['float', 'c']]]], 'values']
- name: values
- type: [['struct', '<none>', [['float', 'a'], ['float', 'b'], ['float', 'c']]]]
- struct: ['struct', '<none>', [['float', 'a'], ['float', 'b'], ['float', 'c']]]
- members: [['float', 'a'], ['float', 'b'], ['float', 'c']]
[0]:
['float', 'a']
- name: a
- type: float
[1]:
['float', 'b']
- name: b
- type: float
[2]:
['float', 'c']
- name: c
- type: float
- name: <none>
[1]:
['int', 'x']
- name: x
- type: int
[2]:
['int', 'y']
- name: y
- type: int
[3]:
['float', 'z']
- name: z
- type: float
- name: Foo
Thanks but I'm not actually parsing C. I just chose a small part of it as a simplified view on the problem I was facing.
The actual project will have next to no keywords, a line- and indentation-based
grammar with shell-style line-end comments, and the thing for which I substituted word
is in fact an Or
of a double-quoted single-line string, triple-double-quoted multiline
string, and an unquoted non-empty sequence of non-whitespace Unicode characters
except for #
, "
and ,
. I did not include any of that in the example because it wasn't relevant to the question.
Hi, I want to write a script which will take some structured data(csv file) as input, read every line from it and generate minimum number of regex as output which will satisfy the whole input set. Can I kindly know if this can be done using PyParsing ?
What you are asking is not really in pyparsing's domain. This link might give you some better leads: . Or google for 'generate regex from examples'.
thank you for your response :)
Hi All, I am working on parsing a text file that has a bunch of commands (CLIs as they're fondly called in some circles) in BNF format. I need to parse each elements in a set of groups into a list and process them and convert them to XML with appropriate tags (as per the DITA architecture). Here are some examples of commands:
a b c { de | ed | fg }
a b c [ de | ed | fg ]
a b c { de <variable> | ed | fg }
The a, b, c, de, ed, fg, are all words and can be alpha numeric. I'm struggling to define the grammar for the {} and [] parts. The arguments must be enclosed in either {} or [] and must be separated by a |. How do we define a grammar like this? I gave this a shot but I'm pretty sure its not right.
Group(OneOrMore(Word(alpha))) + OneOf(Literal('{ [')) + Group(OneOrMore(Word(alpha))) ^ Group(OneOrMore(alpha))) + OneOf('|') + OneOf(Literal('} ]'))
Any help would be greatly appreciated.
Regards, Nilesh
The delimitedList helper will make it much easier to define those '|'-delimited lists inside the braces:
from pyparsing import *
LBRACE,RBRACE,LBRACK,RBRACK = map(Suppress,'{}[]')
wd = Word(alphas)
wd_list = delimitedList(wd, delim='|')
brace_expr = Group(LBRACE + wd_list + RBRACE | LBRACK + wd_list + RBRACK)
expr = OneOrMore(wd) + brace_expr
expr.runTests('''
a b c { de | ed | fg }
a b c [ de | ed | fg ]
a b c { de | fg | hi | qr | ed | fg }
''')
Thanks a lot! This works like a charm!
This is surely my stupidity but I thought I would mention. I am trying to parse out '' (I am aware of dblSlashComment) . If I do it this way things are fine:
[[pat = Literal('')result = pat.parseString(simple_str)print result]]
-- But if i do it this way with a setDebug:
I get 'TypeError: 'str' object is not callable'
I don't get this with other literal patterns only with double back slash. Tore my hair for awhile till I traced it to the setDebug
Not sure why my square brackets failed for the first code chunk. ignore them please. Should have read like this:
simple_str = '<em>#command choice'
pat = Literal('</em>')
result = pat.parseString(simple_str)
print result
damn the double back slashes keep getting remove from my code when I post. Arrgh. Ok so the string is in english '# FORWARDSLASH FORWARDSLASH #command choice' pat = Literal('FORWARDSLASH FORWARDSLASH') (what fun it is to deal with forward slashes)
Maybe you can use a pastebin and post the link? But in your first example, you are calling setDebug on the parsed results, not on the expression. ParseResults will accept just about any attribute name as a potential results name, and if the name was not defined in the grammar, will return ''. So your code is trying to retrieve the 'setDebug' name of some parsed result, which gives '', then you are trying to call it with (), giving the Python error ''str' object is not callable'. I think you want 'pat.setDebug()'
From your follow-up messages, I think you tried to do this:
simple_str = '// #command choice'
pat = Literal('//').setDebug()
result = pat.parseString(simple_str)
print(result)
which prints:
Match '//' at loc 0(1,1)
Matched '//' -> ['//']
['//']
Does that get you any closer?
-- Paul
Thanks for the quick reply. My bad for not reading carefully about 'True' for debug. Thank you. And again great work (I sent you a contribution).
It is much appreciated, thanks!
2016-10-20 12:45:49 - rcrowe123 - How to grab the key:value from parseResults and send to setParseAction
I've inherited some code that uses a lot of PyParsing in it. I have never used PP before and need to make a few changes to the code. I am trying to pull the key:value from my parse string and send them to a function using setParseAction. Using the tokens. I have this parser defined:
sni_entry=Combine(Literal('sni')+Word(alphas)+Literal('vector')+Literal('[32:0]:').suppress(),adjacent=False)+Word(printables)
and when I parse the text I get this:
([(['snifaultvector', '0x00000000'], {}), (['snistuckvector', '0x00100000'], {}), (['sniundvector', '0x00000000'], {}), (['sniovrvector', '0x00000000'], {}), (['sniemptyvector', '0x3fffffff'], {})], {})
I want to send the name and value to the function:
my_function('snifaultvector','0x00000000')
How do i do that using the setParserAction?
I've tried something like this, but it seems that only the value is in the token.
.setParseAction(lambda t: store_key_val(t[0])
I was expecting t[0] to contain ['snifaultvector', '0x00000000'], but seems it only contains '0x00000000'. This seems like it should be easy, but I just can't get it to work. I could easily process the output of the parsing and pass k:v to the function, but wanted to get it to work using setParseAction. Is it possible?
Change your parse action from a one-line lambda into a full method definition, taking an argument 'tokens', and then insert 'print(tokens.dump())' as the first line and see what kind of parsed results are being passed to your parse action. Pyparsing creates ParseResults objects out of its tokens, which support list, dict, and attribute semantics. As such, the repr format is somewhat complex, and does not reflect the actual access paths to the parsed content. In this case it looks like you are getting a series of one-entry ParseResults, each with a different key. You should be able to combine these into a single mapping object using 'tokens = sum(tokens)' - then you should be able to get at the contents using dict style ('tokens['snifaultvector']') or attribute style ('tokens.snifaultvector'). The names you are seeing are for the name-based accesses. If you access this using list semantics, then it will be like you are indexing into a list of unnamed tokens, which is what you have already observed. Please look over some of the new online docs and examples, I've added about 1000 lines of inline documentation, hopefully to help clarify some of these points. If not, please post back, so I can expand on these important elements. -- Paul
Thanks for the reply. My biggest problems was using the lambda function instead of just sending the full tokens to the function. And with my lack of knowledge on Pyparsing I just followed what previous developer was doing. Anyway, I got it to work by grouping the combined results and that gave me both the name and value in the token.
Ah, I see - it would not be difficult to convert this named data, but it sounds like it is doing what you need. But feel free to write back if you want to make other enhancements to this parser.
Hello.
I was trying to parse the following text but with no success. text = '''
{
Hello, /* This is a comment */
My_name, /* This is also a comment */
{
'Foo123', /* Another Comment */
Bar559, /* Comment */
},
'Foo556', /* Another Comment123 */
Bar459, /* Comment112 */
},
}
} ''' My code is as follows:
LBRACE,RBRACE,COMMA = map(Suppress,'{},')
name = Word(alphanums+'._'')
value = Forward()
entry = name + COMMA + cppStyleComment
struct = LBRACE + value + RBRACE + COMMA
value << ( ZeroOrMore(entry) | struct )
result = OneOrMore(struct).parseString(sample)
pprint(result.asList())
Any help will be appreciated.
I think you just need to change value's definition to value << ZeroOrMore(entry | struct)
Thanks for such a powerful library!
Prototyping a system that does the below and just wanted a check since I am still having some difficulty seeing if there is a better way to do this in pyparsing.
-
Take the string '(Metric1 + Metric2) * 2'
-
Metric1 and Metric2 represent UIDs that need to be retrieved from a datastore. In this case this is a time series so we will need to evaluate this expression for each point in time required.
So far I have reviewed the fourFn.py example and it makes a lot of sense. I also have created some examples where I create a grammer for what a metric is and it is returned in parseString as a MetricClass. My currently approach then was to retrieve all metric class types via a loop but I am curious if there is a better approach?
In general, I usually steer people to the more current pyparsing mechanism for parsing arithmetic notations, 'infixNotation'. The SimpleArith.py and SimpleBool.py examples show how this is used, and you can still attach parse actions or classes to individual levels, and to operands, to support evaluation of the parsed expression.
You might see references to infixNotation under its former (deprecated) name of operatorPrecedence. Here is a link to the online doc:
I'm trying to parse a simple JSON-like structure into python dics and then turn it into a proper JSON structure. The block is as follows:
###################################################
# HEADER TEXT
# HEADER TEXT
###################################################
NAME => {
NAME => VALUE,
NAME => VALUE,
NAME => VALUE,
NAME => {
NAME => {
NAME => VALUE, NAME => VALUE, NAME => VALUE,
},
} # comment
}, # more comments
and repeating. Rules:
NAME = alphanums and _
VALUE = decimal(6) | hex (0xA) | list of hex ([0x1,0x2]) | text in brackets([A]) | string('A')
I set up the following grammar:
comment = Literal('#') + restOfLine
header = OneOrMore('#').suppress()
equals = Literal('=>').suppress()
lbrace = Literal('{').suppress()
rbrace = Literal('}').suppress()
comma = Literal(',').suppress()
name = Word(alphas+'_', alphanums+'_')
dec_number = Word(nums)
hex_number = Word(nums+'x')
block = Literal('[') + Word(alphas+'_', alphanums+'_') + Literal(']')
listing = Literal('[') + OneOrMore(hex_number) + Literal(']')
value = Forward()
entry = Group(name + equals + value)
entry.ignore(comment)
struct = Group(lbrace + ZeroOrMore(entry) + rbrace)
value << (quotedString | struct | hex_number | dec_number | block)
result = OneOrMore(entry).parseString(test_string)
pprint(result.asList())
but I'm getting pyparsing.ParseException: Expected {quotedString using single or double quotes
when trying to run. I'm new to pyparsing so still trying to figure it out.
OK, I've managed to improve the grammer:
cfgName = Word(alphanums+'_')
cfgString = dblQuotedString().setParseAction(removeQuotes)
cfgNumber = Word('0123456789ABCDEFx')
LBRACK, RBRACK, LBRACE, RBRACE = map(Suppress, '[]{}')
EQUAL = Literal('=>').suppress()
cfgObject = Forward()
cfgValue = Forward()
cfgElements = delimitedList(cfgValue)
cfgArray = Group(LBRACK + Optional(cfgElements, []) + RBRACK)
cfgValue << (cfgString | cfgNumber | cfgArray | cfgName | Group(cfgObject))
memberDef = Group(cfgName + EQUAL + cfgValue)
cfgMembers = delimitedList(memberDef)
cfgObject << Dict(LBRACE + Optional(cfgMembers) + RBRACE)
cfgComment = pythonStyleComment
cfgObject.ignore(cfgComment)
The problem is proper JSON is
{member,member,member}
however my structure is:
{member,member,member,}
the last element in every nested structure is comma separated and I don't know how to account for that in the grammar.
This was reposted and answered on StackOverflow:
Hello all, I'm starting to use pyparsing for a DSL I'm creating, and am stuck at the simplest beginner code:
w = delimitedList(Word(alphas), delim=' ')
w.parseString('set', parseAll=True) # Ok...
w.parseString('set endianness', parseAll=True) # ERROR! Expected end of text (at char 3), (line:1, col:4)
Can anyone give me a hint in the general direction for what I'm missing? Thanks!
One thing that is important about pyparsing is that, unlike regex, you focus on the parts of the input that aren't whitespace. So to use pyparsing to parse a list of words delimited by spaces isn't really consistent with pyparsing philosophy. Instead, you would use 'OneOrMore(Word(alphas))' - pyparsing skips over whitespace implicitly. Please look over the examples on the Examples wiki page, to see various approaches to parsing lists, command syntax, etc. Welcome to pyparsing! -- Paul
Thanks, that was awesome! Now I have a more complicated parser, but for the life of me can't figure why it won't accept more than one input! I'm using the OneOrMore to accept multiple 'proto' definitions in one protofile, but it gives me a parser error: pyparsing.ParseException: Expected end of text.
Parser definition:
attribute = pp.Group(pp.Suppress('[') + pp.OneOrMore(pp.Word(pp.alphas)) + pp.Suppress(']'))
member = pp.Group(pp.Word(pp.printables) + pp.Word(pp.printables.replace(';', '')) + pp.Suppress(';'))
msgHeader = pp.ZeroOrMore(attribute)
msgName = pp.Word(pp.printables)
msgContent = pp.Suppress('{') + pp.OneOrMore(member) + pp.Suppress('}')
msgWithContent = pp.Keyword('message').suppress() + msgName + msgContent
defaultMsg = pp.Keyword('default').suppress() + pp.Suppress(';')
protoStatement = msgHeader + pp.MatchFirst([msgWithContent, defaultMsg])
protoFileParser = pp.ZeroOrMore(protoStatement)
protoFileParser.parseString(fileContents, parseAll=True)
Test file:
[endianness network]
[namespace protoOne]
message JumpPacket {
uint64 PlayerID;
}
[identifier uint64 19023789]
[endianness network]
[namespace protoOne]
message MovementPacket {
uint16 X;
uint16 Y;
uint64 PlayerID;
string Msg;
}
[identifier uint64 19023789]
does not match your expression for attribute.
Hello all, i'm trying to understand how to use forward with this simple example:
variable= Word(alphanums)
p1=Word('(')
p2=Word(')')
v=Word(',')
name=Word(alphanums)
funct=Forward()
arg=funct|argFunct
funct << (name+p1+arg+ZeroOrMore(v+arg)+p2)
But this code doesnt work when i try to parse strings like 'a(a(a))' Can you help me? Thank you
I've found the error! I should use
p1=Literal('(')
p2=Literal(')')
v=Literal(',')
Glad you found your error.
Pyparsing includes many helper methods. A common pattern is to parse comma-delimited lists, using 'expr + ZeroOrMore(',' + expr)'. pyparsing has a method that allows you to replace this with 'delimitedList(expr)'. The delimiters themselves are removed from the parsed tokens, and delimiters other than ',' can be named using 'delimitedList(expr, delim='|')' (if using a '|' for delimiter, for instance). If items are just delimited by whitespace, then you can just use 'OneOrMore(expr)', since pyparsing implicitly skips over whitespace.
Welcome to pyparsing!
-- Paul