[Note: these entries are fairly old, and predate many new features of pyparsing, and are predominantly coded using Python 2. They are captured here for historical benefit, but may not contain the most current practices or features. We will try to add editor notes to entries to indicate when discussions have been overtaken by development events.]
2012-01-07 06:08:59 - DiaaFayed - promote nested elements ...
2012-01-09 05:40:45 - DiaaFayed - rearrange parsed tree
2012-01-11 08:39:18 - DiaaFayed - a letter play two roles
2012-01-11 08:39:54 - DiaaFayed - how make the parser does not stop
2012-01-11 08:43:19 - DiaaFayed - add logger class
2012-01-14 08:16:21 - DiaaFayed - parsing a list of strings
2012-01-14 08:29:02 - DiaaFayed - parseResultsSumExample.py
2012-01-16 23:11:45 - 0xLeFF - Parsing nested c/c++ blocks
2012-01-19 08:18:31 - masura-san - How to tag parsed elements
2012-01-20 08:44:42 - DiaaFayed - file contains lines for parsed.
2012-01-29 11:47:37 - Phxsawdust - Creating manual ParseResults
2012-01-31 18:24:11 - oafilipoai - Catch-all pattern
2012-02-04 15:55:57 - lamakaha - error with setParseAction
2012-02-05 14:44:27 - karulis - bug + patch for ParseResults.dir in python3
2012-02-07 16:34:33 - oafilipoai - Finding the end location for a matched expression
2012-02-14 03:49:57 - ror6ax - Trying to parse a file.
2012-02-17 10:10:07 - DiaaFayed - comma separates outside paranthesis
2012-02-18 19:57:34 - lamakaha - how to ignore blank lines in line oriented parser?
2012-02-21 05:54:17 - ror6ax - parsing tables
2012-02-25 13:19:21 - johnmudd - long output, is ths right?
2012-02-29 00:17:37 - lesnar56 - Extending Keyword Classes
2012-03-06 06:04:21 - rrian - Unexpected results with name
2012-03-12 12:46:53 - tarruda - Need help in parsing part of python grammar
2012-03-14 12:00:22 - keirian - Recursion Help
2012-03-21 11:58:03 - maxime-esa - ambiguous grammar not detected?
2012-03-25 00:33:47 - nimbiotics - problems with delimitedList
2012-03-29 14:30:05 - nimbiotics - How can I group this?
2012-04-04 14:58:25 - HumbertMason - Parsing a list of structures line by line
2012-04-06 10:11:25 - pepinocho9 - Help with parseactions and Morgan's Law
2012-04-16 10:33:55 - takluyver - Skip optional part if following part matches
2012-04-27 12:16:42 - larapsodia - Question about "Or" statement
2012-04-28 06:48:11 - charles_w - working to understand pyparsing, setResultsName, and setParseAction
2012-05-01 01:14:04 - robintw - Labelling of results when using Each
2012-05-08 11:32:30 - side78 - Parsing nested blocks without any deterministic end
2012-05-09 18:23:50 - Caffeinix - C++ qualified types
2012-05-21 12:08:56 - dGRp - Building AST for n-op abstract algebras
2012-05-23 04:27:27 - Madan2 - TypeError: unsupported operand type(s) for ^: 'NoneType' and 'NoneType'
2012-05-25 05:33:30 - dGRp - Some thoughts and questions on improvement
2012-05-26 14:51:08 - BrenBarn - get original text from ParseResults
2012-06-07 09:56:55 - Madan2 - combine - space bn tokens
2012-06-10 06:12:29 - bsr203 - Rules for Repeating sections of data
2012-06-10 14:23:11 - ofuhrer - Replace only locally
2012-06-11 02:51:37 - willem179 - ParseResults inconsistent after setitem
2012-06-25 11:31:52 - Madan2 - Dealing with "" in data
2012-06-26 16:47:14 - chlim - parsing identical strings and multi-lines
2012-07-02 07:16:21 - DiaaFayed - dynamic extractor statement
2012-07-02 12:52:58 - DiaaFayed - a new feature of the Python eval and exec commands
2012-07-03 13:30:03 - BrenBarn - Copying ParseResults attributes
2012-07-04 08:25:29 - einar77 - Parsing multi-line records
2012-07-11 17:36:06 - chlim - svn syntax
2012-07-16 08:45:49 - DiaaFayed - how can we restore setParseAction results ?
2012-07-18 07:37:10 - paulelastic - Parsing Expression Grammar (PEG)
2012-07-18 07:46:50 - paulelastic - Visual debugger for pyparsing
2012-08-09 07:35:47 - Leevi3 - multiple parse actions
2012-08-17 14:36:15 - script_lover - operatorPrecedence generate rules that cannot be validated
2012-08-17 14:48:13 - script_lover - Avoid duplicating rules
2012-08-22 05:56:40 - Leevi3 - nesting depth of operator precedence parse results
2012-08-25 05:11:18 - simbera.jan - Getting a ParseResults line number
2012-09-10 02:50:10 - acjackson5 - Help with datetime conversion
2012-09-10 05:09:15 - darkest_star - Parse a logfile and detect repetitve textblocks
2012-09-12 15:11:17 - sravet - need help with verilog parser
2012-09-30 23:19:48 - cqqhzxgh - match multiple lines
2012-10-01 23:38:09 - cqqhzxgh - question with scanString
2012-10-05 14:19:50 - dlwatey - Beginner Question
2012-10-07 21:20:38 - RunSilent - parseString works, scanString fails?
2012-10-08 08:22:01 - DiaaFayed - How can we transform pyparsing string to pyparsing expression
2012-10-15 15:14:28 - jsy1972 - question re: indentation and grouped stmts
2012-10-17 13:30:37 - DiaaFayed - plz give explain and examples ...
2012-10-18 13:50:33 - chlim - Parsing single and multiple records
2012-10-22 07:30:51 - DiaaFayed - can we simulate the caret and dollar sign functions .?
2012-10-22 12:25:04 - DiaaFayed - simulate caret and dollar sign in regular expression
2012-10-22 12:34:59 - tvn1981 - Getting line number where error occurs
2012-10-25 00:26:43 - kmbt - Match only at selected lines
2012-10-25 00:50:16 - ranjith19 - How to get a perticular behaviour with a function parser?
2012-10-26 04:57:27 - pypetey - buildout: Couldn't find a setup script
2012-10-26 11:46:25 - dlwatey - Handling special Characters
2012-10-29 09:04:41 - dlwatey - Getting closer and clearer
2012-11-06 13:39:41 - tvn1981 - Very slow parsing a simple First order logic form
2012-11-13 00:24:07 - DiaaFayed - expressions matcher module
2012-11-13 01:56:45 - DiaaFayed - the space and paranthesis
2012-11-15 05:55:16 - DiaaFayed - Questions about scaneExamples.py
2012-11-15 06:03:12 - DiaaFayed - Question about Copy()
2012-11-22 09:15:57 - cadourian - How to improve parser performance
2012-11-26 10:12:33 - DiaaFayed - Design Pattern: Chain of Responsibility
2012-11-26 17:38:27 - rogersanchez75 - Arithmetic evaluation with variables
2012-11-28 20:37:48 - rogersanchez75 - Eval functions in arith expressions
2012-11-29 21:07:56 - torfat - parsing C function calls
2012-12-05 14:38:02 - Demolishun - New to pyparser and impressed by capabilities
2012-12-05 21:14:00 - Demolishun - Working on Literal Identification
2012-12-10 15:58:57 - rogersanchez75 - Further DSL and function parsing development
2012-12-12 00:30:30 - Demolishun - Trouble with moving beyond basic pattern matching.
2012-12-22 20:41:51 - rogersanchez75 - Control flow
2012-12-24 11:59:11 - catdude - Clarification regarding building a search sting
plz sir
when capture structure of matching in a xml tree structure
how can we promote deep nested elements to higher levels
thanks
if I have a parse tree like
item
--------item1
--------item2
----------------item2.1
----------------item2.2
------------------------item2.2.1
------------------------item2.2.2
----------------item2.3
--------item3
--------item4
and need to reorganize the tree to shape
item
--------item1
--------item2
----------------item2.1
----------------item2.2
--------item2.2.1
--------item2.2.2
----------------item2.3
--------item3
--------item4
the purpose is to export to relational table with columns
to put the information in relational table with columns
<item1, item2, item2.2.1, item2.2.2, item3, item4>
could I have the second tree while parsing ?
in one Arabic string (unicode or utf8). the author sometimes use the dash (ascii) character as a separator, or use the tatweel character as aseperator. the problem is the tatweel letter is one of the alphas that constitutes. So the parsing raise errors. the tatweel plays two roles is the cause of error.
Is the tatweel really a valid alpha? Or is it included with alphas because alphas is locale-sensitive? Can you define your own subset of alphas that omits tatweel?
Let's say 'x' was like tatweel. Here is how to define a word of alphas with 'x' as separators:
aword = Word(alphas, excludeChars='x')
expr = delimitedList(aword, delim='x')
print expr.parseString('sldkjfzxlskaopweiurxlkszaxlsdf')
prints:
['sldkjfz', 'lskaopweiur', 'lksza', 'lsdf']
thanks for all replay, but plz see the example assum the dash = '-' used as separator and also can be a letter of a word
dash = '-'
word = Word(alphas + dash)
sentence1 = 'diaa fayed- - engineer'
sentence = OneOrMore(word) + Suppress('-') + OneOrMore(word)
print sentence.parseString(sentence1)
the error
print sentence.parseString(sentence1)
File 'C:\Python26\lib\site-packages\pyparsing.py', line 1032, in parseString
raise exc
ParseException: Expected '-' (at char 22), (line:1, col:23)
I need the result to be
['diaa fayed-', 'engineer']
Thank you for providing an example that displays adequately in ASCII characters. I'll refer to '-' in your example as a standin for tatweel.
What distinguishes the lone '-' as a separator, instead of being a single-character word? Can words start with a '-'? If '-' is a separator, does it have to have whitespace on either side?
Try this:
dash = Keyword('-')
word = Word(alphas+'-')
sentence = delimitedList(OneOrMore(~dash+word), dash)
This assumes that '-' is a separator if and only if it passes the test of being a standalone keyword. The OneOrMore does a lookahead to ensure that it does not accidentally read the lone '-' as a word.
In cases like this where there is some ambiguity, you must ask yourself questions as if you were playing the part of the parser. How can you tell the difference between a tatweel that is a word character from a tatweel that is a separator? If some sort of lookahead is required, then implement that with ~ or FollowedBy. Be aware that '-' will match Word(alphas+'-'). Also be aware that OneOrMore will match repetitions as long as it can, even if the next expression in the parser is a match also - pyparsing does NO implicit lookahead. In this way, it is unlike regular expressions.
-- Paul
thanks very much plz let me to suggest some ideas to your parser
-
adding more documentation and examples for rarely used methods.
-
the discussion leads to new ideas that you can add them to the code regularly.
-
you and we can suggest new problems to implement for visitors, and then added them to the examples.
plz sir
if I parse a list of strings, how make the parser does not stop if there is error in one of the strings. instead the parser log out the error and then continue parsing.
I suggest to add logger class to the pyparsing library. this will facilitate debugging process. the logger should save exceptions to text files or screen or ...
Pyparsing supports debugging on individual expressions, by calling setDebug. If an expression has been set for debugging, then every time the grammar tries to evaluate that expression, the location of the parse is logged, followed by either the successfully parsed tokens or the resulting exception:
s = 'ABC DEF 123 XYZ'
aword = Word(alphas)
integer = Word(nums)
aword.setDebug()
OneOrMore(aword | integer).parseString(s)
prints
Match W:(abcd...) at loc 0(1,1)
Matched W:(abcd...) -> ['ABC']
Match W:(abcd...) at loc 3(1,4)
Matched W:(abcd...) -> ['DEF']
Match W:(abcd...) at loc 7(1,8)
Exception raised:Expected W:(abcd...) (at char 8), (line:1, col:9)
Match W:(abcd...) at loc 11(1,12)
Matched W:(abcd...) -> ['XYZ']
Match W:(abcd...) at loc 15(1,16)
Exception raised:Expected W:(abcd...) (at char 15), (line:1, col:16)
(['ABC', 'DEF', '123', 'XYZ'], {})
when parsing a list of string
how to make the pyparsing do not stop parsing if one of the strings have error
only print or log error and resume parsing
Pyparsing raises ordinary Python exceptions, so if you are parsing one string at a time in a list, just call parseString inside a try-except block:
strings = [
'ABC',
'DEF',
'123',
'xyz',
]
for s in strings:
try:
print Word(alphas).parseString(s)
except ParseException as pe:
print s, pe
prints
['ABC']
['DEF']
123 Expected W:(abcd...) (at char 0), (line:1, col:1)
['xyz']
in the example
parseResultsSumExample.py
samplestr1 = 'garbage;DOB 10-10-2010;more garbage\nID PARI12345678;more garbage'
samplestr2 = 'garbage;ID PARI12345678;more garbage\nDOB 10-10-2010;more garbage'
samplestr3 = 'garbage;DOB 10-10-2010'
samplestr4 = 'garbage;ID PARI12345678;more garbage- I am cool'
from pyparsing import *
dob_ref = 'DOB' + Regex(r'\d{2}-\d{2}-\d{4}')('dob')
id_ref = 'ID' + Word(alphanums,exact=12)('id')
info_ref = '-' + restOfLine('info')
person_data = dob_ref | id_ref | info_ref
for test in (samplestr1,samplestr2,samplestr3,samplestr4,):
person = sum(person_data.searchString(test))
print person.id
print person.dump()
print
if we assume one of the strings have error and
raise exception. How do we make pyparsing continue to parse the remaining strings
Again, this is just standard Python exception handling - see my answer to the other message you posted.
hello... I need to parse nested c/c++ like blocks of code like this:
{
int A1 = 100;
int A2 = 200;
int B1 = 100;
int B2 = 200;
{
int _A1 = 100;
int _A2 = 200;
int _B1 = 100;
int _B2 = 200;
}
}
and I'd like to take this kind of output from the parser:
['int A1 = 100;\nint A2=200;\n', 'int B1 = 100;\nint B2 = 200\n', ['int _A1 = 100;\nint _A2 = 200;\n', 'int _B1 = 100;\nint _B2 = 200;']]
I hope I spelled it right)) could you tell me, what is the best way to it? I tried using nestedExpr, but failed))
PS there can be any c/c++ valid code on the place of 'int A1 = 100'
, I made this example for the sake of simplicity))
when I'm parsing only inner code blocks like this:
Txt = '''
const int A1 = 100;
const int A2 = 200;
const int B1 = 100;
const int B2 = 200;
const int C1 = 100;
const int C2 = 200;
'''
EmptyLine = Suppress(lineEnd + lineEnd)
CodeBlock = ZeroOrMore(SkipTo(EmptyLine) + Optional(EmptyLine))
print(CodeBlock.parseString(Txt))
I get the desired results, but when I'm trying to use nestedExpr I get an infinite loop:
print(nestedExpr('{', '}', CodeBlock).parseString(Txt))
where '{' and '}' were added to Txt variable...
See if these give you any ideas on things to try:
code = '''\
{
int A1 = 100;
int A2 = 200;
int B1 = 100;
int B2 = 200;
{
int _A1 = 100;
int _A2 = 200;
int _B1 = 100;
int _B2 = 200;
}
}
'''
from pyparsing import *
p1 = nestedExpr('{','}')
print p1.parseString(code)
# prints
# [['int', 'A1', '=', '100;', 'int', 'A2', '=', '200;', 'int', 'B1', '=', '100;', 'int', 'B2', '=', '200;', ['int', '_A1', '=', '100;', 'int', '_A2', '=', '200;', 'int', '_B1', '=', '100;', 'int', '_B2', '=', '200;']]]
cStatement = ~oneOf('{ }') + SkipTo(';') + ';'
content = originalTextFor(OneOrMore(cStatement))
p2 = nestedExpr('{','}', content=content)
print p2.parseString(code)
# prints
# [['int A1 = 100;\nint A2 = 200;\n \nint B1 = 100;\nint B2 = 200;', ['int _A1 = 100;\nint _A2 = 200;\n \nint _B1 = 100;\nint _B2 = 200;']]]
cStatement = Forward()
cStatement << (originalTextFor(~oneOf('{ }') + SkipTo(';') + ';') |
nestedExpr('{','}', content=cStatement))
p3 = OneOrMore(cStatement)
print p3.parseString(code)
# prints
# [['int A1 = 100;', 'int A2 = 200;', 'int B1 = 100;', 'int B2 = 200;', ['int _A1 = 100;', 'int _A2 = 200;', 'int _B1 = 100;', 'int _B2 = 200;']]]
ParserElement.setDefaultWhitespaceChars(' \t')
EOL = LineEnd()
cStatement = SkipTo(';', failOn=oneOf('{ }')|EOL) + ';'
content = originalTextFor(OneOrMore(cStatement + EOL)) | (Empty()+EOL).suppress()
p4 = nestedExpr('{','}', content=content)
print p4.parseString(code)
# prints
# [['int A1 = 100;\nint A2 = 200;\n', 'int B1 = 100;\nint B2 = 200;\n', ['int _A1 = 100;\nint _A2 = 200;\n', 'int _B1 = 100;\nint _B2 = 200;\n']]]
-- Paul
Hi,
I've been working on a grammar that parses spec files. My grammar works, but the output that I'm getting is not how I expect it to be. The specs can be very complex, they have 3 levels and on each level it should be possible to have different sorts of items. Also I need to parse everything, I need a full memory representation of the spec that means I also need to know the location of empty lines.
I made a small sample to illustrate my problem. It's not very clear though, it's hard to make my problem clear. I know you can tag elements using the setResultsName method, but sometimes the result is a list of different kind of items. Then I can only get the attributes of those items.
It's possible to determine the type of element based on the attributes but that's not a good solution. So I have a few questions:
Does my explanation makes any sense? :) Is there a way to tag items in another way?
Should I solve this problem by defining a parse action for each type of item that adds an attribute 'itemType' to every token of a certain type?
spec = '''\
# lorem ipsum
# lorem ipsum
[version:1.0]
# lorem ipsum
'''
# set spaces and tabs as parser default white space
ParserElement.setDefaultWhitespaceChars(' \t')
lineEnd = emptyLine = Suppress(LineEnd()('emptyLine'))
numberSign = Suppress(Literal('#').setName('number sign (#)'))
leftSquareBracket = Suppress(Literal('[').setName('left square bracket ([)'))
rightSquareBracket = Suppress(Literal(']').setName('right square bracket (])'))
colon = Suppress(Literal(':').setName('colon (:)'))
singleLineComment = Group(numberSign + SkipTo(lineEnd) + lineEnd)
singleLineComment = singleLineComment.setResultsName('comment')
singleLineComment.setName('comment')
versionLiteral = Suppress(Literal('version'))
versionLiteral.setResultsName('version')
versionLiteral.setName('version literal')
releaseNumber = Combine(singleDigitNumber + period + singleDigitNumber)
releaseNumber.setResultsName('releaseNumber')
releaseNumber.setName('release number')
version = Group(leftSquareBracket + versionLiteral + colon + releaseNumber + rightSquareBracket + lineEnd)
version = version.setResultsName('version')
version.setName('version')
grammar = OneOrMore(emptyLine | singleLineComment | version)
results = grammar.parseString(spec, parseAll=True)
Instead of tagging the parse results with a type, I suggest using the parse results to construct an object. Here is a sample of creating Shape objects from simple format strings:
class Shape(object):
def __init__(self, tokens):
self.__dict__.update(tokens.asDict())
def area(self):
raise NotImplementedException()
def __str__(self):
return '<%s>: %s' % (self.__class__.__name__, self.__dict__)
class Square(Shape):
def area(self):
return self.side**2
class Rectangle(Shape):
def area(self):
return self.width * self.height
class Circle(Shape):
def area(self):
return 3.14159 * self.radius**2
from pyparsing import *
number = Regex(r'-?\d+(\.\d*)?').setParseAction(lambda t:float(t[0]))
# Shape expressions:
# square : S <centerx> <centery> <side>
# rectangle: R <centerx> <centery> <width> <height>
# circle : C <centerx> <centery> <diameter>
squareDefn = 'S' + number('centerx') + number('centery') + number('side')
rectDefn = 'R' + number('centerx') + number('centery') + number('width') + number('height')
circleDefn = 'C' + number('centerx') + number('centery') + number('diameter')
squareDefn.setParseAction(Square)
rectDefn.setParseAction(Rectangle)
def computeRadius(tokens):
tokens['radius'] = tokens.diameter/2.0
circleDefn.setParseAction(computeRadius, Circle)
shapeExpr = squareDefn | rectDefn | circleDefn
tests = '''\
C 0 0 100
R 10 10 20 50
S -1 5 10'''.splitlines()
for t in tests:
shape = shapeExpr.parseString(t)[0]
print shape
print 'Area:', shape.area()
print
prints
<Circle>: {'diameter': 100.0, 'radius': 50.0, 'centerx': 0.0, 'centery': 0.0}
Area: 7853.975
<Rectangle>: {'width': 20.0, 'height': 50.0, 'centerx': 10.0, 'centery': 10.0}
Area: 1000.0
<Square>: {'side': 10.0, 'centerx': -1.0, 'centery': 5.0}
Area: 100.0
You can see another example on the Examples page, SimpleBool.py.
-- Paul
when I have a file that contains lines. I want to parse each line according to a grammar. and also write each parsed line to file accompanied by the line number. this I want to discover the success lined and failed lines by writing a logger or output file.
How I preserve the line number ?
Diaa -
Your code is iterating through the file line by line, so pyparsing does not really have visibility to the separate line numbers. But your code does. You can iterate through the file and keep a line number variable yourself, or wrap the file iterator in 'enumerate' and get the line number and line for each line in the file.
Diaa, the questions you are asking are very basic, and I fear that you really need more programming experience before trying to write a pyparsing application.
-- Paul
for python, yes I need more experience.
for this question specifically, I wait you to talk about parseFile and use LineEnd without reading the file line by line. really I can read line by line and then use use parseString(). but I wanted to understand the parseFile and LineEnd in order to use calback function that return col, lineno, tokens to write this information to output file in one shot.
Ah, now I have a clearer picture of what you are asking. I have to head to work now, but I will write up some examples when I get home this evening.
You can use a parse action to add the current line number to individual tokens or whole lines. Or just attach a parse action to LineEnd that returns the line number. See the following code with embedded comments:
text = '''\
Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore
magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure
dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat
nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa
qui officia deserunt mollit anim id
est laborum.'''
from pyparsing import *
# add line and col to each word
word = Word(alphas)
word.setParseAction(lambda s,l,t: (t[0], lineno(l,s), col(l,s)))
# use transformString since the input text contains non-words too (like '.' and ',')
print word.transformString(text)
print
# another approach - add line number to each line
# remove \n from the list of default whitespace
ParserElement.setDefaultWhitespaceChars(' \t')
word = Word(alphas)
punc = oneOf('. ,')
eol = LineEnd()
textline = OneOrMore(word | punc) + eol
textline.setParseAction(lambda s,l,t: [str(lineno(l,s)),] + t.asList())
corpus = OneOrMore(Group(textline))
# create inmemory file-like object using StringIO
# could have just as easily used parseString(text), but you asked
# specifcally about parseFile
from cStringIO import StringIO
textfile = StringIO(text)
lines = corpus.parseFile(textfile)
for l in lines:
print l
prints:
('Lorem', 1, 1) ('ipsum', 1, 7) ('dolor', 1, 13) ('sit', 1, 19) ('amet', 1, 23), ('consectetur', 1, 29)
('adipisicing', 2, 1) ('elit', 2, 13), ('sed', 2, 19) ('do', 2, 23) ('eiusmod', 2, 26)
('tempor', 3, 1) ('incididunt', 3, 8) ('ut', 3, 19) ('labore', 3, 22) ('et', 3, 29) ('dolore', 3, 32)
('magna', 4, 1) ('aliqua', 4, 7). ('Ut', 4, 15) ('enim', 4, 18) ('ad', 4, 23) ('minim', 4, 26) ('veniam', 4, 32),
('quis', 5, 1) ('nostrud', 5, 6) ('exercitation', 5, 14) ('ullamco', 5, 27)
('laboris', 6, 1) ('nisi', 6, 9) ('ut', 6, 14) ('aliquip', 6, 17) ('ex', 6, 25) ('ea', 6, 28)
('commodo', 7, 1) ('consequat', 7, 9). ('Duis', 7, 20) ('aute', 7, 25) ('irure', 7, 30)
('dolor', 8, 1) ('in', 8, 7) ('reprehenderit', 8, 10) ('in', 8, 24) ('voluptate', 8, 27)
('velit', 9, 1) ('esse', 9, 7) ('cillum', 9, 12) ('dolore', 9, 19) ('eu', 9, 26) ('fugiat', 9, 29)
('nulla', 10, 1) ('pariatur', 10, 7). ('Excepteur', 10, 17) ('sint', 10, 27) ('occaecat', 10, 32)
('cupidatat', 11, 1) ('non', 11, 11) ('proident', 11, 15), ('sunt', 11, 25) ('in', 11, 30) ('culpa', 11, 33)
('qui', 12, 1) ('officia', 12, 5) ('deserunt', 12, 13) ('mollit', 12, 22) ('anim', 12, 29) ('id', 12, 34)
('est', 13, 1) ('laborum', 13, 5).
['1', 'Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', '\n']
['2', 'adipisicing', 'elit', ',', 'sed', 'do', 'eiusmod', '\n']
['3', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', '\n']
['4', 'magna', 'aliqua', '.', 'Ut', 'enim', 'ad', 'minim', 'veniam', ',', '\n']
['5', 'quis', 'nostrud', 'exercitation', 'ullamco', '\n']
['6', 'laboris', 'nisi', 'ut', 'aliquip', 'ex', 'ea', '\n']
['7', 'commodo', 'consequat', '.', 'Duis', 'aute', 'irure', '\n']
['8', 'dolor', 'in', 'reprehenderit', 'in', 'voluptate', '\n']
['9', 'velit', 'esse', 'cillum', 'dolore', 'eu', 'fugiat', '\n']
['10', 'nulla', 'pariatur', '.', 'Excepteur', 'sint', 'occaecat', '\n']
['11', 'cupidatat', 'non', 'proident', ',', 'sunt', 'in', 'culpa', '\n']
['12', 'qui', 'officia', 'deserunt', 'mollit', 'anim', 'id', '\n']
['13', 'est', 'laborum', '.']
thanks very much I still try the second approach. but there is some errors. I am using asXML() to write the output to xml file. I am using unicode for Arabic. I try to adjust your technique to wrok with asXML.
the reason for using asXML is to use the xml.etree.ElementTree module to restructure and arrange the output tree for all the file in order to convert the output file to relational database for future and postprocessing.
the advantage of asXML is the output is explicit string so its easier to out to a text file.
I am working on a parsing project where I need to inject some manually created parseresults into a parsed token. I have attached a parseaction to the appropriate place in my code, and I seem to have succeeded in creating a custom made parseresult to add back into my larger grammer. dump() and asxml() seem to output correctly, but other parts of my code (trying to access the created results by name) have issues. I can access by list position, but not assigned name. It is entirely possible that my limited python knowledge is messing me up somewhere, but since I have not been able to find an example of creating a parseresults quite this way I thought I would start here. Here is my parseresults creation code. tripHeaderCustomFields is attached as a parseaction. If a particular value is parsed (ie. 'TripCode') then some custom parseresults are created and added back in to the final result.
If anyone has tried to create manual parseresults like this, could you please look over my code and tell me if you see any glaring problems? It took hours of trial and error to get this version to work, and I would not be surprised if there is a better or more correct way.
def addCustomField( self, group, name, datatype, value ):
'''
custom fields:
Group: ie, specific airline or category - 'USAir, 'general'
Name: name of field, ie 'linecheck', 'Medical', 'Deadhead', 'IV Pay'
DataType: string, int, date, time
Value: value of field, ie. 'checked by joe shmo, #2345', or '1st class medical - bryman'
'''
#TODO: Need to ask for help, some logic problem somewhere. loosing string name somewhere, but xml prints ok!
prGroup = ParseResults( group, self.NAME.CFGROUP )
prName = ParseResults( name, self.NAME.CFNAME )
prDataType = ParseResults( datatype, self.NAME.CFDATATYPE )
prValue = ParseResults( value, self.NAME.CFVAULE )
prList = ParseResults( [] )
prList += prGroup
prList += prName
prList += prDataType
prList += prValue
customField = ParseResults( [prList], self.NAME.CUSTOMFIELD )
return customField
def tripHeaderCustomFields( self, tokens ):
parseSegment = tokens
if 'TripCode' in parseSegment:
customField = self.addCustomField( 'USAir', 'PairingCode', 'String', parseSegment['TripCode'] )
if self.NAME.CUSTOMFIELDS in parseSegment:
parseSegment[self.NAME.CUSTOMFIELDS] += customField
else :
parseSegment += ParseResults( [customField], self.NAME.CUSTOMFIELDS )
if 'Charter' in parseSegment[self.NAME.EFFECTIVEDOWS]:
customField = self.addCustomField( 'USAir', 'Charter', 'Boolean', 'True' )
if self.NAME.CUSTOMFIELDS in parseSegment:
parseSegment[self.NAME.CUSTOMFIELDS] += customField
else :
parseSegment += ParseResults( [customField], self.NAME.CUSTOMFIELDS )
return tokens
returns a seemingly correct token,
- <CustomFields>
- <CustomField>
- <Group>USAir</Group>
- <Name>EquipmentChange</Name>
- <DataType>Boolean</DataType>
- <Value>True</Value>
- </CustomField>
- <CustomField>
- <Group>USAir</Group>
- <Name>EquipmentChange</Name>
- <DataType>Boolean</DataType>
- <Value>True</Value>
- </CustomField>
- </CustomFields>
that goes into a bigger result:
<Trip>
<TripNumber>8510</TripNumber>
<EffectiveDOWs>
<EXCPT>EXCPT</EXCPT>
<DayOfWeek>MO</DayOfWeek>
<DayOfWeek>TH</DayOfWeek>
<DayOfWeek>FR</DayOfWeek>
</EffectiveDOWs>
<ReportTime>
<Hours>21</Hours>
<Minutes>40</Minutes>
</ReportTime>
<TripCode>N</TripCode>
<EffectiveDateStart>
<Month>APR</Month>
<Day>02</Day>
</EffectiveDateStart>
<EffectiveDateEnd>
<Month>APR</Month>
<Day>27</Day>
</EffectiveDateEnd>
<CustomFields>
<CustomField>
<Group>USAir</Group>
<Name>PairingCode</Name>
<DataType>String</DataType>
<Value>N</Value>
</CustomField>
</CustomFields>
<RequiredCrew>
<Captain>1</Captain>
<FO>1</FO>
</RequiredCrew>
.....snip....
</Trip>
Cross posted to
I can't see anything wrong with what you are doing. You are essentially implementing in your parse action what would happened in the parser if those fields had been in the input stream, and that is just fine. The only comment I can make is that, since you are modifying the tokens object directly, it is not necessary to return it from the routine, you can just return None or don't return anything. Pyparsing interprets a None return from a parse action as 'use the current tokens object'. I do this all the time.
Very clever technique to inject extra 'marker' values. Even though it took you hours to figure out, in the end, I think your code looks pretty direct.
-- Paul
Thanks for looking it over. I don't have the background to be able to read your code and figure out how it works, so I really appreciate your feedback. I feel a bit like the cave man with the TV remote. With perseverance I can find ESPN, but I'm not quite sure how i got there.... Bad news is I still have a problem with my code somewhere (else). I'll have to dive back in and see if I can tease it out.
I have reworked my custom ParseResults code, and it now works as expected. I wish I had thought of doing it this way the first time, as it was much easier to figure out. :) I do tend to reinvent the wheel... tripHeaderCustomFields is attached as a ParseAction, and the new ParseResults are added to the parent ParseResults
def tripHeaderCustomFields( self, tokens ):
parseSegment = tokens
if 'TripCode' in parseSegment:
customField = self.addCustomField( 'USAir', 'PairingCode', 'String', parseSegment['TripCode'], parseSegment )
if 'Charter' in parseSegment[self.NAME.EFFECTIVEDOWS]:
customField = self.addCustomField( 'USAir', 'Charter', 'Boolean', 'True', parseSegment )
def buildCustomFieldString( self, group, name, datatype, value ):
#TODO: replace any stray '|' that might be in input strings
text = group + '|' + name + '|' + datatype + '|' + value
return text
def addCustomField( self, group, name, datatype, value, token ):
'''
custom fields:
Group: ie, specific airline or category - 'USAir, 'general'
Name: name of field, ie 'linecheck', 'Medical', 'Deadhead', 'IV Pay'
DataType: string, int, date, time
Value: value of field, ie. 'checked by joe shmo, #2345', or '1st class medical - bryman'
<CustomFields>
<CustomField>
<Group>USAir</Group>
<Name>EquipmentChange</Name>
<DataType>Boolean</DataType>
<Value>True</Value>
</CustomField>
<CustomField>
<Group>USAir</Group>
<Name>EquipmentChange</Name>
<DataType>Boolean</DataType>
<Value>True</Value>
</CustomField>
</CustomFields>
'''
pGroup = Word( alphanums )( self.NAME.CFGROUP )
pName = Word( alphanums )( self.NAME.CFNAME )
pDatatype = Word( alphanums )( self.NAME.CFDATATYPE )
pValue = Word( alphanums )( self.NAME.CFVAULE )
delim = Suppress( '|' )
customField = Group( pGroup + delim + pName + delim + pDatatype + delim + pValue )( self.NAME.CUSTOMFIELD )
text = self.buildCustomFieldString( group, name, datatype, value )
if self.NAME.CUSTOMFIELDS in token:
token[self.NAME.CUSTOMFIELDS] += customField.parseString( text )
else :
token += Group( customField )( self.NAME.CUSTOMFIELDS ).parseString( text )
I'm trying to parse a list of statements enclosed in {} brackets. I only care about some of the statements and I want to avoid writing an exhaustive grammar for all possible statement types.
import pyparsing as pp
s = 'my_keyword(){known1 unknown0 known2 unknown1 unknown2 }'
known = pp.Regex('known.').setResultsName('known')
other = pp.Word(pp.alphanums)
pat = (
pp.Keyword('my_keyword')
+ pp.nestedExpr(opener='(', closer=')')
+ pp.nestedExpr(opener='{', closer='}', content=pp.OneOrMore(known | other) )
)
for x in pat.scanString(s):
print x
This works fine as the 'other' pattern matches all the unknown statements
However if I modify the input string as shown below the scanString does not return any output.
s = 'my_keyword(){known1 unknown0 ; known2 unknown1 unknown2 }'
This is obviously because ';' is not an alphanumerical character. Is there a catch-all pattern I can use to match everything not matched by the known pattern? Alternatively, is there a better way of extracting only the known statements from code enclosed between curly brackets?
Try pp.Word(pp.printables, excludeChars='{}')
Thanks for the reply.
I have another question I'm trying to match a block of text enclosed between curly brackets which contains one known token and other unknown stuff:
import pyparsing as pp
s = 'dummy {dummy0 test dummy1} {dummy2, dummy3}'
other = pp.Word(initChars=pp.printables, excludeChars='{}')
pat=pp.nestedExpr(opener='{', closer='}', content=pp.Each(pp.Literal('test'), pp.ZeroOrMore(other)))
for x in pat.scanString(s):
print x
I would expect the above pattern would match the first block of text included in {} and not the second one. However, nothing is matched. What is the best way to accomplish my goal?
DIAGNOSIS:
I rewrote your program as follows, creating an expression for CONTENT, which I could then name and enable debugging:
s = 'dummy {dummy0 test dummy1} {dummy2, dummy3}'
other = pp.Word(initChars=pp.printables, excludeChars='{}')
TEST = pp.Literal('test')
CONTENT = TEST & pp.ZeroOrMore(other)
CONTENT.setName('content')
CONTENT.setDebug()
pat=pp.nestedExpr(opener='{', closer='}', content=CONTENT)
for x in pat.scanString(s):
print x
Outputs:
Match content at loc 7(1,8)
Exception raised:Missing one or more required elements ('test') (at char 7), (line:1, col:8)
Match content at loc 28(1,29)
Exception raised:Missing one or more required elements ('test') (at char 28), (line:1, col:29)
EXPLANATION:
The literal 'test' also matches the definition of other, so when the Each expression failed to find 'test' at the beginning of the bracketed group, it tried to find 0 or more other
s. Since 'test' matches the pattern defined in other, it got read as part of the ZeroOrMore.
SOLUTION:
Define an expression for 'test' and excluded it from the repetition in the ZeroOrMore:
CONTENT = TEST & pp.ZeroOrMore(~TEST + other)
After removing the debugging code, the output is:
(([(['dummy0', 'test', 'dummy1'], {})], {}), 6, 26)
Pyparsing does not implicitly try to do any lookahead or expression filtering or mind-reading - we have to put that in ourselves, as I did by saying that 'test' should not be included as part of the ZeroOrMore repetition of the other expression. Also, please try using setName and setDebug to start troubleshooting these problems, and you will get a better feel for where pyparsing can go astray.
Thanks for the detailed reply
I didn't mean to be flip with my 'mind-reading' comment. In fact, to debug some of these parsers, I often play 'Be The Parser', and try to mentally step through each expression just following the grammar, and not using my own assumptions about how something should be parsed. You have to work hard to set aside your own human pattern matching machinery, which is much more powerful than pyparsing.
Good luck, and write back if you have more questions.
hello - i'm getting an error executing parseString when the following seemingly basic ParseAction is added - when there's no ParseAction set it parses without any issue. Any insight? I'm using the latest pyparsing-1.5.6 and Python 3.2
Series_Code=OneOrMore(Word(alphanums+'-'))('Series_Code')
Series_Code.setParseAction( lambda tokens : ''.join(tokens))
test = 'Series_Code: Series 1-1|'
topIDs = Suppress('Series_Code:') + Series_Code + Suppress('|')
parsed = (topIDs).parseString(test)
File 'C:\Python32\lib\site-packages\pyparsing.py', line 689, in wrapper
return func(*args[limit:])
UnboundLocalError: local variable 'limit' referenced before assignment
this is the code referenced by the error
def _trim_arity(func, maxargs=2):
limit = maxargs
def wrapper(*args):
#~ nonlocal limit
while 1:
try:
return func(*args[limit:])
except TypeError:
if limit:
limit -= 1
continue
raise
return wrapper
Please make sure you are using the version of pyparsing that is compatible with Python 3. The commented-out nonlocal statement tells me that you are using the Python 2-compatible version.
Thanks for your response - it helped I have re-run the setup.py a few more times after clearing pyparsing files from Python32\Lib\site-packages but it kept copying the Python 2 version of pyparsing into my Python32 folder. So I ended up just manually dropping pyparsing_py3 file into Python32\Lib\site-packages and renaming it to pyparsing and it seems to be working ok. Is this all the installation really does (outside of compiling it to the bytecode which i understand happens anyways on the first import)? Or, if I do it this way, I'm missing some important installation step?
No, that is really all the installation does. pyparsing is just the one Python file, and you are correct, it will be compiled to bytecode the first time you import it. (Keeping it to one file was intentional on my part, to simplify its inclusion in other projects.)
I am disappointed that setup.py is not picking up your Python version though, I'll have to go back and look at that to see where it is going wrong. Thanks for writing, and good luck with pyparsing!
-- Paul
Hi
I have noticed that ParseResults.__dir__
tries to
concatenate list to dict_keys(in python3.x dict.keys()
returns dict_keys iterable view).
Here is patch with simplest solution:
Index: src/pyparsing_py3.py
===================================================================
--- src/pyparsing_py3.py (revision 216)
+++ src/pyparsing_py3.py (working copy)
@@ -568,7 +568,7 @@
self.__parent = None
def __dir__(self):
- return dir(super(ParseResults,self)) + self.keys()
+ return dir(super(ParseResults,self)) + list(self.keys())
collections.MutableMapping.register(ParseResults)
That will work but its not thread-safe. Thread-safe version would be:
def __dir__(self):
return dir(super(ParseResults,self)) + list(self.keys())
I have noticed this issue when using pydev's debbuger:)
Awesome, nice catch! Thanks for the patch, I'll include it in1.5.7.
(I don't see the difference between the patch and the thread-safe version, though.)
-- Paul
Yep no difference there(copy/paste bug;)), it should be:
def __dir__(self):
return dir(super()) + list(self.copy().keys() )
I have noticed that there already is bug for it: TypeError: can only concatenate list to list - ID: 3483740
I have an input string which is a collection of nested structures like this:
id1{
info_level1
id2{
info_level2
}
....
}
.....
Each 'id1' block contains some identifying information and one or more 'id2' blocks
I need to change and replace in the original string all blocks of type 'id2' which meet certain criteria for both info_level1 and info_level2. For this purpose I'm trying to extract using pyparsing the start and end locations of these 'id2' blocks
So far I built the pyparsing expression for both 'id2' and 'id1' (which contains 'id2') and I did something like this:
id2_pattern.setParseAction(lambda s,loc,toks: <store location here>)
id1_pattern.searchString(s)
The problem is that with setParseAction I only get the starting location of the matching id2 block. If I use searchString on the id2_pattern, I can't incorporate info_level1 in the search pattern.
I could call searchString on the content of id1 blocks and derive the locations in the input string, but I'm hoping there is an easier way to get the start/end locations for 'id2' blocks
See if one of the transformString examples looks like a better fit. transformString applies changes to the tokens made during a parse action and replaces the matched tokens. It takes care of the replacement within the start/end locations. -- Paul
I ended up using the originalTextFor method and computing the end location from the length of the matched token:
id2_pattern = pp.originalTextFor(old_id2_pattern)
id2_pattern.addParseAction(lambda s,loc,toks: <store (loc, loc + len(toks[0])>)
As a side note there seems to be some undocumented behavior related to this method (I could only find references to it on this forum but not in the pyparsing docs): After using originalTextFor one needs to use addParseAction as opposed to setParseAction.
Got this error :
raise ParseException(instring, loc, self.errmsg, self)
The code is
from pyparsing import Word,alphas
f=open('pinged.txt')
# define grammar
greet = Word( alphas ) + '.' + Word( alphas ) + '.' + Word('com')
# input string
hello = f.read()
# parse input string
output=greet.parseString( hello )
print(output)
Traceback (most recent call last):
File 'C:\module1.py', line 27, in <module>
output=greet.parseString( hello )
File 'C:\Python32\lib\site-packages\pyparsing.py', line 969, in parseString
raise exc
File 'C:\Python32\lib\site-packages\pyparsing.py', line 959, in parseString
loc, tokens = self._parse( instring, 0 )
File 'C:\Python32\lib\site-packages\pyparsing.py', line 833, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File 'C:\Python32\lib\site-packages\pyparsing.py', line 2214, in parseImpl
loc, exprtokens = e._parse( instring, loc, doActions )
File 'C:\Python32\lib\site-packages\pyparsing.py', line 837, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File 'C:\Python32\lib\site-packages\pyparsing.py', line 1435, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected '.' (at char 5), (line:1, col:6)
Please post the first few lines of the input file. I suspect that Word(alphas) is not sufficient to describe the first part of the file - contains numbers or '_'s perhaps? or starts with 'http://'? Difficult to give much more help without seeing the input string.
-- Paul
A few more tips:
- the last line of the exception tells you where to look in the input string for what you thought would be an alpha, but what pyparsing thought should be one of the '.'s. Look at character number 6 of the first line (numbering starting at 1).
- If you catch the exception by wrapping your call to parseString with try/except ParseException as pe, print pe.markInputline() and you should get a visually marked version of the input line, with '>|<' in the string where the parsing error occurred.
Ping request could not find host lalala.balm.com . Please check the name and try again.
Ping request could not find host lalala.balm.com . Please check the name and try again.
Ping statistics for 11.11.11.111:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 57ms, Maximum = 58ms, Average = 57ms
Pinging lalala.balm.com [11.11.11.12] with 32 bytes of data:
Reply from 11.11.11.12: bytes=32 time=69ms TTL=247
Reply from 11.11.11.12: bytes=32 time=70ms TTL=247
Reply from 11.11.11.12: bytes=32 time=69ms TTL=247
Reply from 11.11.11.12: bytes=32 time=69ms TTL=247
Basically, I need to get every IP adress this file contains. My idea was to seek for some text divided by two dots which will make IP adresses.
Try using searchString instead of parseString. parseString assumes that your grammar definition fully describes the input string; searchString and scanString will search for matches. (Not all addresses end in .com, some end in .org, .edu, etc., and many contain more than 3 dotted elements. This will be a good learning experience for you.)
I want the comma separates only when it is outside of parentheses.
input to parser
ition (government, cabinet, etc), harmonious, harmonic
girl(s), women
I need the output:
['ition (government, cabinet, etc)', 'harmonious', 'harmonic']
['girl(s)', 'women']
the errored output
['ition (government', 'cabinet', 'etc)', 'harmonious', 'harmonic']
['girl(s)', 'women']
Try this:
listitem = originalTextFor(OneOrMore(Word(alphas) | nestedExpr()))
for t in tests:
print delimitedList(listitem).parseString(t).asList()
In the future, it would help if you also posted the parser that you have tried, so I can give you some instructive suggestions. As it is, all I can do is spoon-feed you the answer.
-- Paul
thanks very much I tried to simulate nestedExpr() using
parentheses = Regex(ur'\([^()]+\)')
but this did not solve the problem I do not know the reason thanks very much
Just found this post, it's very useful. I'd written a method to parse function argument lists that contain nested calls to functions that might also contain a list of arguments. For example the following argument string:
'A1*atan2( A2, A3), power( 10, A4), ( B1 + B2 )/C10'
Parses to:
['A1*atan2( A2, A3)', 'power( 10, A4)', '( B1 + B2 )/C10']
Now I can replace my method with just two lines:
import pyparsing as pyp
ArithExp = pyp.Word( pyp.alphanums + '+-*/^' )
listitem = pyp.originalTextFor( pyp.OneOrMore( ArithExp | pyp.nestedExpr() ) )
tests = ['A1*sin(pi/8), cos(pi/8)/B1',
'atan2( 2*X25, -Y25 ), A1*sin( A2/3 )',
'A1*atan2( A2, A3), power( 10, A4), ( B1 + B2 )/C10']
for t in tests:
print pyp.delimitedList( listitem ).parseString( t ).asList()
hello - I'm building a line oriented parser but at the same time blank lines including the lines with spaces and tabs need to be ignored. Any suggestions?
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' \t')
text ='''
# there's a space here
From_Date: 10/1/2011
'''
EOL = LineEnd().suppress()
SOL = LineStart().suppress()
blankline = SOL + EOL
headerBlock =Suppress('From_Date:') + Word(nums+'/')('OpeningDate')
headerBlock.ignore(blankline)
print(headerBlock.parseString(text).dump())
You realize that this fails because the input string does not start with 'From_Date:' but with '\n # there's a space'? Maybe parseString is not what you want, try searchString or scanString.
Paul - I'm very sorry - the test string I posted is wrong - it should have been like this (there's a blank or tab in the second line and nothing else except EOD
text ='''
From_Date: 10/1/2011 '''
I have a work-around like this
headerBlock =SkipTo(Suppress('From_Date:') + Word(nums+'/')('OpeningDate')) but need to find a more general solution
again, my problem is this - I'm trying to parse lines so EOD is significant but empty lines could come at any time and in any number. These 'empty' lines could occasionally contain tabs and spaces and I was hoping these would be ignored due to the 2 provisions
- tabs and spaces are set to be the default white space
- I defined blank lines and made them to be 'ignored'
actually my work-around looks more like this
headerBlock =SkipTo(Suppress('From_Date:'),include=True) + Word(nums+'/')('OpeningDate')
but it's besides the point - still not happy with this as I have more cases in my parser where SkipTo would skip over pieces which I'm interested in
Sometime LineStart()
does not match as well as we would expect.
Try defining blankline as just plain EOL. Or define EOL as OneOrMore(LineEnd())
, and get rid of the ignore.
-- Paul
'define EOL as OneOrMore(LineEnd())
, and get rid of the ignore' - worked really well - issue resolved- thank you!!
Hi there. I need to parse every line in a table which contains specific string it it. is there a way of doing it with pyparsing? Many thanks in advance.
Very likely. What kind of table are you talking about? An HTML table, or a text-formatted table with '+'s, '-'s, and '|'s? Or just tabular data with columns of values lined up in nice straight columns? Or something else? A sample would help. (and please enclose in [[code]] tags, before and after, each tag on a line by itself.)
-- Paul
Actually pdf with text layer... I guess that makes a text-formatted table in the end. getting a text-only input should be easy, but the main question stands.
Ok, cool. Can you post a sample of what your PDF table looks like, as plain text? Or is it stored as a compressed block? If so, you'll need to first extract the binary compressed data, then expand to text, and then parse the text. You might also look at what PDF-processing tools are available from reportlab (). They might even have something that will directly parse your table - pyparsing isn't always the solution to everything :) .
Sample C typedef parser:
The output is incredibly long for such a small example. Is this as expected or am I doing something wrong when creating mu parser?
Wow, that is an ugly mess! Try using this instead to see the results of your parse:
print parser_result.dump()
This will just show you the tokens and any parse results names. Using eval as you are doing is dumping out all the internal structures that go into a ParseResults object, which is even surprising to me what all you are getting! -- Paul
Hi I want to extend the behaviour of Keyword Class.
import pyparsing as P
def call_back(p):
print p
print 'calling ....'
class MyKeyword(P.Keyword):
x = P.StringStart() + P.Word(P.alphas) + P.StringEnd()
x.setParseAction(call_back)
parseImpl = x.parseImpl
setParseAction = x.setParseAction
if __name__ == '__main__':
t = MyKeyword('ABCD')
t.setParseAction(call_back)
print t.parseString('ABCD')
I am able to parse the string but the setParseAction is not working . Can anyone tell me where am I going wrong ?
#!/usr/bin/python
from pyparsing import Word,nums
va = 'VA'+Word(nums)
vb = 'VB'+Word(nums)
vc = 'VC'+Word(nums)
v = (va | vb | vc)('vvv')
k = v.parseString('VB 15')
print k.vvv
I was expecting the result to be
['VB', '15']
but am instead getting
VB
If I replace the expression with simply
v = (vb)('vvv')
I get the expected result. Any thoughts?
Thanks!
I'm not sure I've seen this particular problem before, but try changing va to va = Group('VA' + Word(nums))
, etc.
This will keep these tokens together when they are saved in the parsed results.
Thanks! That's exactly what I am doing to get around the problem. But it is not as clean because Group creates a list.
If you just want a string 'VA 100' or whatever, then use Combine. Since there may be space between the VA and then number, you'll have to add 'adjacent=False' when combining these tokens.
As an exercise to learning pyparsing and how to write grammars in general, I'm writing a parser for a python-like language.
So far I'm almost finished with the expressions part, but I cant seem to properly define primaries.
The part I'm having trouble is found in the official documentation here:
Here's the code I'm having trouble with:
primary << (
atom
| attributeref # If I comment this line, everything is fine
)
attributeref << (primary + DOT + IDENTIFIER)
pow_expr << (primary + Optional(POW_OP + unary_expr))
#
# Many other expressions defined
#
expr << conditional_expr # This expression will eventually try to parse a 'pow_expr' and consequently a 'primary', just like defined in python docs
The string I'm trying to parse is simple as this
'somevar.someattr'
In other words, I'm trying to parse an attribute access. The problem is that when I call expr.parseString('somevar.someattr', parseAll=True), it will sucessfully parse 'somevar' as an identifier, but it will fail since there's a leading '.' (DOT). If I turn off the parseAll flag it will only parse 'somevar' which is not what I want. If I swap 'atom' and 'attributeref' in the 'primary' definition(so it will try to match an attributeref first) it will enter in infinite recursion and break(obviously)
I have already tried this:
primary << (atom + ~DOT | attributeref)
But this will also fall into infinite recursion since it will keep trying to match the 'attributeref' when it meets the 'DOT' token.
What can I do to work around this problem?
Ok, I've been trying to get my head around recursion for a while, but I'm still having trouble with it.
Would someone please demonstrate how to implement a grammar to match something like '(1 or (2 and 3) or (4 or 5) and 6)'? Note that this may also be represented as '(1 (2 and 3) (4 5) and 6)', where the absence of an operator is an implicit 'or'.
Here is my (embarrassing) attempt:
LPAREN = Literal('(')
RPAREN = Literal(')')
and_ = Keyword('and', caseless=True)
or_ = Keyword('or',caseless=True)
operator = and_ | or_
reference = Word(nums)
statement = Forward()
item = reference | (LPAREN + statement + LPAREN)
statement << (OneOrMore(item + Optional(operator)) + ZeroOrMore(Optional(operator) + statement))
element = Group(operator | statement)
Any guidance would be greatly appreciated. Thanks!
-Keirian
(Already posted in the bugtrack but I think it is more relevant for discussion here)
Dear all,
I tried to feed Pyparsing with the following grammar:
from pyparsing import *
expr = Literal('EXPR')
end = Literal ('endif')
stmt = Forward()
cond = stmt | end
stmt << 'if' + expr + 'then' + cond + Optional('else' + cond)
cond.validate()
stmt.validate()
cond.parseString ('if EXPR then if EXPR then endif else endif')
(['if', 'EXPR', 'then', 'if', 'EXPR', 'then', 'endif', 'else', 'endif'], {})
I am not a parser expert and maybe there is something I do not understand, but I would have expected pyparsing to raise a warning when parsing this string (or accepting the grammar): I see no way to know if the 'else' belongs to the first 'if' or to the second one.
Help?
Thanks!
validate() looks for left-recursion in a grammar, and there is none in the one you posted.
Change stmt to:
stmt << Group('if' + expr + 'then' + cond + Optional('else' + cond))
to better see how the string is parsed into statements.
[
['if', 'EXPR', 'then',
['if', 'EXPR', 'then', 'endif', 'else', 'endif']
]
]
I was a little surprised though to see 'endif' as a valid statement in your grammar. In fact, adding 'endif' as a required terminator to your if statement syntax completely disambiguates any else clause.
Here is a slight rework of your grammar adding two more simple statements, and making 'endif' a required terminator of 'if' (instead of being a statement on its own). Also, I'm using Keyword instead of Literal for your keywords, to avoid accidentally parsing a variable name that just happens to start with a keyword, such as 'iffy'.
from pyparsing import *
# keywords
IF,THEN,ELSE,ENDIF,PASS,PRINT = map(Keyword,
'if then else endif pass print'.split())
# placeholder for boolean expression
expr = Keyword('EXPR')
# statements
stmt = Forward()
pass_stmt = PASS
print_stmt = PRINT + quotedString
if_stmt = (IF + expr + THEN + stmt +
Optional(ELSE + stmt) +
ENDIF)
stmt << Group(pass_stmt | print_stmt | if_stmt)
stmt.validate()
print stmt.parseString ('if EXPR then if EXPR then print 'hi' endif else print 'bye' endif')
prints:
[
['if', 'EXPR', 'then',
['if', 'EXPR', 'then',
['print', ''hi''],
'endif'],
'else',
['print', ''bye''],
'endif']
]
In fact, by adding 'endif' as your terminator, you can easily support multiple statements in your then and else blocks. Here is a little more expanded grammar, expanding the print statement (just to make things a little more interesting) and using the multiplication syntax as an alternative to OneOrMore:
ident = Word(alphas, alphanums+'_')
print_stmt = PRINT + (quotedString | ident)
if_stmt = (IF + expr + THEN + Group(stmt*(1,)) +
Optional(ELSE + Group(stmt*(1,))) +
ENDIF)
stmt << Group(pass_stmt | print_stmt | if_stmt)
print stmt.parseString ('''
if EXPR then
if EXPR then
print 'hi'
print name
endif
else
print 'bye'
print name
endif
''')
Gives
[['if', 'EXPR', 'then',
[['if', 'EXPR', 'then',
[
['print', ''hi''],
['print', 'name']
],
'endif']],
'else',
[
['print', ''bye''],
['print', 'name']
],
'endif']]
HTH, -- Paul
Thank you very much for your detailed answer, which, if needed, shows how Python and PyParsing are elegant when properly used.
However, with all due respect, I am not completely satisfied yet :-)
What you propose is a workaround: knowing that there is an ambiguity, you fixed it.
I am more concerned by the fact that the ambiguity is not detected by the parser itself, and wonder if (and how) you could detect it automatically.
As an exercise to illustrate what I mean, I wrote an 'equivalent' grammar using ANTLR, and tried to compile it:
cond : stmt | END;
stmt : 'if' EXPR 'then' cond ('else' cond)* ;
EXPR : 'EXPR';
END : 'endif';
But this won't pass the semantic check:
warning(200): test.g:18:42: Decision can match input such as ''else'' using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
error(201): test.g:18:42: The following alternatives can never be matched: 2
I have no idea how much effort it would represent to have this level of analysis.
But see the point: as soon as there is recursion, there is a risk of having such situations, which can not easily be spotted by a human reader.
What do you think?
I'm sorry, but I don't foresee being able to add this kind of analysis to pyparsing any time soon.
The current philosophy of pyparsing is pretty strictly left-to-right, working through the parsing grammar expression by expression. The implementation of validate() just walks this grammar the same way, looking for cycles in any recursive paths, but even this is not 100% reliable.
If you need this capability, you can still use pyparsing for quick prototyping, but it sounds like ANTLR gives you more value in terms of validation (and runtime performance as well).
-- Paul
Hello everyone. I'm new to python and even newer to pyparsing. I pasted my code at I'm having trouble with the definition of 'grant' at line 100 (the error message is shown at the end of the paste). Here is the ebnf of what i want:
<grant> ::= <> 'grant', <permission>, <user>, [',', <user>]0, <journal>, [',', <journal>]0
Can someone please exlain me what am I doing wrong here? TIA!
<grant> ::= 'grant', <permission>, <user>, [',', <user>]0, <journal>, [',', <journal>]0
The problem you have is that you define journalName to include ',' as a possible character. This consumes the ',' that would be the delimitedList's delimiter, leaving the rest of the list of journal names unparsed.
How to figure this out for yourself? First, look at the exception. I see that you already gave yourself a nice ruler of column numbers. So you can see that the exception reports a problem at column 55, which is the 2nd journal name in the command's list of journals. Now try adding 'setDebug()' on your journalName expression, and you will get output like this:
Match W:(abcd...) at loc 43(1,44)
Matched W:(abcd...) -> ['journal_1,']
Traceback (most recent call last):
File 'k9.py', line 138, in <module>
linea = syntax.parseString(test)
File 'c:\python26\lib\site-packages\pyparsing-1.5.6-py2.6.egg\pyparsing.py', line 1032, in parseString
raise exc
pyparsing.ParseException: Expected end of text (at char 54), (line:1, col:55)
You'll see output from pyparsing every time journalName is attempted to be parsed, followed by either the matched token (if successful) or the exception text (if not successful). See that your first entry was matched as 'journal1,', and that unwanted trailing comma is your culprit.
If you need to permit commas in your journal names, then you will need to use a different delimiter, which you can specify to delimitedList in a second argument.
Welcome to pyparsing! -- Paul
You Rock!
I was about to give up and use a 'special' character at the beginning of journalName. Not very elegant and, as per you explanation; it woulnd have worked anyways.
Thanks a lot!!!
In the following definition; is there any way to obtain the amounts, dates and references grouped in a list or dictionary? I tried Group() and it doesn't work,
single = (StringStart() +
delimitedList(Amount('amount') +
Optional(Date)('date') +
Optional(QuotedString(quoteChar='''))('reference')) +
journalName('journalName') +
Optional(Comments)('comments') +
StringEnd())
Group the expression within the delimitedList:
single = (StringStart() +
delimitedList(Group(Amount('amount') +
Optional(Date)('date') +
Optional(QuotedString(quoteChar=''')))('reference')) +
journalName('journalName') +
Optional(Comments)('comments') +
StringEnd()
)
Hello, I am trying to parse a file containing some simple structure declarations. It should be something like this:
BeginStructure house
color = white
size = big
EndStructure
BeginStructure car
speed = fast
price = 15000
EndStructure
and so on
I'd like to detect every structure declaration and print or save it in a list. I have written something like this:
SingleLine = Word(alphanums) + '=' + Word(alphanums)
MultipleValues = OneOrMore(Group(SingleLine + Suppress(';')))
Structure = Suppress('beginFact') + Word(alphanums) + MultipleValues + Suppress('endFact')
but it can't work if I read the file line by line because I should do something like this:
with open('file.txt') as f:
print 'line = ' + line
print grammarFact.parseString(line)
and obviously it can't match the entire grammar if it considers a single line. Any suggestions? Thank you!
I can't edit my post: sorry, instead of beginFact and endFact they should be BeginStructure and EndStructure.
You should read the entire file into a string and parse it all at once. Unless you are parsing gigabyte-sized files, this is perfectly acceptable.
with open('file.txt') as f:
print grammarFact.parseString(f.read())
Oh ok. Thank you :)
One last question: would you parse an entire programming language by loading the entire file into a string or there're better ways to do this? If yes, what would be the 'main idea'? Thank you
For parsing a complex grammar like an entire programming language, look at the Verilog parser example - which loads the entire file into a string and parses it. This parser just parses the Verilog source, there is very little beyond that that is implemented. To actually implement a compiler or interpreter, I would define processable classes associated with the various language constructs - see the SimpleBool.py example for how this might be done, and then used once built.
-- Paul
I am trying to do a program that evaluates if a propositional logic formula is valid or invalid using the semantic three method.
I managed to evaluate if a formula is well formed or not so far:
from pyparsing import *
from string import lowercase
def fbf():
atom = Word(lowercase, max=1) #alphabet
op = oneOf('^ V => <=>') #Operators
identOp = oneOf('( [ {')
identCl = oneOf(') ] }')
form = Forward()
#Grammar
form << ( (Group(Literal('~') + form)) | ( Group(identOp + form + op + form + identCl) ) | ( Group(identOp + form + identCl) ) | (atom) )
return form
#Parsing
entrada = raw_input('Entrada: ')
try:
print fbf().parseString(entrada, parseAll=True)
except ParseException as error:
print error.markInputline()
print error
print
Now I need to convert the negated forumla ~(form) acording to the Monrgan's Law, The BNF of Morgan's Law its something like this:
~((form) V (form)) = (~(form) ^ ~(form))
~((form) ^ (form)) = (~(form) V ~(form))
Parsing must be recursive; I was reading about Parseactions, but I don't really understand I'm new to python and very unskilled.
Can somebody help me on how to get this to work?
Thank you very much!.
PD. ptmcg, again greetings from Mexico and thank you for all your help :)
I'm trying to match a pattern with an optional repeating part, followed by a compulsory single part which is a superset of the optional part. So 'A B CDE' has optional parts 'A', 'B', and compulsory part 'CDE'. But in 'A B C', C represents the compulsory part.
This code, and all the variants I've tried involving Optional and OneOrMore, fails for 'A B C', because the optional part is too greedy:
pattern = ZeroOrMore(Word(upper, max=1)) + Word(upper)
I could do the whole chunk with a regex, though I'd prefer not to:
re.compile('([A-Z] )*[A-Z]+')
Is there a way to achieve this in pyparsing?
Thanks!
What version of pyparsing are you using? With version 1.5.6, I get this:
>>> single = Word(alphas,max=1)
>>> mult = Word(alphas)
>>> expr = ZeroOrMore(single) + mult
>>> expr.parseString('A B CDE')
(['A', 'B', 'CDE'], {})
Yep, that one works, but try this:
>>> expr.parseString('A B C')
...big 'orrible traceback
ParseException: Expected W:(abcd...) (at char 5), (line:1, col:6)
I'm running 1.5.6, on Python 3.2.
Sorry, I misread your question.
So a single
isn't just a single letter, it's a single letter that is followed by at least one more letter, or conversely, it's a letter that is not followed by the end of the line.
This kind of lookahead is the only way to tell pyparsing to look beyond just the current character, but what comes next as well. It helps to have a kind of self-imposed tunnel vision while writing a pyparsing grammar, because unless you explicitly spell out any required looking ahead, it's not going to happen.
Using FollowedBy is a way to see if something is coming up, without consuming that something from the input string. So you could implement either of these lookaheads:
single = Word(alphas,max=1) + FollowedBy(Word(alphas))
or
single = Word(alphas,max=1) + ~FollowedBy(LineEnd())
The first only matches a single
if there is at least something after it, which could be a single or a mult. The second only matches a single
if it is not the last thing on the current line.
Change single to either one of these, and 'A B C' parses just fine.
-- Paul
Thanks, Paul, that seems to be doing the trick.
I'm just starting out with pyparsing and having a bit of a hard time trying to get it to do what I want. I'm trying to parse words into 'prefix', 'stem' and 'suffix'. Prefixes and suffixes are optional, and can be made up of several parts (i.e., a prefix could be 'conjunction' + 'definite article'). Here's the grammar:
endOfString = StringEnd()
conjunction = oneOf('w f')
preposition = oneOf('l b') def_art = oneOf('al l')
noun_prefix = Group( Optional(conjunction('conjunction')) + \
Optional(preposition('preposition')) + \
Optional(def_art('article')))
noun_suffix = oneOf('y na k km w h ha hm') + FollowedBy(endOfString)
poss_noun = Optional( Optional(conjunction) + \
Optional(preposition) )('prefixes') + \
SkipTo(noun_suffix | endOfString)('stem') + \
Optional(noun_suffix)('suffix')
def_noun = Optional( Optional(conjunction) + \
Optional(preposition) + Optional(def_art) )('prefixes') + \
SkipTo(endOfString)('stem')
noun = Or( [poss_noun, def_noun ] )('noun')
My problem is that I'd like to get the maximum parse (i.e., breaking the word up into as many pieces as possible), not necessarily the longest result.
For example, I have nouns defined as:
noun = Or( [def_noun, poss_noun ] )('noun')
in order to enforce a rule that if a noun has a definite article, it can't also have a possessive ending. The problem I'm having is that the parser matches whichever pattern is first in the Or statement, and doesn't seem to try the other one.
Here's what it does (with the parse that I would have preferred added as a comment to the right):
>>> noun = Or( [def_noun, poss_noun ] )('noun')
>>> for word in wordlist:
... noun.parseString(word).asList()
...
['al', 'dar'] # correct
['b', 'al', 'blad'] # correct
['al', 'blad'] # correct
['b', 'ytw'] # b + yt + w
['b', 'al', 'Hq'] # correct
['l', 'bytw'] # l + byt + w
>>> noun = Or( [poss_noun, def_noun ] )('noun')
>>> for word in wordlist:
... noun.parseString(word).asList()
...
['aldar'] # al + dar
['b', 'alblad'] # b + al + blad
['alblad'] # al + blad
['b', 'yt', 'w'] # correct
['b', 'alHq'] # b + al + Hq
['l', 'byt', 'w'] # correct
<ul class="quotelist"><ul class="quotelist"><ul class="quotelist"><li>
</li></ul></ul></ul>
So it's matching whichever pattern is first, instead of which pattern is the best match. What am I doing wrong?
Thanks, Karen
No, 'Or' actually tests both (or all if more than two) cases, with the expectation that the 'better' match is the one that matches the longest input string - and if two or more parse the same amount of input text, then the first one given in the Or expression will win out.
You can confirm this for yourself by adding setDebug() to each expression in the Or. setDebug() will report when an expression is about to be used in an attempt to parse the next position in the input, followed by either the success (and matching tokens) or failure (with failure message) of the parse. Change Or to MatchFirst to see the difference.
As a matter of personal style, I prefer 'a ^ b' over 'Or([a,b])', but the two are equivalent.
I'll give your question a little more thought to see what I can come up with to answer your underlying question, how to prefer the more complex parse over the simpler, when they parse the same length.
-- Paul
Yes, I guess that is my question. Thank you -- I would appreciate any help.
I can work around it -- I can just put all the prefixes and suffixes in one definition of a noun, but I would like to enforce the definiteness constraint, if possible.
This might be what you were trying to avoid, but creating one comprehensive expression for noun seems best to me:
from pyparsing import *
endOfString = StringEnd()
conjunction = oneOf('w f')
preposition = oneOf('l b')
def_art = oneOf('al l')
noun_prefix = ( Optional(conjunction('conjunction')) + \
Optional(preposition('preposition')) + \
Optional(def_art('article')))
noun_suffix = oneOf('y na k km w h ha hm') + FollowedBy(endOfString)
noun = ( Optional( (noun_prefix) )('prefixes') +
(SkipTo(noun_suffix)('stem') + noun_suffix('suffix') |
SkipTo(endOfString)('stem') ))
wordlist = 'aldar balblad alblad bytw balHq lbytw'.split()
for word in wordlist:
print word
print noun.parseString(word).dump()
print
prints:
aldar
['al', 'dar']
- article: al
- prefixes: ['al']
- article: al
- stem: dar
balblad
['b', 'al', 'blad']
- article: al
- prefixes: ['b', 'al']
- article: al
- preposition: b
- preposition: b
- stem: blad
alblad
['al', 'blad']
- article: al
- prefixes: ['al']
- article: al
- stem: blad
bytw
['b', 'yt', 'w']
- prefixes: ['b']
- preposition: b
- preposition: b
- stem: ['yt']
- suffix: ['w']
balHq
['b', 'al', 'Hq']
- article: al
- prefixes: ['b', 'al']
- article: al
- preposition: b
- preposition: b
- stem: Hq
lbytw
['l', 'byt', 'w']
- prefixes: ['l']
- preposition: l
- preposition: l
- stem: ['byt']
- suffix: ['w']
-- Paul
Thank you very much for your help, Paul. I ended up writing some code that would try ever type of parse, and keep the one with the shortest stem, so that I was able to maintain my constraints:
def word_parse(word):
word_types = [poss_noun, def_noun, pres_verb, past_verb]
parses = []
for type in word_types:
try:
parse = type.parseString(word)
# stems should be at least two letters
if len(parse.stem) < 2:
continue
parses.append((parse.asList(), len(parse.stem)))
except:
pass
try:
#sort by second value of tuple, to get the parse with
#the shortest stem
top_parse = sorted(parses, key=lambda x: x[1])[0][0]
except:
top_parse = word
parse_string = '+'.join(top_parse)
return parse_string
This seems to work pretty well:
>>> wordlist = ['aldar','balblad','alblad','bytw', 'balHq', 'lbytw']
>>> for word in wordlist:
... print final.word_parse(word)
...
al+dar
b+al+blad
al+blad
b+yt+w
b+al+Hq
l+byt+w
I parsed a small section of authentic text by hand, then tested the parser against it and got an accuracy of 77%. I'm going to do some fancy statistical stuff now to increase that -- but considering that's just a first shot, with a still pretty primitive grammar, it seems pretty damn good.
Thank you again for all your help!
~Karen
2012-04-28 06:48:11 - charles_w - working to understand pyparsing, setResultsName, and setParseAction
Continuing from these two StackOverflow questions:
()
()
I have since been able to at least get some traction using setResultsName. Here is the current complete code.
from pyparsing import *
#first are the basic elements of the expression
#number at the beginning of the line, unique for each line
#top-level category for a sentiment
#semicolon should eventually become a line break
lineId = Word(nums)
topicString = Word(alphanums+'-'+' '+''')
semicolon = Literal(';')
#call variable early to allow for recursion
#recursive function allowing for a line id at first, then the topic,
#then any subtopics, and so on. Finally, optional semicolon and repeat.
#the lineId and the semicolon are read but not printed
#set results name lineId.lineId here
expr = Forward()
expr << Optional(lineId.setResultsName('lineId')) + topicString + \
Optional(nestedExpr(content=delimitedList(expr))) + \
Optional(Suppress(semicolon) + expr)
#open files for read and write
input = open('parserinput.txt')
output = open('parseroutput.txt', 'w')
#defining functions
#takes nested list output of parser grammer and translates it into
#strings suited for the final output
def format_tree(tree):
prefix = ''
for node in tree:
if isinstance(node, basestring):
prefix = node
yield node
else:
for elt in format_tree(node):
yield prefix + '_' + elt
#function for passing tokens from setResultsName
def id_number(tokens):
#print tokens.dump()
lineId = tokens
lineId['lineId'] = lineId.lineId
#function for splitting line at semicolon and appending numberId
#not currently in use
def split_and_prepend(tokens):
return '\n' + final.lineId
#setting parse actions
lineId.setParseAction(id_number)
#reads each line in the input file
#calls the grammar expressed in 'expr' and uses it to read the line
#outputs it as a list (changed to allow result names for operations)
#applies the format tree function
for line in input:
final = delimitedList(expr).parseString(line)#.asList()
newline = '\n' + final.lineId + ' = '
final_string = newline.join(format_tree(final))
print final_string
output.write(final_string)
The good news is that I'm making progress both toward my desired functionality and toward a better understanding of how this all works.
This version has some bugs/flaws/badly-hacked compromises, however.
- Currently I'm trying to bring in the lineId through the join() method, but this requires that I place the lineId near the return carriage. Ideally, the lineId could be place at the front of each line which probably means taking a different approach.
- While one possible alternative seems to be to use the lineId attribute variable to construct the kind of line that I want after the parsing rather than having the parser return the lineId as part of the string, I found that using Suppress on the lineId token to take it out of the results also precluded me from passing that token as an attribute name.
I'm going to continue trying to figure out some other approaches, but I wanted to go ahead and post over here so that someone can warn me if I'm going down rabbit trails and steer me toward a more productive line of questioning.
Okay - I've made a little more progress, possibly.
I'm now stuck at a different challenge en route to an alternate approach to the one I was attempting before.
from pyparsing import *
#first are the basic elements of the expression
#number at the beginning of the line, unique for each line
#top-level category for a sentiment
#semicolon should eventually become a line break
lineId = Word(nums)
topicString = Word(alphanums+'-'+' '+''')
semicolon = Literal(';')
#call variable early to allow for recursion
#recursive function allowing for a line id at first, then the topic,
#then any subtopics, and so on. Finally, optional semicolon and repeat.
#the lineId and the semicolon are read but not printed
#set results name lineId.lineId here
expr = Forward()
expr << Optional(lineId.setResultsName('lineId')) + topicString.setResultsName('topicString') + \
Optional(nestedExpr(content=delimitedList(expr))).setResultsName('parenthetical') + \
Optional(Suppress(semicolon).setResultsName('semicolon') + expr.setResultsName('subsequentlines'))
notid = Suppress(lineId) + topicString + \
Optional(nestedExpr(content=delimitedList(expr))) + \
Optional(Suppress(semicolon) + expr)
#naming the parenthetical portion for independent reference later
parenthetical = nestedExpr(content=delimitedList(expr))
#open files for read and write
input = open('parserinput.txt')
output = open('parseroutput.txt', 'w')
#defining functions
#takes nested list output of parser grammer and translates it into
#strings suited for the final output
def format_tree(tree):
prefix = ''
for node in tree:
if isinstance(node, basestring):
prefix = node
yield node
else:
for elt in format_tree(node):
yield prefix + '_' + elt
#function for passing tokens from setResultsName
def id_number(tokens):
#print tokens.dump()
lineId = tokens
lineId['lineId'] = lineId.lineId
def topic_string(tokens):
topicString = tokens
topicString['topicString'] = topicString.topicString
def parenthetical_fun(tokens):
parenthetical = tokens
parenthetical['parenthetical'] = parenthetical.parenthetical
#function for splitting line at semicolon and appending numberId
#not currently in use
def split_and_prepend(tokens):
return '\n' + final.lineId
#setting parse actions
lineId.setParseAction(id_number)
topicString.setParseAction(topic_string)
parenthetical.setParseAction(parenthetical)
#reads each line in the input file
#calls the grammar expressed in 'expr' and uses it to read the line
#outputs it as a list (changed to allow result names for operations)
#applies the format tree function
for line in input:
final = delimitedList(expr).parseString(line)
notid = delimitedList(notid).parseString(line)
dash_tags = ', '.join(format_tree(final))
print final.lineId + ': ' + dash_tags
The problem is that for multi-line inputs, I get the error 'no such attribute _ParseResults__tokdict.
Commenting out either of the lines at the end that are doing the parsing removes the error. And inputs with only one line removes the error.
Hmm - I can't edit my post but I the penultimate line is supposed to read:
dash_tags = ', '.join(format_tree(notid))
Because then I can get the tokens without the lineId token on the front, which I'm already getting from final.lineId.
Posted the new problem to SO here:
Your post sounds like a traceback in pyparsing itself, so possibly a bug. I'll run your posted code and see.
You have written and posted a lot of code on this project, but have you written a BNF? I've seen some simple examples, but you mentioned that they are just part of a larger project. I'm really having difficulty understanding your code without seeing the bigger picture.
Interesting. I would never have considered the possibility of a bug.
As for a BNF, I have not written one (had to look up the term), but I'll try to describe what I'm doing. I'll try to write a BNF if you recommend it after reading this.
I have nearly ten thousand open-ended survey question responses that I'll have to review and tag in a way that best captures the content of the response. This reviewing and tagging I will be doing manually in Excel.
This manual tagging will follow a simple grammar:
- Parentheses are used to show a parent-child relationship between ideas. These can nest.
- Commas are used to delimit tags at the same level describing the same idea
- Semicolons are used to delimit one major idea from another
'Pyparsing is an amazing module because it is so powerful and simple to use. Wikispaces is a good site.'
The tag would be this:
pyparsing(compliment(powerful, easy to use)); wikispaces(compliment)
There are dozens of such tags, and responses can cover arbitrarily many topics in a single response and can go to arbitrary depth.
I then want to use Pyparsing to read my tagging grammar and print out the tags, converting the nested relationships into single tokens that represent each parent-child relationship. At a semicolon, make a new line to indicate a sufficiently discrete change in topic.
So for the above example, this is the desired output:
pyparsing, pyparsing-compliment, pyparsing-compliment-powerful, pyparsing-compliment-easy to use
wikispaces, wikispaces-compliment
Each survey response also has a unique ID number that I want to attach to the front of the lines for that response, as follows.
4934 pyparsing, pyparsing-compliment, pyparsing-compliment-powerful, pyparsing-compliment-easy to use
4934 wikispaces, wikispaces-compliment
The tokens in this final output will be put into fields in Excel alongside the number that serves as an index. Then one can search for fields that equal a given tag whether for a top-level issue (pyparsing) or a narrower category (pyparsing-compliment).
I'll reiterate that I am not experienced in programming, python, or parsing, so there is at least some stuff in my code that is extraneous. It doesn't necessarily all signify some intention necessarily.
I'll also add that while this is a project with practical applications, the primary goal is for it to be a learning experience with Python and Pyparsing, which it certainly has been so far.
While there may be a but somewhere in Pyparsing, my particular problem seems to have been caused by using the same name 'notId' for the parser function early in the code and for the parse results in the final section of the code.
Someone on SO caught the mistake for me at the link above.
After much trial and error, I've arrived at something that will take the inputs I want and deliver the outputs that I want.
Thanks again, Paul, for your patience and your help. This has been a great learning exercise for me.
from pyparsing import *
data = '''\
1200 price(margin, happy), channel, friend; friend, channel, price
'''
def memorize(t):
memorize.idnum = t[0]
def endblock(t):
return '\n' + memorize.idnum
expr = Forward()
expr << Optional(Word(nums).setParseAction(memorize)) + OneOrMore(delimitedList(Word(alphanums+'-'+' '+''') + Optional(nestedExpr(content=delimitedList(expr))))) + Optional(Suppress(Literal(';')).setParseAction(endblock))
lines = ZeroOrMore(expr)
parsed = lines.parseString(data)
print parsed
def format_tree(tree):
print tree
prefix = ''
for node in tree:
if node[0].isdigit():
yield node
elif isinstance(node, basestring):
prefix = node
yield node
else:
for elt in format_tree(node):
yield prefix + '_' + elt
print ', '.join(format_tree(parsed))
Hi,
I've been struggling with labelling of individual results when I am using some Optional elements within an Each clause.
Firstly, is this possible to do? I'd assume it was, but I wanted to check.
Secondly, what am I doing wrong here?
I've posted a StackOverflow question about this, which is at - it has all of the code etc, and if anyone could help that'd be brilliant!
Cheers,
Robin
I'm trying to parse command output that unfortunately isn't very nicely structured. The output is loosely structured as a series of outer blocks, containing zero or more inner blocks. It's possible to determine the start of different blocks but not necessarily the end of them, except to potentially match on a list of block starting matches.
I'm not sure if I'm using the right approach to parsing this. What I've done below has
the problem that it doesn't break when it reaches the second outer block, and thus all
the inner blocks inside the second block are included in the results for the first outer
block. If I try to break by adding
Suppress(SkipTo(inner_block_begin | outer_block_begin))
to the end of the inner_block construct it seems to greedily slurp until the end (I think).
What's an appropriate way to handle this kind of parsing using pyparsing (if any)?
from pyparsing import *
text = '''
Outer 100
Text to be skipped
Some parm to match 199
Text to be skipped may contain keyword Inner
Inner 101
Text to be skipped
Text to be skipped may contain keyword Inner
Some prefixes:
Text to be skipped
Inner 102
Text to be skipped
Text to be skipped may contain keyword Inner
Some prefixes:
102.1.1.1/24 text
102.2.2.2/24 text
Text to be skipped
Outer 200
Text to be skipped
Text to be skipped may contain keyword Inner
Inner 201
Text to be skipped
Text to be skipped may contain keyword Inner
Some prefixes:
201.1.1.1/24 text text
Text to be skipped
'''
ipv4 = Combine(((Word(nums, max=3) + '.') * 3) + Word(nums, max=3))
ipv4_prefix = Combine(ipv4 + '/' + Word(nums, max=2))
outer_block_begin = lineStart + Literal('Outer')
outer_block_id = Word(nums)
outer_block_misc = Suppress(Literal('Some parm to match')) + Word(nums)
inner_block_begin = lineStart + Literal('Inner')
inner_block_id = Word(nums)
inner_block_prefix = ipv4_prefix + Suppress(restOfLine)
inner_block = \
Suppress(SkipTo(inner_block_begin, include=True)) + \
inner_block_id + \
Suppress(SkipTo(Literal('Some prefixes:'), include=True)) + \
Group(ZeroOrMore(inner_block_prefix))
outer_block = \
Suppress(SkipTo(outer_block_begin, include=True)) + \
outer_block_id + \
Suppress(SkipTo(outer_block_misc)) + outer_block_misc + \
Group(ZeroOrMore(inner_block))
print outer_block.searchString(text)
side78,
Here is a pragmatic approach. Your basic search definition works on a single block. So pre-parse the long string into a list of strings each representing a block. Then apply outer_block.searchString( * ) to each block string in the list. Additionally I had to add Optional() to the outer_block_misc definition since it is not always present in a text block. Hope this helps,
Mike
ipv4 = Combine(((Word(nums, max=3) + '.') * 3) + Word(nums, max=3))
ipv4_prefix = Combine(ipv4 + '/' + Word(nums, max=2))
outer_block_begin = lineStart + Literal('Outer')
outer_block_id = Word(nums)
outer_block_misc = Optional( Suppress(Literal('Some parm to match')) + Word(nums) )
inner_block_begin = lineStart + Literal('Inner')
inner_block_id = Word(nums)
inner_block_prefix = ipv4_prefix + Suppress(restOfLine)
inner_block = \
Suppress(SkipTo(inner_block_begin, include=True)) + \
inner_block_id + \
Suppress(SkipTo(Literal('Some prefixes:'), include=True)) + \
Group(ZeroOrMore(inner_block_prefix))
outer_block = \
Suppress(outer_block_begin) + \
outer_block_id + \
Suppress(SkipTo(outer_block_misc)) + outer_block_misc + \
Group(ZeroOrMore(inner_block))
def isOuterBegin( line ):
bList = outer_block_begin.searchString(line).asList()
if len( bList ) > 0:
return True
return False
strList = []
blockStr = ''
inBlock = False
for line in text.splitlines():
if isOuterBegin(line): # Start new outer block
if len( blockStr ) > 0 and inBlock: # Close out previous block
strList.append( blockStr )
blockStr = ''
else:
inBlock = True
if inBlock:
blockStr += line + '\n'
if inBlock and len( blockStr ) > 0: # Close out final block
strList.append( blockStr )
pList = []
for blockStr in strList:
bList = outer_block.searchString( blockStr ).asList()[0]
pList.append( bList )
print( pList )
I'm trying to use pyparsing to chop namespaces off of C++ qualified types in error output. A qualified type looks like this:
namespace::namespace::Type
This alone is pretty easy to do using pyparsing:
name = Word(alphas + "_", alphanums + "_")
namespace = name.suppress() + FollowedBy("::")
identifier = Group(
delimitedList(namespace, delim="::")) +
Literal("::").suppress() +
name)
print identifier.transformString(input)
The trouble comes when I try to also match methods on qualified types:
namespace::namespace::Type::MethodName()
In the first example, my output is simply Type
, which is what I want. But in the second example, I want the output to be Type::MethodName()
, not just MethodName()
. I can find no way of making pyparsing do this.
As far as I've been able to figure out, the trouble is that I need backtracking for this. There's no way to tell that a token is a type and not a namespace without knowing what comes after it. In the case of a qualified type, that's okay, because I can just use a lookahead to ensure that there's another '::' after the token (i.e., it's not the last token). But detecting a method would require me to ensure that there's both a type name and method name after the token before declaring it a namespace. I've tried something like:
{function = name + '(' method = name + '::' + function namespace = name.suppress() + FollowedBy('::' + method)}
But this doesn't match at all.
I feel intuitively that this grammar is simple enough for pyparsing to deal with, but I am missing a key insight. Anyone have any idea what it is?
In general, whenever you are trying to match a trailing string after some repetition, the pattern is:
- define what you want to match (expr)
- define the repetition expression (rep)
- prefix rep with a negative lookahead, rep = ~expr + rep
- define the overall using something like ZeroOrMore(rep) + expr
You can see this in the code below:
from pyparsing import *
name = Word(alphas + '_', alphanums + '_')
# namespace = name.suppress() + FollowedBy('::')
# identifier = Group(delimitedList(namespace, delim='::') + Literal('::').suppress() + name)
COLONS = Literal('::')
LPAR,RPAR = map(Literal,'()')
METHOD_ARGS = LPAR+RPAR | nestedExpr('(',')')
# define what your trailing string looks like
methodcall = (name + METHOD_ARGS)
typestring = (name + (COLONS + methodcall | ~COLONS))
# use negative lookahead to avoid matching your type text as a namespace
namespace = name + COLONS
namespace = ~typestring + namespace
# just using ZeroOrMore instead of delimitedList
identifier = (ZeroOrMore(namespace).suppress() + typestring).leaveWhitespace()
tests = '''
namespace::namespace::Type
namespace::namespace::Type::MethodName()
'''
print identifier.transformString(tests)
Which gives this result
Type
Type::MethodName()
Some other bits:
-
I usually try to stay in definition space when defining the component expressions, that is how to recognize a particular pattern vs. another, and leave the things like suppression or results names to later steps when the smaller pieces all get assembled into an overall grammar.
-
Once I use a particular literal string more than a couple of times in a grammar, I'll create a separate expression for it (see use of COLONS).
-
Note that I'm trying to anticipate a method call that has arguments, by using nestedExpr as a cheap way to handle an argument list which itself might contain function calls. This may have a downside in that, if there are namespace references within the method args, they may not be stripped the way you want them. If that happens, you'll need to do a more rigorous definition of METHOD_ARGS.
-
I added leaveWhitespace so that spaces before the namespace references won't be consumed.
Hope this gets you further along, -- Paul
Wow! Thanks Paul. This is really helpful indeed. I'm still trying to wrap my mind around your recursive definition of namespace
: I think it's basically saying 'what you think is a namespace is only actually a namespace if it does not also match the definition of a typestring.' Is that right? I had no idea you could restrictively redefine like that. Is that the preferred method of setting the priority with which definitions are matched?
Hello,
I am wrting with a question regarding my project. I am writing a parser for a process algebra, till now used ANTLR with Java but want to switch the project to python. I have a questions regarding AST creating in pyparsing. In ANTLR rules could return values and take arguments, also there was a tree grammar. This was for me (not a programmer) easy to create functional parsers. In pyparsing there is only setParseAction() method. How can I create AST in the example below.
Basically what I want to parse is a something like 4op arithmetic, but I do not want to evaluate it, since all terms are abstact. I want to create AST and later traverse it in order to generate something (in my case, a state space).
Example:
P = a.P + b.P1;
P1 = c.P;
I want AST to be:
= (P + (. (a P) . (b P1)))
= (P1 . (c P))
prefix_op = Literal('.')
choice_op = Literal('+')
parallel = Literal('||')
ident = Word(alphas, alphanums+'_')
lpar = Literal('(').suppress()
rpar = Literal(')').suppress()
define = Literal('=')
semicol = Literal(';').suppress()
col = Literal(',').suppress()
sync = Word('<').suppress() + ident + ZeroOrMore(col + ident) + Word('>').suppress()
coop_op = parallel | sync
# PA grammar
expression = Forward()
process = lpar + ident + rpar | ident | lpar + expression + rpar
prefix = (process + ZeroOrMore(prefix_op + process))
choice = prefix + ZeroOrMore(choice_op + prefix)
expression << choice + ZeroOrMore(coop_op + choice)
rmdef = (ident + define + expression + semicol)
I ask because I could not find a solution anywhere.
Sorry for [code] formatting, do not know what happened and now I cannot edit.
Ok so I solved my problem. First of all I discovered that in fun (setParseAction(fun) ) I can return an object. This way I can propagate information I want through the rules chain.
Here is example:
class Node(object):
right = None
left = None
data = ''
def createNode(toks):
n = Node()
return n
...
rule = (Word('Hi') + Litearal(';').suppress()).setParseAction(createNode)
Before you get much further down this path, please read this thread from the pyparsing users list:
AST's are a good intermediate step, but pyparsing can help you build more active objects as output from your parsing process. See how this is done in SimpleBool.py, for example (on the wiki Examples page).
-- Paul
Hey,
Sorry for not responding earlier.
My problem area is such that it is not possible for me to work without AST. I need to create AST to rewalk it and create something different from it. Even legacy reference implementation for my problem do it this way.
The only way to omit AST is to use LR parser instead of LL (I assume pyparsing uses LL(1) ? )
BTW. Is there a way to control k param in LL(k) in pyparsing?
I suspect you have not really read the link that I posted, as in that thread, I describe to that poster how to create an AST using pyparsing's Group class.
Of course, you can approach your problem in whatever way you choose, but from what you've described so far, you are definitely doing this the hard way.
It is very rare in Python to have to implement your own linked list, with next and prev pointers. Python includes its own list structure, to which new items can be freely appended, and can be easily iterated over. A Python list can contain as an element another, nested Python list, so that a hierarchical structure can be easily represented. And using pyparsing, you don't even have to build your own list, as pyparsing accumulates your parsed tokens for you, into a very rich ParseResults object - ParseResults can be treated as a list, or with named tokens that can accessed by name lookup or as object attributes (again, see and study the simple example in the linked thread).
Lookahead can be LL(as much as you care to define) in pyparsing, using the FollowedBy lookahead class. Within the definition of your grammar, you can specify something as complex as FollowedBy(validTimeStamp) or FollowedBy(socialsecuritynumber+zipcode) or FollowedBy(zipcode*100). FollowedBy will not consume the given expression from the input string, but it will verify that at the current parsing position, the next parts will or wont match the given expression.
-- Paul
Thank for clarifying. Ofc I read the link You provided, I next projects I will surely try the python way. Now it was more an excercise for myself in python, so I created my own Node objects. When the whole project is complete I will provide some info on the forum I you are interested.
2012-05-23 04:27:27 - Madan2 - TypeError: unsupported operand type(s) for ^: 'NoneType' and 'NoneType'
Hi,
I'm new to Python and Pyparsing and trying to parse a program. Initially I've had success. When the expressions are complicated
I'm trying to build tokens and try to reuse them to define other tokens / expressions. This is where I'm getting the errors.
I have code like this
LBRACE = Suppress(Literal('('))
RBRACE = Suppress(Literal(')'))
get_KW = Suppress(Literal('get'))
simple_punc = '-./_:*+=|[~!%]<>?$'
token_char = alphanums + simple_punc
tokenz = Word(token_char)
QuotedString = quotedString
init_constant = Group (LBRACE + Suppress(tokenz) + LBRACE + tokenz ('operator') + LBRACE + Suppress(tokenz) + LBRACE + Suppress(tokenz) + tokenz('variable_name') +Suppress(integer) + LBRACE + Suppress(tokenz)+ Suppress(tokenz) + RBRACE + RBRACE + RBRACE + integer ('init_value') + RBRACE +RBRACE) ('set_expression')
sub_str = Group (LBRACE + tokenz + tokenz + integer + integer + RBRACE) ('set_expression')
fn_call = Group (LBRACE + tokenz ('function name') + OneOrMore(tokenz('argument'))+RBRACE)
get_exp = (LBRACE + get_KW + tokenz+ RBRACE)
trim_get_exp_1 = (LBRACE + tokenz + (get_exp) + QuotedString + RBRACE)
trim_get_exp_2 = (LBRACE + tokenz + (Group(trim_get_exp_1)) + QuotedString + RBRACE)
copy_input_frmt2 = (LBRACE + fmt_KW + QuotedString + trim_get_exp_2+ RBRACE)
copy_exp = Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expression')
set_exp = Group (LBRACE + set_KW + tokenz('variable_name') + (fn_call ^ sub_str ^ init_constant ^ integer ^ QuotedString ^ trim_get_exp_1 ^ trim_get_exp_2) + RBRACE)('set_exp')
I get exceptions like
Parser3.py:42: SyntaxWarning: Cannot combine element of type <class 'type'> with ParserElement
trim_get_exp_1 = (LBRACE + tokenz + (get_exp) + QuotedString + RBRACE)
Parser3.py:42: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
trim_get_exp_1 = (LBRACE + tokenz + (get_exp) + QuotedString + RBRACE)
Parser3.py:43: SyntaxWarning: Cannot combine element of type <class 'type'> with ParserElement
trim_get_exp_2 = (LBRACE + tokenz + (Group(trim_get_exp_1)) + QuotedString + RBRACE)
Parser3.py:43: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
trim_get_exp_2 = (LBRACE + tokenz + (Group(trim_get_exp_1)) + QuotedString + RBRACE)
Parser3.py:61: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
copy_input_frmt2 = (LBRACE + fmt_KW + QuotedString + trim_get_exp_2+ RBRACE)
Parser3.py:65: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
copy_exp = Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expres
sion')
Parser3.py:79: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
set_exp = Group (LBRACE + set_KW + tokenz('variable_name') + (fn_call ^ sub_str ^ init_constant ^ integer ^ QuotedString ^ trim_get_exp_1 ^ tri
m_get_exp_2) + RBRACE)('set_exp')
Traceback (most recent call last):
File 'Parser3.py', line 79, in <module>
set_exp = Group (LBRACE + set_KW + tokenz('variable_name') + (fn_call ^ sub_str ^ init_constant ^ integer ^ QuotedString ^ trim_get_exp_1 ^ tri
m_get_exp_2) + RBRACE)('set_exp')
TypeError: unsupported operand type(s) for ^: 'NoneType' and 'NoneType'
Can you please tell us what's wrong with this and how to resolve the error ? I did see earlier posted messages and tried to group them by braces, it didnt' work ..
ThankYou!
I extracted your posted code, and it would not build until I stubbed in the following expression definitions:
integer = Word(nums)
fmt_KW = Forward()
copy_KW = Forward()
set_KW = Forward()
inputpath = Forward()
copy_input_frmt1 = Forward()
outputpath = Forward()
After that, the code runs okay. So I think there is probably something wrong with the way these expressions are defined in your larger parser. For example, you may have left out the arguments to construct an object, like accidentally entering:
inputpath = Word
If I do this, then I get these warnings, very similar to what you are getting:
x.py:33: SyntaxWarning: Cannot combine element of type <type 'type'> with ParserElement
copy_exp = Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expression')
x.py:33: SyntaxWarning: Cannot combine element of type <type 'NoneType'> with ParserElement
copy_exp = Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expression')
-- Paul
Thanks verymuch. Indeed I was doing a mistake. Thanks again!!
Here are some loose thought on improving lovely pyparsing.
-
Add actions to rules that are fired whena rule is entered, not when it is resolved. Similar to ANTLR's init block. It could be some kind setPreAction(str, loc) ?
-
Give possibility to control k in LL(k).
-
How can we achieve lookup ? By ^ ?
It would be also nice to be able to pass context to rules (arguments for rules).
- Pyparsing has debugging actions that fire before match, after successful match, and after failed match. Perhaps the before match action could do this for you.
- See my discussion of FollowedBy in your other post.
- 'lookup'? Perhaps you mean results names? Check out setResultsName.
-- Paul
To pass context to rules, look at how this is done in some of pyparsing's helper methods, like withAttribute, or replaceWith.
Sorry for late replying.Thank You for your reply. I will surely look into it.
Is there any way to get the overall text matched by a ParseResults object? I know about originalTextFor, but that replaces the matched text at match time. What I want is a way to incrementally slice the full nested parse tree to get the text of any partial match. So I want to be able to do, like, myGrammar.parseString('...').someSubElement.otherSubElement.originalText() and get the original text that matched that particular nested bit of the overall grammar. How can I do this?
Short of monkeypatching the ParserElement class, there is no generic way to do this for any and every subelement. You might try writing a wrapper class that captures the ending position and adds an 'originalText' named result to any returned ParseResults. Then wrap your expressions with that wrapper class. Something like Forward that just contains another expression - maybe even just subclassing Forward would work. -- Paul
Hi,
I'm using combine. It works fine but it strips off the whitespace in between the tokens. Is there anyway to retain the space OR introduce one space between each token
actual text : ((i 0 (+ i 1)))
Combine((printLBRACE + tokenz + tokenz + printLBRACE + tokenz + tokenz + tokenz + printRBRACE + printRBRACE),adjacent=False)
gives me (i0(+i1))
I would like to have '((i 0 (+ i 1)))'
or with atleast one space between each token
Thankyou!
Try this:
Combine((...exprs...), joinStr=' ', adjacent=False)
Thank you Sir! Much appreciated for the instant reply.
it's actually joinString not joinStr
Guess I should read the docs. :)
Hello group,
I am new to python and writing a program to parse EDI files. I came across pyparsing and looks like a good fit for my use case. The input data file has well defined syntax, but some of the lines may be optional and can have block of lines which repeats. My main question is is there a way to map the repeating block of data (loop) into a list. For example, here is a sample file.
ISA*00* *00* *01*987654321 *01*123456789 *020917*0913*U*00400*000000901*0*P*>@
GS*SH*987654321*123456789*20020917*0913*965*X*004010@
ST*856*0001@
BSN*00*260784*20020917*0913@
DTM*011*20020917*0913*@
HL*01**S@
MEA*PD*G*355*LB@
TD1*CTN90*3@
TD5*B*2*RDWY*LT****@
TD3*TL*RDWY*1234567890@
REF*BM*260784@
N1*MI**01*987654321@
N1*SU**01*123456789@
N1*ST**01*987654321@
HL*02*01*O@
LIN**BP*PARTNUMBER1*PO*P012345*RE*001@ SN1**40*PC*1084@ //Block 1
HL*03*02*I@
REF*DK*DOCKA@
CLD*1*40*CTN90@
REF*LS*S562896@
HL*04*01*O@
LIN**BP*PARTNUMBER2*PO*P012316*RE*004@ //Block 2
SN1**100*PC*32400@
HL*05*04*I@
REF*DK*DOCKB@
CLD*2*50*CTN90@
REF*LS*S562897@
REF*LS*S562898@
CTT*2@ SE*29*0001@
GE*1*965@
IEA*1*000000901@
So the data has a header section, and two blocks of data and a trailing section. Instead of 2, there could be hundreds of blocks. some lines in this block may be optional.
-
Is there a way to write a grammer, which has some context. This is because 'REF' tag in header section, and within the block (say beginning marked by LIN tag)
-
Is there a way to capture the repeating block of data (note the number of lines vary as some of them are optional) into python list.
thank you verymuch.
I suggest you start small, here is a prototype parser with simplified line markers for tags. You can use this little prototype grammar to experiment with repetition, grouping, and ordered/out-of-order data.
Here is first example, with a header line starting with 'A' and a terminator line starting with 'Z'. You can see in the definition of AZ_rec that between these two there can be 0 or more B, c, or D lines. The line groups that have values are also labeled with their key, A, B, C, or D.
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' \t')
NL = LineEnd().suppress()
integer = Word(nums).setParseAction(lambda t:int(t[0]))
STAR = Suppress('*')
A_line = Group('A' + STAR + integer('value') + NL)
B_line = Group('B' + STAR + integer('value') + NL)
C_line = Group('C' + STAR + integer('value') + NL)
D_line = Group('D' + STAR + integer('value') + NL)
Z_line = 'Z' + STAR + NL
AZ_rec = (A_line('A') +
Group(ZeroOrMore(B_line))('B') +
Group(ZeroOrMore(C_line))('C') +
Group(ZeroOrMore(D_line))('D') +
Z_line)
Here is some sample data, and a test routine to parse and output the results:
data1 = '''\
A*100
B*10
C*11
D*12
Z*'''
data2 = '''\
A*101
B*10
B*11
B*12
Z*'''
def testParse(expr,s):
data = expr.parseString(s)
print data.asList()
print 'A', data.A.value
for key in 'BCD':
if data[key]:
print key, '-', ','.join(str(d.value) for d in data[key])
print
testParse(AZ_rec, data1)
testParse(AZ_rec, data2)
Giving:
[['A', 100], [['B', 10]], [['C', 11]], [['D', 12]], 'Z']
A 100
B - 10
C - 11
D - 12
[['A', 101], [['B', 10], ['B', 11], ['B', 12]], [], [], 'Z']
A 101
B - 10,11,12
Notice how the values are extracted by name from the parsed data (data.A.value or d.value for d in data[key]). With pyparsing's named results, you don't have to count up indexes into lists of tokens (which can break when the grammar evolves in the future and new fields are introduced in the middle of existings ones).
Now here is the slightest variation on the previous parser, in which the B, C, and D internal records might not occur in nice B-C-D order, but might be C-B-D, or C-D-B, D-B, etc. The difference is that this grammar uses the '&' operator instead of '+' to join the inner 3 record types.
AZ_rec2 = (A_line('A') +
(Group(ZeroOrMore(B_line))('B') &
Group(ZeroOrMore(C_line))('C') &
Group(ZeroOrMore(D_line))('D')) +
Z_line)
data3 = '''\
A*102
C*11
C*21
C*31
B*10
Z*'''
testParse(AZ_rec2, data3)
Giving:
[['A', 102], [['B', 10]], [['C', 11], ['C', 21], ['C', 31]], [], [['B', 10]], [], [], 'Z']
A 102
B - 10
C - 11,21,31
See if you can make some headway from here.
Welcome to Python and pyparsing! -- Paul
Paul. Thank you very much for the detailed answer and encouragement. I appreciate your effort supporting pyparsing through this form and SO. After some xml processing, I will get back and update my progress. Thanks again.
Hi,
I've written a parser for Fortran namelist which have the general format...
&name
a = 1,
b = 2,
c = 3, ! this is a comment
/
The parser works perfectly. The key=value pairs are recognized as such and give appropriate names with setParseAction(). After having checked the correct syntax (i.e. correct parsing) of the namelist, I would like to check for the presence for a specific key=value pair. If the key is present, I would like to change the value and output the namelist OTHERWISE UNTOUCHED. Meaning with all the whitespace, comments etc. that were present in the original unparsed version. Any hints of how to achieve that would be appreciated.
Cheers, Oli
Suppose I want to change a number embedded in a alpha-string using transformString. Like, for instance, changing 123 to 456 embedded in a string of x's:
xxxxxxxx123xxx -> xxxxxxxx456xxx
The code below shows two ways to do this:
from pyparsing import Group, ZeroOrMore, Word, nums, ParseResults
def changeNum1 (res):
res[8] = 456 # how to know that the number is at position 8?
print 'dict:', res.number, 'list:', res[8],
def changeNum2 (res): # this does *not* change the returned parse result
res.number = 456
print 'dict:', res.number, 'list:', res[8],
exp = ZeroOrMore ('x') + Word (nums)('number') + ZeroOrMore ('x')
exp1 = exp.copy().setParseAction (changeNum1)
exp2 = exp.copy().setParseAction (changeNum2)
print '--> ' + exp1.transformString ('xxxxxxxx123xxx')
print '--> ' + exp2.transformString ('xxxxxxxx123xxx')
This example produces the following output:
dict: 123 list: 456 --> xxxxxxxx456xxx
dict: 456 list: 123 --> xxxxxxxx123xxx
The first transformation (exp1) works, the second (exp2) does not. In the first we change the parse result as a list representation, in the second we use the dict representation. When dict representation is changed this is not reflected in the list representation. The two are not kept in sync. This seems unfortunate, because changing the dict is far more user frienly (position independent) than using an index in a list.
The __setitem__
method of the ParseResults class does not keep the list
representation synchronized with the dictionary representation. Changing one does
not change the other. And only the list representation seems to be propagated to
the final result. (Unfortunately I do not yet understand the reason why __setitem__
leaves the list and dict representation of a parse result in an inconsistent state)
I have searched this discussion forum, and only found
where the solution is a crooked way (using the internal Python id of an object!) to get
to the list index of a named item, clearly showing the lack of this info from the API.
In my opinion the list index(ices) of (all matches of) a named item should be accessible
by the API (it is held in __tokdict
).
Hi,
While parsing I'm encountering a '' in the data and getting errored out. Is there anyway to overcome this.
I've cutpasted the data below
(set! mMiscStringTwo (string-tokens (cdr (assoc mFreightClass (vector->list mItemVector))) #\~))
Help is much appreciated.
Thank you! Madan
It is pretty much impossible to help you much on this question looking at just the input string - can you post the pyparsing grammar, or at least the part that is failing?
I'm a newbie have spent quite a bit of time trying to get this to work but still unsuccessful to parse the following text with multiple records. I think pyparsing will do this for me. Appreciate your help in advanced. Thank you.
- there are 2 'parents:' fields, that i need to process separately as parent 1 and parent 2. Currently it seems like only 'pcs2_group' is matched.
- the 'description' field is multi-lined starting on the next line.
I am not sure if searchString can do it or not now...
sample = r'''changeset: 2916:cbeb5f68b46b725ebeb0192e4b6852db6c9bd6f3
parent: 2914:ab2526b29654115d3327c4ae31243e019f4739c5
parent: -1:0000000000000000000000000000000000000000
description:
Bug 123: blah line 1
Bug 455: blah line 2
changeset: 2915:b21b281f5bf00350823aadd64730efb18f62150f
... another record ...
'''
SkipToNextRecord = SkipTo( 'changeset:', include=False )
SkipToKey = SkipTo( Word(alphas), include=False )
cset = Word(nums).setResultsName('revId') + Suppress(':') + Word(alphanums).setResultsName('rev')
changesetStmt = Group( 'changeset:' + cset('changeset_group') ) + SkipToKey
parCset = Word(nums).setResultsName('revId') + Suppress(':') + Word(alphanums).setResultsName('rev')
parCsetStmt = Group( 'parent:' + cset('pcs_group') ) + SkipToKey
pcs2cset = ZeroOrMore('-') + Word(nums).setResultsName('revId') + Suppress(':') + Word(alphanums).setResultsName('rev')
par2CsetStmt = Group( 'parent:' + pcs2cset('pcs2_group') ) + SkipToKey
changesetLine = 'changeset:' + SkipTo( Literal('\n').suppress() )
descLine = Word(alphanums)
descrDef = OneOrMore( ~changesetLine )
descrStmt = Group( 'description:' + descrDef('DESCR') ) + changesetLine
changesetDef = Dict( changesetStmt
+ ZeroOrMore(parCsetStmt)
+ ZeroOrMore(par2CsetStmt)
+ ZeroOrMore(descrStmt).setDebug()
) + SkipToNextRecord
for csetDict in changesetDef.searchString(sample):
print csetDict.dump()
print '-' * 8
I'm getting closer... I'll probably post sthe solution when I have it.
Can we compose dynamic extractor statement. For example, in the scanString() example mentioned in .
#################
print 'Example of an extractor'
print '----------------------'
# simple grammar to match #define's
ident = Word(alphas, alphanums+'_')
macroDef = Literal('#define') + ident.setResultsName('name') + '=' + restOfLine.setResultsName('value')
for t,s,e in macroDef.scanString( testData ):
print t.name,':', t.value
# or a quick way to make a dictionary of the names and values
# (return only key and value tokens, and construct dict from key-value pairs)
# - empty ahead of restOfLine advances past leading whitespace, does implicit lstrip during parsing
macroDef = Suppress('#define') + ident + Suppress('=') + empty + restOfLine
macros = dict(list(macroDef.searchString(testData)))
print 'macros =', macros
print
I need to make the variables: ident, name, value, and restOfLine readed dynamically and the extractor statment coposed at run time.
each iteration the values will be changed.
I need to understand
September 9, 2007 - Pyparsing Recipe in the Python Cookbook
Diaa - there is not a lot of information here. Can you post a URL to the Python Cookbook recipe you mean? -- Paul
September 9, 2007 - Pyparsing Recipe in the Python Cookbook Kevin Atkinson has submitted this recipe to the Python Cookbook. It uses a new feature of the Python eval and exec commands to implement custom behavior when a symbol is not found. So instead of this:
B = Forward()
C = Forward()
A = B + C
B << Literal('b')
C << Literal('c')
D_list = Forward()
D = Forward()
D_list << D | (D + D_list)
D << Literal('d')
you can just write this:
A = B + C
B = Literal('b')
C = Literal('c')
D_list = D | (D + D_list)
D = Literal('d')
Diaa -
Do not use this recipe, especially since you are still working through basic pyparsing and Python concepts. All this recipe does is allow you to avoid the pre-definition of Forward instances, at the risk of introducing other dependencies and errors. It does not make pyparsing any easier or faster, and does not conform to any of the docs or examples.
-- Paul
I'm finding that pyparsing does not handle ParseResults attributes in a consistent way. The pyparsing code seems to cavalierly treat ParseResults objects and lists as equivalent, even though they're not: the ParseResults objects can have attributes (i.e., subexpressions with resultNames), but the lists don't.
I'm writing some utility parsing expressions. I have one that acts like the builtin Group() but also saves some additional information about the parsed object in some attributes. However, when in postParse I wrap the result in a list, the attributes are lost.
I can't find anything in the ParseResults object that provides an easy way to copy the attributes from one ParseResults object to another. What am I supposed to do if I want to create a new ParseResults object that 'wraps' another, retaining its attributes?
Here's an example. In pyparsing's Group class, it has this:
def postParse( self, instring, loc, tokenlist ):
return [tokenlist]
How could this be modified to create a Group that retains the attributes of its subexpressions? The only place that I can see attribute-coping code is in the iadd method of ParseResults, but this again assumes that I want to retain the same token list. I don't. I want to make a new ParseResults object from an existing one, and wind up with a DIFFERENT token list but the SAME attributes as the original.
Not sure if these qualify as solutions to your problem, but might give you some ideas on how to workaround current limitations.
-- Paul
from pyparsing import *
# a simple key-value grammar to load up some ParseResults
ident = Word(alphas, alphanums)
integer = Word(nums)
table = dictOf(ident, integer)
res1 = table.parseString('A 3 B 7 D 22')
print res1.dump()
res2 = table.parseString('Z 12 Y 17')
print res2.dump()
# delete all entries from a PR list, but leave attributes intact
del res2[:]
print res2.dump()
# ParseResults don't have append, but they do iadd
res2 += ParseResults(list('ABCDEF'))
print res2.dump()
# use dict-style assignment to copy key-values from one PR to another
res3 = ParseResults(list('ABCDEFG'))
for k,v in res1.items():
res3[k] = v
print res3.dump()
Sorry, forgot the output results:
[['A', '3'], ['B', '7'], ['D', '22']]
- A: 3
- B: 7
- D: 22
[['Z', '12'], ['Y', '17']]
- Y: 17
- Z: 12
[]
- Y: 17
- Z: 12
['A', 'B', 'C', 'D', 'E', 'F']
- Y: 17
- Z: 12
['A', 'B', 'C', 'D', 'E', 'F', 'G']
- A: 3
- B: 7
- D: 22
Oh, one other thing - when you say that you 'wrap one PR inside a list, and the attributes are lost' I don't think they are lost, they are just held as attributes of the [0]th element in the list, kind of like wrapping a dict in a list. Group has to do this, so that an expression with results names that occurs multiple times can be wrapped in Group, and the different parse results will kept from stepping on each other.
Hello,
as seen on , I'm trying to get pyparsing to properly parse records such as
DRUG D09347 Fostamatinib (USAN)
D09348 Fostamatinib disodium (USAN)
D09692 Veliparib (USAN/INN)
D09730 Olaparib (JAN/INN)
D09913 Iniparib (USAN/INN)
So far what I did is
from pyparsing import *
punctuation = ',.'`&-'
special_chars = '\()[]'
drug = Keyword('DRUG')
drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
alphanums + special_chars))) + ZeroOrMore(LineEnd())
drug_lines = OneOrMore(drug_content)
drug_parser = drug + drug_lines
However OneOrMore is too greedy, and grabs the following lines as well (example with 3 entries):
['DRUG', ['D09347', 'Fostamatinib (USAN)
D09348 Fostamatinib disodium (USAN)
D09692 Veliparib (USAN']]
What I'd like instead would be to have
['DRUG', [['D09347', 'Fostamatinib (USAN)'], ['D09348', 'Fostamatinib disodium (USAN)'],
['D09692', ' Veliparib (USAN)']]]
I'm fairly sure I'm getting things wrong here. Where?
I usually like to try to 'be the parser'. When you see the word 'DRUG', how do you know
that it is not supposed to match the leading Word(alphanums)
in drug_content? Probably because 'DRUG' is a special word in your syntax. So you specifically want to exclude 'DRUG' from matching that first word - the easiest is to use a negative lookahead, using NotAny, or use the ~
operator. Try this:
drug_content = ~drug + Word(alphanums) + originalTextFor(OneOrMore(Word(
alphanums + special_chars))) + ZeroOrMore(LineEnd())
You could also attach a validating parse action to the leading word to check the column number, and if it is 1, raise a ParseException.
-- Paul
Apparently it's not working yet:
In [515]: drug_parser.parseString(contents)
Out[515]: (['DRUG', 'D09347', 'Fostamatinib (USAN)\n D09348 Fostamatinib disodium (USAN)\n D09692 Veliparib (USAN'], {})
The only change I did was to switch drug_content, the rest was like in my example.
Sorry, I misread your 'what I want' example. If you just want to read from 'D0####' to the end of the line, then this version of drug_content will give you a separate entry for each line:
drug_content = Group(~drug + Word(alphanums) + empty + restOfline)
Group tells pyparsing to create a sublist for each drug_content. empty advances the parser to the next non-whitespace character, and restOfLine gets everything up to the next newline.
Hope this gets you closer, -- Paul
Thanks, this gets one of the hardest bits sorted. Now I'm having a similar record, but the issue here is that I need everything on the line up to a certain point.
The first bit is identical to the other, with just a different keyword
GENE 3932 LCK; lymphocyte-specific protein tyrosine kinase (EC:2.7.10.2) [KO:K05856] [EC:2.7.10.2]
And it's a multiline record like the DRUG one (so the same things apply). However here I need to parse just part of the line, up to the word with the first parenthesis, the rest should be ignored. In other words, the bit that interests me is
GENE 3932 LCK; lymphocyte-specific protein tyrosine kinase
The part after the ;, however, is variable-length. So far I got the first two bits OK
gene_line = ~gene + Word(nums) + empty + Word(alphanums, excludeChars=';')
But how can I tell pyparsing to parse 'up to a certain point'?
I forgot to add: the 'good' text may also contain parentheses, so I can't just ignore text starting from the first parens('(') onwards.
Hmm, your last statement kind of confounds the issue. I was going to suggest using SkipTo('(')
, but that won't work if you have ()'s in your desired text as well.
It looks like what you want is to read up to something matching this: '(EC:2.7.10.2)'
. Create
an expression that matches that with some rigor, not just
'(' + Word(alphanums+':.') + ')'
, which might still match some text that you actually
want, but more like
Combine('(' + Word(alphas.upper) + ':' + delimitedList(Word(nums),'.') + ')' )
or if
you prefer, Regex(r'\([A-Z]{2}\d+(\.\d+)+\)')
. This looks like some kind of reference
to me so I'll call it 'reference'. You can then use SkipTo(reference)
to read the desired text.
A side note: when you are parsing the first word ('LCK' in your sample), you are using this expression:
Word(alphanums, excludeChars=';')
There is no point in using excludeChars here, since ';' is not one of the characters in alphanums.
The purpose of excludeChars is to simplify defining words like 'any printable character
except '/' or '.'' as Word(printables, excludeChars='/.')
. Before I added excludeChars,
you had to write something like
Word(printables.replace('/','').replace('.',''))
. In your case, Word(alphanums)
will
work just fine, and still not read the trailing ';'.
I'm having some trouble parsing this
convert_revision=svn:9171d42e-b04d-0410-96dc-cb0bc40dcdda/realstore/trunk@2222
Something simple like this didn't work.
svn = 'svn' + Word(alphanums + ':' + '-' + '/@' )
Am I missing something here?
This parses everything after 'convert_revision=' just fine. To parse the whole string, try something like:
expr = Word(alphas,alphanums+'_') + '=' + svn
You can also create more specific subparsers for the uuid, path, '@' and trailing integer, instead of just blending them all into one Word expression.
plz
to simplify the grammar of my problems, I build some grammars that process matched tokens in the setPraseAction().
my question, how can return that resutls of parsing in the setParseAction to concatenate to the mian parsing resutls.
example(pesudo code):
def minorgrammar(toks):
grammar2 = expression
for t,s,e in grammar2.scanString(toks[0],maxMatches=1):
...
return (t,s,e)
# main grammar
grammar1 = expression.setParseAction('minorgrammar')
x = grammar1.parseString(data)
I want the last returned parsed tree to contans x and reslt of minor grammar
also this procedures will executed for each line in text file and all the results will be send in one big xml file
sorry
I cannot solve the problem in one grammar
but in cascaded fashion
also I can not send my code grammar for my security study
thanks
Try something like this:
from pyparsing import *
# a hypothetical outer parser, with an unparsed SkipTo element
color = oneOf('red orange yellow green blue purple')
expression = SkipTo('XXX') + 'XXX' + color('color')
data = 'JUNK 100 200 10 XXX green'
print expression.parseString(data).dump()
# main grammar
def minorgrammar(toks):
# a simple inner grammar
integer = Word(nums)
grammar2 = integer('A') + integer('B') + integer('C')
# use scanString to find the inner grammar
# (since we just want the first occurrence, we can use next
# instead of a for loop with a break)
t,s,e = next(grammar2.scanString(toks[0],maxMatches=1))
# remove 0'th element from toks
del toks[0]
# return a new ParseResults, the sum of t and everything
# in toks after toks[0] was removed
return t + toks
grammar1 = expression.setParseAction(minorgrammar)
x = grammar1.parseString(data)
print x.dump()
prints:
['JUNK 100 200 10 ', 'XXX', 'green']
- color: green
['100', '200', '10', 'XXX', 'green']
- A: 100
- B: 200
- C: 10
- color: green
-- Paul
Thanks very very much
I will build on that since there are many setParseAction functions
I will also study how traverse the resulted tree and rearrange it for each parsed line
plz another question
-
I produce the the result of both the main grammar and the minor grammar asXMl.
-
I need to build the minor xml results as elements in in-between the main xml tree elements.
-
I have to option to produce this results: a. try to traverse the asXML result tree and insert the minor grammar xml tree b. produce xml file to the main grammar xml tree and another file to minor grammar xml finally, try to merge the two xml files by building cross- refrence
now I test the first option but the problem is asXML() cannot traversed or editing at run-time to insert the minor grammar xml tree
Hi!
I have a conceptual question.
I've been using pyparsing for years, and recently I came across this notion of a Parsing Expression Grammar (PEG), which I understand to be a stricter/deterministic Context-free grammar (CFG).
It seems to me that it boils down to the ability to make an ordered choice, as well as the inclusion of a packrat parser.
If I'm not mistaken, the 7 operators specified in PEG have pyparsing analogues:
- Sequence:
e1 e2 -> e1 + e2
- Ordered Choice:
e1 / e2 -> e1 | e2
- Zero-or-More:
e* -> ZeroOrMore(e)
- One-or-More:
e+ -> OneOrMore(e)
- Optional:
e? -> Optional(e)
- And-predicate:
&e -> FollowedBy(e)
- Not-predicate:
!e -> ~e
And packrat parsing in pyparsing is turned on using:
import pyparsing as pp
pp.ParserElement.enablePackrat()
Would I then be correct in saying that pyparsing can be used to implement PEGs, as long as the above constraints are followed?
Thanks!
Yes, I would say so.
I'm surprised, though, that the canonical PEG definition omits:
- no match -> NoMatch
- any match -> Empty
- unordered choice -> e1 & e2
Also, there are several possible implementations of Ordered Choice:
- match first (MatchFirst)
- match longest (Or)
- match all and select most successful overall parse (not implemented in pyparsing - can be extremely slow)
And I'm not sure that packratting is a necessary part of PEGs, but more like implementations of PEGs lend themselves to supporting packratting.
-- Paul
Thanks for the helpful answer, Paul!
After some research, I believe there are a few more subtleties to PEG (such as greedy matching), and Empty is defined. Unordered choice is intentionally eschewed in PEGs because it can lead to ambiguities. But it does seem that the PEG languages can be entirely represented in pyparsing.
I was merely curious because everyone's talking about PEGs -- I guess one of the big reasons there's so much buzz is due to the fact that it makes linear time packrat parsing possible.
I'm just wondering, how hard would it be to write a visual debugger for pyparsing, sort of in the vein of the Regex Coach?
Has anyone attempted such a thing? Would it just be a matter of programatically adding color codes (e.g. ANSI, HTML, etc.) around matched tokens using .setParseAction?
How do folks here debug their pyparsing expressions?
Thanks!
Catherine Devlin posted on her blog about pyparsing_helper - you can install it with easy_install.
Thanks for heads up, Paul!
pyparsing_helper is really nice. However, what I had in mind was something that goes a little beyond that and does syntax highlighting for individual parse expressions.
For instance:
var = Word(alphas,alphanums)
cmp = Literal('<=') | Literal('>=')
float = Word(nums) + Optional(Literal('.') + Word(nums))
statement = Optional(float + cmp) + var + cmp + float
results = statement.parseString('2.4 <= x1 <= 4.2')
I'd like to build a tool where if I hover my mouse over the parse expression
'Optional(float + cmp)'
in 'statement'
, it will highlight '2.4 <='
in the input
string. Or if I hover over the float
, it will highlight '2.4'
and '4.2'
in the input string.
To do this, there would need to be some mechanism for each matched token to return--upon
parsing--a tuple containing its start and end position in the input string (or a list of
them in the case of ZeroOrMore/OneOrMore
's).
Is this hard to do?
I'm just thinking this would help greatly in debugging ambiguous grammars, where I'm always wondering where a parse expression match stops matching, and the next parse expression starts to match.
What you describe does sound helpful. If I were writing this, I would use scanString,
which returns a generator that yields (tokens,start,end)
tuples. The debugger would take the selected grammar fragment, run fragment.scanString, and highlight all the start-end regions. (Watch out for embedded tabs characters.)
But where would you mouse-hover in that expression for your debugger to detect that you wanted to highlight matches for 'Optional(float + cmp)'? If the mouse was just over the word 'float', would you want every float highlighted? How would you hover over 'float + cmp'? Maybe you would have to actually select a region of the grammar so that your scanner could comprehend when you are trying to match larger pieces of the grammar.
And how would the debugger know that '2.4 <= x1 <= 4.2' was the string in which to highlight the matches? You might have to do something more like pyparsing_helper, where you put input text into a separate panel.
Your project sounds interesting, and maybe pyparsing_helper could be a place to start. I seem to remember from Catherine's blog that, after she wrote this 0.1.0 version, she found another interactive utility that was more general purpose, but did most everything she wanted. So maybe this other utility might be more fully featured in the way you want. Or just write to Catherine directly, and see what she uses now for debugging her pyparsing programs.
HTH, -- Paul
I did read on Catherine's blog that she was using reInteract to debug pyparsing. It's kind of a worksheet REPL for Python in general, and it's quite wonderful, but it still doesn't provide the kind of visualization that I envisage.
scanString is a good idea. I have to think about that a bit.
Yes, if I hover over the definition of float, all the float lexemes should be highlighted. But if I highlight over the float contained within the Optional(), only those floats inside that production rule should be highlighted. And yes, the input string will have to be provided to the highlighting code.
I'm wondering if a pseudo-lex-then-parse method may be able to provide more information in this case. Once an input string is tokenized, the highlighter obtains a map from the tokens to the lexemes. So if a user highlights a token, the program will highlight the part of the string that matches the token.
Similarly, the highlighter can match BNF production rules to tokenlists, and when the user highlights a production rule, the part of the string gets highlighted.
Of course, all this can get quite complicated with ZeroOrMore's and OneOrMore's -- so maybe my ideas are out of left field here. But I'm just wondering aloud.
Anyway, I'll leave it at that. I'll think about this some more.
Hi, as this is my first post I want to thank the developer(s) for this great package.
My question refers to the statement 'Multiple functions can be attached to a ParserElement by specifying multiple arguments to setParseAction, or by calling setParseAction multiple times.' in the documentation.
I want to add 2 parse actions to an element. Adding two functions in one call of setParseAction works fine, but doing it one after the other seems to only attach the most recent action to the element. For example:
import pyparsing as PYP
def one():
print 'one',
def two():
print 'two',
atom = PYP.Word(PYP.alphas, PYP.alphanums, 1)
atom.setParseAction(one)
atom.setParseAction(two)
test = PYP.delimitedList(atom)
s = 'ji2n, d292, o33, ok3'
test.parseString(s)
returns
two two two two
where I would expect
one two one two one two one two
which is what I get if I add both functions at once:
atom.setParseAction(one, two)
My question is: why is 'one' not printed in the first case? To make this question more general:
Is there a way of accessing the actions currently attached to an element? (for example to change the grammar during execution of a program).
Thanks for any replies, L.
Sorry - the docs should read to 'call addParseAction multiple times', not setParseAction. Will fix.
Whenever I have had to define a dynamic grammar (which changes based on data that has been parsed so far), I use a Forward expression, and then inject new expressions into it at parse time using a parse action. See how this is done in the countedArray
helper.
Thanks for the super fast reply, it works. Are there further ways of viewing, or removing the actions currently attached to an element, or is cumulative adding the way of using actions?
Thanks again for this great package!
Look at the expression's parseAction attribute - it's just a regular list, so you can pop or add functions to it (but this will make your code more obscure when you need to maintain it later). You can also clear the list using setParseAction(None).
Welcome to pyparsing, glad you are enjoying it!
Take the example of , if I run boolExpr.validate()
it reports that a left recursion is found.
Is this correct?
Thanks.
I just replicated your results using a simple 4-function arithmetic expression. It's probably a bug in validate(), since I know that operatorPrecedence expressions do run successfully with recursing infinitely.
Thanks, -- Paul
Hi,
I run into the following situation:
A production called 'expression' is built on top of a more basic production called, 'element.'
For example:
expression = element + OneOrMore(oneOf('*','/') + element) | element
My problem is that I want the production 'element' to have different definitions under different context.
For example, let's say the syntax '.variable' is not a valid element under normal context, but is valid within a WITH block. So:
a * b #is a valid expression
a * .member # is not a valid expression
WITH my_struct
a * .member #is a valid expression
END
One way to do this is to copy and paste every single production that depend on 'element,' make a new production 'element2' that fits the new context and build the production expression2 that depends only on 'element2.' But that is extremely verbose and error-prone. I wonder if there is a way to reuse similar productions under different contexts.
Hi,
this question is about the nesting depth of parse results - in particular results obtained from an operator precedence element.
I want to pass a parsed logical expression (usual operators and, or, not) as an argument to a function. I have noticed, that if the logical expression contains at least on operator, then the parse result is nested twice, but otherwise only once.
For example:
import pyparsing as PYP
And = PYP.CaselessKeyword('and')
Not = PYP.CaselessKeyword('not')
Or = PYP.CaselessKeyword('or')
Expression = PYP.operatorPrecedence(PYP.Literal('a'), [(Not, 1, PYP.opAssoc.RIGHT),
(And, 2, PYP.opAssoc.LEFT),
(Or, 2, PYP.opAssoc.LEFT)])
print Expression.parseString('a or a and a').asList()
print Expression.parseString('a').asList()
returns
>> [['a', 'or', ['a', 'and', 'a']]]
>> ['a']
where I would expect to see
>> [['a']]
If I want to pass the parsed expression to a function, I will have to pop() once in the first case. In the second case I would pass the results as they are.
My question is: Is this difference intended, and if so what is the advantage?
A quick fix in my grammar would of course be
FixedExpression = PYP.Group(PYP.Literal('a')) ^ Expression
Thanks for any info, Leevi.
Unfortunately, if you retained the nesting in operator precedence, then every atomic
operand would be buried inside a nested list as deep as the number of levels defined in
the precedence list. Rather than wade through these nestings, you might be better off
using parse actions to construct a hierarchy of evaluatable objects - see how this is
done in the SimpleBool.py example. The values in any binary operation
are (value, operator, value, operator, etc.)
, where value itself can be an atomic
operand, or a nested object. Write back if you want to discuss this in more detail.
-- Paul
Hi Paul,
thanks for the advice. I spent most of today rewriting my program. I adapted it to your example for Boolean formulas. It works fine and the code is much clearer.
My task here was to parse a string into a function. Just like your Boolean example, but without having fixed values for the atoms. Rather I can call the parsed object with a value assignment. Also, I am not evaluating Boolean expression but my own predicates that themselves consist of a mini-language.
I devised a two step procedure: (1) Parse the input string. (2) Initialize the parsed object with a Model that determines how my custom predicates are to be interpreted and (3) return a callable function. Now I can test if a (parameter)-assignment is compatible with a model.
Thanks again for you advice. Especially attaching Objects as parseActions turned out clean and readable.
Till then, Hannes
Btw: My hope is that, when I am finished with this project I might appear on your Pyparsing Examples page.
Hi, I've been working on a parser for nested expressions that are then translated according to external specifications. I use the pyparsing-based parser to produce a document tree and then construct a hierarchy of various objects from it. The parser has been tremendously helpful so far, but now I have the following problem: I want each ParseResults object in the tree to carry its line number, so that when an error during the construction of the hierarchy occurs, I can easily report it.
I tried using parse actions to insert the line number into the passed token list (such
as tokens.lineNumber = pyparsing.lineno(loc, string)
and then return tokens, but I tried
dictionary access as well). It might work when I print the line number using a following parse action,
but when I try it on the final results from parseString, there is nothing stored
(when retrieving tokens.lineNumber, I get an empty string).
Could anybody show me a way to accomplish this (a clean way, preferrably)? Thanks, Jan
Not sure why using dictionary-style access to add thelocation to the tokens didn't work for you. Here is an example, looking through the class 'lorem ipsum' text for words starting with a vowel.
from pyparsing import *
text = '''Lorem ipsum dolor sit amet, consectetur adipisicing
elit, sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis
aute irure dolor in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum'''
# find all words beginning with a vowel
vowels = 'aeiouAEIOU'
initialVowelWord = Word(vowels,alphas)
# Unfortunately, searchString will advance character by character through
# the input text, so it will detect that the initial 'Lorem' is not an
# initialVowelWord, but then it will test 'orem' and think that it is. So
# we need to add a do-nothing term that will match the words that start with
# consonants, but we will just throw them away when we match them. The key is
# that, in having been matched, the parser will skip over them entirely when
# looking for initialVowelWords.
consonants = ''.join(c for c in alphas if c not in vowels)
initialConsWord = Word(consonants, alphas).suppress()
# add parse action to store the current location in the parsed tokens
# (you said you tried this, not sure why it didn't work for you)
def addLocnToTokens(s,l,t):
t['locn'] = l
t['word'] = t[0]
initialVowelWord.setParseAction(addLocnToTokens)
for ivowelInfo in (initialConsWord | initialVowelWord).searchString(text):
if not ivowelInfo:
continue
print ivowelInfo.locn, ':', ivowelInfo.word
The parse action addLocnToTokens
embellishes the parsed tokens with new results names 'locn' and 'word'.
Alternatively, you can define your own placeholder using an Empty, and add any kind of behavior to it you want, such as in this case, saving the current parse location:
# alternative - add an Empty that will save the current location
def location(name):
return Empty().setParseAction(lambda s,l,t: t.__setitem__(name,l))
locateInitialVowels = location('locn') + initialVowelWord('word')
# search through the input text
for ivowelInfo in (initialConsWord | locateInitialVowels).searchString(text):
if not ivowelInfo:
continue
print ivowelInfo.locn, ':', ivowelInfo.word
This will give the same results as the previous example.
Good luck!
Thanks a lot! Problem solved. It was actually quite a small thing that confused me - I expected the ParseResults objects to behave symetrically, i. e. when I set something as an attribute or a dictionary value, it can be retrieved the same way. So, thanks again. Pyparsing is a great library - it saved me a lot of time and is as intuitive and Pythonic as it can be. Yours, Jan
Great, thanks for the props! Good luck with your parser!
-- Paul
Hi,
I have some problems with datetime: I want to parse a string into an actual datetime string, but it doesn't work the way I want it to.
def generateDateString(tokens):
try:
# example: Thu Jun 14
tokens[0] = datetime.datetime.strptime(tokens[0], '%a %b %d')
except ValueError, ve:
raise ParseException('Invalid date string (%s)' % tokens[0])
date.setParseAction(generateDateString)
If I print the results as XML it is displayed as datetime object, but if I want to work with that datetime object, I get the error message: Attribute error: 'str' object has no attribute 'strftime'
Change this line:
tokens[0] = datetime.datetime.strptime(tokens[0], '%a %b %d')
to:
return datetime.datetime.strptime(tokens[0], '%a %b %d')
You didn't post the other parts of your parser, so I don't know how you are using results names to access the datetime object.
Hi,
I'm very new with pyparsing and have not find a example, which solves my following issue: I have created a logging mechanism, which runs as a cron job every 60 seconds and logs some important Linux system values. Like 'date', 'meminfo' and 'loadavg'. Please see the sample string below in my code snipped.
from pyparsing import *
sample = '''
Date:
Thu Sep 6 22:15:01 CEST 2012
who -r:
run-level 3 Sep 6 21:59 last=S
/proc/meminfo:
MemTotal: 12191888 kB
MemFree: 11558472 kB
Buffers: 13068 kB
Cached: 218592 kB
SwapCached: 0 kB
Active: 114388 kB
Inactive: 174244 kB
Active(anon): 74500 kB
Inactive(anon): 12764 kB
Active(file): 39888 kB
Inactive(file): 161480 kB
Unevictable: 14148 kB
Mlocked: 14148 kB
SwapTotal: 16777208 kB
SwapFree: 16777208 kB
Dirty: 640 kB
Writeback: 0 kB
AnonPages: 71064 kB
Mapped: 37944 kB
Shmem: 21020 kB
Slab: 249580 kB
SReclaimable: 14708 kB
SUnreclaim: 234872 kB
KernelStack: 2720 kB
PageTables: 7880 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 22873152 kB
Committed_AS: 381536 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 316004 kB
VmallocChunk: 34359408868 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7680 kB
DirectMap2M: 12566528 kB
/proc/loadavg:
0.71 0.52 0.34 2/339 3794
Date:
Thu Sep 6 22:16:01 CEST 2012
who -r:
run-level 3 Sep 6 21:59 last=S
/proc/meminfo:
MemTotal: 12191888 kB
MemFree: 11502920 kB
Buffers: 19296 kB
Cached: 257780 kB
SwapCached: 0 kB
Active: 151340 kB
Inactive: 183008 kB
Active(anon): 74792 kB
Inactive(anon): 12764 kB
Active(file): 76548 kB
Inactive(file): 170244 kB
Unevictable: 14148 kB
Mlocked: 14148 kB
SwapTotal: 16777208 kB
SwapFree: 16777208 kB
Dirty: 160 kB
Writeback: 0 kB
AnonPages: 71328 kB
Mapped: 38248 kB
Shmem: 21020 kB
Slab: 259952 kB
SReclaimable: 24656 kB
SUnreclaim: 235296 kB
KernelStack: 2720 kB
PageTables: 7784 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 22873152 kB
Committed_AS: 379928 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 316004 kB
VmallocChunk: 34359408868 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7680 kB
DirectMap2M: 12566528 kB
/proc/loadavg:
0.68 0.54 0.35 1/336 8232
Date:
Thu Sep 6 22:17:01 CEST 2012
who -r:
run-level 3 Sep 6 21:59 last=S
/proc/meminfo:
MemTotal: 12191888 kB
MemFree: 11499912 kB
Buffers: 19500 kB
Cached: 259832 kB
SwapCached: 0 kB
Active: 152120 kB
Inactive: 185172 kB
Active(anon): 75480 kB
Inactive(anon): 12764 kB
Active(file): 76640 kB
Inactive(file): 172408 kB
Unevictable: 14148 kB
Mlocked: 14148 kB
SwapTotal: 16777208 kB
SwapFree: 16777208 kB
Dirty: 208 kB
Writeback: 0 kB
AnonPages: 71680 kB
Mapped: 39428 kB
Shmem: 21020 kB
Slab: 259924 kB
SReclaimable: 24700 kB
SUnreclaim: 235224 kB
KernelStack: 2728 kB
PageTables: 7564 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 22873152 kB
Committed_AS: 381292 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 316004 kB
VmallocChunk: 34359408868 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7680 kB
DirectMap2M: 12566528 kB
/proc/loadavg:
0.52 0.52 0.36 1/337 12341
'''
# macros
integer = Word(nums)
#date
weekdays = 'Mon Thu Wed Tue Fri Sat Sun'
months = 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec'
date = Group(Literal ('Date:').suppress() + \
oneOf(weekdays).suppress() + \
oneOf(months)('month') + \
integer('day') + \
Combine(integer + ':' + integer + ':' + integer)('time') + \
Word(alphas).suppress()('timezone') + \
integer('year') \
)('date')
#who
who = Group(Literal('who -r:').suppress() + \
Word(alphas+'-').suppress() + \
integer('runlevel') + \
oneOf(months)('month') + \
integer('day') + \
Combine(integer + ':' + integer)('time') + \
Suppress('last=') + \
oneOf('0 1 2 3 4 5 6 S')('prerunlevel') \
)('who')
#meminfo
meminfo = Group(Literal('/proc/meminfo:').suppress() + \
Dict(OneOrMore(Group(Word(alphanums+'('+')'+'_') + Suppress(':') + Combine(integer + Optional(' kB'))))) \
)('meminfo')
#loadavg
loadavg = Group(Literal('/proc/loadavg:').suppress() + \
OneOrMore(Word(nums+'.'+'/')) \
)('loadavg')
record = Forward()
#record << date
#record < who + meminfo + loadavg
record << date + loadavg
# or: record << date + meminfo
# or: record << date + loadavg + meminfo
# parse input string
records=record.searchString( sample )
print 'Number of records:',
print len(records)
print records
Each logging block starts with a special 'Keyword'. E.g. the date section starts with 'Date:' and the meminfo section starts with '/proc/meminfo:'.
Now my goal is to parse this logfile and create a array of records, which includes (one or more) certain interesting system values to a certain Date. Each of the 4 (date, who, meminfo, loadavg) work perfectly alone. But I have problems if I want to combine them. Means: If I combine date + loadavg (to see the loadavg values for each timestamp) it doesn't find any results anymore.
Has somebody had a similar problem already solved in the past and could point me to a code example? Btw, the logfile structure is fix and should be treated as given because I run it already on many systems for a long time.
Thank you very much in advance!
Regards, Uwe
I forgot to mention: My sample logfile ('sample') contains 3 of these log sets collected in 3 minutes and is only a short example.
Probably I have found a solution respectively a good workaround... ;-) If I define the pyparsing grammar as follows I get the desired output:
# record all values
record << date + SkipTo(who).suppress() + who + SkipTo(meminfo).suppress() + meminfo + SkipTo(loadavg).suppress() + loadavg
or
# record only date+meminfo
record << date + SkipTo(meminfo).suppress() + meminfo
Uwe
I have a small testcase that fails to parse:
`timescale 100 ps/100 ps
module mb1_uA ();
wire abc ;
wire a;
xyz_top u_xyz_top (
.a(a),
.\abc (abc )
);
Note that there is a portname that is an escaped identifier. I have real RTL like this that was produced by a commercial EDA tool.
Parsing produces this:
Exception raised:Expected ')' (at char 91), (line:7, col:8)
I looked into the verilog BNF and from what I can tell it looks OK. The named port convention looks for an identifier as the port name
port = portExpr | Group( ( '.' + identifier + '(' + portExpr + ')' ) )
[[code]]
and identifier includes in part escaped identifiers
identifier2 = Regex(r'\\\S+').setParseAction(lambda t:t[0][1:]).setName('escapedIdent')
That's about the extent of what I can tell, I'm new to python. Any suggestions?
thanks, --steve
The escaped identifier is not the problem. The problem is this definition in the parser:
inst_args = Group( '(' + (delimitedList( modulePortConnection ) |
delimitedList( namedPortConnection )) + ')').setName('inst_args')
should be changed to:
inst_args = Group( '(' + (delimitedList( namedPortConnection ) |
delimitedList( modulePortConnection )) + ')').setName('inst_args')
Also, your input file is missing a terminating endmodule.
Thanks for reporting this, I'll update the next released version!
-- Paul
Thanks Paul. In my real example there are many ports that are connected and the error pointed to the one with the escaped identifier. And, when I modified the port name to get rid of the escape it worked correctly, hence my assumption that it was the problem. Anyway, I put in your fix and it's working fine now. Many thanks, I hope to get a lot of good use from this parser.
In the definition of my Pyparsing grammar, there are some grammars which will match strings that span multiple lines. If I use the api like:
PyGrammar.parseString(open('file_name').read())
If will behave in the correct way.
However if I want to use the iterator to read the file like
with open('file_name') as f:
for line in f:
PyGrammar.parseString(line)
the parser will break
Is there a way to work around this case. Thanks...
No, pyparsing must have the full source string read into a local string variable for it to parse.
If your top-level grammar looks something like OneOrMore(expr)
, and creating a single
ParseResults containing all the expr results is slow to create or too large in memory, you
could switch from using parseString to using scanString with the repeated expression.
That is, convert OneOrMore(expr).parseString(inputstring)
to expr.scanString(inputString)
.
scanString returns a generator that gives the matched tokens, start, and end location of each match. Perhaps this will help address what I assume are memory issues.
-- Paul
hi Paul, It is really nice for you to provide such prompt reply. Based on your reply, Can I say the second way where I pass each line to the pyparsing grammar is wrong?
currently I use the pyparsing to parse some log files, which stuffed with some irreverent info which I just want to ignore. so I create my grammar structure looks like this:
ZeroOrMore(SkipTo((expr1 | expr2 | expr3 |....).setParseAction(my_call_back_function),include=True))
In the my_call_back_function I will generate corresponding objects, store them in db and delete the ParseResult like
del tokens
So I will not be using the big ParseResults returned from the parseString. I guess scanString will perform the similar function in this case.
my concerns for using open('file_name').read()
is I guess python will load the entire file to the memory which exceeds 200MB. It consumes too much memory in this case. This is especially true if I am intended to run multiple pyparsing parser together. Can you enlighten on me for this. BTW, I am not sure If I make myself clear. I am not sure if I am structure the grammar correctly either. Sorry for my crapping English.
I think you are in a good place to just switch over to using scanString. Define your grammar as:
grammar = (expr1 | expr2 | expr3).setParseAction(my_call_back_function)
for tokens, start, end in grammar.scanString(sourceText):
# do something with tokens
# no need to even call del tokens here, because the
# name will be rebound on the next iteration
Do you actually use the text that gets skipped over? Let me know, and I'll show you how to get at it when using scanString instead of parseString.
-- Paul
yes, I tried the scanString, it works awesome. And I figured out how to print how the skipped over text as well. BTW I have another question lingered around me. you mentioned to me that parser needs to see all source text before parsing the text. But for me, I want to use pyparsing in a real time case, where I cannot get all text all in one time.
or, you prefer to have another post for the above question
I've thought about rewriting pyparsing to accept an input stream, but it will be a pretty radical change. So, unfortunately, for the foreseeable future, you'll have to pass to pyparsing a complete string of data to parse.
What you could do is wrap your own code around a call to scanString, something like this:
# set up a generator to yield a line of text at a time
linegenerator = open('big_hairy_file.txt')
# buffer will accumulate lines until a fully parseable piece is found
buffer = ''
for line in linegenerator:
buffer += line
match = next(grammar.scanString(buffer), None)
while match:
tokens, start, end = match
print tokens.asList()
buffer = buffer[end:]
match = next(grammar.scanString(buffer), None)
Write back and let me know how that works out.
-- Paul
looks like a solution. I will tested it out and let you know. But now I have another question and I opened new post for that.
I have some code looks like
parser = (exper1 + ( exper2 | expre3| ...))
data = open('file_name').read()
for tokens, start, end in parser.scanString(data):
print 'token: %s\n' % tokens.dump()
print 'start: %d \n'%start
print 'end: %d \n'% end
print 'match: %s\n'% data[start:end]
The function returns correct token matched. But start and end positions are wrong . It was shifted forward by some number. Is there a bug here?
sorry , forgot to mention that there is a lot of unmatched text before the first match found. So If I delete some unmatched text in the file, the parser works fine. But with large unmatched text presents, the start and end does not align well.
Because I have used restOfLine in the expr. It may cause this problem. I will dig out more.
problem found!! the log files contains both space and tab which the location error. But help still needed.
oops, I figured it out by looking at the source code which comes with good documentation. Call parseWithTabs did the trick. Thanks Paul.
Terrific! I'm glad my docs were helpful here. If you download the full source .zip from Sourceforge, you'll find a full htmldoc of all classes and helper methods in pyparsing, and a HowToUsePyparsing.html. The htmldoc is also available online at
Good luck in your pyparsings, -- Paul
Hi Paul, I'm using Reinteract to debug my parser and I've got a situation where parseString returns a result but scanString doesn't. I am parsing a simple string with a (document) revision and title. Here is (hopefully) the BNF:
#<revision> ::= <whitespace>+
# ['Rev'['.'] | 'REV'['.']]
# <alpha_rev> | <alpha_rev><alpha_rev>
# [<num_rev> | <num_revs><num_rev>]
# |
# <space>+
# <alpha_rev> | <alpha_rev><alpha_rev>
# [<num_rev> | <num_rev><num_rev>] |
#
#<whitespace> ::= ' |\t|\r|\n'
#<space> ::= ' '
#<alpha_rev> ::= 'a|...|z|A|...|Z'
#<num_rev> ::= '1|...|9'
Here is a cut and paste from Reinteract:
alpha_revs = pyp.Word(pyp.alphas, min=1, max=2)('alpha_revs')
num_revs = pyp.Word('123456789', min=1, max=2)('num_revs')
space = pyp.White(ws=' ').setName('spaces')('spaces')
space = space.suppress()
revisionExpr = (
pyp.StringStart().leaveWhitespace() +
pyp.White().suppress() +
pyp.oneOf('rev rev.', caseless=True).suppress() +
pyp.Combine(alpha_revs +
pyp.Optional(num_revs)('rev'))
|
pyp.StringStart().leaveWhitespace() +
space +
pyp.Combine(alpha_revs +
pyp.Optional(num_revs)('rev'))
)
revisionTokens = revisionExpr.parseWithTabs().parseString(rev_string)
for match_str, start, end in (
revisionExpr.parseWithTabs().scanString(rev_string, maxMatches=1)):
print match_str
print match_str
I'll show two examples with the input string and the results
rev_string = ' Rev. K This is the title'
parseString: ['K']
scanString: NameError: name 'match_str' is not defined
rev_string = ' A This is the title'
parseString: ['A']
scanString: NameError: name 'match_str' is not defined
The input strings are not well formatted, people have been very creative and not every document has a revision listed. The only reliable way I can see to not grab an incorrect rev string is to require that it be at the beginning of the input (where it should be). Without that, I had situations where it was finding part of the title instead. It is a safe assumption that the revision is preceded by a space if it is just a alphanumeric string.
I don't understand why scanString isn't giving me the same result?
Thanks, Eric
Here is some annotated Python code, walking through your expression creation and parsing steps:
# parsing a revision and a description
from pyparsing import *
numeric_rev = Word(nums)
alpha_rev = Word(alphas, max=2)
rev_label = CaselessLiteral('Rev') + Optional('.')
revision_expr = rev_label + (numeric_rev | alpha_rev) + restOfLine
rev_string = ' Rev. K This is the title'
print revision_expr.parseString(rev_string)
# prints
# ['Rev', '.', 'K', ' This is the title']
# I don't like the leading whitespace, but restOfLine doesn't skip
# whitespace. An Empty *does* skip whitespace, and returns nothing.
revision_expr = rev_label + (numeric_rev | alpha_rev) + Empty() + restOfLine
print revision_expr.parseString(rev_string)
# Now we get
# ['Rev', '.', 'K', 'This is the title']
# If we wanted, we could just look at results[2] and results[3]
# to get the interesting fields, but this is error prone - what if
# the optional '.' is left out? We *could* use results[-2] and
# results[-1], using negative indexes to read from the right
# instead. But I recommend you get used to naming the results, like
# this:
revision_expr = (rev_label + (numeric_rev | alpha_rev)('revision') +
Empty() + restOfLine('description'))
results = revision_expr.parseString(rev_string)
print results
# Hmm, we still get:
# ['Rev', '.', 'K', 'This is the title']
# what's the difference?
print results.dump()
# By calling dump(), we see not only the list of tokens, but
# any named fields:
# ['Rev', '.', 'K', 'This is the title']
# - description: This is the title
# - revision: K
# How do you get just the named fields? Access them as if they
# were values in a dict, or attributes on an object (for attribute
# access to work, the results name has to be a valid Python
# identifier)
print results['revision']
print results.description
-- Paul
Hi Paul, Thanks very much for the detailed answer. I need to work on this a bit more (it is a nightime project) since I think I'm stuck with considering whitespace. Some of the strings don't have revisions in them but they do have two letter strings in the title ('WI') which matches in your example.
I do have one quick(?) question, this sequence works:
alpha_rev = pyp.Word(pyp.alphas, max=2)
num_rev = pyp.Word('123456789', max=2)
space = pyp.White(ws=' ').suppress()
revisionExpr = (
pyp.StringStart().leaveWhitespace() +
space +
pyp.Combine(alpha_rev +
pyp.Optional(num_rev)('rev'))
)
rev_string = ' K This is the title'
for match_str, start, end in (
revisionExpr.scanString(rev_string, maxMatches=1)):
print match_str
['K']
Adding a second parse expression causes the scanString to fail:
revisionExpr = (
pyp.StringStart().leaveWhitespace() +
space +
pyp.Combine(alpha_rev +
pyp.Optional(num_rev)('rev'))
|
pyp.CaselessLiteral('Rev') + pyp.Optional('.') +
pyp.Combine(alpha_rev +
pyp.Optional(num_rev)('rev'))
)
I thought the '|' would still allow the first expression to match even if the second one fails?
Changing the rev_string:
rev_string = ' Rev. K This is the title'
Matches with the last revisionExpr even though the first expression fails on its own? scanString hates StringStart?
If this is getting too convoluted I can probably just work with parseString since it keeps working.
Thanks again, Eric
IGNORE = pp.Group(pp.ZeroOrMore(
pp.Or(map(pp.Literal, ['TIDAL Scheduler','Dependency Cross Reference', 'Job Name', 'Printed']))
+ pp.SkipTo(pp.LineEnd())
)).suppress()
END = pp.Literal('** End of Report **') + pp.SkipTo(pp.LineEnd())
LBRACKET = pp.Keyword('[ ', identChars='[')
RBRACKET = pp.Group(pp.Optional(pp.LineStart()).suppress() + pp.Keyword(']', identChars=']') + pp.Optional(pp.LineEnd()).suppress())
CRAP = pp.Group('[t' + pp.Optional(pp.oneOf('- +') + pp.Optional(pp.LineEnd()).suppress() + pp.oneOf('1 2 3 4 5 6 7 8 9')) + ']')
NAMEPART = IGNORE + pp.OneOrMore(pp.Word(pp.printables, excludeChars='[]')) + IGNORE
NAME = pp.Combine(NAMEPART + pp.Optional(CRAP + NAMEPART), adjacent=False, joinString=' ')
JOB = NAME + pp.FollowedBy(LBRACKET)
GROUP = pp.Group(LBRACKET + pp.Optional(pp.LineEnd()).suppress() + NAME + pp.Optional(pp.LineEnd()).suppress() + RBRACKET)
JOBNAME = pp.Combine(JOB + GROUP, adjacent=False, joinString=' ')
INDENT = (pp.Literal(' ') * 36).leaveWhitespace().suppress() + pp.FollowedBy(JOBNAME)
JOBDEP = INDENT + JOBNAME
JOBTREE = JOBNAME.setResultsName('job') + pp.Group(pp.ZeroOrMore(JOBDEP)).setResultsName('deps')
here's some sample data
In [47]: print test2
BAR [
BAZ:\BLAH\BLAH ]
Printed 10/12/2012 11:36 AM Page 386
Job Name Direct Dependents Indirect Dependents
FOOBAR [
BLAH:BLAH\BLAH\BLAH_
]
I can parse out the job dep correctly when using that expression individually:
list(JOBDEP.scanString(test2)
[((['FOOBAR [ ...'], {}), ...)]
also just using the JOBNAME expression gives me the 'correct' results, but the JOBTREE expression fails and instead gives me 2 'job's instead of a job and a jobdep...
can someone help me fix this?
I should note that the 'JOBTREE' expression works everywhere else in my dataset except for the case where the page number and column headings follow right after a job definition line...
Tried to use indentedBlock, but had difficulty interpreting the function definition, also some jobs don't have dependencies.
in the book 'Getting Started with Pyparsing' plz give explanation and examples of the following
page 17
'Parse actions can also be used to perform additional validation checks, such as testing whether a matched word exists in a list of valid words, and raising a ParseException if not. Parse actions can also return a constructed list or application object, essentially compiling the input text into a series of executable or callable user objects. Parse actions can be a powerful tool when designing a parser with pyparsing.'
'Parse actions can also return a constructed list or application object, essentially compiling the input text into a series of executable or callable user objects. Parse actions can be a powerful tool when designing a parser with pyparsing.'
Here is a simple parser to match a date string of the form 'YYYY/MM/DD', and return it as a datetime, or raise an exception if not a valid date.
from datetime import datetime
from pyparsing import *
# define an integer string, and a parse action to convert it
# to an integer at parse time
integer = Word(nums)
def convertToInt(tokens):
return int(tokens[0])
integer.setParseAction(convertToInt)
# or can be written as one line as
#integer = Word(nums).setParseAction(lambda t: int(t[0]))
# define a pattern for a year/month/day date
date = integer('year') + '/' + integer('month') + '/' + integer('day')
def convertToDatetime(s,loc,tokens):
try:
return datetime(tokens.year, tokens.month, tokens.day)
except Exception as ve:
errmsg = ''%d/%d/%d' is not a valid date, %s' % \
(tokens.year, tokens.month, tokens.day, ve)
raise ParseException(s, loc, errmsg)
date.setParseAction(convertToDatetime)
def test(s):
try:
print date.parseString(s)
except ParseException as pe:
print pe
test('2000/1/1')
test('2000/13/1') # invalid month
test('1900/2/29') # 1900 was not a leap year
test('2000/2/29') # but 2000 was
prints
[datetime.datetime(2000, 1, 1, 0, 0)]
'2000/13/1' is not a valid date, month must be in 1..12 (at char 0), (line:1, col:1)
'1900/2/29' is not a valid date, day is out of range for month (at char 0), (line:1, col:1)
[datetime.datetime(2000, 2, 29, 0, 0)]
thanks Paul I hoped to answer each sentence seperately
any way
the most important one is
'compiling the input text into a series of executable or callable user objects.'
This is done in the online example SimpleBool.py - the classes are used in the expressions for operatorPrecedence, but that is the same as calling setParseAction on each intermediate expression that operatorPrecedence creates for each level of operations.
thanks very much
The ideas in the discussions are very important. It is more suitable to augment the documentations by these ideas or collect them into tutorial
Hi, I wonder if someone can help me. I have scenarios where I need to parse and grab the description field (the word description + ':' plus one or more returns and then multiple lines of description that can contain anything).
First scenario is just to be able to parse a single record like this input string:
product: soap
description:
foo
bar
Second scenario (input has multiple records and these are the last two records):
product: soap
description:
blah blah
foo foo
bar bar
product: towel
description:
blah blah
abc !@#%&foo 1234
abc !@#%&foo 1234
abc !@#%&foo 1234
<end-of-file>
I'm trying to doing something like this to get it into a dictionary:
for prodDict in prodParser.prodDef.searchString(prodFile):
prodResults.append(prodDict)
The problems I'm having is that I can't get the description (variable lines). How is that done? If one record has another subsequent record I think the approach is to make it SkipTo the next line with 'product:'. But if it was the last record it would not have another line begining with 'product:'. It's the same when I'm given a single record. I cannot just say Regex(.*) any number of lines after matching 'description:'
can we simulate the caret and dollar sign functions in the regular expression in a pypasing expression ?
can we simulate caret and dollar sign in regular expression in a pyparsing expression without using Rexex ?
Thanks I have found
StringStart() and StringEnd()
Hi, I am looking at this code which is close to what I want to do.
If there I have something unexpected in the test string, is there a way to output a msg saying something like 'Unrecognized syntax S at line X' ?
Hi
I have a text file consisting of multiple lines. I would like to write an expression which would work like so:
Given a set of line numbers, match only at those lines.
Can you help?
Cheers, kmbt
This is easily done by using a parse action, and pyparsing's lineno (line number) method (the sample text below includes the line number as part of its content, but this is just to illustrate the output in the matched tokens, it is not used in the filtering condition):
from pyparsing import Word, nums, alphas, lineno, ParseException
text = '''\
1 some text
2 some more
3 blah blah
4 lorem ipsem
5 the end'''
line_pattern = Word(nums) + Word(alphas) + Word(alphas)
print '\nshow all matches'
for line_data in line_pattern.searchString(text):
print line_data
desired_lines = set([2,3,4])
def only_match_on_desired_lines(s,locn,tokens):
if lineno(locn,s) not in desired_lines:
raise ParseException(s,locn,'not one of the desired lines')
line_pattern.setParseAction(only_match_on_desired_lines)
print '\nmatch only desired_lines'
for line_data in line_pattern.searchString(text):
print line_data
prints
show all matches
['1', 'some', 'text']
['2', 'some', 'more']
['3', 'blah', 'blah']
['4', 'lorem', 'ipsem']
['5', 'the', 'end']
match only desired_lines
['2', 'some', 'more']
['3', 'blah', 'blah']
['4', 'lorem', 'ipsem']
The pattern will match all of the lines, but the parse action will apply the additional filter of the particular line numbers that you want.
-- Paul
Thank you. Your solution was very helpful to me.
Here is my code so far:
I am writing a DSL with support for following:
- varaibles: All of them begin with v_
- Unary operators: +, -
- Binary operators: +,-,*,/,%
- Constant numbers
- Functions, like normal functions. They need to have this behaviour: [[foo(v_1+v_2) = foo(v_1) + foo(v_2)]], . It should be the case for any binary operation
I am able to get till point 4 but I am not able understand how to make point 5 happen. I need some help
I have asked the same question at as well. It is formatted better there. Hope you guys do not mind
See my reply on SO.
-- Paul
I am trying to install pyparsing under windows from buildout and I am getting an error all the time: sqget_dist\download An error occurred when trying to install pyparsing 1.5.6. Look above this messag e for any errors that were output by easy_install. While: Installing django. Getting distribution for 'pyparsing>=1.5.5'. Error: Couldn't install: pyparsing 1.5.6
Is there a way to fix it?
I don't understand why you would get this error, pyparsing most definitely includes a setup script.
You should also be able to install pyparsing using easy_install. Or just download the source package from SourceForge, and pull out the single pyparsing.py file. Pyparsing is packaged as just a single Python source file, so it should be easy to put whereever you want.
-- Paul
My input stream has several special characters '\000', ''', etc. The parser stops and emits an error whenever these characters are encountered. How can I tell the parser to simply ignore them?
Ok with Sandy bearing down on DC I am stuck at home with nothing but time to try and finish the parsing exercise I have been banging my head against for several weeks.
I have an export file created form a Lotus Notes database containing about 12,000 records. Each record has a set of attributes and a blob of text. My goal is to be able to parse the export file such that I can store it into another database preserving the original attributes and enriching with others via some natural language processing.
The export file contains two different structures. The 1st is for fielded information and the pattern is 'field : ' where field starts at the beginning of a line and value may be of arbitrary length and format. (the value sometime has repeating information but is not important for right now). The 2nd structure is the blob of text which always follows a specifc field '$revisions : '.
Here is an example of my test harness including sample input:
test = '''
$FILE:
EXTERNALLINKEDUNID: 35F30F8BBDF4F0CE85257AA000745A3B
$Links:
DTU: 10/24/2012 12:00:00 AM
Document_ID: 2012-1728
Document_Type: Tax Alerts
Document_Subtype: Internal Tax Alerts
LPlanning: (n/a)
LProvision: (n/a)
LCompliance: (n/a)
LControversy: (n/a)
Drafter: Brittenham, J.A.
Turnaround: 24 Hours
CopyrightNoticeFirstLine: Copyright ' 1996 ' 2012, Ernst & Young LLP.
CopyrightNotice: All rights reserved. No part of this document may be reproduced, retransmitted or otherwise redistributed in any form or by any means, electronic or mechanical, including by photocopying, facsimile transmission, recording, rekeying, or using any information storage and retrieval system, without written permission from Ernst & Young LLP.
DocAuthor: CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US
DocComposed: 10/23/2012 05:10:52 PM
InactiveIssue:
Title: Service releases some inflation adjustments for 2013, defers releasing others
Display: No
Search: No
External_Distribution: No
Supertopic_1: Supertopic 2\Personal Finance
DOCID: 2012-1728
$UpdatedBy: CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US,CN=Joan D. Osborne/OU=NOhioEMichigan/OU=TAX/O=EYLLP/C=US
$Revisions: 10/23/2012 05:10:51 PM,10/23/2012 05:10:52 PM,10/23/2012 05:10:55 PM,10/23/2012 05:11:15 PM
The Service has issued Revenue Procedure 2012-41, which contains the inflation adjustments to various exemptions, exclusions and limitation amounts for income, estate and gift tax purposes that take effect in 2013. Revenue Procedure 2012-41 does not include some of the annual inflation adjustments, including those for the tax rate tables, the standard deduction, the personal exemption, and the overall limitation on itemized deductions. The Service stated that it will address those items in future guidance.
Important inflation adjustments under Revenue Procedure 2012-41
Kiddie Tax
For tax years beginning in 2013, the first $1,000 of income of a child subject to the kiddie tax will generally not be subject to tax, and the next $1,000 will be taxable at the child's own bracket. Unearned income in excess of $2,000 will be taxed to the child at the parent's tax rate.
Expatriation to avoid tax
For 2013, an individual with 'average annual net income tax' of more than $155,000 (up from $151,000 in 2012) for the five tax years ending before the date of the loss of US citizenship under Section 877(a)(2)(A) is subject to tax under Section 877A(g)(1).
Tax responsibilities of expatriation
For 2013, the amount that would be includible in the gross income of a covered expatriate by reason of Section 877A(a)(1) is reduced (but not below zero) by $668,000.
Annual gift tax exclusion
The annual gift tax exclusion amount under Section 2503 is $14,000 for 2012 (up from $13,000 in 2012).
The annual exclusion permitted by Section 2523(i)(2) for transfers to a noncitizen spouse has been increased from $139,000 in 2012 to $143,000 in 2013.
Special use valuation
The Section 2032A ceiling on special use valuation for an estate of a decedent dying in 2013 is $1,070,000 (up from $1,040,000 in 2012).
Interest on the portion of estate tax payable in installments
To calculate Section 6601(j) interest for an estate of a decedent dying in 2013, the dollar amount used to determine the '2% portion' of the estate tax payable in installments under Section 6166 is $1,430,000 (up from $1,390,000 in 2012).
Large gifts received from foreign persons
For tax years beginning in 2013, the threshold for reporting gifts from foreign persons under Section 6039F is $15,102 (up from $14,723 in 2012).
Contact Information
For additional information concerning this Alert, please contact:
Personal Financial Services
' Kim McFarlane
(330) 255-5247
This Alert was prepared to present time-sensitive information affecting our clients. Recipients of this publication should promptly review and consider the effect of its contents on the clients they serve.
'''
recordStart = '$FILE:'
recordEnd = Literal('
')
colon = Literal(':')
f = Combine(Word(alphas+'$', alphanums+'_') + colon.suppress())('name')
v = restOfLine('value')
r = Dict(OneOrMore(Group(f + Optional(v))))('attrs')
record = ''
try:
record = r.parseString(test, parseAll=False)
except ParseException, err:
print '>>>>>>>>',err.line
print '>>>>>>>>',' '*(err.column-1) + '^'
print '>>>>>>>>',err
print '========================'
print 'record = ',record
print ' Document_ID = ',record.attrs.Document_ID
print '-----------------------'
print 'raw input = ', test
print '======================='
print 'parsed attributes and values'
print ' '
for attr in record.attrs:
print '-----|',attr.name,':', attr.value
here is the corresponding output:
=======================
parsed attributes and values
-----| $FILE :
-----| EXTERNALLINKEDUNID : 35F30F8BBDF4F0CE85257AA000745A3B
-----| $Links :
-----| DTU : 10/24/2012 12:00:00 AM
-----| Document_ID : 2012-1728
-----| Document_Type : Tax Alerts
-----| Document_Subtype : Internal Tax Alerts
-----| LPlanning : (n/a)
-----| LProvision : (n/a)
-----| LCompliance : (n/a)
-----| LControversy : (n/a)
-----| Drafter : Brittenham, J.A.
-----| Turnaround : 24 Hours
-----| CopyrightNoticeFirstLine : Copyright ' 1996 ' 2012, Ernst & Young LLP.
-----| CopyrightNotice : All rights reserved. No part of this document may be reproduced, retransmitted or otherwise redistributed in any form or by any means, electronic or mechanical, including by photocopying, facsimile transmission, recording, rekeying, or using any information storage and retrieval system, without written permission from Ernst & Young LLP.
-----| DocAuthor : CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US
-----| DocComposed : 10/23/2012 05:10:52 PM
-----| InactiveIssue :
-----| Title : Service releases some inflation adjustments for 2013, defers releasing others
-----| Display : No
-----| Search : No
-----| External_Distribution : No
-----| Supertopic_1 : Supertopic 2\Personal Finance
-----| DOCID : 2012-1728
-----| $UpdatedBy : CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US,CN=Joan D. Osborne/OU=NOhioEMichigan/OU=TAX/O=EYLLP/C=US
-----| $Revisions : 10/23/2012 05:10:51 PM,10/23/2012 05:10:52 PM,10/23/2012 05:10:55 PM,10/23/2012 05:11:15 PM
I am not sure how to parse the blob of text form the input string. I have tried to treat it as part of the grammar but ended up not being successful. I ended up not knowing how to identify the start of the blob, and the text of the blob ended up being parsed as part of the 'field : vlaue' pattern if any ':''s occurred in the blob.
I am considering the possibly of creating a 2nd grammar or even more simply segmenting the input string when reading the txt file so that I can isolate the blob. So far this has been equal part maddening and pure joy. The maddening part is there are not a lot of examples and resources for a novice at this like me, the joy is that pyparsing holds such great promise for all types of tasks if I can just get my head around it and learn it better. Any suggestions would be greatly appreciated.
Try this:
recordStart = Literal('$FILE:')
colon = Literal(':')
f = ~recordStart + Combine(Word(alphas+'$', alphanums+'_') + colon.suppress())('name')
v = restOfLine('value')
blob = Group(~recordStart + ~StringEnd() +
Empty().setParseAction(replaceWith('$Body'))('name') +
SkipTo(recordStart | StringEnd())('value'))
r = Group(recordStart + (Dict(OneOrMore(Group(f + Optional(v))))+blob)('attrs'))
record = ''
try:
records = OneOrMore(r).parseString(test, parseAll=False)
except ParseException, err:
print '>>>>>>>>',err.line
print '>>>>>>>>',' '*(err.column-1) + '^'
print '>>>>>>>>',err
for record in records:
print '========================'
print 'record = ',record
print ' Document_ID = ',record.attrs.Document_ID
print ' DOCID = ',record.attrs.DOCID
#~ print '-----------------------'
#~ print 'raw input = ', test
print '======================='
print 'parsed attributes and values'
print ' '
for attr in record.attrs:
print '-----|',attr.name,':', attr.value
Note the definition of blob as a very open-ended catch-all term, defined using SkipTo. blob has to come after you have looked for more attr definitons and not found them. There is also a bit of pyparsing magic in using an Empty with the parse action replaceWith to inject the '$Body' label so that the blob will have a nice attribute name.
Parsing in general is a mix of the maddening and the joyful. Thanks for sticking this out, I hope this does the job with your Lotus export file.
-- Paul
Hi, I try to parse logical expressions such as these
x
FALSE
NOT x
(x = 5) AND (y >= 10) OR NOT (z < 100 OR w)
(A=True OR NOT (G < 8) => S = J) => ((P = A) AND not (P = 1) AND (B = O)) => (S = T)
and the parsing code I've wrtten bellow is very slow (e.g. the last test input above A=True ...). Am I missing something or is there something to make it faster ?
LPAR,RPAR = map(Suppress,['(',')'])
number = Word(nums)
tf = oneOf('TRUE FALSE')
identifier = Word(alphas, alphanums + '_'')
fol = Forward()
term = tf | identifier | number
op_prec = [(oneOf('= >= <= > <'),2,opAssoc.RIGHT,),
(CaselessLiteral('not'), 1, opAssoc.RIGHT,),
(CaselessLiteral('and'), 2, opAssoc.LEFT,),
(CaselessLiteral('or'), 2, opAssoc.LEFT,),
('=>',2,opAssoc.RIGHT,),
]
fol << operatorPrecedence(term,op_prec)
I have designed match expressions module as follows
input: list of strings input : list of pypasring matching expressions each expression have a specific parse action method
for each string for each expression if match, execute the relevant action method go the next string else go the next expression
the question
- How can I assign a specific method for each matching expression?
- Is there another elegant design for that module ?
thanks
I could still use some more information here.
After finding the first match, do you continue on to all the rest of the expressions, or do you stop?
Are the action methods just arbitrary methods that take a string and do something, or are they intended to be chained one to the next (that is, they return the input string or a modified form of the input string)?
I would not have this function take a list of strings, but just a single string. Have the caller take care of looping over all the strings.
Can you put together an example of a list of expressions and corresponding functions, so I have a clearer idea of what your intention is for what this function is to do?
-- Paul
Oh I reread your pseudo code, and I see that you answered my first question - after a match, you are then just done with the string.
Still curious about the expressions and functions that you are going to pass. If these are just regular parse actions, then attach them to the expressions before calling the function. Your function then is nothing more than a MatchFirst.
expr1 = ...
expr2 = ...
expr3 = ...
expr1.setParseAction(func1)
expr2.setParseAction(func2)
expr3.setParseAction(func3)
# list of expressions, with parse action associated with each
exprs = [expr1, expr2, expr3]
# some sample strings
strings = 'I'd gladly pay you Tuesday for a hamburger today'.split()
for s in strings:
print MatchFirst(exprs).parseString(s)
What you have described as a separate module is already part of pyparsing, it is how MatchFirst works.
-- Paul
-
stopping after matching and then process the next string.
-
the methods are arbitrary that take a string and do transformation.
-
the data that should be processed are a list of strings. I extract information and transform it in another format.
Example:
def eng_word():
excluded_chars = u'?!:;,()'
english_alphas = u''.join(unichr(x) for x in range(0x0021, 0x007F))
word = Word(english_alphas'�', excludeChars=excluded_chars)
return word
data = [
'diaa (mohamed) fayed',
'diaa(mohamed)fayed',
'diaa(fayed)',
'diaa (fayed)'
'(diaa)fayed',
'(diaa) fayed',
]
expressions = [
eng_word()('left') + Literal('(').suppress() + eng_word()('middle') + Literal(')').suppress() + eng_word()('right'),
eng_word()('left') + Literal('(').suppress() + eng_word()('middle') + Literal(')').suppress(),
Literal('(').suppress() + eng_word()('middle') + Literal(')').suppress() + eng_word()('right')
]
method1
input: the output of expression1
output: [ 'diaa fayed', 'diaa mohamed fayed' ]
method2
input: the output of expression2
output: [ 'diaa', 'diaa fayed' ]
method3
input: the output of expression3
output: [ 'fayed', 'diaa fayed' ]
....
this an example but the expressions and examples may be long and differents
here are sample of the action methods
# Actions Methods
def f01(toks):
return ' '.join(toks) + ',' + ' '.join([toks[0], toks[2]])
def f02(toks):
return ' '.join(toks) + ',' + ' '.join([toks[0]])
def f03(toks):
return ' '.join(toks) + ',' + ' '.join([toks[1]])
the output
['diaa mohamed fayed,diaa fayed']
['diaa mohamed fayed,diaa fayed']
['diaa fayed,diaa']
['diaa fayed,diaa']
['diaa fayed,fayed']
['diaa fayed,fayed']
please
-
see if you have better reasoning about the action methods.
-
I will try remove duplication by applying your note in my question about 'paranthsis and space'
given the expression
name1 = '(' + Word(alphas) + ')' + Word(alphas)
name2 = Word(alphas) + '(' + Word(alphas) + ')'
name3 =Word(alphas) + '(' + Word(alphas) + ')' + Word(alphas)
and
example1 = (diaa)fayed
example2 = (diaa) fayed
this code parses the two examples to
['diaa', 'fayed']
but I need it to parse only the first example
in the same way the following examples
diaa(fayed)
diaa (fayed)
diaa(mohamed)fayed
diaa (mohamed) fayed
i.e. I want only to parse the examples without space
Read up on 'Combine' and 'leaveWhitespace' to see how to control how pyparsing skips or doesn't skip over whitespace.
-- Paul
questions about scaneExamples.py
-
What is the role of empty ? is it necessary ?
macroDef = Suppress('#define') + ident + Suppress('=') + empty + restOfLine
-
How does the dictionary constructed in
macros = dict(list(macroDef.searchString(testData)))
although the dictionary needs a list of tuples according to
the output of list(macroDef.searchString(testData)) is
['(['MAX_LOCS', '100'], {})','(['USERNAME', '\'floyd\''], {})','(['PASSWORD', '\'swordfish\''], {})']
for example the term (['MAX_LOCS', '100'], {}) is a tuple but the secomnd item is empty dictionary
- the structure of the ParseResults object such that a tuple of a list and a dictionary is not undestandable
why that structure
for example (['MAX_LOCS', '100'], {})
-
I encourage you to experiment with these things on your own. Take out the empty and see how the values of the macros are different.
-
Don't get hung up on how a ParseResults looks, focus on what it does. ParseResults is there to act like a list, and object, and a dict, all depending on how you access it. In the case you show, there are 2 elements in the term, 'MAX_LOCS' and '100'. The empty dict indicates that there are no results names defined. If you access term as a list, you'll get term[0] contains 'MAX_LOCS' and term[1] contains '100'. The dict constructor is not limited to taking a list of tuples as its argument, but will take a sequence of any 2-element sequences. A ParseResults containing 2-element ParseResults will work just as well.
Here is a little console example doing tuple unpacking directly against the ParseResults object returned from parseString:
>>> patt = Word(alphas) + Suppress('=') + empty + restOfLine
>>> macro = patt.parseString('A = 3.14159')
>>> key,value = macro
>>> print key
A
>>> print value
3.14159
That is similar to what is happening inside the dict constructor.
-- Paul
copy() - returns a copy of a ParserElement; can be used to use the same parse expression in different places in a grammar, with different parse actions attached to each
please, kindly give us and example ?
from pyparsing import Suppress, Word, nums, ParseException
SLASH = Suppress('/')
integer = Word(nums)
def validate_year_range(t):
val = int(t[0])
if not 1800 <= val <= 2099:
raise ParseException('invalid value for year')
year_integer = integer.copy().setParseAction(validate_year_range)
def validate_month_range(t):
val = int(t[0])
if not 1 <= val <= 12:
raise ParseException('invalid value for month')
month_integer = integer.copy().setParseAction(validate_month_range)
def validate_day_range(t):
val = int(t[0])
if not 1 <= val <= 31:
raise ParseException('invalid value for day')
day_integer = integer.copy().setParseAction(validate_day_range)
date = year_integer + SLASH + month_integer + SLASH + day_integer
print date.parseString('2012/12/31') # valid date
print date.parseString('2012/2/30') # not a valid date, but passes the range check
print date.parseString('2012/14/4') # no such thing as month 14
# added exercise for the reader: add a parse action to date to verify that the
# day_integer value is within the correct range for the given month and year
Thank very much
please when you have time, the previous post for me
Hello,
I have been working with pyparsing for some time and have written a parser to parse a special programming language. All is working fine with the parser, except that I need to make it go faster and I wanted to know what techniques can help doing that.
For example, here's what I see when I run the profiler on the code.
ncalls tottime percall cumtime percall filename:lineno(function)
4087463/1284 28.829 0 96.989 0.076 pyparsing.py:909(_parseNoCache)
1959051/1543061 8.517 0 19.598 0 pyparsing.py:291(__init__)
1380 7.235 0.005 7.235 0.005 {_omnipy.invoke}
393192/7383 7.167 0 95.522 0.013 pyparsing.py:2524(parseImpl)
Line 909 (_ParseNoCaches) is taking 30 seconds. ParseResults on line 291 is taking 8.5 seconds, etc.
I wish to know what techniques are available to make things go faster.
- Compilation?
- Rewrite the parsers?
etc.
Thanks for any suggestions.
Chah'
Follow up on the question above,
how can I trace which parser rule is being called the most or/and taking how much total time. With the profiler I used, I can see that ParseResult and ParseNoCache are taking the most time, but if I can trace the problem to a specific parsing rule, I'm already ahead.
Tx
Try calling ParserElement.enablePackrat()
before calling parseString - this will do internal memoization of parser matches/exceptions. Also, a common easy-to-change speedup is to replace low-level tokens that are built up using Combine(lots of other pyparsing bits) can be sped up by replacing with a Regex - floating-point numbers matched using Regex(r'\d+\.\d*')
will be much faster than Combine(Word(nums)+'.'+Optional(Word(nums)))
, with little loss of readability.
You can add your own custom debug action to all of your expressions of interest to keep a tally of attempts, matches, and exceptions.
Hi Paul,
ok. Let me see if I understand how to do this.
For example, if I have the following
STRING = Combine(Literal(''') + ZeroOrMore(S_CHAR | S_ESCAPE) + Literal('''))
Can I just say
STRING = Regex('\'[my escape characters]*\')
and that would work within the parser wherever STRING was used?
Second question. The enablePackRat
can I just use something like (given the pyparsing STRING def above),
STRING = Combine(Literal('"') + ZeroOrMore(S_CHAR | S_ESCAPE) + Literal('"'))
STRING.enablePackRat()
and later on use STRING.parseString(...)
as before?
Thanks
For your definition of STRING, just use QuotedString('"', escChar='\\')
, which will internally generate its own regex.
enablePackrat
has to be globally enabled so that all expressions get memoized. After
importing pyparsing, do 'ParserElement.enablePackrat()'
. More info here:
Didn't mean to stomp on your post - your regex sample is on the right track too, if you prefer to use that over QuotedString.
And after calling enablePackrat, the rest of your program works with no additional changes - the memoizing just happens internally to the pyparsing code.
What about PyPy? Since this is pure Python.
Yes, I have tried pyparsing with PyPy - it is 2-8 times as fast as CPython, depending on the complexity of the grammar.
I want to do design the design pattern:
Chain of Responsibility
remember the last discussion about FirstMatch and Pattern Matcher module
I want to design pattern matcher as Chain of Responsibility
such that
- each line pass over a set of Pyparsing expression
- each expression will process the line and extract matched information.
- the extracted information will be collected in structure to be fill data base structure
This module will be Information Extractor module
plz if you have reasoning about that ?
I like Chain Of Responsibility (or CoR for short) in one respect, and dislike it in another. I like the notion of setting up the chain of handlers, and then running an object through the chain to be processed by the first eligible handler. What I dislike is that the handlers know they are part of a chain - they have a successor member variable, and each link's implementation code can't just be a clean 'handle object' but is instead 'try handling object, but if I can't, pass it to the next handler in the chain'.
To get the best of both worlds, and still call this Chain Of Responsibility, you could create a wrapper class containing the handler instance that just does object handling, and have the wrapper contain the next pointer and have the wrapper implement the 'try my handler, if fail, pass it to the next'. Now the handlers stay very clean, and the CoR pattern happens in a generic wrapper:
class CoRLink(object):
def __init__(self, handler):
self._handler = handler
self._next = None
self._handled = False
def setSuccessor(self, nextHandler):
self._next = nextHandler
def handle(self, obj):
# version of handle where handler.handle() returns
# True or False if the object was handled
if not self._handler.handle(obj):
if self._next:
self._next.handle(obj)
def handle(self, obj):
# version of handle where handler.handle() raises
# an exception if the object was not handled
self._handled = False
try:
self._handler.handle(obj)
except Exception:
pass
else:
self._handled = True
if not self._handled and self._next:
self._next.handle(obj)
def addHandler(self, newHandler):
if self._next:
self._next.addHandler(newHandler)
else:
self._next = newHandler
def wasHandled():
return (self._handled or
(self._next and self._next.wasHandled())
)
h1 = HandlerType1()
h2 = HandlerType2()
h3 = HandlerType3()
handlerChain = CoRLink(h1)
handlerChain.addHandler(CoRLink(h2))
handlerChain.addHandler(CoRLink(h3))
# pass an object to the head of the chain, and one of the
# handlers might handle it
handlerChain.handle(objectToBeHandled)
But this strikes me as unnecessarily clever code when a simple container class for the chain itself can cleanly implement the iteration logic, breaking out on the first successful handler. Because the 'try this handler, but if fail move on to the next' logic is implemented outside of the handlers themselves, this strictly speaking isn't an example of the CoR pattern - but I think it is more readable, and has an API that is just as clean.
class HandlerChain(object):
def __init__(self):
self._handlers = []
self._handled = False
def addHandler(self, newHandler):
self._handlers.append(newHandler)
def handle(self, obj):
# version of handle where handler.handle() returns
# True or False if the object was handled
self._handled = False
for handler in self._handlers:
if handler.handle(obj):
self._handled = True
break
def handle(self, obj):
# or implement this in a single line using the builtin
# method 'any'
# (any will automatically stop processing the list once
# it gets the first True value)
self._handled = any(handler.handle(obj)
for handler in self._handlers)
def handle(self, obj):
# version of handle where handler.handle() raises
# an exception if the object was not handled
self._handled = False
for handler in self._handlers:
try:
handler.handle(obj)
except Exception:
pass
else:
self._handled = True
def wasHandled():
return self._handled
h1 = HandlerType1()
h2 = HandlerType2()
h3 = HandlerType3()
handlerChain = HandlerChain()
handlerChain.addHandler(h1)
handlerChain.addHandler(h2)
handlerChain.addHandler(h3)
# pass an object to the head of the chain, and one of the
# handlers might handle it
handlerChain.handle(objectToBeHandled)
I am definitely a big fan of design patterns - you see heavy use of the Template and Strategy patterns and this modified CoR in pyparsing, and the Command pattern in many of my pyparsing examples. But it is also important to understand the context of the original GoF Design Patterns work was with C, C++, and Java, languages that did not have language-native containers like Python's list and tuple or the flexibility of Python's duck-typing. The Design Patterns that were identified in the early 1990's were defined by the problem space, the language choices and features, of that time. For instance, you don't ever even see a Factory implementation in Python, since this is pretty implicitly done for you by the language itself.
Here is a link to a good presentation by Alex Martelli on Design Patterns in Python: , and some videos of Alex presenting this material:
How does this all relate to pyparsing? Look again at this simple example:
floatExpr = Regex(r'\d+\.\d*')
intExpr = Word(nums)
parser = floatExpr | intExpr | quotedString
# or to be more CoR-like, add each expression (or handler) to the
# parser
parser = floatExpr
parser |= intExpr
parser |= quotedString
data = '''
100
3.14159
'blah blah''''
for line in data.splitlines():
print parser.parseString(line)
parser is a MatchFirst of three pyparsing expressions. MatchFirst's implementation is essentially the same as the third HandlerChain.handle method shown above. For the line containing '100', parser first tries to evaluate the floatExpr (and fails); then the intExpr, succeeds, and stops trying to match any further expressions.
Write back and let me know what you think about this discussion.
-- Paul
After some further thought, I recall now that many of the examples for uses of CoR were in cases involving UI controls and widgets, which naturally contain pointers to parent container and child contained widgets. The classic CoR is implemented to do the handling of UI events, such as a mouse click for instance. The click event is passed to the innermost widget's handler. If not handled there, the widget calls the handle method on its parent, and so on up the chain of contained UI objects, until the control or container that is responsible for handling that event is reached, and it handles the event.
In this case, there is no addition of next pointers for the purpose of implementing CoR - just the opposite, CoR can take advantage of the fact that the pointers are there to begin with, implementing the UI controls hierarchy.
Out of curiosity, what was it that made you want to implement CoR in the first place? Did you think it would make a good way to model the alternative parsing options for parsing a line of data? If you broaden your concept of just how CoR is implemented, to include the iteration over a list of possible handlers, I would say that using a MatchFirst as I showed in the previous comment (the '|' operator creates MatchFirsts) is doing just that.
-- Paul
Thanks very much, I am sorry for late
-
I have agreed with you in simplifying the pattern in the second example.
-
I want only the concept of CoR pattern not its implementation exactly.
-
I want some modification to the concept of the pattern: the line of data can be processed by more than one handlers or Pyparsing expressions
example:
the line data could contain two pieces of information that need to be extracted by two Pyparsing expressions or handlers.
- I have hoped you to implement my concept by Pyparsing classes as possible not by pure python so that the code will be homogeneous code
With your adoption of my simplification of CoR iteration being done in a container, and with your modification of processing all handlers and not just the first, this pretty much ceases to be CoR.
Your description of a pyparsing element that reprocesses the input multiple times is not consistent with any other part of pyparsing. I don't think you really need a specialized pyparsing class, you just need to write some Python code that iterates over a series of expressions and accumulates the matched data into a single ParseResults object. Here is some compact sample code, using Python's sum builtin and searchString:
>>> a_s = Word('A')
>>> b_s = Word('Bb')('B')
>>> c_s = Word('C')
>>> exprs = [a_s, b_s, c_s]
(iterate over all expressions, and accumulate returned ParseResults using Python sum builtin
>>> instr = 'AS;LKJFASDBWEL;CCDBEawe;lkb'
>>> total = sum(sum(expr.searchString(instr)) for expr in exprs)
>>> print total.asList()
['A', 'A', 'B', 'B', 'b', 'CC']
>>> print total.dump()
['A', 'A', 'B', 'B', 'b', 'CC']
- B: b
At this point, I must ask you to stop just asking for stuff, and start working out simple examples and asking for specific help on what is not working for you. What I read in this whole thread is 'I want...' over and over. I do my best to help out beginners, but you have to bring some effort to the process too.
-- Paul
Thanks very much
I will do I really try and then ask in the same time
Started using pyparsing and found it phenomenal for my project. I am trying to build a DSL for simple calculations on data fields extracted from a mongodb database. Using the SimpleCalc and eval_arith examples I was able to put together a calculator that evaluates an expression that combines field references and simple operators.
I am now trying to add a layer of recursion to support variables, but having trouble sorting through how to do this most elegantly. Have the following questions:
-
When using recursive pattern as shown in SimpleCalc, is it the case that I no longer need to use the operatorPrecedence as was being setup in eval_arith? The examples like SimpleCalc seem to do away with this.
-
From the standpoint of evaluating expressions containing variables, if i wanted to preserve the use of Classes like EvalConstant such that it also contains variables, would I simply extend the EvalConstant class so that it contains a list of variables and their values, and then invoke a method within it when I encounter a variable?
In general, what would be awesome and I think useful to the pyparsing community is to expand the eval_arith example such that it shows the use of nested variables, something like:
a = 3
b = 8
c = a + b
2*c - b
Thanks for the help, I can post some code if needed
Getting closer but struggling to get the variable parsing working. I think there is something wrong with my grammar. Code below:
# Define parser, accounting for the fact that some fields contain whitespace
integer = Word(nums)
variable = Word(alphas)
real = Combine(Word(nums) + '.' + Word(nums))
field = Combine(Word(alphas) + ':' + Word(printables) + Optional(' ' + Word(alphas) + ' ' + Word(alphas)))
operand = real | integer | field | variable
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
operand.setParseAction(EvalConstant)
arith_expr = operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
comparisonop = oneOf('< <= > >= != == t')
comp_expr = operatorPrecedence(arith_expr,
[
(comparisonop, 2, opAssoc.LEFT, EvalComparisonOp),
])
assignment = variable('varname') + '=' + (arith_expr | comp_expr)('varvalue')
assignment.setParseAction(StoreVariable)
My problem seems to be that parse action StoreVariable is not being called when evaluating an expression like:
var1 = Person:Height
Rather, EvalConstant is being invoked on the token 'var1' and fails. Here is my code for EvalConstant:
class EvalConstant(object):
tests_ = {}
fields_ = {}
person_ = None
def __init__(self, tokens):
self.value = tokens[0]
def eval(self):
v = self.value
# Determine if this is a database reference and if so get field value
if ':' in v:
fieldRef = v.split(':')
source = fieldRef[0]
field = fieldRef[1]
if source not in EvalConstant.tests_:
raise NameError('Syntax error: cannot find source ' + source + ' in test list')
elif field not in EvalConstant.fields_:
raise NameError('Syntax error: cannot find field ' + source + ' in fields list')
# Fetch the value from the database
rec = db[source].find_one({'Name' : self.person_}, { '_id' : 0, field : 1})
if rec is not None:
return rec[field]
else:
return 0
# Must be a number
else:
return float(self.value)
Any idea where my grammar is broken?
Ah, well! What you have done is to introduce a new type of operand, a DBReference. Move this code out of EvalConstant and define a new class EvalDBReference. Create a grammar element dbReference something like:
dbReference = variable('table') + ':' + variable('column')
Then expand operand to:
operand = (real | integer | field | variable).setParseAction(EvalConstant) | dbReference.setParseAction(EvalDBReference)
This will keep your code from getting too complicated in EvalConstant.
-- Paul
Thanks - in essence I already have a dbReference, its the grammar I called 'field' in my definitions. I see what you are saying though in terms of adding this to the list of operands. Will give it a shot!
Ok, stuck again. I just cannot seem to get the assignment parse action to get executed, and I am not sure what EvalConstant is supposed to do when it encounters a variable. In your code suggestion you showed setting parse action for variable to EvalConstant but I am not clear how this case should be handled. Should I be creating a dictionary in EvalConstant to hold the vars, and if so how do I get them to evaluate fully?
I've pasted the full source for my example here:
again its pretty similar to your eval_arith except I use dbReferences and want to have variables in a dictionary so that formulas can reference prior formulas.
really appreciate the help, as you can tell I'm a hack software developer but am the only one around that can work on this project
My mistake, change:
operand = (real | integer | variable).setParseAction(EvalConstant) | dbRef.setParseAction(EvalDBref)
to
operand = dbRef.setParseAction(EvalDBref)
| (real | integer | variable).setParseAction(EvalConstant)
Can you see why? (hint: '|' means 'match first' - must take care not to interpret the 'xxx' of 'xxx:yyy' as a variable named 'xxx')
Yes, understood on the ordering. I changed the operand parsing around but still having an issue: when testing simple expressions such as
a = 2.5
b = Person:Height
I still am ending up with EvalConstant being the first parse action that is called. It does not seem like the assignment grammar is being executed so I never end up with the StoreVariable function being executed:
assignment = variable('varname') + '=' + (arith_expr | comp_expr)('varvalue')
assignment.setParseAction(StoreVariable)
I think there is still something not right with the grammar. Should I be using the Optional object like this:
assign = Optional((variable+assign).setParseAction(StoreVariable)) + comp_expr
Hi guys, I am still working on this problem and struggling. My parseaction for assignment is being eaten by the operand action .. so I never am able to assign the variable value. Please have a look and let me know what I have missed here:
expr = Forward()
chars = Word(alphanums + '_-/')
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + '.' + Word(nums)).setParseAction(EvalConstant)
var = Word(alphanums)
# Handle database field references that are coming out of Mongo
dbRef = Combine(chars + OneOrMore(':') + chars)
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword('Rank') | Keyword('ZS') | Keyword('Ntile')) + '[' + dbRef + ']'
functionCall.setParseAction(EvalFunction)
assign = var('varname') + '=' + expr('varvalue')
assign.setParseAction(assign_var)
operand = functionCall | dbRef | (var | real | integer).setParseAction(EvalConstant)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
assign_var is never being called, EvalConstant is being invoked on the token 'Var' for a test expression of this type:
Var=People::Height + People::Weight
Help!
Are you doing this:
(expr | assign).parseString(inputstring)
If so, reverse the order of the expressions to:
(assign | expr).parseString(inputstring)
-- Paul
Thanks, I am now doing parseString in the order you described and assign_var is being called correctly. Unfortunately now I end up with an exception during evaluation:
AttributeError: 'str' object has no attribute 'eval'
Here is my EvalConstant
class EvalConstant(object):
var_ = {}
def __init__(self, tokens):
self.value = tokens[0]
def eval(self):
v = self.value
if v in alphas:
return var_[v]
else:
return float(self.value)
And here is my formula evaluation call
ret = (assign | expr).parseString(line)
print line + ' --> ' + str(ret.eval())
The value of ret is parsed to 'Var' when testing the formula:
Var=People::Height + People::Weight
I just cant puzzle out why parseString only returns the token Var. I believe the problem lies in my operand grammar but when i try to shuffle the operands around I get other problems where the variable grammar eats my dbRef ..
Addendum, after playing around some more I have found that the parseString returns the following tokens in my example:
['Var', '=', '<__main__.EvalAddOp object at 0x1007c32d0>
The exception is thrown after this as Var cannot be evaluated. I think I am extremely close!
An assign statement is not the same as an expression. If an assign is parsed, then you don't need to eval anything - the expression on the right hand side has already been eval'ed and stored into the variable using the parse action you attached to assign. If the first token of the results is a string, then there is nothing to do, or you can print out a diagnostic like 'print ret[0], '<-', EvalConstant._var[ret[0]]'. Yes, I think you are very close.
IT LIVES
got it all functioning .. thanks for the help. That last tip about an assignment not being an expression should have been obvious but I am just not Being The Parser!
Next step is function evaluation for me!
Just to simplify my question, my primary request would be to see an example of how eval_arith would be extended to handle nested variables and recursion while retaining its same general class structure.
Thanks!
I don't think there's much recursion here - operatorPrecedence and pyparsing should already take care of any recursion in the parsing process.
To extend eval_arith, you would first need to change the comparison operator '=' to '==' so as not to confuse assignment with comparison. Then, expand the parser from just comp_expr to comp_expr | assignment_statement, and define assignment_statement to be
assignment_statement = variable('varname') + '=' + (arith_expr | comp_exp)('varvalue')
and add a parse action to assignment_statement that looks like:
def store_variable_value(tokens):
EvalConstant.vars_[tokens.varname] = tokens.varvalue.eval()
assignment_statement.setParseAction(store_variable_value)
This is the general idea, I haven't tested this, but it should get you in the ballpark.
-- Paul
Thanks for the tip - I will try it. How would I then incorporate the store_variable_value parse action into the operatorPrecedence for arith_expr?
Could you explain the usage of operatorPrecedence a little more and why I wouldn't use the pattern of having an EvalStack[]and VarStack[] together with the recursive function EvaluateStack() as shown in SimpleCalc.py ?
I am sure my question is totally newbish, but I am hoping to get a threshold level of understanding so that I can self serve going forward.
It seems like the grammar definition in SimpleCalc is more complex, using atom, factor, term, and expression definitions, while when using operatorPrecedence the structure is simplified.
Thanks again
By the way, what is the meaning of this grammar:
assignment_statement = variable('varname') + '=' + (arith_expr | comp_exp)('varvalue')
not clear what the function variable() is?
Ignore my last question on variable, i realized this was for me to define
Following up on my quest ..
I am working on a simple DSL to transform data extracted from MongoDB. I am using python and pyparsing and have gotten reasonably far in creating a grammar that works for basic operators like +/-*, starting from the examples provided. I am currently stuck on how to get my program to evaluate functions of the form Rank[databaseField]. I can retrieve and operate on dbFields through the simple operators, but something is not working with my recursion in evaluating functions.
Here is the grammar and associated setParseActions:
# Define parser, accounting for the fact that some fields contain whitespace
chars = Word(alphanums + '_-/')
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + '.' + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo
dbRef = Combine(chars + OneOrMore(':') + chars)
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword('Rank') | Keyword('ZS') | Keyword('Ntile')) + '[' + dbRef + ']'
functionCall.setParseAction(EvalFunction)
operand = (real | integer) | functionCall | dbRef
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
formulas = ['Rank[Person:Height']
for f in formulas:
ret = expr.parseString(f)[0]
print p + ': ' + line + ' --> ' + str(ret.eval())
Here is the relevant code for my evaluation class:
# Executes functions contained in expressions
class EvalFunction(object):
def __init__(self, tokens):
self.value = tokens[0]
def eval(self):
func = self.value
if func == 'Rank':
# How to evaluate the token that is arg of Function?
return 'Rank Found';
I think I just need a nudge in the right direction to get to the next stage ..
As an update, I got this figured out and working. Ended up with an EvalFunction class that looks like this:
class EvalFunction(object):
pop_ = {}
def __init__(self, tokens):
self.func_ = tokens.funcname
self.field_ = tokens.arg
def eval(self):
# Get the name of the requested field and source db
# Functions can only be called on dbRef so this always done
v = self.field_.value
fieldRef = v.split(':')
source = fieldRef[0]
field = fieldRef[1]
# Evaluate the dbRef (get the value from the db)
val = self.field_.eval()
if self.func_ == 'Avg':
rec = db['Stats'].find_one({'_id' : field})
return rec['value']['avg']
elif self.func_ == 'Root':
return math.sqrt(val)
and a grammar that is:
functionCall = funcNames('funcname') + '[' + dbRef('arg') + ']'
functionCall.setParseAction(EvalFunction)
Hi Pau, i'm trying to parse C function calls like this
from pyparsing import Word, alphas, alphanums, oneOf, OneOrMore, \
commaSeparatedList, Suppress, Forward, Group, Optional, \
delimitedList, Regex, operatorPrecedence, opAssoc, quotedString, \
dblQuotedString, Literal
testData = '''
funcName('paramOne', ¶mTwo, fTwo(p0, p1), paramFour);
'''
expr = Forward()
LPAR, RPAR, SEMI = map(Suppress, '();')
identifier = Word(alphas+'_', alphanums+'_')
function_call = identifier.setResultsName('name') + LPAR + Group(Optional(delimitedList(expr))) + RPAR
integer = Regex(r'-?\d+')
real = Regex(r'-?\d+\.\d*')
operand = (function_call | identifier | real | integer | quotedString )
expop = Literal('^')
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
factop = Literal('!')
derefop = OneOrMore('*')
addrop = oneOf('&')
expr << operatorPrecedence( operand,
[(derefop, 1, opAssoc.RIGHT),
(addrop, 1, opAssoc.RIGHT),
(factop, 1, opAssoc.LEFT),
(expop, 2, opAssoc.RIGHT),
(signop, 1, opAssoc.RIGHT),
(multop, 2, opAssoc.LEFT),
(plusop, 2, opAssoc.LEFT),]
)
for t,s,e in function_call.scanString( testData ):
print t[0], len(t[1]), 'Parameters:', t[1]
it returns this:
funcName 5 Parameters: [''paramOne'', ['&', 'paramTwo'], 'fTwo', ['p0', 'p1'], 'paramFour']
i want the output to show 4 parameters, not 5. i need to know that fTwo is a function and its parameters are p0 and p1. what are my options? TIA
Wow, a C parser is very ambitious, you're doing pretty good so far it seems. To group all the tokens together for a function call (a good idea) change:
function_call = identifier.setResultsName('name') + LPAR + Group(Optional(delimitedList(expr))) + RPAR
to
function_call = Group(identifier.setResultsName('name') + LPAR + Group(Optional(delimitedList(expr))) + RPAR)
Now you should see your function call argument wrapped as its own subgroup.
Good luck! -- Paul
Paul, Thanks for the quick response! I am not hoping to write a complete c parser. my task at hand is to go through the source and replace some functions with new names and parameters derived from old parameters. for example: change
funcName('paramOne', ¶mTwo, fTwo(p0, p1), paramFour);
to something like this:
newFuncName('PARAMONE', ¶mTwo, paramThree, fTwo(p0, p1), paramFour);
in the above example, string in first param is transformed to upper case, paramThree inserted, and the rest are preserved as original. is 'transformString' the right tool for this? i am having a hard time understanding transformString and setParseAction. to start i can't even get it to print out the original function.
def substituteFunc(s,l,t):
s = t[0] + '(' + ', '.join(t[1]) + ')'
return s
function.setParseAction( substituteFunc )
print function.transformString( testData )
please advice. Thanks!
I'm pretty sure transformString is exactly the tool for this job. You define a pattern to be matched, and in the parse action perform the desired transformation and return the modified string. transformString will then reassemble the unmatched pieces and the transformed strings back into a single output string. Suppressed expressions will also be stripped out when using transformString. You can see some examples in this code:
I have done some limited parsing in the past using Construct (Python binary parser package) and regexp. I really do not like the latter. Construct seems to be similar to pyparser as you build up your parser a step at a time. I used it in a project to parse simple text messages coming from a server stream. It radically improved the readability of the code and reliability of the message parsing.
Now I need a more full featured text based parser so I found pyparser. I need to take an existing scripting language with well defined C-like syntax and convert to Javascript and/or Python. So this will definitely be a challenge. I have not done anything with traditional compiler tools. Especially since using Construct as it kind of spoiled me.
Would be it better to just start from scratch or should I look to adapting the existing pyparsers for C code? The language I want to parse is Torque Script which uses a simple C-like syntax. I think due to the syntax of the language it should be easier to parse. The syntax is also similar to Javascript in that it uses keywords for function definitions.
Here is my first take at variables:
Notation:
+ = 1 or more
* = 0 or more
| = or
<> = used to identify entities
<alpha> ::= <+'a'...'z'> | <+'A'...'Z'>
<numeric> ::= <+'0'...'9'>
<variable> ::= <'$'|'%'> <+alpha>|'_' <*alpha|*numeric>
<local_variable> ::= <'%'> <+alpha>|'_' <*alpha|*numeric>
<global_variable> ::= <'$'> <+alpha>|'_' <*alpha|*numeric>
Variables are defined as having a '$' or '%' at the beginning with the normal rules for variables following that. I know there are already functions to help with this I just want to work through the process.
Eh, I messed up alpha. It should be like this:
, = group sets
<alpha> ::= <+'a'...'z','A'...'Z'>
Okay, so I have successfully parsed variables:
# define variable parser
var_start = oneOf('$ %')
#identifier = OneOrMore('_'+alphas)+ZeroOrMore('_'+alphas+nums)
identifier = Word(alphas+'_', alphanums+'_')
variable = var_start+identifier
One question on this:
Why doesn't commented out identifier work? To me it looks equivalent to the Word() based version. Unless I am misunderstanding how OneOrMore and ZeroOrMore are supposed to be used.
alphas, nums, and alphanums are not pyparsing expressions, they are just plain old strings. pyparsing does auto-promotion of strings to Literals in many cases, so that you can easily write:
socSecNumber = Word(nums,exact=3) + '-' + Word(nums,exact=2) + '-' + Word(nums,exact=4)
alphas, nums, etc. are convenience string constants for defining Words, so that you don't have to constantly define 'alphas = 'abcdefghi...etc.' ' in all you pyparsing code. But because alphas is a string, and pyparsing auto-promotes strings to Literals if they are used as a Literal, what you are actually writing is 'OneOrMore(Literal('_abcdefghi...etc.') + ZeroOrMore('_abcdefghi...0123456789')' so the repetition is looking for those actual strings, not using those strings as Word does, which is as the definition of allowed leading and optionally body characters. Also, Word does not allow any intervening whitespace, but looks for contiguous 'words' of letters that are composed from the given leading and body string characters. But by default 'expr + expr' (where 'expr' is any pyparsing expression) will allow whitespace to be found between the two expressions and still match.
Glad you find pyparsing to be a promising toolkit for you - it does take some mental adjustment to work with, but I hope its not too big of a hurdle!
-- Paul
Yeah, I was starting to wonder after reading through pyparsing.py. I thought it might be looking at the alphas and alphanums as string rather than an object. Also, good catch on using word to keep there from being whitespace in there. That would have been bad.
Hey, I picked up your manual from Oreilly for $10. Is that going to be updated at some point? I saw it was referencing 2.3 or 2.4 of Python.
Probably not going to get updated, I'm afraid, unless I self-publish. O'Reilly isn't getting enough sales volume for them to be interested in doing a 2nd edition.
-- Paul
identifier = Word(alphas+'_', alphanums+'_')
variable = Combine(oneOf('$ %')+identifier)
qstring = QuotedString(''',escChar='\\',multiline=False)
tstring = QuotedString(''',escChar='\\',multiline=False)
#qstring = dblQuotedString # leaves quotes in result
#tstring = sglQuotedString # leaves quotes in result
integer_literal = Combine(Optional('-')+Word(nums))
hex_literal = Combine(oneOf('0x 0X')+Word(hexnums))
float_literal = Combine(Word(nums)+'.'+Word(nums))
scinot_literal = Combine(Word(nums)+oneOf('e- e+',caseless=True)+Word(nums))
num_literal = scinot_literal|hex_literal|float_literal|integer_literal
I am working on the literal identification and am wondering about testing for start and end of each word. I see in the examples in the Oreilly text that the start and end are not checked necessarily as the context provides verification of the literal. I am thinking about this right? Or should I be making sure each literal is bounded properly?
The reason I ask is because I can get false positives using the above rules by adding chars to the begging and end of the patterns. I know if I tell it to parse it will fail, so maybe this is a non-issue as it would be caught there.
I've been working on building out my DSL with pyparsing and have made excellent progress. My first milestone was to evaluate expressions that contain arithmetic operators, database field references and a set of functions (Avg, Stdev, etc). In addition, I implemented assignment of expressions to variables so as to be able to build up complex expressions in a modular way. So far so good.
I have now hit my next major snag when trying to calculation functions on variables as arguments. Specifically, my database references (which is the building block on which calcs are performed) requires specifiying a Person as a dimension of the query. I don't know the best way to force re-evaluation of the expressions assigned to these variables when they are contained within a function. Specific example that has problems:
CustomAvg = Avg[Height] + Avg[Weight] Avg[CustomAvg]
In these scenarios, I have a list of People that I iterate over to calculate the components of CustomAvg. However, when I evaluate Avg[CustomAvg] the value of CustomAvg is coming from my variable lookup dict rather than being evaluated, so effectively I am iterating over a constant value. What is the best way to introduce 'awareness' in my evaluation so that the variables used as arguments within a function a re-evaluated rather than sourced from the lookup table? Here is streamlined relevant code:
class EvalConstant(object):
var_ = {}
def __init__(self, tokens):
self.value = tokens[0]
def eval(self):
v = self.value
if self.var_.has_key(v):
return self.var_[v]
else:
return float(v)
class EvalDBref(object):
person_ = None
def __init__(self, tokens):
self.value = tokens[0]
def eval(self):
v = self.value
fieldRef = v.split(':')
source = fieldRef[0]
field = fieldRef[1]
rec = db[source].find_one({'Name' : self.person_}, { '_id' : 0, field : 1})
return rec[field]
class EvalFunction(object):
pop_ = {}
def __init__(self, tokens):
self.func_ = tokens.funcname
self.field_ = tokens.arg
self.pop_ = POPULATION
def eval(self):
v = self.field_.value
fieldRef = v.split(':')
source = fieldRef[0]
field = fieldRef[1]
val = self.field_.eval()
if self.func_ == 'ZS':
# If using zscore then fetch the field aggregates from stats
rec = db['Stats'].find_one({'_id' : field})
stdev = rec['value']['stddev']
avg = rec['value']['avg']
return (val - avg)/stdev
elif self.func_ == 'Ptile':
recs = list(db[source].find({'Name' : { '$in' : self.pop_}},{'_id' : 0, field : 1}))
recs = [r[field] for r in recs]
return percentileofscore(recs, val)
def assign_var(tokens):
ev = tokens.varvalue.eval()
EvalConstant.var_[tokens.varname] = ev
#--------------------
expr = Forward()
chars = Word(alphanums + '_-/')
integer = Word(nums)
real = Combine(Word(nums) + '.' + Word(nums))
var = Word(alphas)
assign = var('varname') + '=' + expr('varvalue')
assign.setParseAction(assign_var)
dbRef = Combine(chars + OneOrMore(':') + chars)
dbRef.setParseAction(EvalDBref)
funcNames = Keyword('ZS') | Keyword('Avg') | Keyword('Stdev')
functionCall = funcNames('funcname') + '[' + expr('arg') + ']' functionCall.setParseAction(EvalFunction)
operand = dbRef | functionCall | (real | integer| var).setParseAction(EvalConstant)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
EvalDBref.person_ = 'John Smith'
ret = (assign | expr).parseString(line)[0]
Hey ptmcg, I am having some issues understanding where to go from simple pattern matching to fully parsing a large grammar like C. Can you recommend some resources that discuss the theory behind construction of grammar processors? I have on order the 'Dragon Book' which has theory and examples of using lex and yacc. Hopefully that will help. I actually have a well defined grammar for the language I want to parse, but it is in lex and yacc source files and I am having issues understanding how to apply the rules to work with pyparsing. I am using pyparsing right now to parse messages coming from a VectorNav NV-100 and it works great for that. I like it because it cleans up the code to verify the messages in a very structured and easily documented way. I did find a bunch of links on lex and yacc to study, but I get the feeling your parser uses more modern approaches. I am just not understanding where to find references on the techniques you are using.
Thanks, Frank
Frank -
I am certainly happy that pyparsing is helping you be productive in your VectorNav application. You can see an example of a more extensive language parser by looking at the Verilog parser ().
As rich a library as pyparsing is, it does cut some corners and makes guesses in some cases of ambiguity, something that you really don't a language parser doing. Pyparsing's cavalier approach to whitespace for example, while appropriate in many everyday cases, is not really as rigorous as a language parser ought to be.
C presents its own special complexities in parsing, because of some syntax 'specialness', like leading '*'s for dereferencing pointers, typedefs - and don't even get me started on the macro preprocessor! If you want to parse a full language, try something like Pascal, whose syntax was designed up front to be parseable in a single pass. In contrast, most C compilers take several passes over the input source to do their parsing.
So if you are scouting about for languages to write a parser for using pyparsing, stick with those that can be parsed in a single pass. A syntax that can be parsed in 1 pass is more likely to be tractable using pyparsing's plodding left-to-right processing, with a minimum of lookahead and backtracking.
For more information on pyparsing's style of parsing, look for PEG's (parser expression generators).
-- Paul
I have developed a DSL for manipulating database content using pyparsing. I am now considering trying to add control flow (if then else in particular) to my language, which currently Bly supports direct expression evaluation with arithmetic operators and basic calc functions.
I don't really know where to start and whether this is even a good idea to try to do with an external DSL and pyparsing. Anyone have experience in this or advice?
A while back I wrote an article for Python Magazine describing a Brainfuck
compiler/interpreter written with pyparsing. The main concept was to compile the code into
a corresponding structure of executable objects (similar to what is done in the online
example SimpleBool.py, but with objects for IfStatement
, AssignStatement
, etc.) Design a
Virtual Machine in which these objects can be run, possibly something as simple as a
dict of variable values. Implement for each object class a method execute(vm)
.
Then associate each class with the corresponding statements expression in your parser
as a parse action. When you have parsed successfully, you will get a ParseResults
containing executable objects - create an empty VM and then call object.execute(vm)
for
each object you have parsed. For control flow (like if-then-else or for/while loops),
implement the control flow in that statement's execute function. What was fun about
making this a 'compiler' was that the parsed code could be pickled and saved to a file.
This could then be unpickled and run directly, without having to reparse the original
DSL source.
HTH, -- Paul
Cool. I've see that article referenced a few places and tried to find the back issue online to no avail. Do you have any idea where it might be available?
This is my first attempt to use pyparsing for serious work. I have a script that
currently uses regular expressions to parse log files, looking for specific type of log
lines. In my current testing, I'm being given a message containing
'Accepted publickey for nagios from 10.70.50.101 port 43382 ssh2'
. I had been parsing this with a regex
'(Accepted publickey for ([a-zA-Z0-9\.]+) from [0-9\.]+).*'
.
If I use :
a = 'Accepted publickey for'
user = Word(alphanums + '-')
ip = Word(nums+'.')
string = a + user + Literal('from') + ip + Regex('.*')
I get:
string.searchString(msg)
([(['Accepted publickey for', 'nagios', 'from', '10.70.50.101', 'port 43382 ssh2'], {})], {})
But if I do:
string2 = Literal('Accepted publickkey for')+ Word(alphanums + '_') + Literal('from') + Word(nums + '.') + Regex('.*')
I get :
string2.searchString(msg)
([], {})
So my question is, why do I get a parsed output when I build the pieces of the match string piece by piece then combine them, but not when I build the match string in place?
Or am I just too tired and not seeing a type?
You have a typo in string2, 'publickkey' instead of 'publickey'. Fix the typo and it should work just fine.
Welcome to pyparsing!
-- Paul
I was afraid it was something simple like that. I looked at my code a bunch of times and never saw that. Thanks!