Skip to content

Latest commit

 

History

History
6805 lines (4529 loc) · 276 KB

all_wiki_discussion_toc_2012.md

File metadata and controls

6805 lines (4529 loc) · 276 KB

Pyparsing Wikispaces Discussion - 2012

[Note: these entries are fairly old, and predate many new features of pyparsing, and are predominantly coded using Python 2. They are captured here for historical benefit, but may not contain the most current practices or features. We will try to add editor notes to entries to indicate when discussions have been overtaken by development events.]

2012-01-07 06:08:59 - DiaaFayed - promote nested elements ...
2012-01-09 05:40:45 - DiaaFayed - rearrange parsed tree
2012-01-11 08:39:18 - DiaaFayed - a letter play two roles
2012-01-11 08:39:54 - DiaaFayed - how make the parser does not stop
2012-01-11 08:43:19 - DiaaFayed - add logger class
2012-01-14 08:16:21 - DiaaFayed - parsing a list of strings
2012-01-14 08:29:02 - DiaaFayed - parseResultsSumExample.py
2012-01-16 23:11:45 - 0xLeFF - Parsing nested c/c++ blocks
2012-01-19 08:18:31 - masura-san - How to tag parsed elements
2012-01-20 08:44:42 - DiaaFayed - file contains lines for parsed.
2012-01-29 11:47:37 - Phxsawdust - Creating manual ParseResults
2012-01-31 18:24:11 - oafilipoai - Catch-all pattern
2012-02-04 15:55:57 - lamakaha - error with setParseAction
2012-02-05 14:44:27 - karulis - bug + patch for ParseResults.dir in python3
2012-02-07 16:34:33 - oafilipoai - Finding the end location for a matched expression
2012-02-14 03:49:57 - ror6ax - Trying to parse a file.
2012-02-17 10:10:07 - DiaaFayed - comma separates outside paranthesis
2012-02-18 19:57:34 - lamakaha - how to ignore blank lines in line oriented parser?
2012-02-21 05:54:17 - ror6ax - parsing tables
2012-02-25 13:19:21 - johnmudd - long output, is ths right?
2012-02-29 00:17:37 - lesnar56 - Extending Keyword Classes
2012-03-06 06:04:21 - rrian - Unexpected results with name
2012-03-12 12:46:53 - tarruda - Need help in parsing part of python grammar
2012-03-14 12:00:22 - keirian - Recursion Help
2012-03-21 11:58:03 - maxime-esa - ambiguous grammar not detected?
2012-03-25 00:33:47 - nimbiotics - problems with delimitedList
2012-03-29 14:30:05 - nimbiotics - How can I group this?
2012-04-04 14:58:25 - HumbertMason - Parsing a list of structures line by line
2012-04-06 10:11:25 - pepinocho9 - Help with parseactions and Morgan's Law
2012-04-16 10:33:55 - takluyver - Skip optional part if following part matches
2012-04-27 12:16:42 - larapsodia - Question about "Or" statement
2012-04-28 06:48:11 - charles_w - working to understand pyparsing, setResultsName, and setParseAction
2012-05-01 01:14:04 - robintw - Labelling of results when using Each
2012-05-08 11:32:30 - side78 - Parsing nested blocks without any deterministic end
2012-05-09 18:23:50 - Caffeinix - C++ qualified types
2012-05-21 12:08:56 - dGRp - Building AST for n-op abstract algebras
2012-05-23 04:27:27 - Madan2 - TypeError: unsupported operand type(s) for ^: 'NoneType' and 'NoneType'
2012-05-25 05:33:30 - dGRp - Some thoughts and questions on improvement
2012-05-26 14:51:08 - BrenBarn - get original text from ParseResults
2012-06-07 09:56:55 - Madan2 - combine - space bn tokens
2012-06-10 06:12:29 - bsr203 - Rules for Repeating sections of data
2012-06-10 14:23:11 - ofuhrer - Replace only locally
2012-06-11 02:51:37 - willem179 - ParseResults inconsistent after setitem
2012-06-25 11:31:52 - Madan2 - Dealing with "" in data
2012-06-26 16:47:14 - chlim - parsing identical strings and multi-lines
2012-07-02 07:16:21 - DiaaFayed - dynamic extractor statement
2012-07-02 12:52:58 - DiaaFayed - a new feature of the Python eval and exec commands
2012-07-03 13:30:03 - BrenBarn - Copying ParseResults attributes
2012-07-04 08:25:29 - einar77 - Parsing multi-line records
2012-07-11 17:36:06 - chlim - svn syntax
2012-07-16 08:45:49 - DiaaFayed - how can we restore setParseAction results ?
2012-07-18 07:37:10 - paulelastic - Parsing Expression Grammar (PEG)
2012-07-18 07:46:50 - paulelastic - Visual debugger for pyparsing
2012-08-09 07:35:47 - Leevi3 - multiple parse actions
2012-08-17 14:36:15 - script_lover - operatorPrecedence generate rules that cannot be validated
2012-08-17 14:48:13 - script_lover - Avoid duplicating rules
2012-08-22 05:56:40 - Leevi3 - nesting depth of operator precedence parse results
2012-08-25 05:11:18 - simbera.jan - Getting a ParseResults line number
2012-09-10 02:50:10 - acjackson5 - Help with datetime conversion
2012-09-10 05:09:15 - darkest_star - Parse a logfile and detect repetitve textblocks
2012-09-12 15:11:17 - sravet - need help with verilog parser
2012-09-30 23:19:48 - cqqhzxgh - match multiple lines
2012-10-01 23:38:09 - cqqhzxgh - question with scanString
2012-10-05 14:19:50 - dlwatey - Beginner Question
2012-10-07 21:20:38 - RunSilent - parseString works, scanString fails?
2012-10-08 08:22:01 - DiaaFayed - How can we transform pyparsing string to pyparsing expression
2012-10-15 15:14:28 - jsy1972 - question re: indentation and grouped stmts
2012-10-17 13:30:37 - DiaaFayed - plz give explain and examples ...
2012-10-18 13:50:33 - chlim - Parsing single and multiple records
2012-10-22 07:30:51 - DiaaFayed - can we simulate the caret and dollar sign functions .?
2012-10-22 12:25:04 - DiaaFayed - simulate caret and dollar sign in regular expression
2012-10-22 12:34:59 - tvn1981 - Getting line number where error occurs
2012-10-25 00:26:43 - kmbt - Match only at selected lines
2012-10-25 00:50:16 - ranjith19 - How to get a perticular behaviour with a function parser?
2012-10-26 04:57:27 - pypetey - buildout: Couldn't find a setup script
2012-10-26 11:46:25 - dlwatey - Handling special Characters
2012-10-29 09:04:41 - dlwatey - Getting closer and clearer
2012-11-06 13:39:41 - tvn1981 - Very slow parsing a simple First order logic form
2012-11-13 00:24:07 - DiaaFayed - expressions matcher module
2012-11-13 01:56:45 - DiaaFayed - the space and paranthesis
2012-11-15 05:55:16 - DiaaFayed - Questions about scaneExamples.py
2012-11-15 06:03:12 - DiaaFayed - Question about Copy()
2012-11-22 09:15:57 - cadourian - How to improve parser performance
2012-11-26 10:12:33 - DiaaFayed - Design Pattern: Chain of Responsibility
2012-11-26 17:38:27 - rogersanchez75 - Arithmetic evaluation with variables
2012-11-28 20:37:48 - rogersanchez75 - Eval functions in arith expressions
2012-11-29 21:07:56 - torfat - parsing C function calls
2012-12-05 14:38:02 - Demolishun - New to pyparser and impressed by capabilities
2012-12-05 21:14:00 - Demolishun - Working on Literal Identification
2012-12-10 15:58:57 - rogersanchez75 - Further DSL and function parsing development
2012-12-12 00:30:30 - Demolishun - Trouble with moving beyond basic pattern matching.
2012-12-22 20:41:51 - rogersanchez75 - Control flow
2012-12-24 11:59:11 - catdude - Clarification regarding building a search sting


2012-01-07 06:08:59 - DiaaFayed - promote nested elements ...

plz sir

when capture structure of matching in a xml tree structure

how can we promote deep nested elements to higher levels

thanks


2012-01-09 05:40:45 - DiaaFayed - rearrange parsed tree

if I have a parse tree like

item
--------item1
--------item2
----------------item2.1
----------------item2.2
------------------------item2.2.1
------------------------item2.2.2
----------------item2.3
--------item3
--------item4

and need to reorganize the tree to shape

item
--------item1
--------item2
----------------item2.1
----------------item2.2
--------item2.2.1
--------item2.2.2
----------------item2.3
--------item3
--------item4

the purpose is to export to relational table with columns

to put the information in relational table with columns

<item1, item2, item2.2.1, item2.2.2, item3, item4>

could I have the second tree while parsing ?


2012-01-11 08:39:18 - DiaaFayed - a letter play two roles

in one Arabic string (unicode or utf8). the author sometimes use the dash (ascii) character as a separator, or use the tatweel character as aseperator. the problem is the tatweel letter is one of the alphas that constitutes. So the parsing raise errors. the tatweel plays two roles is the cause of error.

2012-01-16 20:17:37 - ptmcg

Is the tatweel really a valid alpha? Or is it included with alphas because alphas is locale-sensitive? Can you define your own subset of alphas that omits tatweel?

Let's say 'x' was like tatweel. Here is how to define a word of alphas with 'x' as separators:

aword = Word(alphas, excludeChars='x')

expr = delimitedList(aword, delim='x')

print expr.parseString('sldkjfzxlskaopweiurxlkszaxlsdf')

prints:

['sldkjfz', 'lskaopweiur', 'lksza', 'lsdf']

2012-01-17 10:22:59 - DiaaFayed

thanks for all replay, but plz see the example assum the dash = '-' used as separator and also can be a letter of a word

dash = '-'
word = Word(alphas + dash)
sentence1 = 'diaa fayed- - engineer'
sentence = OneOrMore(word) + Suppress('-') + OneOrMore(word)
print sentence.parseString(sentence1)

the error

print sentence.parseString(sentence1)
  File 'C:\Python26\lib\site-packages\pyparsing.py', line 1032, in parseString
    raise exc
ParseException: Expected '-' (at char 22), (line:1, col:23)

I need the result to be

['diaa fayed-', 'engineer']

2012-01-18 21:29:03 - ptmcg

Thank you for providing an example that displays adequately in ASCII characters. I'll refer to '-' in your example as a standin for tatweel.

What distinguishes the lone '-' as a separator, instead of being a single-character word? Can words start with a '-'? If '-' is a separator, does it have to have whitespace on either side?

Try this:

dash = Keyword('-')
word = Word(alphas+'-')
sentence = delimitedList(OneOrMore(~dash+word), dash)

This assumes that '-' is a separator if and only if it passes the test of being a standalone keyword. The OneOrMore does a lookahead to ensure that it does not accidentally read the lone '-' as a word.

In cases like this where there is some ambiguity, you must ask yourself questions as if you were playing the part of the parser. How can you tell the difference between a tatweel that is a word character from a tatweel that is a separator? If some sort of lookahead is required, then implement that with ~ or FollowedBy. Be aware that '-' will match Word(alphas+'-'). Also be aware that OneOrMore will match repetitions as long as it can, even if the next expression in the parser is a match also - pyparsing does NO implicit lookahead. In this way, it is unlike regular expressions.

-- Paul

2012-01-20 06:40:34 - DiaaFayed

thanks very much plz let me to suggest some ideas to your parser

  1. adding more documentation and examples for rarely used methods.

  2. the discussion leads to new ideas that you can add them to the code regularly.

  3. you and we can suggest new problems to implement for visitors, and then added them to the examples.


2012-01-11 08:39:54 - DiaaFayed - how make the parser does not stop

plz sir

if I parse a list of strings, how make the parser does not stop if there is error in one of the strings. instead the parser log out the error and then continue parsing.


2012-01-11 08:43:19 - DiaaFayed - add logger class

I suggest to add logger class to the pyparsing library. this will facilitate debugging process. the logger should save exceptions to text files or screen or ...

2012-01-16 20:09:27 - ptmcg

Pyparsing supports debugging on individual expressions, by calling setDebug. If an expression has been set for debugging, then every time the grammar tries to evaluate that expression, the location of the parse is logged, followed by either the successfully parsed tokens or the resulting exception:

s = 'ABC DEF 123 XYZ'

aword = Word(alphas)
integer = Word(nums)

aword.setDebug()

OneOrMore(aword | integer).parseString(s)

prints

Match W:(abcd...) at loc 0(1,1)
Matched W:(abcd...) -> ['ABC']
Match W:(abcd...) at loc 3(1,4)
Matched W:(abcd...) -> ['DEF']
Match W:(abcd...) at loc 7(1,8)
Exception raised:Expected W:(abcd...) (at char 8), (line:1, col:9)
Match W:(abcd...) at loc 11(1,12)
Matched W:(abcd...) -> ['XYZ']
Match W:(abcd...) at loc 15(1,16)
Exception raised:Expected W:(abcd...) (at char 15), (line:1, col:16)
(['ABC', 'DEF', '123', 'XYZ'], {})

2012-01-14 08:16:21 - DiaaFayed - parsing a list of strings

when parsing a list of string

how to make the pyparsing do not stop parsing if one of the strings have error

only print or log error and resume parsing

2012-01-16 20:04:10 - ptmcg

Pyparsing raises ordinary Python exceptions, so if you are parsing one string at a time in a list, just call parseString inside a try-except block:

strings = [
    'ABC',
    'DEF',
    '123',
    'xyz',
    ]

for s in strings:
    try:
        print Word(alphas).parseString(s)
    except ParseException as pe:
        print s, pe

prints

['ABC']
['DEF']
123 Expected W:(abcd...) (at char 0), (line:1, col:1)
['xyz']

2012-01-14 08:29:02 - DiaaFayed - parseResultsSumExample.py

in the example

parseResultsSumExample.py

samplestr1 = 'garbage;DOB 10-10-2010;more garbage\nID PARI12345678;more garbage'
samplestr2 = 'garbage;ID PARI12345678;more garbage\nDOB 10-10-2010;more garbage'
samplestr3 = 'garbage;DOB 10-10-2010'
samplestr4 = 'garbage;ID PARI12345678;more garbage- I am cool'

from pyparsing import *
dob_ref = 'DOB' + Regex(r'\d{2}-\d{2}-\d{4}')('dob')
id_ref = 'ID' + Word(alphanums,exact=12)('id')
info_ref = '-' + restOfLine('info')

person_data = dob_ref | id_ref | info_ref

for test in (samplestr1,samplestr2,samplestr3,samplestr4,):
    person = sum(person_data.searchString(test))
    print person.id
    print person.dump()
    print

if we assume one of the strings have error and
raise exception. How do we make pyparsing continue to parse the remaining strings

2012-01-16 20:10:43 - ptmcg

Again, this is just standard Python exception handling - see my answer to the other message you posted.


2012-01-16 23:11:45 - 0xLeFF - Parsing nested c/c++ blocks

hello... I need to parse nested c/c++ like blocks of code like this:

{
int A1 = 100;
int A2 = 200;

int B1 = 100;
int B2 = 200;
{
int _A1 = 100;
int _A2 = 200;

int _B1 = 100;
int _B2 = 200;
}
}

and I'd like to take this kind of output from the parser:

['int A1 = 100;\nint A2=200;\n', 'int B1 = 100;\nint B2 = 200\n', ['int _A1 = 100;\nint _A2 = 200;\n', 'int _B1 = 100;\nint _B2 = 200;']]

I hope I spelled it right)) could you tell me, what is the best way to it? I tried using nestedExpr, but failed))

PS there can be any c/c++ valid code on the place of 'int A1 = 100', I made this example for the sake of simplicity))

2012-01-16 23:43:55 - 0xLeFF

when I'm parsing only inner code blocks like this:

Txt = '''
const int A1 = 100;
const int A2 = 200;

const int B1 = 100;
const int B2 = 200;

const int C1 = 100;
const int C2 = 200;
'''

EmptyLine = Suppress(lineEnd + lineEnd)
CodeBlock = ZeroOrMore(SkipTo(EmptyLine) + Optional(EmptyLine))

print(CodeBlock.parseString(Txt))

I get the desired results, but when I'm trying to use nestedExpr I get an infinite loop:

print(nestedExpr('{', '}', CodeBlock).parseString(Txt))

where '{' and '}' were added to Txt variable...

2012-01-17 00:38:31 - ptmcg

See if these give you any ideas on things to try:

code = '''\
{
int A1 = 100;
int A2 = 200;

int B1 = 100;
int B2 = 200;
{
int _A1 = 100;
int _A2 = 200;

int _B1 = 100;
int _B2 = 200;
}
}
'''
from pyparsing import *

p1 = nestedExpr('{','}')
print p1.parseString(code)

# prints
# [['int', 'A1', '=', '100;', 'int', 'A2', '=', '200;', 'int', 'B1', '=', '100;', 'int', 'B2', '=', '200;', ['int', '_A1', '=', '100;', 'int', '_A2', '=', '200;', 'int', '_B1', '=', '100;', 'int', '_B2', '=', '200;']]]

cStatement = ~oneOf('{ }') + SkipTo(';') + ';'
content = originalTextFor(OneOrMore(cStatement))
p2 = nestedExpr('{','}', content=content)
print p2.parseString(code)

# prints
# [['int A1 = 100;\nint A2 = 200;\n \nint B1 = 100;\nint B2 = 200;', ['int _A1 = 100;\nint _A2 = 200;\n \nint _B1 = 100;\nint _B2 = 200;']]]

cStatement = Forward()
cStatement << (originalTextFor(~oneOf('{ }') + SkipTo(';') + ';') |
                nestedExpr('{','}', content=cStatement))
p3 = OneOrMore(cStatement)
print p3.parseString(code)

# prints
# [['int A1 = 100;', 'int A2 = 200;', 'int B1 = 100;', 'int B2 = 200;', ['int _A1 = 100;', 'int _A2 = 200;', 'int _B1 = 100;', 'int _B2 = 200;']]]

ParserElement.setDefaultWhitespaceChars(' \t')
EOL = LineEnd()
cStatement = SkipTo(';', failOn=oneOf('{ }')|EOL) + ';'
content = originalTextFor(OneOrMore(cStatement + EOL)) | (Empty()+EOL).suppress()
p4 = nestedExpr('{','}', content=content)
print p4.parseString(code)

# prints
# [['int A1 = 100;\nint A2 = 200;\n', 'int B1 = 100;\nint B2 = 200;\n', ['int _A1 = 100;\nint _A2 = 200;\n', 'int _B1 = 100;\nint _B2 = 200;\n']]]

-- Paul


2012-01-19 08:18:31 - masura-san - How to tag parsed elements

Hi,

I've been working on a grammar that parses spec files. My grammar works, but the output that I'm getting is not how I expect it to be. The specs can be very complex, they have 3 levels and on each level it should be possible to have different sorts of items. Also I need to parse everything, I need a full memory representation of the spec that means I also need to know the location of empty lines.

I made a small sample to illustrate my problem. It's not very clear though, it's hard to make my problem clear. I know you can tag elements using the setResultsName method, but sometimes the result is a list of different kind of items. Then I can only get the attributes of those items.

It's possible to determine the type of element based on the attributes but that's not a good solution. So I have a few questions:

Does my explanation makes any sense? :) Is there a way to tag items in another way?

Should I solve this problem by defining a parse action for each type of item that adds an attribute 'itemType' to every token of a certain type?

spec = '''\
# lorem ipsum

# lorem ipsum
[version:1.0]
# lorem ipsum

'''

# set spaces and tabs as parser default white space
ParserElement.setDefaultWhitespaceChars(' \t')

lineEnd = emptyLine = Suppress(LineEnd()('emptyLine'))
numberSign = Suppress(Literal('#').setName('number sign (#)'))
leftSquareBracket =  Suppress(Literal('[').setName('left square bracket ([)'))
rightSquareBracket = Suppress(Literal(']').setName('right square bracket (])'))
colon = Suppress(Literal(':').setName('colon (:)'))

singleLineComment = Group(numberSign + SkipTo(lineEnd) + lineEnd)
singleLineComment = singleLineComment.setResultsName('comment')
singleLineComment.setName('comment')

versionLiteral = Suppress(Literal('version'))
versionLiteral.setResultsName('version')
versionLiteral.setName('version literal')

releaseNumber = Combine(singleDigitNumber + period + singleDigitNumber)
releaseNumber.setResultsName('releaseNumber')
releaseNumber.setName('release number')

version = Group(leftSquareBracket + versionLiteral + colon + releaseNumber + rightSquareBracket + lineEnd)
version = version.setResultsName('version')
version.setName('version')

grammar = OneOrMore(emptyLine | singleLineComment | version)
results = grammar.parseString(spec, parseAll=True)

2012-01-22 21:35:24 - ptmcg

Instead of tagging the parse results with a type, I suggest using the parse results to construct an object. Here is a sample of creating Shape objects from simple format strings:

class Shape(object):
    def __init__(self, tokens):
        self.__dict__.update(tokens.asDict())

    def area(self):
        raise NotImplementedException()

    def __str__(self):
        return '<%s>: %s' % (self.__class__.__name__, self.__dict__)

class Square(Shape):
    def area(self):
        return self.side**2

class Rectangle(Shape):
    def area(self):
        return self.width * self.height

class Circle(Shape):
    def area(self):
        return 3.14159 * self.radius**2

from pyparsing import *

number = Regex(r'-?\d+(\.\d*)?').setParseAction(lambda t:float(t[0]))

# Shape expressions:
#   square : S <centerx> <centery> <side>
#   rectangle: R <centerx> <centery> <width> <height>
#   circle : C <centerx> <centery> <diameter>

squareDefn = 'S' + number('centerx') + number('centery') + number('side')
rectDefn = 'R' + number('centerx') + number('centery') + number('width') + number('height')
circleDefn = 'C' + number('centerx') + number('centery') + number('diameter')

squareDefn.setParseAction(Square)
rectDefn.setParseAction(Rectangle)

def computeRadius(tokens):
    tokens['radius'] = tokens.diameter/2.0
circleDefn.setParseAction(computeRadius, Circle)

shapeExpr = squareDefn | rectDefn | circleDefn

tests = '''\
C 0 0 100
R 10 10 20 50
S -1 5 10'''.splitlines()

for t in tests:
    shape = shapeExpr.parseString(t)[0]
    print shape
    print 'Area:', shape.area()
    print

prints

<Circle>: {'diameter': 100.0, 'radius': 50.0, 'centerx': 0.0, 'centery': 0.0}
Area: 7853.975

<Rectangle>: {'width': 20.0, 'height': 50.0, 'centerx': 10.0, 'centery': 10.0}
Area: 1000.0

<Square>: {'side': 10.0, 'centerx': -1.0, 'centery': 5.0}
Area: 100.0

You can see another example on the Examples page, SimpleBool.py.

-- Paul


2012-01-20 08:44:42 - DiaaFayed - file contains lines for parsed.

when I have a file that contains lines. I want to parse each line according to a grammar. and also write each parsed line to file accompanied by the line number. this I want to discover the success lined and failed lines by writing a logger or output file.

How I preserve the line number ?

2012-01-22 19:06:31 - ptmcg

Diaa -

Your code is iterating through the file line by line, so pyparsing does not really have visibility to the separate line numbers. But your code does. You can iterate through the file and keep a line number variable yourself, or wrap the file iterator in 'enumerate' and get the line number and line for each line in the file.

Diaa, the questions you are asking are very basic, and I fear that you really need more programming experience before trying to write a pyparsing application.

-- Paul

2012-01-23 02:51:37 - DiaaFayed

for python, yes I need more experience.

for this question specifically, I wait you to talk about parseFile and use LineEnd without reading the file line by line. really I can read line by line and then use use parseString(). but I wanted to understand the parseFile and LineEnd in order to use calback function that return col, lineno, tokens to write this information to output file in one shot.

2012-01-23 05:54:11 - ptmcg

Ah, now I have a clearer picture of what you are asking. I have to head to work now, but I will write up some examples when I get home this evening.

2012-01-24 02:16:17 - ptmcg

You can use a parse action to add the current line number to individual tokens or whole lines. Or just attach a parse action to LineEnd that returns the line number. See the following code with embedded comments:

text = '''\
Lorem ipsum dolor sit amet, consectetur 
adipisicing elit, sed do eiusmod 
tempor incididunt ut labore et dolore 
magna aliqua. Ut enim ad minim veniam, 
quis nostrud exercitation ullamco 
laboris nisi ut aliquip ex ea 
commodo consequat. Duis aute irure 
dolor in reprehenderit in voluptate 
velit esse cillum dolore eu fugiat 
nulla pariatur. Excepteur sint occaecat 
cupidatat non proident, sunt in culpa 
qui officia deserunt mollit anim id 
est laborum.'''

from pyparsing import *

# add line and col to each word
word = Word(alphas)
word.setParseAction(lambda s,l,t: (t[0], lineno(l,s), col(l,s)))

# use transformString since the input text contains non-words too (like '.' and ',')
print word.transformString(text)
print

# another approach - add line number to each line

# remove \n from the list of default whitespace
ParserElement.setDefaultWhitespaceChars(' \t')
word = Word(alphas)
punc = oneOf('. ,')
eol = LineEnd()
textline = OneOrMore(word | punc) + eol
textline.setParseAction(lambda s,l,t: [str(lineno(l,s)),] + t.asList())
corpus = OneOrMore(Group(textline))

# create inmemory file-like object using StringIO
# could have just as easily used parseString(text), but you asked
# specifcally about parseFile
from cStringIO import StringIO
textfile = StringIO(text)

lines = corpus.parseFile(textfile)
for l in lines:
    print l

prints:

('Lorem', 1, 1) ('ipsum', 1, 7) ('dolor', 1, 13) ('sit', 1, 19) ('amet', 1, 23), ('consectetur', 1, 29) 
('adipisicing', 2, 1) ('elit', 2, 13), ('sed', 2, 19) ('do', 2, 23) ('eiusmod', 2, 26) 
('tempor', 3, 1) ('incididunt', 3, 8) ('ut', 3, 19) ('labore', 3, 22) ('et', 3, 29) ('dolore', 3, 32) 
('magna', 4, 1) ('aliqua', 4, 7). ('Ut', 4, 15) ('enim', 4, 18) ('ad', 4, 23) ('minim', 4, 26) ('veniam', 4, 32), 
('quis', 5, 1) ('nostrud', 5, 6) ('exercitation', 5, 14) ('ullamco', 5, 27) 
('laboris', 6, 1) ('nisi', 6, 9) ('ut', 6, 14) ('aliquip', 6, 17) ('ex', 6, 25) ('ea', 6, 28) 
('commodo', 7, 1) ('consequat', 7, 9). ('Duis', 7, 20) ('aute', 7, 25) ('irure', 7, 30) 
('dolor', 8, 1) ('in', 8, 7) ('reprehenderit', 8, 10) ('in', 8, 24) ('voluptate', 8, 27) 
('velit', 9, 1) ('esse', 9, 7) ('cillum', 9, 12) ('dolore', 9, 19) ('eu', 9, 26) ('fugiat', 9, 29) 
('nulla', 10, 1) ('pariatur', 10, 7). ('Excepteur', 10, 17) ('sint', 10, 27) ('occaecat', 10, 32) 
('cupidatat', 11, 1) ('non', 11, 11) ('proident', 11, 15), ('sunt', 11, 25) ('in', 11, 30) ('culpa', 11, 33) 
('qui', 12, 1) ('officia', 12, 5) ('deserunt', 12, 13) ('mollit', 12, 22) ('anim', 12, 29) ('id', 12, 34) 
('est', 13, 1) ('laborum', 13, 5).

['1', 'Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', '\n']
['2', 'adipisicing', 'elit', ',', 'sed', 'do', 'eiusmod', '\n']
['3', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', '\n']
['4', 'magna', 'aliqua', '.', 'Ut', 'enim', 'ad', 'minim', 'veniam', ',', '\n']
['5', 'quis', 'nostrud', 'exercitation', 'ullamco', '\n']
['6', 'laboris', 'nisi', 'ut', 'aliquip', 'ex', 'ea', '\n']
['7', 'commodo', 'consequat', '.', 'Duis', 'aute', 'irure', '\n']
['8', 'dolor', 'in', 'reprehenderit', 'in', 'voluptate', '\n']
['9', 'velit', 'esse', 'cillum', 'dolore', 'eu', 'fugiat', '\n']
['10', 'nulla', 'pariatur', '.', 'Excepteur', 'sint', 'occaecat', '\n']
['11', 'cupidatat', 'non', 'proident', ',', 'sunt', 'in', 'culpa', '\n']
['12', 'qui', 'officia', 'deserunt', 'mollit', 'anim', 'id', '\n']
['13', 'est', 'laborum', '.']

2012-01-27 06:13:25 - DiaaFayed

thanks very much I still try the second approach. but there is some errors. I am using asXML() to write the output to xml file. I am using unicode for Arabic. I try to adjust your technique to wrok with asXML.

2012-01-27 06:16:28 - DiaaFayed

the reason for using asXML is to use the xml.etree.ElementTree module to restructure and arrange the output tree for all the file in order to convert the output file to relational database for future and postprocessing.

2012-01-27 06:19:03 - DiaaFayed

the advantage of asXML is the output is explicit string so its easier to out to a text file.


2012-01-29 11:47:37 - Phxsawdust - Creating manual ParseResults

I am working on a parsing project where I need to inject some manually created parseresults into a parsed token. I have attached a parseaction to the appropriate place in my code, and I seem to have succeeded in creating a custom made parseresult to add back into my larger grammer. dump() and asxml() seem to output correctly, but other parts of my code (trying to access the created results by name) have issues. I can access by list position, but not assigned name. It is entirely possible that my limited python knowledge is messing me up somewhere, but since I have not been able to find an example of creating a parseresults quite this way I thought I would start here. Here is my parseresults creation code. tripHeaderCustomFields is attached as a parseaction. If a particular value is parsed (ie. 'TripCode') then some custom parseresults are created and added back in to the final result.

If anyone has tried to create manual parseresults like this, could you please look over my code and tell me if you see any glaring problems? It took hours of trial and error to get this version to work, and I would not be surprised if there is a better or more correct way.

 def addCustomField( self, group, name, datatype, value ):
        '''
        custom fields:
        Group: ie, specific airline or category - 'USAir, 'general'
        Name: name of field, ie 'linecheck', 'Medical', 'Deadhead', 'IV Pay'
        DataType: string, int, date, time
        Value: value of field, ie. 'checked by joe shmo, #2345', or '1st class medical - bryman'
        '''
        #TODO: Need to ask for help, some logic problem somewhere. loosing string name somewhere, but xml prints ok!


        prGroup = ParseResults( group, self.NAME.CFGROUP )
        prName = ParseResults( name, self.NAME.CFNAME )
        prDataType = ParseResults( datatype, self.NAME.CFDATATYPE )
        prValue = ParseResults( value, self.NAME.CFVAULE )

        prList = ParseResults( [] )
        prList += prGroup
        prList += prName
        prList += prDataType
        prList += prValue

        customField = ParseResults( [prList], self.NAME.CUSTOMFIELD )


        return customField


    def tripHeaderCustomFields( self, tokens ):
        parseSegment = tokens
        if 'TripCode' in parseSegment:
            customField = self.addCustomField( 'USAir', 'PairingCode', 'String', parseSegment['TripCode'] )
            if self.NAME.CUSTOMFIELDS in parseSegment:
                parseSegment[self.NAME.CUSTOMFIELDS] += customField
            else :
                parseSegment += ParseResults( [customField], self.NAME.CUSTOMFIELDS )
        if 'Charter' in parseSegment[self.NAME.EFFECTIVEDOWS]:
            customField = self.addCustomField( 'USAir', 'Charter', 'Boolean', 'True' )
            if self.NAME.CUSTOMFIELDS in parseSegment:
                parseSegment[self.NAME.CUSTOMFIELDS] += customField
            else :
                parseSegment += ParseResults( [customField], self.NAME.CUSTOMFIELDS )
        return tokens

returns a seemingly correct token,

- <CustomFields>
  - <CustomField>
  - <Group>USAir</Group>
  - <Name>EquipmentChange</Name>
  - <DataType>Boolean</DataType>
  - <Value>True</Value>
  - </CustomField>
  - <CustomField>
  - <Group>USAir</Group>
  - <Name>EquipmentChange</Name>
  - <DataType>Boolean</DataType>
  - <Value>True</Value>
  - </CustomField>
- </CustomFields>

that goes into a bigger result:

<Trip>
    <TripNumber>8510</TripNumber>
    <EffectiveDOWs>
      <EXCPT>EXCPT</EXCPT>
      <DayOfWeek>MO</DayOfWeek>
      <DayOfWeek>TH</DayOfWeek>
      <DayOfWeek>FR</DayOfWeek>
    </EffectiveDOWs>
    <ReportTime>
      <Hours>21</Hours>
      <Minutes>40</Minutes>
    </ReportTime>
    <TripCode>N</TripCode>
    <EffectiveDateStart>
      <Month>APR</Month>
      <Day>02</Day>
    </EffectiveDateStart>
    <EffectiveDateEnd>
      <Month>APR</Month>
      <Day>27</Day>
    </EffectiveDateEnd>
    <CustomFields>
      <CustomField>
        <Group>USAir</Group>
        <Name>PairingCode</Name>
        <DataType>String</DataType>
        <Value>N</Value>
      </CustomField>
    </CustomFields>
    <RequiredCrew>
      <Captain>1</Captain>
      <FO>1</FO>
    </RequiredCrew>

    .....snip....

</Trip>

2012-01-29 17:16:17 - Phxsawdust

Cross posted to

2012-01-30 20:20:03 - ptmcg

I can't see anything wrong with what you are doing. You are essentially implementing in your parse action what would happened in the parser if those fields had been in the input stream, and that is just fine. The only comment I can make is that, since you are modifying the tokens object directly, it is not necessary to return it from the routine, you can just return None or don't return anything. Pyparsing interprets a None return from a parse action as 'use the current tokens object'. I do this all the time.

Very clever technique to inject extra 'marker' values. Even though it took you hours to figure out, in the end, I think your code looks pretty direct.

-- Paul

2012-02-01 14:33:38 - Phxsawdust

Thanks for looking it over. I don't have the background to be able to read your code and figure out how it works, so I really appreciate your feedback. I feel a bit like the cave man with the TV remote. With perseverance I can find ESPN, but I'm not quite sure how i got there.... Bad news is I still have a problem with my code somewhere (else). I'll have to dive back in and see if I can tease it out.

2012-02-08 12:18:12 - Phxsawdust

I have reworked my custom ParseResults code, and it now works as expected. I wish I had thought of doing it this way the first time, as it was much easier to figure out. :) I do tend to reinvent the wheel... tripHeaderCustomFields is attached as a ParseAction, and the new ParseResults are added to the parent ParseResults

def tripHeaderCustomFields( self, tokens ):
        parseSegment = tokens
        if 'TripCode' in parseSegment:
            customField = self.addCustomField( 'USAir', 'PairingCode', 'String', parseSegment['TripCode'], parseSegment )
        if 'Charter' in parseSegment[self.NAME.EFFECTIVEDOWS]:
            customField = self.addCustomField( 'USAir', 'Charter', 'Boolean', 'True', parseSegment )

    def buildCustomFieldString( self, group, name, datatype, value ):
        #TODO: replace any stray '|' that might be in input strings

        text = group + '|' + name + '|' + datatype + '|' + value
        return text

    def addCustomField( self, group, name, datatype, value, token ):
        '''
        custom fields:
        Group: ie, specific airline or category - 'USAir, 'general'
        Name: name of field, ie 'linecheck', 'Medical', 'Deadhead', 'IV Pay'
        DataType: string, int, date, time
        Value: value of field, ie. 'checked by joe shmo, #2345', or '1st class medical - bryman'
           <CustomFields>
              <CustomField>
                <Group>USAir</Group>
                <Name>EquipmentChange</Name>
                <DataType>Boolean</DataType>
                <Value>True</Value>
              </CustomField>
              <CustomField>
                <Group>USAir</Group>
                <Name>EquipmentChange</Name>
                <DataType>Boolean</DataType>
                <Value>True</Value>
              </CustomField>
            </CustomFields>
        '''
        pGroup = Word( alphanums )( self.NAME.CFGROUP )
        pName = Word( alphanums )( self.NAME.CFNAME )
        pDatatype = Word( alphanums )( self.NAME.CFDATATYPE )
        pValue = Word( alphanums )( self.NAME.CFVAULE )
        delim = Suppress( '|' )
        customField = Group( pGroup + delim + pName + delim + pDatatype + delim + pValue )( self.NAME.CUSTOMFIELD )
        text = self.buildCustomFieldString( group, name, datatype, value )
        if self.NAME.CUSTOMFIELDS in token:
            token[self.NAME.CUSTOMFIELDS] += customField.parseString( text )
        else :
            token += Group( customField )( self.NAME.CUSTOMFIELDS ).parseString( text )

2012-01-31 18:24:11 - oafilipoai - Catch-all pattern

I'm trying to parse a list of statements enclosed in {} brackets. I only care about some of the statements and I want to avoid writing an exhaustive grammar for all possible statement types.

import pyparsing as pp

s = 'my_keyword(){known1 unknown0  known2  unknown1  unknown2 }'
known = pp.Regex('known.').setResultsName('known')
other = pp.Word(pp.alphanums)
pat = (
        pp.Keyword('my_keyword')
        + pp.nestedExpr(opener='(', closer=')')
        + pp.nestedExpr(opener='{', closer='}', content=pp.OneOrMore(known | other) )
        )
for x in pat.scanString(s):
    print x

This works fine as the 'other' pattern matches all the unknown statements

However if I modify the input string as shown below the scanString does not return any output.

s = 'my_keyword(){known1 unknown0 ; known2  unknown1  unknown2 }'

This is obviously because ';' is not an alphanumerical character. Is there a catch-all pattern I can use to match everything not matched by the known pattern? Alternatively, is there a better way of extracting only the known statements from code enclosed between curly brackets?

2012-01-31 23:04:19 - ptmcg

Try pp.Word(pp.printables, excludeChars='{}')

2012-02-02 22:03:14 - oafilipoai

Thanks for the reply.

I have another question I'm trying to match a block of text enclosed between curly brackets which contains one known token and other unknown stuff:

import pyparsing as pp
s = 'dummy {dummy0 test dummy1} {dummy2, dummy3}'
other = pp.Word(initChars=pp.printables, excludeChars='{}')
pat=pp.nestedExpr(opener='{', closer='}', content=pp.Each(pp.Literal('test'), pp.ZeroOrMore(other)))

for x in pat.scanString(s):
    print x

I would expect the above pattern would match the first block of text included in {} and not the second one. However, nothing is matched. What is the best way to accomplish my goal?

2012-02-05 23:01:11 - ptmcg

DIAGNOSIS:

I rewrote your program as follows, creating an expression for CONTENT, which I could then name and enable debugging:

s = 'dummy {dummy0 test dummy1} {dummy2, dummy3}'
other = pp.Word(initChars=pp.printables, excludeChars='{}')
TEST = pp.Literal('test')
CONTENT = TEST & pp.ZeroOrMore(other)
CONTENT.setName('content')
CONTENT.setDebug()
pat=pp.nestedExpr(opener='{', closer='}', content=CONTENT)

for x in pat.scanString(s):
    print x

Outputs:

Match content at loc 7(1,8)
Exception raised:Missing one or more required elements ('test') (at char 7), (line:1, col:8)
Match content at loc 28(1,29)
Exception raised:Missing one or more required elements ('test') (at char 28), (line:1, col:29)

EXPLANATION:

The literal 'test' also matches the definition of other, so when the Each expression failed to find 'test' at the beginning of the bracketed group, it tried to find 0 or more others. Since 'test' matches the pattern defined in other, it got read as part of the ZeroOrMore.

SOLUTION:

Define an expression for 'test' and excluded it from the repetition in the ZeroOrMore:

CONTENT = TEST & pp.ZeroOrMore(~TEST + other)

After removing the debugging code, the output is:

(([(['dummy0', 'test', 'dummy1'], {})], {}), 6, 26)

Pyparsing does not implicitly try to do any lookahead or expression filtering or mind-reading - we have to put that in ourselves, as I did by saying that 'test' should not be included as part of the ZeroOrMore repetition of the other expression. Also, please try using setName and setDebug to start troubleshooting these problems, and you will get a better feel for where pyparsing can go astray.

2012-02-06 12:35:21 - oafilipoai

Thanks for the detailed reply

2012-02-06 12:55:58 - ptmcg

I didn't mean to be flip with my 'mind-reading' comment. In fact, to debug some of these parsers, I often play 'Be The Parser', and try to mentally step through each expression just following the grammar, and not using my own assumptions about how something should be parsed. You have to work hard to set aside your own human pattern matching machinery, which is much more powerful than pyparsing.

Good luck, and write back if you have more questions.


2012-02-04 15:55:57 - lamakaha - error with setParseAction

hello - i'm getting an error executing parseString when the following seemingly basic ParseAction is added - when there's no ParseAction set it parses without any issue. Any insight? I'm using the latest pyparsing-1.5.6 and Python 3.2

Series_Code=OneOrMore(Word(alphanums+'-'))('Series_Code')
Series_Code.setParseAction( lambda tokens : ''.join(tokens))
test = 'Series_Code: Series 1-1|'
topIDs =  Suppress('Series_Code:') + Series_Code + Suppress('|') 
parsed = (topIDs).parseString(test)


  File 'C:\Python32\lib\site-packages\pyparsing.py', line 689, in wrapper
    return func(*args[limit:])
UnboundLocalError: local variable 'limit' referenced before assignment

this is the code referenced by the error

def _trim_arity(func, maxargs=2):
    limit = maxargs
    def wrapper(*args):
        #~ nonlocal limit
        while 1:
            try:
                return func(*args[limit:])
            except TypeError:
                if limit:
                    limit -= 1
                    continue
                raise
    return wrapper

2012-02-05 22:39:12 - ptmcg

Please make sure you are using the version of pyparsing that is compatible with Python 3. The commented-out nonlocal statement tells me that you are using the Python 2-compatible version.

2012-02-08 02:19:43 - lamakaha

Thanks for your response - it helped I have re-run the setup.py a few more times after clearing pyparsing files from Python32\Lib\site-packages but it kept copying the Python 2 version of pyparsing into my Python32 folder. So I ended up just manually dropping pyparsing_py3 file into Python32\Lib\site-packages and renaming it to pyparsing and it seems to be working ok. Is this all the installation really does (outside of compiling it to the bytecode which i understand happens anyways on the first import)? Or, if I do it this way, I'm missing some important installation step?

2012-02-08 05:23:30 - ptmcg

No, that is really all the installation does. pyparsing is just the one Python file, and you are correct, it will be compiled to bytecode the first time you import it. (Keeping it to one file was intentional on my part, to simplify its inclusion in other projects.)

I am disappointed that setup.py is not picking up your Python version though, I'll have to go back and look at that to see where it is going wrong. Thanks for writing, and good luck with pyparsing!

-- Paul


2012-02-05 14:44:27 - karulis - bug + patch for ParseResults.dir in python3

Hi

I have noticed that ParseResults.__dir__ tries to concatenate list to dict_keys(in python3.x dict.keys() returns dict_keys iterable view).

Here is patch with simplest solution:

Index: src/pyparsing_py3.py
===================================================================
--- src/pyparsing_py3.py    (revision 216)
+++ src/pyparsing_py3.py    (working copy)
@@ -568,7 +568,7 @@
             self.__parent = None

     def __dir__(self):
-        return dir(super(ParseResults,self)) + self.keys()
+        return dir(super(ParseResults,self)) + list(self.keys())

 collections.MutableMapping.register(ParseResults)

That will work but its not thread-safe. Thread-safe version would be:

def __dir__(self):
    return dir(super(ParseResults,self)) + list(self.keys())

I have noticed this issue when using pydev's debbuger:)

2012-02-05 22:42:55 - ptmcg

Awesome, nice catch! Thanks for the patch, I'll include it in1.5.7.

(I don't see the difference between the patch and the thread-safe version, though.)

-- Paul

2012-02-06 13:11:40 - karulis

Yep no difference there(copy/paste bug;)), it should be:

def __dir__(self):
        return dir(super()) + list(self.copy().keys() )

I have noticed that there already is bug for it: TypeError: can only concatenate list to list - ID: 3483740


2012-02-07 16:34:33 - oafilipoai - Finding the end location for a matched expression

I have an input string which is a collection of nested structures like this:

id1{
    info_level1
    id2{
        info_level2
    }
    ....
}
.....

Each 'id1' block contains some identifying information and one or more 'id2' blocks

I need to change and replace in the original string all blocks of type 'id2' which meet certain criteria for both info_level1 and info_level2. For this purpose I'm trying to extract using pyparsing the start and end locations of these 'id2' blocks

So far I built the pyparsing expression for both 'id2' and 'id1' (which contains 'id2') and I did something like this:

id2_pattern.setParseAction(lambda s,loc,toks: <store location here>)
id1_pattern.searchString(s)

The problem is that with setParseAction I only get the starting location of the matching id2 block. If I use searchString on the id2_pattern, I can't incorporate info_level1 in the search pattern.

I could call searchString on the content of id1 blocks and derive the locations in the input string, but I'm hoping there is an easier way to get the start/end locations for 'id2' blocks

2012-02-07 17:08:34 - ptmcg

See if one of the transformString examples looks like a better fit. transformString applies changes to the tokens made during a parse action and replaces the matched tokens. It takes care of the replacement within the start/end locations. -- Paul

2012-02-08 15:10:16 - oafilipoai

I ended up using the originalTextFor method and computing the end location from the length of the matched token:

id2_pattern = pp.originalTextFor(old_id2_pattern)
id2_pattern.addParseAction(lambda s,loc,toks: <store (loc, loc + len(toks[0])>)

As a side note there seems to be some undocumented behavior related to this method (I could only find references to it on this forum but not in the pyparsing docs): After using originalTextFor one needs to use addParseAction as opposed to setParseAction.


2012-02-14 03:49:57 - ror6ax - Trying to parse a file.

Got this error :

raise ParseException(instring, loc, self.errmsg, self)

The code is

from pyparsing import Word,alphas
f=open('pinged.txt')
# define grammar
greet = Word( alphas ) + '.' + Word( alphas ) + '.' + Word('com')

# input string
hello = f.read()

# parse input string
output=greet.parseString( hello )
print(output)

2012-02-14 03:51:09 - ror6ax

Traceback (most recent call last):
  File 'C:\module1.py', line 27, in <module>
    output=greet.parseString( hello )
  File 'C:\Python32\lib\site-packages\pyparsing.py', line 969, in parseString
    raise exc
  File 'C:\Python32\lib\site-packages\pyparsing.py', line 959, in parseString
    loc, tokens = self._parse( instring, 0 )
  File 'C:\Python32\lib\site-packages\pyparsing.py', line 833, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File 'C:\Python32\lib\site-packages\pyparsing.py', line 2214, in parseImpl
    loc, exprtokens = e._parse( instring, loc, doActions )
  File 'C:\Python32\lib\site-packages\pyparsing.py', line 837, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File 'C:\Python32\lib\site-packages\pyparsing.py', line 1435, in parseImpl
    raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected '.' (at char 5), (line:1, col:6)

2012-02-14 05:18:51 - ptmcg

Please post the first few lines of the input file. I suspect that Word(alphas) is not sufficient to describe the first part of the file - contains numbers or '_'s perhaps? or starts with 'http://'? Difficult to give much more help without seeing the input string.

-- Paul

2012-02-14 05:23:52 - ptmcg

A few more tips:

  • the last line of the exception tells you where to look in the input string for what you thought would be an alpha, but what pyparsing thought should be one of the '.'s. Look at character number 6 of the first line (numbering starting at 1).
  • If you catch the exception by wrapping your call to parseString with try/except ParseException as pe, print pe.markInputline() and you should get a visually marked version of the input line, with '>|<' in the string where the parsing error occurred.

2012-02-14 05:26:23 - ror6ax

Ping request could not find host lalala.balm.com . Please check the name and try again.
Ping request could not find host lalala.balm.com . Please check the name and try again.


Ping statistics for 11.11.11.111:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 57ms, Maximum = 58ms, Average = 57ms

Pinging lalala.balm.com [11.11.11.12] with 32 bytes of data:
Reply from 11.11.11.12: bytes=32 time=69ms TTL=247
Reply from 11.11.11.12: bytes=32 time=70ms TTL=247
Reply from 11.11.11.12: bytes=32 time=69ms TTL=247
Reply from 11.11.11.12: bytes=32 time=69ms TTL=247

2012-02-14 06:09:53 - ror6ax

Basically, I need to get every IP adress this file contains. My idea was to seek for some text divided by two dots which will make IP adresses.

2012-02-14 06:51:16 - ptmcg

Try using searchString instead of parseString. parseString assumes that your grammar definition fully describes the input string; searchString and scanString will search for matches. (Not all addresses end in .com, some end in .org, .edu, etc., and many contain more than 3 dotted elements. This will be a good learning experience for you.)


2012-02-17 10:10:07 - DiaaFayed - comma separates outside paranthesis

I want the comma separates only when it is outside of parentheses.

input to parser

ition (government, cabinet, etc), harmonious, harmonic
girl(s), women

I need the output:

['ition (government, cabinet, etc)', 'harmonious', 'harmonic']
['girl(s)', 'women']

the errored output

['ition (government', 'cabinet', 'etc)', 'harmonious', 'harmonic']
['girl(s)', 'women']

2012-02-17 19:36:28 - ptmcg

Try this:

listitem = originalTextFor(OneOrMore(Word(alphas) | nestedExpr()))
for t in tests:
    print delimitedList(listitem).parseString(t).asList()

In the future, it would help if you also posted the parser that you have tried, so I can give you some instructive suggestions. As it is, all I can do is spoon-feed you the answer.

-- Paul

2012-03-02 05:56:06 - DiaaFayed

thanks very much I tried to simulate nestedExpr() using

parentheses = Regex(ur'\([^()]+\)')

but this did not solve the problem I do not know the reason thanks very much

2012-05-10 18:57:05 - Mike01915

Just found this post, it's very useful. I'd written a method to parse function argument lists that contain nested calls to functions that might also contain a list of arguments. For example the following argument string:

'A1*atan2( A2, A3), power( 10, A4), ( B1 + B2 )/C10'

Parses to:

['A1*atan2( A2, A3)', 'power( 10, A4)', '( B1 + B2 )/C10']

Now I can replace my method with just two lines:

import pyparsing as pyp

ArithExp = pyp.Word( pyp.alphanums + '+-*/^' )
listitem = pyp.originalTextFor( pyp.OneOrMore( ArithExp | pyp.nestedExpr() ) )

tests = ['A1*sin(pi/8), cos(pi/8)/B1',
         'atan2( 2*X25, -Y25 ), A1*sin( A2/3 )',
         'A1*atan2( A2, A3), power( 10, A4), ( B1 + B2 )/C10']

for t in tests:
    print pyp.delimitedList( listitem ).parseString( t ).asList()

2012-02-18 19:57:34 - lamakaha - how to ignore blank lines in line oriented parser?

hello - I'm building a line oriented parser but at the same time blank lines including the lines with spaces and tabs need to be ignored. Any suggestions?

from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' \t')

text ='''
  # there's a space here
From_Date: 10/1/2011
'''

EOL = LineEnd().suppress()
SOL = LineStart().suppress()
blankline = SOL + EOL

headerBlock =Suppress('From_Date:') + Word(nums+'/')('OpeningDate') 

headerBlock.ignore(blankline)
print(headerBlock.parseString(text).dump())

2012-02-20 17:43:34 - ptmcg

You realize that this fails because the input string does not start with 'From_Date:' but with '\n # there's a space'? Maybe parseString is not what you want, try searchString or scanString.

2012-02-20 18:42:26 - lamakaha

Paul - I'm very sorry - the test string I posted is wrong - it should have been like this (there's a blank or tab in the second line and nothing else except EOD

text ='''

From_Date: 10/1/2011 '''

I have a work-around like this

headerBlock =SkipTo(Suppress('From_Date:') + Word(nums+'/')('OpeningDate'))  but need to find a more general solution

again, my problem is this - I'm trying to parse lines so EOD is significant but empty lines could come at any time and in any number. These 'empty' lines could occasionally contain tabs and spaces and I was hoping these would be ignored due to the 2 provisions

  1. tabs and spaces are set to be the default white space
  2. I defined blank lines and made them to be 'ignored'

2012-02-20 18:48:08 - lamakaha

actually my work-around looks more like this

headerBlock =SkipTo(Suppress('From_Date:'),include=True) + Word(nums+'/')('OpeningDate') 

but it's besides the point - still not happy with this as I have more cases in my parser where SkipTo would skip over pieces which I'm interested in

2012-02-20 18:57:54 - ptmcg

Sometime LineStart() does not match as well as we would expect. Try defining blankline as just plain EOL. Or define EOL as OneOrMore(LineEnd()), and get rid of the ignore.

-- Paul

2012-02-20 19:15:12 - lamakaha

'define EOL as OneOrMore(LineEnd()), and get rid of the ignore' - worked really well - issue resolved- thank you!!


2012-02-21 05:54:17 - ror6ax - parsing tables

Hi there. I need to parse every line in a table which contains specific string it it. is there a way of doing it with pyparsing? Many thanks in advance.

2012-02-21 05:58:56 - ptmcg

Very likely. What kind of table are you talking about? An HTML table, or a text-formatted table with '+'s, '-'s, and '|'s? Or just tabular data with columns of values lined up in nice straight columns? Or something else? A sample would help. (and please enclose in [[code]] tags, before and after, each tag on a line by itself.)

-- Paul

2012-02-21 06:02:48 - ror6ax

Actually pdf with text layer... I guess that makes a text-formatted table in the end. getting a text-only input should be easy, but the main question stands.

2012-02-21 07:27:01 - ptmcg

Ok, cool. Can you post a sample of what your PDF table looks like, as plain text? Or is it stored as a compressed block? If so, you'll need to first extract the binary compressed data, then expand to text, and then parse the text. You might also look at what PDF-processing tools are available from reportlab (). They might even have something that will directly parse your table - pyparsing isn't always the solution to everything :) .


2012-02-25 13:19:21 - johnmudd - long output, is ths right?

Sample C typedef parser:

The output is incredibly long for such a small example. Is this as expected or am I doing something wrong when creating mu parser?

2012-02-25 13:40:43 - ptmcg

Wow, that is an ugly mess! Try using this instead to see the results of your parse:

print parser_result.dump()

This will just show you the tokens and any parse results names. Using eval as you are doing is dumping out all the internal structures that go into a ParseResults object, which is even surprising to me what all you are getting! -- Paul


2012-02-29 00:17:37 - lesnar56 - Extending Keyword Classes

Hi I want to extend the behaviour of Keyword Class.

import pyparsing as P
def call_back(p):
    print p
    print 'calling ....'

class MyKeyword(P.Keyword):
    x = P.StringStart() + P.Word(P.alphas) + P.StringEnd()
    x.setParseAction(call_back)
    parseImpl = x.parseImpl
    setParseAction = x.setParseAction 

if __name__ == '__main__':
    t = MyKeyword('ABCD')
    t.setParseAction(call_back)
    print t.parseString('ABCD')

I am able to parse the string but the setParseAction is not working . Can anyone tell me where am I going wrong ?


2012-03-06 06:04:21 - rrian - Unexpected results with name

#!/usr/bin/python

from pyparsing import Word,nums

va = 'VA'+Word(nums)
vb = 'VB'+Word(nums)
vc = 'VC'+Word(nums)
v = (va | vb | vc)('vvv')
k = v.parseString('VB 15')
print k.vvv

I was expecting the result to be

['VB', '15']

but am instead getting

VB 

If I replace the expression with simply

v = (vb)('vvv')

I get the expected result. Any thoughts?

Thanks!

2012-03-13 22:14:27 - ptmcg

I'm not sure I've seen this particular problem before, but try changing va to va = Group('VA' + Word(nums)), etc. This will keep these tokens together when they are saved in the parsed results.

2012-03-14 21:00:21 - rrian

Thanks! That's exactly what I am doing to get around the problem. But it is not as clean because Group creates a list.

2012-03-15 02:27:24 - ptmcg

If you just want a string 'VA 100' or whatever, then use Combine. Since there may be space between the VA and then number, you'll have to add 'adjacent=False' when combining these tokens.


2012-03-12 12:46:53 - tarruda - Need help in parsing part of python grammar

As an exercise to learning pyparsing and how to write grammars in general, I'm writing a parser for a python-like language.

So far I'm almost finished with the expressions part, but I cant seem to properly define primaries.

The part I'm having trouble is found in the official documentation here:

Here's the code I'm having trouble with:

primary << (
    atom 
    | attributeref  # If I comment this line, everything is fine
) 
attributeref << (primary + DOT + IDENTIFIER)
pow_expr << (primary + Optional(POW_OP + unary_expr))
#
# Many other expressions defined
#
expr << conditional_expr  # This expression will eventually try to parse a 'pow_expr'  and consequently a 'primary', just like defined in python docs

The string I'm trying to parse is simple as this

'somevar.someattr'

In other words, I'm trying to parse an attribute access. The problem is that when I call expr.parseString('somevar.someattr', parseAll=True), it will sucessfully parse 'somevar' as an identifier, but it will fail since there's a leading '.' (DOT). If I turn off the parseAll flag it will only parse 'somevar' which is not what I want. If I swap 'atom' and 'attributeref' in the 'primary' definition(so it will try to match an attributeref first) it will enter in infinite recursion and break(obviously)

I have already tried this:

primary << (atom + ~DOT | attributeref)

But this will also fall into infinite recursion since it will keep trying to match the 'attributeref' when it meets the 'DOT' token.

What can I do to work around this problem?


2012-03-14 12:00:22 - keirian - Recursion Help

Ok, I've been trying to get my head around recursion for a while, but I'm still having trouble with it.

Would someone please demonstrate how to implement a grammar to match something like '(1 or (2 and 3) or (4 or 5) and 6)'? Note that this may also be represented as '(1 (2 and 3) (4 5) and 6)', where the absence of an operator is an implicit 'or'.

Here is my (embarrassing) attempt:

LPAREN = Literal('(')
RPAREN = Literal(')') 
and_ = Keyword('and', caseless=True)
or_ = Keyword('or',caseless=True)
operator = and_ | or_
reference = Word(nums)
statement = Forward()
item = reference | (LPAREN + statement + LPAREN)
statement << (OneOrMore(item + Optional(operator)) + ZeroOrMore(Optional(operator) + statement))
element = Group(operator | statement)

Any guidance would be greatly appreciated. Thanks!

-Keirian


2012-03-21 11:58:03 - maxime-esa - ambiguous grammar not detected?

(Already posted in the bugtrack but I think it is more relevant for discussion here)

Dear all,

I tried to feed Pyparsing with the following grammar:

from pyparsing import *
 expr = Literal('EXPR')
 end = Literal ('endif')
 stmt = Forward()
 cond = stmt | end
 stmt << 'if' + expr + 'then' + cond + Optional('else' + cond)
 cond.validate()
 stmt.validate()
 cond.parseString ('if EXPR then if EXPR then endif else endif')
 (['if', 'EXPR', 'then', 'if', 'EXPR', 'then', 'endif', 'else', 'endif'], {})

I am not a parser expert and maybe there is something I do not understand, but I would have expected pyparsing to raise a warning when parsing this string (or accepting the grammar): I see no way to know if the 'else' belongs to the first 'if' or to the second one.

Help?

Thanks!

2012-04-05 01:33:43 - ptmcg

validate() looks for left-recursion in a grammar, and there is none in the one you posted.

Change stmt to:

stmt << Group('if' + expr + 'then' + cond + Optional('else' + cond))

to better see how the string is parsed into statements.

[ 
  ['if', 'EXPR', 'then', 
    ['if', 'EXPR', 'then', 'endif', 'else', 'endif']
  ] 
]

I was a little surprised though to see 'endif' as a valid statement in your grammar. In fact, adding 'endif' as a required terminator to your if statement syntax completely disambiguates any else clause.

Here is a slight rework of your grammar adding two more simple statements, and making 'endif' a required terminator of 'if' (instead of being a statement on its own). Also, I'm using Keyword instead of Literal for your keywords, to avoid accidentally parsing a variable name that just happens to start with a keyword, such as 'iffy'.

from pyparsing import *

# keywords
IF,THEN,ELSE,ENDIF,PASS,PRINT = map(Keyword,
    'if then else endif pass print'.split())

# placeholder for boolean expression
expr = Keyword('EXPR')

# statements
stmt = Forward()
pass_stmt = PASS
print_stmt = PRINT + quotedString
if_stmt = (IF + expr + THEN + stmt + 
            Optional(ELSE + stmt) + 
            ENDIF)

stmt << Group(pass_stmt | print_stmt | if_stmt)
stmt.validate()
print stmt.parseString ('if EXPR then if EXPR then print 'hi' endif else print 'bye' endif')

prints:

[
  ['if', 'EXPR', 'then', 
     ['if', 'EXPR', 'then', 
        ['print', ''hi''], 
     'endif'], 
  'else', 
     ['print', ''bye''], 
  'endif']
]

In fact, by adding 'endif' as your terminator, you can easily support multiple statements in your then and else blocks. Here is a little more expanded grammar, expanding the print statement (just to make things a little more interesting) and using the multiplication syntax as an alternative to OneOrMore:

ident = Word(alphas, alphanums+'_')
print_stmt = PRINT + (quotedString | ident)
if_stmt = (IF + expr + THEN + Group(stmt*(1,)) + 
            Optional(ELSE + Group(stmt*(1,))) + 
            ENDIF)
stmt << Group(pass_stmt | print_stmt | if_stmt)

print stmt.parseString ('''
    if EXPR then 
        if EXPR then 
            print 'hi' 
            print name
        endif 
    else 
        print 'bye' 
        print name
    endif
''')

Gives

[['if', 'EXPR', 'then',
  [['if', 'EXPR', 'then', 
    [
    ['print', ''hi''], 
    ['print', 'name']
    ], 
  'endif']],
'else',
  [
  ['print', ''bye''], 
  ['print', 'name']
  ],
'endif']]

HTH, -- Paul

2012-04-05 04:29:13 - maxime-esa

Thank you very much for your detailed answer, which, if needed, shows how Python and PyParsing are elegant when properly used.

However, with all due respect, I am not completely satisfied yet :-)

What you propose is a workaround: knowing that there is an ambiguity, you fixed it.

I am more concerned by the fact that the ambiguity is not detected by the parser itself, and wonder if (and how) you could detect it automatically.

As an exercise to illustrate what I mean, I wrote an 'equivalent' grammar using ANTLR, and tried to compile it:

cond : stmt | END;

stmt : 'if' EXPR 'then' cond ('else' cond)* ;

EXPR : 'EXPR';
END : 'endif';

But this won't pass the semantic check:

warning(200): test.g:18:42: Decision can match input such as ''else'' using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
error(201): test.g:18:42: The following alternatives can never be matched: 2

I have no idea how much effort it would represent to have this level of analysis.

But see the point: as soon as there is recursion, there is a risk of having such situations, which can not easily be spotted by a human reader.

What do you think?

2012-04-05 09:22:18 - ptmcg

I'm sorry, but I don't foresee being able to add this kind of analysis to pyparsing any time soon.

The current philosophy of pyparsing is pretty strictly left-to-right, working through the parsing grammar expression by expression. The implementation of validate() just walks this grammar the same way, looking for cycles in any recursive paths, but even this is not 100% reliable.

If you need this capability, you can still use pyparsing for quick prototyping, but it sounds like ANTLR gives you more value in terms of validation (and runtime performance as well).

-- Paul


2012-03-25 00:33:47 - nimbiotics - problems with delimitedList

Hello everyone. I'm new to python and even newer to pyparsing. I pasted my code at I'm having trouble with the definition of 'grant' at line 100 (the error message is shown at the end of the paste). Here is the ebnf of what i want:

<grant> ::= <> 'grant', <permission>, <user>, [',', <user>]0, <journal>, [',', <journal>]0

Can someone please exlain me what am I doing wrong here? TIA!

2012-03-25 00:34:56 - nimbiotics

<grant> ::= 'grant', <permission>, <user>, [',', <user>]0, <journal>, [',', <journal>]0

2012-03-25 05:35:01 - ptmcg

The problem you have is that you define journalName to include ',' as a possible character. This consumes the ',' that would be the delimitedList's delimiter, leaving the rest of the list of journal names unparsed.

How to figure this out for yourself? First, look at the exception. I see that you already gave yourself a nice ruler of column numbers. So you can see that the exception reports a problem at column 55, which is the 2nd journal name in the command's list of journals. Now try adding 'setDebug()' on your journalName expression, and you will get output like this:

Match W:(abcd...) at loc 43(1,44)
Matched W:(abcd...) -> ['journal_1,']
Traceback (most recent call last):
  File 'k9.py', line 138, in <module>
    linea = syntax.parseString(test)
  File 'c:\python26\lib\site-packages\pyparsing-1.5.6-py2.6.egg\pyparsing.py', line 1032, in parseString
    raise exc
pyparsing.ParseException: Expected end of text (at char 54), (line:1, col:55)

You'll see output from pyparsing every time journalName is attempted to be parsed, followed by either the matched token (if successful) or the exception text (if not successful). See that your first entry was matched as 'journal1,', and that unwanted trailing comma is your culprit.

If you need to permit commas in your journal names, then you will need to use a different delimiter, which you can specify to delimitedList in a second argument.

Welcome to pyparsing! -- Paul

2012-03-25 19:34:00 - nimbiotics

You Rock!

I was about to give up and use a 'special' character at the beginning of journalName. Not very elegant and, as per you explanation; it woulnd have worked anyways.

Thanks a lot!!!


2012-03-29 14:30:05 - nimbiotics - How can I group this?

In the following definition; is there any way to obtain the amounts, dates and references grouped in a list or dictionary? I tried Group() and it doesn't work,

single = (StringStart() +
          delimitedList(Amount('amount') +
                        Optional(Date)('date') +
                        Optional(QuotedString(quoteChar='''))('reference')) +
          journalName('journalName') +
          Optional(Comments)('comments') +
          StringEnd())

2012-04-05 00:52:01 - ptmcg

Group the expression within the delimitedList:

single = (StringStart() +
         delimitedList(Group(Amount('amount') +
                       Optional(Date)('date') +
                       Optional(QuotedString(quoteChar=''')))('reference')) +
         journalName('journalName') +
         Optional(Comments)('comments') +
         StringEnd()
         )

2012-04-04 14:58:25 - HumbertMason - Parsing a list of structures line by line

Hello, I am trying to parse a file containing some simple structure declarations. It should be something like this:

BeginStructure house
color = white
size = big
EndStructure

BeginStructure car
speed = fast
price = 15000
EndStructure

and so on

I'd like to detect every structure declaration and print or save it in a list. I have written something like this:

SingleLine = Word(alphanums) + '=' + Word(alphanums)
    MultipleValues = OneOrMore(Group(SingleLine + Suppress(';')))
    Structure = Suppress('beginFact') + Word(alphanums) + MultipleValues + Suppress('endFact')

but it can't work if I read the file line by line because I should do something like this:

with open('file.txt') as f:
        print 'line = ' + line
        print grammarFact.parseString(line)

and obviously it can't match the entire grammar if it considers a single line. Any suggestions? Thank you!

2012-04-04 15:04:21 - HumbertMason

I can't edit my post: sorry, instead of beginFact and endFact they should be BeginStructure and EndStructure.

2012-04-05 00:47:25 - ptmcg

You should read the entire file into a string and parse it all at once. Unless you are parsing gigabyte-sized files, this is perfectly acceptable.

with open('file.txt') as f:
    print grammarFact.parseString(f.read())

2012-04-05 05:26:13 - HumbertMason

Oh ok. Thank you :)

2012-04-05 06:26:03 - HumbertMason

One last question: would you parse an entire programming language by loading the entire file into a string or there're better ways to do this? If yes, what would be the 'main idea'? Thank you

2012-04-05 09:25:47 - ptmcg

For parsing a complex grammar like an entire programming language, look at the Verilog parser example - which loads the entire file into a string and parses it. This parser just parses the Verilog source, there is very little beyond that that is implemented. To actually implement a compiler or interpreter, I would define processable classes associated with the various language constructs - see the SimpleBool.py example for how this might be done, and then used once built.

-- Paul


2012-04-06 10:11:25 - pepinocho9 - Help with parseactions and Morgan's Law

I am trying to do a program that evaluates if a propositional logic formula is valid or invalid using the semantic three method.

I managed to evaluate if a formula is well formed or not so far:

from pyparsing import *
from string import lowercase

def fbf():

    atom = Word(lowercase, max=1) #alphabet
    op = oneOf('^ V => <=>') #Operators
    identOp = oneOf('( [ {')
    identCl = oneOf(') ] }')
    form = Forward() 
    #Grammar
    form << ( (Group(Literal('~') + form)) | ( Group(identOp + form + op + form + identCl) ) | ( Group(identOp + form + identCl) ) | (atom) )

    return form

#Parsing
entrada = raw_input('Entrada: ')
try:
    print fbf().parseString(entrada, parseAll=True)
except ParseException as error:
    print error.markInputline()
    print error
print

Now I need to convert the negated forumla ~(form) acording to the Monrgan's Law, The BNF of Morgan's Law its something like this:

~((form) V (form))  =  (~(form) ^ ~(form))
~((form) ^ (form))  =  (~(form) V ~(form))

Parsing must be recursive; I was reading about Parseactions, but I don't really understand I'm new to python and very unskilled.

Can somebody help me on how to get this to work?

Thank you very much!.

PD. ptmcg, again greetings from Mexico and thank you for all your help :)


2012-04-16 10:33:55 - takluyver - Skip optional part if following part matches

I'm trying to match a pattern with an optional repeating part, followed by a compulsory single part which is a superset of the optional part. So 'A B CDE' has optional parts 'A', 'B', and compulsory part 'CDE'. But in 'A B C', C represents the compulsory part.

This code, and all the variants I've tried involving Optional and OneOrMore, fails for 'A B C', because the optional part is too greedy:

pattern = ZeroOrMore(Word(upper, max=1)) + Word(upper)

I could do the whole chunk with a regex, though I'd prefer not to:

re.compile('([A-Z] )*[A-Z]+')

Is there a way to achieve this in pyparsing?

Thanks!

2012-04-16 11:04:25 - ptmcg

What version of pyparsing are you using? With version 1.5.6, I get this:

>>> single = Word(alphas,max=1)
>>> mult = Word(alphas)
>>> expr = ZeroOrMore(single) + mult
>>> expr.parseString('A B CDE')
(['A', 'B', 'CDE'], {})

2012-04-16 11:23:13 - takluyver

Yep, that one works, but try this:

>>> expr.parseString('A B C')
...big 'orrible traceback
ParseException: Expected W:(abcd...) (at char 5), (line:1, col:6)

I'm running 1.5.6, on Python 3.2.

2012-04-16 16:57:35 - ptmcg

Sorry, I misread your question.

So a single isn't just a single letter, it's a single letter that is followed by at least one more letter, or conversely, it's a letter that is not followed by the end of the line.

This kind of lookahead is the only way to tell pyparsing to look beyond just the current character, but what comes next as well. It helps to have a kind of self-imposed tunnel vision while writing a pyparsing grammar, because unless you explicitly spell out any required looking ahead, it's not going to happen.

Using FollowedBy is a way to see if something is coming up, without consuming that something from the input string. So you could implement either of these lookaheads:

single = Word(alphas,max=1) + FollowedBy(Word(alphas))

or

single = Word(alphas,max=1) + ~FollowedBy(LineEnd())

The first only matches a single if there is at least something after it, which could be a single or a mult. The second only matches a single if it is not the last thing on the current line.

Change single to either one of these, and 'A B C' parses just fine.

-- Paul

2012-04-17 04:35:48 - takluyver

Thanks, Paul, that seems to be doing the trick.


2012-04-27 12:16:42 - larapsodia - Question about "Or" statement

I'm just starting out with pyparsing and having a bit of a hard time trying to get it to do what I want. I'm trying to parse words into 'prefix', 'stem' and 'suffix'. Prefixes and suffixes are optional, and can be made up of several parts (i.e., a prefix could be 'conjunction' + 'definite article'). Here's the grammar:

endOfString = StringEnd() 
conjunction = oneOf('w f') 
preposition = oneOf('l b') def_art = oneOf('al l') 
noun_prefix = Group( Optional(conjunction('conjunction')) + \
     Optional(preposition('preposition')) + \
     Optional(def_art('article'))) 
noun_suffix = oneOf('y na k km w h ha hm') + FollowedBy(endOfString) 
poss_noun = Optional( Optional(conjunction) + \
     Optional(preposition) )('prefixes') + \ 
     SkipTo(noun_suffix | endOfString)('stem') + \ 
     Optional(noun_suffix)('suffix') 
def_noun = Optional( Optional(conjunction) + \
     Optional(preposition) + Optional(def_art) )('prefixes') + \ 
     SkipTo(endOfString)('stem') 
noun = Or( [poss_noun, def_noun ] )('noun') 

My problem is that I'd like to get the maximum parse (i.e., breaking the word up into as many pieces as possible), not necessarily the longest result.

For example, I have nouns defined as:

noun = Or( [def_noun, poss_noun ] )('noun') 

in order to enforce a rule that if a noun has a definite article, it can't also have a possessive ending. The problem I'm having is that the parser matches whichever pattern is first in the Or statement, and doesn't seem to try the other one.

Here's what it does (with the parse that I would have preferred added as a comment to the right):

>>> noun = Or( [def_noun, poss_noun ] )('noun') 
>>> for word in wordlist:
...   noun.parseString(word).asList() 
... 

['al', 'dar']                 # correct 
['b', 'al', 'blad']          # correct 
['al', 'blad']                # correct 
['b', 'ytw']                 # b + yt + w 
['b', 'al', 'Hq']            # correct 
['l', 'bytw']                # l + byt + w 

>>> noun = Or( [poss_noun, def_noun ] )('noun')
>>> for word in wordlist:
...   noun.parseString(word).asList() 
... 

['aldar']               # al + dar 
['b', 'alblad']        # b + al + blad 
['alblad']              # al + blad 
['b', 'yt', 'w']         # correct 
['b', 'alHq']           # b + al + Hq 
['l', 'byt', 'w']         # correct 
<ul class="quotelist"><ul class="quotelist"><ul class="quotelist"><li>
</li></ul></ul></ul>

So it's matching whichever pattern is first, instead of which pattern is the best match. What am I doing wrong?

Thanks, Karen

2012-04-28 14:37:52 - ptmcg

No, 'Or' actually tests both (or all if more than two) cases, with the expectation that the 'better' match is the one that matches the longest input string - and if two or more parse the same amount of input text, then the first one given in the Or expression will win out.

You can confirm this for yourself by adding setDebug() to each expression in the Or. setDebug() will report when an expression is about to be used in an attempt to parse the next position in the input, followed by either the success (and matching tokens) or failure (with failure message) of the parse. Change Or to MatchFirst to see the difference.

As a matter of personal style, I prefer 'a ^ b' over 'Or([a,b])', but the two are equivalent.

I'll give your question a little more thought to see what I can come up with to answer your underlying question, how to prefer the more complex parse over the simpler, when they parse the same length.

-- Paul

2012-04-29 17:45:13 - larapsodia

Yes, I guess that is my question. Thank you -- I would appreciate any help.

I can work around it -- I can just put all the prefixes and suffixes in one definition of a noun, but I would like to enforce the definiteness constraint, if possible.

2012-04-29 19:13:26 - ptmcg

This might be what you were trying to avoid, but creating one comprehensive expression for noun seems best to me:

from pyparsing import *

endOfString = StringEnd() 
conjunction = oneOf('w f') 
preposition = oneOf('l b') 
def_art = oneOf('al l')

noun_prefix = ( Optional(conjunction('conjunction')) + \
                     Optional(preposition('preposition')) + \
                     Optional(def_art('article'))) 
noun_suffix = oneOf('y na k km w h ha hm') + FollowedBy(endOfString) 
noun = (    Optional( (noun_prefix) )('prefixes') + 
             (SkipTo(noun_suffix)('stem') + noun_suffix('suffix') | 
              SkipTo(endOfString)('stem') ))

wordlist = 'aldar balblad alblad bytw balHq lbytw'.split()
for word in wordlist:
    print word
    print noun.parseString(word).dump()
    print

prints:

aldar
['al', 'dar']
- article: al
- prefixes: ['al']
  - article: al
- stem: dar

balblad
['b', 'al', 'blad']
- article: al
- prefixes: ['b', 'al']
  - article: al
  - preposition: b
- preposition: b
- stem: blad

alblad
['al', 'blad']
- article: al
- prefixes: ['al']
  - article: al
- stem: blad

bytw
['b', 'yt', 'w']
- prefixes: ['b']
  - preposition: b
- preposition: b
- stem: ['yt']
- suffix: ['w']

balHq
['b', 'al', 'Hq']
- article: al
- prefixes: ['b', 'al']
  - article: al
  - preposition: b
- preposition: b
- stem: Hq

lbytw
['l', 'byt', 'w']
- prefixes: ['l']
  - preposition: l
- preposition: l
- stem: ['byt']
- suffix: ['w']

-- Paul

2012-04-30 20:44:43 - larapsodia

Thank you very much for your help, Paul. I ended up writing some code that would try ever type of parse, and keep the one with the shortest stem, so that I was able to maintain my constraints:

def word_parse(word):
    word_types = [poss_noun, def_noun, pres_verb, past_verb]
    parses = []
    for type in word_types:
        try:
            parse = type.parseString(word)
            # stems should be at least two letters
            if len(parse.stem) < 2:
                continue
            parses.append((parse.asList(), len(parse.stem)))
        except:
            pass
    try:
        #sort by second value of tuple, to get the parse with
        #the shortest stem
        top_parse = sorted(parses, key=lambda x: x[1])[0][0]
    except:
        top_parse = word
    parse_string = '+'.join(top_parse)
    return parse_string

This seems to work pretty well:

>>> wordlist = ['aldar','balblad','alblad','bytw', 'balHq', 'lbytw']
>>> for word in wordlist:
...   print final.word_parse(word)
... 
al+dar
b+al+blad
al+blad
b+yt+w
b+al+Hq
l+byt+w

I parsed a small section of authentic text by hand, then tested the parser against it and got an accuracy of 77%. I'm going to do some fancy statistical stuff now to increase that -- but considering that's just a first shot, with a still pretty primitive grammar, it seems pretty damn good.

Thank you again for all your help!

~Karen


2012-04-28 06:48:11 - charles_w - working to understand pyparsing, setResultsName, and setParseAction

Continuing from these two StackOverflow questions:

()

()

I have since been able to at least get some traction using setResultsName. Here is the current complete code.

from pyparsing import *

#first are the basic elements of the expression

#number at the beginning of the line, unique for each line
#top-level category for a sentiment
#semicolon should eventually become a line break

lineId = Word(nums)
topicString = Word(alphanums+'-'+' '+''')
semicolon = Literal(';')

#call variable early to allow for recursion
#recursive function allowing for a line id at first, then the topic,
#then any subtopics, and so on. Finally, optional semicolon and repeat.
#the lineId and the semicolon are read but not printed
#set results name lineId.lineId here
expr = Forward()
expr << Optional(lineId.setResultsName('lineId')) + topicString + \
Optional(nestedExpr(content=delimitedList(expr))) + \
Optional(Suppress(semicolon) + expr)


#open files for read and write
input = open('parserinput.txt')
output = open('parseroutput.txt', 'w')

#defining functions

#takes nested list output of parser grammer and translates it into
#strings suited for the final output
def format_tree(tree):                                                                                            
    prefix = ''
    for node in tree:
        if isinstance(node, basestring):
            prefix = node
            yield node
        else:
            for elt in format_tree(node):
                yield prefix + '_' + elt

#function for passing tokens from setResultsName
def id_number(tokens):
    #print tokens.dump()
    lineId = tokens
    lineId['lineId'] = lineId.lineId

#function for splitting line at semicolon and appending numberId
#not currently in use
def split_and_prepend(tokens):
    return '\n' + final.lineId


#setting parse actions
lineId.setParseAction(id_number)


#reads each line in the input file
#calls the grammar expressed in 'expr' and uses it to read the line
#outputs it as a list (changed to allow result names for operations)
#applies the format tree function
for line in input:
    final = delimitedList(expr).parseString(line)#.asList()
    newline = '\n' + final.lineId + ' = '
    final_string = newline.join(format_tree(final))
    print final_string
    output.write(final_string)

The good news is that I'm making progress both toward my desired functionality and toward a better understanding of how this all works.

This version has some bugs/flaws/badly-hacked compromises, however.

  • Currently I'm trying to bring in the lineId through the join() method, but this requires that I place the lineId near the return carriage. Ideally, the lineId could be place at the front of each line which probably means taking a different approach.
  • While one possible alternative seems to be to use the lineId attribute variable to construct the kind of line that I want after the parsing rather than having the parser return the lineId as part of the string, I found that using Suppress on the lineId token to take it out of the results also precluded me from passing that token as an attribute name.
I feel like I'm very close to the desired functionality and can even get something usable if I to the final bit of manipulation in Excel, but of course I'd like to lick this problem the proper way rather than doing something like that.

I'm going to continue trying to figure out some other approaches, but I wanted to go ahead and post over here so that someone can warn me if I'm going down rabbit trails and steer me toward a more productive line of questioning.

2012-04-28 15:14:25 - charles_w

Okay - I've made a little more progress, possibly.

I'm now stuck at a different challenge en route to an alternate approach to the one I was attempting before.

from pyparsing import *

#first are the basic elements of the expression

#number at the beginning of the line, unique for each line
#top-level category for a sentiment
#semicolon should eventually become a line break

lineId = Word(nums)
topicString = Word(alphanums+'-'+' '+''')
semicolon = Literal(';')

#call variable early to allow for recursion
#recursive function allowing for a line id at first, then the topic,
#then any subtopics, and so on. Finally, optional semicolon and repeat.
#the lineId and the semicolon are read but not printed
#set results name lineId.lineId here
expr = Forward()
expr << Optional(lineId.setResultsName('lineId')) + topicString.setResultsName('topicString') + \
Optional(nestedExpr(content=delimitedList(expr))).setResultsName('parenthetical') + \
Optional(Suppress(semicolon).setResultsName('semicolon') + expr.setResultsName('subsequentlines'))

notid = Suppress(lineId) + topicString + \
Optional(nestedExpr(content=delimitedList(expr))) + \
Optional(Suppress(semicolon) + expr)



#naming the parenthetical portion for independent reference later
parenthetical = nestedExpr(content=delimitedList(expr))


#open files for read and write
input = open('parserinput.txt')
output = open('parseroutput.txt', 'w')

#defining functions

#takes nested list output of parser grammer and translates it into
#strings suited for the final output
def format_tree(tree):                                                                                            
    prefix = ''
    for node in tree:
        if isinstance(node, basestring):
            prefix = node
            yield node
        else:
            for elt in format_tree(node):
                yield prefix + '_' + elt

#function for passing tokens from setResultsName
def id_number(tokens):
    #print tokens.dump()
    lineId = tokens
    lineId['lineId'] = lineId.lineId

def topic_string(tokens):
    topicString = tokens
    topicString['topicString'] = topicString.topicString

def parenthetical_fun(tokens):
    parenthetical = tokens
    parenthetical['parenthetical'] = parenthetical.parenthetical

#function for splitting line at semicolon and appending numberId
#not currently in use
def split_and_prepend(tokens):
    return '\n' + final.lineId


#setting parse actions
lineId.setParseAction(id_number)
topicString.setParseAction(topic_string)
parenthetical.setParseAction(parenthetical)


#reads each line in the input file
#calls the grammar expressed in 'expr' and uses it to read the line
#outputs it as a list (changed to allow result names for operations)
#applies the format tree function
for line in input:
    final = delimitedList(expr).parseString(line)
    notid = delimitedList(notid).parseString(line)
    dash_tags = ', '.join(format_tree(final))

    print final.lineId + ': ' + dash_tags

The problem is that for multi-line inputs, I get the error 'no such attribute _ParseResults__tokdict.

Commenting out either of the lines at the end that are doing the parsing removes the error. And inputs with only one line removes the error.

2012-04-28 15:23:23 - charles_w

Hmm - I can't edit my post but I the penultimate line is supposed to read:

dash_tags = ', '.join(format_tree(notid))

Because then I can get the tokens without the lineId token on the front, which I'm already getting from final.lineId.

2012-04-28 15:57:07 - charles_w

Posted the new problem to SO here:

2012-04-28 20:23:58 - ptmcg

Your post sounds like a traceback in pyparsing itself, so possibly a bug. I'll run your posted code and see.

You have written and posted a lot of code on this project, but have you written a BNF? I've seen some simple examples, but you mentioned that they are just part of a larger project. I'm really having difficulty understanding your code without seeing the bigger picture.

2012-04-28 23:00:20 - charles_w

Interesting. I would never have considered the possibility of a bug.

As for a BNF, I have not written one (had to look up the term), but I'll try to describe what I'm doing. I'll try to write a BNF if you recommend it after reading this.

I have nearly ten thousand open-ended survey question responses that I'll have to review and tag in a way that best captures the content of the response. This reviewing and tagging I will be doing manually in Excel.

This manual tagging will follow a simple grammar:

  • Parentheses are used to show a parent-child relationship between ideas. These can nest.
  • Commas are used to delimit tags at the same level describing the same idea
  • Semicolons are used to delimit one major idea from another
So if a survey response was like this:
'Pyparsing is an amazing module because it is so powerful and simple to use. Wikispaces is a good site.'

The tag would be this:

pyparsing(compliment(powerful, easy to use)); wikispaces(compliment)

There are dozens of such tags, and responses can cover arbitrarily many topics in a single response and can go to arbitrary depth.

I then want to use Pyparsing to read my tagging grammar and print out the tags, converting the nested relationships into single tokens that represent each parent-child relationship. At a semicolon, make a new line to indicate a sufficiently discrete change in topic.

So for the above example, this is the desired output:

pyparsing, pyparsing-compliment, pyparsing-compliment-powerful, pyparsing-compliment-easy to use

wikispaces, wikispaces-compliment

Each survey response also has a unique ID number that I want to attach to the front of the lines for that response, as follows.

4934 pyparsing, pyparsing-compliment, pyparsing-compliment-powerful, pyparsing-compliment-easy to use

4934 wikispaces, wikispaces-compliment

The tokens in this final output will be put into fields in Excel alongside the number that serves as an index. Then one can search for fields that equal a given tag whether for a top-level issue (pyparsing) or a narrower category (pyparsing-compliment).

I'll reiterate that I am not experienced in programming, python, or parsing, so there is at least some stuff in my code that is extraneous. It doesn't necessarily all signify some intention necessarily.

I'll also add that while this is a project with practical applications, the primary goal is for it to be a learning experience with Python and Pyparsing, which it certainly has been so far.

2012-04-29 14:28:35 - charles_w

While there may be a but somewhere in Pyparsing, my particular problem seems to have been caused by using the same name 'notId' for the parser function early in the code and for the parse results in the final section of the code.

Someone on SO caught the mistake for me at the link above.

2012-04-30 07:09:26 - charles_w

After much trial and error, I've arrived at something that will take the inputs I want and deliver the outputs that I want.

Thanks again, Paul, for your patience and your help. This has been a great learning exercise for me.

from pyparsing import *

data = '''\
1200 price(margin, happy), channel, friend; friend, channel, price
'''


def memorize(t):
    memorize.idnum = t[0]

def endblock(t):
    return '\n' + memorize.idnum


expr = Forward()
expr << Optional(Word(nums).setParseAction(memorize)) + OneOrMore(delimitedList(Word(alphanums+'-'+' '+''') + Optional(nestedExpr(content=delimitedList(expr))))) + Optional(Suppress(Literal(';')).setParseAction(endblock))
lines = ZeroOrMore(expr)

parsed = lines.parseString(data)

print parsed


def format_tree(tree):
    print tree                                                                                           
    prefix = ''
    for node in tree:
        if node[0].isdigit():
            yield node
        elif isinstance(node, basestring):
            prefix = node
            yield node
        else:
            for elt in format_tree(node):
                yield prefix + '_' + elt

print ', '.join(format_tree(parsed))

2012-05-01 01:14:04 - robintw - Labelling of results when using Each

Hi,

I've been struggling with labelling of individual results when I am using some Optional elements within an Each clause.

Firstly, is this possible to do? I'd assume it was, but I wanted to check.

Secondly, what am I doing wrong here?

I've posted a StackOverflow question about this, which is at - it has all of the code etc, and if anyone could help that'd be brilliant!

Cheers,

Robin


2012-05-08 11:32:30 - side78 - Parsing nested blocks without any deterministic end

I'm trying to parse command output that unfortunately isn't very nicely structured. The output is loosely structured as a series of outer blocks, containing zero or more inner blocks. It's possible to determine the start of different blocks but not necessarily the end of them, except to potentially match on a list of block starting matches.

I'm not sure if I'm using the right approach to parsing this. What I've done below has the problem that it doesn't break when it reaches the second outer block, and thus all the inner blocks inside the second block are included in the results for the first outer block. If I try to break by adding Suppress(SkipTo(inner_block_begin | outer_block_begin)) to the end of the inner_block construct it seems to greedily slurp until the end (I think).

What's an appropriate way to handle this kind of parsing using pyparsing (if any)?

from pyparsing import *

text = '''
Outer 100
  Text to be skipped
  Some parm to match 199
  Text to be skipped may contain keyword Inner
  Inner 101
    Text to be skipped
    Text to be skipped may contain keyword Inner
    Some prefixes:
    Text to be skipped
  Inner 102
    Text to be skipped
    Text to be skipped may contain keyword Inner
    Some prefixes:
      102.1.1.1/24 text
      102.2.2.2/24 text
    Text to be skipped
Outer 200
  Text to be skipped
  Text to be skipped may contain keyword Inner
  Inner 201
    Text to be skipped
    Text to be skipped may contain keyword Inner
    Some prefixes:
      201.1.1.1/24 text text
    Text to be skipped
'''

ipv4 = Combine(((Word(nums, max=3) + '.') * 3) + Word(nums, max=3))
ipv4_prefix = Combine(ipv4 + '/' + Word(nums, max=2))

outer_block_begin = lineStart + Literal('Outer')
outer_block_id = Word(nums)
outer_block_misc = Suppress(Literal('Some parm to match')) + Word(nums)

inner_block_begin = lineStart + Literal('Inner')
inner_block_id = Word(nums)
inner_block_prefix = ipv4_prefix + Suppress(restOfLine)

inner_block = \
    Suppress(SkipTo(inner_block_begin, include=True)) + \
    inner_block_id + \
    Suppress(SkipTo(Literal('Some prefixes:'), include=True)) + \
    Group(ZeroOrMore(inner_block_prefix))

outer_block = \
    Suppress(SkipTo(outer_block_begin, include=True)) + \
    outer_block_id + \
    Suppress(SkipTo(outer_block_misc)) + outer_block_misc + \
    Group(ZeroOrMore(inner_block))

print outer_block.searchString(text)

2012-05-09 17:30:33 - Mike01915

side78,

Here is a pragmatic approach. Your basic search definition works on a single block. So pre-parse the long string into a list of strings each representing a block. Then apply outer_block.searchString( * ) to each block string in the list. Additionally I had to add Optional() to the outer_block_misc definition since it is not always present in a text block. Hope this helps,

Mike

ipv4 = Combine(((Word(nums, max=3) + '.') * 3) + Word(nums, max=3))
ipv4_prefix = Combine(ipv4 + '/' + Word(nums, max=2))

outer_block_begin = lineStart + Literal('Outer')
outer_block_id = Word(nums)
outer_block_misc = Optional( Suppress(Literal('Some parm to match')) + Word(nums) )

inner_block_begin = lineStart + Literal('Inner')
inner_block_id = Word(nums)
inner_block_prefix = ipv4_prefix + Suppress(restOfLine)

inner_block = \
    Suppress(SkipTo(inner_block_begin, include=True)) + \
    inner_block_id + \
    Suppress(SkipTo(Literal('Some prefixes:'), include=True)) + \
    Group(ZeroOrMore(inner_block_prefix))

outer_block = \
    Suppress(outer_block_begin) + \
    outer_block_id + \
    Suppress(SkipTo(outer_block_misc)) + outer_block_misc + \
    Group(ZeroOrMore(inner_block))

def isOuterBegin( line ):
    bList = outer_block_begin.searchString(line).asList()
    if len( bList ) > 0:
        return True
    return False

strList = []
blockStr = ''
inBlock = False
for line in text.splitlines():    
    if isOuterBegin(line):  # Start new outer block
        if len( blockStr ) > 0 and inBlock:  # Close out previous block
            strList.append( blockStr )
            blockStr = ''  
        else:
            inBlock = True
    if inBlock:        
        blockStr += line + '\n'

if inBlock and len( blockStr ) > 0:  # Close out final block
    strList.append( blockStr )

pList = []
for blockStr in strList:
    bList = outer_block.searchString( blockStr ).asList()[0]
    pList.append( bList )

print( pList ) 

2012-05-09 18:23:50 - Caffeinix - C++ qualified types

I'm trying to use pyparsing to chop namespaces off of C++ qualified types in error output. A qualified type looks like this:

namespace::namespace::Type

This alone is pretty easy to do using pyparsing:

name = Word(alphas + &quot;_&quot;, alphanums + &quot;_&quot;)
namespace = name.suppress() + FollowedBy(&quot;::&quot;)
identifier = Group(
    delimitedList(namespace, delim=&quot;::&quot;)) +
    Literal(&quot;::&quot;).suppress() + 
    name)
print identifier.transformString(input)

The trouble comes when I try to also match methods on qualified types:

namespace::namespace::Type::MethodName()

In the first example, my output is simply Type, which is what I want. But in the second example, I want the output to be Type::MethodName(), not just MethodName(). I can find no way of making pyparsing do this.

As far as I've been able to figure out, the trouble is that I need backtracking for this. There's no way to tell that a token is a type and not a namespace without knowing what comes after it. In the case of a qualified type, that's okay, because I can just use a lookahead to ensure that there's another '::' after the token (i.e., it's not the last token). But detecting a method would require me to ensure that there's both a type name and method name after the token before declaring it a namespace. I've tried something like:

{function = name + '(' method = name + '::' + function namespace = name.suppress() + FollowedBy('::' + method)}

But this doesn't match at all.

I feel intuitively that this grammar is simple enough for pyparsing to deal with, but I am missing a key insight. Anyone have any idea what it is?

2012-05-14 04:29:47 - ptmcg

In general, whenever you are trying to match a trailing string after some repetition, the pattern is:

  • define what you want to match (expr)
  • define the repetition expression (rep)
  • prefix rep with a negative lookahead, rep = ~expr + rep
  • define the overall using something like ZeroOrMore(rep) + expr

You can see this in the code below:

from pyparsing import *

name = Word(alphas + '_', alphanums + '_')
# namespace = name.suppress() + FollowedBy('::')
# identifier = Group(delimitedList(namespace, delim='::') + Literal('::').suppress() + name)

COLONS = Literal('::')
LPAR,RPAR = map(Literal,'()')
METHOD_ARGS = LPAR+RPAR | nestedExpr('(',')')

# define what your trailing string looks like
methodcall = (name + METHOD_ARGS)
typestring = (name + (COLONS + methodcall | ~COLONS))

# use negative lookahead to avoid matching your type text as a namespace
namespace = name + COLONS
namespace = ~typestring + namespace

# just using ZeroOrMore instead of delimitedList
identifier = (ZeroOrMore(namespace).suppress() + typestring).leaveWhitespace()

tests = '''
namespace::namespace::Type
namespace::namespace::Type::MethodName()
'''

print identifier.transformString(tests)

Which gives this result

Type
Type::MethodName()

Some other bits:

  • I usually try to stay in definition space when defining the component expressions, that is how to recognize a particular pattern vs. another, and leave the things like suppression or results names to later steps when the smaller pieces all get assembled into an overall grammar.

  • Once I use a particular literal string more than a couple of times in a grammar, I'll create a separate expression for it (see use of COLONS).

  • Note that I'm trying to anticipate a method call that has arguments, by using nestedExpr as a cheap way to handle an argument list which itself might contain function calls. This may have a downside in that, if there are namespace references within the method args, they may not be stripped the way you want them. If that happens, you'll need to do a more rigorous definition of METHOD_ARGS.

  • I added leaveWhitespace so that spaces before the namespace references won't be consumed.

Hope this gets you further along, -- Paul

2012-05-14 17:38:30 - Caffeinix

Wow! Thanks Paul. This is really helpful indeed. I'm still trying to wrap my mind around your recursive definition of namespace: I think it's basically saying 'what you think is a namespace is only actually a namespace if it does not also match the definition of a typestring.' Is that right? I had no idea you could restrictively redefine like that. Is that the preferred method of setting the priority with which definitions are matched?


2012-05-21 12:08:56 - dGRp - Building AST for n-op abstract algebras

Hello,

I am wrting with a question regarding my project. I am writing a parser for a process algebra, till now used ANTLR with Java but want to switch the project to python. I have a questions regarding AST creating in pyparsing. In ANTLR rules could return values and take arguments, also there was a tree grammar. This was for me (not a programmer) easy to create functional parsers. In pyparsing there is only setParseAction() method. How can I create AST in the example below.

Basically what I want to parse is a something like 4op arithmetic, but I do not want to evaluate it, since all terms are abstact. I want to create AST and later traverse it in order to generate something (in my case, a state space).

Example:

P = a.P + b.P1;                                                                                                                
P1 = c.P;                                                                                                                      

I want AST to be:

= (P + (. (a P) . (b P1)))                                                                                                     
= (P1 . (c P))                                                                                                                 
                                                                                                                   

prefix_op = Literal('.')                                                                                                       
choice_op = Literal('+')                                                                                                       
parallel = Literal('||')                                                                                                       
ident = Word(alphas, alphanums+'_')                                                                                            
lpar = Literal('(').suppress()                                                                                                 
rpar = Literal(')').suppress()                                                                                                 
define = Literal('=')                                                                                                          
semicol = Literal(';').suppress()                                                                                              
col = Literal(',').suppress()                                                                                                  
sync = Word('<').suppress() + ident + ZeroOrMore(col + ident) + Word('>').suppress()                                           
coop_op = parallel | sync      
# PA grammar
expression = Forward()                                                                                                         
process = lpar + ident + rpar | ident | lpar + expression + rpar                                                               
prefix = (process + ZeroOrMore(prefix_op + process))                                                                           
choice = prefix + ZeroOrMore(choice_op + prefix)                                                                               
expression << choice + ZeroOrMore(coop_op + choice)                                                                            
rmdef = (ident + define + expression + semicol)                                                                                

I ask because I could not find a solution anywhere.

2012-05-21 12:10:39 - dGRp

Sorry for [code] formatting, do not know what happened and now I cannot edit.

2012-05-22 10:23:26 - dGRp

Ok so I solved my problem. First of all I discovered that in fun (setParseAction(fun) ) I can return an object. This way I can propagate information I want through the rules chain.

Here is example:

class Node(object):
    right = None
    left = None
    data = ''


def createNode(toks):
      n = Node()
      return n
...

rule = (Word('Hi') + Litearal(';').suppress()).setParseAction(createNode)

2012-05-23 09:36:38 - ptmcg

Before you get much further down this path, please read this thread from the pyparsing users list:

AST's are a good intermediate step, but pyparsing can help you build more active objects as output from your parsing process. See how this is done in SimpleBool.py, for example (on the wiki Examples page).

-- Paul

2012-05-25 05:29:24 - dGRp

Hey,

Sorry for not responding earlier.

My problem area is such that it is not possible for me to work without AST. I need to create AST to rewalk it and create something different from it. Even legacy reference implementation for my problem do it this way.

The only way to omit AST is to use LR parser instead of LL (I assume pyparsing uses LL(1) ? )

BTW. Is there a way to control k param in LL(k) in pyparsing?

2012-05-25 06:32:22 - ptmcg

I suspect you have not really read the link that I posted, as in that thread, I describe to that poster how to create an AST using pyparsing's Group class.

Of course, you can approach your problem in whatever way you choose, but from what you've described so far, you are definitely doing this the hard way.

It is very rare in Python to have to implement your own linked list, with next and prev pointers. Python includes its own list structure, to which new items can be freely appended, and can be easily iterated over. A Python list can contain as an element another, nested Python list, so that a hierarchical structure can be easily represented. And using pyparsing, you don't even have to build your own list, as pyparsing accumulates your parsed tokens for you, into a very rich ParseResults object - ParseResults can be treated as a list, or with named tokens that can accessed by name lookup or as object attributes (again, see and study the simple example in the linked thread).

Lookahead can be LL(as much as you care to define) in pyparsing, using the FollowedBy lookahead class. Within the definition of your grammar, you can specify something as complex as FollowedBy(validTimeStamp) or FollowedBy(socialsecuritynumber+zipcode) or FollowedBy(zipcode*100). FollowedBy will not consume the given expression from the input string, but it will verify that at the current parsing position, the next parts will or wont match the given expression.

-- Paul

2012-05-26 11:12:19 - dGRp

Thank for clarifying. Ofc I read the link You provided, I next projects I will surely try the python way. Now it was more an excercise for myself in python, so I created my own Node objects. When the whole project is complete I will provide some info on the forum I you are interested.


2012-05-23 04:27:27 - Madan2 - TypeError: unsupported operand type(s) for ^: 'NoneType' and 'NoneType'

Hi,

I'm new to Python and Pyparsing and trying to parse a program. Initially I've had success. When the expressions are complicated

I'm trying to build tokens and try to reuse them to define other tokens / expressions. This is where I'm getting the errors.

I have code like this

LBRACE         = Suppress(Literal('('))
RBRACE         = Suppress(Literal(')'))
get_KW        = Suppress(Literal('get'))

simple_punc         = '-./_:*+=|[~!%]<>?$'
token_char         = alphanums + simple_punc
tokenz         = Word(token_char)    

QuotedString        = quotedString

init_constant     =  Group (LBRACE + Suppress(tokenz) + LBRACE + tokenz ('operator') + LBRACE + Suppress(tokenz) + LBRACE + Suppress(tokenz) + tokenz('variable_name') +Suppress(integer) + LBRACE + Suppress(tokenz)+ Suppress(tokenz) + RBRACE + RBRACE + RBRACE + integer ('init_value') + RBRACE +RBRACE) ('set_expression')
sub_str        =  Group (LBRACE + tokenz + tokenz + integer + integer + RBRACE) ('set_expression')
fn_call        =  Group (LBRACE + tokenz ('function name') + OneOrMore(tokenz('argument'))+RBRACE)

get_exp        =  (LBRACE + get_KW + tokenz+ RBRACE)
trim_get_exp_1     =  (LBRACE + tokenz + (get_exp) + QuotedString + RBRACE)
trim_get_exp_2       =  (LBRACE + tokenz + (Group(trim_get_exp_1)) + QuotedString + RBRACE)


copy_input_frmt2 = (LBRACE + fmt_KW + QuotedString + trim_get_exp_2+ RBRACE) 

copy_exp        =  Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expression')

set_exp        =  Group (LBRACE + set_KW + tokenz('variable_name') + (fn_call ^ sub_str ^ init_constant ^ integer ^ QuotedString ^ trim_get_exp_1 ^ trim_get_exp_2) + RBRACE)('set_exp')

I get exceptions like

   Parser3.py:42: SyntaxWarning: Cannot combine element of type <class 'type'> with ParserElement
     trim_get_exp_1  =  (LBRACE + tokenz + (get_exp) + QuotedString + RBRACE)
   Parser3.py:42: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
     trim_get_exp_1  =  (LBRACE + tokenz + (get_exp) + QuotedString + RBRACE)
   Parser3.py:43: SyntaxWarning: Cannot combine element of type <class 'type'> with ParserElement
     trim_get_exp_2  =  (LBRACE + tokenz + (Group(trim_get_exp_1)) + QuotedString + RBRACE)
   Parser3.py:43: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
     trim_get_exp_2  =  (LBRACE + tokenz + (Group(trim_get_exp_1)) + QuotedString + RBRACE)
   Parser3.py:61: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
     copy_input_frmt2 = (LBRACE + fmt_KW + QuotedString + trim_get_exp_2+ RBRACE)
   Parser3.py:65: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
     copy_exp      =  Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expres
   sion')
   Parser3.py:79: SyntaxWarning: Cannot combine element of type <class 'NoneType'> with ParserElement
   set_exp               =  Group (LBRACE + set_KW + tokenz('variable_name') + (fn_call ^ sub_str ^ init_constant ^ integer ^ QuotedString ^ trim_get_exp_1 ^ tri
   m_get_exp_2) + RBRACE)('set_exp')
   Traceback (most recent call last):
     File 'Parser3.py', line 79, in <module>
       set_exp             =  Group (LBRACE + set_KW + tokenz('variable_name') + (fn_call ^ sub_str ^ init_constant ^ integer ^ QuotedString ^ trim_get_exp_1 ^ tri
   m_get_exp_2) + RBRACE)('set_exp')
TypeError: unsupported operand type(s) for ^: 'NoneType' and 'NoneType'

Can you please tell us what's wrong with this and how to resolve the error ? I did see earlier posted messages and tried to group them by braces, it didnt' work ..

ThankYou!

2012-05-23 05:15:12 - ptmcg

I extracted your posted code, and it would not build until I stubbed in the following expression definitions:

integer = Word(nums)
fmt_KW = Forward()
copy_KW = Forward()
set_KW = Forward()
inputpath = Forward()
copy_input_frmt1 = Forward()
outputpath = Forward()

After that, the code runs okay. So I think there is probably something wrong with the way these expressions are defined in your larger parser. For example, you may have left out the arguments to construct an object, like accidentally entering:

inputpath = Word

If I do this, then I get these warnings, very similar to what you are getting:

x.py:33: SyntaxWarning: Cannot combine element of type <type 'type'> with ParserElement
  copy_exp = Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expression')
x.py:33: SyntaxWarning: Cannot combine element of type <type 'NoneType'> with ParserElement
  copy_exp = Group (LBRACE + copy_KW + (QuotedString ^ inputpath ^ copy_input_frmt1 ^ copy_input_frmt2) + outputpath + QuotedString + RBRACE)('copy_expression')

-- Paul

2012-06-07 09:45:43 - Madan2

Thanks verymuch. Indeed I was doing a mistake. Thanks again!!


2012-05-25 05:33:30 - dGRp - Some thoughts and questions on improvement

Here are some loose thought on improving lovely pyparsing.

  1. Add actions to rules that are fired whena rule is entered, not when it is resolved. Similar to ANTLR's init block. It could be some kind setPreAction(str, loc) ?

  2. Give possibility to control k in LL(k).

  3. How can we achieve lookup ? By ^ ?

2012-05-25 06:27:51 - dGRp

It would be also nice to be able to pass context to rules (arguments for rules).

2012-05-25 06:34:41 - ptmcg

  1. Pyparsing has debugging actions that fire before match, after successful match, and after failed match. Perhaps the before match action could do this for you.
  2. See my discussion of FollowedBy in your other post.
  3. 'lookup'? Perhaps you mean results names? Check out setResultsName.

-- Paul

2012-05-25 06:36:58 - ptmcg

To pass context to rules, look at how this is done in some of pyparsing's helper methods, like withAttribute, or replaceWith.

2012-05-26 11:10:47 - dGRp

Sorry for late replying.Thank You for your reply. I will surely look into it.


2012-05-26 14:51:08 - BrenBarn - get original text from ParseResults

Is there any way to get the overall text matched by a ParseResults object? I know about originalTextFor, but that replaces the matched text at match time. What I want is a way to incrementally slice the full nested parse tree to get the text of any partial match. So I want to be able to do, like, myGrammar.parseString('...').someSubElement.otherSubElement.originalText() and get the original text that matched that particular nested bit of the overall grammar. How can I do this?

2012-05-26 17:32:21 - ptmcg

Short of monkeypatching the ParserElement class, there is no generic way to do this for any and every subelement. You might try writing a wrapper class that captures the ending position and adds an 'originalText' named result to any returned ParseResults. Then wrap your expressions with that wrapper class. Something like Forward that just contains another expression - maybe even just subclassing Forward would work. -- Paul


2012-06-07 09:56:55 - Madan2 - combine - space bn tokens

Hi,

I'm using combine. It works fine but it strips off the whitespace in between the tokens. Is there anyway to retain the space OR introduce one space between each token

actual text : ((i 0 (+ i 1)))

Combine((printLBRACE + tokenz + tokenz + printLBRACE + tokenz + tokenz + tokenz + printRBRACE + printRBRACE),adjacent=False)

gives me (i0(+i1))

I would like to have '((i 0 (+ i 1)))' or with atleast one space between each token

Thankyou!

2012-06-07 11:06:41 - ptmcg

Try this:

Combine((...exprs...), joinStr=' ', adjacent=False)

2012-06-08 09:52:01 - Madan2

Thank you Sir! Much appreciated for the instant reply.

it's actually joinString not joinStr

2012-06-10 09:33:39 - ptmcg

Guess I should read the docs. :)


2012-06-10 06:12:29 - bsr203 - Rules for Repeating sections of data

Hello group,

I am new to python and writing a program to parse EDI files. I came across pyparsing and looks like a good fit for my use case. The input data file has well defined syntax, but some of the lines may be optional and can have block of lines which repeats. My main question is is there a way to map the repeating block of data (loop) into a list. For example, here is a sample file.

ISA*00* *00* *01*987654321 *01*123456789 *020917*0913*U*00400*000000901*0*P*>@ 
GS*SH*987654321*123456789*20020917*0913*965*X*004010@
ST*856*0001@
BSN*00*260784*20020917*0913@
DTM*011*20020917*0913*@
HL*01**S@
MEA*PD*G*355*LB@
TD1*CTN90*3@
TD5*B*2*RDWY*LT****@ 
TD3*TL*RDWY*1234567890@
REF*BM*260784@
N1*MI**01*987654321@
N1*SU**01*123456789@
N1*ST**01*987654321@
HL*02*01*O@ 
LIN**BP*PARTNUMBER1*PO*P012345*RE*001@ SN1**40*PC*1084@     //Block 1
HL*03*02*I@
REF*DK*DOCKA@
CLD*1*40*CTN90@
REF*LS*S562896@
HL*04*01*O@ 
LIN**BP*PARTNUMBER2*PO*P012316*RE*004@  //Block 2
SN1**100*PC*32400@
HL*05*04*I@ 
REF*DK*DOCKB@ 
CLD*2*50*CTN90@ 
REF*LS*S562897@ 
REF*LS*S562898@ 
CTT*2@ SE*29*0001@ 
GE*1*965@ 
IEA*1*000000901@

So the data has a header section, and two blocks of data and a trailing section. Instead of 2, there could be hundreds of blocks. some lines in this block may be optional.

  1. Is there a way to write a grammer, which has some context. This is because 'REF' tag in header section, and within the block (say beginning marked by LIN tag)

  2. Is there a way to capture the repeating block of data (note the number of lines vary as some of them are optional) into python list.

thank you verymuch.

2012-06-10 09:23:15 - ptmcg

I suggest you start small, here is a prototype parser with simplified line markers for tags. You can use this little prototype grammar to experiment with repetition, grouping, and ordered/out-of-order data.

Here is first example, with a header line starting with 'A' and a terminator line starting with 'Z'. You can see in the definition of AZ_rec that between these two there can be 0 or more B, c, or D lines. The line groups that have values are also labeled with their key, A, B, C, or D.

from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' \t')
NL = LineEnd().suppress()

integer = Word(nums).setParseAction(lambda t:int(t[0]))
STAR = Suppress('*')
A_line = Group('A' + STAR + integer('value') + NL)
B_line = Group('B' + STAR + integer('value') + NL)
C_line = Group('C' + STAR + integer('value') + NL)
D_line = Group('D' + STAR + integer('value') + NL)
Z_line = 'Z' + STAR + NL

AZ_rec = (A_line('A') + 
            Group(ZeroOrMore(B_line))('B') + 
            Group(ZeroOrMore(C_line))('C') +
            Group(ZeroOrMore(D_line))('D') +
            Z_line)

Here is some sample data, and a test routine to parse and output the results:

data1 = '''\
A*100
B*10
C*11
D*12
Z*'''

data2 = '''\
A*101
B*10
B*11
B*12
Z*'''

def testParse(expr,s):
    data = expr.parseString(s)
    print data.asList()
    print 'A', data.A.value
    for key in 'BCD':
        if data[key]:
            print key, '-', ','.join(str(d.value) for d in data[key])
    print

testParse(AZ_rec, data1)
testParse(AZ_rec, data2)

Giving:

[['A', 100], [['B', 10]], [['C', 11]], [['D', 12]], 'Z']
A 100
B - 10
C - 11
D - 12

[['A', 101], [['B', 10], ['B', 11], ['B', 12]], [], [], 'Z']
A 101
B - 10,11,12

Notice how the values are extracted by name from the parsed data (data.A.value or d.value for d in data[key]). With pyparsing's named results, you don't have to count up indexes into lists of tokens (which can break when the grammar evolves in the future and new fields are introduced in the middle of existings ones).

Now here is the slightest variation on the previous parser, in which the B, C, and D internal records might not occur in nice B-C-D order, but might be C-B-D, or C-D-B, D-B, etc. The difference is that this grammar uses the '&' operator instead of '+' to join the inner 3 record types.

AZ_rec2 = (A_line('A') + 
            (Group(ZeroOrMore(B_line))('B') &
             Group(ZeroOrMore(C_line))('C') &
             Group(ZeroOrMore(D_line))('D')) +
            Z_line)

data3 = '''\
A*102
C*11
C*21
C*31
B*10
Z*'''

testParse(AZ_rec2, data3)

Giving:

[['A', 102], [['B', 10]], [['C', 11], ['C', 21], ['C', 31]], [], [['B', 10]], [], [], 'Z']
A 102
B - 10
C - 11,21,31

See if you can make some headway from here.

Welcome to Python and pyparsing! -- Paul

2012-06-11 03:56:26 - bsr203

Paul. Thank you very much for the detailed answer and encouragement. I appreciate your effort supporting pyparsing through this form and SO. After some xml processing, I will get back and update my progress. Thanks again.


2012-06-10 14:23:11 - ofuhrer - Replace only locally

Hi,

I've written a parser for Fortran namelist which have the general format...

&name
 a = 1,
 b = 2,
 c = 3, ! this is a comment
/

The parser works perfectly. The key=value pairs are recognized as such and give appropriate names with setParseAction(). After having checked the correct syntax (i.e. correct parsing) of the namelist, I would like to check for the presence for a specific key=value pair. If the key is present, I would like to change the value and output the namelist OTHERWISE UNTOUCHED. Meaning with all the whitespace, comments etc. that were present in the original unparsed version. Any hints of how to achieve that would be appreciated.

Cheers, Oli


2012-06-11 02:51:37 - willem179 - ParseResults inconsistent after setitem

Suppose I want to change a number embedded in a alpha-string using transformString. Like, for instance, changing 123 to 456 embedded in a string of x's:

xxxxxxxx123xxx -> xxxxxxxx456xxx

The code below shows two ways to do this:

from pyparsing import Group, ZeroOrMore, Word, nums, ParseResults

def changeNum1 (res):
    res[8] = 456      # how to know that the number is at position 8?
    print 'dict:', res.number, 'list:', res[8],

def changeNum2 (res): # this does *not* change the returned parse result
    res.number = 456
    print 'dict:', res.number, 'list:', res[8],

exp = ZeroOrMore ('x') + Word (nums)('number') + ZeroOrMore ('x')
exp1 = exp.copy().setParseAction (changeNum1)
exp2 = exp.copy().setParseAction (changeNum2)

print '--> ' + exp1.transformString ('xxxxxxxx123xxx')
print '--> ' + exp2.transformString ('xxxxxxxx123xxx')

This example produces the following output:

dict: 123 list: 456 --> xxxxxxxx456xxx
dict: 456 list: 123 --> xxxxxxxx123xxx

The first transformation (exp1) works, the second (exp2) does not. In the first we change the parse result as a list representation, in the second we use the dict representation. When dict representation is changed this is not reflected in the list representation. The two are not kept in sync. This seems unfortunate, because changing the dict is far more user frienly (position independent) than using an index in a list.

The __setitem__ method of the ParseResults class does not keep the list representation synchronized with the dictionary representation. Changing one does not change the other. And only the list representation seems to be propagated to the final result. (Unfortunately I do not yet understand the reason why __setitem__ leaves the list and dict representation of a parse result in an inconsistent state)

I have searched this discussion forum, and only found

where the solution is a crooked way (using the internal Python id of an object!) to get to the list index of a named item, clearly showing the lack of this info from the API. In my opinion the list index(ices) of (all matches of) a named item should be accessible by the API (it is held in __tokdict).


2012-06-25 11:31:52 - Madan2 - Dealing with "" in data

Hi,

While parsing I'm encountering a '' in the data and getting errored out. Is there anyway to overcome this.

I've cutpasted the data below

(set! mMiscStringTwo (string-tokens (cdr (assoc mFreightClass (vector->list mItemVector))) #\~))

Help is much appreciated.

Thank you! Madan

2012-06-27 21:30:24 - ptmcg

It is pretty much impossible to help you much on this question looking at just the input string - can you post the pyparsing grammar, or at least the part that is failing?


2012-06-26 16:47:14 - chlim - parsing identical strings and multi-lines

I'm a newbie have spent quite a bit of time trying to get this to work but still unsuccessful to parse the following text with multiple records. I think pyparsing will do this for me. Appreciate your help in advanced. Thank you.

  1. there are 2 'parents:' fields, that i need to process separately as parent 1 and parent 2. Currently it seems like only 'pcs2_group' is matched.
  2. the 'description' field is multi-lined starting on the next line.

I am not sure if searchString can do it or not now...

sample = r'''changeset:  2916:cbeb5f68b46b725ebeb0192e4b6852db6c9bd6f3
parent:      2914:ab2526b29654115d3327c4ae31243e019f4739c5
parent:      -1:0000000000000000000000000000000000000000
description:
Bug 123: blah line 1
Bug 455: blah line 2

changeset:   2915:b21b281f5bf00350823aadd64730efb18f62150f
... another record ...

'''
SkipToNextRecord = SkipTo( 'changeset:', include=False )
SkipToKey = SkipTo( Word(alphas), include=False )
cset = Word(nums).setResultsName('revId') + Suppress(':') + Word(alphanums).setResultsName('rev')
changesetStmt = Group( 'changeset:' + cset('changeset_group') ) + SkipToKey
parCset = Word(nums).setResultsName('revId') + Suppress(':') + Word(alphanums).setResultsName('rev')
parCsetStmt = Group( 'parent:' + cset('pcs_group') ) + SkipToKey
pcs2cset = ZeroOrMore('-') + Word(nums).setResultsName('revId') + Suppress(':') + Word(alphanums).setResultsName('rev')
par2CsetStmt = Group( 'parent:' + pcs2cset('pcs2_group') ) + SkipToKey
changesetLine = 'changeset:' + SkipTo( Literal('\n').suppress() )

descLine = Word(alphanums)
descrDef = OneOrMore( ~changesetLine )
descrStmt = Group( 'description:' +  descrDef('DESCR') ) + changesetLine
changesetDef = Dict(   changesetStmt 
                     + ZeroOrMore(parCsetStmt)
                     + ZeroOrMore(par2CsetStmt)
                     + ZeroOrMore(descrStmt).setDebug()
                    ) + SkipToNextRecord

for csetDict in changesetDef.searchString(sample):
    print csetDict.dump()
    print  '-' * 8

2012-07-11 17:33:48 - chlim

I'm getting closer... I'll probably post sthe solution when I have it.


2012-07-02 07:16:21 - DiaaFayed - dynamic extractor statement

Can we compose dynamic extractor statement. For example, in the scanString() example mentioned in .

#################
print 'Example of an extractor'
print '----------------------'

# simple grammar to match #define's
ident = Word(alphas, alphanums+'_')
macroDef = Literal('#define') + ident.setResultsName('name') + '=' + restOfLine.setResultsName('value')
for t,s,e in macroDef.scanString( testData ):
    print t.name,':', t.value

# or a quick way to make a dictionary of the names and values 
# (return only key and value tokens, and construct dict from key-value pairs)
# - empty ahead of restOfLine advances past leading whitespace, does implicit lstrip during parsing
macroDef = Suppress('#define') + ident + Suppress('=') + empty + restOfLine
macros = dict(list(macroDef.searchString(testData)))
print 'macros =', macros
print

I need to make the variables: ident, name, value, and restOfLine readed dynamically and the extractor statment coposed at run time.

each iteration the values will be changed.

2012-07-02 11:43:36 - DiaaFayed


2012-07-02 12:52:58 - DiaaFayed - a new feature of the Python eval and exec commands

I need to understand

September 9, 2007 - Pyparsing Recipe in the Python Cookbook

2012-07-03 03:07:23 - ptmcg

Diaa - there is not a lot of information here. Can you post a URL to the Python Cookbook recipe you mean? -- Paul

2012-07-03 08:59:51 - DiaaFayed

September 9, 2007 - Pyparsing Recipe in the Python Cookbook Kevin Atkinson has submitted this recipe to the Python Cookbook. It uses a new feature of the Python eval and exec commands to implement custom behavior when a symbol is not found. So instead of this:

B = Forward()
C = Forward()
A = B + C
B << Literal('b')
C << Literal('c')
D_list = Forward()
D = Forward()
D_list << D | (D + D_list)
D << Literal('d')

you can just write this:

A = B + C
B = Literal('b') 
C = Literal('c')
D_list = D | (D + D_list)
D = Literal('d')

2012-07-03 09:22:45 - ptmcg

Diaa -

Do not use this recipe, especially since you are still working through basic pyparsing and Python concepts. All this recipe does is allow you to avoid the pre-definition of Forward instances, at the risk of introducing other dependencies and errors. It does not make pyparsing any easier or faster, and does not conform to any of the docs or examples.

-- Paul


2012-07-03 13:30:03 - BrenBarn - Copying ParseResults attributes

I'm finding that pyparsing does not handle ParseResults attributes in a consistent way. The pyparsing code seems to cavalierly treat ParseResults objects and lists as equivalent, even though they're not: the ParseResults objects can have attributes (i.e., subexpressions with resultNames), but the lists don't.

I'm writing some utility parsing expressions. I have one that acts like the builtin Group() but also saves some additional information about the parsed object in some attributes. However, when in postParse I wrap the result in a list, the attributes are lost.

I can't find anything in the ParseResults object that provides an easy way to copy the attributes from one ParseResults object to another. What am I supposed to do if I want to create a new ParseResults object that 'wraps' another, retaining its attributes?

Here's an example. In pyparsing's Group class, it has this:

def postParse( self, instring, loc, tokenlist ):
        return [tokenlist]

How could this be modified to create a Group that retains the attributes of its subexpressions? The only place that I can see attribute-coping code is in the iadd method of ParseResults, but this again assumes that I want to retain the same token list. I don't. I want to make a new ParseResults object from an existing one, and wind up with a DIFFERENT token list but the SAME attributes as the original.

2012-07-03 14:11:53 - ptmcg

Not sure if these qualify as solutions to your problem, but might give you some ideas on how to workaround current limitations.

-- Paul

from pyparsing import *

# a simple key-value grammar to load up some ParseResults
ident = Word(alphas, alphanums)
integer = Word(nums)
table = dictOf(ident, integer)

res1 = table.parseString('A 3 B 7 D 22')
print res1.dump()

res2 = table.parseString('Z 12 Y 17')
print res2.dump()

# delete all entries from a PR list, but leave attributes intact
del res2[:]
print res2.dump()

# ParseResults don't have append, but they do iadd
res2 += ParseResults(list('ABCDEF'))
print res2.dump()

# use dict-style assignment to copy key-values from one PR to another
res3 = ParseResults(list('ABCDEFG'))
for k,v in res1.items():
    res3[k] = v
print res3.dump()

2012-07-03 14:12:38 - ptmcg

Sorry, forgot the output results:

[['A', '3'], ['B', '7'], ['D', '22']]
- A: 3
- B: 7
- D: 22
[['Z', '12'], ['Y', '17']]
- Y: 17
- Z: 12
[]
- Y: 17
- Z: 12
['A', 'B', 'C', 'D', 'E', 'F']
- Y: 17
- Z: 12
['A', 'B', 'C', 'D', 'E', 'F', 'G']
- A: 3
- B: 7
- D: 22

2012-07-03 14:16:05 - ptmcg

Oh, one other thing - when you say that you 'wrap one PR inside a list, and the attributes are lost' I don't think they are lost, they are just held as attributes of the [0]th element in the list, kind of like wrapping a dict in a list. Group has to do this, so that an expression with results names that occurs multiple times can be wrapped in Group, and the different parse results will kept from stepping on each other.


2012-07-04 08:25:29 - einar77 - Parsing multi-line records

Hello,

as seen on , I'm trying to get pyparsing to properly parse records such as

DRUG        D09347  Fostamatinib (USAN)
            D09348  Fostamatinib disodium (USAN)
            D09692  Veliparib (USAN/INN)
            D09730  Olaparib (JAN/INN)
            D09913  Iniparib (USAN/INN)

So far what I did is

from pyparsing import *
 punctuation = ',.'`&-'
 special_chars = '\()[]'

 drug = Keyword('DRUG')
 drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
      alphanums + special_chars))) + ZeroOrMore(LineEnd())
 drug_lines = OneOrMore(drug_content)
 drug_parser = drug + drug_lines

However OneOrMore is too greedy, and grabs the following lines as well (example with 3 entries):

['DRUG', ['D09347', 'Fostamatinib (USAN)
        D09348  Fostamatinib disodium      (USAN)
        D09692  Veliparib (USAN']]

What I'd like instead would be to have

['DRUG', [['D09347', 'Fostamatinib (USAN)'], ['D09348', 'Fostamatinib disodium (USAN)'],
           ['D09692', ' Veliparib (USAN)']]]

I'm fairly sure I'm getting things wrong here. Where?

2012-07-04 10:11:25 - ptmcg

I usually like to try to 'be the parser'. When you see the word 'DRUG', how do you know that it is not supposed to match the leading Word(alphanums) in drug_content? Probably because 'DRUG' is a special word in your syntax. So you specifically want to exclude 'DRUG' from matching that first word - the easiest is to use a negative lookahead, using NotAny, or use the ~ operator. Try this:

drug_content = ~drug + Word(alphanums) + originalTextFor(OneOrMore(Word(
      alphanums + special_chars))) + ZeroOrMore(LineEnd())

You could also attach a validating parse action to the leading word to check the column number, and if it is 1, raise a ParseException.

-- Paul

2012-07-05 09:06:04 - einar77

Apparently it's not working yet:

In [515]: drug_parser.parseString(contents)
Out[515]: (['DRUG', 'D09347', 'Fostamatinib (USAN)\n            D09348  Fostamatinib disodium (USAN)\n            D09692  Veliparib (USAN'], {})

The only change I did was to switch drug_content, the rest was like in my example.

2012-07-05 09:19:09 - ptmcg

Sorry, I misread your 'what I want' example. If you just want to read from 'D0####' to the end of the line, then this version of drug_content will give you a separate entry for each line:

drug_content = Group(~drug + Word(alphanums) + empty + restOfline)

Group tells pyparsing to create a sublist for each drug_content. empty advances the parser to the next non-whitespace character, and restOfLine gets everything up to the next newline.

Hope this gets you closer, -- Paul

2012-07-06 00:58:09 - einar77

Thanks, this gets one of the hardest bits sorted. Now I'm having a similar record, but the issue here is that I need everything on the line up to a certain point.

The first bit is identical to the other, with just a different keyword

GENE        3932  LCK; lymphocyte-specific protein tyrosine kinase (EC:2.7.10.2) [KO:K05856] [EC:2.7.10.2]

And it's a multiline record like the DRUG one (so the same things apply). However here I need to parse just part of the line, up to the word with the first parenthesis, the rest should be ignored. In other words, the bit that interests me is

GENE        3932  LCK; lymphocyte-specific protein tyrosine kinase

The part after the ;, however, is variable-length. So far I got the first two bits OK

gene_line = ~gene + Word(nums) + empty + Word(alphanums, excludeChars=';')

But how can I tell pyparsing to parse 'up to a certain point'?

2012-07-06 01:11:43 - einar77

I forgot to add: the 'good' text may also contain parentheses, so I can't just ignore text starting from the first parens('(') onwards.

2012-07-07 07:51:35 - ptmcg

Hmm, your last statement kind of confounds the issue. I was going to suggest using SkipTo('('), but that won't work if you have ()'s in your desired text as well.

It looks like what you want is to read up to something matching this: '(EC:2.7.10.2)'. Create an expression that matches that with some rigor, not just '(' + Word(alphanums+':.') + ')', which might still match some text that you actually want, but more like Combine('(' + Word(alphas.upper) + ':' + delimitedList(Word(nums),'.') + ')' ) or if you prefer, Regex(r'\([A-Z]{2}\d+(\.\d+)+\)'). This looks like some kind of reference to me so I'll call it 'reference'. You can then use SkipTo(reference) to read the desired text.

A side note: when you are parsing the first word ('LCK' in your sample), you are using this expression:

Word(alphanums, excludeChars=';')

There is no point in using excludeChars here, since ';' is not one of the characters in alphanums. The purpose of excludeChars is to simplify defining words like 'any printable character except '/' or '.'' as Word(printables, excludeChars='/.') . Before I added excludeChars, you had to write something like Word(printables.replace('/','').replace('.','')) . In your case, Word(alphanums) will work just fine, and still not read the trailing ';'.


2012-07-11 17:36:06 - chlim - svn syntax

I'm having some trouble parsing this

convert_revision=svn:9171d42e-b04d-0410-96dc-cb0bc40dcdda/realstore/trunk@2222

Something simple like this didn't work.

svn = 'svn' + Word(alphanums +  ':' + '-' + '/@' )

Am I missing something here?

2012-07-11 18:49:55 - ptmcg

This parses everything after 'convert_revision=' just fine. To parse the whole string, try something like:

expr = Word(alphas,alphanums+'_') + '=' + svn

You can also create more specific subparsers for the uuid, path, '@' and trailing integer, instead of just blending them all into one Word expression.


2012-07-16 08:45:49 - DiaaFayed - how can we restore setParseAction results ?

plz

to simplify the grammar of my problems, I build some grammars that process matched tokens in the setPraseAction().

my question, how can return that resutls of parsing in the setParseAction to concatenate to the mian parsing resutls.

example(pesudo code):

def minorgrammar(toks):
   grammar2 = expression   
   for t,s,e in grammar2.scanString(toks[0],maxMatches=1):
   ...
   return (t,s,e)


# main grammar
grammar1 = expression.setParseAction('minorgrammar')
x = grammar1.parseString(data)

I want the last returned parsed tree to contans x and reslt of minor grammar

also this procedures will executed for each line in text file and all the results will be send in one big xml file


sorry I cannot solve the problem in one grammar
but in cascaded fashion

also I can not send my code grammar for my security study

thanks

2012-07-16 20:47:20 - ptmcg

Try something like this:

from pyparsing import *

# a hypothetical outer parser, with an unparsed SkipTo element
color = oneOf('red orange yellow green blue purple')
expression = SkipTo('XXX') + 'XXX' + color('color')

data = 'JUNK 100 200 10 XXX green'
print expression.parseString(data).dump()

# main grammar
def minorgrammar(toks):
    # a simple inner grammar
    integer = Word(nums)
    grammar2 = integer('A') + integer('B') + integer('C')

    # use scanString to find the inner grammar
    # (since we just want the first occurrence, we can use next
    # instead of a for loop with a break)
    t,s,e = next(grammar2.scanString(toks[0],maxMatches=1))

    # remove 0'th element from toks
    del toks[0]

    # return a new ParseResults, the sum of t and everything 
    # in toks after toks[0] was removed
    return t + toks

grammar1 = expression.setParseAction(minorgrammar)
x = grammar1.parseString(data)
print x.dump()

prints:

['JUNK 100 200 10 ', 'XXX', 'green']
- color: green
['100', '200', '10', 'XXX', 'green']
- A: 100
- B: 200
- C: 10
- color: green

-- Paul

2012-07-24 16:07:28 - DiaaFayed

Thanks very very much

I will build on that since there are many setParseAction functions

I will also study how traverse the resulted tree and rearrange it for each parsed line

2012-07-25 05:39:10 - DiaaFayed

plz another question

  1. I produce the the result of both the main grammar and the minor grammar asXMl.

  2. I need to build the minor xml results as elements in in-between the main xml tree elements.

  3. I have to option to produce this results: a. try to traverse the asXML result tree and insert the minor grammar xml tree b. produce xml file to the main grammar xml tree and another file to minor grammar xml finally, try to merge the two xml files by building cross- refrence

now I test the first option but the problem is asXML() cannot traversed or editing at run-time to insert the minor grammar xml tree


2012-07-18 07:37:10 - paulelastic - Parsing Expression Grammar (PEG)

Hi!

I have a conceptual question.

I've been using pyparsing for years, and recently I came across this notion of a Parsing Expression Grammar (PEG), which I understand to be a stricter/deterministic Context-free grammar (CFG).

It seems to me that it boils down to the ability to make an ordered choice, as well as the inclusion of a packrat parser.

If I'm not mistaken, the 7 operators specified in PEG have pyparsing analogues:

  • Sequence: e1 e2 -> e1 + e2
  • Ordered Choice: e1 / e2 -> e1 | e2
  • Zero-or-More: e* -> ZeroOrMore(e)
  • One-or-More: e+ -> OneOrMore(e)
  • Optional: e? -> Optional(e)
  • And-predicate: &e -> FollowedBy(e)
  • Not-predicate: !e -> ~e

And packrat parsing in pyparsing is turned on using:

import pyparsing as pp
pp.ParserElement.enablePackrat()

Would I then be correct in saying that pyparsing can be used to implement PEGs, as long as the above constraints are followed?

Thanks!

2012-07-19 19:40:42 - ptmcg

Yes, I would say so.

I'm surprised, though, that the canonical PEG definition omits:

  • no match -> NoMatch
  • any match -> Empty
  • unordered choice -> e1 & e2

Also, there are several possible implementations of Ordered Choice:

  • match first (MatchFirst)
  • match longest (Or)
  • match all and select most successful overall parse (not implemented in pyparsing - can be extremely slow)

And I'm not sure that packratting is a necessary part of PEGs, but more like implementations of PEGs lend themselves to supporting packratting.

-- Paul

2012-07-26 11:46:37 - paulelastic

Thanks for the helpful answer, Paul!

After some research, I believe there are a few more subtleties to PEG (such as greedy matching), and Empty is defined. Unordered choice is intentionally eschewed in PEGs because it can lead to ambiguities. But it does seem that the PEG languages can be entirely represented in pyparsing.

I was merely curious because everyone's talking about PEGs -- I guess one of the big reasons there's so much buzz is due to the fact that it makes linear time packrat parsing possible.


2012-07-18 07:46:50 - paulelastic - Visual debugger for pyparsing

I'm just wondering, how hard would it be to write a visual debugger for pyparsing, sort of in the vein of the Regex Coach?

Has anyone attempted such a thing? Would it just be a matter of programatically adding color codes (e.g. ANSI, HTML, etc.) around matched tokens using .setParseAction?

How do folks here debug their pyparsing expressions?

Thanks!

2012-07-18 18:02:12 - ptmcg

Catherine Devlin posted on her blog about pyparsing_helper - you can install it with easy_install.

2012-07-19 20:54:58 - paulelastic

Thanks for heads up, Paul!

pyparsing_helper is really nice. However, what I had in mind was something that goes a little beyond that and does syntax highlighting for individual parse expressions.

For instance:

var = Word(alphas,alphanums)
cmp = Literal('<=') | Literal('>=')
float = Word(nums) + Optional(Literal('.') + Word(nums))
statement = Optional(float + cmp) + var + cmp + float

results = statement.parseString('2.4 <= x1 <= 4.2')

I'd like to build a tool where if I hover my mouse over the parse expression 'Optional(float + cmp)' in 'statement', it will highlight '2.4 <=' in the input string. Or if I hover over the float, it will highlight '2.4' and '4.2' in the input string.

To do this, there would need to be some mechanism for each matched token to return--upon parsing--a tuple containing its start and end position in the input string (or a list of them in the case of ZeroOrMore/OneOrMore's).

Is this hard to do?

I'm just thinking this would help greatly in debugging ambiguous grammars, where I'm always wondering where a parse expression match stops matching, and the next parse expression starts to match.

2012-07-20 02:33:38 - ptmcg

What you describe does sound helpful. If I were writing this, I would use scanString, which returns a generator that yields (tokens,start,end) tuples. The debugger would take the selected grammar fragment, run fragment.scanString, and highlight all the start-end regions. (Watch out for embedded tabs characters.)

But where would you mouse-hover in that expression for your debugger to detect that you wanted to highlight matches for 'Optional(float + cmp)'? If the mouse was just over the word 'float', would you want every float highlighted? How would you hover over 'float + cmp'? Maybe you would have to actually select a region of the grammar so that your scanner could comprehend when you are trying to match larger pieces of the grammar.

And how would the debugger know that '2.4 <= x1 <= 4.2' was the string in which to highlight the matches? You might have to do something more like pyparsing_helper, where you put input text into a separate panel.

Your project sounds interesting, and maybe pyparsing_helper could be a place to start. I seem to remember from Catherine's blog that, after she wrote this 0.1.0 version, she found another interactive utility that was more general purpose, but did most everything she wanted. So maybe this other utility might be more fully featured in the way you want. Or just write to Catherine directly, and see what she uses now for debugging her pyparsing programs.

HTH, -- Paul

2012-07-26 12:00:24 - paulelastic

I did read on Catherine's blog that she was using reInteract to debug pyparsing. It's kind of a worksheet REPL for Python in general, and it's quite wonderful, but it still doesn't provide the kind of visualization that I envisage.

scanString is a good idea. I have to think about that a bit.

Yes, if I hover over the definition of float, all the float lexemes should be highlighted. But if I highlight over the float contained within the Optional(), only those floats inside that production rule should be highlighted. And yes, the input string will have to be provided to the highlighting code.

I'm wondering if a pseudo-lex-then-parse method may be able to provide more information in this case. Once an input string is tokenized, the highlighter obtains a map from the tokens to the lexemes. So if a user highlights a token, the program will highlight the part of the string that matches the token.

Similarly, the highlighter can match BNF production rules to tokenlists, and when the user highlights a production rule, the part of the string gets highlighted.

Of course, all this can get quite complicated with ZeroOrMore's and OneOrMore's -- so maybe my ideas are out of left field here. But I'm just wondering aloud.

Anyway, I'll leave it at that. I'll think about this some more.


2012-08-09 07:35:47 - Leevi3 - multiple parse actions

Hi, as this is my first post I want to thank the developer(s) for this great package.

My question refers to the statement 'Multiple functions can be attached to a ParserElement by specifying multiple arguments to setParseAction, or by calling setParseAction multiple times.' in the documentation.

I want to add 2 parse actions to an element. Adding two functions in one call of setParseAction works fine, but doing it one after the other seems to only attach the most recent action to the element. For example:

import pyparsing as PYP

def one():
    print 'one',

def two():
    print 'two',

atom = PYP.Word(PYP.alphas, PYP.alphanums, 1)
atom.setParseAction(one)
atom.setParseAction(two)
test = PYP.delimitedList(atom)
s = 'ji2n, d292, o33, ok3'
test.parseString(s)

returns

two two two two

where I would expect

one two one two one two one two

which is what I get if I add both functions at once:

atom.setParseAction(one, two)

My question is: why is 'one' not printed in the first case? To make this question more general:

Is there a way of accessing the actions currently attached to an element? (for example to change the grammar during execution of a program).

Thanks for any replies, L.

2012-08-09 07:44:00 - ptmcg

Sorry - the docs should read to 'call addParseAction multiple times', not setParseAction. Will fix.

Whenever I have had to define a dynamic grammar (which changes based on data that has been parsed so far), I use a Forward expression, and then inject new expressions into it at parse time using a parse action. See how this is done in the countedArray helper.

2012-08-09 07:50:20 - Leevi3

Thanks for the super fast reply, it works. Are there further ways of viewing, or removing the actions currently attached to an element, or is cumulative adding the way of using actions?

Thanks again for this great package!

2012-08-09 08:55:43 - ptmcg

Look at the expression's parseAction attribute - it's just a regular list, so you can pop or add functions to it (but this will make your code more obscure when you need to maintain it later). You can also clear the list using setParseAction(None).

Welcome to pyparsing, glad you are enjoying it!


2012-08-17 14:36:15 - script_lover - operatorPrecedence generate rules that cannot be validated

Take the example of , if I run boolExpr.validate() it reports that a left recursion is found.

Is this correct?

Thanks.

2012-08-27 21:01:12 - ptmcg

I just replicated your results using a simple 4-function arithmetic expression. It's probably a bug in validate(), since I know that operatorPrecedence expressions do run successfully with recursing infinitely.

Thanks, -- Paul


2012-08-17 14:48:13 - script_lover - Avoid duplicating rules

Hi,

I run into the following situation:

A production called 'expression' is built on top of a more basic production called, 'element.'

For example:

expression = element + OneOrMore(oneOf('*','/') + element) | element

My problem is that I want the production 'element' to have different definitions under different context.

For example, let's say the syntax '.variable' is not a valid element under normal context, but is valid within a WITH block. So:

a * b  #is a valid expression
a * .member # is not a valid expression

WITH my_struct
     a * .member   #is a valid expression
END

One way to do this is to copy and paste every single production that depend on 'element,' make a new production 'element2' that fits the new context and build the production expression2 that depends only on 'element2.' But that is extremely verbose and error-prone. I wonder if there is a way to reuse similar productions under different contexts.


2012-08-22 05:56:40 - Leevi3 - nesting depth of operator precedence parse results

Hi,

this question is about the nesting depth of parse results - in particular results obtained from an operator precedence element.

I want to pass a parsed logical expression (usual operators and, or, not) as an argument to a function. I have noticed, that if the logical expression contains at least on operator, then the parse result is nested twice, but otherwise only once.

For example:

import pyparsing as PYP

And = PYP.CaselessKeyword('and')
Not = PYP.CaselessKeyword('not')
Or = PYP.CaselessKeyword('or')
Expression = PYP.operatorPrecedence(PYP.Literal('a'), [(Not, 1, PYP.opAssoc.RIGHT),
                                                       (And,  2, PYP.opAssoc.LEFT),
                                                       (Or, 2, PYP.opAssoc.LEFT)])

print Expression.parseString('a or a and a').asList()
print Expression.parseString('a').asList()

returns

>> [['a', 'or', ['a', 'and', 'a']]]
>> ['a']

where I would expect to see

>> [['a']]

If I want to pass the parsed expression to a function, I will have to pop() once in the first case. In the second case I would pass the results as they are.

My question is: Is this difference intended, and if so what is the advantage?

A quick fix in my grammar would of course be

FixedExpression = PYP.Group(PYP.Literal('a')) ^ Expression

Thanks for any info, Leevi.

2012-08-27 20:55:17 - ptmcg

Unfortunately, if you retained the nesting in operator precedence, then every atomic operand would be buried inside a nested list as deep as the number of levels defined in the precedence list. Rather than wade through these nestings, you might be better off using parse actions to construct a hierarchy of evaluatable objects - see how this is done in the SimpleBool.py example. The values in any binary operation are (value, operator, value, operator, etc.), where value itself can be an atomic operand, or a nested object. Write back if you want to discuss this in more detail.

-- Paul

2012-08-28 16:06:32 - Leevi3

Hi Paul,

thanks for the advice. I spent most of today rewriting my program. I adapted it to your example for Boolean formulas. It works fine and the code is much clearer.

My task here was to parse a string into a function. Just like your Boolean example, but without having fixed values for the atoms. Rather I can call the parsed object with a value assignment. Also, I am not evaluating Boolean expression but my own predicates that themselves consist of a mini-language.

I devised a two step procedure: (1) Parse the input string. (2) Initialize the parsed object with a Model that determines how my custom predicates are to be interpreted and (3) return a callable function. Now I can test if a (parameter)-assignment is compatible with a model.

Thanks again for you advice. Especially attaching Objects as parseActions turned out clean and readable.

Till then, Hannes

Btw: My hope is that, when I am finished with this project I might appear on your Pyparsing Examples page.


2012-08-25 05:11:18 - simbera.jan - Getting a ParseResults line number

Hi, I've been working on a parser for nested expressions that are then translated according to external specifications. I use the pyparsing-based parser to produce a document tree and then construct a hierarchy of various objects from it. The parser has been tremendously helpful so far, but now I have the following problem: I want each ParseResults object in the tree to carry its line number, so that when an error during the construction of the hierarchy occurs, I can easily report it.

I tried using parse actions to insert the line number into the passed token list (such as tokens.lineNumber = pyparsing.lineno(loc, string) and then return tokens, but I tried dictionary access as well). It might work when I print the line number using a following parse action, but when I try it on the final results from parseString, there is nothing stored (when retrieving tokens.lineNumber, I get an empty string).

Could anybody show me a way to accomplish this (a clean way, preferrably)? Thanks, Jan

2012-08-25 08:30:49 - ptmcg

Not sure why using dictionary-style access to add thelocation to the tokens didn't work for you. Here is an example, looking through the class 'lorem ipsum' text for words starting with a vowel.

from pyparsing import *

text = '''Lorem ipsum dolor sit amet, consectetur adipisicing 
elit, sed do eiusmod tempor incididunt ut labore et dolore magna 
aliqua. Ut enim ad minim veniam, quis nostrud exercitation 
ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis 
aute irure dolor in reprehenderit in voluptate velit esse cillum 
dolore eu fugiat nulla pariatur. Excepteur sint occaecat 
cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum'''

# find all words beginning with a vowel
vowels = 'aeiouAEIOU'
initialVowelWord = Word(vowels,alphas)

# Unfortunately, searchString will advance character by character through
# the input text, so it will detect that the initial 'Lorem' is not an
# initialVowelWord, but then it will test 'orem' and think that it is. So
# we need to add a do-nothing term that will match the words that start with
# consonants, but we will just throw them away when we match them. The key is
# that, in having been matched, the parser will skip over them entirely when
# looking for initialVowelWords.
consonants = ''.join(c for c in alphas if c not in vowels)
initialConsWord = Word(consonants, alphas).suppress()

# add parse action to store the current location in the parsed tokens
# (you said you tried this, not sure why it didn't work for you)
def addLocnToTokens(s,l,t):
    t['locn'] = l
    t['word'] = t[0]
initialVowelWord.setParseAction(addLocnToTokens)

for ivowelInfo in (initialConsWord | initialVowelWord).searchString(text):
    if not ivowelInfo:
        continue
    print ivowelInfo.locn, ':', ivowelInfo.word

The parse action addLocnToTokens embellishes the parsed tokens with new results names 'locn' and 'word'.

Alternatively, you can define your own placeholder using an Empty, and add any kind of behavior to it you want, such as in this case, saving the current parse location:

# alternative - add an Empty that will save the current location
def location(name):
    return Empty().setParseAction(lambda s,l,t: t.__setitem__(name,l))
locateInitialVowels = location('locn') + initialVowelWord('word')

# search through the input text
for ivowelInfo in (initialConsWord | locateInitialVowels).searchString(text):
    if not ivowelInfo:
        continue
    print ivowelInfo.locn, ':', ivowelInfo.word

This will give the same results as the previous example.

Good luck!

2012-08-25 11:10:20 - simbera.jan

Thanks a lot! Problem solved. It was actually quite a small thing that confused me - I expected the ParseResults objects to behave symetrically, i. e. when I set something as an attribute or a dictionary value, it can be retrieved the same way. So, thanks again. Pyparsing is a great library - it saved me a lot of time and is as intuitive and Pythonic as it can be. Yours, Jan

2012-08-26 19:47:30 - ptmcg

Great, thanks for the props! Good luck with your parser!

-- Paul


2012-09-10 02:50:10 - acjackson5 - Help with datetime conversion

Hi,

I have some problems with datetime: I want to parse a string into an actual datetime string, but it doesn't work the way I want it to.

def generateDateString(tokens):
    try:
        # example: Thu Jun 14
        tokens[0] = datetime.datetime.strptime(tokens[0], '%a %b %d')
    except ValueError, ve:
        raise ParseException('Invalid date string (%s)' % tokens[0])
date.setParseAction(generateDateString)

If I print the results as XML it is displayed as datetime object, but if I want to work with that datetime object, I get the error message: Attribute error: 'str' object has no attribute 'strftime'

2012-09-13 06:15:35 - ptmcg

Change this line:

tokens[0] = datetime.datetime.strptime(tokens[0], '%a %b %d')

to:

return datetime.datetime.strptime(tokens[0], '%a %b %d')

You didn't post the other parts of your parser, so I don't know how you are using results names to access the datetime object.


2012-09-10 05:09:15 - darkest_star - Parse a logfile and detect repetitve textblocks

Hi,

I'm very new with pyparsing and have not find a example, which solves my following issue: I have created a logging mechanism, which runs as a cron job every 60 seconds and logs some important Linux system values. Like 'date', 'meminfo' and 'loadavg'. Please see the sample string below in my code snipped.

from pyparsing import *

sample = '''
Date:
Thu Sep  6 22:15:01 CEST 2012

who -r:
         run-level 3  Sep  6 21:59                   last=S

/proc/meminfo:
MemTotal:       12191888 kB
MemFree:        11558472 kB
Buffers:           13068 kB
Cached:           218592 kB
SwapCached:            0 kB
Active:           114388 kB
Inactive:         174244 kB
Active(anon):      74500 kB
Inactive(anon):    12764 kB
Active(file):      39888 kB
Inactive(file):   161480 kB
Unevictable:       14148 kB
Mlocked:           14148 kB
SwapTotal:      16777208 kB
SwapFree:       16777208 kB
Dirty:               640 kB
Writeback:             0 kB
AnonPages:         71064 kB
Mapped:            37944 kB
Shmem:             21020 kB
Slab:             249580 kB
SReclaimable:      14708 kB
SUnreclaim:       234872 kB
KernelStack:        2720 kB
PageTables:         7880 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    22873152 kB
Committed_AS:     381536 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      316004 kB
VmallocChunk:   34359408868 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7680 kB
DirectMap2M:    12566528 kB

/proc/loadavg:
0.71 0.52 0.34 2/339 3794

Date:
Thu Sep  6 22:16:01 CEST 2012

who -r:
         run-level 3  Sep  6 21:59                   last=S

/proc/meminfo:
MemTotal:       12191888 kB
MemFree:        11502920 kB
Buffers:           19296 kB
Cached:           257780 kB
SwapCached:            0 kB
Active:           151340 kB
Inactive:         183008 kB
Active(anon):      74792 kB
Inactive(anon):    12764 kB
Active(file):      76548 kB
Inactive(file):   170244 kB
Unevictable:       14148 kB
Mlocked:           14148 kB
SwapTotal:      16777208 kB
SwapFree:       16777208 kB
Dirty:               160 kB
Writeback:             0 kB
AnonPages:         71328 kB
Mapped:            38248 kB
Shmem:             21020 kB
Slab:             259952 kB
SReclaimable:      24656 kB
SUnreclaim:       235296 kB
KernelStack:        2720 kB
PageTables:         7784 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    22873152 kB
Committed_AS:     379928 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      316004 kB
VmallocChunk:   34359408868 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7680 kB
DirectMap2M:    12566528 kB

/proc/loadavg:
0.68 0.54 0.35 1/336 8232

Date:
Thu Sep  6 22:17:01 CEST 2012

who -r:
         run-level 3  Sep  6 21:59                   last=S

/proc/meminfo:
MemTotal:       12191888 kB
MemFree:        11499912 kB
Buffers:           19500 kB
Cached:           259832 kB
SwapCached:            0 kB
Active:           152120 kB
Inactive:         185172 kB
Active(anon):      75480 kB
Inactive(anon):    12764 kB
Active(file):      76640 kB
Inactive(file):   172408 kB
Unevictable:       14148 kB
Mlocked:           14148 kB
SwapTotal:      16777208 kB
SwapFree:       16777208 kB
Dirty:               208 kB
Writeback:             0 kB
AnonPages:         71680 kB
Mapped:            39428 kB
Shmem:             21020 kB
Slab:             259924 kB
SReclaimable:      24700 kB
SUnreclaim:       235224 kB
KernelStack:        2728 kB
PageTables:         7564 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    22873152 kB
Committed_AS:     381292 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      316004 kB
VmallocChunk:   34359408868 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7680 kB
DirectMap2M:    12566528 kB

/proc/loadavg:
0.52 0.52 0.36 1/337 12341
'''

# macros
integer = Word(nums)

#date
weekdays = 'Mon Thu Wed Tue Fri Sat Sun'
months = 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec'
date = Group(Literal ('Date:').suppress() + \
             oneOf(weekdays).suppress() + \
             oneOf(months)('month') + \
             integer('day') + \
             Combine(integer + ':' + integer + ':' + integer)('time') + \
             Word(alphas).suppress()('timezone') + \
             integer('year') \
             )('date')
#who
who = Group(Literal('who -r:').suppress() + \
            Word(alphas+'-').suppress() + \
            integer('runlevel') + \
            oneOf(months)('month') + \
            integer('day') + \
            Combine(integer + ':' + integer)('time') + \
            Suppress('last=') + \
            oneOf('0 1 2 3 4 5 6 S')('prerunlevel') \
            )('who')

#meminfo
meminfo = Group(Literal('/proc/meminfo:').suppress() + \
                Dict(OneOrMore(Group(Word(alphanums+'('+')'+'_') + Suppress(':') + Combine(integer + Optional(' kB'))))) \
                )('meminfo')

#loadavg
loadavg = Group(Literal('/proc/loadavg:').suppress() + \
                OneOrMore(Word(nums+'.'+'/')) \
                )('loadavg')

record = Forward()
#record << date
#record < who + meminfo + loadavg

record << date + loadavg
# or: record << date + meminfo
# or: record << date + loadavg + meminfo

# parse input string
records=record.searchString( sample )
print 'Number of records:',
print len(records)
print records

Each logging block starts with a special 'Keyword'. E.g. the date section starts with 'Date:' and the meminfo section starts with '/proc/meminfo:'.

Now my goal is to parse this logfile and create a array of records, which includes (one or more) certain interesting system values to a certain Date. Each of the 4 (date, who, meminfo, loadavg) work perfectly alone. But I have problems if I want to combine them. Means: If I combine date + loadavg (to see the loadavg values for each timestamp) it doesn't find any results anymore.

Has somebody had a similar problem already solved in the past and could point me to a code example? Btw, the logfile structure is fix and should be treated as given because I run it already on many systems for a long time.

Thank you very much in advance!

Regards, Uwe

2012-09-10 05:13:49 - darkest_star

I forgot to mention: My sample logfile ('sample') contains 3 of these log sets collected in 3 minutes and is only a short example.

2012-09-11 01:35:04 - darkest_star

Probably I have found a solution respectively a good workaround... ;-) If I define the pyparsing grammar as follows I get the desired output:

# record all values
record << date + SkipTo(who).suppress() + who + SkipTo(meminfo).suppress() + meminfo + SkipTo(loadavg).suppress() + loadavg

or

# record only date+meminfo
record << date + SkipTo(meminfo).suppress() + meminfo

Uwe


2012-09-12 15:11:17 - sravet - need help with verilog parser

I have a small testcase that fails to parse:

`timescale 100 ps/100 ps
module mb1_uA (); 
wire abc  ;
wire a;
xyz_top u_xyz_top (
  .a(a),
  .\abc (abc )
);

Note that there is a portname that is an escaped identifier. I have real RTL like this that was produced by a commercial EDA tool.

Parsing produces this:

Exception raised:Expected ')' (at char 91), (line:7, col:8)

I looked into the verilog BNF and from what I can tell it looks OK. The named port convention looks for an identifier as the port name

port = portExpr | Group( ( '.' + identifier + '(' + portExpr + ')' ) )
 [[code]]
and identifier includes in part escaped identifiers


    identifier2 = Regex(r'\\\S+').setParseAction(lambda t:t[0][1:]).setName('escapedIdent')

That's about the extent of what I can tell, I'm new to python. Any suggestions?

thanks, --steve

2012-09-14 06:38:05 - ptmcg

The escaped identifier is not the problem. The problem is this definition in the parser:

inst_args = Group( '(' + (delimitedList( modulePortConnection ) |
                    delimitedList( namedPortConnection )) + ')').setName('inst_args')

should be changed to:

inst_args = Group( '(' + (delimitedList( namedPortConnection ) |
                    delimitedList( modulePortConnection )) + ')').setName('inst_args')

Also, your input file is missing a terminating endmodule.

Thanks for reporting this, I'll update the next released version!

-- Paul

2012-09-20 15:09:49 - sravet

Thanks Paul. In my real example there are many ports that are connected and the error pointed to the one with the escaped identifier. And, when I modified the port name to get rid of the escape it worked correctly, hence my assumption that it was the problem. Anyway, I put in your fix and it's working fine now. Many thanks, I hope to get a lot of good use from this parser.


2012-09-30 23:19:48 - cqqhzxgh - match multiple lines

In the definition of my Pyparsing grammar, there are some grammars which will match strings that span multiple lines. If I use the api like:

PyGrammar.parseString(open('file_name').read())

If will behave in the correct way.

However if I want to use the iterator to read the file like

with open('file_name') as f:
   for line in f:
      PyGrammar.parseString(line)
the parser will break

Is there a way to work around this case. Thanks...

2012-10-01 01:56:08 - ptmcg

No, pyparsing must have the full source string read into a local string variable for it to parse.

If your top-level grammar looks something like OneOrMore(expr), and creating a single ParseResults containing all the expr results is slow to create or too large in memory, you could switch from using parseString to using scanString with the repeated expression. That is, convert OneOrMore(expr).parseString(inputstring) to expr.scanString(inputString). scanString returns a generator that gives the matched tokens, start, and end location of each match. Perhaps this will help address what I assume are memory issues.

-- Paul

2012-10-01 04:34:11 - cqqhzxgh

hi Paul, It is really nice for you to provide such prompt reply. Based on your reply, Can I say the second way where I pass each line to the pyparsing grammar is wrong?

currently I use the pyparsing to parse some log files, which stuffed with some irreverent info which I just want to ignore. so I create my grammar structure looks like this:

ZeroOrMore(SkipTo((expr1 | expr2 | expr3 |....).setParseAction(my_call_back_function),include=True))

In the my_call_back_function I will generate corresponding objects, store them in db and delete the ParseResult like

del tokens

So I will not be using the big ParseResults returned from the parseString. I guess scanString will perform the similar function in this case.

my concerns for using open('file_name').read() is I guess python will load the entire file to the memory which exceeds 200MB. It consumes too much memory in this case. This is especially true if I am intended to run multiple pyparsing parser together. Can you enlighten on me for this. BTW, I am not sure If I make myself clear. I am not sure if I am structure the grammar correctly either. Sorry for my crapping English.

2012-10-01 06:00:16 - ptmcg

I think you are in a good place to just switch over to using scanString. Define your grammar as:

grammar = (expr1 | expr2 | expr3).setParseAction(my_call_back_function)

for tokens, start, end in grammar.scanString(sourceText):
    # do something with tokens
    # no need to even call del tokens here, because the
    # name will be rebound on the next iteration

Do you actually use the text that gets skipped over? Let me know, and I'll show you how to get at it when using scanString instead of parseString.

-- Paul

2012-10-01 18:46:37 - cqqhzxgh

yes, I tried the scanString, it works awesome. And I figured out how to print how the skipped over text as well. BTW I have another question lingered around me. you mentioned to me that parser needs to see all source text before parsing the text. But for me, I want to use pyparsing in a real time case, where I cannot get all text all in one time.

2012-10-01 18:47:17 - cqqhzxgh

or, you prefer to have another post for the above question

2012-10-01 20:36:12 - ptmcg

I've thought about rewriting pyparsing to accept an input stream, but it will be a pretty radical change. So, unfortunately, for the foreseeable future, you'll have to pass to pyparsing a complete string of data to parse.

What you could do is wrap your own code around a call to scanString, something like this:

# set up a generator to yield a line of text at a time
linegenerator = open('big_hairy_file.txt')

# buffer will accumulate lines until a fully parseable piece is found
buffer = ''

for line in linegenerator:
    buffer += line

    match = next(grammar.scanString(buffer), None)
    while match:
        tokens, start, end = match
        print tokens.asList()

        buffer = buffer[end:]
        match = next(grammar.scanString(buffer), None)

Write back and let me know how that works out.

-- Paul

2012-10-01 23:32:11 - cqqhzxgh

looks like a solution. I will tested it out and let you know. But now I have another question and I opened new post for that.


2012-10-01 23:38:09 - cqqhzxgh - question with scanString

I have some code looks like

parser = (exper1 + ( exper2 | expre3| ...))
data =  open('file_name').read()
for tokens, start, end in parser.scanString(data):
       print 'token: %s\n' % tokens.dump()
       print 'start: %d \n'%start
       print 'end: %d \n'% end
       print 'match: %s\n'% data[start:end]

The function returns correct token matched. But start and end positions are wrong . It was shifted forward by some number. Is there a bug here?

2012-10-02 00:03:26 - cqqhzxgh

sorry , forgot to mention that there is a lot of unmatched text before the first match found. So If I delete some unmatched text in the file, the parser works fine. But with large unmatched text presents, the start and end does not align well.

2012-10-02 00:40:26 - cqqhzxgh

Because I have used restOfLine in the expr. It may cause this problem. I will dig out more.

2012-10-02 01:33:59 - cqqhzxgh

problem found!! the log files contains both space and tab which the location error. But help still needed.

2012-10-02 01:50:49 - cqqhzxgh

oops, I figured it out by looking at the source code which comes with good documentation. Call parseWithTabs did the trick. Thanks Paul.

2012-10-02 03:55:11 - ptmcg

Terrific! I'm glad my docs were helpful here. If you download the full source .zip from Sourceforge, you'll find a full htmldoc of all classes and helper methods in pyparsing, and a HowToUsePyparsing.html. The htmldoc is also available online at

Good luck in your pyparsings, -- Paul


2012-10-05 14:19:50 - dlwatey - Beginner Question


2012-10-07 21:20:38 - RunSilent - parseString works, scanString fails?

Hi Paul, I'm using Reinteract to debug my parser and I've got a situation where parseString returns a result but scanString doesn't. I am parsing a simple string with a (document) revision and title. Here is (hopefully) the BNF:

#<revision> ::= <whitespace>+
# ['Rev'['.'] | 'REV'['.']]
# <alpha_rev> | <alpha_rev><alpha_rev>
# [<num_rev> | <num_revs><num_rev>]
# |
# <space>+
# <alpha_rev> | <alpha_rev><alpha_rev>
# [<num_rev> | <num_rev><num_rev>] |
#
#<whitespace> ::= ' |\t|\r|\n'
#<space> ::= ' '
#<alpha_rev> ::= 'a|...|z|A|...|Z'
#<num_rev> ::= '1|...|9'

Here is a cut and paste from Reinteract:

alpha_revs = pyp.Word(pyp.alphas, min=1, max=2)('alpha_revs')
num_revs = pyp.Word('123456789', min=1, max=2)('num_revs')
space = pyp.White(ws=' ').setName('spaces')('spaces')
space = space.suppress()

revisionExpr = (
    pyp.StringStart().leaveWhitespace() + 
    pyp.White().suppress() +
    pyp.oneOf('rev rev.', caseless=True).suppress() + 
    pyp.Combine(alpha_revs + 
    pyp.Optional(num_revs)('rev'))
    |
    pyp.StringStart().leaveWhitespace() + 
    space + 
    pyp.Combine(alpha_revs + 
    pyp.Optional(num_revs)('rev'))
    )

revisionTokens = revisionExpr.parseWithTabs().parseString(rev_string)

for match_str, start, end in (
    revisionExpr.parseWithTabs().scanString(rev_string, maxMatches=1)):
    print match_str

print match_str

I'll show two examples with the input string and the results

rev_string = '    Rev. K        This is the title'

parseString: ['K']
scanString: NameError: name 'match_str' is not defined

rev_string = ' A This is the title'
parseString: ['A']
scanString: NameError: name 'match_str' is not defined

The input strings are not well formatted, people have been very creative and not every document has a revision listed. The only reliable way I can see to not grab an incorrect rev string is to require that it be at the beginning of the input (where it should be). Without that, I had situations where it was finding part of the title instead. It is a safe assumption that the revision is preceded by a space if it is just a alphanumeric string.

I don't understand why scanString isn't giving me the same result?

Thanks, Eric

2012-10-09 18:00:58 - ptmcg

Here is some annotated Python code, walking through your expression creation and parsing steps:

# parsing a revision and a description
from pyparsing import *

numeric_rev = Word(nums)
alpha_rev = Word(alphas, max=2)

rev_label = CaselessLiteral('Rev') + Optional('.')

revision_expr = rev_label + (numeric_rev | alpha_rev) + restOfLine



rev_string = '    Rev. K        This is the title'

print revision_expr.parseString(rev_string)

# prints
# ['Rev', '.', 'K', '        This is the title']


# I don't like the leading whitespace, but restOfLine doesn't skip
# whitespace. An Empty *does* skip whitespace, and returns nothing.

revision_expr = rev_label + (numeric_rev | alpha_rev) + Empty() + restOfLine

print revision_expr.parseString(rev_string)

# Now we get
# ['Rev', '.', 'K', 'This is the title']
# If we wanted, we could just look at results[2] and results[3]
# to get the interesting fields, but this is error prone - what if
# the optional '.' is left out? We *could* use results[-2] and 
# results[-1], using negative indexes to read from the right 
# instead. But I recommend you get used to naming the results, like
# this:

revision_expr = (rev_label + (numeric_rev | alpha_rev)('revision') + 
                    Empty() + restOfLine('description'))

results = revision_expr.parseString(rev_string)
print results

# Hmm, we still get:
# ['Rev', '.', 'K', 'This is the title']
# what's the difference?
print results.dump()

# By calling dump(), we see not only the list of tokens, but
# any named fields:
#  ['Rev', '.', 'K', 'This is the title']
#  - description: This is the title
#  - revision: K

# How do you get just the named fields? Access them as if they
# were values in a dict, or attributes on an object (for attribute
# access to work, the results name has to be a valid Python
# identifier)
print results['revision']
print results.description

-- Paul

2012-10-09 23:28:43 - RunSilent

Hi Paul, Thanks very much for the detailed answer. I need to work on this a bit more (it is a nightime project) since I think I'm stuck with considering whitespace. Some of the strings don't have revisions in them but they do have two letter strings in the title ('WI') which matches in your example.

I do have one quick(?) question, this sequence works:

alpha_rev = pyp.Word(pyp.alphas, max=2)
num_rev =  pyp.Word('123456789', max=2)
space = pyp.White(ws=' ').suppress()

revisionExpr = (
    pyp.StringStart().leaveWhitespace() +
    space +
    pyp.Combine(alpha_rev + 
    pyp.Optional(num_rev)('rev'))
    )

rev_string = ' K        This is the title'

for match_str, start, end in (
    revisionExpr.scanString(rev_string, maxMatches=1)):
    print match_str

['K']

Adding a second parse expression causes the scanString to fail:

revisionExpr = (
    pyp.StringStart().leaveWhitespace() +
    space +
    pyp.Combine(alpha_rev + 
    pyp.Optional(num_rev)('rev'))
    |
    pyp.CaselessLiteral('Rev') + pyp.Optional('.') + 
    pyp.Combine(alpha_rev + 
    pyp.Optional(num_rev)('rev'))
    )

I thought the '|' would still allow the first expression to match even if the second one fails?

Changing the rev_string:

rev_string = '    Rev. K        This is the title'

Matches with the last revisionExpr even though the first expression fails on its own? scanString hates StringStart?

If this is getting too convoluted I can probably just work with parseString since it keeps working.

Thanks again, Eric


2012-10-08 08:22:01 - DiaaFayed - How can we transform pyparsing string to pyparsing expression


2012-10-15 15:14:28 - jsy1972 - question re: indentation and grouped stmts

IGNORE = pp.Group(pp.ZeroOrMore(
    pp.Or(map(pp.Literal, ['TIDAL Scheduler','Dependency Cross Reference', 'Job Name', 'Printed']))
    + pp.SkipTo(pp.LineEnd())
    )).suppress()

END = pp.Literal('** End of Report **') + pp.SkipTo(pp.LineEnd())

LBRACKET = pp.Keyword('[ ', identChars='[')

RBRACKET = pp.Group(pp.Optional(pp.LineStart()).suppress() + pp.Keyword(']', identChars=']') + pp.Optional(pp.LineEnd()).suppress())

CRAP = pp.Group('[t' + pp.Optional(pp.oneOf('- +') + pp.Optional(pp.LineEnd()).suppress() + pp.oneOf('1 2 3 4 5 6 7 8 9')) + ']')

NAMEPART =  IGNORE + pp.OneOrMore(pp.Word(pp.printables, excludeChars='[]')) + IGNORE

NAME = pp.Combine(NAMEPART + pp.Optional(CRAP + NAMEPART), adjacent=False, joinString=' ')

JOB = NAME + pp.FollowedBy(LBRACKET)

GROUP = pp.Group(LBRACKET + pp.Optional(pp.LineEnd()).suppress() + NAME + pp.Optional(pp.LineEnd()).suppress() + RBRACKET)

JOBNAME = pp.Combine(JOB + GROUP, adjacent=False, joinString=' ')

INDENT = (pp.Literal(' ') * 36).leaveWhitespace().suppress() + pp.FollowedBy(JOBNAME)

JOBDEP = INDENT + JOBNAME

JOBTREE = JOBNAME.setResultsName('job') + pp.Group(pp.ZeroOrMore(JOBDEP)).setResultsName('deps')

here's some sample data

In [47]: print test2
          BAR [
          BAZ:\BLAH\BLAH ]
     Printed 10/12/2012 11:36 AM                                  Page 386
          Job Name                  Direct Dependents         Indirect Dependents
                                    FOOBAR [
                                    BLAH:BLAH\BLAH\BLAH_
                                    ]

I can parse out the job dep correctly when using that expression individually:

list(JOBDEP.scanString(test2)

[((['FOOBAR [ ...'], {}), ...)]

also just using the JOBNAME expression gives me the 'correct' results, but the JOBTREE expression fails and instead gives me 2 'job's instead of a job and a jobdep...

can someone help me fix this?

2012-10-15 15:18:06 - jsy1972

I should note that the 'JOBTREE' expression works everywhere else in my dataset except for the case where the page number and column headings follow right after a job definition line...

Tried to use indentedBlock, but had difficulty interpreting the function definition, also some jobs don't have dependencies.


2012-10-17 13:30:37 - DiaaFayed - plz give explain and examples ...

in the book 'Getting Started with Pyparsing' plz give explanation and examples of the following

page 17

'Parse actions can also be used to perform additional validation checks, such as testing whether a matched word exists in a list of valid words, and raising a ParseException if not. Parse actions can also return a constructed list or application object, essentially compiling the input text into a series of executable or callable user objects. Parse actions can be a powerful tool when designing a parser with pyparsing.'

2012-10-17 13:31:37 - DiaaFayed

'Parse actions can also return a constructed list or application object, essentially compiling the input text into a series of executable or callable user objects. Parse actions can be a powerful tool when designing a parser with pyparsing.'

2012-10-18 01:29:48 - ptmcg

Here is a simple parser to match a date string of the form 'YYYY/MM/DD', and return it as a datetime, or raise an exception if not a valid date.

from datetime import datetime
from pyparsing import *

# define an integer string, and a parse action to convert it
# to an integer at parse time
integer = Word(nums)
def convertToInt(tokens):
    return int(tokens[0])
integer.setParseAction(convertToInt)
# or can be written as one line as
#integer = Word(nums).setParseAction(lambda t: int(t[0]))

# define a pattern for a year/month/day date
date = integer('year') + '/' + integer('month') + '/' + integer('day')

def convertToDatetime(s,loc,tokens):
    try:
        return datetime(tokens.year, tokens.month, tokens.day)
    except Exception as ve:
        errmsg = ''%d/%d/%d' is not a valid date, %s' % \
            (tokens.year, tokens.month, tokens.day, ve)
        raise ParseException(s, loc, errmsg)
date.setParseAction(convertToDatetime)


def test(s):
    try:
        print date.parseString(s)
    except ParseException as pe:
        print pe

test('2000/1/1')
test('2000/13/1') # invalid month
test('1900/2/29') # 1900 was not a leap year
test('2000/2/29') # but 2000 was

prints

[datetime.datetime(2000, 1, 1, 0, 0)]
'2000/13/1' is not a valid date, month must be in 1..12 (at char 0), (line:1, col:1)
'1900/2/29' is not a valid date, day is out of range for month (at char 0), (line:1, col:1)
[datetime.datetime(2000, 2, 29, 0, 0)]

2012-10-18 05:06:59 - DiaaFayed

thanks Paul I hoped to answer each sentence seperately

any way

the most important one is

'compiling the input text into a series of executable or callable user objects.'

2012-10-18 05:51:32 - ptmcg

This is done in the online example SimpleBool.py - the classes are used in the expressions for operatorPrecedence, but that is the same as calling setParseAction on each intermediate expression that operatorPrecedence creates for each level of operations.

2012-10-18 10:47:12 - DiaaFayed

thanks very much

The ideas in the discussions are very important. It is more suitable to augment the documentations by these ideas or collect them into tutorial


2012-10-18 13:50:33 - chlim - Parsing single and multiple records

Hi, I wonder if someone can help me. I have scenarios where I need to parse and grab the description field (the word description + ':' plus one or more returns and then multiple lines of description that can contain anything).

First scenario is just to be able to parse a single record like this input string:

product: soap
description:
foo
bar

Second scenario (input has multiple records and these are the last two records):

product: soap
description:
blah blah
foo foo
bar bar

product: towel
description:
blah blah
abc !@#%&foo 1234
abc !@#%&foo 1234
abc !@#%&foo 1234

<end-of-file>

I'm trying to doing something like this to get it into a dictionary:

for prodDict in prodParser.prodDef.searchString(prodFile):
    prodResults.append(prodDict)

The problems I'm having is that I can't get the description (variable lines). How is that done? If one record has another subsequent record I think the approach is to make it SkipTo the next line with 'product:'. But if it was the last record it would not have another line begining with 'product:'. It's the same when I'm given a single record. I cannot just say Regex(.*) any number of lines after matching 'description:'


2012-10-22 07:30:51 - DiaaFayed - can we simulate the caret and dollar sign functions .?

can we simulate the caret and dollar sign functions in the regular expression in a pypasing expression ?


2012-10-22 12:25:04 - DiaaFayed - simulate caret and dollar sign in regular expression

can we simulate caret and dollar sign in regular expression in a pyparsing expression without using Rexex ?

2012-10-24 11:58:23 - DiaaFayed

Thanks I have found

StringStart() and StringEnd()


2012-10-22 12:34:59 - tvn1981 - Getting line number where error occurs

Hi, I am looking at this code which is close to what I want to do.

If there I have something unexpected in the test string, is there a way to output a msg saying something like 'Unrecognized syntax S at line X' ?


2012-10-25 00:26:43 - kmbt - Match only at selected lines

Hi

I have a text file consisting of multiple lines. I would like to write an expression which would work like so:

Given a set of line numbers, match only at those lines.

Can you help?

Cheers, kmbt

2012-10-25 01:45:43 - ptmcg

This is easily done by using a parse action, and pyparsing's lineno (line number) method (the sample text below includes the line number as part of its content, but this is just to illustrate the output in the matched tokens, it is not used in the filtering condition):

from pyparsing import Word, nums, alphas, lineno, ParseException


text = '''\
1 some text
2 some more
3 blah blah
4 lorem ipsem
5 the end'''


line_pattern = Word(nums) + Word(alphas) + Word(alphas)

print '\nshow all matches'
for line_data in line_pattern.searchString(text):
    print line_data


desired_lines = set([2,3,4])
def only_match_on_desired_lines(s,locn,tokens):
    if lineno(locn,s) not in desired_lines:
        raise ParseException(s,locn,'not one of the desired lines')
line_pattern.setParseAction(only_match_on_desired_lines)

print '\nmatch only desired_lines'
for line_data in line_pattern.searchString(text):
    print line_data

prints

show all matches
['1', 'some', 'text']
['2', 'some', 'more']
['3', 'blah', 'blah']
['4', 'lorem', 'ipsem']
['5', 'the', 'end']

match only desired_lines
['2', 'some', 'more']
['3', 'blah', 'blah']
['4', 'lorem', 'ipsem']

The pattern will match all of the lines, but the parse action will apply the additional filter of the particular line numbers that you want.

-- Paul

2013-03-05 00:22:39 - kmbt

Thank you. Your solution was very helpful to me.


2012-10-25 00:50:16 - ranjith19 - How to get a perticular behaviour with a function parser?

Here is my code so far:

I am writing a DSL with support for following:

  1. varaibles: All of them begin with v_
  2. Unary operators: +, -
  3. Binary operators: +,-,*,/,%
  4. Constant numbers
  5. Functions, like normal functions. They need to have this behaviour: [[foo(v_1+v_2) = foo(v_1) + foo(v_2)]], . It should be the case for any binary operation

I am able to get till point 4 but I am not able understand how to make point 5 happen. I need some help

2012-10-25 02:10:02 - ranjith19

I have asked the same question at as well. It is formatted better there. Hope you guys do not mind

2012-10-25 03:20:55 - ptmcg

See my reply on SO.

-- Paul


2012-10-26 04:57:27 - pypetey - buildout: Couldn't find a setup script

I am trying to install pyparsing under windows from buildout and I am getting an error all the time: sqget_dist\download An error occurred when trying to install pyparsing 1.5.6. Look above this messag e for any errors that were output by easy_install. While: Installing django. Getting distribution for 'pyparsing>=1.5.5'. Error: Couldn't install: pyparsing 1.5.6

Is there a way to fix it?

2012-10-30 02:44:37 - ptmcg

I don't understand why you would get this error, pyparsing most definitely includes a setup script.

You should also be able to install pyparsing using easy_install. Or just download the source package from SourceForge, and pull out the single pyparsing.py file. Pyparsing is packaged as just a single Python source file, so it should be easy to put whereever you want.

-- Paul


2012-10-26 11:46:25 - dlwatey - Handling special Characters

My input stream has several special characters '\000', ''', etc. The parser stops and emits an error whenever these characters are encountered. How can I tell the parser to simply ignore them?


2012-10-29 09:04:41 - dlwatey - Getting closer and clearer

Ok with Sandy bearing down on DC I am stuck at home with nothing but time to try and finish the parsing exercise I have been banging my head against for several weeks.

I have an export file created form a Lotus Notes database containing about 12,000 records. Each record has a set of attributes and a blob of text. My goal is to be able to parse the export file such that I can store it into another database preserving the original attributes and enriching with others via some natural language processing.

The export file contains two different structures. The 1st is for fielded information and the pattern is 'field : ' where field starts at the beginning of a line and value may be of arbitrary length and format. (the value sometime has repeating information but is not important for right now). The 2nd structure is the blob of text which always follows a specifc field '$revisions : '.

Here is an example of my test harness including sample input:

test = '''
$FILE:
EXTERNALLINKEDUNID: 35F30F8BBDF4F0CE85257AA000745A3B
$Links:
DTU: 10/24/2012 12:00:00 AM
Document_ID: 2012-1728
Document_Type: Tax Alerts
Document_Subtype: Internal Tax Alerts
LPlanning: (n/a)
LProvision: (n/a)
LCompliance: (n/a)
LControversy: (n/a)
Drafter: Brittenham, J.A.
Turnaround: 24 Hours
CopyrightNoticeFirstLine: Copyright ' 1996 ' 2012, Ernst & Young LLP.
CopyrightNotice: All rights reserved. No part of this document may be reproduced, retransmitted or otherwise redistributed in any form or by any means, electronic or mechanical, including by photocopying, facsimile transmission, recording, rekeying, or using any information storage and retrieval system, without written permission from Ernst & Young LLP.
DocAuthor: CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US
DocComposed: 10/23/2012 05:10:52 PM
InactiveIssue:
Title: Service releases some inflation adjustments for 2013, defers releasing others
Display: No
Search: No
External_Distribution: No
Supertopic_1: Supertopic 2\Personal Finance
DOCID: 2012-1728
$UpdatedBy: CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US,CN=Joan D. Osborne/OU=NOhioEMichigan/OU=TAX/O=EYLLP/C=US
$Revisions: 10/23/2012 05:10:51 PM,10/23/2012 05:10:52 PM,10/23/2012 05:10:55 PM,10/23/2012 05:11:15 PM

The Service has issued Revenue Procedure 2012-41, which contains the inflation adjustments to various exemptions, exclusions and limitation amounts for income, estate and gift tax purposes that take effect in 2013. Revenue Procedure 2012-41 does not include some of the annual inflation adjustments, including those for the tax rate tables, the standard deduction, the personal exemption, and the overall limitation on itemized deductions. The Service stated that it will address those items in future guidance.
Important inflation adjustments under Revenue Procedure 2012-41

Kiddie Tax
For tax years beginning in 2013, the first $1,000 of income of a child subject to the kiddie tax will generally not be subject to tax, and the next $1,000 will be taxable at the child's own bracket. Unearned income in excess of $2,000 will be taxed to the child at the parent's tax rate.
Expatriation to avoid tax
For 2013, an individual with 'average annual net income tax' of more than $155,000 (up from $151,000 in 2012) for the five tax years ending before the date of the loss of US citizenship under Section 877(a)(2)(A) is subject to tax under Section 877A(g)(1).
Tax responsibilities of expatriation
For 2013, the amount that would be includible in the gross income of a covered expatriate by reason of Section 877A(a)(1) is reduced (but not below zero) by $668,000.
Annual gift tax exclusion
The annual gift tax exclusion amount under Section 2503 is $14,000 for 2012 (up from $13,000 in 2012).
The annual exclusion permitted by Section 2523(i)(2) for transfers to a noncitizen spouse has been increased from $139,000 in 2012 to $143,000 in 2013.
Special use valuation
The Section 2032A ceiling on special use valuation for an estate of a decedent dying in 2013 is $1,070,000 (up from $1,040,000 in 2012).
Interest on the portion of estate tax payable in installments
To calculate Section 6601(j) interest for an estate of a decedent dying in 2013, the dollar amount used to determine the '2% portion' of the estate tax payable in installments under Section 6166 is $1,430,000 (up from $1,390,000 in 2012).
Large gifts received from foreign persons

For tax years beginning in 2013, the threshold for reporting gifts from foreign persons under Section 6039F is $15,102 (up from $14,723 in 2012).



Contact Information
For additional information concerning this Alert, please contact:





Personal Financial Services
' Kim McFarlane
(330) 255-5247


This Alert was prepared to present time-sensitive information affecting our clients. Recipients of this publication should promptly review and consider the effect of its contents on the clients they serve.

'''



recordStart = '$FILE:'
recordEnd = Literal('
')
colon  = Literal(':')


f = Combine(Word(alphas+'$', alphanums+'_') + colon.suppress())('name')
v = restOfLine('value')
r = Dict(OneOrMore(Group(f + Optional(v))))('attrs')


record = ''
try: 
       record = r.parseString(test, parseAll=False)  
except ParseException, err:
       print '>>>>>>>>',err.line
       print '>>>>>>>>',' '*(err.column-1) + '^'
       print '>>>>>>>>',err   


print '========================'
print 'record = ',record
print '     Document_ID = ',record.attrs.Document_ID
print '-----------------------'
print 'raw input = ', test
print '======================='
print 'parsed attributes and values'
print ' '
for attr in record.attrs:
    print '-----|',attr.name,':', attr.value

here is the corresponding output:

=======================
parsed attributes and values

-----| $FILE : 
-----| EXTERNALLINKEDUNID :  35F30F8BBDF4F0CE85257AA000745A3B
-----| $Links : 
-----| DTU :  10/24/2012 12:00:00 AM
-----| Document_ID :  2012-1728
-----| Document_Type :  Tax Alerts
-----| Document_Subtype :  Internal Tax Alerts
-----| LPlanning :  (n/a)
-----| LProvision :  (n/a)
-----| LCompliance :  (n/a)
-----| LControversy :  (n/a)
-----| Drafter :  Brittenham, J.A.
-----| Turnaround :  24 Hours
-----| CopyrightNoticeFirstLine :  Copyright ' 1996 ' 2012, Ernst & Young LLP.
-----| CopyrightNotice :  All rights reserved. No part of this document may be reproduced, retransmitted or otherwise redistributed in any form or by any means, electronic or mechanical, including by photocopying, facsimile transmission, recording, rekeying, or using any information storage and retrieval system, without written permission from Ernst & Young LLP.
-----| DocAuthor :  CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US
-----| DocComposed :  10/23/2012 05:10:52 PM
-----| InactiveIssue : 
-----| Title :  Service releases some inflation adjustments for 2013, defers releasing others
-----| Display :  No
-----| Search :  No
-----| External_Distribution :  No
-----| Supertopic_1 :  Supertopic 2\Personal Finance
-----| DOCID :  2012-1728
-----| $UpdatedBy :  CN=Darryl Hudson/OU=ESS/O=EYLLP/C=US,CN=Joan D. Osborne/OU=NOhioEMichigan/OU=TAX/O=EYLLP/C=US
-----| $Revisions :  10/23/2012 05:10:51 PM,10/23/2012 05:10:52 PM,10/23/2012 05:10:55 PM,10/23/2012 05:11:15 PM

I am not sure how to parse the blob of text form the input string. I have tried to treat it as part of the grammar but ended up not being successful. I ended up not knowing how to identify the start of the blob, and the text of the blob ended up being parsed as part of the 'field : vlaue' pattern if any ':''s occurred in the blob.

I am considering the possibly of creating a 2nd grammar or even more simply segmenting the input string when reading the txt file so that I can isolate the blob. So far this has been equal part maddening and pure joy. The maddening part is there are not a lot of examples and resources for a novice at this like me, the joy is that pyparsing holds such great promise for all types of tasks if I can just get my head around it and learn it better. Any suggestions would be greatly appreciated.

2012-10-30 02:28:39 - ptmcg

Try this:

recordStart = Literal('$FILE:')
colon  = Literal(':')


f = ~recordStart + Combine(Word(alphas+'$', alphanums+'_') + colon.suppress())('name')
v = restOfLine('value')
blob = Group(~recordStart + ~StringEnd() + 
            Empty().setParseAction(replaceWith('$Body'))('name') + 
            SkipTo(recordStart | StringEnd())('value'))
r = Group(recordStart + (Dict(OneOrMore(Group(f + Optional(v))))+blob)('attrs'))


record = ''
try: 
       records = OneOrMore(r).parseString(test, parseAll=False)  
except ParseException, err:
       print '>>>>>>>>',err.line
       print '>>>>>>>>',' '*(err.column-1) + '^'
       print '>>>>>>>>',err   


for record in records:
    print '========================'
    print 'record = ',record
    print '     Document_ID = ',record.attrs.Document_ID
    print '     DOCID = ',record.attrs.DOCID
    #~ print '-----------------------'
    #~ print 'raw input = ', test
    print '======================='
    print 'parsed attributes and values'
    print ' '
    for attr in record.attrs:
        print '-----|',attr.name,':', attr.value

Note the definition of blob as a very open-ended catch-all term, defined using SkipTo. blob has to come after you have looked for more attr definitons and not found them. There is also a bit of pyparsing magic in using an Empty with the parse action replaceWith to inject the '$Body' label so that the blob will have a nice attribute name.

Parsing in general is a mix of the maddening and the joyful. Thanks for sticking this out, I hope this does the job with your Lotus export file.

-- Paul


2012-11-06 13:39:41 - tvn1981 - Very slow parsing a simple First order logic form

Hi, I try to parse logical expressions such as these

x
    FALSE
    NOT x
    (x = 5) AND (y >= 10) OR NOT (z < 100 OR w)
    (A=True OR NOT (G < 8) => S = J) => ((P = A) AND not (P = 1) AND (B = O)) => (S = T)

and the parsing code I've wrtten bellow is very slow (e.g. the last test input above A=True ...). Am I missing something or is there something to make it faster ?

LPAR,RPAR = map(Suppress,['(',')'])
    number = Word(nums)
    tf = oneOf('TRUE FALSE')
    identifier = Word(alphas, alphanums + '_'')
    fol = Forward()
    term = tf | identifier | number
    op_prec = [(oneOf('= >= <= > <'),2,opAssoc.RIGHT,),
               (CaselessLiteral('not'), 1, opAssoc.RIGHT,),
               (CaselessLiteral('and'), 2, opAssoc.LEFT,),
               (CaselessLiteral('or'), 2, opAssoc.LEFT,),
               ('=>',2,opAssoc.RIGHT,),
               ]
    fol << operatorPrecedence(term,op_prec)

2012-11-13 00:24:07 - DiaaFayed - expressions matcher module

I have designed match expressions module as follows

input: list of strings input : list of pypasring matching expressions each expression have a specific parse action method

for each string for each expression if match, execute the relevant action method go the next string else go the next expression

the question

  1. How can I assign a specific method for each matching expression?
  2. Is there another elegant design for that module ?

thanks

2012-11-13 02:52:10 - ptmcg

I could still use some more information here.

After finding the first match, do you continue on to all the rest of the expressions, or do you stop?

Are the action methods just arbitrary methods that take a string and do something, or are they intended to be chained one to the next (that is, they return the input string or a modified form of the input string)?

I would not have this function take a list of strings, but just a single string. Have the caller take care of looping over all the strings.

Can you put together an example of a list of expressions and corresponding functions, so I have a clearer idea of what your intention is for what this function is to do?

-- Paul

2012-11-13 02:59:20 - ptmcg

Oh I reread your pseudo code, and I see that you answered my first question - after a match, you are then just done with the string.

Still curious about the expressions and functions that you are going to pass. If these are just regular parse actions, then attach them to the expressions before calling the function. Your function then is nothing more than a MatchFirst.

expr1 = ...
expr2 = ...
expr3 = ...
expr1.setParseAction(func1)
expr2.setParseAction(func2)
expr3.setParseAction(func3)
# list of expressions, with parse action associated with each
exprs = [expr1, expr2, expr3]

# some sample strings
strings = 'I'd gladly pay you Tuesday for a hamburger today'.split()

for s in strings:
    print MatchFirst(exprs).parseString(s)

What you have described as a separate module is already part of pyparsing, it is how MatchFirst works.

-- Paul

2012-11-13 03:27:13 - DiaaFayed

  1. stopping after matching and then process the next string.

  2. the methods are arbitrary that take a string and do transformation.

  3. the data that should be processed are a list of strings. I extract information and transform it in another format.

Example:

def eng_word():
    excluded_chars  = u'?!:;,()'
    english_alphas  = u''.join(unichr(x) for x in range(0x0021, 0x007F))
    word = Word(english_alphas'�', excludeChars=excluded_chars)
    return word

data = [
    'diaa (mohamed) fayed',
    'diaa(mohamed)fayed',
    'diaa(fayed)',
    'diaa (fayed)'
    '(diaa)fayed',
    '(diaa) fayed',

]

expressions = [
eng_word()('left') + Literal('(').suppress() + eng_word()('middle') + Literal(')').suppress() + eng_word()('right'),
eng_word()('left') + Literal('(').suppress() + eng_word()('middle') + Literal(')').suppress(),
Literal('(').suppress() + eng_word()('middle') + Literal(')').suppress() + eng_word()('right')
]

method1 input: the output of expression1 output: [ 'diaa fayed', 'diaa mohamed fayed' ]

method2 input: the output of expression2 output: [ 'diaa', 'diaa fayed' ]

method3 input: the output of expression3 output: [ 'fayed', 'diaa fayed' ]

....

this an example but the expressions and examples may be long and differents

2012-11-13 08:17:26 - DiaaFayed

here are sample of the action methods

# Actions Methods
def f01(toks):
    return ' '.join(toks) + ',' + ' '.join([toks[0], toks[2]])

def f02(toks):
    return ' '.join(toks) + ',' + ' '.join([toks[0]])

def f03(toks):
    return ' '.join(toks) + ',' + ' '.join([toks[1]])

the output

['diaa mohamed fayed,diaa fayed']
['diaa mohamed fayed,diaa fayed']
['diaa fayed,diaa']
['diaa fayed,diaa']
['diaa fayed,fayed']
['diaa fayed,fayed']

please

  1. see if you have better reasoning about the action methods.

  2. I will try remove duplication by applying your note in my question about 'paranthsis and space'


2012-11-13 01:56:45 - DiaaFayed - the space and paranthesis

given the expression

name1 = '(' + Word(alphas) + ')' + Word(alphas)
name2 =  Word(alphas) + '(' + Word(alphas) + ')' 
name3 =Word(alphas) + '(' + Word(alphas) + ')' + Word(alphas)

and

example1 = (diaa)fayed
example2 = (diaa) fayed 

this code parses the two examples to

['diaa', 'fayed'] 

but I need it to parse only the first example

in the same way the following examples

diaa(fayed)
diaa (fayed)

diaa(mohamed)fayed
diaa (mohamed) fayed

i.e. I want only to parse the examples without space

2012-11-13 02:43:11 - ptmcg

Read up on 'Combine' and 'leaveWhitespace' to see how to control how pyparsing skips or doesn't skip over whitespace.

-- Paul


2012-11-15 05:55:16 - DiaaFayed - Questions about scaneExamples.py

questions about scaneExamples.py

  1. What is the role of empty ? is it necessary ?

    macroDef = Suppress('#define') + ident + Suppress('=') + empty + restOfLine

  2. How does the dictionary constructed in

    macros = dict(list(macroDef.searchString(testData)))

although the dictionary needs a list of tuples according to

the output of list(macroDef.searchString(testData)) is

['(['MAX_LOCS', '100'], {})','(['USERNAME', '\'floyd\''], {})','(['PASSWORD', '\'swordfish\''], {})']

for example the term (['MAX_LOCS', '100'], {}) is a tuple but the secomnd item is empty dictionary

  1. the structure of the ParseResults object such that a tuple of a list and a dictionary is not undestandable

why that structure

for example (['MAX_LOCS', '100'], {})

2012-11-15 09:28:21 - ptmcg

  1. I encourage you to experiment with these things on your own. Take out the empty and see how the values of the macros are different.

  2. Don't get hung up on how a ParseResults looks, focus on what it does. ParseResults is there to act like a list, and object, and a dict, all depending on how you access it. In the case you show, there are 2 elements in the term, 'MAX_LOCS' and '100'. The empty dict indicates that there are no results names defined. If you access term as a list, you'll get term[0] contains 'MAX_LOCS' and term[1] contains '100'. The dict constructor is not limited to taking a list of tuples as its argument, but will take a sequence of any 2-element sequences. A ParseResults containing 2-element ParseResults will work just as well.

Here is a little console example doing tuple unpacking directly against the ParseResults object returned from parseString:

>>> patt = Word(alphas) + Suppress('=') + empty + restOfLine
>>> macro = patt.parseString('A = 3.14159')
>>> key,value = macro
>>> print key
A
>>> print value
3.14159

That is similar to what is happening inside the dict constructor.

-- Paul


2012-11-15 06:03:12 - DiaaFayed - Question about Copy()

copy() - returns a copy of a ParserElement; can be used to use the same parse expression in different places in a grammar, with different parse actions attached to each

please, kindly give us and example ?

2012-11-15 07:33:40 - ptmcg

from pyparsing import Suppress, Word, nums, ParseException

SLASH = Suppress('/')
integer = Word(nums)
def validate_year_range(t):
    val = int(t[0])
    if not 1800 <= val <= 2099:
        raise ParseException('invalid value for year')
year_integer = integer.copy().setParseAction(validate_year_range)

def validate_month_range(t):
    val = int(t[0])
    if not 1 <= val <= 12:
        raise ParseException('invalid value for month')
month_integer = integer.copy().setParseAction(validate_month_range)

def validate_day_range(t):
    val = int(t[0])
    if not 1 <= val <= 31:
        raise ParseException('invalid value for day')
day_integer = integer.copy().setParseAction(validate_day_range)

date = year_integer + SLASH + month_integer + SLASH + day_integer

print date.parseString('2012/12/31') # valid date
print date.parseString('2012/2/30') # not a valid date, but passes the range check
print date.parseString('2012/14/4') # no such thing as month 14

# added exercise for the reader: add a parse action to date to verify that the 
# day_integer value is within the correct range for the given month and year

2012-11-15 07:58:05 - DiaaFayed

Thank very much

please when you have time, the previous post for me


2012-11-22 09:15:57 - cadourian - How to improve parser performance

Hello,

I have been working with pyparsing for some time and have written a parser to parse a special programming language. All is working fine with the parser, except that I need to make it go faster and I wanted to know what techniques can help doing that.

For example, here's what I see when I run the profiler on the code.

ncalls           tottime    percall    cumtime    percall    filename:lineno(function)
4087463/1284      28.829          0     96.989      0.076    pyparsing.py:909(_parseNoCache)
1959051/1543061    8.517          0     19.598          0    pyparsing.py:291(__init__)
1380               7.235      0.005      7.235      0.005    {_omnipy.invoke}
393192/7383        7.167          0     95.522      0.013    pyparsing.py:2524(parseImpl)

Line 909 (_ParseNoCaches) is taking 30 seconds. ParseResults on line 291 is taking 8.5 seconds, etc.

I wish to know what techniques are available to make things go faster.

  • Compilation?
  • Rewrite the parsers?

etc.

Thanks for any suggestions.

Chah'

2012-11-22 09:32:58 - cadourian

Follow up on the question above,

how can I trace which parser rule is being called the most or/and taking how much total time. With the profiler I used, I can see that ParseResult and ParseNoCache are taking the most time, but if I can trace the problem to a specific parsing rule, I'm already ahead.

Tx

2012-11-22 10:37:38 - ptmcg

Try calling ParserElement.enablePackrat() before calling parseString - this will do internal memoization of parser matches/exceptions. Also, a common easy-to-change speedup is to replace low-level tokens that are built up using Combine(lots of other pyparsing bits) can be sped up by replacing with a Regex - floating-point numbers matched using Regex(r'\d+\.\d*') will be much faster than Combine(Word(nums)+'.'+Optional(Word(nums))), with little loss of readability.

You can add your own custom debug action to all of your expressions of interest to keep a tally of attempts, matches, and exceptions.

2012-11-22 10:47:35 - cadourian

Hi Paul,

ok. Let me see if I understand how to do this.

For example, if I have the following

STRING = Combine(Literal(''') + ZeroOrMore(S_CHAR | S_ESCAPE) + Literal('''))

Can I just say

STRING = Regex('\'[my escape characters]*\')

and that would work within the parser wherever STRING was used?

Second question. The enablePackRat

can I just use something like (given the pyparsing STRING def above),

STRING = Combine(Literal('"') + ZeroOrMore(S_CHAR | S_ESCAPE) + Literal('"'))
STRING.enablePackRat()

and later on use STRING.parseString(...) as before?

Thanks

2012-11-22 11:10:59 - ptmcg

For your definition of STRING, just use QuotedString('"', escChar='\\'), which will internally generate its own regex.

enablePackrat has to be globally enabled so that all expressions get memoized. After importing pyparsing, do 'ParserElement.enablePackrat()'. More info here:

2012-11-22 11:16:58 - ptmcg

Didn't mean to stomp on your post - your regex sample is on the right track too, if you prefer to use that over QuotedString.

And after calling enablePackrat, the rest of your program works with no additional changes - the memoizing just happens internally to the pyparsing code.

2012-12-05 14:40:04 - Demolishun

What about PyPy? Since this is pure Python.

2012-12-05 17:44:16 - ptmcg

Yes, I have tried pyparsing with PyPy - it is 2-8 times as fast as CPython, depending on the complexity of the grammar.


2012-11-26 10:12:33 - DiaaFayed - Design Pattern: Chain of Responsibility

I want to do design the design pattern:

Chain of Responsibility

remember the last discussion about FirstMatch and Pattern Matcher module

I want to design pattern matcher as Chain of Responsibility

such that

  • each line pass over a set of Pyparsing expression
  • each expression will process the line and extract matched information.
  • the extracted information will be collected in structure to be fill data base structure

This module will be Information Extractor module

plz if you have reasoning about that ?

2012-11-26 21:38:02 - ptmcg

I like Chain Of Responsibility (or CoR for short) in one respect, and dislike it in another. I like the notion of setting up the chain of handlers, and then running an object through the chain to be processed by the first eligible handler. What I dislike is that the handlers know they are part of a chain - they have a successor member variable, and each link's implementation code can't just be a clean 'handle object' but is instead 'try handling object, but if I can't, pass it to the next handler in the chain'.

To get the best of both worlds, and still call this Chain Of Responsibility, you could create a wrapper class containing the handler instance that just does object handling, and have the wrapper contain the next pointer and have the wrapper implement the 'try my handler, if fail, pass it to the next'. Now the handlers stay very clean, and the CoR pattern happens in a generic wrapper:

class CoRLink(object):
    def __init__(self, handler):
        self._handler = handler
        self._next = None
        self._handled = False

    def setSuccessor(self, nextHandler):
        self._next = nextHandler

    def handle(self, obj):
        # version of handle where handler.handle() returns
        # True or False if the object was handled
        if not self._handler.handle(obj):
            if self._next:
                self._next.handle(obj)

    def handle(self, obj):
        # version of handle where handler.handle() raises
        # an exception if the object was not handled
        self._handled = False
        try:
            self._handler.handle(obj)
        except Exception:
            pass
        else:
            self._handled = True

        if not self._handled and self._next:
            self._next.handle(obj)

    def addHandler(self, newHandler):
        if self._next:
            self._next.addHandler(newHandler)
        else:
            self._next = newHandler

    def wasHandled():
        return (self._handled or
                (self._next and self._next.wasHandled())
                )

h1 = HandlerType1()
h2 = HandlerType2()
h3 = HandlerType3()

handlerChain = CoRLink(h1)
handlerChain.addHandler(CoRLink(h2))
handlerChain.addHandler(CoRLink(h3))

# pass an object to the head of the chain, and one of the
# handlers might handle it
handlerChain.handle(objectToBeHandled)

But this strikes me as unnecessarily clever code when a simple container class for the chain itself can cleanly implement the iteration logic, breaking out on the first successful handler. Because the 'try this handler, but if fail move on to the next' logic is implemented outside of the handlers themselves, this strictly speaking isn't an example of the CoR pattern - but I think it is more readable, and has an API that is just as clean.

class HandlerChain(object):
    def __init__(self):
        self._handlers = []
        self._handled = False

    def addHandler(self, newHandler):
        self._handlers.append(newHandler)

    def handle(self, obj):
        # version of handle where handler.handle() returns
        # True or False if the object was handled
        self._handled = False
        for handler in self._handlers:
            if handler.handle(obj):
                self._handled = True
                break

    def handle(self, obj):
        # or implement this in a single line using the builtin
        # method 'any'
        # (any will automatically stop processing the list once 
        # it gets the first True value)
        self._handled = any(handler.handle(obj) 
                                for handler in self._handlers)

    def handle(self, obj):
        # version of handle where handler.handle() raises
        # an exception if the object was not handled
        self._handled = False
        for handler in self._handlers:
            try:
                handler.handle(obj)
            except Exception:
                pass
            else:
                self._handled = True

    def wasHandled():
        return self._handled


h1 = HandlerType1()
h2 = HandlerType2()
h3 = HandlerType3()

handlerChain = HandlerChain()
handlerChain.addHandler(h1)
handlerChain.addHandler(h2)
handlerChain.addHandler(h3)

# pass an object to the head of the chain, and one of the
# handlers might handle it
handlerChain.handle(objectToBeHandled)            

I am definitely a big fan of design patterns - you see heavy use of the Template and Strategy patterns and this modified CoR in pyparsing, and the Command pattern in many of my pyparsing examples. But it is also important to understand the context of the original GoF Design Patterns work was with C, C++, and Java, languages that did not have language-native containers like Python's list and tuple or the flexibility of Python's duck-typing. The Design Patterns that were identified in the early 1990's were defined by the problem space, the language choices and features, of that time. For instance, you don't ever even see a Factory implementation in Python, since this is pretty implicitly done for you by the language itself.

Here is a link to a good presentation by Alex Martelli on Design Patterns in Python: , and some videos of Alex presenting this material:

How does this all relate to pyparsing? Look again at this simple example:

floatExpr = Regex(r'\d+\.\d*')
intExpr = Word(nums)

parser = floatExpr | intExpr | quotedString

# or to be more CoR-like, add each expression (or handler) to the
# parser
parser = floatExpr
parser |= intExpr
parser |= quotedString

data = '''
100
3.14159
'blah blah''''

for line in data.splitlines():
    print parser.parseString(line)

parser is a MatchFirst of three pyparsing expressions. MatchFirst's implementation is essentially the same as the third HandlerChain.handle method shown above. For the line containing '100', parser first tries to evaluate the floatExpr (and fails); then the intExpr, succeeds, and stops trying to match any further expressions.

Write back and let me know what you think about this discussion.

-- Paul

2012-11-27 01:22:45 - ptmcg

After some further thought, I recall now that many of the examples for uses of CoR were in cases involving UI controls and widgets, which naturally contain pointers to parent container and child contained widgets. The classic CoR is implemented to do the handling of UI events, such as a mouse click for instance. The click event is passed to the innermost widget's handler. If not handled there, the widget calls the handle method on its parent, and so on up the chain of contained UI objects, until the control or container that is responsible for handling that event is reached, and it handles the event.

In this case, there is no addition of next pointers for the purpose of implementing CoR - just the opposite, CoR can take advantage of the fact that the pointers are there to begin with, implementing the UI controls hierarchy.

Out of curiosity, what was it that made you want to implement CoR in the first place? Did you think it would make a good way to model the alternative parsing options for parsing a line of data? If you broaden your concept of just how CoR is implemented, to include the iteration over a list of possible handlers, I would say that using a MatchFirst as I showed in the previous comment (the '|' operator creates MatchFirsts) is doing just that.

-- Paul

2012-11-28 06:36:35 - DiaaFayed

Thanks very much, I am sorry for late

  1. I have agreed with you in simplifying the pattern in the second example.

  2. I want only the concept of CoR pattern not its implementation exactly.

  3. I want some modification to the concept of the pattern: the line of data can be processed by more than one handlers or Pyparsing expressions

example:

the line data could contain two pieces of information that need to be extracted by two Pyparsing expressions or handlers.

  1. I have hoped you to implement my concept by Pyparsing classes as possible not by pure python so that the code will be homogeneous code

2012-11-28 07:00:19 - ptmcg

With your adoption of my simplification of CoR iteration being done in a container, and with your modification of processing all handlers and not just the first, this pretty much ceases to be CoR.

Your description of a pyparsing element that reprocesses the input multiple times is not consistent with any other part of pyparsing. I don't think you really need a specialized pyparsing class, you just need to write some Python code that iterates over a series of expressions and accumulates the matched data into a single ParseResults object. Here is some compact sample code, using Python's sum builtin and searchString:

>>> a_s = Word('A')
>>> b_s = Word('Bb')('B')
>>> c_s = Word('C')
>>> exprs = [a_s, b_s, c_s]

(iterate over all expressions, and accumulate returned ParseResults using Python sum builtin
>>> instr = 'AS;LKJFASDBWEL;CCDBEawe;lkb'
>>> total = sum(sum(expr.searchString(instr)) for expr in exprs)

>>> print total.asList()
['A', 'A', 'B', 'B', 'b', 'CC']

>>> print total.dump()
['A', 'A', 'B', 'B', 'b', 'CC']
- B: b

At this point, I must ask you to stop just asking for stuff, and start working out simple examples and asking for specific help on what is not working for you. What I read in this whole thread is 'I want...' over and over. I do my best to help out beginners, but you have to bring some effort to the process too.

-- Paul

2012-11-28 07:52:49 - DiaaFayed

Thanks very much

I will do I really try and then ask in the same time


2012-11-26 17:38:27 - rogersanchez75 - Arithmetic evaluation with variables

Started using pyparsing and found it phenomenal for my project. I am trying to build a DSL for simple calculations on data fields extracted from a mongodb database. Using the SimpleCalc and eval_arith examples I was able to put together a calculator that evaluates an expression that combines field references and simple operators.

I am now trying to add a layer of recursion to support variables, but having trouble sorting through how to do this most elegantly. Have the following questions:

  1. When using recursive pattern as shown in SimpleCalc, is it the case that I no longer need to use the operatorPrecedence as was being setup in eval_arith? The examples like SimpleCalc seem to do away with this.

  2. From the standpoint of evaluating expressions containing variables, if i wanted to preserve the use of Classes like EvalConstant such that it also contains variables, would I simply extend the EvalConstant class so that it contains a list of variables and their values, and then invoke a method within it when I encounter a variable?

In general, what would be awesome and I think useful to the pyparsing community is to expand the eval_arith example such that it shows the use of nested variables, something like:

a = 3
b = 8
c = a + b
2*c - b

Thanks for the help, I can post some code if needed

2012-11-27 07:51:21 - rogersanchez75

Getting closer but struggling to get the variable parsing working. I think there is something wrong with my grammar. Code below:

# Define parser, accounting for the fact that some fields contain whitespace

integer = Word(nums)
variable = Word(alphas)
real = Combine(Word(nums) + '.' + Word(nums))
field = Combine(Word(alphas) + ':' + Word(printables) + Optional(' ' + Word(alphas) + ' ' + Word(alphas)))

operand = real | integer | field | variable

signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')

# Use parse actions to attach Eval constructors to sub-expressions

operand.setParseAction(EvalConstant)
arith_expr = operatorPrecedence(operand,

    [
     (signop, 1, opAssoc.RIGHT, EvalSignOp),
     (multop, 2, opAssoc.LEFT, EvalMultOp),
     (plusop, 2, opAssoc.LEFT, EvalAddOp),
    ])

comparisonop = oneOf('< <= > >= != == t')

comp_expr = operatorPrecedence(arith_expr,

    [
    (comparisonop, 2, opAssoc.LEFT, EvalComparisonOp),
    ])

assignment = variable('varname') + '=' + (arith_expr | comp_expr)('varvalue')

assignment.setParseAction(StoreVariable)

My problem seems to be that parse action StoreVariable is not being called when evaluating an expression like:

var1 = Person:Height

Rather, EvalConstant is being invoked on the token 'var1' and fails. Here is my code for EvalConstant:

class EvalConstant(object):

    tests_ = {}
    fields_ = {}
    person_ = None

    def __init__(self, tokens):
        self.value = tokens[0]

    def eval(self):
        v = self.value

        # Determine if this is a database reference and if so get field value

        if ':' in v:
            fieldRef = v.split(':')
            source = fieldRef[0]
            field = fieldRef[1]

            if source not in EvalConstant.tests_:

                raise NameError('Syntax error: cannot find source ' + source + ' in test list')

            elif field not in EvalConstant.fields_:
                raise NameError('Syntax error: cannot find field ' + source + ' in fields list')

            # Fetch the value from the database

            rec = db[source].find_one({'Name' : self.person_}, { '_id' : 0, field : 1})

            if rec is not None:
                return rec[field]
            else:
                return 0

        # Must be a number
        else:
            return float(self.value)

Any idea where my grammar is broken?

2012-11-27 08:05:13 - ptmcg

Ah, well! What you have done is to introduce a new type of operand, a DBReference. Move this code out of EvalConstant and define a new class EvalDBReference. Create a grammar element dbReference something like:

dbReference = variable('table') + ':' + variable('column')

Then expand operand to:

operand = (real | integer | field | variable).setParseAction(EvalConstant) | dbReference.setParseAction(EvalDBReference)

This will keep your code from getting too complicated in EvalConstant.

-- Paul

2012-11-27 08:11:27 - rogersanchez75

Thanks - in essence I already have a dbReference, its the grammar I called 'field' in my definitions. I see what you are saying though in terms of adding this to the list of operands. Will give it a shot!

2012-11-27 08:57:26 - rogersanchez75

Ok, stuck again. I just cannot seem to get the assignment parse action to get executed, and I am not sure what EvalConstant is supposed to do when it encounters a variable. In your code suggestion you showed setting parse action for variable to EvalConstant but I am not clear how this case should be handled. Should I be creating a dictionary in EvalConstant to hold the vars, and if so how do I get them to evaluate fully?

I've pasted the full source for my example here:

again its pretty similar to your eval_arith except I use dbReferences and want to have variables in a dictionary so that formulas can reference prior formulas.

really appreciate the help, as you can tell I'm a hack software developer but am the only one around that can work on this project

2012-11-27 09:22:24 - ptmcg

My mistake, change:

operand = (real | integer | variable).setParseAction(EvalConstant) | dbRef.setParseAction(EvalDBref)

to

operand = dbRef.setParseAction(EvalDBref)
 | (real | integer | variable).setParseAction(EvalConstant)

Can you see why? (hint: '|' means 'match first' - must take care not to interpret the 'xxx' of 'xxx:yyy' as a variable named 'xxx')

2012-11-27 09:34:30 - rogersanchez75

Yes, understood on the ordering. I changed the operand parsing around but still having an issue: when testing simple expressions such as

a = 2.5
b = Person:Height

I still am ending up with EvalConstant being the first parse action that is called. It does not seem like the assignment grammar is being executed so I never end up with the StoreVariable function being executed:

assignment = variable('varname') + '=' + (arith_expr | comp_expr)('varvalue')
assignment.setParseAction(StoreVariable)

I think there is still something not right with the grammar. Should I be using the Optional object like this:

assign = Optional((variable+assign).setParseAction(StoreVariable)) + comp_expr

2012-12-01 15:27:53 - rogersanchez75

Hi guys, I am still working on this problem and struggling. My parseaction for assignment is being eaten by the operand action .. so I never am able to assign the variable value. Please have a look and let me know what I have missed here:

expr = Forward()
chars = Word(alphanums + '_-/')
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + '.' + Word(nums)).setParseAction(EvalConstant)
var = Word(alphanums)

# Handle database field references that are coming out of Mongo
dbRef = Combine(chars + OneOrMore(':') + chars)
dbRef.setParseAction(EvalDBref)

# Handle function calls
functionCall = (Keyword('Rank') | Keyword('ZS') | Keyword('Ntile')) + '[' + dbRef + ']'
functionCall.setParseAction(EvalFunction)

assign = var('varname') + '=' + expr('varvalue')
assign.setParseAction(assign_var)

operand = functionCall | dbRef | (var | real | integer).setParseAction(EvalConstant) 

signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')

# Use parse actions to attach Eval constructors to sub-expressions

expr << operatorPrecedence(operand,
    [
     (signop, 1, opAssoc.RIGHT, EvalSignOp),
     (multop, 2, opAssoc.LEFT, EvalMultOp),
     (plusop, 2, opAssoc.LEFT, EvalAddOp),
    ])

assign_var is never being called, EvalConstant is being invoked on the token 'Var' for a test expression of this type:

Var=People::Height + People::Weight

Help!

2012-12-01 16:44:03 - ptmcg

Are you doing this:

(expr | assign).parseString(inputstring)

If so, reverse the order of the expressions to:

(assign | expr).parseString(inputstring)

-- Paul

2012-12-01 18:48:02 - rogersanchez75

Thanks, I am now doing parseString in the order you described and assign_var is being called correctly. Unfortunately now I end up with an exception during evaluation:

AttributeError: 'str' object has no attribute 'eval'

Here is my EvalConstant

class EvalConstant(object):
    var_ = {}
    def __init__(self, tokens):
        self.value = tokens[0]
    def eval(self):
        v = self.value
        if v in alphas:
            return var_[v]
        else:
            return float(self.value)

And here is my formula evaluation call

ret = (assign | expr).parseString(line)
            print line + ' --> ' + str(ret.eval())

The value of ret is parsed to 'Var' when testing the formula:

Var=People::Height + People::Weight

I just cant puzzle out why parseString only returns the token Var. I believe the problem lies in my operand grammar but when i try to shuffle the operands around I get other problems where the variable grammar eats my dbRef ..

2012-12-01 21:55:54 - rogersanchez75

Addendum, after playing around some more I have found that the parseString returns the following tokens in my example:

['Var', '=', '<__main__.EvalAddOp object at 0x1007c32d0>

The exception is thrown after this as Var cannot be evaluated. I think I am extremely close!

2012-12-02 12:48:14 - ptmcg

An assign statement is not the same as an expression. If an assign is parsed, then you don't need to eval anything - the expression on the right hand side has already been eval'ed and stored into the variable using the parse action you attached to assign. If the first token of the results is a string, then there is nothing to do, or you can print out a diagnostic like 'print ret[0], '<-', EvalConstant._var[ret[0]]'. Yes, I think you are very close.

2012-12-02 16:54:00 - rogersanchez75

IT LIVES

got it all functioning .. thanks for the help. That last tip about an assignment not being an expression should have been obvious but I am just not Being The Parser!

Next step is function evaluation for me!

2012-11-26 17:40:36 - rogersanchez75

Just to simplify my question, my primary request would be to see an example of how eval_arith would be extended to handle nested variables and recursion while retaining its same general class structure.

Thanks!

2012-11-26 22:10:20 - ptmcg

I don't think there's much recursion here - operatorPrecedence and pyparsing should already take care of any recursion in the parsing process.

To extend eval_arith, you would first need to change the comparison operator '=' to '==' so as not to confuse assignment with comparison. Then, expand the parser from just comp_expr to comp_expr | assignment_statement, and define assignment_statement to be

assignment_statement = variable('varname') + '=' + (arith_expr | comp_exp)('varvalue')

and add a parse action to assignment_statement that looks like:

def store_variable_value(tokens):
    EvalConstant.vars_[tokens.varname] = tokens.varvalue.eval()
assignment_statement.setParseAction(store_variable_value)

This is the general idea, I haven't tested this, but it should get you in the ballpark.

-- Paul

2012-11-27 06:46:44 - rogersanchez75

Thanks for the tip - I will try it. How would I then incorporate the store_variable_value parse action into the operatorPrecedence for arith_expr?

Could you explain the usage of operatorPrecedence a little more and why I wouldn't use the pattern of having an EvalStack[]and VarStack[] together with the recursive function EvaluateStack() as shown in SimpleCalc.py ?

I am sure my question is totally newbish, but I am hoping to get a threshold level of understanding so that I can self serve going forward.

It seems like the grammar definition in SimpleCalc is more complex, using atom, factor, term, and expression definitions, while when using operatorPrecedence the structure is simplified.

Thanks again

2012-11-27 06:57:44 - rogersanchez75

By the way, what is the meaning of this grammar:

assignment_statement = variable('varname') + '=' + (arith_expr | comp_exp)('varvalue')

not clear what the function variable() is?

2012-11-27 07:09:24 - rogersanchez75

Ignore my last question on variable, i realized this was for me to define


2012-11-28 20:37:48 - rogersanchez75 - Eval functions in arith expressions

Following up on my quest ..

I am working on a simple DSL to transform data extracted from MongoDB. I am using python and pyparsing and have gotten reasonably far in creating a grammar that works for basic operators like +/-*, starting from the examples provided. I am currently stuck on how to get my program to evaluate functions of the form Rank[databaseField]. I can retrieve and operate on dbFields through the simple operators, but something is not working with my recursion in evaluating functions.

Here is the grammar and associated setParseActions:

# Define parser, accounting for the fact that some fields contain whitespace
chars = Word(alphanums + '_-/')
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + '.' + Word(nums)).setParseAction(EvalConstant)

# Handle database field references that are coming out of Mongo
dbRef = Combine(chars + OneOrMore(':') + chars)

dbRef.setParseAction(EvalDBref)

# Handle function calls
functionCall = (Keyword('Rank') | Keyword('ZS') | Keyword('Ntile')) + '[' + dbRef + ']'
functionCall.setParseAction(EvalFunction)
operand = (real | integer) | functionCall | dbRef 

signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')

# Use parse actions to attach Eval constructors to sub-expressions

expr << operatorPrecedence(operand,
    [
     (signop, 1, opAssoc.RIGHT, EvalSignOp),
     (multop, 2, opAssoc.LEFT, EvalMultOp),
     (plusop, 2, opAssoc.LEFT, EvalAddOp),
    ])

formulas = ['Rank[Person:Height']

for f in formulas:
    ret = expr.parseString(f)[0]
    print p + ': ' + line + ' --> ' + str(ret.eval())

Here is the relevant code for my evaluation class:

# Executes functions contained in expressions
class EvalFunction(object): 
    def __init__(self, tokens): 
        self.value = tokens[0]
    def eval(self):
        func = self.value
        if func == 'Rank':
            # How to evaluate the token that is arg of Function?
            return 'Rank Found';

I think I just need a nudge in the right direction to get to the next stage ..

2012-12-04 08:05:45 - rogersanchez75

As an update, I got this figured out and working. Ended up with an EvalFunction class that looks like this:

class EvalFunction(object):
  pop_ = {}
  def __init__(self, tokens):
    self.func_ = tokens.funcname
    self.field_ = tokens.arg
  def eval(self):
      # Get the name of the requested field and source db
      # Functions can only be called on dbRef so this always done
      v = self.field_.value
      fieldRef = v.split(':')
      source = fieldRef[0]
      field = fieldRef[1]

      # Evaluate the dbRef (get the value from the db)
      val = self.field_.eval()

      if self.func_ == 'Avg':
        rec = db['Stats'].find_one({'_id' : field})   
        return rec['value']['avg']
      elif self.func_ == 'Root':
          return math.sqrt(val)

and a grammar that is:

functionCall = funcNames('funcname') + '[' + dbRef('arg') + ']'
functionCall.setParseAction(EvalFunction)

2012-11-29 21:07:56 - torfat - parsing C function calls

Hi Pau, i'm trying to parse C function calls like this

from pyparsing import Word, alphas, alphanums, oneOf, OneOrMore, \
        commaSeparatedList, Suppress, Forward, Group, Optional, \
        delimitedList, Regex, operatorPrecedence, opAssoc, quotedString, \
        dblQuotedString, Literal


testData = '''
funcName('paramOne', &paramTwo, fTwo(p0, p1), paramFour);
'''

expr = Forward()

LPAR, RPAR, SEMI = map(Suppress, '();')
identifier = Word(alphas+'_', alphanums+'_')
function_call = identifier.setResultsName('name') + LPAR + Group(Optional(delimitedList(expr))) + RPAR
integer = Regex(r'-?\d+')
real = Regex(r'-?\d+\.\d*')

operand = (function_call | identifier | real | integer | quotedString )
expop = Literal('^')
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
factop = Literal('!')
derefop = OneOrMore('*')
addrop = oneOf('&')

expr << operatorPrecedence( operand,
    [(derefop, 1, opAssoc.RIGHT),
     (addrop, 1, opAssoc.RIGHT),
     (factop, 1, opAssoc.LEFT),
     (expop, 2, opAssoc.RIGHT),
     (signop, 1, opAssoc.RIGHT),
     (multop, 2, opAssoc.LEFT),
     (plusop, 2, opAssoc.LEFT),]
    )

for t,s,e in function_call.scanString( testData ):
        print t[0], len(t[1]), 'Parameters:', t[1]

it returns this:

funcName 5 Parameters: [''paramOne'', ['&', 'paramTwo'], 'fTwo', ['p0', 'p1'], 'paramFour']

i want the output to show 4 parameters, not 5. i need to know that fTwo is a function and its parameters are p0 and p1. what are my options? TIA

2012-11-29 22:09:20 - ptmcg

Wow, a C parser is very ambitious, you're doing pretty good so far it seems. To group all the tokens together for a function call (a good idea) change:

function_call = identifier.setResultsName('name') + LPAR + Group(Optional(delimitedList(expr))) + RPAR

to

function_call = Group(identifier.setResultsName('name') + LPAR + Group(Optional(delimitedList(expr))) + RPAR)

Now you should see your function call argument wrapped as its own subgroup.

Good luck! -- Paul

2012-12-01 00:23:19 - torfat

Paul, Thanks for the quick response! I am not hoping to write a complete c parser. my task at hand is to go through the source and replace some functions with new names and parameters derived from old parameters. for example: change

funcName('paramOne', &paramTwo, fTwo(p0, p1), paramFour);

to something like this:

newFuncName('PARAMONE', &paramTwo, paramThree,  fTwo(p0, p1), paramFour);

in the above example, string in first param is transformed to upper case, paramThree inserted, and the rest are preserved as original. is 'transformString' the right tool for this? i am having a hard time understanding transformString and setParseAction. to start i can't even get it to print out the original function.

def substituteFunc(s,l,t):
        s = t[0] + '(' + ', '.join(t[1]) + ')'
        return s

function.setParseAction( substituteFunc )
print function.transformString( testData )

please advice. Thanks!

2012-12-01 08:52:31 - ptmcg

I'm pretty sure transformString is exactly the tool for this job. You define a pattern to be matched, and in the parse action perform the desired transformation and return the modified string. transformString will then reassemble the unmatched pieces and the transformed strings back into a single output string. Suppressed expressions will also be stripped out when using transformString. You can see some examples in this code:


2012-12-05 14:38:02 - Demolishun - New to pyparser and impressed by capabilities

I have done some limited parsing in the past using Construct (Python binary parser package) and regexp. I really do not like the latter. Construct seems to be similar to pyparser as you build up your parser a step at a time. I used it in a project to parse simple text messages coming from a server stream. It radically improved the readability of the code and reliability of the message parsing.

Now I need a more full featured text based parser so I found pyparser. I need to take an existing scripting language with well defined C-like syntax and convert to Javascript and/or Python. So this will definitely be a challenge. I have not done anything with traditional compiler tools. Especially since using Construct as it kind of spoiled me.

Would be it better to just start from scratch or should I look to adapting the existing pyparsers for C code? The language I want to parse is Torque Script which uses a simple C-like syntax. I think due to the syntax of the language it should be easier to parse. The syntax is also similar to Javascript in that it uses keywords for function definitions.

2012-12-05 15:41:34 - Demolishun

Here is my first take at variables:

Notation:
    + = 1 or more
    * = 0 or more
    | = or
    <> = used to identify entities

<alpha> ::= <+'a'...'z'> | <+'A'...'Z'>
<numeric> ::= <+'0'...'9'>

<variable> ::= <'$'|'%'> <+alpha>|'_' <*alpha|*numeric>
<local_variable> ::= <'%'> <+alpha>|'_' <*alpha|*numeric>
<global_variable> ::= <'$'> <+alpha>|'_' <*alpha|*numeric>

Variables are defined as having a '$' or '%' at the beginning with the normal rules for variables following that. I know there are already functions to help with this I just want to work through the process.

2012-12-05 15:43:46 - Demolishun

Eh, I messed up alpha. It should be like this:

, = group sets

<alpha> ::= <+'a'...'z','A'...'Z'>

2012-12-05 16:43:49 - Demolishun

Okay, so I have successfully parsed variables:

# define variable parser
var_start = oneOf('$ %')
#identifier = OneOrMore('_'+alphas)+ZeroOrMore('_'+alphas+nums)
identifier = Word(alphas+'_', alphanums+'_')
variable = var_start+identifier

One question on this:
Why doesn't commented out identifier work? To me it looks equivalent to the Word() based version. Unless I am misunderstanding how OneOrMore and ZeroOrMore are supposed to be used.

2012-12-05 17:42:44 - ptmcg

alphas, nums, and alphanums are not pyparsing expressions, they are just plain old strings. pyparsing does auto-promotion of strings to Literals in many cases, so that you can easily write:

socSecNumber = Word(nums,exact=3) + '-' + Word(nums,exact=2) + '-' + Word(nums,exact=4)

alphas, nums, etc. are convenience string constants for defining Words, so that you don't have to constantly define 'alphas = 'abcdefghi...etc.' ' in all you pyparsing code. But because alphas is a string, and pyparsing auto-promotes strings to Literals if they are used as a Literal, what you are actually writing is 'OneOrMore(Literal('_abcdefghi...etc.') + ZeroOrMore('_abcdefghi...0123456789')' so the repetition is looking for those actual strings, not using those strings as Word does, which is as the definition of allowed leading and optionally body characters. Also, Word does not allow any intervening whitespace, but looks for contiguous 'words' of letters that are composed from the given leading and body string characters. But by default 'expr + expr' (where 'expr' is any pyparsing expression) will allow whitespace to be found between the two expressions and still match.

Glad you find pyparsing to be a promising toolkit for you - it does take some mental adjustment to work with, but I hope its not too big of a hurdle!

-- Paul

2012-12-05 19:24:03 - Demolishun

Yeah, I was starting to wonder after reading through pyparsing.py. I thought it might be looking at the alphas and alphanums as string rather than an object. Also, good catch on using word to keep there from being whitespace in there. That would have been bad.

Hey, I picked up your manual from Oreilly for $10. Is that going to be updated at some point? I saw it was referencing 2.3 or 2.4 of Python.

2012-12-06 18:36:45 - ptmcg

Probably not going to get updated, I'm afraid, unless I self-publish. O'Reilly isn't getting enough sales volume for them to be interested in doing a 2nd edition.

-- Paul


2012-12-05 21:14:00 - Demolishun - Working on Literal Identification

identifier = Word(alphas+'_', alphanums+'_')
variable = Combine(oneOf('$ %')+identifier)
qstring = QuotedString(''',escChar='\\',multiline=False)
tstring = QuotedString(''',escChar='\\',multiline=False)
#qstring = dblQuotedString  # leaves quotes in result
#tstring = sglQuotedString  # leaves quotes in result
integer_literal = Combine(Optional('-')+Word(nums))
hex_literal = Combine(oneOf('0x 0X')+Word(hexnums))
float_literal = Combine(Word(nums)+'.'+Word(nums))
scinot_literal = Combine(Word(nums)+oneOf('e- e+',caseless=True)+Word(nums))
num_literal = scinot_literal|hex_literal|float_literal|integer_literal

I am working on the literal identification and am wondering about testing for start and end of each word. I see in the examples in the Oreilly text that the start and end are not checked necessarily as the context provides verification of the literal. I am thinking about this right? Or should I be making sure each literal is bounded properly?

The reason I ask is because I can get false positives using the above rules by adding chars to the begging and end of the patterns. I know if I tell it to parse it will fail, so maybe this is a non-issue as it would be caught there.


2012-12-10 15:58:57 - rogersanchez75 - Further DSL and function parsing development

I've been working on building out my DSL with pyparsing and have made excellent progress. My first milestone was to evaluate expressions that contain arithmetic operators, database field references and a set of functions (Avg, Stdev, etc). In addition, I implemented assignment of expressions to variables so as to be able to build up complex expressions in a modular way. So far so good.

I have now hit my next major snag when trying to calculation functions on variables as arguments. Specifically, my database references (which is the building block on which calcs are performed) requires specifiying a Person as a dimension of the query. I don't know the best way to force re-evaluation of the expressions assigned to these variables when they are contained within a function. Specific example that has problems:

CustomAvg = Avg[Height] + Avg[Weight] Avg[CustomAvg]

In these scenarios, I have a list of People that I iterate over to calculate the components of CustomAvg. However, when I evaluate Avg[CustomAvg] the value of CustomAvg is coming from my variable lookup dict rather than being evaluated, so effectively I am iterating over a constant value. What is the best way to introduce 'awareness' in my evaluation so that the variables used as arguments within a function a re-evaluated rather than sourced from the lookup table? Here is streamlined relevant code:

class EvalConstant(object):
        var_ = {}
        def __init__(self, tokens):
            self.value = tokens[0]

        def eval(self):
            v = self.value
            if self.var_.has_key(v):
                return self.var_[v]
            else:
                return float(v)

    class EvalDBref(object):
        person_ = None
        def __init__(self, tokens):
            self.value = tokens[0]

        def eval(self):
            v = self.value
            fieldRef = v.split(':')
            source = fieldRef[0]
            field = fieldRef[1]
            rec = db[source].find_one({'Name' : self.person_}, { '_id' : 0, field : 1})

            return rec[field]

    class EvalFunction(object):
        pop_ = {}
        def __init__(self, tokens):
            self.func_ = tokens.funcname
            self.field_ = tokens.arg
            self.pop_ = POPULATION

        def eval(self):
            v = self.field_.value
            fieldRef = v.split(':')
            source = fieldRef[0]
            field = fieldRef[1]
            val = self.field_.eval()

            if self.func_ == 'ZS':
                # If using zscore then fetch the field aggregates from stats

                rec = db['Stats'].find_one({'_id' : field})  
                stdev = rec['value']['stddev']         
                avg = rec['value']['avg']
                return (val - avg)/stdev

            elif self.func_ == 'Ptile':
                recs = list(db[source].find({'Name' : { '$in' : self.pop_}},{'_id' : 0, field : 1}))
                recs = [r[field] for r in recs]
                return percentileofscore(recs, val)

    def assign_var(tokens):
        ev = tokens.varvalue.eval()
        EvalConstant.var_[tokens.varname] = ev

    #--------------------
    expr = Forward()
    chars = Word(alphanums + '_-/')
    integer = Word(nums)
    real = Combine(Word(nums) + '.' + Word(nums))
    var = Word(alphas)

    assign = var('varname') + '=' + expr('varvalue')
    assign.setParseAction(assign_var)

    dbRef = Combine(chars + OneOrMore(':') + chars)
    dbRef.setParseAction(EvalDBref)

    funcNames = Keyword('ZS') | Keyword('Avg') | Keyword('Stdev')

    functionCall = funcNames('funcname') + '[' + expr('arg') + ']'    functionCall.setParseAction(EvalFunction)

    operand =  dbRef | functionCall | (real | integer| var).setParseAction(EvalConstant)

    signop = oneOf('+ -')
    multop = oneOf('* /')
    plusop = oneOf('+ -')

    expr << operatorPrecedence(operand,
       [
        (signop, 1, opAssoc.RIGHT, EvalSignOp),
        (multop, 2, opAssoc.LEFT, EvalMultOp),
        (plusop, 2, opAssoc.LEFT, EvalAddOp),
       ])

    EvalDBref.person_ = 'John Smith'

    ret = (assign | expr).parseString(line)[0]

2012-12-12 00:30:30 - Demolishun - Trouble with moving beyond basic pattern matching.

Hey ptmcg, I am having some issues understanding where to go from simple pattern matching to fully parsing a large grammar like C. Can you recommend some resources that discuss the theory behind construction of grammar processors? I have on order the 'Dragon Book' which has theory and examples of using lex and yacc. Hopefully that will help. I actually have a well defined grammar for the language I want to parse, but it is in lex and yacc source files and I am having issues understanding how to apply the rules to work with pyparsing. I am using pyparsing right now to parse messages coming from a VectorNav NV-100 and it works great for that. I like it because it cleans up the code to verify the messages in a very structured and easily documented way. I did find a bunch of links on lex and yacc to study, but I get the feeling your parser uses more modern approaches. I am just not understanding where to find references on the techniques you are using.

Thanks, Frank

2012-12-13 20:19:48 - ptmcg

Frank -

I am certainly happy that pyparsing is helping you be productive in your VectorNav application. You can see an example of a more extensive language parser by looking at the Verilog parser ().

As rich a library as pyparsing is, it does cut some corners and makes guesses in some cases of ambiguity, something that you really don't a language parser doing. Pyparsing's cavalier approach to whitespace for example, while appropriate in many everyday cases, is not really as rigorous as a language parser ought to be.

C presents its own special complexities in parsing, because of some syntax 'specialness', like leading '*'s for dereferencing pointers, typedefs - and don't even get me started on the macro preprocessor! If you want to parse a full language, try something like Pascal, whose syntax was designed up front to be parseable in a single pass. In contrast, most C compilers take several passes over the input source to do their parsing.

So if you are scouting about for languages to write a parser for using pyparsing, stick with those that can be parsed in a single pass. A syntax that can be parsed in 1 pass is more likely to be tractable using pyparsing's plodding left-to-right processing, with a minimum of lookahead and backtracking.

For more information on pyparsing's style of parsing, look for PEG's (parser expression generators).

-- Paul


2012-12-22 20:41:51 - rogersanchez75 - Control flow

I have developed a DSL for manipulating database content using pyparsing. I am now considering trying to add control flow (if then else in particular) to my language, which currently Bly supports direct expression evaluation with arithmetic operators and basic calc functions.

I don't really know where to start and whether this is even a good idea to try to do with an external DSL and pyparsing. Anyone have experience in this or advice?

2012-12-23 02:44:39 - ptmcg

A while back I wrote an article for Python Magazine describing a Brainfuck compiler/interpreter written with pyparsing. The main concept was to compile the code into a corresponding structure of executable objects (similar to what is done in the online example SimpleBool.py, but with objects for IfStatement, AssignStatement, etc.) Design a Virtual Machine in which these objects can be run, possibly something as simple as a dict of variable values. Implement for each object class a method execute(vm). Then associate each class with the corresponding statements expression in your parser as a parse action. When you have parsed successfully, you will get a ParseResults containing executable objects - create an empty VM and then call object.execute(vm) for each object you have parsed. For control flow (like if-then-else or for/while loops), implement the control flow in that statement's execute function. What was fun about making this a 'compiler' was that the parsed code could be pickled and saved to a file. This could then be unpickled and run directly, without having to reparse the original DSL source.

HTH, -- Paul

2012-12-23 07:56:48 - rogersanchez75

Cool. I've see that article referenced a few places and tried to find the back issue online to no avail. Do you have any idea where it might be available?


2012-12-24 11:59:11 - catdude - Clarification regarding building a search sting

This is my first attempt to use pyparsing for serious work. I have a script that currently uses regular expressions to parse log files, looking for specific type of log lines. In my current testing, I'm being given a message containing 'Accepted publickey for nagios from 10.70.50.101 port 43382 ssh2'. I had been parsing this with a regex '(Accepted publickey for ([a-zA-Z0-9\.]+) from [0-9\.]+).*'.

If I use :

a = 'Accepted publickey for'
user = Word(alphanums + '-')
ip = Word(nums+'.')
string = a + user + Literal('from') + ip + Regex('.*')

I get:

string.searchString(msg)
([(['Accepted publickey for', 'nagios', 'from', '10.70.50.101', 'port 43382 ssh2'], {})], {})

But if I do:

string2 = Literal('Accepted publickkey for')+ Word(alphanums + '_') + Literal('from') + Word(nums + '.') + Regex('.*')

I get :

string2.searchString(msg)
([], {})

So my question is, why do I get a parsed output when I build the pieces of the match string piece by piece then combine them, but not when I build the match string in place?

Or am I just too tired and not seeing a type?

2012-12-24 12:21:02 - ptmcg

You have a typo in string2, 'publickkey' instead of 'publickey'. Fix the typo and it should work just fine.

Welcome to pyparsing!

-- Paul

2012-12-24 12:22:33 - catdude

I was afraid it was something simple like that. I looked at my code a bunch of times and never saw that. Thanks!