Python Re Modul
Python Re Modul
Python is a high level open source scripting language. Python's built-in "re" module provides excellent support for regular expressions, with a modern and complete regex flavor. The only significant features missing from Python's regex syntax are atomic grouping, possessive quantifiers and Unicode properties. The first thing to do is to import the regexp module into your script with import re.
Python strings also use the backslash to escape characters. The above regexes are written as Python strings as "\\\\" and "\\w". Confusing indeed. Fortunately, Python also has "raw strings" which do not apply special treatment to backslashes. As raw strings, the above regexes become r"\\" and r"\w". The only limitation of using raw strings is that the delimiter you're using for the string must not appear in the regular expression, as raw strings do not offer a means to escape it. You can use \n and \t in raw strings. Though raw strings do not support these escapes, the regular expression engine does. The end result is the same.
Unicode
Python's re module does not support any Unicode regular expression tokens. However, Python Unicode strings do support the \uFFFF notation, and Python's re module can use Unicode strings. So you could pass the Unicode string u"\u00E0\\d" to the re module to match followed by a digit. Note that the backslash for \d was escaped, while the one for \u was not. That's because \d is a regular expression token, and a regular expression backslash needs to be escaped. \u00E0 is a Python string token that shouldn't be escaped. The string u"\u00E0\\d" is seen by the regular expression engine as \d. If you did put another backslash in front of the \u, the regex engine would see \u00E0\d. The regex engine doesn't support the \u token. It will to match the literal text u00E0 followed by a digit instead. To avoid this confusion, just use Unicode raw strings like ur"\u00E0\d". Then backslashes don't need to be escaped. Python does interpret Unicode escapes in raw strings.
Splitting Strings
returns an array of strings. The array contains the parts of subject between all the regex matches in the subject. Adjacent regex matches will cause empty strings to
re.split(regex, subject)
2/3
appear in the array. The regex matches themselves are not included in the array. If the regex contains capturing groups, then the text matched by the capturing groups is included in the array. The capturing groups are inserted between the substrings that appeared to the left and right of the regex match. If you don't want the capturing groups in the array, convert them into non-capturing groups. The re.split() function does not offer an option to suppress capturing groups. You can specify an optional third parameter to limit the number of times the subject string is split. Note that this limit controls the number of splits, not the number of strings that will end up in the array. The unsplit remainder of the subject is added as the final string to the array. If there are no capturing groups, the array will contain limit+1 items.
Match Details
and re.match() return a Match object, while re.finditer() generates an iterator to iterate over a Match object. This object holds lots of useful information about the regex match. I will use m to signify a Match object in the discussion below. m.group() returns the part of the string matched by the entire regular expression. m.start() returns the offset in the string of the start of the match. m.end() returns the offset of the character beyond the match. m.span() returns a 2-tuple of m.start() and m.end(). You can use the m.start() and m.end() to slice the subject string: subject[m.start():m.end()]. If you want the results of a capturing group rather than the overall regex match, specify the name or number of the group as a parameter. m.group(3) returns the text matched by the third capturing group. m.group('groupname') returns the text matched by a named group 'groupname'. If the group did not participate in the overall match, m.group() returns an empty string, while m.start() and m.end() return -1. If you want to do a regular expression based search-and-replace without using re.sub(), call m.expand(replacement) to compute the replacement text. The function returns the replacement string with backreferences etc. substituted.
re.search()
3/3