Sundeep Agarwal Understanding Python Re Gex
Sundeep Agarwal Understanding Python Re Gex
Sundeep Agarwal
Understanding Python re(gex)?
Preface
Prerequisites
Conventions
Acknowledgements
Feedback and Errata
Author info
License
Book version
Why is it needed?
How this book is organized
re introduction
re module documentation
re.search()
re.search() in conditional expressions
re.sub()
Compiling regular expressions
bytes
re(gex)? playground
Cheatsheet and Summary
Exercises
Anchors
String anchors
re.fullmatch()
Line anchors
Word anchors
Cheatsheet and Summary
Exercises
Alternation and Grouping
Alternation
Grouping
Precedence rules
Cheatsheet and Summary
Exercises
Escaping metacharacters
Escaping with backslash
re.escape()
Escape sequences
Cheatsheet and Summary
Exercises
Dot metacharacter and Quantifiers
Dot metacharacter
re.split()
Greedy quantifiers
Conditional AND
What does greedy mean?
Non-greedy quantifiers
Possessive quantifiers
Catastrophic Backtracking
Cheatsheet and Summary
Exercises
Interlude: Tools for debugging and visualization
regex101
debuggex
re(gex)? playground
re(gex)? exercises
regexcrossword
Summary
Working with matched portions
re.Match object
Assignment expressions
Using functions in the replacement section
Using dict in the replacement section
re.findall()
re.finditer()
re.split() with capture groups
re.subn()
Cheatsheet and Summary
Exercises
Character class
Custom character sets
Range of characters
Negating character sets
Matching metacharacters literally
Escape sequence sets
Numeric ranges
Cheatsheet and Summary
Exercises
Groupings and backreferences
Backreference
Non-capturing groups
Named capture groups
Atomic grouping
Conditional groups
Match.expand()
Cheatsheet and Summary
Exercises
Interlude: Common tasks
CommonRegex
Summary
Lookarounds
Conditional expressions
Negative lookarounds
Positive lookarounds
Capture groups inside positive lookarounds
Conditional AND with lookarounds
Variable length lookbehind
Negated groups
Cheatsheet and Summary
Exercises
Flags
re.IGNORECASE
re.DOTALL
re.MULTILINE
re.VERBOSE
Inline comments
Inline flags
Cheatsheet and Summary
Exercises
Unicode
re.ASCII
Codepoints and Unicode escapes
\N escape sequence
Cheatsheet and Summary
Exercises
regex module
Subexpression calls
Set the start of matching portion with \K
Variable length lookbehind
\G anchor
Recursive matching
Named character sets
Set operations
Unicode character sets
Skipping matches
\m and \M word anchors
Overlapped matches
regex.REVERSE flag
\X vs dot metacharacter
Cheatsheet and Summary
Exercises
Gotchas
Escape sequences
Line anchors with \n as the last character
Zero-length matches
Capture group with quantifiers
Converting re to regex module
Optional arguments syntax
Summary
Further Reading
Preface
Scripting and automation tasks often need to extract
particular portions of text from input data or modify them
from one format to another. This book will help you
understand Regular Expressions, a mini-programming
language for all sorts of text processing needs.
Prerequisites
You should be familiar with programming basics. You
should also have a working knowledge of Python syntax
and features like string formats, string methods and list
comprehensions.
Conventions
The examples presented here have been tested with
Python version 3.13.1 and includes features not
available in earlier versions.
Code snippets shown are copy pasted from the Python
REPL shell and modified for presentation purposes.
Some commands are preceded by comments to provide
context and explanations. Blank lines have been added
to improve readability. Error messages are shortened.
import statements are skipped after initial use. And so
on.
Unless otherwise noted, all examples and explanations
are meant for ASCII characters.
External links are provided throughout the book for you
to explore certain topics in more depth.
The py_regular_expressions repo has all the code
snippets and exercises used in the book. Solutions file
is also provided. If you are not familiar with the git
command, click the Code button on the webpage to get
the files.
Acknowledgements
Python documentation — manuals and tutorials
/r/learnpython/, /r/Python/ and /r/regex/ — helpful
forums for beginners and experienced programmers
alike
stackoverflow — for getting answers to pertinent
questions on Python and regular expressions
tex.stackexchange — for help on pandoc and tex
related questions
canva — cover image
Warning and Info icons by Amada44 under public
domain
oxipng, pngquant and svgcleaner — optimizing images
David Cortesi for helpful feedback on both the technical
content and grammar issues
Kye and gmovchan for spotting a typo
Hugh's email exchanges helped me significantly to
improve the presentation of concepts and exercises
Christopher Patti for reviewing the book, providing
feedback and brightening the day with kind words
Users 73tada, DrBobHope, nlomb and others for
feedback in this reddit thread
Issue Manager:
https://github.com/learnbyexample/py_regular_expressi
ons/issues
E-mail: [email protected]
Twitter: https://twitter.com/learn_byexample
Author info
Sundeep Agarwal is a lazy being who prefers to work just
enough to support his modest lifestyle. He accumulated
vast wealth working as a Design Engineer at Analog
Devices and retired from the corporate world at the ripe
age of twenty-eight. Unfortunately, he squandered his
savings within a few years and had to scramble trying to
earn a living. Against all odds, selling programming ebooks
saved his lazy self from having to look for a job again. He
can now afford all the fantasy ebooks he wants to read and
spends unhealthy amount of time browsing the internet.
License
This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License.
re introduction
Anchors
Alternation and Grouping
Escaping metacharacters
Dot metacharacter and Quantifiers
Interlude: Tools for debugging and visualization
Working with matched portions
Character class
Groupings and backreferences
Interlude: Common tasks
Lookarounds
Flags
Unicode
regex module
Gotchas
Further Reading
re module documentation
It is always a good idea to know where to find the
documentation. The default offering for Python regular
expressions is the re standard library module. Visit
docs.python: re for information on available methods,
syntax, features, examples and more. Here's a quote:
re.search()
Normally you'd use the in operator to test whether a string
is part of another string or not. For regular expressions, use
the re.search() function whose argument list is shown
below.
re.sub()
For normal search and replace, you'd use the
str.replace() method. For regular expressions, use the
re.sub() function, whose argument list is shown below.
re.compile(pattern, flags=0)
bytes
To work with the bytes data type, the RE must be specified
as bytes as well. Similar to the str RE, use raw format to
construct a bytes RE.
re(gex)? playground
To make it easier to experiment, I wrote an interactive TUI
app. See PyRegexPlayground repo for installation
instructions and usage guide. A sample screenshot is shown
below:
Cheatsheet and Summary
Note Description
re.search(pattern, string,
flags=0)
re.compile(pattern, flags=0)
re.IGNORECASE
flag to ignore case while matching
or re.I
Exercises
4) For the given list, filter all elements that do not contain
e.
7) For the given input string, display all lines not containing
start irrespective of case.
9) For the given list, filter all elements that contain both e
and n.
10) For the given string, replace 0xA0 with 0x7F and 0xC0
with 0x1F.
String anchors
This restriction is about qualifying a RE to match only at
the start or the end of an input string. These provide
functionality similar to the str methods startswith() and
endswith(). First up, the escape sequence \A which
restricts the matching to the start of string.
# appending text
>>> re.sub(r'\Z', 'er', 'cat')
'cater'
>>> re.sub(r'\Z', 'er', 'hack')
'hacker'
re.fullmatch()
Combining both the start and end string anchors, you can
restrict the matching to the whole string. The effect is
similar to comparing strings using the == operator.
>>> bool(word_pat.fullmatch('Cat'))
True
>>> bool(word_pat.fullmatch('Scatter'))
False
Line anchors
A string input may contain single or multiple lines. The
newline character \n is considered as the line separator.
There are two line anchors. ^ metacharacter for matching
the start of line and $ for matching the end of line. If there
are no newline characters in the input string, these will
behave exactly the same as \A and \Z respectively.
>>> pets = 'cat and dog'
Just like string anchors, you can use the line anchors by
themselves as a pattern.
Word anchors
The third type of restriction is word anchors. Alphabets
(irrespective of case), digits and the underscore character
qualify as word characters. You might wonder why there
are digits and underscores as well, why not just alphabets?
This comes from variable and function naming conventions
— typically alphabets, digits and underscores are allowed.
So, the definition is more oriented to programming
languages than natural ones.
The escape sequence \b denotes a word boundary. This
works for both the start and end of word anchoring. Start
of word means either the character prior to the word is a
non-word character or there is no character (start of
string). Similarly, end of word means the character after
the word is a non-word character or no character (end of
string). This implies that you cannot have word boundary
\b without a word character.
Note Description
re.fullmatch(pattern, string,
flags=0)
re.MULTILINE
flag to treat input as multiline string
or re.M
Exercises
1) Check if the given strings start with be.
>>> bool(pat.search(line1))
True
>>> bool(pat.search(line2))
False
>>> bool(pat.search(line3))
True
>>> bool(pat.search(line4))
False
2) For the given input string, change only the whole word
red to brown.
>>> words = 'bred red spread credible red.'
3) For the given input list, filter all elements that contain
42 surrounded by word characters.
4) For the given input list, filter all elements that start with
den or end with ly.
8) For the given input list, replace hand with X for all
elements that start with hand followed by at least one word
character.
9) For the given input list, filter all elements starting with
h. Additionally, replace e with X for these filtered elements.
Alternation
A conditional expression combined with logical OR
evaluates to True if any of the condition is satisfied.
Similarly, in regular expressions, you can use the |
metacharacter to combine multiple patterns to indicate
logical OR. The matching will succeed if any of the
alternate pattern is found in the input string. These
alternatives have the full power of a regular expression, for
example they can have their own independent anchors.
Here are some examples.
You might infer from the above examples that there can be
cases where many alternations are required. The join
string method can be used to build the alternation list
automatically from an iterable of strings.
# without grouping
>>> re.sub(r'reform|rest', 'X', 'red reform read
'red X read arX'
# with grouping
>>> re.sub(r're(form|st)', 'X', 'red reform read
'red X read arX'
# without grouping
>>> re.sub(r'\bpar\b|\bpart\b', 'X', 'par spare p
'X spare X party'
# taking out common anchors
>>> re.sub(r'\b(par|part)\b', 'X', 'par spare par
'X spare X party'
# taking out common characters as well
# you'll later learn a better technique instead o
Precedence rules
There are tricky situations when using alternation. There is
no ambiguity if it is used to get a boolean result by testing
a match against a string input. However, for cases like
string replacement, it depends on a few factors. Say, you
want to replace either are or spared — which one should
get precedence? The bigger word spared or the substring
are inside it or based on something else?
Note Description
multiple RE combined as
|
conditional OR
programmatically combine
'|'.join(iterable)
multiple RE
() group pattern(s)
tie-breaker is left-to-right if
patterns have the same starting
location
Note Description
'|'.join(sorted(iterable,
key=len, reverse=True))
Exercises
1) For the given list, filter all elements that start with den
or end with ly.
4) For the given strings, replace all matches from the list
words with A.
5) Filter all whole elements from the input list items based
on elements listed in words.
# match ( or ) literally
>>> re.sub(r'\(|\)', '', '(a*b) + c')
'a*b + c'
re.escape()
Okay, what if you have a string variable that must be used
to construct a RE — how to escape all the metacharacters?
Relax, the re.escape() function has got you covered. No
need to manually take care of all the metacharacters or
worry about changes in future versions.
>>> print(pat1.pattern)
a_42|\(a\^b\)|2\|3
>>> print(pat2.pattern)
a_42|(a^b)|2|3
Escape sequences
Certain characters like tab and newline can be expressed
using escape sequences as \t and \n respectively. These
are similar to how they are treated in normal string literals.
However, \b is for word boundaries as seen earlier,
whereas it stands for the backspace character in normal
string literals.
Note Description
\\ to match \ literally
Exercises
1) Transform the given input strings to the expected output
using the same logic on both strings.
>>> str1 = '(9-2)*5+qty/3-(9-2)*7'
>>> str2 = '(qty+4)/2-(9-2)*5+pq/4'
>>> ip = '123\b456'
>>> ip
'123\x08456'
>>> print(ip)
12456
>>> ip = '3-(a^b)+2*(a^b)-(a/b)+3'
>>> eqns = ['(a^b)', '(a/b)', '(a^b)+2']
Dot metacharacter
The dot metacharacter serves as a placeholder to match
any character except the newline character.
re.split()
This chapter will additionally use the re.split() function
to illustrate examples. For normal strings, you'd use the
str.split() method. For regular expressions, use the
re.split() function, whose argument list is shown below.
Greedy quantifiers
Quantifiers have functionality like the string repetition
operator and the range() function. They can be applied to
characters and groupings (and more, as you'll see in later
chapters). Apart from the ability to specify exact quantity
and bounded range, these can also match unbounded
varying quantities. If the input string can satisfy a pattern
with varying quantities in multiple ways, you can choose
among three types of quantifiers to narrow down
possibilities. In this section, greedy type of quantifiers is
covered.
Quantifier Description
Non-greedy quantifiers
As the name implies, these quantifiers will try to match as
minimally as possible. Also known as lazy or reluctant
quantifiers. Appending a ? to greedy quantifiers makes
them non-greedy.
Possessive quantifiers
Before Python 3.11, you had to use alternatives like the
third-party regex module for possessive quantifiers. The
difference between greedy and possessive quantifiers is
that possessive will not backtrack to find a match. In other
words, possessive quantifiers will always consume every
character that matches the pattern on which it is applied.
Syntax wise, you need to append + to greedy quantifiers to
make it possessive (similar to adding ? for the non-greedy
case).
>>> ip = 'fig:mango:pineapple:guava:apples:orange
Catastrophic Backtracking
Backtracking can become significantly time consuming for
certain corner cases. Which is why some regular
expression engines do not use them, at the cost of not
supporting some features like lookarounds. If your
application accepts user defined RE, you might need to
protect against such catastrophic patterns. From wikipedia:
ReDoS:
Note Description
append ? to greedy
non-greedy
quantifiers
match as minimally as
possible
append + to greedy
possessive
quantifiers
re.split(pattern,
string, maxsplit=0,
flags=0)
Exercises
2) For the list items, filter all elements starting with hand
and ending immediately with at most one more character
or le.
# wrong output
>>> change.sub('X', words)
'plXk XcomXg tX wXer X cautX sentient'
# expected output
>>> change = re.compile() ##### add your so
>>> change.sub('X', words)
'plX XmX tX wX X cautX sentient'
? is same as
* is same as
+ is same as
10) For the input list words, filter all elements starting with
s and containing e and t in any order.
11) For the input list words, remove all elements having
less than 6 characters.
12) For the input list words, filter all elements starting with
s or t and having a maximum of 6 characters.
13) Can you reason out why this code results in the output
shown? The aim was to remove all <characters> patterns
but not the <> ones. The expected result was 'a 1<> b 2<>
c'.
>>> s1 = 'appleabcabcabcapricot'
>>> s2 = 'bananabcabcabcdelicious'
# wrong output
>>> pat = re.compile(r'(abc)+a')
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
True
# expected output
# 'abc' shouldn't be considered when trying to ma
>>> pat = re.compile() ##### add your soluti
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
False
# wrong output
>>> re.sub(rf'{c}{3,}', c, cast)
'dragon-unicorn--centaur---mage----healer'
# expected output
>>> re.sub(rf'', c, cast) ##### add your soluti
'dragon-unicorn--centaur-mage-healer'
Interlude: Tools for debugging
and visualization
As your RE gets complicated, it can get difficult to debug
when you run into issues. Building your RE step by step
from scratch and testing against input strings will go a long
way in correcting the problem. To aid in such a process, you
could use various online tools.
regex101
regex101 is a popular site to test your RE. You'll have to
first choose the flavor as Python. Then you can add your RE,
input strings, choose flags and an optional replacement
string. Matching portions will be highlighted and
explanation is offered in separate panels.
debuggex
Another useful tool is debuggex which converts your RE to a
railroad diagram, thus providing a visual aid to
understanding the pattern.
re(gex)? playground
As already mentioned in the introduction chapter, I wrote
an interactive TUI app for interactive practice. See
PyRegexPlayground repo for installation instructions and
usage guide.
re(gex)? exercises
I wrote another TUI app to help you solve exercises from
this book interactively. See PyRegexExercises repo for
installation steps and app_guide.md for instructions on
using this app.
Summary
This chapter briefly presented tools that can help you with
understanding and interactively solving/debugging regular
expressions. Syntax and features can vary, sometimes
significantly, between various tools and programming
languages. So, ensure that the program you are using
supports the flavor of regular expressions you are using.
Working with matched portions
You have already seen a few features that can match
varying text. In this chapter, you'll learn how to extract and
work with those matching portions. First, the re.Match
object will be discussed in detail. And then you'll learn
about re.findall() and re.finditer() functions to get
all the matches instead of just the first match. You'll also
learn a few tricks like using functions in the replacement
section of re.sub(). And finally, some examples for the
re.subn() function.
re.Match object
The re.search() and re.fullmatch() functions return a
re.Match object from which various details can be
extracted like the matched portion of string, location of the
matched portion, etc. Note that you'll get the details only
for the first match. Working with multiple matches will be
covered later in this chapter. Here are some examples with
re.Match output.
The details in the output above are for quick reference only.
There are methods and attributes that you can apply on the
re.Match object to get only the exact information you need.
Use the span() method to get the starting and ending +
1 indexes of the matching portion.
>>> m.span(1)
(2, 5)
>>> m.start()
0
>>> m.end(1)
5
re.findall()
The re.findall() function returns all the matched
portions as a list of strings.
>>> s = 'green:3.14:teal::brown:oh!:blue'
>>> re.findall(r':.*:', s)
[':3.14:teal::brown:oh!:']
>>> re.findall(r':.*?:', s)
[':3.14:', '::', ':oh!:']
>>> re.findall(r':.*+:', s)
[]
For both cases, any pattern outside the capture groups will
not be represented in the output. Also, you'll get an empty
string if a particular capture group didn't match any
character.
re.finditer()
You can use the re.finditer() function to get an iterator
object with each element as re.Match objects for the
matched portions.
re.finditer(pattern, string, flags=0)
Here's an example:
>>> d = '2023/04/25,1986/Mar/02,77/12/31'
>>> m_iter = re.finditer(r'(.*?),', d)
Note Description
re.findall(pattern, string,
flags=0)
re.finditer(pattern, string,
flags=0)
Exercises
1) For the given strings, extract the matching portion from
the first is to the last t.
>>> s1 = 'first-3.14'
>>> s2 = 'next-123'
>>> pat.findall(row1)
[('-2', '5'), ('4', '+3'), ('+42', '-53'), ('4356
>>> pat.findall(row2)
[('1.32', '-3.14'), ('634', '5.63'), ('63.3e3', '
>>> ip = '42:no-output;1000:car-tr:u-ck;SQEX49801
12) For the given list of strings, change the elements into a
tuple of original element and the number of times t occurs
in that element.
>>> ip = 'TWXA42:JWPA:NTED01:'
# all digits
>>> re.findall(r'[0-9]+', 'Sample123string42with7
['123', '42', '777']
# all non-digits
>>> re.findall(r'[^0-9]+', 'Sample123string42with
['Sample', 'string', 'with', 'numbers']
Numeric ranges
Character classes can also be used to construct numeric
ranges. However, it is easy to miss corner cases and some
ranges are complicated to design.
# numbers between 10 to 29
>>> re.findall(r'\b[12]\d\b', '23 154 12 26 98234
['23', '12', '26']
Note Description
Exercises
1) For the list items, filter all elements starting with hand
and ending immediately with s or y or le.
7) For the list words, filter all elements not starting with e
or p or u.
>>> pat.split(str1)
['lion', 'Ink', 'onion', 'Nice']
>>> pat.split(str2)
['**', 'star', '**']
>>> print('known\nmood\nknow\npony\ninns')
known
mood
know
pony
inns
13) For the given list, filter all elements containing any
number sequence greater than 624.
>>> ip.split()
['so', 'pole', 'lit', 'in', 'to']
##### add your solution here
['so', 'pole', 'lit', 'in', 'to']
Backreference
Backreferences are like variables in a programming
language. You have already seen how to use a re.Match
object to refer to the text captured by groups.
Backreferences provide the same functionality, with the
advantage that these can be directly used in RE definition
as well as the replacement section without having to invoke
re.Match objects. Another advantage is that you can apply
quantifiers to backreferences.
Non-capturing groups
Grouping has many uses like applying quantifiers on a RE
portion, creating terse RE by factoring common portions
and so on. It also affects the behavior of functions like
re.findall() and re.split() as seen in the Working with
matched portions chapter.
# single match
>>> details = '2018-10-25,car,2346'
>>> re.search(r'(?P<date>[^,]+),(?P<product>[^,]+
{'date': '2018-10-25', 'product': 'car'}
# multiple matches
>>> s = 'good,bad 42,24'
>>> [m.groupdict() for m in re.finditer(r'(?P<fw>
[{'fw': 'good', 'sw': 'bad'}, {'fw': '42', 'sw':
Atomic grouping
(?>pat) is an atomic group, where pat is the pattern you
want to safeguard from further backtracking. You can think
of it as a special group that is isolated from the other parts
of the regular expression.
>>> ip = 'fig::mango::pineapple::guava::apples::o
Conditional groups
This special grouping allows you to add a condition that
depends on whether a capture group succeeded in
matching. You can also add an optional else condition. The
syntax as per the docs is shown below.
(?(id/name)yes-pattern|no-pattern)
Match.expand()
The expand() method on a re.Match object accepts syntax
similar to the replacement section of the re.sub()
function. The difference is that the expand() method
returns only the string after backreference expansion,
instead of the entire input string with the modified content.
# re.sub vs Match.expand
>>> re.sub(r'w(.*)m', r'[\1]', 'awesome')
'a[eso]e'
>>> re.search(r'w(.*)m', 'awesome').expand(r'[\1]
'[eso]'
Note Description
refer as (?P=name) in RE
definition
(?
conditional group
(id/name)yes|no)
match yes-pattern if
backreferenced group succeeded
Exercises
1) Replace the space character that occurs after a word
ending with a or r with a newline character.
3) Replace all whole words with X that start and end with
the same word character (irrespective of case). Single
character word should get replaced with X too, as it
satisfies the stated condition.
>>> ip = 'firecatlioncatcatcatbearcatcatparrot'
11) For the given input string, find all occurrences of digit
sequences with at least one repeating sequence. For
example, 232323 and 897897. If the repeats end
prematurely, for example 12121, it should not be matched.
>>> ip = '( S:12 E:5 S:4 and E:123 ok S:100 & E:1
# wrong output
>>> re.findall(r'S:\d+.*?E:\d{2,}', ip)
['S:12 E:5 S:4 and E:123', 'S:100 & E:10', 'S:1 -
# expected output
##### add your solution here
['S:4 and E:123', 'S:100 & E:10', 'S:42 E:43']
Interlude: Common tasks
Tasks like matching phone numbers, ip addresses, dates,
etc are so common that you can often find them collected
as a library. This chapter shows some examples for the
CommonRegex module. The re module documentation also
has a section on tasks like docs.python: tokenizer. See also
Awesome Regex: Collections.
CommonRegex
You can either install commonregex as a module or go
through commonregex.py and choose the regular
expression you need. There are several ways to use the
patterns, see CommonRegex: Usage for details. Here's an
example for matching ip addresses:
# wrong matches
>>> ip.findall(data)
['23.14.2.4', '255.21.255.22', '67.12.2.1']
# corrected usage
>>> [e for e in data.split() if ip.fullmatch(e)]
['255.21.255.22']
Summary
Some patterns are quite complex and not easy to build and
validate from scratch. Libraries like CommonRegex are
helpful to reduce your time and effort needed for commonly
known tasks. However, you do need to test the solution for
your use cases. See also stackoverflow: validating email
addresses.
Lookarounds
You've already seen how to create custom character classes
and various avatars of special groupings. In this chapter
you'll learn more groupings, known as lookarounds, that
help to create custom anchors and add conditions within
RE definition. These assertions are also known as zero-
width patterns because they add restrictions similar to
anchors and are not part of the matched portions. Also, you
will learn how to negate a grouping similar to negated
character sets.
Conditional expressions
Before you get used to lookarounds too much, it is good to
remember that Python is a programming language. You
have control structures and you can combine multiple
conditions using logical operators, functions like all(),
any(), etc. Also, do not forget that re is only one of the
tools available for text processing.
Negative lookarounds
Lookaround assertions can be added in two ways —
lookbehind and lookahead. Each of these can be a
positive or a negative assertion. Syntax wise, lookbehind
has an extra < compared to the lookahead version.
Negative lookarounds can be identified by the use of !
whereas = is used for positive lookarounds. This section is
about negative lookarounds, whose syntax is shown below:
# overlap example
# the final _ was replaced as well as played a pa
>>> re sub(r'(?<! )cat ' 'dog' 'cats cater 42c
>>> re.sub(r (?<!_)cat. , dog , cats _cater 42c
'dog _cater 42dogcats'
Positive lookarounds
Unlike negative lookarounds, absence of something will not
satisfy positive lookarounds. Instead, for the condition to
satisfy, the pattern has to match actual characters and/or
zero-width assertions. Positive lookarounds can be
identified by use of = in the grouping. Syntax is shown
below:
# allowed
>>> re.findall(r'(?<=(?:po|da)re)\d+', s)
['42', '7']
>>> re.findall(r'(?<=\b[a-z]{4})\d+', s)
['42', '7', '5']
# not allowed
>>> re.findall(r'(?<!tar|dare)\d+', s)
re.PatternError: look-behind requires fixed-width
>>> re.findall(r'(?<=\b[pd][a-z]*)\d+', s)
re.PatternError: look-behind requires fixed-width
>>> re.sub(r'(?<=\A|,)(?=,|\Z)', 'NA', ',1,,,two,
re.PatternError: look-behind requires fixed-width
Negated groups
Some of the variable length negative lookbehind cases can
be simulated by using a negative lookahead (which doesn't
have restriction on variable length). The trick is to assert
negative lookahead one character at a time and applying
quantifiers on such a grouping to satisfy the variable
requirement. This will only work if you have well defined
conditions before the negated group.
Exercises
Please use lookarounds for solving the following
exercises even if you can do it without
lookarounds. Unless you cannot use lookarounds
for cases like variable length lookbehinds.
>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-
>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-
>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-
>>> ip = 'Poke,on=-=so_good:ink.to/is(vast)ever2-
12) For the given string, surround all whole words with {}
except for whole words par and cat and apple.
14) For the given input strings, extract all overlapping two
character sequences.
>>> s1 = 'apple'
>>> s2 = '1.2-3:4'
>>> s1 = '42:cat'
>>> s2 = 'twelve:a2b'
>>> s3 = 'we:be:he:0:a:b:bother'
>>> s4 = 'apple:banana-42:cherry:'
>>> s5 = 'dragon:unicorn:centaur'
b() dd l i h
>>> pat.sub() ##### add your solution here
'42'
>>> pat.sub() ##### add your solution here
'twelve:a2b'
>>> pat.sub() ##### add your solution here
'we:be:he:0:a:b'
>>> ip = '::very--at<=>row|in.a_b#b2c=>lion----ea
>>> bool(neg.search(str1))
True
>>> bool(neg.search(str2))
False
>>> bool(neg.search(str3))
False
>>> bool(neg.search(str4))
True
>>> bool(neg.search(str5))
False
>>> bool(neg.search(str6))
True
19) The given input string has comma separated fields and
some of them can occur more than once. For the duplicated
fields, retain only the rightmost one. Assume that there are
no empty fields.
>>> row = '421,cat,2425,42,5,cat,6,6,42,61,6,6,sc
re.IGNORECASE
First up, the flag to ignore case while matching alphabets.
When flags argument is used, this can be specified as re.I
or re.IGNORECASE constants.
re.DOTALL
Use re.S or re.DOTALL to allow the . metacharacter to
match newline characters as well.
re.MULTILINE
As seen earlier, re.M or re.MULTILINE flag would allow the
^ and $ anchors to work line wise.
re.VERBOSE
The re.X or re.VERBOSE flag is another provision like
named capture groups to help add clarity to RE definitions.
This flag allows you to use literal whitespaces for aligning
purposes and add comments after the # character to break
down complex RE into multiple lines.
Inline comments
Comments can also be added using the (?#comment)
special group. This is independent of the re.X flag.
Inline flags
To apply flags to specific portions of RE, specify them
inside a special grouping syntax. This will override the flags
applied to entire RE definitions, if any. The syntax
variations are:
Note Description
re.IGNORECASE
flag to ignore case
or re.I
Exercises
1) Remove from the first occurrence of hat to the last
occurrence of it for the given input strings. Match these
markers case insensitively.
hi there
42
bye
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
False
>>> bool(pat.search(s3))
True
>>> bool(pat.search(s4))
True
>>> bool(pat.search(s5))
False
>>> bool(pat.search(s6))
False
>>> bool(pat.search(s1))
True
>>> bool(pat.search(s2))
True
>>> bool(pat.search(s3))
False
>>> bool(pat.search(s4))
False
5) Explore what the re.DEBUG flag does. Here are some
example patterns to check out.
re.compile(r'\Aden|ly\Z', flags=re.DEBUG)
re.compile(r'\b(0x)?[\da-f]+\b',
flags=re.DEBUG)
re.compile(r'\b(?:0x)?[\da-f]+\b',
flags=re.I|re.DEBUG)
Unicode
The examples so far had input strings made up of ASCII
characters only. However, the re module's matching works
on Unicode by default. See docs.python: Unicode for a
tutorial on Unicode support in Python. This chapter will
briefly discuss a few things related to Unicode matching.
re.ASCII
Flags can be used to override the default Unicode setting.
The re.A or re.ASCII flag will change \b, \w, \d, \s and
their opposites to match only based on ASCII characters.
# \w is Unicode aware
>>> re.findall(r'\w+', 'fox:αλεπού')
['fox', 'αλεπού']
\N escape sequence
You can also specify a Unicode character using the
\N{name} escape sequence. See unicode: NamesList for a
full list of names. From the Python docs:
Note Description
tutorial on Unicode
docs.python: Unicode
support in Python
codepoint defined
\uXXXX using 4 hexadecimal
digits
codepoint defined
\UXXXXXXXX using 8 hexadecimal
digits
Unicode character
\N{name}
defined by its name
See unicode:
NamesList for full list
>>> regex.search(r'(?P<date>\d{4}-\d{2}-\d{2}).*(?&d
'2008-03-24,food,2012-08-12'
Set the start of matching portion with
\K
Some of the positive lookbehind cases can be solved by adding
\K as a suffix to the pattern to be asserted. The text consumed
until \K won't be part of the matching portion. In other words,
\K determines the starting point. The pattern before \K can be
variable length too.
\G anchor
The \G anchor matches the start of the input string, just like
the \A anchor. In addition, it will also match at the end of the
previous match. This helps you to mark a particular location in
the input string and continue from there instead of having the
pattern to always check for the specific location. This is best
understood with examples.
\G matches the start of the string but the input string doesn't
start with a space character. So the regular expression can be
satisfied only after the other alternative is matched. Consider
the first pattern where Mina is the other alternative. Once that
string is found, a space and digit characters will satisfy the
rest of the RE. Ending of the match, i.e. Mina 89 in this case,
will now be the \G anchoring position. This will allow 85 and
84 to be matched subsequently. After that, J fails the \d
pattern and no more matches are possible (as Mina isn't found
another time).
Recursive matching
The subexpression call special group was introduced as
analogous to function calls. And similar to functions, it does
support recursion. Useful to match nested patterns, which is
usually not recommended to be done with regular
expressions. Indeed, you should use a proper parser library
for file formats like html, xml, json, csv, etc. But for some
cases, a parser might not be available and using RE might be
simpler than writing one from scratch.
>>> lvl2.findall(eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']
>>> lvl2.findall(eqn2)
['(b)', '((c))', '((d))']
>>> lvln.findall(eqn0)
['(b * c)', '(d / e)']
>>> lvln.findall(eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']
>>> lvln.findall(eqn2)
['(b)', '((c))', '(((d)))']
Set operations
Set operators can be used inside character class between sets.
Mostly used to get intersection or difference between two
sets, where one/both of them is a character range or a
predefined character set. To aid in such definitions, you can
use [] in nested fashion. The four operators, in increasing
order of precedence, are:
|| union
~~ symmetric difference
&& intersection
-- difference
Skipping matches
Sometimes, you want to change or extract all matches except
particular portions. Usually, there are common characteristics
between the two types of matches that makes it hard or
impossible to define a RE only for the required matches. For
example, changing field values unless it is a particular name,
or perhaps don't touch double quoted values and so on. To use
the skipping feature, define the matches to be ignored suffixed
by (*SKIP)(*FAIL) and then put the required matches as part
of an alternation list. (*F) can also be used instead of
(*FAIL).
Overlapped matches
You can use overlapped=True to get overlapped matches.
regex.REVERSE flag
The regex.R or regex.REVERSE flag will result in right-to-left
processing instead of the usual left-to-right order.
Note Description
regex.DEFAULT_VERSION=regex.VERSION1
can also be used
r'\((?:[^()]++|(?0))++\)' matches
nested sets of parentheses
regex.findall(r'\G\d+-?', '12-34
42') gives ['12-', '34']
[[:^digit:]] to indicate \D
\P{L} or
match characters other than the \p{L} set
\p{^L}
pat(*SKIP)
ignore text matched by pat
(*F)
Exercises
1) List the two regex module constants that affect the
compatibility with the re module. Also specify their
corresponding inline flags.
>>> ip = '::very--at<=>row|in.a_b#b2c=>lion----east'
>>> ip = 'vast:a2b2:ride:in:awe:b2b:3list:end'
>>> pat.findall(row1)
['vast']
>>> pat.findall(row2)
['um', 'no', 'low']
>>> pat.findall(row3)
[]
>>> pat.findall(row4)
['Dragon', 'Unicorn', 'Wizard-Healer']
>>> pat.search(ip1)[0]
'if(3-(k*3+4)/12-(r+2/3))'
>>> pat.search(ip2)[0]
'if(a(b)c(d(e(f)1)2)3)'
8) Read about the POSIX flag from
https://pypi.org/project/regex/. Is the following code snippet
showing the correct output?
12) For the given input strings, construct a word that is made
up of the last characters of all the words in the input. Use the
last character of the last word as the first character, last
character of the last but one word as the second character
and so on.
>>> s1 = 'Sample123string42with777numbers'
>>> s2 = '12apples'
>>> pat.split(s1)
['Sample123string42with', '777', 'numbers']
>>> pat.split(s2)
['', '12', 'apples']
>>> bool(pat.fullmatch('CaT'))
True
>>> bool(pat.fullmatch('scat'))
False
>>> bool(pat.fullmatch('ca.'))
True
>>> bool(pat.fullmatch('ca#'))
True
>>> bool(pat.fullmatch('c#t'))
True
>>> bool(pat.fullmatch('at'))
False
>>> bool(pat.fullmatch('act'))
False
>>> bool(pat.fullmatch('2a1'))
False
>>> pat.findall(row1)
['ride', 'in', 'awe', 'b2b', '3list', 'end']
>>> pat.findall(row2)
['s4w', 'seer']
>>> pat.findall(row3)
['apple', 'banana', 'fig']
>>> pat.findall(row4)
[]
Gotchas
Regular expressions can get quite complicated and cryptic.
So, it is natural to assume you have made a mistake if
something isn't working as expected. However, sometimes
it might just be one of the quirky corner cases discussed in
this chapter.
Escape sequences
Some RE engines match characters literally if an escape
sequence is not defined. Python raises an exception for
such cases. Apart from sequences defined for RE (for
example \d), these are allowed: \a \b \f \n \N \r \t \u
\U \v \x \\ where \b means backspace only in character
classes. Also, \u and \U are valid only in Unicode patterns.
Zero-length matches
Beware of empty matches. See also regular-expressions:
Zero-Length Matches.
Here's an example:
>>> +re.I
2
$ python3.13
>>> import re
>>> re.sub(r'key', r'(\g<0>)', 'KEY portkey oKey
e sub( ey , (\g 0 ) , po ey o ey
<python-input-1>:1: DeprecationWarning: 'count' i
re.sub(r'key', r'(\g<0>)', 'KEY portkey oKey Ke
'KEY port(key) oKey Keyed'
Summary
Hope you have found Python regular expressions an
interesting topic to learn. Sooner or later, you'll need to use
them if your project has text processing tasks. At the same
time, knowing when to use normal string methods and
knowing when to reach for other text parsing modules like
json is important. Happy coding!
Further Reading
Note that some of these resources are not specific to
Python. So you'll have to adapt them to Python's syntax.