Py Regex
Py Regex
Preface
Scripting and automation tasks often need to extract particular portions of text from input data or modify them
from one format to another. This book will help you learn Regular Expressions, a mini-programming language for
all sorts of text processing needs.
The book heavily leans on examples to present features of regular expressions one by one. It is recommended that
you manually type each example and experiment with them. Understanding both the nature of sample input string
and the output produced is essential. As an analogy, consider learning to ride a bike or a car - no matter how much
you read about them or listen to explanations, you need to practice a lot and infer your own conclusions. Should
you feel that copy-paste is ideal for you, code snippets are available chapter wise on GitHub.
The examples presented here have been tested with Python version 3.7.1 and may include features not available
in earlier versions. Unless otherwise noted, all examples and explanations are meant for ASCII characters only.
The examples are copy pasted from Python REPL shell, but modified slightly for presentation purposes (like adding
comments and blank lines, shortened error messages, skipping import statements, etc).
Prerequisites
Prior experience working with Python, should know concepts like string formats, string methods, list comprehension
and so on.
If you have prior experience with a programming language, but new to Python, check out my GitHub repository on
Python Basics before starting this book.
Acknowledgements
Special thanks to Al Sweigart, for introducing me to Python with his awesome automatetheboringstuff book and
video course.
I would highly appreciate if you’d let me know how you felt about this book, it would help to improve this book as
well as my future attempts. Also, please do let me know if you spot any error or typo.
E-mail: [email protected]
Twitter: https://twitter.com/learn_byexample
Author info
Sundeep Agarwal is a freelance trainer, author and mentor. His previous experience includes working as a Design
Engineer at Analog Devices for more than 5 years. You can find his other works, primarily focused on Linux command
line, text processing, scripting languages and curated lists, at https://github.com/learnbyexample.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
Book version
1.0
2
Why is it needed?
Regular Expressions have become a synonym with text processing. Most programming languages that are used
for scripting purposes come with regular expression module as part of their standard library offering. If not, you
can usually find a third-party library support. Syntax and features of regular expressions varies from language to
language. Python’s offering is similar to that of Perl language, but there are significant differences.
The str class comes loaded with variety of methods to deal with text. So, what’s so special about regular
expressions and why would you need it? For learning and understanding purposes, one can view regular expressions
as a mini programming language in itself, specialized for text processing. Parts of a regular expression can be
saved for future use, analogous to variables and functions. There are ways to perform AND, OR, NOT conditionals.
Operations similar to range function, string repetition operator and so on.
Further Reading
• The true power of regular expressions - it also includes a nice explanation of what regular means
• softwareengineering: Is it a must for every programmer to learn regular expressions?
• softwareengineering: When you should NOT use Regular Expressions?
• Regular Expressions: Now You Have Two Problems
• wikipedia: Regular expression - this article includes discussion on regular expressions as a formal language
as well as details on various implementations
3
Regular Expression modules
In this chapter, you’ll get an introduction to two regular expression modules. For some examples, the equivalent
normal string method is shown for comparison. Regular expression features will be covered next chapter onwards.
re module
It is always a good idea to know where to find the documentation. The default offering for Python regular expressions
is the re standard library module. Visit docs.python: re for information on available methods, syntax, features,
examples and more. Here’s a quote:
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let
you check if a particular string matches a given regular expression
First up, a simple example to test whether a string is part of another string or not. Normally, you’d use the in
operator. For regular expressions, use the re.search function. Pass the RE as first argument and string to test
against as second argument. As a good practice, always use raw strings to construct RE, unless other formats are
required (will become clearer in coming chapters).
>>> sentence = 'This is a sample string'
Before using the re module, you need to import it. Further example snippets will assume that the module is
already loaded. The return value of re.search function is re.Match object when a match is found and None
otherwise (will be discussed further in a later chapter). For presentation purposes, the examples will use bool
function to show True or False depending on whether the RE pattern matched or not.
bool is not needed for conditional expressions, output of re.search can be used directly.
>>> if re.search(r'ring', sentence):
... print('mission success')
...
mission success
4
Compiling regular expressions
Regular expressions can be compiled using re.compile function, which gives back a re.Pattern object. The
top level re module functions are all available as methods for this object. Compiling a regular expression helps
if the RE has to be used in multiple places or called upon multiple times inside a loop (speed benefit). By default,
Python maintains a small list of recently used RE, so the speed benefit doesn’t apply for trivial use cases.
>>> pet = re.compile(r'dog')
>>> type(pet)
<class 're.Pattern'>
The methods available for compiled patterns might also include some more features than those available for top level
functions of re module. For example, the search method on a compiled pattern has two optional arguments to
specify start and end index. Similar to range function and slicing notation, the ending index has to be specified
1 greater than desired index.
>>> sentence = 'This is a sample string'
>>> word = re.compile(r'is')
bytes
To work with bytes data type, the RE must be of bytes data as well. Similar to str RE, use raw form to
construct bytes RE.
>>> sentence = b'This is a sample string'
regex module
The third party regex module (https://pypi.org/project/regex/) is backward compatible with standard re module
as well as offers additional features. This module is lot closer to Perl regular expression in terms of features than
the re module.
To install the module from command line, you can either use pip install regex in a virtual environment or use
python3.7 -m pip install --user regex for system wide accessibility.
>>> import regex
>>> sentence = 'This is a sample string'
You might wonder why two regular expression modules are being presented in this book. The re module is good
enough for most usecases. But if text processing occupies a large share of your work, the extra features of regex
5
module would certainly come in handy. It would also make it easier to adapt from/to other programming languages.
You can also consider always using regex module for your project instead of having to decide which one to use
depending on features required.
Exercises
Refer to exercises folder for input files required to solve the exercises.
a) For the given input file, print all lines containing the string two
# note that expected output shown here is wrapped to fit pdf width
>>> filename = 'programming_quotes.txt'
>>> word = re.compile() ##### add your solution here
>>> with open(filename, 'r') as ip_file:
... for ip_line in ip_file:
... if word.search(ip_line):
... print(ip_line, end='')
...
"Some people, when confronted with a problem, think - I know, I'll use regular expressions.
Now they have two problems" by Jamie Zawinski
"So much complexity in software comes from trying to make one thing do two things" by Ryan Singer
b) For the given input string, print all lines NOT containing the string 2
>>> purchases = '''\
... apple 24
... mango 50
... guava 42
... onion 31
... water 10'''
>>> num = re.compile() ##### add your solution here
>>> for line in purchases.split('\n'):
... if not num.search(line):
... print(line)
...
mango 50
onion 31
water 10
6
Anchors
In this chapter, you’ll be learning about qualifying a pattern. Instead of matching anywhere in the given input string,
restrictions can be specified. For now, you’ll see the ones that are already part of re module. In later chapters,
you’ll get to know how to define your own rules for restriction.
These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The
characters with special meaning are known as metacharacters in regular expressions parlance. In case you need
to match those characters literally, you need to escape them with a \ (discussed in a later chapter).
String anchors
This restriction is about qualifying a RE to match only at start or end of an input string. These provide functionality
similar to the str methods startswith and endswith . First up, \A which restricts the match to start of
string.
# \A is placed as a prefix to the pattern
>>> bool(re.search(r'\Acat', 'cater'))
True
>>> bool(re.search(r'\Acat', 'concatenation'))
False
Combining start and end of string anchors, you can restrict the matching to whole string. Similar to comparing
strings using the == operator.
>>> pat = re.compile(r'\Acat\Z')
>>> bool(pat.search('cat'))
True
>>> bool(pat.search('cater'))
False
>>> bool(pat.search('concatenation'))
False
Use the optional start/end index arguments for search method with caution. They are not equivalent to string
slicing. For example, specifying a greater than 0 start index for a RE with start of string anchor is always going
to return False .
>>> pat = re.compile(r'\Aat\Z')
7
The anchors can be used by themselves as a pattern. Helps to insert text at start or end of string, emulating string
concatenation operations. These might not feel like useful capability, but combined with other regular expression
features they become quite a handy tool. For this illustration, re.sub function is used, which performs search
and replace operation similar to the normal replace string method.
# first argument is search RE
# second argument is replace RE
# third argument is string to be acted upon
>>> re.sub(r'\A', r're', 'live')
'relive'
>>> re.sub(r'\A', r're', 'send')
'resend'
The meaning of RE is widely different when used as a search argument vs replace argument. It will be discussed
separately in a later chapter, for now only normal strings will be used as replacement. A common mistake, not
specific to re.sub , is forgetting that strings are immutable in Python.
>>> word = 'cater'
# this will return a string object, won't modify 'word' variable
>>> re.sub(r'\Acat', r'hack', word)
'hacker'
>>> word
'cater'
Line anchors
A string input may contain single or multiple lines. The line separator is the newline character \n . So, if you are
dealing with Windows OS based text files, you’ll have to convert \r\n line endings to \n first. Which is made
easier by Python in many cases - for ex: you can specify which line ending to use for open function, the split
string method handles all whitespaces by default and so on. Or, you can handle \r as optional character with
quantifiers (covered later).
There are two line anchors, one for start of line and the other for end of line. The ˆ metacharacter restricts the
matching to start of line, use $ for end of line. If there are no newline characters in the input string, these will
behave same as \A and \Z respectively.
>>> pets = 'cat and dog'
By default, input string is considered as single line, even if multiple newline characters are present. In such cases,
the $ metacharacter can match both end of string and just before the last newline character. However, \Z will
always match end of string, irrespective of what characters are present.
8
>>> greeting = 'hi there\nhave a nice day\n'
To indicate that the input string should be treated as multiple lines, you need to use the re.MULTILINE flag (or,
re.M short form). The flags optional argument will be covered in more detail later.
# check if any line in the string starts with 'tap'
>>> bool(re.search(r'^tap', "hi hello\ntop spot", flags=re.M))
False
Just like string anchors, you can use the line anchors by themselves as a pattern.
>>> ip_lines = "catapults\nconcatenate\ncat"
>>> print(re.sub(r'^', r'* ', ip_lines, flags=re.M))
* catapults
* concatenate
* cat
Word anchors
The third type of restriction is word anchors. A word character is any alphabet (irrespective of case), digit and the
underscore character. You might wonder why there are digits and underscores as well, why not only alphabets?
This comes from variable and function naming conventions - typically alphabets, digits and underscores are allowed.
So, the definition is more programming oriented than natural language.
The escape sequence \b denotes a word boundary. This works for both start of word and end of word anchoring.
Start of word means either the character prior to the word is a non-word character or there is no character (start
of string). Similarly, end of word means the character after the word is a non-word character or no character (end
of string). This implies that you cannot have word boundary without a word character.
>>> words = 'par spar apparent spare part'
9
# replace 'par' only at start of word
>>> re.sub(r'\bpar', r'X', words)
'X spar apparent spare Xt'
# replace 'par' only at end of word
>>> re.sub(r'par\b', r'X', words)
'X sX apparent spare part'
# replace 'par' only if it is not part of another word
>>> re.sub(r'\bpar\b', r'X', words)
'X spar apparent spare part'
You can get lot more creative with using word boundary as a pattern by itself:
# space separated words to double quoted csv
# note the use of 'replace' string method, 'translate' method can also be used
>>> print(re.sub(r'\b', r'"', words).replace(' ', ','))
"par","spar","apparent","spare","part"
The word boundary has an opposite anchor too. \B matches wherever \b doesn’t match. This duality will be
seen with some other escape sequences too. Negative logic is handy in many text processing situations. But use it
with care, you might end up matching things you didn’t intend!
>>> words = 'par spar apparent spare part'
Here’s some standalone pattern usage to compare and contrast the two word anchors:
>>> re.sub(r'\b', r':', 'copper')
':copper:'
>>> re.sub(r'\B', r':', 'copper')
'c:o:p:p:e:r'
In this chapter, you’ve begun to see building blocks of regular expressions and how they can be used in interesting
ways. But at the same time, regular expression is but another tool in the land of text processing. Often, you’d get
simpler solution by combining regular expressions with other string methods and comprehensions. Practice, experi-
10
ence and imagination would help you construct creative solutions. In coming chapters, you’ll see more applications
of anchors as well as \G anchor which is best understood in combination with other regular expression features.
Exercises
a) For the given url, count the total number of lines that contain is or the as whole words. Note that each
line in the for loop will be of bytes data type.
>>> import urllib.request
>>> scarlet_pimpernel_link = r'https://www.gutenberg.org/cache/epub/60/pg60.txt'
>>> word1 = re.compile() ##### add your solution here
>>> word2 = re.compile() ##### add your solution here
>>> count = 0
>>> with urllib.request.urlopen(scarlet_pimpernel_link) as ip_file:
... for line in ip_file:
... if word1.search(line) or word2.search(line):
... count += 1
...
>>> print(count)
3737
b) For the given input string, change only whole word red to brown
>>> words = 'bred red spread credible'
c) For the given input list, filter all elements that contains 42 surrounded by word characters.
>>> words = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', 'fake4b']
d) For the given input list, filter all elements that start with den or end with ly
>>> foo = ['lovely', '1 dentist', '2 lonely', 'eden', 'fly away', 'dent']
e) For the given input string, change whole word mall only if it is at start of line.
>>> para = '''\
... ball fall wall tall
... mall call ball pall
... wall mall ball fall'''
11
Alternation and Grouping
Many a times, you’d want to search for multiple terms. In a conditional expression, you can use the logical operators
to combine multiple conditions. With regular expressions, the | metacharacter is similar to logical OR. The RE
will match if any of the expression separated by | is satisfied. These can have their own independent anchors as
well.
# match either 'cat' or 'dog'
>>> bool(re.search(r'cat|dog', 'I like cats'))
True
>>> bool(re.search(r'cat|dog', 'I like dogs'))
True
>>> bool(re.search(r'cat|dog', 'I like parrots'))
False
You might infer from above examples that there can be cases where lots of alternation is required. The join string
method can be used to build the alternation list automatically from an iterable of strings.
>>> '|'.join(['car', 'jeep'])
'car|jeep'
Often, there are some common things among the RE alternatives. It could be common characters or qualifiers
like the anchors. In such cases, you can group them using a pair of parentheses metacharacters. Similar to
a(b+c) = ab+ac in maths, you get a(b|c) = ab|ac in RE.
# without grouping
>>> re.sub(r'reform|rest', r'X', 'red reform read arrest')
'red X read arX'
# with grouping
>>> re.sub(r're(form|st)', r'X', 'red reform read arrest')
'red X read arX'
# without grouping
>>> re.sub(r'\bpar\b|\bpart\b', r'X', 'par spare part party')
'X spare X party'
# taking out common anchors
>>> re.sub(r'\b(par|part)\b', r'X', 'par spare part party')
'X spare X party'
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternate
>>> re.sub(r'\bpar(|t)\b', r'X', 'par spare part party')
'X spare X party'
There’s lot more features to grouping than just forming terser RE. For now, this is a good place to show how to
incorporate normal strings (could be a variable, result from an expression, etc) while building a regular expression.
For ex: adding anchors to alternation list created using the join method.
12
>>> words = ['cat', 'par']
>>> '|'.join(words)
'cat|par'
# without word boundaries, any matching portion will be replaced
>>> re.sub('|'.join(words), r'X', 'cater cat concatenate par spare')
'Xer X conXenate X sXe'
In the above examples with join method, the string iterable elements do not contain any special regular expression
characters. How to deal strings with special characters will be discussed in a later chapter.
Precedence rules
There’s some tricky situations when using alternation. If it is used for testing a match to get True/False against
a string input, there is no ambiguity. However, for other things like string replacement, it depends on a few factors.
Say, you want to replace either are or spared - which one should get precedence? The bigger word spared
or the substring are inside it or based on something else?
In Python, the alternative which matches earliest in the input string gets precedence.
>>> words = 'lion elephant are rope not'
# starting index of 'on' < index of 'ant' for given string input
# so 'on' will be replaced irrespective of order
# count optional argument here restricts no. of replacements to 1
>>> re.sub(r'on|ant', r'X', words, count=1)
'liX elephant are rope not'
>>> re.sub(r'ant|on', r'X', words, count=1)
'liX elephant are rope not'
What happens if alternatives match on same index? The precedence is then left to right in the order of declaration.
>>> mood = 'best years'
>>> re.search(r'year', mood)
<re.Match object; span=(5, 9), match='year'>
>>> re.search(r'years', mood)
<re.Match object; span=(5, 10), match='years'>
13
Another example (without count restriction) to drive home the issue:
>>> words = 'ear xerox at mare part learn eye'
If you do not want substrings to sabotage your replacements, a robust workaround is to sort the alternations based
on length, longest first.
>>> words = ['hand', 'handy', 'handful']
>>> '|'.join(sorted(words, key=len, reverse=True))
'handful|handy|hand'
>>> alt = re.compile('|'.join(sorted(words, key=len, reverse=True)))
So, this chapter was about specifying one or more alternate matches within the same RE using | metacharac-
ter. Which can further be simplified using () grouping if the alternations have common aspects. Among the
alternations, earliest matching pattern gets precedence. Left to right ordering is used as a tie-breaker if multiple
alternations match starting from same location. You also learnt ways to programmatically construct a RE.
Exercises
a) For the given input list, filter all elements that start with den or end with ly
>>> foo = ['lovely', '1 dentist', '2 lonely', 'eden', 'fly away', 'dent']
b) For the given url, count the total number of lines that contain removed or rested or received or replied
or refused or retired as whole words. Note that each line in the for loop will be of bytes data type.
>>> import urllib.request
>>> scarlet_pimpernel_link = r'https://www.gutenberg.org/cache/epub/60/pg60.txt'
>>> words = re.compile() ##### add your solution here
>>> count = 0
>>> with urllib.request.urlopen(scarlet_pimpernel_link) as ip_file:
... for line in ip_file:
... if words.search(line):
... count += 1
...
>>> print(count)
83
14
Escaping metacharacters
You have seen a few metacharacters and escape sequences that help to compose a RE. To match the metacharacters
literally, i.e. to remove their special meaning, prefix those characters with a \ character. To indicate a literal \
character, use \\ . Assuming these are all part of raw string, not normal strings.
# even though ^ is not being used as anchor, it won't be matched literally
>>> bool(re.search(r'b^2', 'a^2 + b^2 - C*3'))
False
# escaping will work
>>> bool(re.search(r'b\^2', 'a^2 + b^2 - C*3'))
True
# match ( or ) literally
>>> re.sub(r'\(|\)', r'', '(a*b) + c')
'a*b + c'
As emphasized earlier, regular expression is just another tool to process text. Some examples and exercises pre-
sented in this book can be solved using normal string methods as well. For real world use cases, ask yourself first
if regular expression is needed at all?
>>> eqn = 'f*(a^b) - 3*(a^b)'
Okay, what if you have a string variable that must be used to construct a RE - how to escape all the metacharacters?
Relax, re.escape function has got you covered. No need to manually take care of all the metacharacters or worry
about changes in future versions.
>>> expr = '(a^b)'
# print used here to show results similar to raw string
>>> print(re.escape(expr))
\(a\^b\)
Exercises
a) Transform given input strings to expected output using same logic on both strings.
>>> str1 = '(9-2)*5+qty/3'
>>> str2 = '(qty+4)/2-(9-2)*5+pq/4'
15
b) Replace any matching item from given list with X for given input strings.
>>> items = ['a.b', '3+n', r'x\y\z', 'qty||price', '{n}']
>>> alt_re = re.compile() ##### add your solution here
16
Dot metacharacter and Quantifiers
As an analogy, alternation provides logical OR. Combining the dot metacharacter . and quantifiers (and alternation
if needed) paves a way to perform logical AND. For example, you want to check if a string matches two patterns
with any number of characters in between. The dot metacharacter serves as a placeholder to match any character
except the newline character. In later chapters, you’ll learn how to include the newline character, as well as how to
define your own custom placeholder for limited set of characters.
# matches character 'c', any character and then character 't'
>>> re.sub(r'c.t', r'X', 'tac tin cat abc;tuv acute')
'taXin X abXuv aXe'
# matches character 'r', any two characters and then character 'd'
>>> re.sub(r'r..d', r'X', 'breadth markedly reported overrides')
'bXth maXly repoX oveXes'
Greedy quantifiers
Quantifiers are like string repetition operator and range function. They can be applied to both characters and
groupings. Apart from ability to specify exact quantity and bounded range, these can also match unbounded varying
quantities. If the input string can satisfy a pattern with varying quantities in multiple ways, you can choose among
three types of quantifiers to narrow down possibilities. In this section, greedy type of quantifiers is covered.
First up, the ? metacharacter which quantifies a character or group to match 0 or 1 times. This helps to
define optional patterns and build terser RE compared to groupings for some cases.
# same as: r'ear|ar'
>>> re.sub(r'e?ar', r'X', 'far feat flare fear')
'fX feat flXe fX'
The * metacharacter quantifies a character or group to match 0 or more times. There is no upper bound, more
details will be discussed at end of this section.
# match 't' followed by zero or more of 'a' followed by 'r'
>>> re.sub(r'ta*r', r'X', 'tr tear tare steer sitaara')
'X tear Xe steer siXa'
# match 't' followed by zero or more of 'e' or 'a' followed by 'r'
>>> re.sub(r't(e|a)*r', r'X', 'tr tear tare steer sitaara')
'X X Xe sX siXa'
# match zero or more of '1' followed by '2'
>>> re.sub(r'1*2', r'X', '3111111111125111142')
'3X511114X'
17
Time to introduce more re.split function:
# last element is empty because there is nothing between 511114 and 2
>>> re.split(r'1*2', '3111111111125111142')
['3', '511114', '']
The + metacharacter quantifies a character or group to match 1 or more times. Similar to * quantifier, there
is no upper bound. More importantly, this doesn’t have surprises like matching empty string in between patterns
or at start/end of string.
>>> re.sub(r'ta+r', r'X', 'tr tear tare steer sitaara')
'tr tear Xe steer siXa'
>>> re.sub(r't(e|a)+r', r'X', 'tr tear tare steer sitaara')
'tr X Xe sX siXa'
You can specify a range of integer numbers, both bounded and unbounded, using {} metacharacters. There are
four ways to use this quantifier as listed below:
Note: The {} metacharacters have to be escaped to match them literally. However, unlike () metacharacters,
these have lot more leeway. For ex: escaping { alone is enough, or if it doesn’t conform strictly to any of the four
forms listed above, escaping is not needed at all.
Next up, how to construct AND conditional using dot metacharacter and quantifiers.
# match 'Error' followed by zero or more characters followed by 'valid'
>>> bool(re.search(r'Error.*valid', 'Error: not a valid input'))
True
18
To allow matching in any order, you’ll have to bring in alternation as well. That is somewhat manageable for 2 or 3
patterns. In a later chapter, you’ll learn how to use lookarounds for a comparatively easier approach.
>>> seq1 = 'cat and dog'
>>> seq2 = 'dog and cat'
>>> bool(re.search(r'cat.*dog|dog.*cat', seq1))
True
>>> bool(re.search(r'cat.*dog|dog.*cat', seq2))
True
So, how much do these greedy quantifiers match? When you are using ? how does Python decide to
match 0 or 1 times, if both quantities can satisfy the RE? For example, consider the expression
re.sub(r'f.?o', r'X', 'foot') - should foo be replaced or fo ? It will always replace foo , because these
are greedy quantifiers, meaning longest match wins.
>>> re.sub(r'f.?o', r'X', 'foot')
'Xt'
But wait, how did r'Error.*valid' example work? Shouldn’t .* consume all the characters after Error ?
Good question. The regular expression engine actually does consume all the characters. Then realizing that the
RE fails, it gives back one character from end of string and checks again if RE is satisfied. This process is repeated
until a match is found or failure is confirmed. In regular expression parlance, this is called backtracking. And can
be quite time consuming for certain corner cases.
>>> sentence = 'that is quite a fabricated tale'
# matching first 't' to last 'a' for t.*a won't work for these cases
# the engine backtracks until .*q matches and so on
>>> re.sub(r't.*a.*q.*f', r'X', sentence, count=1)
'Xabricated tale'
>>> re.sub(r't.*a.*u', r'X', sentence, count=1)
'Xite a fabricated tale'
Non-greedy quantifiers
As the name implies, these quantifiers will try to match as minimally as possible. Also known as lazy or reluctant
quantifiers. Appending a ? to greedy quantifiers makes them non-greedy.
19
>>> re.sub(r'f.??o', r'X', 'foot', count=1)
'Xot'
Like greedy quantifiers, lazy quantifiers will try to satisfy the overall RE.
>>> sentence = 'that is quite a fabricated tale'
# matching first 't' to first 'a' for t.*?a won't work for this case
# so, engine will move forward until .*?f matches and so on
>>> re.sub(r't.*?a.*?f', r'X', sentence, count=1)
'Xabricated tale'
Possessive quantifiers
Note: This feature is not present in re module, but is offered by the regex module.
Appending a + to greedy quantifiers makes them possessive. These are like greedy quantifiers, but without
the backtracking. So, something like r'Error.*+valid' will never match because .*+ will consume all the
remaining characters. If both greedy and possessive quantifier versions are functionally equivalent, then possessive
is preferred because it will fail faster for non-matching cases. In a later chapter, you’ll see an example where a RE
will only work with possessive quantifier, but not if greedy quantifier is used.
>>> import regex
>>> demo = ['abc', 'ac', 'adc', 'abbc', 'xabbbcz', 'bbb', 'bc', 'abbbbbc']
# different results
>>> regex.sub(r'f(a|e)*at', r'X', 'feat ft feaeat')
'X ft X'
# (a|e)*+ would match 'a' or 'e' as much as possible
# no backtracking, so another 'a' can never match
>>> regex.sub(r'f(a|e)*+at', r'X', 'feat ft feaeat')
'feat ft feaeat'
The effect of possessive quantifier can also be expressed using atomic grouping. The syntax is (?>RE) - in later
chapters you’ll see more such special groupings.
# same as: r'(b|o)++'
>>> regex.sub(r'(?>(b|o)+)', r'X', 'abbbc foooooot')
'aXc fXt'
This chapter introduced the concept of specifying a placeholder instead of fixed string. Combined with quantifiers,
you’ve seen a glimpse of how a simple RE can match wide range of text. In coming chapters, you’ll learn how to
create your own restricted set of placeholder characters.
20
Exercises
Note that some exercises are intentionally designed to be complicated to solve with regular expressions alone. Try
to use normal string methods, break down the problem into multiple steps, etc. Some exercises will become easier
to solve with techniques presented in chapters to come. Going through the exercises a second time after finishing
entire book will be fruitful as well.
a) Use regular expression to get the output as shown for the given strings.
>>> eqn1 = 'a+42//5-c'
>>> eqn2 = 'pressure*3+42/5-14256'
>>> eqn3 = 'r*42-5/3+42///5-42/53+a'
c) Remove leading/trailing whitespaces from all the individual fields of these csv strings.
>>> csv1 = ' comma ,separated ,values '
>>> csv2 = 'good bad,nice ice , 42 , , stall small'
# wrong output
>>> change.sub(r'X', words)
'plXk XcomXg tX wXer X cautX sentient'
# expected output
>>> change = re.compile() ##### add your solution here
>>> change.sub(r'X', words)
'plX XmX tX wX X cautX sentient'
e) For the given greedy quantifiers, what would be the equivalent form using {m,n} representation?
• ? is same as {0,1}
• * is same as {0,}
• + is same as {1,}
21
Working with matched portions
Having seen a few features that can match varying text, you’ll learn how to extract and work with those matching
portions in this chapter.
re.Match object
The re.search function returns a re.Match object from which various details can be extracted like the matched
portion of string, location of matched portion, etc. See docs.python: Match Objects for details.
>>> re.search(r'ab*c', 'abc ac adc abbbc')
<re.Match object; span=(0, 3), match='abc'>
The RE grouping inside () is also known as a capture group. It has multiple uses, one of which is the ability
to work with matched portions of those groups. When capture groups are used with re.search , they can be
retrieved using an index on the re.Match object. The first element is always the entire matched portion and rest
of the elements are for capture groups if they are present. The leftmost ( will get group number 1 , second
leftmost ( will get group number 2 and so on.
>>> re.search(r'b.*d', 'abc ac adc abbbc')
<re.Match object; span=(1, 9), match='bc ac ad'>
# retrieving entire matched portion
>>> re.search(r'b.*d', 'abc ac adc abbbc')[0]
'bc ac ad'
# can also pass an index by calling 'group' method Match object
>>> re.search(r'b.*d', 'abc ac adc abbbc').group(0)
'bc ac ad'
Functions can be used in replacement section of re.sub instead of a string. A re.Match object will be passed
to the function as argument. In later chapters, you’ll see a way to directly reference the matches in replacement
section string.
# m[0] will contain entire matched portion
# a^2 and b^2 for the two matches in this example
>>> re.sub(r'(a|b)\^2', lambda m: m[0].upper(), 'a^2 + b^2 - C*3')
'A^2 + B^2 - C*3'
re.findall
22
It is useful for debugging purposes as well, for example to see what is going on under the hood before applying
substitution.
>>> re.findall(r't.*a', 'that is quite a fabricated tale')
['that is quite a fabricated ta']
If capture groups are used, each element of output will be a tuple of strings of all the capture groups. Text matched
by the RE outside of capture groups won’t be present in the output list. If there is only one capture group, tuple
won’t be used and each element will be the matched portion of that capture group.
>>> re.findall(r'a(b*)c', 'abc ac adc abbc xabbbcz bbb bc abbbbbc')
['b', '', 'bb', 'bbb', 'bbbbb']
re.finditer
This chapter introduced different ways to work with various matching portions of input string. You learnt another
use of groupings and you’ll see even more uses of groupings later on.
Exercises
a) For the given strings, extract the matching portion from first is to last t
23
>>> str1 = 'What is the biggest fruit you have seen?'
>>> str2 = 'Your mission is to read and practice consistently'
>>> expr = re.compile() ##### add your solution here
24
Character class
To create a custom placeholder for limited set of characters, enclose them inside [] metacharacters. It is similar
to using single character alternations inside a grouping, but without the drawbacks of a capture group. In addition,
character classes have their own versions of metacharacters and provide special predefined sets for common use
cases. Quantifiers are also applicable to character classes.
>>> words = ['cute', 'cat', 'cot', 'coat', 'cost', 'scuttle']
# same as: r'cot|cut' or r'c(o|u)t'
>>> [w for w in words if re.search(r'c[ou]t', w)]
['cute', 'cot', 'scuttle']
Metacharacters
Character classes have their own metacharacters to help define the sets succinctly. Metacharacters outside of
character classes like ˆ , $ , () etc either don’t have special meaning or have completely different one inside
the character classes. First up, the - metacharacter that helps to define a range of characters instead of having
to specify them all individually.
# all digits
>>> re.findall(r'[0-9]+', 'Sample123string42with777numbers')
['123', '42', '777']
# whole words made up of lowercase alphabets, but starting with 'p' to 'z'
>>> re.findall(r'\b[p-z][a-z]*\b', 'coat tin food put stoop best')
['tin', 'put', 'stoop']
# whole words made up of only 'a' to 'f' and 'p' to 't' lowercase alphabets
>>> re.findall(r'\b[a-fp-t]+\b', 'coat tin food put stoop best')
['best']
Character classes can also be used to construct numeric ranges. However, it is easy to miss corner cases and some
ranges are complicated to design.
# numbers between 10 to 29
>>> re.findall(r'\b[12][0-9]\b', '23 154 12 26 98234')
['23', '12', '26']
If numeric range is difficult to construct, better to convert the matched portion to appropriate numeric format first.
25
# numbers < 350
>>> m_iter = re.finditer(r'[0-9]+', '45 349 651 593 4 204')
>>> [m[0] for m in m_iter if int(m[0]) < 350]
['45', '349', '4', '204']
# note that return value is string and s[0] is used to get matched portion
>>> def num_range(s):
... return '1' if 200 <= int(s[0]) <= 650 else '0'
...
Next metacharacter is ˆ which has to specified as the first character of the character class. It negates the set
of characters, so all characters other than those specified will be matched. As highlighted earlier, handle negative
logic with care, you might end up matching more than you wanted. Also, these examples below are all excellent
places to use possessive quantifier as there is no backtracking involved.
# all non-digits
>>> re.findall(r'[^0-9]+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
Sometimes, it is easier to use positive character class and negate the re.search result instead of using negated
character class.
>>> words = ['tryst', 'fun', 'glyph', 'pity', 'why']
Similar to other metacharacters, prefix \ to character class metacharacters to match them literally. Some of them
can be achieved by different placement as well.
# - should be first or last character or escaped using \
>>> re.findall(r'\b[a-z-]{2,}\b', 'ab-cd gh-c 12-423')
['ab-cd', 'gh-c']
>>> re.findall(r'\b[a-z\-0-9]{2,}\b', 'ab-cd gh-c 12-423')
['ab-cd', 'gh-c', '12-423']
26
# [ can be escaped with \ or placed as last character
# ] can be escaped with \ or placed as first character
>>> re.search(r'[a-z\[\]0-9]+', 'words[5] = tea')
<re.Match object; span=(0, 8), match='words[5]'>
# \ should be escaped using \
>>> print(re.search(r'[a\\b]+', r'5ba\babc2')[0])
ba\bab
• \w is similar to [a-zA-Z0-9_] for matching word characters (recall the definition for word boundaries)
• \d is similar to [0-9] for matching digit characters
• \s is similar to [ \t\n\r\f\v] for matching whitespace characters
These escape sequences can be used as standalone or inside a character class. Also, these would behave differently
depending on flags used (covered in a later chapter). For now, as mentioned before, the examples and description
will assume input made up of ASCII characters only.
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
>>> re.findall(r'\d+', 'foo=5, bar=3; x=83, y=120')
['5', '3', '83', '120']
And negative logic strikes again, use \W , \D , and \S respectively for their negated character class.
>>> re.sub(r'\D+', r'-', 'Sample123string42with777numbers')
'-123-42-777-'
This chapter focussed on how to use and create custom placeholders for limited set of characters. Grouping and
character classes can be considered as two levels of abstractions. On the one hand, you can have character sets
inside [] and on the other, you can have multiple alternations grouped inside () including character classes. As
anchoring and quantifiers can be applied to both these abstractions, you can begin to see how regular expressions
is considered a mini-programming language. In coming chapters, you’ll even see how to negate groupings similar
to negated character class in certain scenarios.
Exercises
27
>>> remove_parentheses.sub('', str3)
'Hi there. Nice day(a'
b) Extract all hex character sequences, with optional prefix. Match the characters case insensitively, and the
sequences shouldn’t be surrounded by other word characters.
>>> hex_seq = re.compile() ##### add your solution here
c) Output True/False depending upon input string containing any number sequence that is greater than 624
>>> str1 = 'hi0000432abcd'
##### add your solution here
False
d) Split the given strings based on consecutive sequence of digit or whitespace characters.
>>> str1 = 'lion \t Ink32onion Nice'
>>> str2 = '**1\f2\n3star\t7 77\r**'
>>> expr = re.compile() ##### add your solution here
>>> expr.split(str1)
['lion', 'Ink', 'onion', 'Nice']
>>> expr.split(str2)
['**', 'star', '**']
28
Groupings and backreferences
You’ve been patiently hearing more awesome stuff to come regarding groupings. Well, here they come in various
forms. And some more will come in later chapters!
First up, saving (i.e. capturing) RE to use it later, similar to variables and functions in a programming language.
You have already seen how to use Match object to refer to text captured by groups. In a similar manner, you can use
backreference \N where N is the capture group you want. Backreferences can be used within the RE definition
itself as well as in replacement section whereas saved Match objects can be used in later instructions.
Non-capturing groups
Grouping has many uses like applying quantifier on a RE portion, creating terse RE by factoring common portions
and so on. It also affects behavior of functions like re.findall and re.split .
# without capture group
>>> re.split(r'\d+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
29
['Sample', '123', 'string', '42', 'with', '777', 'numbers']
When backreferencing is not required, you can use a non-capturing group to avoid behavior change of re.findall
and re.split . It also helps to avoid keeping a track of capture group numbers when that particular group is not
needed for backreferencing. The syntax is (?:RE) to define a non-capturing group. More such special groups
starting with (? syntax will be discussed later on.
# normal capture group will hinder ability to get whole match
# non-capturing group to the rescue
>>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against')
['cost', 'akin', 'east', 'against']
However, there are situations where capture groups cannot be avoided. In such cases, you’d need to manually work
with Match objects to get desired results.
>>> words = 'effort flee facade oddball rat tool'
# whole words containing at least one consecutive repeated character
>>> repeat_char = re.compile(r'\b\w*(\w)\1\w*\b')
RE can get cryptic and difficult to maintain, even for seasoned programmers. There are a few constructs to help add
clarity. One such is naming the capture groups and using that name for backreferencing instead of plain numbers.
The syntax is (?P<name>RE) for naming the capture groups. The name used should be a valid Python identifier. Use
'name' for Match objects, \g<name> in replacement section and (?P=name) in RE definition for backreferencing.
These will still behave as normal capture groups, so \N or \g<N> numbering can be used as well.
# giving names to first and second captured words
>>> re.sub(r'(?P<fw>\w+),(?P<sw>\w+)', r'\g<sw>,\g<fw>', 'a,b 42,24')
'b,a 24,42'
30
'apple'
>>> m.group('fruit')
'apple'
Subexpression calls
It may be obvious, but it should be noted that backreference will provide the string that was matched, not the RE
that was inside the capture group. For example, if ([0-9][a-f]) matches 3b , then backreferencing will give
3b and not any other valid match of RE like 8f , 0a etc. This is akin to how variables behave in programming,
only the result of expression stays after variable assignment, not the expression itself.
The regex module provides a way to refer to the expression itself, using (?1) , (?2) etc. This is applicable
only in RE definition, not in replacement sections. This behavior is similar to function call, and like functions this
can simulate recursion as well (will be discussed later).
>>> import re, regex
>>> row = 'today,2008-03-24,food,2012-08-12,nice,5632'
Named capture group can be used as well and called using (?&name) syntax:
>>> import regex
>>> row = 'today,2008-03-24,food,2012-08-12,nice,5632'
This chapter covered many more features related to grouping - backreferencing to get both variable and function
like behavior, and naming the groups to add clarity. When backreference is not needed for a particular group, use
non-capturing group.
Exercises
a) The given string has fields separated by : and each field has a floating point number followed by a , and a
string. If the floating point number has only one digit precision, append 0 and swap the strings separated by ,
for that particular field.
>>> row = '3.14,hi:42.5,bye:1056.1,cool:00.9,fool'
##### add your solution here
'3.14,hi:bye,42.50:cool,1056.10:fool,00.90'
b) Count the number of words that have at least two consecutive repeated alphabets, for ex: words like stillness
and Committee but not words like root or readable or rotational . Consider word to be as defined in
regular expression parlance and word split across two lines as two different words.
>>> import urllib.request
>>> scarlet_pimpernel_link = r'https://www.gutenberg.org/cache/epub/60/pg60.txt'
>>> word_expr = re.compile() ##### add your solution here
>>> count = 0
>>> with urllib.request.urlopen(scarlet_pimpernel_link) as ip_file:
... for line in ip_file:
... for word in re.findall(rb'\w+', line):
... if word_expr.search(word):
... count += 1
...
31
>>> print(count)
219
c) Convert the given markdown headers to corresponding anchor tag. Consider the input to start with one or
more # characters followed by space and word characters. The name attribute is constructed by converting the
header to lowercase and replacing spaces with hyphens. Can you do it without using a capture group?
>>> header1 = '# Regular Expressions'
>>> header2 = '## Compiling regular expressions'
e) Use appropriate regular expression function to get the expected output for given string.
>>> str1 = 'price_42 roast:\t\n:-ice==cat\neast'
##### add your solution here
['price_42', ' ', 'roast', ':\t\n:-', 'ice', '==', 'cat', '\n', 'east']
32
Lookarounds
Having seen how to create custom character classes and various avatars of groupings, it is time for learning how
to create custom anchors and add conditions to a pattern within RE definition. These assertions are also known as
zero-width patterns because they add restrictions similar to anchors and are not part of matched portions. Also,
you will learn how to negate a grouping similar to negated character sets.
Negative lookarounds
Lookaround assertions can be added to a pattern in two ways - as a prefix known as lookbehind and as a suffix
known as lookahead. Syntax wise, these two ways are differentiated by adding a < for the lookbehind version.
Negative lookaround uses ! to indicate negated logic. The complete syntax looks like:
As mentioned earlier, lookarounds are not part of matched portions and do not capture the matched text.
# change 'foo' only if it is not followed by a digit character
# note that end of string satisfies the given assertion
# 'foofoo' has two matches as the assertion doesn't consume characters
>>> re.sub(r'foo(?!\d)', r'baz', 'hey food! foo42 foot5 foofoo')
'hey bazd! foo42 bazt5 bazbaz'
# overlap example
# the final _ was replaced as well as played a part in assertion
>>> re.sub(r'(?<!_)foo.', r'baz', 'food _fool 42foo_foot')
'baz _fool 42bazfoot'
Can be mixed with already existing anchors and other features to define truly powerful restrictions:
# change whole word only if it is not preceded by : or -
>>> re.sub(r'(?<![:-])\b\w+\b', r'X', ':cart <apple -rest ;tea')
':cart <X -rest ;X'
Positive lookarounds
Positive lookaround syntax uses = similar to ! for negative lookaround. The complete syntax looks like:
33
# except first and last fields
>>> re.findall(r'(?<=,)[^,]+(?=,)', '1,two,3,four,5')
['two', '3', 'four']
Even though lookarounds are not part of matched portions, capture groups can be used inside them.
>>> print(re.sub(r'(\S+\s+)(?=(\S+)\s)', r'\1\2\n', 'a b c d e'))
a b
b c
c d
d e
AND conditional
When using lookbehind assertion (either positive or negative), the RE inside the assertion cannot imply matching
variable length of text. Using fixed length quantifier is allowed. Alternations of different lengths, even if the different
alternations are of fixed length, are not allowed. Here’s some examples to clarify these points:
# allowed
>>> re.findall(r'(?<=(?:po|ca)re)\d+', 'pore42 car3 pare7 care5')
['42', '5']
>>> re.findall(r'(?<=\b[a-z]{4})\d+', 'pore42 car3 pare7 care5')
['42', '7', '5']
# not allowed
>>> re.findall(r'(?<!car|pare)\d+', 'pore42 car3 pare7 care5')
re.error: look-behind requires fixed-width pattern
>>> re.findall(r'(?<=\b[a-z]+)\d+', 'pore42 car3 pare7 care5')
re.error: look-behind requires fixed-width pattern
>>> re.sub(r'(?<=\A|,)(?=,|\Z)', r'NA', ',1,,,two,3,,,')
re.error: look-behind requires fixed-width pattern
Variable length lookbehind can be addressed in multiple ways using the regex module. Some of the variable
length positive lookbehind cases can be simulated by using \K as a suffix to the RE that is needed as lookbehind
assertion.
34
>>> import regex
If \K doesn’t work out for some reason, the regex module allows using variable length lookbehind as is.
>>> regex.findall(r'(?<=\b[a-z]+)\d+', 'pore42 car3 pare7 care5')
['42', '3', '7', '5']
Negated groups
Variable length negative lookbehind can also be simulated using negative lookahead (which doesn’t have restriction
on variable length) inside a grouping and applying quantifier to match characters one by one. This also showcases
how grouping can be negated in certain cases.
# note the use of \A anchor to force matching all characters up to 'dog'
# also note that regex module is not needed here
>>> bool(re.search(r'\A((?!cat).)*dog', 'fox,cat,dog,parrot'))
False
>>> bool(re.search(r'\A((?!parrot).)*dog', 'fox,cat,dog,parrot'))
True
As lookarounds do not consume characters, you cannot use variable length lookbehind (assuming regex module)
between two patterns. Use negated groups instead.
# match if 'do' is not there between 'at' and 'par'
>>> bool(re.search(r'at((?!do).)*par', 'fox,cat,dog,parrot'))
False
35
# match if 'go' is not there between 'at' and 'par'
>>> bool(re.search(r'at((?!go).)*par', 'fox,cat,dog,parrot'))
True
>>> re.search(r'at((?!go).)*par', 'fox,cat,dog,parrot')[0]
'at,dog,par'
In this chapter, you learnt how to use lookarounds to create custom restrictions and also how to use negated group-
ing. With this, most of the powerful features of regular expressions have been covered. The special groupings seem
never ending though, there’s some more of them in coming chapters!!
Exercises
a) Remove leading and trailing whitespaces from all the individual fields of these csv strings.
>>> csv1 = ' comma ,separated ,values '
>>> csv2 = 'good bad,nice ice , 42 , , stall small'
c) Match strings if it doesn’t contain whitespace or the string error between the strings qty and price
>>> str1 = '23,qty,price,42'
>>> str2 = 'qty price,oh'
>>> str3 = '3.14,qty,6,errors,9,price,3'
>>> str4 = 'qty-6,apple-56,price-234'
36
Flags
Just like options change the default behavior of commands used from a terminal, flags are used to change aspects of
RE. The Anchors chapter already introduced one of them. Flags can be applied to entire RE using flags optional
argument or to a particular portion of RE using special groups. And both of these forms can be mixed up as well.
In regular expression parlance, flags are also known as modifiers.
First up, the flag to ignore case while matching alphabets. When flags argument is used, this can be specified
as re.I or re.IGNORECASE constants.
>>> bool(re.search(r'cat', 'Cat'))
False
>>> bool(re.search(r'cat', 'Cat', flags=re.IGNORECASE))
True
As seen earlier, re.M or re.MULTILINE flag would allow ˆ and $ anchors to match line wise instead of whole
string.
# check if any line in the string starts with 'top'
>>> bool(re.search(r'^top', "hi hello\ntop spot", flags=re.M))
True
The re.X or re.VERBOSE flag is another provision like the named capture groups to help add clarity to RE defi-
nitions. This flag allows to use literal whitespaces for aligning purposes and add comments after the # character
to break down complex RE into multiple lines with comments.
# same as: rex = re.compile(r'\A((?:[^,]+,){3})([^,]+)')
# note the use of triple quoted string
>>> rex = re.compile(r'''
... \A( # group-1, captures first 3 columns
... (?:[^,]+,){3} # non-capturing group to get the 3 columns
... )
... ([^,]+) # group-2, captures 4th column
... ''', flags=re.X)
37
For precise definition, here’s the relevant quote from documentation:
Whitespace within the pattern is ignored, except when in a character class, or when preceded by an
unescaped backslash, or within tokens like *? , (?: or (?P<...> . When a line contains a #
that is not in a character class and is not preceded by an unescaped backslash, all characters from the
leftmost such # through the end of the line are ignored.
>>> bool(re.search(r't a', 'cat and dog', flags=re.X))
False
>>> bool(re.search(r't\ a', 'cat and dog', flags=re.X))
True
>>> bool(re.search(r't[ ]a', 'cat and dog', flags=re.X))
True
>>> bool(re.search(r't\x20a', 'cat and dog', flags=re.X))
True
To apply flags to specific portions of RE, specify them inside a special grouping syntax. This will override the flags
applied to entire RE definitions, if any. The syntax variations are:
In these ways, flags can be specified precisely only where it is needed. The flags are to be given as single letter
lowercase version of short form constants - for ex: i for re.I and so on, except L for re.L or re.LOCALE
(will be discussed later). And as can be observed from below examples, these are not capture groups.
# case-sensitive for whole RE definition
>>> re.findall(r'Cat[a-z]*\b', 'Cat SCatTeR CATER cAts')
['Cat']
# case-insensitive only for '[a-z]*' portion
>>> re.findall(r'Cat(?i:[a-z]*)\b', 'Cat SCatTeR CATER cAts')
['Cat', 'CatTeR']
This chapter showed some of the flags that can be used to change default behavior of RE definition. And more
special groupings were covered.
38
Exercises
a) Delete from the string start if it is at beginning of a line up to the next occurrence of the string end at end
of a line. Match these keywords irrespective of case.
>>> para = '''\
... good start
... start working on that
... project you always wanted
... to, do not let it end
... hi there
... start and end the end
... 42
... Start and try to
... finish the End
... bye'''
hi there
42
bye
b) Explore what the re.DEBUG flag does. Here’s some examples, check their output:
• re.compile(r'\Aden|ly\Z', flags=re.DEBUG)
• re.compile(r'\b(0x)?[\da-f]+\b', flags=re.DEBUG)
• re.compile(r'\b(?:0x)?[\da-f]+\b', flags=re.I|re.DEBUG)
39
Unicode
So far in the book, all examples were meant for strings made up of ASCII characters only. However, re module
matching is Unicode by default. See docs.python: Unicode for a tutorial on Unicode support in Python.
Flags can be used to override the default setting. For example, the re.A or re.ASCII flag will change \b ,
\w , \d , \sand their opposites to match only ASCII characters. Use re.L or re.LOCALE to work based on
locale settings for bytes data type.
# \w is Unicode aware
>>> re.findall(r'\w+', 'fox:αλεπού')
['fox', 'αλεπού']
However, the four characters shown below are also matched when re.I is used without re.A
>>> bool(re.search(r'[a-zA-Z]', 'İıſK'))
False
Similar to named character classes and escape sequences, the regex module supports \p{} construct that
offers various predefined sets to work with Unicode strings. See regular-expressions: Unicode for details.
# extract all consecutive letters
>>> regex.findall(r'\p{L}+', 'fox:αλεπού,eagle:αετός')
['fox', 'αλεπού', 'eagle', 'αετός']
# extract all consecutive Greek letters
>>> regex.findall(r'\p{Greek}+', 'fox:αλεπού,eagle:αετός')
['αλεπού', 'αετός']
For generic Unicode character ranges, specify 4-hexdigits codepoint using \u or 8-hexdigits codepoint using \U
# to get codepoints for ASCII characters
>>> [hex(ord(c)) for c in 'fox']
['0x66', '0x6f', '0x78']
# to get codepoints for Unicode characters
>>> [c.encode('unicode_escape') for c in 'αλεπού']
[b'\\u03b1', b'\\u03bb', b'\\u03b5', b'\\u03c0', b'\\u03bf', b'\\u03cd']
>>> [c.encode('unicode_escape') for c in 'İıſK']
[b'\\u0130', b'\\u0131', b'\\u017f', b'\\u212a']
40
# character range example using \u
# all english lowercase letters
>>> re.findall(r'[\u0061-\u007a]+', 'fox:αλεπού,eagle:αετός')
['fox', 'eagle']
A comprehensive discussion on RE usage with Unicode characters is out of scope for this book. Resources like
regular-expressions: unicode and Programmers introduction to Unicode are recommended for further study.
Exercises
a) Output True or False depending on input string made up of ASCII characters or not. Consider the input to
be non-empty strings and any character that isn’t part of 7-bit ASCII set should give False
>>> str1 = '123—456'
>>> str2 = 'good fοοd'
>>> str3 = 'happy learning!'
>>> str4 = 'İıſK'
41
Miscellaneous
This chapter will cover some more features and useful tricks. Except first two sections, rest are all features provided
by regex module.
Using dict
Using a function in replacement section, you can specify a dict variable to determine the replacement string
based on the matched text.
# one to one mappings
>>> d = { '1': 'one', '2': 'two', '4': 'four' }
>>> re.sub(r'[124]', lambda m: d[m[0]], '9234012')
'9two3four0onetwo'
# if the matched text doesn't exist as a key, default value will be used
>>> re.sub(r'\d', lambda m: d.get(m[0], 'X'), '9234012')
'XtwoXfourXonetwo'
For swapping two or more portions without using intermediate result, using a dict is recommended.
>>> swap = { 'cat': 'tiger', 'tiger': 'cat' }
>>> words = 'cat tiger dog tiger cat'
For dict that have many entries and likely to undergo changes during development, building alternation list
manually is not a good choice. Also, recall that as per precedence rules, longest length string should come first.
>>> d = { 'hand': 1, 'handy': 2, 'handful': 3, 'a^b': 4 }
re.subn
The re.subn function returns a tuple - modified string after substitution and number of substitutions made. This
can be used to perform conditional operations based on whether the substitution was successful. Or, the value of
count itself may be needed for solving the given problem.
>>> word = 'coffining'
# recursively delete 'fin'
>>> while True:
... word, cnt = re.subn(r'fin', r'', word)
... if cnt == 0:
... break
...
>>> word
'cog'
Here’s an example that won’t work if greedy quantifier is used instead of possessive quantifier.
42
>>> row = '421,foo,2425,42,5,foo,6,6,42'
\G anchor
The \G anchor (provided by regex module) restricts matching from start of string like the \A anchor. In
addition, after a match is done, ending of that match is considered as the new anchor location. This process is
repeated again and continues until the given RE fails to match (assuming multiple matches with sub , findall
etc).
>>> import regex
Recursive matching
The subexpression call special group was introduced as analogous to function call. And in typical function fashion,
it does support recursion. Useful to match nested patterns, which is usually not recommended to be done with
regular expressions. Indeed, if you are looking to parse file formats like html, xml, json, csv, etc - use a proper
parser library. But for some cases, a parser might not be available and using RE might be simpler than writing a
parser from scratch.
First up, a RE to match a set of parentheses that is not nested (termed as level-one RE for reference).
# note the use of possessive quantifier
>>> eqn0 = 'a + (b * c) - (d / e)'
>>> regex.findall(r'\([^()]++\)', eqn0)
['(b * c)', '(d / e)']
43
>>> regex.findall(r'\([^()]++\)', eqn1)
['(f+x)', '(3-g)']
Next, matching a set of parentheses which may optionally contain any number of non-nested sets of parentheses
(termed as level-two RE for reference).
>>> eqn1 = '((f+x)^y-42)*((3-g)^z+2)'
# note the use of non-capturing group
>>> regex.findall(r'\((?:[^()]++|\([^()]++\))++\)', eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']
That looks very cryptic. Better to use re.X flag for clarity as well as for comparing against the recursive version.
Breaking down the RE, you can see ( and ) have to be matched literally. Inside that, valid string is made up
of either non-parentheses characters or a non-nested parentheses sequence (level-one RE).
>>> lvl2 = regex.compile('''
... \( #literal (
... (?: #start of non-capturing group
... [^()]++ #non-parentheses characters
... | #OR
... \([^()]++\) #level-one RE
... )++ #end of non-capturing group, 1 or more times
... \) #literal )
... ''', flags=re.X)
>>> lvl2.findall(eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']
>>> lvl2.findall(eqn2)
['(b)', '((c))', '((d))']
To recursively match any number of nested sets of parentheses, use a capture group and call it within the capture
group itself. Since entire RE needs to be called here, you can use the default zeroth capture group (this also helps
to avoid having to use finditer ). Comparing with level-two RE, the only change is that (?0) is used instead of
the level-one RE in the second alternation.
>>> lvln = regex.compile('''
... \( #literal (
... (?: #start of non-capturing group
... [^()]++ #non-parentheses characters
... | #OR
... (?0) #recursive call
... )++ #end of non-capturing group, 1 or more times
... \) #literal )
... ''', flags=re.X)
>>> lvln.findall(eqn0)
['(b * c)', '(d / e)']
>>> lvln.findall(eqn1)
['((f+x)^y-42)', '((3-g)^z+2)']
>>> lvln.findall(eqn2)
['(b)', '((c))', '(((d)))']
44
Named character sets
A named character set is defined by a name enclosed between [: and :] and has to be used within a character
class [] , along with any other characters as needed. Using [:ˆ instead of [: will negate the named character
set. See regular-expressions: POSIX Bracket for full list, and refer to pypi: regex for notes on Unicode.
# similar to: r'\d+' or r'[0-9]+'
>>> regex.split(r'[[:digit:]]+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
# similar to: r'[a-zA-Z]+'
>>> regex.findall(r'[[:alpha:]]+', 'Sample123string42with777numbers')
['Sample', 'string', 'with', 'numbers']
There are two versions provided by regex module - by default version 0 is used, which is meant for compatibility
with re module. Many features, like set operations, require version 1 to be enabled. That can be done by
assigning regex.DEFAULT_VERSION to regex.VERSION1 (permanent) or using (?V1) flag (temporary). To get
back the compatible version, use regex.VERSION0 or (?V0)
Set operations can be applied inside character class between sets. Mostly used to get intersection or difference
between two sets, where one/both of them is a character range or predefined character set. To aid in such definitions,
you can use [] in nested fashion. The four operators, in increasing order of precedence, are:
• || union
• ~~ symmetric difference
• && intersection
• -- difference
45
Skipping matches
Sometimes, you want to change or extract all matches except particular matches. Usually, there are common
characteristics between the two types of matches that makes it hard or impossible to define RE only for the required
matches. For ex: changing field values unless it is a particular name, or perhaps don’t touch double quoted values
and so on. To use the skipping feature, define the matches to be ignored suffixed by (*SKIP)(*FAIL) and then
define the matches required as part of alternation. (*F) can also be used instead of (*FAIL) .
# change lowercase words other than imp or rat
>>> words = 'tiger imp goat eagle rat'
>>> regex.sub(r'\b(?:imp|rat)\b(*SKIP)(*F)|[a-z]++', r'(\g<0>)', words)
'(tiger) imp (goat) (eagle) rat'
This is a miscellaneous chapter, not able to think of a good catchy summary to write. Here’s a suggestion - write a
summary in your own words based on notes you’ve made for this chapter.
Exercises
a) Count the maximum depth of nested braces for the given string. Unbalanced or wrongly ordered braces should
return -1
>>> def max_nested_braces(ip):
##### add your solution here
>>> max_nested_braces('a*b')
0
>>> max_nested_braces('}a+b{')
-1
>>> max_nested_braces('a*b+{}')
1
>>> max_nested_braces('{{a+2}*{b+c}+e}')
2
>>> max_nested_braces('{{a+2}*{b+{c*d}}+e}')
3
>>> max_nested_braces('{{a+2}*{\n{b+{c*d}}+e*d}}')
4
>>> max_nested_braces('a*{b+c*{e*3.14}}}')
-1
b) Replace the string par with spar , spare with extra and park with garden
>>> str1 = 'apartment has a park'
##### add your solution here for str1
'aspartment has a garden'
c) Read about POSIX flag from regex module documentation. Is the following code snippet showing the correct
output?
>>> words = 'plink incoming tint winter in caution sentient'
>>> change = regex.compile(r'int|in|ion|ing|inco|inter|ink', flags=regex.POSIX)
46
>>> change.sub(r'X', words)
'plX XmX tX wX X cautX sentient'
d) For the given markdown file, replace all occurrences of the string python (irrespective of case) with the string
Python . However, any match within code blocks that start with whole line ```python and end with whole line
``` shouldn’t be replaced. Consider the input file to be small enough to fit memory requirements.
47
Gotchas
RE can get quite complicated and cryptic a lot of the times. But sometimes, if something is not working as expected,
it could be because of quirky corner cases.
Some RE engines match character literally if an escape sequence is not defined. Python raises an exception for
such cases. Apart from sequences defined for RE, these are allowed: \a \b \f \n \r \t \u \U \v \x \\ where
\b means backspace only in character classes and \u \U are valid only in Unicode patterns.
>>> bool(re.search(r'\t', 'cat\tdog'))
True
>>> bool(re.search(r'\c', 'cat\tdog'))
re.error: bad escape \c at position 0
There is an additional start/end of line match after last newline character if line anchors are used as standalone
pattern. End of line match after newline is straightforward to understand as $ matches both end of line and end
of string.
# note also the use of special group for enabling multiline flag
>>> print(re.sub(r'(?m)^', r'foo ', '1\n2\n'))
foo 1
foo 2
foo
Referring to text matched by a capture group with a quantifier will give only the last match, not entire match. Use
a non-capturing group inside a capture group to get the entire matched portion.
>>> re.sub(r'\A([^,]+,){3}([^,]+)', r'\1(\2)', '1,2,3,4,5,6,7', count=1)
'3,(4),5,6,7'
>>> re.sub(r'\A((?:[^,]+,){3})([^,]+)', r'\1(\2)', '1,2,3,4,5,6,7', count=1)
'1,2,3,(4),5,6,7'
When using flags options with regex module, the constants should also be used from regex module. A
typical workflow shown below:
# Using re module, unsure if a feature is available
>>> re.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=re.A)
__main__:1: FutureWarning: Possible nested set at position 1
[]
# Ok, convert re to regex
# Oops, output is still wrong
>>> regex.findall(r'[[:word:]]+', 'fox:αλεπού,eagle:αετός', flags=re.A)
48
['fox', 'αλεπού', 'eagle', 'αετός']
Speaking of flags , try to always use it as keyword argument. Using it as positional argument leads to a common
mistake between re.findall and re.sub due to difference in placement. Their syntax, as per the docs, is
shown below:
re.findall(pattern, string, flags=0)
Hope you have found Python regular expressions an interesting topic to learn. Sooner or later, you’ll need to use
them if you are facing plenty of text processing tasks. At the same time, knowing when to use normal string methods
and knowing when to reach for other text parsing modules is important. Happy coding!
49
Further Reading
Note that most of these resources are not specific to Python, so use them with caution and check if they apply to
Python’s syntax and features
50