0% found this document useful (0 votes)
19 views13 pages

Regular Expression

This document discusses using regular expressions to find patterns of text more efficiently than manually specifying checks. It provides an example of a function that checks if a string is a phone number by manually validating the format, which requires 17 lines of code but only works for one pattern. Regular expressions can describe text patterns more concisely and flexibly, allowing programs to find matches for multiple patterns with less code. The document then demonstrates finding phone numbers using regular expressions.

Uploaded by

cocodarshi2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views13 pages

Regular Expression

This document discusses using regular expressions to find patterns of text more efficiently than manually specifying checks. It provides an example of a function that checks if a string is a phone number by manually validating the format, which requires 17 lines of code but only works for one pattern. Regular expressions can describe text patterns more concisely and flexibly, allowing programs to find matches for multiple patterns with less code. The document then demonstrates finding phone numbers using regular expressions.

Uploaded by

cocodarshi2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

programmers.

In fact, tech writer Cory Doctorow argues that even before


teaching programming, we should be teaching regular expressions:
“Knowing [regular expressions] can mean the difference between
solving a problem in 3 steps and solving it in 3,000 steps. When
you’re a nerd, you forget that the problems you solve with a couple

7
keystrokes can take other people days of tedious, error-prone
work to slog through.”1

In this chapter, you’ll start by writing a program to find text patterns


without using regular expressions and then see how to use regular expres-
sions to make the code much less bloated. I’ll show you basic matching with
regular expressions and then move on to some more powerful features,
such as string substitution and creating your own character classes. Finally,
at the end of the chapter, you’ll write a program that can automatically
extract phone numbers and email addresses from a block of text.

Finding Patterns of Text Without Regular Expressions


P a t t e r n Ma t c h i n g w i t h Say you want to find a phone number in a string. You know the pattern:
R e g u l a r E x p r e ss i o n s three numbers, a hyphen, three numbers, a hyphen, and four numbers.
Here’s an example: 415-555-4242.
Let’s use a function named isPhoneNumber() to check whether a string
matches this pattern, returning either True or False. Open a new file editor
window and enter the following code; then save the file as isPhoneNumber.py:

def isPhoneNumber(text):
u if len(text) != 12:
return False
You may be familiar with searching for text for i in range(0, 3):
v if not text[i].isdecimal():
by pressing ctrl-F and typing in the words return False
you’re looking for. Regular expressions go one w if text[3] != '-':
return False
step further: They allow you to specify a pattern for i in range(4, 7):
x if not text[i].isdecimal():
of text to search for. You may not know a business’s return False
exact phone number, but if you live in the United States y if text[7] != '-':
return False
or Canada, you know it will be three digits, followed by for i in range(8, 12):
z if not text[i].isdecimal():
a hyphen, and then four more digits (and optionally, a three-digit area code return False
at the start). This is how you, as a human, know a phone number when you { return True
see it: 415-555-1234 is a phone number, but 4,155,551,234 is not.
Regular expressions are helpful, but not many non-programmers print('415-555-4242 is a phone number:')
know about them even though most modern text editors and word pro- print(isPhoneNumber('415-555-4242'))
cessors, such as Microsoft Word or OpenOffice, have find and find-and- print('Moshi moshi is a phone number:')
replace features that can search based on regular expressions. Regular print(isPhoneNumber('Moshi moshi'))
expressions are huge time-savers, not just for software users but also for
1. Cory Doctorow, “Here’s what ICT should really teach kids: how to do regular expressions,”
Guardian, December 4, 2012, http://www.theguardian.com/technology/2012/dec/04/ict-teach-kids
-regular-expressions/.

148   Chapter 7
When this program is run, the output looks like this: While the string in message is short in this example, it could be millions
of characters long and the program would still run in less than a second. A
415-555-4242 is a phone number: similar program that finds phone numbers using regular expressions would
True also run in less than a second, but regular expressions make it quicker to
Moshi moshi is a phone number:
write these programs.
False

The isPhoneNumber() function has code that does several checks to see
whether the string in text is a valid phone number. If any of these checks
Finding Patterns of Text with Regular Expressions
fail, the function returns False. First the code checks that the string is The previous phone number–finding program works, but it uses a lot of
exactly 12 characters u. Then it checks that the area code (that is, the first code to do something limited: The isPhoneNumber() function is 17 lines but
three characters in text) consists of only numeric characters v. The rest can find only one pattern of phone numbers. What about a phone number
of the function checks that the string follows the pattern of a phone num- formatted like 415.555.4242 or (415) 555-4242? What if the phone num-
ber: The number must have the first hyphen after the area code w, three ber had an extension, like 415-555-4242 x99? The isPhoneNumber() function
more numeric characters x, then another hyphen y, and finally four more would fail to validate them. You could add yet more code for these addi-
numbers z. If the program execution manages to get past all the checks, it tional patterns, but there is an easier way.
returns True {. Regular expressions, called regexes for short, are descriptions for a
Calling isPhoneNumber() with the argument '415-555-4242' will return ­pattern of text. For example, a \d in a regex stands for a digit character—
True. Calling isPhoneNumber() with 'Moshi moshi' will return False; the first that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used
test fails because 'Moshi moshi' is not 12 characters long. by Python to match the same text the previous isPhoneNumber() function did:
You would have to add even more code to find this pattern of text in a a string of three numbers, a hyphen, three more numbers, another hyphen,
larger string. Replace the last four print() function calls in isPhoneNumber.py and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d
with the following: \d\d regex.
But regular expressions can be much more sophisticated. For example,
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.' adding a 3 in curly brackets ({3}) after a pattern is like saying, “Match this
for i in range(len(message)): pattern three times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also
u chunk = message[i:i+12]
matches the correct phone number format.
v if isPhoneNumber(chunk):
print('Phone number found: ' + chunk)
print('Done') Creating Regex Objects
All the regex functions in Python are in the re module. Enter the following
When this program is run, the output will look like this: into the interactive shell to import this module:
Phone number found: 415-555-1011
>>> import re
Phone number found: 415-555-9999
Done
NOTE Most of the examples that follow in this chapter will require the re module, so remem-
On each iteration of the for loop, a new chunk of 12 characters from ber to import it at the beginning of any script you write or any time you restart IDLE.
message is assigned to the variable chunk u. For example, on the first iteration, Otherwise, you’ll get a NameError: name 're' is not defined error message.
i is 0, and chunk is assigned message[0:12] (that is, the string 'Call me at 4').
On the next iteration, i is 1, and chunk is assigned message[1:13] (the string Passing a string value representing your regular expression to re.compile()
'all me at 41'). returns a Regex pattern object (or simply, a Regex object).
You pass chunk to isPhoneNumber() to see whether it matches the phone To create a Regex object that matches the phone number pattern, enter
number pattern v, and if so, you print the chunk. the following into the interactive shell. (Remember that \d means “a digit
Continue to loop through message, and eventually the 12 characters character” and \d\d\d-\d\d\d-\d\d\d\d is the regular expression for the cor-
in chunk will be a phone number. The loop goes through the entire string, rect phone number pattern.)
testing each 12-character piece and printing any chunk it finds that satisfies
isPhoneNumber(). Once we’re done going through message, we print Done.

Pattern Matching with Regular Expressions   149 150   Chapter 7


>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') Review of Regular Expression Matching
While there are several steps to using regular expressions in Python, each
Now the phoneNumRegex variable contains a Regex object. step is fairly simple.

1. Import the regex module with import re.


2. Create a Regex object with the re.compile() function. (Remember to use a
Passing Raw S trings to re.compile( ) raw string.)
Remember that escape characters in Python use the backslash (\). The string 3. Pass the string you want to search into the Regex object’s search() method.
value '\n' represents a single newline character, not a backslash followed by a This returns a Match object.
lowercase n. You need to enter the escape character \\ to print a single back­ 4. Call the Match object’s group() method to return a string of the actual
slash. So '\\n' is the string that represents a backslash followed by a lower- matched text.
case n. However, by putting an r before the first quote of the string value, you
can mark the string as a raw string, which does not escape characters. NOTE While I encourage you to enter the example code into the interactive shell, you should
Since regular expressions frequently use backslashes in them, it is conve- also make use of web-based regular expression testers, which can show you exactly
nient to pass raw strings to the re.compile() function instead of typing extra how a regex matches a piece of text that you enter. I recommend the tester at http://
backslashes. Typing r'\d\d\d-\d\d\d-\d\d\d\d' is much easier than typing regexpal.com/.
'\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d'.

More Pattern Matching with Regular Expressions


Matching Regex Objects Now that you know the basic steps for creating and finding regular expres-
A Regex object’s search() method searches the string it is passed for any sion objects with Python, you’re ready to try some of their more powerful
matches to the regex. The search() method will return None if the regex pat- pattern-matching capabilities.
tern is not found in the string. If the pattern is found, the search() method
returns a Match object. Match objects have a group() method that will return Grouping with Parentheses
the actual matched text from the searched string. (I’ll explain groups Say you want to separate the area code from the rest of the phone number.
shortly.) For example, enter the following into the interactive shell: Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d).
Then you can use the group() match object method to grab the matching
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') text from just one group.
>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
The first set of parentheses in a regex string will be group 1. The sec-
>>> print('Phone number found: ' + mo.group())
Phone number found: 415-555-4242 ond set will be group 2. By passing the integer 1 or 2 to the group() match
object method, you can grab different parts of the matched text. Passing 0
The mo variable name is just a generic name to use for Match objects. or nothing to the group() method will return the entire matched text. Enter
This example might seem complicated at first, but it is much shorter than the following into the interactive shell:
the earlier isPhoneNumber.py program and does the same thing.
>>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
Here, we pass our desired pattern to re.compile() and store the resulting
>>> mo = phoneNumRegex.search('My number is 415-555-4242.')
Regex object in phoneNumRegex. Then we call search() on phoneNumRegex and pass >>> mo.group(1)
search() the string we want to search for a match. The result of the search '415'
gets stored in the variable mo. In this example, we know that our pattern >>> mo.group(2)
will be found in the string, so we know that a Match object will be returned. '555-4242'
Knowing that mo contains a Match object and not the null value None, we can >>> mo.group(0)
call group() on mo to return the match. Writing mo.group() inside our print '415-555-4242'
>>> mo.group()
statement displays the whole match, 415-555-4242.
'415-555-4242'

Pattern Matching with Regular Expressions   151 152   Chapter 7


If you would like to retrieve all the groups at once, use the groups() You can also use the pipe to match one of several patterns as part of
method—note the plural form for the name. your regex. For example, say you wanted to match any of the strings 'Batman',
'Batmobile', 'Batcopter', and 'Batbat'. Since all these strings start with Bat, it
>>> mo.groups() would be nice if you could specify that prefix only once. This can be done
('415', '555-4242') with parentheses. Enter the following into the interactive shell:
>>> areaCode, mainNumber = mo.groups()
>>> print(areaCode)
>>> batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
415
>>> mo = batRegex.search('Batmobile lost a wheel')
>>> print(mainNumber)
>>> mo.group()
555-4242
'Batmobile'
>>> mo.group(1)
Since mo.groups() returns a tuple of multiple values, you can use the 'mobile'
multiple-assignment trick to assign each value to a separate variable, as in
the previous areaCode, mainNumber = mo.groups() line. The method call mo.group() returns the full matched text 'Batmobile',
Parentheses have a special meaning in regular expressions, but what do while mo.group(1) returns just the part of the matched text inside the first
you do if you need to match a parenthesis in your text? For instance, maybe parentheses group, 'mobile'. By using the pipe character and grouping paren-
the phone numbers you are trying to match have the area code set in paren- theses, you can specify several alternative patterns you would like your regex
theses. In this case, you need to escape the ( and ) characters with a back­ to match.
slash. Enter the following into the interactive shell: If you need to match an actual pipe character, escape it with a back­
slash, like \|.
>>> phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
>>> mo = phoneNumRegex.search('My phone number is (415) 555-4242.')
>>> mo.group(1) Optional Matching with the Question Mark
'(415)' Sometimes there is a pattern that you want to match only optionally. That
>>> mo.group(2)
is, the regex should find a match whether or not that bit of text is there.
'555-4242'
The ? character flags the group that precedes it as an optional part of the
pattern. For example, enter the following into the interactive shell:
The \( and \) escape characters in the raw string passed to re.compile()
will match actual parenthesis characters. >>> batRegex = re.compile(r'Bat(wo)?man')
>>> mo1 = batRegex.search('The Adventures of Batman')
Matching Multiple Groups with the Pipe >>> mo1.group()
'Batman'
The | character is called a pipe. You can use it anywhere you want to match one
of many expressions. For example, the regular expression r'Batman|Tina Fey' >>> mo2 = batRegex.search('The Adventures of Batwoman')
will match either 'Batman' or 'Tina Fey'. >>> mo2.group()
When both Batman and Tina Fey occur in the searched string, the first 'Batwoman'
occurrence of matching text will be returned as the Match object. Enter the
following into the interactive shell: The (wo)? part of the regular expression means that the pattern wo is
an optional group. The regex will match text that has zero instances or
>>> heroRegex = re.compile (r'Batman|Tina Fey') one instance of wo in it. This is why the regex matches both 'Batwoman' and
>>> mo1 = heroRegex.search('Batman and Tina Fey.') 'Batman'.
>>> mo1.group() Using the earlier phone number example, you can make the regex look
'Batman'
for phone numbers that do or do not have an area code. Enter the following
>>> mo2 = heroRegex.search('Tina Fey and Batman.') into the interactive shell:
>>> mo2.group()
'Tina Fey' >>> phoneRegex = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
>>> mo1 = phoneRegex.search('My number is 415-555-4242')
>>> mo1.group()
NOTE You can find all matching occurrences with the findall() method that’s discussed in '415-555-4242'
“The findall() Method” on page 157.

Pattern Matching with Regular Expressions   153 154   Chapter 7


>>> mo2 = phoneRegex.search('My number is 555-4242') >>> mo3 = batRegex.search('The Adventures of Batman')
>>> mo2.group() >>> mo3 == None
'555-4242' True

You can think of the ? as saying, “Match zero or one of the group pre- The regex Bat(wo)+man will not match the string 'The Adventures of
ceding this question mark.” Batman' because at least one wo is required by the plus sign.
If you need to match an actual question mark character, escape it with \?. If you need to match an actual plus sign character, prefix the plus sign
with a backslash to escape it: \+.
Matching Zero or More with the Star
The * (called the star or asterisk) means “match zero or more”—the group Matching Specific Repetitions with Curly Brackets
that precedes the star can occur any number of times in the text. It can be If you have a group that you want to repeat a specific number of times, fol-
completely absent or repeated over and over again. Let’s look at the Batman low the group in your regex with a number in curly brackets. For example,
example again. the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa',
since the latter has only two repeats of the (Ha) group.
>>> batRegex = re.compile(r'Bat(wo)*man') Instead of one number, you can specify a range by writing a minimum,
>>> mo1 = batRegex.search('The Adventures of Batman') a comma, and a maximum in between the curly brackets. For example, the
>>> mo1.group()
regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.
'Batman'
You can also leave out the first or second number in the curly brackets
>>> mo2 = batRegex.search('The Adventures of Batwoman') to leave the minimum or maximum unbounded. For example, (Ha){3,} will
>>> mo2.group() match three or more instances of the (Ha) group, while (Ha){,5} will match
'Batwoman' zero to five instances. Curly brackets can help make your regular expres-
sions shorter. These two regular expressions match identical patterns:
>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group() (Ha){3}
'Batwowowowoman' (Ha)(Ha)(Ha)

For 'Batman', the (wo)* part of the regex matches zero instances of wo And these two regular expressions also match identical patterns:
in the string; for 'Batwoman', the (wo)* matches one instance of wo; and for
'Batwowowowoman', (wo)* matches four instances of wo. (Ha){3,5}
If you need to match an actual star character, prefix the star in the ((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))
regular expression with a backslash, \*.
Enter the following into the interactive shell:
Matching One or More with the Plus
>>> haRegex = re.compile(r'(Ha){3}')
While * means “match zero or more,” the + (or plus) means “match one or >>> mo1 = haRegex.search('HaHaHa')
more.” Unlike the star, which does not require its group to appear in the >>> mo1.group()
matched string, the group preceding a plus must appear at least once. It is 'HaHaHa'
not optional. Enter the following into the interactive shell, and compare it
with the star regexes in the previous section: >>> mo2 = haRegex.search('Ha')
>>> mo2 == None
True
>>> batRegex = re.compile(r'Bat(wo)+man')
>>> mo1 = batRegex.search('The Adventures of Batwoman')
>>> mo1.group() Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha',
'Batwoman' search() returns None.

>>> mo2 = batRegex.search('The Adventures of Batwowowowoman')


>>> mo2.group() Greedy and Nongreedy Matching
'Batwowowowoman'
Since (Ha){3,5} can match three, four, or five instances of Ha in the string
'HaHaHaHaHa', you may wonder why the Match object’s call to group() in the

Pattern Matching with Regular Expressions   155 156   Chapter 7


previous curly bracket example returns 'HaHaHaHaHa' instead of the shorter matched strings for each group in the regex. To see findall() in action, enter
possibilities. After all, 'HaHaHa' and 'HaHaHaHa' are also valid matches of the the following into the interactive shell (notice that the regular expression
regular expression (Ha){3,5}. being compiled now has groups in parentheses):
Python’s regular expressions are greedy by default, which means that in
ambiguous situations they will match the longest string possible. The non- >>> phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # has groups
greedy version of the curly brackets, which matches the shortest string pos- >>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
sible, has the closing curly bracket followed by a question mark. [('415', '555', '1122'), ('212', '555', '0000')]
Enter the following into the interactive shell, and notice the dif-
ference between the greedy and nongreedy forms of the curly brackets To summarize what the findall() method returns, remember the
searching the same string: following:
1. When called on a regex with no groups, such as \d\d\d-\d\d\d-\d\d\d\d,
>>> greedyHaRegex = re.compile(r'(Ha){3,5}')
>>> mo1 = greedyHaRegex.search('HaHaHaHaHa')
the method findall() returns a list of string matches, such as ['415-555-
>>> mo1.group() 9999', '212-555-0000'].
'HaHaHaHaHa' 2. When called on a regex that has groups, such as (\d\d\d)-(\d\d\d)-(\d\
d\d\d), the method findall() returns a list of tuples of strings (one string
>>> nongreedyHaRegex = re.compile(r'(Ha){3,5}?') for each group), such as [('415', '555', '1122'), ('212', '555', '0000')].
>>> mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
>>> mo2.group()
'HaHaHa'
Character Classes
Note that the question mark can have two meanings in regular expres- In the earlier phone number regex example, you learned that \d could
sions: declaring a nongreedy match or flagging an optional group. These stand for any numeric digit. That is, \d is shorthand for the regular expres-
meanings are entirely unrelated. sion (0|1|2|3|4|5|6|7|8|9). There are many such shorthand character classes, as
shown in Table 7-1.
The findall() Method Table 7-1: Shorthand Codes for Common Character Classes
In addition to the search() method, Regex objects also have a findall() Shorthand character class Represents
method. While search() will return a Match object of the first matched text
\d Any numeric digit from 0 to 9.
in the searched string, the findall() method will return the strings of every
match in the searched string. To see how search() returns a Match object \D Any character that is not a numeric digit from 0 to 9.
only on the first instance of matching text, enter the following into the \w Any letter, numeric digit, or the underscore character.
interactive shell: (Think of this as matching “word” characters.)
\W Any character that is not a letter, numeric digit, or the
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') underscore character.
>>> mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000') \s Any space, tab, or newline character. (Think of this as
>>> mo.group() matching “space” characters.)
'415-555-9999'
\S Any character that is not a space, tab, or newline.

On the other hand, findall() will not return a Match object but a list of
strings—as long as there are no groups in the regular expression. Each string in Character classes are nice for shortening regular expressions. The char-
the list is a piece of the searched text that matched the regular expression. acter class [0-5] will match only the numbers 0 to 5; this is much shorter
Enter the following into the interactive shell: than typing (0|1|2|3|4|5).
For example, enter the following into the interactive shell:
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000') >>> xmasRegex = re.compile(r'\d+\s\w+')
['415-555-9999', '212-555-0000'] >>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7
swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
If there are groups in the regular expression, then findall() will return ['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6
geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']
a list of tuples. Each tuple represents a found match, and its items are the

Pattern Matching with Regular Expressions   157 158   Chapter 7


The regular expression \d+\s\w+ will match text that has one or more For example, the r'^Hello' regular expression string matches strings
numeric digits (\d+), followed by a whitespace character (\s), followed by that begin with 'Hello'. Enter the following into the interactive shell:
one or more letter/digit/underscore characters (\w+). The findall() method
returns all matching strings of the regex pattern in a list. >>> beginsWithHello = re.compile(r'^Hello')
>>> beginsWithHello.search('Hello world!')
<_sre.SRE_Match object; span=(0, 5), match='Hello'>
Making Your Own Character Classes >>> beginsWithHello.search('He said hello.') == None
True
There are times when you want to match a set of characters but the short-
hand character classes (\d, \w, \s, and so on) are too broad. You can define The r'\d$' regular expression string matches strings that end with a
your own character class using square brackets. For example, the character numeric character from 0 to 9. Enter the following into the interactive shell:
class [aeiouAEIOU] will match any vowel, both lowercase and uppercase. Enter
the following into the interactive shell: >>> endsWithNumber = re.compile(r'\d$')
>>> endsWithNumber.search('Your number is 42')
>>> vowelRegex = re.compile(r'[aeiouAEIOU]') <_sre.SRE_Match object; span=(16, 17), match='2'>
>>> vowelRegex.findall('RoboCop eats baby food. BABY FOOD.') >>> endsWithNumber.search('Your number is forty two.') == None
['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O'] True

You can also include ranges of letters or numbers by using a hyphen. The r'^\d+$' regular expression string matches strings that both begin
For example, the character class [a-zA-Z0-9] will match all lowercase letters, and end with one or more numeric characters. Enter the following into the
uppercase letters, and numbers. interactive shell:
Note that inside the square brackets, the normal regular expression
>>> wholeStringIsNum = re.compile(r'^\d+$')
symbols are not interpreted as such. This means you do not need to escape
>>> wholeStringIsNum.search('1234567890')
the ., *, ?, or () characters with a preceding backslash. For example, the <_sre.SRE_Match object; span=(0, 10), match='1234567890'>
character class [0-5.] will match digits 0 to 5 and a period. You do not need >>> wholeStringIsNum.search('12345xyz67890') == None
to write it as [0-5\.]. True
By placing a caret character (^) just after the character class’s opening >>> wholeStringIsNum.search('12 34567890') == None
bracket, you can make a negative character class. A negative character class True
will match all the characters that are not in the character class. For example,
enter the following into the interactive shell: The last two search() calls in the previous interactive shell example dem-
onstrate how the entire string must match the regex if ^ and $ are used.
>>> consonantRegex = re.compile(r'[^aeiouAEIOU]') I always confuse the meanings of these two symbols, so I use the mne-
>>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.') monic “Carrots cost dollars” to remind myself that the caret comes first and
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' the dollar sign comes last.
', 'B', 'B', 'Y', ' ', 'F', 'D', '.']

Now, instead of matching every vowel, we’re matching every character The Wildcard Character
that isn’t a vowel.
The . (or dot) character in a regular expression is called a wildcard and will
match any character except for a newline. For example, enter the following
The Caret and Dollar Sign Characters into the interactive shell:

You can also use the caret symbol (^) at the start of a regex to indicate that >>> atRegex = re.compile(r'.at')
a match must occur at the beginning of the searched text. Likewise, you can >>> atRegex.findall('The cat in the hat sat on the flat mat.')
put a dollar sign ($) at the end of the regex to indicate the string must end ['cat', 'hat', 'sat', 'lat', 'mat']
with this regex pattern. And you can use the ^ and $ together to indicate
that the entire string must match the regex—that is, it’s not enough for a
match to be made on some subset of the string.

Pattern Matching with Regular Expressions   159 160   Chapter 7


Remember that the dot character will match just one character, which Matching Newlines with the Dot Character
is why the match for the text flat in the previous example matched only lat.
The dot-star will match everything except a newline. By passing re.DOTALL as
To match an actual dot, escape the dot with a backslash: \..
the second argument to re.compile(), you can make the dot character match
all characters, including the newline character.
Matching Everything with Dot-Star Enter the following into the interactive shell:
Sometimes you will want to match everything and anything. For example,
>>> noNewlineRegex = re.compile('.*')
say you want to match the string 'First Name:', followed by any and all text,
>>> noNewlineRegex.search('Serve the public trust.\nProtect the innocent.
followed by 'Last Name:', and then followed by anything again. You can \nUphold the law.').group()
use the dot-star (.*) to stand in for that “anything.” Remember that the 'Serve the public trust.'
dot character means “any single character except the newline,” and the
star character means “zero or more of the preceding character.” >>> newlineRegex = re.compile('.*', re.DOTALL)
Enter the following into the interactive shell: >>> newlineRegex.search('Serve the public trust.\nProtect the innocent.
\nUphold the law.').group()
>>> nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)') 'Serve the public trust.\nProtect the innocent.\nUphold the law.'
>>> mo = nameRegex.search('First Name: Al Last Name: Sweigart')
>>> mo.group(1) The regex noNewlineRegex, which did not have re.DOTALL passed to the
'Al' re.compile() call that created it, will match everything only up to the first
>>> mo.group(2) newline character, whereas newlineRegex, which did have re.DOTALL passed to
'Sweigart' re.compile(), matches everything. This is why the newlineRegex.search() call
matches the full string, including its newline characters.
The dot-star uses greedy mode: It will always try to match as much text as
possible. To match any and all text in a nongreedy fashion, use the dot, star,
and question mark (.*?). Like with curly brackets, the question mark tells Review of Regex Symbols
Python to match in a nongreedy way.
Enter the following into the interactive shell to see the difference This chapter covered a lot of notation, so here’s a quick review of what
between the greedy and nongreedy versions: you learned:

• The ? matches zero or one of the preceding group.


>>> nongreedyRegex = re.compile(r'<.*?>')
>>> mo = nongreedyRegex.search('<To serve man> for dinner.>') • The * matches zero or more of the preceding group.
>>> mo.group() • The + matches one or more of the preceding group.
'<To serve man>'
• The {n} matches exactly n of the preceding group.
>>> greedyRegex = re.compile(r'<.*>') • The {n,} matches n or more of the preceding group.
>>> mo = greedyRegex.search('<To serve man> for dinner.>')
>>> mo.group()
• The {,m} matches 0 to m of the preceding group.
'<To serve man> for dinner.>' • The {n,m} matches at least n and at most m of the preceding group.
• {n,m}? or *? or +? performs a nongreedy match of the preceding group.
Both regexes roughly translate to “Match an opening angle bracket, • ^spam means the string must begin with spam.
followed by anything, followed by a closing angle bracket.” But the string
'<To serve man> for dinner.>' has two possible matches for the closing angle • spam$ means the string must end with spam.
bracket. In the nongreedy version of the regex, Python matches the shortest • The . matches any character, except newline characters.
possible string: '<To serve man>'. In the greedy version, Python matches the • \d, \w, and \s match a digit, word, or space character, respectively.
longest possible string: '<To serve man> for dinner.>'. • \D, \W, and \S match anything except a digit, word, or space character,
respectively.
• [abc] matches any character between the brackets (such as a, b, or c).
• [^abc] matches any character that isn’t between the brackets.

Pattern Matching with Regular Expressions   161 162   Chapter 7


Case-Insensitive Matching >>> agentNamesRegex = re.compile(r'Agent (\w)\w*')
Normally, regular expressions match text with the exact casing you specify. >>> agentNamesRegex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent
Eve knew Agent Bob was a double agent.')
For example, the following regexes match completely different strings: A**** told C**** that E**** knew B**** was a double agent.'

>>> regex1 = re.compile('RoboCop')


>>> regex2 = re.compile('ROBOCOP')
>>> regex3 = re.compile('robOcop')
>>> regex4 = re.compile('RobocOp') Managing Complex Regexes
Regular expressions are fine if the text pattern you need to match is simple.
But sometimes you care only about matching the letters without worry- But matching complicated text patterns might require long, convoluted reg-
ing whether they’re uppercase or lowercase. To make your regex case-insen- ular expressions. You can mitigate this by telling the re.compile() function
sitive, you can pass re.IGNORECASE or re.I as a second argument to re.compile(). to ignore whitespace and comments inside the regular expression string.
Enter the following into the interactive shell: This “verbose mode” can be enabled by passing the variable re.VERBOSE as
the second argument to re.compile().
>>> robocop = re.compile(r'robocop', re.I)
>>> robocop.search('RoboCop is part man, part machine, all cop.').group() Now instead of a hard-to-read regular expression like this:
'RoboCop'
phoneRegex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}
>>> robocop.search('ROBOCOP protects the innocent.').group() (\s*(ext|x|ext.)\s*\d{2,5})?)')
'ROBOCOP'
you can spread the regular expression over multiple lines with comments
>>> robocop.search('Al, why does your programming book talk about robocop so much?').group() like this:
'robocop'
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
Substituting Strings with the sub() Method \d{3} # first 3 digits
Regular expressions can not only find text patterns but can also substitute (\s|-|\.) # separator
\d{4} # last 4 digits
new text in place of those patterns. The sub() method for Regex objects is (\s*(ext|x|ext.)\s*\d{2,5})? # extension
passed two arguments. The first argument is a string to replace any matches. )''', re.VERBOSE)
The second is the string for the regular expression. The sub() method returns
a string with the substitutions applied. Note how the previous example uses the triple-quote syntax (''') to
For example, enter the following into the interactive shell: create a multiline string so that you can spread the regular expression defi-
nition over many lines, making it much more legible.
>>> namesRegex = re.compile(r'Agent \w+')
The comment rules inside the regular expression string are the same as
>>> namesRegex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')
'CENSORED gave the secret documents to CENSORED.' regular Python code: The # symbol and everything after it to the end of the
line are ignored. Also, the extra spaces inside the multiline string for the reg-
Sometimes you may need to use the matched text itself as part of the ular expression are not considered part of the text pattern to be matched.
substitution. In the first argument to sub(), you can type \1, \2, \3, and so This lets you organize the regular expression so it’s easier to read.
on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”
For example, say you want to censor the names of the secret agents by
showing just the first letters of their names. To do this, you could use the Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE
regex Agent (\w)\w* and pass r'\1****' as the first argument to sub(). The \1 What if you want to use re.VERBOSE to write comments in your regular expres-
in that string will be replaced by whatever text was matched by group 1— sion but also want to use re.IGNORECASE to ignore capitalization? Unfortunately,
that is, the (\w) group of the regular expression. the re.compile() function takes only a single value as its second argument. You
can get around this limitation by combining the re.IGNORECASE, re.DOTALL, and
re.VERBOSE variables using the pipe character (|), which in this context is
known as the bitwise or operator.

Pattern Matching with Regular Expressions   163 164   Chapter 7


So if you want a regular expression that’s case-insensitive and includes This list is like a road map for the project. As you write the code, you
newlines to match the dot character, you would form your re.compile() call can focus on each of these steps separately. Each step is fairly manageable
like this: and expressed in terms of things you already know how to do in Python.

>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL)


Step 1: Create a Regex for Phone Numbers
All three options for the second argument will look like this: First, you have to create a regular expression to search for phone numbers.
Create a new file, enter the following, and save it as phoneAndEmail.py:
>>> someRegexValue = re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
This syntax is a little old-fashioned and originates from early versions
of Python. The details of the bitwise operators are beyond the scope of this import pyperclip, re
book, but check out the resources at http://nostarch.com/automatestuff/ for
more information. You can also pass other options for the second argument; phoneRegex = re.compile(r'''(
they’re uncommon, but you can read more about them in the resources, too. (\d{3}|\(\d{3}\))? # area code
(\s|-|\.)? # separator
(\d{3}) # first 3 digits
Project: Phone Number and Email Address Extractor (\s|-|\.) # separator
(\d{4}) # last 4 digits
Say you have the boring task of finding every phone number and email (\s*(ext|x|ext.)\s*(\d{2,5}))? # extension
address in a long web page or document. If you manually scroll through )''', re.VERBOSE)
the page, you might end up searching for a long time. But if you had a pro-
# TODO: Create email regex.
gram that could search the text in your clipboard for phone numbers and
email addresses, you could simply press ctrl-A to select all the text, press # TODO: Find matches in clipboard text.
ctrl -C to copy it to the clipboard, and then run your program. It could
replace the text on the clipboard with just the phone numbers and email # TODO: Copy results to the clipboard.
addresses it finds.
Whenever you’re tackling a new project, it can be tempting to dive right The TODO comments are just a skeleton for the program. They’ll be
into writing code. But more often than not, it’s best to take a step back and replaced as you write the actual code.
consider the bigger picture. I recommend first drawing up a high-level plan The phone number begins with an optional area code, so the area code
for what your program needs to do. Don’t think about the actual code yet— group is followed with a question mark. Since the area code can be just
you can worry about that later. Right now, stick to broad strokes. three digits (that is, \d{3}) or three digits within parentheses (that is, \(\d{3}\)),
For example, your phone and email address extractor will need to do you should have a pipe joining those parts. You can add the regex comment
the following: # Area code to this part of the multiline string to help you remember what
(\d{3}|\(\d{3}\))? is supposed to match.
• Get the text off the clipboard. The phone number separator character can be a space (\s), hyphen (-),
• Find all phone numbers and email addresses in the text. or period (.), so these parts should also be joined by pipes. The next few
• Paste them onto the clipboard. parts of the regular expression are straightforward: three digits, followed
by another separator, followed by four digits. The last part is an optional
Now you can start thinking about how this might work in code. The extension made up of any number of spaces followed by ext, x, or ext., fol-
code will need to do the following: lowed by two to five digits.

• Use the pyperclip module to copy and paste strings.


Step 2: Create a Regex for Email Addresses
• Create two regexes, one for matching phone numbers and the other for
matching email addresses. You will also need a regular expression that can match email addresses.
Make your program look like the following:
• Find all matches, not just the first match, of both regexes.
• Neatly format the matched strings into a single string to paste. #! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
• Display some kind of message if no matches were found in the text.

Pattern Matching with Regular Expressions   165 166   Chapter 7


import pyperclip, re u matches = []
v for groups in phoneRegex.findall(text):
phoneRegex = re.compile(r'''( phoneNum = '-'.join([groups[1], groups[3], groups[5]])
--snip-- if groups[8] != '':
phoneNum += ' x' + groups[8]
# Create email regex. matches.append(phoneNum)
emailRegex = re.compile(r'''( w for groups in emailRegex.findall(text):
u [a-zA-Z0-9._%+-]+ # username matches.append(groups[0])
v @ # @ symbol
w [a-zA-Z0-9.-]+ # domain name # TODO: Copy results to the clipboard.
(\.[a-zA-Z]{2,4}) # dot-something
)''', re.VERBOSE) There is one tuple for each match, and each tuple contains strings for
each group in the regular expression. Remember that group 0 matches the
# TODO: Find matches in clipboard text.
entire regular expression, so the group at index 0 of the tuple is the one you
# TODO: Copy results to the clipboard. are interested in.
As you can see at u, you’ll store the matches in a list variable named
The username part of the email address u is one or more characters matches. It starts off as an empty list, and a couple for loops. For the email
that can be any of the following: lowercase and uppercase letters, numbers, addresses, you append group 0 of each match w. For the matched phone
a dot, an underscore, a percent sign, a plus sign, or a hyphen. You can put numbers, you don’t want to just append group 0. While the program detects
all of these into a character class: [a-zA-Z0-9._%+-]. phone numbers in several formats, you want the phone number appended
The domain and username are separated by an @ symbol v. The to be in a single, standard format. The phoneNum variable contains a string
domain name w has a slightly less permissive character class with only built from groups 1, 3, 5, and 8 of the matched text v. (These groups are
letters, numbers, periods, and hyphens: [a-zA-Z0-9.-]. And last will be the area code, first three digits, last four digits, and extension.)
the “dot-com” part (technically known as the top-level domain), which can
really be dot-anything. This is between two and four characters. Step 4: Join the Matches into a String for the Clipboard
The format for email addresses has a lot of weird rules. This regular Now that you have the email addresses and phone numbers as a list of strings
expression won’t match every possible valid email address, but it’ll match in matches, you want to put them on the clipboard. The pyperclip.copy() func-
almost any typical email address you’ll encounter. tion takes only a single string value, not a list of strings, so you call the join()
method on matches.
Step 3: Find All Matches in the Clipboard Text To make it easier to see that the program is working, let’s print any
matches you find to the terminal. And if no phone numbers or email
Now that you have specified the regular expressions for phone numbers addresses were found, the program should tell the user this.
and email addresses, you can let Python’s re module do the hard work of Make your program look like the following:
finding all the matches on the clipboard. The pyperclip.paste() function
will get a string value of the text on the clipboard, and the findall() regex #! python3
method will return a list of tuples. # phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.
Make your program look like the following:
--snip--
#! python3 for groups in emailRegex.findall(text):
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard. matches.append(groups[0])

import pyperclip, re # Copy results to the clipboard.


if len(matches) > 0:
phoneRegex = re.compile(r'''( pyperclip.copy('\n'.join(matches))
--snip-- print('Copied to clipboard:')
print('\n'.join(matches))
# Find matches in clipboard text. else:
text = str(pyperclip.paste()) print('No phone numbers or email addresses found.')

Pattern Matching with Regular Expressions   167 168   Chapter 7


Running the Program Practice Questions
For an example, open your web browser to the No Starch Press contact page 1. What is the function that creates Regex objects?
at http://www.nostarch.com/contactus.htm, press ctrl-A to select all the text on
2. Why are raw strings often used when creating Regex objects?
the page, and press ctrl-C to copy it to the clipboard. When you run this
program, the output will look something like this: 3. What does the search() method return?
4. How do you get the actual strings that match the pattern from a Match
Copied to clipboard: object?
800-420-7240
415-863-9900
5. In the regex created from r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does
415-863-9950 group 0 cover? Group 1? Group 2?
[email protected] 6. Parentheses and periods have specific meanings in regular expression
[email protected] syntax. How would you specify that you want a regex to match actual
[email protected] parentheses and period characters?
[email protected]
7. The findall() method returns a list of strings or a list of tuples of
strings. What makes it return one or the other?
Ideas for Similar Programs 8. What does the | character signify in regular expressions?
Identifying patterns of text (and possibly substituting them with the sub() 9. What two things does the ? character signify in regular expressions?
method) has many different potential applications. 10. What is the difference between the + and * characters in regular
expressions?
• Find website URLs that begin with http:// or https://.
11. What is the difference between {3} and {3,5} in regular expressions?
• Clean up dates in different date formats (such as 3/14/2015, 03-14-2015,
and 2015/3/14) by replacing them with dates in a single, standard format. 12. What do the \d, \w, and \s shorthand character classes signify in regular
expressions?
• Remove sensitive information such as Social Security or credit card
numbers. 13. What do the \D, \W, and \S shorthand character classes signify in regular
expressions?
• Find common typos such as multiple spaces between words, acciden-
tally accidentally repeated words, or multiple exclamation marks at the 14. How do you make a regular expression case-insensitive?
end of sentences. Those are annoying!! 15. What does the . character normally match? What does it match if
re.DOTALL is passed as the second argument to re.compile()?
16. What is the difference between .* and .*??
Summary 17. What is the character class syntax to match all numbers and lowercase
While a computer can search for text quickly, it must be told precisely what letters?
to look for. Regular expressions allow you to specify the precise patterns of 18. If numRegex = re.compile(r'\d+'), what will numRegex.sub('X', '12 drummers,
characters you are looking for. In fact, some word processing and spread- 11 pipers, five rings, 3 hens') return?
sheet applications provide find-and-replace features that allow you to search
19. What does passing re.VERBOSE as the second argument to re.compile()
using regular expressions.
allow you to do?
The re module that comes with Python lets you compile Regex objects.
These values have several methods: search() to find a single match, findall() 20. How would you write a regex that matches a number with commas for
to find all matching instances, and sub() to do a find-and-replace substitu- every three digits? It must match the following:
tion of text. • '42'
There’s a bit more to regular expression syntax than is described in • '1,234'
this chapter. You can find out more in the official Python documentation
• '6,368,745'
at http://docs.python.org/3/library/re.html. The tutorial website http://www
.regular-expressions.info/ is also a useful resource. but not the following:
Now that you have expertise manipulating and matching strings, it’s • '12,34,567' (which has only two digits between the commas)
time to dive into how to read from and write to files on your computer’s • '1234' (which lacks commas)
hard drive.

Pattern Matching with Regular Expressions   169 170   Chapter 7


21. How would you write a regex that matches the full name of someone
whose last name is Nakamoto? You can assume that the first name that
comes before it will always be one word that begins with a capital letter.
The regex must match the following:
• 'Satoshi Nakamoto'
• 'Alice Nakamoto'
• 'RoboCop Nakamoto'
but not the following:
• 'satoshi Nakamoto' (where the first name is not capitalized)
• 'Mr. Nakamoto' (where the preceding word has a nonletter character)
• 'Nakamoto' (which has no first name)
• 'Satoshi nakamoto' (where Nakamoto is not capitalized)
22. How would you write a regex that matches a sentence where the first
word is either Alice, Bob, or Carol; the second word is either eats, pets, or
throws; the third word is apples, cats, or baseballs; and the sentence ends
with a period? This regex should be case-insensitive. It must match the
following:
• 'Alice eats apples.'
• 'Bob pets cats.'
• 'Carol throws baseballs.'
• 'Alice throws Apples.'
• 'BOB EATS CATS.'
but not the following:
• 'RoboCop eats apples.'
• 'ALICE THROWS FOOTBALLS.'
• 'Carol eats 7 cats.'

Practice Projects
For practice, write programs to do the following tasks.

Strong Password Detection


Write a function that uses regular expressions to make sure the password
string it is passed is strong. A strong password is defined as one that is at
least eight characters long, contains both uppercase and lowercase charac-
ters, and has at least one digit. You may need to test the string against mul-
tiple regex patterns to validate its strength.

Regex Version of strip()


Write a function that takes a string and does the same thing as the strip()
string method. If no other arguments are passed other than the string to
strip, then whitespace characters will be removed from the beginning and
end of the string. Otherwise, the characters specified in the second argu-
ment to the function will be removed from the string.

Pattern Matching with Regular Expressions   171

You might also like