Regular Expression
Regular Expression
7
keystrokes can take other people days of tedious, error-prone
work to slog through.”1
def isPhoneNumber(text):
u if len(text) != 12:
return False
You may be familiar with searching for text for i in range(0, 3):
v if not text[i].isdecimal():
by pressing ctrl-F and typing in the words return False
you’re looking for. Regular expressions go one w if text[3] != '-':
return False
step further: They allow you to specify a pattern for i in range(4, 7):
x if not text[i].isdecimal():
of text to search for. You may not know a business’s return False
exact phone number, but if you live in the United States y if text[7] != '-':
return False
or Canada, you know it will be three digits, followed by for i in range(8, 12):
z if not text[i].isdecimal():
a hyphen, and then four more digits (and optionally, a three-digit area code return False
at the start). This is how you, as a human, know a phone number when you { return True
see it: 415-555-1234 is a phone number, but 4,155,551,234 is not.
Regular expressions are helpful, but not many non-programmers print('415-555-4242 is a phone number:')
know about them even though most modern text editors and word pro- print(isPhoneNumber('415-555-4242'))
cessors, such as Microsoft Word or OpenOffice, have find and find-and- print('Moshi moshi is a phone number:')
replace features that can search based on regular expressions. Regular print(isPhoneNumber('Moshi moshi'))
expressions are huge time-savers, not just for software users but also for
1. Cory Doctorow, “Here’s what ICT should really teach kids: how to do regular expressions,”
Guardian, December 4, 2012, http://www.theguardian.com/technology/2012/dec/04/ict-teach-kids
-regular-expressions/.
148 Chapter 7
When this program is run, the output looks like this: While the string in message is short in this example, it could be millions
of characters long and the program would still run in less than a second. A
415-555-4242 is a phone number: similar program that finds phone numbers using regular expressions would
True also run in less than a second, but regular expressions make it quicker to
Moshi moshi is a phone number:
write these programs.
False
The isPhoneNumber() function has code that does several checks to see
whether the string in text is a valid phone number. If any of these checks
Finding Patterns of Text with Regular Expressions
fail, the function returns False. First the code checks that the string is The previous phone number–finding program works, but it uses a lot of
exactly 12 characters u. Then it checks that the area code (that is, the first code to do something limited: The isPhoneNumber() function is 17 lines but
three characters in text) consists of only numeric characters v. The rest can find only one pattern of phone numbers. What about a phone number
of the function checks that the string follows the pattern of a phone num- formatted like 415.555.4242 or (415) 555-4242? What if the phone num-
ber: The number must have the first hyphen after the area code w, three ber had an extension, like 415-555-4242 x99? The isPhoneNumber() function
more numeric characters x, then another hyphen y, and finally four more would fail to validate them. You could add yet more code for these addi-
numbers z. If the program execution manages to get past all the checks, it tional patterns, but there is an easier way.
returns True {. Regular expressions, called regexes for short, are descriptions for a
Calling isPhoneNumber() with the argument '415-555-4242' will return pattern of text. For example, a \d in a regex stands for a digit character—
True. Calling isPhoneNumber() with 'Moshi moshi' will return False; the first that is, any single numeral 0 to 9. The regex \d\d\d-\d\d\d-\d\d\d\d is used
test fails because 'Moshi moshi' is not 12 characters long. by Python to match the same text the previous isPhoneNumber() function did:
You would have to add even more code to find this pattern of text in a a string of three numbers, a hyphen, three more numbers, another hyphen,
larger string. Replace the last four print() function calls in isPhoneNumber.py and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d
with the following: \d\d regex.
But regular expressions can be much more sophisticated. For example,
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.' adding a 3 in curly brackets ({3}) after a pattern is like saying, “Match this
for i in range(len(message)): pattern three times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also
u chunk = message[i:i+12]
matches the correct phone number format.
v if isPhoneNumber(chunk):
print('Phone number found: ' + chunk)
print('Done') Creating Regex Objects
All the regex functions in Python are in the re module. Enter the following
When this program is run, the output will look like this: into the interactive shell to import this module:
Phone number found: 415-555-1011
>>> import re
Phone number found: 415-555-9999
Done
NOTE Most of the examples that follow in this chapter will require the re module, so remem-
On each iteration of the for loop, a new chunk of 12 characters from ber to import it at the beginning of any script you write or any time you restart IDLE.
message is assigned to the variable chunk u. For example, on the first iteration, Otherwise, you’ll get a NameError: name 're' is not defined error message.
i is 0, and chunk is assigned message[0:12] (that is, the string 'Call me at 4').
On the next iteration, i is 1, and chunk is assigned message[1:13] (the string Passing a string value representing your regular expression to re.compile()
'all me at 41'). returns a Regex pattern object (or simply, a Regex object).
You pass chunk to isPhoneNumber() to see whether it matches the phone To create a Regex object that matches the phone number pattern, enter
number pattern v, and if so, you print the chunk. the following into the interactive shell. (Remember that \d means “a digit
Continue to loop through message, and eventually the 12 characters character” and \d\d\d-\d\d\d-\d\d\d\d is the regular expression for the cor-
in chunk will be a phone number. The loop goes through the entire string, rect phone number pattern.)
testing each 12-character piece and printing any chunk it finds that satisfies
isPhoneNumber(). Once we’re done going through message, we print Done.
You can think of the ? as saying, “Match zero or one of the group pre- The regex Bat(wo)+man will not match the string 'The Adventures of
ceding this question mark.” Batman' because at least one wo is required by the plus sign.
If you need to match an actual question mark character, escape it with \?. If you need to match an actual plus sign character, prefix the plus sign
with a backslash to escape it: \+.
Matching Zero or More with the Star
The * (called the star or asterisk) means “match zero or more”—the group Matching Specific Repetitions with Curly Brackets
that precedes the star can occur any number of times in the text. It can be If you have a group that you want to repeat a specific number of times, fol-
completely absent or repeated over and over again. Let’s look at the Batman low the group in your regex with a number in curly brackets. For example,
example again. the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa',
since the latter has only two repeats of the (Ha) group.
>>> batRegex = re.compile(r'Bat(wo)*man') Instead of one number, you can specify a range by writing a minimum,
>>> mo1 = batRegex.search('The Adventures of Batman') a comma, and a maximum in between the curly brackets. For example, the
>>> mo1.group()
regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'.
'Batman'
You can also leave out the first or second number in the curly brackets
>>> mo2 = batRegex.search('The Adventures of Batwoman') to leave the minimum or maximum unbounded. For example, (Ha){3,} will
>>> mo2.group() match three or more instances of the (Ha) group, while (Ha){,5} will match
'Batwoman' zero to five instances. Curly brackets can help make your regular expres-
sions shorter. These two regular expressions match identical patterns:
>>> mo3 = batRegex.search('The Adventures of Batwowowowoman')
>>> mo3.group() (Ha){3}
'Batwowowowoman' (Ha)(Ha)(Ha)
For 'Batman', the (wo)* part of the regex matches zero instances of wo And these two regular expressions also match identical patterns:
in the string; for 'Batwoman', the (wo)* matches one instance of wo; and for
'Batwowowowoman', (wo)* matches four instances of wo. (Ha){3,5}
If you need to match an actual star character, prefix the star in the ((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha))
regular expression with a backslash, \*.
Enter the following into the interactive shell:
Matching One or More with the Plus
>>> haRegex = re.compile(r'(Ha){3}')
While * means “match zero or more,” the + (or plus) means “match one or >>> mo1 = haRegex.search('HaHaHa')
more.” Unlike the star, which does not require its group to appear in the >>> mo1.group()
matched string, the group preceding a plus must appear at least once. It is 'HaHaHa'
not optional. Enter the following into the interactive shell, and compare it
with the star regexes in the previous section: >>> mo2 = haRegex.search('Ha')
>>> mo2 == None
True
>>> batRegex = re.compile(r'Bat(wo)+man')
>>> mo1 = batRegex.search('The Adventures of Batwoman')
>>> mo1.group() Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha',
'Batwoman' search() returns None.
On the other hand, findall() will not return a Match object but a list of
strings—as long as there are no groups in the regular expression. Each string in Character classes are nice for shortening regular expressions. The char-
the list is a piece of the searched text that matched the regular expression. acter class [0-5] will match only the numbers 0 to 5; this is much shorter
Enter the following into the interactive shell: than typing (0|1|2|3|4|5).
For example, enter the following into the interactive shell:
>>> phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
>>> phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000') >>> xmasRegex = re.compile(r'\d+\s\w+')
['415-555-9999', '212-555-0000'] >>> xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7
swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
If there are groups in the regular expression, then findall() will return ['12 drummers', '11 pipers', '10 lords', '9 ladies', '8 maids', '7 swans', '6
geese', '5 rings', '4 birds', '3 hens', '2 doves', '1 partridge']
a list of tuples. Each tuple represents a found match, and its items are the
You can also include ranges of letters or numbers by using a hyphen. The r'^\d+$' regular expression string matches strings that both begin
For example, the character class [a-zA-Z0-9] will match all lowercase letters, and end with one or more numeric characters. Enter the following into the
uppercase letters, and numbers. interactive shell:
Note that inside the square brackets, the normal regular expression
>>> wholeStringIsNum = re.compile(r'^\d+$')
symbols are not interpreted as such. This means you do not need to escape
>>> wholeStringIsNum.search('1234567890')
the ., *, ?, or () characters with a preceding backslash. For example, the <_sre.SRE_Match object; span=(0, 10), match='1234567890'>
character class [0-5.] will match digits 0 to 5 and a period. You do not need >>> wholeStringIsNum.search('12345xyz67890') == None
to write it as [0-5\.]. True
By placing a caret character (^) just after the character class’s opening >>> wholeStringIsNum.search('12 34567890') == None
bracket, you can make a negative character class. A negative character class True
will match all the characters that are not in the character class. For example,
enter the following into the interactive shell: The last two search() calls in the previous interactive shell example dem-
onstrate how the entire string must match the regex if ^ and $ are used.
>>> consonantRegex = re.compile(r'[^aeiouAEIOU]') I always confuse the meanings of these two symbols, so I use the mne-
>>> consonantRegex.findall('RoboCop eats baby food. BABY FOOD.') monic “Carrots cost dollars” to remind myself that the caret comes first and
['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.', ' the dollar sign comes last.
', 'B', 'B', 'Y', ' ', 'F', 'D', '.']
Now, instead of matching every vowel, we’re matching every character The Wildcard Character
that isn’t a vowel.
The . (or dot) character in a regular expression is called a wildcard and will
match any character except for a newline. For example, enter the following
The Caret and Dollar Sign Characters into the interactive shell:
You can also use the caret symbol (^) at the start of a regex to indicate that >>> atRegex = re.compile(r'.at')
a match must occur at the beginning of the searched text. Likewise, you can >>> atRegex.findall('The cat in the hat sat on the flat mat.')
put a dollar sign ($) at the end of the regex to indicate the string must end ['cat', 'hat', 'sat', 'lat', 'mat']
with this regex pattern. And you can use the ^ and $ together to indicate
that the entire string must match the regex—that is, it’s not enough for a
match to be made on some subset of the string.
Practice Projects
For practice, write programs to do the following tasks.