Module 4 RegEX
Module 4 RegEX
When this program is run, the output will look like this:
On each iteration of the for loop, a new chunk of 12 characters from message is assigned to the
variable chunk.
For example, on the first iteration, i is 0, and chunk is assigned message[0:12] (that is, the string
'Call me at 4’).
On the next iteration, i is 1, and chunk is assigned message[1:13] (the string 'all me at 41’).
In other words, on each iteration of the for loop, chunk takes on the following values:
'Call me at 4'
'all me at 41'
'll me at 415'
'l me at 415-'... and so on
Pass chunk to isPhoneNumber() to see whether it matches the phone number pattern, and if so,
you print the chunk.Continue to loop through message, and eventually the 12 characters in
chunk will be a phone number. The loop goes through the entire string, testing each 12-
character piece and printing any chunk it finds that satisfies isPhoneNumber().
Once we’re done going through message, we print Done.
While the string in message is short in this example, it could be millions of characters long and
the program would still run in less than a second. A similar program that finds phone numbers
using regular expressions would also run in less than a second, but regular expressions make it
quicker to write these programs.
Passing a string value representing your regular expression to re.compile() returns a Regex
pattern object (or simply, a Regex object).
To create a Regex object that matches the phone number pattern, enter the following into the
interactive shell.
Now the phoneNumRegex variable contains a Regex object.
b) Matching Regex Objects
A Regex object’s search() method searches the string it is passed for any matches to the regex.
The search() method will return None if the regex pattern is not found in the string. If the
pattern is found, the search() method returns a Match object, which have a group() method that
will return the actual matched text from the searched string. For example, enter the following
into the interactive shell:
The mo variable name is just a generic name to use for Match objects.In this program, we
pass our desired pattern to re.compile() and store the resulting Regex object in
phoneNumRegex. Then we call search() on phoneNumRegex and pass search() the string we
want to match for during the search. The result of the search gets stored in the variable mo. In
this example, we know that our pattern will be found in the string, so we know that a Match
object will be returned. Knowing that mo contains a Match object and not the null value None,
we can call group() on mo to return the match. Writing mo.group() inside our print() function
call displays the whole match, 415-555-4242.
c)Review of Regular Expression Matching
There are several steps to using regular expressions in Python. They are:
1. Import the regex module with import re.
2. Create a Regex object with the re.compile() function.
3. Pass the string you want to search into the Regex object’s search() method. This
returns a Match object.
4. Call the Match object’s group() method to return a string of the actual matched text.
If we want to detect these characters as part of your text pattern, you need to escape them with a
backslash:
The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1)
returns just the part of the matched text inside the first parentheses group, 'mobile’. If we need
to match an actual pipe character, escape it with a backslash, like \|.
Optional Matching with the Question Mark
• Sometimes there is a pattern that is required to match only optionally.
• The ? character flags the group that precedes it as an optional part of the pattern.
• Enter the following into the interactive shell:
• The (wo)? part of the regular expression means that the pattern wo is an optional group.
• The regex will match text that has zero instances or one instance of wo in it.
• This is why the regex matches both 'Batwoman' and 'Batman'.
• Enter the following phone number example into the interactive shell:
? can be thought as saying, “Match zero or one of the group preceding this question
mark.”
If we need to match an actual question mark character, escape it with \?.
Matching Zero or More with the Star
The * (called the star or asterisk) means “match zero or more”.
The group that precedes the star can occur any number of times in the text.
It can be completely absent or repeated over and over again.
For example,
For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string; for
'Batwoman', the (wo)* matches one instance of wo; and for 'Batwowowowoman', (wo)*
matches four instances of wo.
To match an actual star character, prefix the star in the regular expression with a
backslash, \*.
Matching One or More with the Plus
The + (or plus) means “match one or more.”
The group preceding a plus must appear at least once. It is not optional.
Enter the following into the interactive shell:
The regex Bat(wo)+man will not match the string 'The Adventures of Batman', because at least
one wo is required by the plus sign.
To match an actual plus sign character, prefix the plus sign with a backslash to escape it: \+.
The question mark can have two meanings in regular expressions: declaring a non-greedy
match or flagging an optional group.
The findall() Method
In addition to the search() method, Regex objects also have a findall() method. While search()
will return a Match object of the first matched text in the searched string, the findall() method
will return the strings of every match in the searched string.
For example, enter the following into the interactive shell:
findall() will return a list of strings as long as there are no groups in the regular expression.
Each string in the list is a piece of the searched text that matched the regular expression.
Enter the following into the interactive shell:
If there are groups in the regular expression, then findall() will return a list of tuples.
Each tuple has items that are the matched strings for each group in the regex.
For example, enter the following into the interactive shell:
Character Classes
\d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9).
There are many such shorthand character classes, as shown in Table.
The r'^\d+$' regular expression string matches strings that both begin and end with one or more
numericcharacters.
The Wildcard Character
The . (or dot) character in a regular expression is called a wildcard and will match any
character except for a newline.
For example, enter the following into the interactive shell:
•
The dot character will match just one character. This is why the match for the text ‘flat’ in
the example matched only ‘lat’.
To match an actual dot, escape the dot with a backslash: \.
Matching Everything with Dot-Star
The dot-star (.*) is used to match everything and anything.
The dot character means “any single character except the newline,” and the star character
means “zero or more of the preceding character.”
Enter the following into the interactive shell:
The dot-star uses greedy mode: It will always try to match as much text as possible.
To match any and all text in a non-greedy fashion, use the dot, star, and question mark (.*?).
Enter the following into the interactive shell:
The string '<To serve man> for dinner.>' has two possible matches for the closing angle
bracket.
In the non-greedy version of the regex, Python matches the shortest possible string: '<To
serve man>’.
In the greedy version, Python matches the longest possible string: '<To serve man> for
dinner.>'.
The regex noNewlineRegex, which did not have re.DOTALL passed to the re.compile() call
that created it, will match everything only up to the first newline character, whereas
newlineRegex, which had re.DOTALL passed to re.compile(), matches everything.
This is why the newlineRegex.search() call matches the full string, including its newline
characters.
Review of Regex Symbols
The ? matches zero or one of the preceding group.
The * matches zero or more of the preceding group.
The + matches one or more of the preceding group.
The {n} matches exactly n of the preceding group.
The {n,} matches n or more of the preceding group.
The {,m} matches 0 to m of the preceding group.
The {n,m} matches at least n and at most m of the preceding group.
{n,m}? or *? or +? performs a non-greedy match of the preceding group.
^spam means the string must begin with spam.
spam$ means the string must end with spam.
The . matches any character, except newline characters.
\d, \w, and \s match a digit, word, or space character, respectively.
\D, \W, and \S match anything except a digit, word, or space character, respectively.
[abc] matches any character between the brackets (such as a, b, or c).
[^abc] matches any character that isn’t between the brackets.
Case-Insensitive Matching
Regular expressions match text with the exact casing that is specified.
For example, the following regexes match completely different strings:
But to make the regex case insensitive, re.IGNORECASE or re.I can be passed as a
second argument to re.compile().
Enter the following into the interactive shell:
Substituting Strings with the sub() Method
Regular expressions can not only find text patterns but can also substitute new text in place of
those text patterns.
The sub() method for Regex objects is passed two arguments.
This limitation can be overcome by combining multiple variables using the pipe character (|).
In this context it is known as the bitwise or operator.
Including all three options in the second argument will look like this: