0% found this document useful (0 votes)
2 views22 pages

Module 4 RegEX

Module 4 covers pattern matching using regular expressions (regex) in Python, detailing how to find and validate text patterns such as phone numbers. It explains the creation of regex objects, various matching techniques, and the use of special characters and methods like findall() for efficient pattern searching. The module emphasizes the advantages of regex for simplifying code compared to traditional string matching methods.

Uploaded by

vivekgowda2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views22 pages

Module 4 RegEX

Module 4 covers pattern matching using regular expressions (regex) in Python, detailing how to find and validate text patterns such as phone numbers. It explains the creation of regex objects, various matching techniques, and the use of special characters and methods like findall() for efficient pattern searching. The module emphasizes the advantages of regex for simplifying code compared to traditional string matching methods.

Uploaded by

vivekgowda2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

MODULE 4

Chapter 1: Pattern Matching with Regular Expressions


Finding Patterns of Text without Regular Expressions
Finding Patterns of Text with Regular Expressions
More Pattern Matching with Regular Expressions
Greedy and Nongreedy Matching
The findall() Method
Character Classes
Making Your Own Character Classes
The Caret and Dollar Sign Characters
The Wildcard Character
Review of Regex Symbols
Case-Insensitive Matching
Substituting Strings with the sub() Method
Managing Complex Regexes
Combining re.IGNORECASE, re .DOTALL, and re .VERBOSE.
Finding Patterns of Text Without Regular Expressions
Consider finding an American phone number in a string. The pattern is known if you’re
American: three numbers, a hyphen, three numbers, a hyphen, and four numbers. For example:
415-555-4242.Let’s use a function named isPhoneNumber() to check whether a string matches
this pattern, returning either True or False.Enter the following code:

When this program is run, the output looks like this:


The isPhoneNumber() function has code that does several checks to see whether the string in
text is a valid phone number. If any of these checks fail, the function returns False.First the
code checks that the string is exactly 12 characters. Then it checks that the area code (that is,
the first three characters in text) consists of only numeric characters.
The rest of the function checks that the string follows the pattern of a phone number: the
number must have the first hyphen after the area code, three more numeric characters, then
another hyphen, and finally four more numbers.
If the program execution manages to get past all the checks, it returns True.
Calling isPhoneNumber() with the argument '415-555-4242' will return True.
Calling isPhoneNumber() with 'Moshi moshi' will return False; the first test fails because
'Moshi moshi' is not 12 characters long.
If you wanted to find a phone number within a larger string, you would have to add even more
code to find the phone number pattern.
Replace the last four print() function calls in isPhoneNumber program

When this program is run, the output will look like this:

On each iteration of the for loop, a new chunk of 12 characters from message is assigned to the
variable chunk.
For example, on the first iteration, i is 0, and chunk is assigned message[0:12] (that is, the string
'Call me at 4’).
On the next iteration, i is 1, and chunk is assigned message[1:13] (the string 'all me at 41’).
In other words, on each iteration of the for loop, chunk takes on the following values:
'Call me at 4'
'all me at 41'
'll me at 415'
'l me at 415-'... and so on
Pass chunk to isPhoneNumber() to see whether it matches the phone number pattern, and if so,
you print the chunk.Continue to loop through message, and eventually the 12 characters in
chunk will be a phone number. The loop goes through the entire string, testing each 12-
character piece and printing any chunk it finds that satisfies isPhoneNumber().
Once we’re done going through message, we print Done.
While the string in message is short in this example, it could be millions of characters long and
the program would still run in less than a second. A similar program that finds phone numbers
using regular expressions would also run in less than a second, but regular expressions make it
quicker to write these programs.

Finding Patterns of Text with Regular Expressions


The previous phone number–finding program works, but it uses a lot of code to do something
limited: the isPhoneNumber() function is 17 lines but can find only one pattern of phone
numbers.
What about a phone number formatted like 415.555.4242 or (415) 555-4242? What if the phone
number had an extension, like 415-555-4242 x99? The isPhoneNumber() function would fail
to validate them.
Regular expressions, called regexes for short, are descriptions for a pattern of text.
For example, a \d in a regex stands for a digit character—that is, any single numeral from 0 to
9. The regex \d\d\d-\d\d\d-\d\d\d\d is used by Python to match the same text pattern the previous
isPhoneNumber() function did: a string of three numbers, a hyphen, three more numbers,
another hyphen, and four numbers. Any other string would not match the \d\d\d-\d\d\d-\d\d\d\d
regex.
For example, adding a 3 in braces ({3}) after a pattern is like saying, “Match this pattern three
times.” So the slightly shorter regex \d{3}-\d{3}-\d{4} also matches the correct phone number
format.
a) Creating Regex Objects:
All the regex functions in Python are in the re module. Enter the following into the interactive
shell to import this module:

Passing a string value representing your regular expression to re.compile() returns a Regex
pattern object (or simply, a Regex object).
To create a Regex object that matches the phone number pattern, enter the following into the
interactive shell.
Now the phoneNumRegex variable contains a Regex object.
b) Matching Regex Objects
A Regex object’s search() method searches the string it is passed for any matches to the regex.
The search() method will return None if the regex pattern is not found in the string. If the
pattern is found, the search() method returns a Match object, which have a group() method that
will return the actual matched text from the searched string. For example, enter the following
into the interactive shell:

The mo variable name is just a generic name to use for Match objects.In this program, we
pass our desired pattern to re.compile() and store the resulting Regex object in
phoneNumRegex. Then we call search() on phoneNumRegex and pass search() the string we
want to match for during the search. The result of the search gets stored in the variable mo. In
this example, we know that our pattern will be found in the string, so we know that a Match
object will be returned. Knowing that mo contains a Match object and not the null value None,
we can call group() on mo to return the match. Writing mo.group() inside our print() function
call displays the whole match, 415-555-4242.
c)Review of Regular Expression Matching
There are several steps to using regular expressions in Python. They are:
1. Import the regex module with import re.
2. Create a Regex object with the re.compile() function.
3. Pass the string you want to search into the Regex object’s search() method. This
returns a Match object.
4. Call the Match object’s group() method to return a string of the actual matched text.

More Pattern Matching with Regular Expressions


Grouping with Parentheses: To separate the area code from the rest of the phone number.
Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\ d\d). Then use the
group() match object method to grab the matching text from just one group.The first set of
parentheses in a regex string will be group 1. The second set will be group 2. By passing the
integer 1 or 2 to the group() match object method, you can grab different parts of the matched
text. Passing 0 or nothing to the group() method will return the entire matched text.
Enter the following into the interactive shell:
To retrieve all the groups at once, use the groups() method.
In regular expressions, the following characters have special meanings:

If we want to detect these characters as part of your text pattern, you need to escape them with a
backslash:

Matching Multiple Groups with the Pipe


• The | character is called a pipe.
• It can be used anywhere to match one of many expressions.
• For example, the regular expression r'Batman|Tina Fey' will match either 'Batman' or
'Tina Fey'.
• When both Batman and Tina Fey occur in the searched string, the first occurrence of
matching text will be returned as the Match object.
• Enter the following into the interactive shell:
Pipe can be used to match one of several patterns as part of the regex. For example,
consider matching any of the strings 'Batman', 'Batmobile', 'Batcopter', and 'Batbat’.Since all
these strings start with Bat, it is sufficient to specify that prefix only once. This can be done
with parentheses. Enter the following into the interactive shell:

The method call mo.group() returns the full matched text 'Batmobile', while mo.group(1)
returns just the part of the matched text inside the first parentheses group, 'mobile’. If we need
to match an actual pipe character, escape it with a backslash, like \|.
Optional Matching with the Question Mark
• Sometimes there is a pattern that is required to match only optionally.
• The ? character flags the group that precedes it as an optional part of the pattern.
• Enter the following into the interactive shell:

• The (wo)? part of the regular expression means that the pattern wo is an optional group.
• The regex will match text that has zero instances or one instance of wo in it.
• This is why the regex matches both 'Batwoman' and 'Batman'.
• Enter the following phone number example into the interactive shell:

? can be thought as saying, “Match zero or one of the group preceding this question
mark.”
If we need to match an actual question mark character, escape it with \?.
Matching Zero or More with the Star
The * (called the star or asterisk) means “match zero or more”.
The group that precedes the star can occur any number of times in the text.
It can be completely absent or repeated over and over again.
For example,

For 'Batman', the (wo)* part of the regex matches zero instances of wo in the string; for
'Batwoman', the (wo)* matches one instance of wo; and for 'Batwowowowoman', (wo)*
matches four instances of wo.
To match an actual star character, prefix the star in the regular expression with a
backslash, \*.
Matching One or More with the Plus
The + (or plus) means “match one or more.”
The group preceding a plus must appear at least once. It is not optional.
Enter the following into the interactive shell:

The regex Bat(wo)+man will not match the string 'The Adventures of Batman', because at least
one wo is required by the plus sign.
To match an actual plus sign character, prefix the plus sign with a backslash to escape it: \+.

Matching Specific Repetitions with Braces


To repeat a specific number of times, follow the group in your regex with a number in braces.
For example, the regex (Ha){3} will match the string 'HaHaHa', but it will not match 'HaHa’,
since the latter has only two repeats of the (Ha) group.
Instead of one number, you can specify a range by writing a minimum, a comma, and a
maximum in between the braces.
For example, the regex (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa’.
Also, (Ha){3,} will match three or more instances of the (Ha) group, while (Ha){,5} will match
zero to five instances.
Braces can help make regular expressions shorter.

And these two regular expressions also match identical patterns:

Enter the following into the interactive shell:


Here, (Ha){3} matches 'HaHaHa' but not 'Ha'. Since it doesn’t match 'Ha’, search() returns
None.

Greedy and Non-greedy Matching


'HaHaHa' and 'HaHaHaHa' are valid matches of the regular expression (Ha){3,5}.
Python’s regular expressions are greedy by default, which means that in ambiguous
situations they will match the longest string possible.
The nongreedy (also called lazy) version of the braces, which matches the shortest string
possible, has the closing brace followed by a question mark.
The difference between the greedy and non-greedy forms of the braces searching the same
string can be seen in the code below:

The question mark can have two meanings in regular expressions: declaring a non-greedy
match or flagging an optional group.
The findall() Method
In addition to the search() method, Regex objects also have a findall() method. While search()
will return a Match object of the first matched text in the searched string, the findall() method
will return the strings of every match in the searched string.
For example, enter the following into the interactive shell:

findall() will return a list of strings as long as there are no groups in the regular expression.
Each string in the list is a piece of the searched text that matched the regular expression.
Enter the following into the interactive shell:

If there are groups in the regular expression, then findall() will return a list of tuples.
Each tuple has items that are the matched strings for each group in the regex.
For example, enter the following into the interactive shell:
Character Classes
\d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9).
There are many such shorthand character classes, as shown in Table.

Character classes are nice for shortening regular expressions.


The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing
(0|1|2|3|4|5).
For example, enter the following into the interactive shell:
The regular expression \d+\s\w+ will match text that has one or more numeric digits (\d+),
followed by a whitespace character (\s), followed by one or more letter/digit/underscore
characters (\w+).
The findall() method returns all matching strings of the regex pattern in a list.

Making Your Own Character Classes


Character class can be defined using square brackets.
For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and
uppercase.
Enter the following into the interactive shell:

The ranges of letters or numbers can also be included by using a hyphen.


For example, the character class [a-zA-Z0-9] will match all lowercase letters, uppercase letters,
and numbers.
By placing a caret character (^) just after the character class’s opening bracket, one can make
a negative character class.
A negative character class will match all the characters that are not in the character class.
For example, enter the following into the interactive shell:
Here, instead of matching every vowel, every character that isn’t a vowel is being matched.
The Caret and Dollar Sign Characters
A caret symbol (^) can be used at the start of a regex to indicate that a match must occur at the
beginning of the searched text.
A dollar sign ($) can be used at the end of the regex to indicate the string must end with this
regex pattern.
The ^ and $ together to indicate that the entire string must match the regex.
The r'\d$' regular expression string matches strings that end with a numeric character from 0 to
9.
Enter the following into the interactive shell:

The r'^\d+$' regular expression string matches strings that both begin and end with one or more
numericcharacters.
The Wildcard Character
The . (or dot) character in a regular expression is called a wildcard and will match any
character except for a newline.
For example, enter the following into the interactive shell:


The dot character will match just one character. This is why the match for the text ‘flat’ in
the example matched only ‘lat’.
To match an actual dot, escape the dot with a backslash: \.
Matching Everything with Dot-Star
The dot-star (.*) is used to match everything and anything.
The dot character means “any single character except the newline,” and the star character
means “zero or more of the preceding character.”
Enter the following into the interactive shell:

The dot-star uses greedy mode: It will always try to match as much text as possible.
To match any and all text in a non-greedy fashion, use the dot, star, and question mark (.*?).
Enter the following into the interactive shell:
The string '<To serve man> for dinner.>' has two possible matches for the closing angle
bracket.
In the non-greedy version of the regex, Python matches the shortest possible string: '<To
serve man>’.
In the greedy version, Python matches the longest possible string: '<To serve man> for
dinner.>'.

Matching Newlines with the Dot Character


The dot-star will match everything except a newline.
By passing re.DOTALL as the second argument to re.compile(), one can make the dot
character match all characters, including the newline character.
Enter the following into the interactive shell:

The regex noNewlineRegex, which did not have re.DOTALL passed to the re.compile() call
that created it, will match everything only up to the first newline character, whereas
newlineRegex, which had re.DOTALL passed to re.compile(), matches everything.
This is why the newlineRegex.search() call matches the full string, including its newline
characters.
Review of Regex Symbols
 The ? matches zero or one of the preceding group.
 The * matches zero or more of the preceding group.
 The + matches one or more of the preceding group.
 The {n} matches exactly n of the preceding group.
 The {n,} matches n or more of the preceding group.
 The {,m} matches 0 to m of the preceding group.
 The {n,m} matches at least n and at most m of the preceding group.
 {n,m}? or *? or +? performs a non-greedy match of the preceding group.
 ^spam means the string must begin with spam.
 spam$ means the string must end with spam.
 The . matches any character, except newline characters.
 \d, \w, and \s match a digit, word, or space character, respectively.
 \D, \W, and \S match anything except a digit, word, or space character, respectively.
 [abc] matches any character between the brackets (such as a, b, or c).
 [^abc] matches any character that isn’t between the brackets.

Case-Insensitive Matching
 Regular expressions match text with the exact casing that is specified.
 For example, the following regexes match completely different strings:

 But to make the regex case insensitive, re.IGNORECASE or re.I can be passed as a
second argument to re.compile().
 Enter the following into the interactive shell:
Substituting Strings with the sub() Method
Regular expressions can not only find text patterns but can also substitute new text in place of
those text patterns.
The sub() method for Regex objects is passed two arguments.

The first argument is a string to replace any matches.


The second is the string for the regular expression.

The sub() method returns a string with the substitutions applied.


For example, enter the following into the interactive shell:

Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE


The re.compile() function takes only a single value as its second argument.

This limitation can be overcome by combining multiple variables using the pipe character (|).
In this context it is known as the bitwise or operator.

So if a regular expression is required which is case-insensitive and includes newlines to


match the dot character then it can be formed as:

Including all three options in the second argument will look like this:

You might also like