Unit 4 - Regular Expressions
Unit 4 - Regular Expressions
import re
Example
In this example, we define a function named match_example, which takes
a regular expression pattern and a text string as arguments. Inside the
function, we utilize re.match() to search for the specified pattern at the
beginning of the text. The pattern 'r'\d+'' designates one or more digits.
Upon invoking the function with the provided example text, it successfully
identifies the pattern "100" at the start of the text and notifies us of the
pattern's presence.
import re
def match_example(pattern, text):
matched = re.match(pattern, text)
if matched:
print(f"Pattern '{pattern}' found at the beginning of the text.")
else:
print(f"Pattern '{pattern}' not found at the beginning of the text.")
# Example usage
pattern = r'\d+' #digit
text = "100 is the product code."
match_example(pattern, text)
Output
Pattern '\d+' found at the beginning of the text.
Similar to the search() function, the match() function permits the use of
flags to modify the behavior of the regular expression. An example of such
a flag is the re.IGNORECASE flag, which renders the match case-
insensitive. Let's explore this flag in the following example −
Example
Import re
def case_insensitive_match(pattern, text):
matched = re.match(pattern, text, re.IGNORECASE)
if matched:
print(f"Pattern '{pattern}' found (case-insensitive) at the beginning of the
text.")
else:
print(f"Pattern '{pattern}' not found at the beginning of the text.")
# Example usage
pattern = r'\bhello\b'
text = "Hello, World! Welcome to the Hello World program."
case_insensitive_match(pattern, text)
Similar to the search() function, the match() function also affords us the
opportunity to capture specific parts of the matched text by employing
groups. Groups constitute portions of the pattern enclosed within
parentheses, allowing us to extract specific information from the matched
text.
Example :
import re
# string to be match save in a dictionary.
listString = ["string stay", "stringers sit","string dope"]
# loop through the dictionary and check for match
pattern = "(s\w+)\W(s\w+)"
The span() method of the match object allows us to retrieve the position
(start and end indices) of the matched text within the input string. This
information can be instrumental in further processing or highlighting
matched substrings.
import re
def retrieve_match_position(pattern, text):
matched = re.match(pattern, text)
if matched:
matched_text = matched.group()
start_index, end_index = matched.span()
print(f"Pattern '{pattern}' found at indices {start_index} to {end_index -
1}.")
print(f"Matched text: '{matched_text}'")
else:
print(f"Pattern '{pattern}' not found at the beginning of the text.")
# Example usage
pattern = r'\b\d+\b'
text = "The price of the product is $100. The discounted price is $50."
retrieve_match_position(pattern, text)
Using match() with Multiline Text
Example
# Example usage
pattern = r'^python'
text = "Python is an amazing language.\npython is a snake.\nPYTHON is
great."
match_multiline_text(pattern, text)
Syntax:
re.search(pattern, string, flags=0)
Parameters
pattern − The regular expression pattern to be sought.
string − The input string within which the pattern is to be found.
flags (optional) − Additional flags to modify the behavior of the
search.
Return Value
Example
In this example, we import the 're' module to avail of regular
expressions. The input string 'text' encompasses the phrase "I have
an apple and a banana." The regular expression pattern 'r"apple"'
specifies our quest for the exact word "apple" within the 'text'.
Subsequently, we invoke the 're.search()' function with the pattern
and the 'text' as arguments. When a match is found, the function
returns a match object. Conversely, if the pattern is not found, the
function returns None.
Finally, the code assesses the result and prints "Pattern found!" if a
match is discovered, or "Pattern not found." otherwise.
import re
def basic_search_example():
pattern = r"apple"
if result:
print("Pattern found!")
else:
# Example usage
basic_search_example()
O/P: Pattern found
Example
In this instance, the input string 'text' encompasses the phrase "I
have an Apple and a banana." The pattern 'r"apple"' remains
unchanged, but this time, we include the 're.IGNORECASE' flag as
the third argument to the 're.search()' function.
The 're.IGNORECASE' flag instructs the 'search()' function to carry
out a case-insensitive search, thereby matching both "apple" and
"Apple" within the input string.
import re
def ignore_case_search_example():
pattern = r"apple"
if result:
print("Pattern found!")
else:
Example
In this example, the input string 'email' holds the email address
"[email protected]." The regular expression pattern 'r"@(.+)$"'
aids in the extraction of the domain name from the email address.
The '@' symbol matches the "@" character in the email address.
The parentheses '()' create a group, encompassing the domain name
for capture.
The '.+' part of the pattern matches one or more characters
(excluding a newline) within the email address.
The '$' symbol represents the end of the string.
Once the 're.search()' function discovers a match, it returns a match
object. We subsequently utilize the 'group(1)' method on the match
object to extract the content of the first (and sole) group, which is the
domain name.
import re
def extract_domain_example():
email = "[email protected]"
pattern = r"@(.+)$"
if result:
domain = result.group(1)
print(f"Domain: {domain}")
else:
# Example usage
extract_domain_example()
Output
Domain: example.com
Finding Multiple Occurrences of a Pattern
Example
In this example, the input string 'text' comprises the phrase "I have an
apple, and she has an apple too." The regular expression pattern
'r"apple"' remains unchanged.
By leveraging the 're.findall()' function with the pattern and 'text' as
arguments, we obtain a list containing all occurrences of the pattern
in the text. If no match is found, an empty list is returned.
The code checks the result, and if occurrences are detected, it prints
the list of occurrences.
import re
def find_all_occurrences_example():
pattern = r"apple"
if results:
else:
Output
Occurrences of 'apple': ['apple', 'apple']
Example
In this example, the input string 'text' contains the phrase "The cat ran
on the mat." The regular expression pattern 'r"\b...\b"' is employed to
identify all three-letter words in the text.
The '\b' represents a word boundary, guaranteeing the inclusion of
complete words in the matches.
The '...' matches any three characters (letters) within the text.
Upon using the 're.findall()' function, we retrieve a list containing all
three-letter words in the text. If no match is found, an empty list is
returned.
The code verifies the result and prints the list of words if three-letter
words are discovered.
import re
def dot_metacharacter_example():
pattern = r"\b...\b"
if results:
print(f"Three-letter words: {results}")
else:
# Example usage
dot_metacharacter_example()
Output
Three-letter words: ['The', 'cat', 'ran', ' on', 'the', 'mat']
import re
O/p: The9rain9in9Spain
Regular Expression Modifiers
re.I
1
Performs case-insensitive matching.
re.L
Interprets words according to the current locale. This
2
interpretation affects the alphabetic group (\w and \W), as
well as word boundary behavior(\b and \B).
re.M
Makes $ match the end of a line (not just the end of the
3
string) and makes ^ match the start of any line (not just the
start of the string).
re.S
4 Makes a period (dot) match any character, including a
newline.
re.U
5 Interprets letters according to the Unicode character set. This
flag affects the behavior of \w, \W, \b, \B.
re.X
Permits "cuter" regular expression syntax. It ignores
6
whitespace (except inside a set [] or when escaped by a
backslash) and treats unescaped # as a comment marker.
^
1
Matches beginning of line.
$
2
Matches end of line.
.
3 Matches any single character except newline. Using m option
allows it to match newline as well.
[...]
4
Matches any single character in brackets.
[^...]
5
Matches any single character not in brackets
re*
6
Matches 0 or more occurrences of preceding expression.
re+
7
Matches 1 or more occurrence of preceding expression.
re?
8
Matches 0 or 1 occurrence of preceding expression.
re{ n}
9 Matches exactly n number of occurrences of preceding
expression.
re{ n,}
10
Matches n or more occurrences of preceding expression.
re{ n, m}
11 Matches at least n and at most m occurrences of preceding
expression.
S.No. Pattern & Description
a| b
12
Matches either a or b.
(re)
13
Groups regular expressions and remembers matched text.
(?imx)
14 Temporarily toggles on i, m, or x options within a regular
expression. If in parentheses, only that area is affected.
(?-imx)
15 Temporarily toggles off i, m, or x options within a regular
expression. If in parentheses, only that area is affected.
(?: re)
16 Groups regular expressions without remembering matched
text.
(?imx: re)
17
Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re)
18
Temporarily toggles off i, m, or x options within parentheses.
(?#...)
19
Comment.
(?= re)
20
Specifies position using a pattern. Doesn't have a range.
(?! re)
21 Specifies position using pattern negation. Doesn't have a
range.
(?> re)
22
Matches independent pattern without backtracking.
\w
23
Matches word characters.
S.No. Pattern & Description
\W
24
Matches nonword characters.
\s
25
Matches whitespace. Equivalent to [\t\n\r\f].
\S
26
Matches nonwhitespace.
\d
27
Matches digits. Equivalent to [0-9].
\D
28
Matches nondigits.
\A
29
Matches beginning of string.
\Z
30 Matches end of string. If a newline exists, it matches just
before newline.
\z
31
Matches end of string.
\G
32
Matches point where last match finished.
\b
33 Matches word boundaries when outside brackets. Matches
backspace (0x08) when inside brackets.
\B
34
Matches nonword boundaries.
36 \1...\9
S.No. Pattern & Description
\10
Matches nth grouped subexpression if it matched already.
37
Otherwise refers to the octal representation of a character
code.
compile() Function
The compile() function returns the specified source as a code object, ready
to be executed.
Syntax
compile(source, filename, mode, flag, dont_inherit, optimize)
Parameter Values
Parameter Description
Filename Required. The name of the file that the source comes from. If
the source does not come from a file, you can write whatever
you like
Example
55
55
88
Readability
Python always internally compiles and caches regexes whenever you use
them anyway (including calls to search() or match()), so using compile()
method, you’re only changing when the regex gets compiled.
It denotes that the compiled regular expressions will be used a lot and
is not meant to be removed.
By compiling once and re-using the same regex multiple times, we
reduce the possibility of typos.
When you are using lots of different regexes, you should keep your
compiled expressions for those which are used multiple times, so
they’re not flushed out of the regex cache when the cache is full.