0% found this document useful (0 votes)
10 views20 pages

Unit 4 - Regular Expressions

Used for MCA student refer to the best for your future

Uploaded by

nivithaswathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Unit 4 - Regular Expressions

Used for MCA student refer to the best for your future

Uploaded by

nivithaswathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Regular Expression

match() function - search() function - Search and Replace - Regular


Expression Modifiers: Option Flags - Regular Expression Patterns - find
all() method - compile() method.

Regular expressions constitute potent sequences of characters that


define search patterns. They are extensively employed to match and
manipulate strings based on specific rules or patterns. As a result, regular
expressions present a concise and flexible approach to execute complex
text searches and replacements.

Purpose of the match() Function

The match() function, located within Python's ‘re’ module, is designed to


undertake pattern-matching operations exclusively at the beginning of a
given string. In contrast to the search() function, which hunts for the pattern
anywhere within the string, match() solely endeavors to locate the pattern
at the very start of the string. When the pattern is successfully found at the
beginning, the match() function yields a match object representing the initial
match. Conversely, if no match is discovered at the onset, it returns None.

Syntax of the match() Function

re.match(pattern, string, flag=0)


Where −
 pattern − Signifies the regular expression pattern to be matched at
the beginning of the string.
 \d - Matches any decimal digit; this is equivalent to the class [0-9].
 \D - Matches any non-digit character; this is equivalent to the
class [^0-9].
 \s- Matches any whitespace character; this is equivalent to the
class [ \t\n\r\f\v].
 \S- Matches any non-whitespace character; this is equivalent to the
class [^ \t\n\r\f\v].
 \w- Matches any alphanumeric character; this is equivalent to the
class [a-zA-Z0-9_].
 \W-Matches any non-alphanumeric character; this is equivalent to the
class [^a-zA-Z0-9_].

 string − Represents the input string where the match will be


attempted.
 flags (optional) − Denotes the flags that modify the behavior of the
regular expression, typically specified using constants from the ‘re’
module.
o re.I: For a case insensitive search.
o re.L: Causes words to be interpreted according to the current
locale.
o re.S: Performs a period (dot) match at any character, including
a new line.
o re.U: Interprets letters according to the Unicode character set.
However, \w, \W, \b, and \B behave are usually affected.

Do a search that will return a match object.

import re

txt = "The rain in Spain"


x = re.match("Th", txt)
print(x)
Output: <re.Match object; span=(0, 2), match='Th'>

If there is no match, the value None will be returned, instead of the


Match Object.

Basic Usage of match()

Let's commence with a basic example to demonstrate the application of the


match() function –

Example
In this example, we define a function named match_example, which takes
a regular expression pattern and a text string as arguments. Inside the
function, we utilize re.match() to search for the specified pattern at the
beginning of the text. The pattern 'r'\d+'' designates one or more digits.
Upon invoking the function with the provided example text, it successfully
identifies the pattern "100" at the start of the text and notifies us of the
pattern's presence.

import re
def match_example(pattern, text):
matched = re.match(pattern, text)
if matched:
print(f"Pattern '{pattern}' found at the beginning of the text.")
else:
print(f"Pattern '{pattern}' not found at the beginning of the text.")

# Example usage
pattern = r'\d+' #digit
text = "100 is the product code."
match_example(pattern, text)

Output
Pattern '\d+' found at the beginning of the text.

Flags in the match() Function

Similar to the search() function, the match() function permits the use of
flags to modify the behavior of the regular expression. An example of such
a flag is the re.IGNORECASE flag, which renders the match case-
insensitive. Let's explore this flag in the following example −

Using the re.IGNORECASE Flag

In this example, we establish a function named case_insensitive_match,


which takes a regular expression pattern and a text string as arguments. By
employing re.match() alongside the re.IGNORECASE flag, we conduct a
case-insensitive match for the designated pattern at the beginning of the
text. The pattern 'r'\bhello\b'' stands for the word "hello" with word
boundaries. As we call the function with the provided example text, it
successfully detects the word "Hello" at the commencement of the text,
affirming the pattern's presence in a case-insensitive manner.

Example
Import re
def case_insensitive_match(pattern, text):
matched = re.match(pattern, text, re.IGNORECASE)
if matched:
print(f"Pattern '{pattern}' found (case-insensitive) at the beginning of the
text.")
else:
print(f"Pattern '{pattern}' not found at the beginning of the text.")

# Example usage
pattern = r'\bhello\b'
text = "Hello, World! Welcome to the Hello World program."
case_insensitive_match(pattern, text)

Capturing Matched Text Using Groups

Similar to the search() function, the match() function also affords us the
opportunity to capture specific parts of the matched text by employing
groups. Groups constitute portions of the pattern enclosed within
parentheses, allowing us to extract specific information from the matched
text.

Example :
import re
# string to be match save in a dictionary.
listString = ["string stay", "stringers sit","string dope"]
# loop through the dictionary and check for match
pattern = "(s\w+)\W(s\w+)"

for string in listString:


match = re.match(pattern, string)
if match:
# matched groups are printed as a separate list
print(match.groups())
# matched groups are each printed as a single string
print(match.group())
O/P:
('string', 'stay')
string stay
('stringers', 'sit')
stringers sit

Using the span() Method for Match Position

The span() method of the match object allows us to retrieve the position
(start and end indices) of the matched text within the input string. This
information can be instrumental in further processing or highlighting
matched substrings.

In this example, we define a function named retrieve_match_position,


which takes a regular expression pattern and a text string as arguments.
Utilizing re.match(), we attempt a match for the designated pattern at the
beginning of the text. The pattern 'r'\b\d+\b'' indicates one or more digits
with word boundaries. As we call the function with the provided example
text, it successfully detects the numbers "100" and "50" at the inception of
the text. It then proceeds to print their positions as "19 to 21" and "44 to
46," respectively. Moreover, it displays the matched text "100" and "50,"
which are extracted using the group() method of the match object.

import re
def retrieve_match_position(pattern, text):
matched = re.match(pattern, text)
if matched:
matched_text = matched.group()
start_index, end_index = matched.span()
print(f"Pattern '{pattern}' found at indices {start_index} to {end_index -
1}.")
print(f"Matched text: '{matched_text}'")
else:
print(f"Pattern '{pattern}' not found at the beginning of the text.")
# Example usage
pattern = r'\b\d+\b'
text = "The price of the product is $100. The discounted price is $50."
retrieve_match_position(pattern, text)
Using match() with Multiline Text

By default, the match() function operates solely with single-line strings,


restricting its matching to the beginning of the first line within the input text.
However, when the input text comprises multiple lines, we can enable the
re.MULTILINE flag to permit the function to match the pattern at the
inception of each line. Let's demonstrate this with the subsequent example

Example

In this example, we define a function named match_multiline_text, which


takes a regular expression pattern and a text string as arguments. By
employing re.match() with the re.MULTILINE flag, we execute a match for
the designated pattern at the beginning of each line in the text. The pattern
'r'^python'' signifies the word "python" at the beginning of a line. As we call
the function with the provided example text, it successfully identifies the
word "python" at the commencement of the first and third lines, thereby
confirming the pattern's presence at the inception of a line.
import re
def match_multiline_text(pattern, text):
matched = re.match(pattern, text, re.MULTILINE)
if matched:
print(f"Pattern '{pattern}' found at the beginning of a line.")
else:
print(f"Pattern '{pattern}' not found at the beginning of any line.")

# Example usage
pattern = r'^python'
text = "Python is an amazing language.\npython is a snake.\nPYTHON is
great."
match_multiline_text(pattern, text)

The 'search()' Function in Python


An indispensable part of the 're' module, the 'search()' function enables the
search for specified patterns within a given string.

Syntax:
re.search(pattern, string, flags=0)

Parameters
 pattern − The regular expression pattern to be sought.
 string − The input string within which the pattern is to be found.
 flags (optional) − Additional flags to modify the behavior of the
search.

Return Value

The 'search()' function returns a match object when the pattern is


discovered within the string; otherwise, it returns None.

Fundamental Usage of 'search()'

To demonstrate the rudimentary application of the 'search()' function, let us


consider a simple example. Our aim is to search for the word "apple" in a
provided string.

Example
 In this example, we import the 're' module to avail of regular
expressions. The input string 'text' encompasses the phrase "I have
an apple and a banana." The regular expression pattern 'r"apple"'
specifies our quest for the exact word "apple" within the 'text'.
 Subsequently, we invoke the 're.search()' function with the pattern
and the 'text' as arguments. When a match is found, the function
returns a match object. Conversely, if the pattern is not found, the
function returns None.
 Finally, the code assesses the result and prints "Pattern found!" if a
match is discovered, or "Pattern not found." otherwise.

import re
def basic_search_example():

text = "I have an apple and a banana."

pattern = r"apple"

result = re.search(pattern, text)

if result:

print("Pattern found!")

else:

print("Pattern not found.")

# Example usage
basic_search_example()
O/P: Pattern found

Ignoring Case Sensitivity with Flags

One of the salient features of the 'search()' function is its adaptability


through the use of flags. Among these flags, 're.IGNORECASE' stands out,
granting the capacity for case-insensitive searches. Let's revisit the
previous example, but this time, we shall ignore case sensitivity while
searching for the word "apple."

Example
 In this instance, the input string 'text' encompasses the phrase "I
have an Apple and a banana." The pattern 'r"apple"' remains
unchanged, but this time, we include the 're.IGNORECASE' flag as
the third argument to the 're.search()' function.
 The 're.IGNORECASE' flag instructs the 'search()' function to carry
out a case-insensitive search, thereby matching both "apple" and
"Apple" within the input string.

import re

def ignore_case_search_example():

text = "I have an Apple and a banana."

pattern = r"apple"

result = re.search(pattern, text, re.IGNORECASE)

if result:

print("Pattern found!")

else:

print("Pattern not found.")


ignore_case_search_example()

Extracting a Substring using Groups

Regular expressions offer the added advantage of extracting substrings


from matched patterns through groups. Employing parentheses '()' enables
us to define groups within the pattern. Let's illustrate this by extracting the
domain name from an email address using the 'search()' function.

Example
 In this example, the input string 'email' holds the email address
"[email protected]." The regular expression pattern 'r"@(.+)$"'
aids in the extraction of the domain name from the email address.
 The '@' symbol matches the "@" character in the email address.
 The parentheses '()' create a group, encompassing the domain name
for capture.
 The '.+' part of the pattern matches one or more characters
(excluding a newline) within the email address.
 The '$' symbol represents the end of the string.
 Once the 're.search()' function discovers a match, it returns a match
object. We subsequently utilize the 'group(1)' method on the match
object to extract the content of the first (and sole) group, which is the
domain name.

import re

def extract_domain_example():

email = "[email protected]"

pattern = r"@(.+)$"

result = re.search(pattern, email)

if result:

domain = result.group(1)

print(f"Domain: {domain}")

else:

print("Pattern not found.")

# Example usage
extract_domain_example()

Output
Domain: example.com
Finding Multiple Occurrences of a Pattern

While the 'search()' function discovers the first occurrence of a pattern


within a string, it may fall short when seeking all occurrences. To address
this, the 're' module offers the 'findall()' function. Let's identify all
occurrences of the word "apple" in a given text.

Example
 In this example, the input string 'text' comprises the phrase "I have an
apple, and she has an apple too." The regular expression pattern
'r"apple"' remains unchanged.
 By leveraging the 're.findall()' function with the pattern and 'text' as
arguments, we obtain a list containing all occurrences of the pattern
in the text. If no match is found, an empty list is returned.
 The code checks the result, and if occurrences are detected, it prints
the list of occurrences.

import re

def find_all_occurrences_example():

text = "I have an apple, and she has an apple too."

pattern = r"apple"

results = re.findall(pattern, text)

if results:

print(f"Occurrences of 'apple': {results}")

else:

print("Pattern not found.")


# Example usage
find_all_occurrences_example()

Output
Occurrences of 'apple': ['apple', 'apple']

Using the Dot Metacharacter

The dot '.' in regular expressions functions as a metacharacter, matching


any character except a newline. We can exploit the dot metacharacter to
locate all three-letter words in a given text.

Example
 In this example, the input string 'text' contains the phrase "The cat ran
on the mat." The regular expression pattern 'r"\b...\b"' is employed to
identify all three-letter words in the text.
 The '\b' represents a word boundary, guaranteeing the inclusion of
complete words in the matches.
 The '...' matches any three characters (letters) within the text.
 Upon using the 're.findall()' function, we retrieve a list containing all
three-letter words in the text. If no match is found, an empty list is
returned.
 The code verifies the result and prints the list of words if three-letter
words are discovered.

import re

def dot_metacharacter_example():

text = "The cat ran on the mat."

pattern = r"\b...\b"

results = re.findall(pattern, text)

if results:
print(f"Three-letter words: {results}")

else:

print("Pattern not found.")

# Example usage
dot_metacharacter_example()

Output
Three-letter words: ['The', 'cat', 'ran', ' on', 'the', 'mat']

The sub() Function – Replace()


The sub() function replaces the matches with the text of your choice:
Example

Replace every white-space character with the number 9:

import re

txt = "The rain in Spain"


x = re.sub("\s", "9", txt)
print(x)

O/p: The9rain9in9Spain
Regular Expression Modifiers

Regular expression literals may include an optional modifier to control


various aspects of matching. The modifiers are specified as an optional
flag. You can provide multiple modifiers using exclusive OR (|)

Sr.No. Modifier & Description

re.I
1
Performs case-insensitive matching.

re.L
Interprets words according to the current locale. This
2
interpretation affects the alphabetic group (\w and \W), as
well as word boundary behavior(\b and \B).

re.M
Makes $ match the end of a line (not just the end of the
3
string) and makes ^ match the start of any line (not just the
start of the string).

re.S
4 Makes a period (dot) match any character, including a
newline.

re.U
5 Interprets letters according to the Unicode character set. This
flag affects the behavior of \w, \W, \b, \B.

re.X
Permits "cuter" regular expression syntax. It ignores
6
whitespace (except inside a set [] or when escaped by a
backslash) and treats unescaped # as a comment marker.

Regular Expression Patterns


Following table lists the regular expression syntax that is available in
Python −

S.No. Pattern & Description

^
1
Matches beginning of line.

$
2
Matches end of line.

.
3 Matches any single character except newline. Using m option
allows it to match newline as well.

[...]
4
Matches any single character in brackets.

[^...]
5
Matches any single character not in brackets

re*
6
Matches 0 or more occurrences of preceding expression.

re+
7
Matches 1 or more occurrence of preceding expression.

re?
8
Matches 0 or 1 occurrence of preceding expression.

re{ n}
9 Matches exactly n number of occurrences of preceding
expression.

re{ n,}
10
Matches n or more occurrences of preceding expression.

re{ n, m}
11 Matches at least n and at most m occurrences of preceding
expression.
S.No. Pattern & Description

a| b
12
Matches either a or b.

(re)
13
Groups regular expressions and remembers matched text.

(?imx)
14 Temporarily toggles on i, m, or x options within a regular
expression. If in parentheses, only that area is affected.

(?-imx)
15 Temporarily toggles off i, m, or x options within a regular
expression. If in parentheses, only that area is affected.

(?: re)
16 Groups regular expressions without remembering matched
text.

(?imx: re)
17
Temporarily toggles on i, m, or x options within parentheses.

(?-imx: re)
18
Temporarily toggles off i, m, or x options within parentheses.

(?#...)
19
Comment.

(?= re)
20
Specifies position using a pattern. Doesn't have a range.

(?! re)
21 Specifies position using pattern negation. Doesn't have a
range.

(?> re)
22
Matches independent pattern without backtracking.

\w
23
Matches word characters.
S.No. Pattern & Description

\W
24
Matches nonword characters.

\s
25
Matches whitespace. Equivalent to [\t\n\r\f].

\S
26
Matches nonwhitespace.

\d
27
Matches digits. Equivalent to [0-9].

\D
28
Matches nondigits.

\A
29
Matches beginning of string.

\Z
30 Matches end of string. If a newline exists, it matches just
before newline.

\z
31
Matches end of string.

\G
32
Matches point where last match finished.

\b
33 Matches word boundaries when outside brackets. Matches
backspace (0x08) when inside brackets.

\B
34
Matches nonword boundaries.

\n, \t, etc.


35
Matches newlines, carriage returns, tabs, etc.

36 \1...\9
S.No. Pattern & Description

Matches nth grouped subexpression.

\10
Matches nth grouped subexpression if it matched already.
37
Otherwise refers to the octal representation of a character
code.

compile() Function

The compile() function returns the specified source as a code object, ready
to be executed.

Syntax
compile(source, filename, mode, flag, dont_inherit, optimize)

Parameter Values

Parameter Description

Source Required. The source to compile, can be a String, a Bytes


object, or an AST object

Filename Required. The name of the file that the source comes from. If
the source does not come from a file, you can write whatever
you like

Mode Required. Legal values:


eval - if the source is a single expression
exec - if the source is a block of statements
single - if the source is a single interactive statement

Flags Optional. How to compile the source. Default 0

dont- Optional. How to compile the source. Default False


inherit

Optimize Optional. Defines the optimization level of the compiler.


Default -1

Example

Compile text as code, and then execute it:

x = compile('print(55)', 'test', 'eval')


exec(x)

55

Compile more than one statement, and the execute it:

x = compile('print(55)\nprint(88)', 'test', 'exec')


exec(x)

55
88

Why and when to use re.compile()


Performance improvement

Compiling regular expression objects is useful and efficient when the


expression will be used several times in a single program. It saves time and
improves performance.

Readability

Another benefit is readability. Using re.compile() you can separate the


definition of the regex from its use.

Compiling regex is useful for the following situations.

Python always internally compiles and caches regexes whenever you use
them anyway (including calls to search() or match()), so using compile()
method, you’re only changing when the regex gets compiled.

 It denotes that the compiled regular expressions will be used a lot and
is not meant to be removed.
 By compiling once and re-using the same regex multiple times, we
reduce the possibility of typos.
 When you are using lots of different regexes, you should keep your
compiled expressions for those which are used multiple times, so
they’re not flushed out of the regex cache when the cache is full.

You might also like