Pattern matching in Python with Regex

What is Regular Expression?

In the real world, string parsing in most programming languages is handled by regular expression. Regular expression in a python programming language is a method used for matching text pattern.

The “re” module which comes with every python installation provides regular expression support.

In python, a regular expression search is typically written as:

match = re.search(pattern, string)

The re.search() method takes two arguments, a regular expression pattern and a string and searches for that pattern within the string. If the pattern is found within the string, search() returns a match object or None otherwise. So in a regular expression, given a string, determine whether that string matches a given pattern, and, optionally, collect substrings that contain relevant information. A regular expression can be used to answer questions like −

Is this string a valid URL?
Which users in /etc/passwd are in a given group?
What is the date and time of all warning messages in a log file?
What username and document were requested by the URL a visitor typed?

Matching patterns

Regular expressions are complicated mini-language. They rely on special characters to match unknown strings, but let's start with literal characters, such as letters, numbers, and the space character, which always match themselves. Let's see a basic example:

#Need module 're' for regular expression
import re
#
search_string = "TutorialsPoint"
pattern = "Tutorials"
match = re.match(pattern, search_string)
#If-statement after search() tests if it succeeded
if match:
   print("regex matches: ", match.group())
else:
   print('pattern not found')

Result

regex matches: Tutorials

Matching a string

The “re” module of python has numerous method, and to test whether a particular regular expression matches a specific string, you can use re.search(). The re.MatchObject provides additional information like which part of the string the match was found.

Syntax

matchObject = re.search(pattern, input_string, flags=0)

Example

#Need module 're' for regular expression
import re
# Lets use a regular expression to match a date string.
regex = r"([a-zA-Z]+) (\d+)"
if re.search(regex, "Jan 2"):
   match = re.search(regex, "Jan 2")
   # This will print [0, 5), since it matches at the beginning and end of the
   # string
   print("Match at index %s, %s" % (match.start(), match.end()))
   # The groups contain the matched values. In particular:
   # match.group(0) always returns the fully matched string
   # match.group(1), match.group(2), ... will return the capture
   # groups in order from left to right in the input string  
   # match.group() is equivalent to match.group(0)
   # So this will print "Jan 2"
   print("Full match: %s" % (match.group(0)))
   # So this will print "Jan"
   print("Month: %s" % (match.group(1)))
   # So this will print "2"
   print("Day: %s" % (match.group(2)))
else:
   # If re.search() does not match, then None is returned
   print("Pattern not Found! ")

Result

Match at index 0, 5
Full match: Jan 2
Month: Jan
Day: 2

As the above method stops after the first match, so is better suited for testing a regular expression than extracting data.

Capturing Groups

If the pattern includes two or more parenthesis, then the end result will be a tuple instead of a list of string, with the help of parenthesis() group mechanism and finall(). Each pattern matched is represented by a tuple and each tuple contains group(1), group(2).. data.

import re
regex = r'([\w\.-]+)@([\w\.-]+)'
str = ('hello [email protected], [email protected], hello [email protected]')
matches = re.findall(regex, str)
print(matches)
for tuple in matches:
   print("Username: ",tuple[0]) #username
   print("Host: ",tuple[1]) #host

Result

[('john', 'hotmail.com'), ('hello', 'Tutorialspoint.com'), ('python', 'gmail.com')]
Username: john
Host: hotmail.com
Username: hello
Host: Tutorialspoint.com
Username: python
Host: gmail.com

Finding and replacing string

Another common task is to search for all the instances of the pattern in the given string and replace them, the re.sub(pattern, replacement, string) will exactly do that. For example to replace all instances of an old email domain

Code

# requid library
import re
#given string
str = ('hello [email protected], [email protected], hello [email protected], Hello World!')
#pattern to match
pattern = r'([\w\.-]+)@([\w\.-]+)'
#replace the matched pattern from string with,
replace = r'\[email protected]'
   ## re.sub(pat, replacement, str) -- returns new string with all replacements,
   ## \1 is group(1), \2 group(2) in the replacement
print (re.sub(pattern, replace, str))

Result

hello [email protected], [email protected], hello [email protected], Hello World!

Re options flags

In the python regular expression like above, we can use different options to modify the behavior of the pattern match. These extra arguments, optional flag is added to the search() or findall() etc. function, for example re.search(pattern, string, re.IGNORECASE).

IGNORECASE −
As the name indicates, it makes the pattern case insensitive(upper/lowercase), with this, strings containing ‘a’ and ‘A’ both matches.
DOTALL
The re.DOTALL allows dot(.) metacharacter to match all character including newline (\n).
MULTILINE
The re.MULTILINE allows matching the start(^) and end($) of each line of a string. However, generally, ^ and & would just match the start and end of the whole string.