Regular Expression (RegEx) is a powerful tool used to search, match, validate, extract or modify text based on specific patterns. In Python, the built-in re module provides support for using RegEx. It allows you to define patterns using special characters like \d for digits, ^ for the beginning of a string and many more.
Python
import re
txt = 'GeeksforGeeks: A computer science portal for geeks'
match = re.search(r'portal', txt)
if match:
print(match.group())
print("Start:", match.start(), "End:", match.end())
else:
print("No match")
Outputportal
Start: 34 End: 40
Explanation: re.search(r'portal', txt) finds the first occurrence of "portal" in the string. match.group() returns "portal", start() is 34 and end() is 40 (spaces included in count).
Why use RegEx?
Regular expressions are widely used in various fields involving text manipulation and analysis. Below are some common use cases:
Use Case | Description |
---|
Data Mining | Quickly extract emails, phone numbers, URLs, etc. from large text blocks. |
---|
Validation | Validate user inputs like email addresses, passwords, dates, etc. |
---|
Text Processing | Replace or reformat strings to match required formats (e.g., date reformatting). |
---|
To understand the RE analogy, MetaCharacters are useful, important and will be used in functions of module re. Below is the list of metacharacters.
MetaCharacters | Description |
---|
\ | Used to drop the special meaning of character following it |
[] | Represent a character class |
^ | Matches the beginning |
$ | Matches the end |
. | Matches any character except newline |
| | Means OR (Matches with any of the characters separated by it. |
? | Matches zero or one occurrence |
* | Any number of occurrences (including 0 occurrences) |
+ | One or more occurrences |
{} | Indicate the number of occurrences of a preceding RegEx to match. |
() | Enclose a group of RegEx |
group(), start() and end() methods are commonly used to access matched substrings and their positions.
Special Sequences
Special sequences in Python RegEx begin with a backslash (\) and are used to match specific character types or positions in a string. They simplify complex patterns and enhance readability.
Special Sequence | Description | Examples |
---|
\A | Matches if the string begins with the given character | \Afor | for geeks |
for the world |
\b | Matches if the word begins or ends with the given character. \b(string) will check for the beginning of the word and (string)\b will check for the ending of the word. | \bge | geeks |
get |
\B | It is the opposite of the \b i.e. the string should not start or end with the given regex. | \Bge | together |
forge |
\d | Matches any decimal digit, this is equivalent to the set class [0-9] | \d | 123 |
gee1 |
\D | Matches any non-digit character, this is equivalent to the set class [^0-9] | \D | geeks |
geek1 |
\s | Matches any whitespace character. | \s | gee ks |
a bc a |
\S | Matches any non-whitespace character | \S | a bd |
abcd |
\w | Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_]. | \w | 123 |
geeKs4 |
\W | Matches any non-alphanumeric character. | \W | >$ |
gee<> |
\Z | Matches if the string ends with the given regex | ab\Z | abcdab |
abababab |
Basic RegEx Patterns
Let's understand some of the basic regular expressions. They are as follows:
1. Character Classes
Character classes allow matching any one character from a specified set. They are enclosed in square brackets [].
Python
import re
print(re.findall(r'[Gg]eeks', 'GeeksforGeeks: \
A computer science portal for geeks'))
Output['Geeks', 'Geeks', 'geeks']
2. Ranges
In RegEx, a range allows matching characters or digits within a span using - inside []. For example, [0-9] matches digits, [A-Z] matches uppercase letters.
Python
import re
print('Range',re.search(r'[a-zA-Z]', 'x'))
OutputRange <re.Match object; span=(0, 1), match='x'>
3. Negation
Negation in a character class is specified by placing a ^ at the beginning of the brackets, meaning match anything except those characters.
Syntax:
[^a-z]
Example:
Python
import re
print(re.search(r'[^a-z]', 'c'))
print(re.search(r'G[^e]', 'Geeks'))
3. Shortcuts
Shortcuts are shorthand representations for common character classes. Let's discuss some of the shortcuts provided by the regular expression engine.
- \w - matches a word character
- \d - matches digit character
- \s - matches whitespace character (space, tab, newline, etc.)
- \b - matches a zero-length character
Python
import re
print('Geeks:', re.search(r'\bGeeks\b', 'Geeks'))
print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'GeeksforGeeks'))
OutputGeeks: <_sre.SRE_Match object; span=(0, 5), match='Geeks'>
GeeksforGeeks: None
4. Beginning and End of String
The ^ character chooses the beginning of a string and the $ character chooses the end of a string.
Python
import re
# Beginning of String
match = re.search(r'^Geek', 'Campus Geek of the month')
print('Beg. of String:', match)
match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)
# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)
OutputBeg. of String: None
Beg. of String: <_sre.SRE_Match object; span=(0, 4), match='Geek'>
End of String: <_sre.SRE_Match object; span=(31, 36), match='Geeks'>
5. Any Character
The . character represents any single character outside a bracketed character class.
Python
import re
print('Any Character', re.search(r'p.th.n', 'python 3'))
OutputAny Character <_sre.SRE_Match object; span=(0, 6), match='python'>
6. Optional Characters
Regular expression engine allows you to specify optional characters using the ? character. It allows a character or character class either to present once or else not to occur. Let's consider the example of a word with an alternative spelling - color or colour.
Python
import re
print('Color',re.search(r'colou?r', 'color'))
print('Colour',re.search(r'colou?r', 'colour'))
OutputColor <_sre.SRE_Match object; span=(0, 5), match='color'>
Colour <_sre.SRE_Match object; span=(0, 6), match='colour'>
7. Repetition
Repetition enables you to repeat the same character or character class. Consider an example of a date that consists of day, month, and year. Let's use a regular expression to identify the date (mm-dd-yyyy).
Python
import re
print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{2}-[\d]{4}',
'18-08-2020'))
OutputDate{mm-dd-yyyy}: <_sre.SRE_Match object; span=(0, 10), match='18-08-2020'>
Here, the regular expression engine checks for two consecutive digits. Upon finding the match, it moves to the hyphen character. After then, it checks the next two consecutive digits and the process is repeated.
Let's discuss three other regular expressions under repetition.
7.1 Repetition ranges
The repetition range is useful when you have to accept one or more formats. Consider a scenario where both three digits, as well as four digits, are accepted. Let's have a look at the regular expression.
Python
import re
print('Three Digit:', re.search(r'[\d]{3,4}', '189'))
print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))
OutputThree Digit: <_sre.SRE_Match object; span=(0, 3), match='189'>
Four Digit: <_sre.SRE_Match object; span=(0, 4), match='2145'>
7.2 Open-Ended Ranges
There are scenarios where there is no limit for a character repetition. In such scenarios, you can set the upper limit as infinitive. A common example is matching street addresses. Let's have a look
Python
import re
print(re.search(r'[\d]{1,}','5th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))
Output<_sre.SRE_Match object; span=(0, 1), match='5'>
7.3 Shorthand
Shorthand characters allow you to use + character to specify one or more ({1,}) and * character to specify zero or more ({0,}.
Python
import re
print(re.search(r'[\d]+', '5th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))
Output<_sre.SRE_Match object; span=(0, 1), match='5'>
8. Grouping
Grouping is the process of separating an expression into groups by using parentheses, and it allows you to fetch each individual matching group.
Python
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', '26-08-2020')
print(grp)
Output<_sre.SRE_Match object; span=(0, 10), match='26-08-2020'>
Let's see some of its functionality.
8.1 Return the entire match
The re module allows you to return the entire match using the group() method
Python
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group())
8.2 Return a tuple of matched groups
You can use groups() method to return a tuple that holds individual matched groups
Python
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.groups())
Output('26', '08', '2020')
8.3 Retrieve a single group
Upon passing the index to a group method, you can retrieve just a single group.
Python
import re
grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group(3))
8.4 Name your groups
The re module allows you to name your groups. Let's look into the syntax.
Python
import re
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
'26-08-2020')
print(match.group('mm'))
8.5 Individual match as a dictionary
We have seen how regular expression provides a tuple of individual groups. Not only tuple, but it can also provide individual match as a dictionary in which the name of each group acts as the dictionary key.
Python
import re
match = re.search(r'(?P<dd>[\d]{2})-(?P<mm>[\d]{2})-(?P<yyyy>[\d]{4})',
'26-08-2020')
print(match.groupdict())
Output{'dd': '26', 'mm': '08', 'yyyy': '2020'}
9. Lookahead
In the case of a negated character class, it won't match if a character is not present to check against the negated character. We can overcome this case by using lookahead; it accepts or rejects a match based on the presence or absence of content.
Python
import re
print('negation:', re.search(r'n[^e]', 'Python'))
print('lookahead:', re.search(r'n(?!e)', 'Python'))
Outputnegation: None
lookahead: <_sre.SRE_Match object; span=(5, 6), match='n'>
Lookahead can also disqualify the match if it is not followed by a particular character. This process is called a positive lookahead, and can be achieved by simply replacing ! character with = character.
Python
import re
print('positive lookahead', re.search(r'n(?=e)', 'jasmine'))
Outputpositive lookahead <_sre.SRE_Match object; span=(5, 6), match='n'>
10. Substitution
The regular expression can replace the string and returns the replaced one using the re.sub method. It is useful when you want to avoid characters such as /, -, ., etc. before storing it to a database. It takes three arguments:
- the regular expression
- the replacement string
- the source string being searched
Let's have a look at the below code that replaces - character from a credit card number.
Python
import re
print(re.sub(r'([\d]{4})-([\d]{4})-([\d]{4})-([\d]{4})',r'\1\2\3\4',
'1111-2222-3333-4444'))
Compiled Regular Expression
In Python, the re.compile() function allows you to compile a regular expression pattern into a RegEx object. This compiled object can then be reused for multiple operations like search, match, sub, etc.
Python
import re
regex = re.compile(r'([\d]{2})-([\d]{2})-([\d]{4})')
# search method
print('compiled reg expr', regex.search('26-08-2020'))
# sub method
print(regex.sub(r'\1.\2.\3', '26-08-2020'))
Output
compiled reg expr <_sre.SRE_Match object; span=(0, 10), match='26-08-2020'> 26.08.2020
Explanation:
- re.compile(...) creates a reusable regular expression pattern to match dates in the DD-MM-YYYY format.
- regex.search() finds the date match and regex.sub(r'\1.\2.\3', ...) replaces '26-08-2020' with '26.08.2020'.
Similar Reads
Python Modules Python Module is a file that contains built-in functions, classes,its and variables. There are many Python modules, each with its specific work.In this article, we will cover all about Python modules, such as How to create our own simple module, Import Python modules, From statements in Python, we c
7 min read
Python Arrays Lists in Python are the most flexible and commonly used data structure for sequential storage. They are similar to arrays in other languages but with several key differences:Dynamic Typing: Python lists can hold elements of different types in the same list. We can have an integer, a string and even
9 min read
asyncio in Python Asyncio is a Python library that is used for concurrent programming, including the use of async iterator in Python. It is not multi-threading or multi-processing. Asyncio is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web servers, databa
4 min read
Calendar in Python Python has a built-in Python Calendar module to work with date-related tasks. Using the module, we can display a particular month as well as the whole calendar of a year. In this article, we will see how to print a calendar month and year using Python. Calendar in Python ExampleInput: yy = 2023 mm =
2 min read
Python Collections Module The collection Module in Python provides different types of containers. A Container is an object that is used to store different objects and provide a way to access the contained objects and iterate over them. Some of the built-in containers are Tuple, List, Dictionary, etc. In this article, we will
13 min read
Working with csv files in Python Python is one of the important fields for data scientists and many programmers to handle a variety of data. CSV (Comma-Separated Values) is one of the prevalent and accessible file formats for storing and exchanging tabular data. In article explains What is CSV. Working with CSV files in Python, Rea
10 min read
Python datetime module In Python, date and time are not data types of their own, but a module named DateTime in Python can be imported to work with the date as well as time. Python Datetime module comes built into Python, so there is no need to install it externally. In this article, we will explore How DateTime in Python
14 min read
Functools module in Python The functools module offers a collection of tools that simplify working with functions and callable objects. It includes utilities to modify, extend, or optimize functions without rewriting their core logic, helping you write cleaner and more efficient code.Let's discuss them in detail.1. Partial cl
5 min read
hashlib module in Python A Cryptographic hash function is a function that takes in input data and produces a statistically unique output, which is unique to that particular set of data. The hash is a fixed-length byte stream used to ensure the integrity of the data. In this article, you will learn to use the hashlib module
5 min read
Heap queue or heapq in Python A heap queue or priority queue is a data structure that allows us to quickly access the smallest (min-heap) or largest (max-heap) element. A heap is typically implemented as a binary tree, where each parent node's value is smaller (for a min-heap) or larger (for a max-heap) than its children. Howeve
7 min read