0% found this document useful (0 votes)
5 views

RegEx in Python (4)

Uploaded by

Yash Verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

RegEx in Python (4)

Uploaded by

Yash Verma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

REGULAR EXPRESSIONS (REGEX) IN PYTHON:

Regular Expressions (RegEx) are a powerful tool for pattern matching and text manipulation. In Python, regex
functionality is implemented through the re module.

APPLICATIONS OF REGEX
● Data validation
● Data extraction
● Input sanitization (data cleaning)

This document explains regex basics, syntax, functions, and practical examples with improved clarity and structure.

What is a Regular Expression?


A Regular Expression is a sequence of characters that defines a search pattern. It can be used to match strings,
validate formats, or extract information.

COMMON USE CASES OF REGEX THAT ARE ALSO COVERED IN THIS ARTICLE WITH DETAILED EXPLANATION:

● Extracting email addresses


● Extracting timestamps from logs
● Extracting URLs
● Validating phone numbers or dates
● Searching for words or patterns in text
● Validating passwords

Regex Syntax in Python


To use regex, you define a pattern or a regex expression that consists of special characters and sequences, which
defines what to look for in a text.
Here are some of the most common components of regex syntax:

1. SPECIAL CHARACTERS
Character Description
. Matches any single character.
^ Matches the start of the string.
$ Matches the end of the string.
* Matches 0 or more repetitions.
+ Matches 1 or more repetitions.
? Matches 0 or 1 occurrence.
{n} Matches exactly n occurrences.
{n,} Matches n or more occurrences.
{n,m} Matches between n and m occurrences.
\ Escapes special characters.

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


2. CHARACTER CLASSES
Syntax Description
[arn] where one of the a, r or n is present
[a-n] returns a match for any lowercase character between a and n
[^arn] returns a match where character is not a, r or n
[0123] return a match where 0,1,2 or 3 is present
[0-9] returns a match where a number between 0 to 9
[0-5][0-9] returns a match for any number between 00-59
[a-zA-Z] returns a match for any alphabetical character
[+] in sets, special characters have no meaning, so it will return a match if a '+' character is found.

3. PREDEFINED SEQUENCES
Sequence Description
\A returns a match if the specified characters are at the start of the string
\b Returns a match where the specified characters are at the beginning or at the end of a word
\B A match where the specified characters are present, but NOT at the beginning or at the end of a word
\d returns a match where the string contains digits 0-9
\D returns a match where the string does not contains digits 0-9
\s returns a match where the string contains a white space character
\S returns a match where the string DOES NOT contains a white space character
\w returns a match where the string contains word character i.e., a-zA-Z0-9 and underscore
\W returns a match where the string DOES NOT contain a word character
\Z returns a match if the specified characters are at the end of the string.

4. GROUPING AND CAPTURING

Parentheses () are used to group parts of a regex pattern and capture matches. Capturing groups save the matched
content for later use, while non-capturing groups allow grouping without saving the matched content.

CAPTURING GROUP
A capturing group matches the specified pattern and saves the matched content for reference. For example:

pattern = r"(\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.match(pattern, text)
print(match.groups()) # Output: ('123', '45', '6789')

NON-CAPTURING GROUP
A non-capturing group groups the pattern without saving the matched content. Use (?:...) to create a non-
capturing group. For example:

pattern = r"(?:\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.match(pattern, text)
print(match.groups()) # Output: ('45', '6789')

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


PRACTICAL EXAMPLES
1. MATCHING EMAIL ADDRESSES
Example: [email protected]
● The username part i.e., before @ part:
Can contain alphabets a-z, A-Z, numbers 0-9, dot ., space, hyphen -, and some emails unlike gmail allow
underscore _ and other special characters like + as well.
[email protected] : “[a-zA-Z0-9 .-_+]+” : one or more than one occurrence of these
characters
● The domain part i.e., after @ part:
Can contain sub domains, domains, domain extensions and one necessary ending extension that must
contain at least 2 alphabets.
[email protected] : “[a-zA-Z0-9-.]+”
[email protected] : “\.[a-zA-Z]{2,}”
# Complete regex:
r"[a-zA-Z0-9 ._-+]+@[a-zA-Z-.]+\.[a-zA-Z]{2,}"
# Equivalent regex:
r"[\w .-+]+@[\w-.]+\.[a-zA-Z]{2,}"
# (\w: any alphabet, number, underscore, {2,} means occurrence greater than 2
times)

2. MATCHING QUESTIONS
Examples:
- Is this your final answer?
- "Python is a snake" - is this statement correct?
- Why is the sky blue during the day?
● Starting of question: can be alphanumeric, can contain quotation marks: r”[a-zA-Z0-9\”’]+”
● Middle part of a question: r”[a-zA-Z0-9\”’ ,-_–+]*”
(you can include more special characters if they’re allowed in the questions, or you can use [^?\n] to match
every character except a question mark and a new line)
● Ending of a question: r”\?”

# Complete regex:
r"[\w\"']+[\w\"',-_+ ]*\?"

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


3. MATCHING URLS
Examples:
- https://www.example.com?query_param1=value1&query_param2=value2
- Components of a URL:

Since, there are a lot of special characters allowed in the URL, some are not allowed, for example white space is
encoded using %20, and non ascii characters are also encoded using word characters and some special characters.

● Scheme (http/https) of url followed by :// - r”https?:\/\/”


● Subdomain, domain, top level domain: r”(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}”
● Port number’s non capturing group: r”(?::[0-9]{1,5})?”
● Path’s non capturing group: r”(?:\/[^\s?#]*)?”
● Query Separator and Parameters’ non capturing group: r”(?:\?[a-zA-Z0-9%._\-~+=&]*)?”
● Fragment’s non capturing group: r”(?:#[^\s]+)?”

# Complete regex:
r"https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}(?::[0-
9]{1,5})?(?:\/[^\s?#]*)?(?:\?[a-zA-Z0-9%._\-~+=&]*)?(?:#[^\s]*)?"

4. MATCHING IPV4 ADDRESSES


An IPv4 address consists of four octets, separated by dots (.), where each octet is a number between 0 and 255.
Logic behind regex to match a number between 0-255:
● Number between 0-9: [0-9]
● Number between 10-99: [1-9][0-9]
● Number between 0-99: [0-9][0-9]?
● Number between 0-199: [0-1]?[0-9][0-9]?
● Number between 200-255: 2[0-5][0-5]

Regex for number to be in between 0-255: r”(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])”


# Complete regex:
r"(?:(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])\.){3}(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])"

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


Python’s re Module
The re module provides built-in functions for regex operations.

COMMON FUNCTIONS
Function Description Syntax Return Value (x)

Returns a list containing all matches in x=


List of all matched
re.findall the order they are found. If no match, re.findall("regex_expression",
strings
empty list. text)

Returns a match object for the first x=


Match object (if
re.search match found. Returns None if no match is re.search("regex_expression",
found) or None
found. text)

Splits a string into a list at each match. x = re.split("regex_expression", List of separated


re.split
Optionally, limit the splits with maxsplit. text, [maxsplit]) strings

Replaces one or more matches with a x = re.sub("regex_expression", A new string with


re.sub given string. Optionally limit "replacement_string", text, substitutions
replacements with count. count) applied

CODE:
import re

# Sample text with correct and incorrect examples


sample_text = """
Correct Examples:
[email protected]
[email protected]
Is this your final answer?
"Python is a snake" - is this statement correct?
https://www.example.com?query_param1=value1&query_param2=value2
http://example.org/resource
192.168.1.1
127.0.0.1

Incorrect Examples:
john.doe@com
noatsymbol.com
Is this even correct..
ftp://wrong.protocol.com
256.256.256.256
999.999.999.999
"""

# Regex patterns
patterns = {
"Email Address": r"[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"Question": r"[a-zA-Z0-9\"'][a-zA-Z0-9\"',-_-+ ]*\?",
"URL": r"https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}(?::[0-
9]{1,5})?(?:\/[^\s?#]*)?(?:\?[a-zA-Z0-9%._\-~+=&]*)?(?:#[^\s]*)?",
"IPv4 Address": r"(?:(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])\.){3}(?:[0-1]?[0-9][0-
9]?|2[0-5][0-5])"
}

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/


def test_regex(pattern_name, pattern, text):
print(f"\nTesting: {pattern_name}")
matches = re.findall(pattern, text)
print("Matches:")
for match in matches:
print(f" - {match}")

# Testing all patterns


for name, regex in patterns.items():
test_regex(name, regex, sample_text)

OUTPUT:
Testing: Email Address
Matches:
- [email protected]
- [email protected]
Testing: Question
Matches:
- Is this your final answer?
- "Python is a snake" - is this statement correct?
- https://www.example.com?
Testing: URL
Matches:
- https://www.example.com?query_param1=value1&query_param2=value2
- http://example.org/resource
Testing: IPv4 Address
Matches:
- 192.168.1.1
- 127.0.0.1

Theory References:
https://www.w3schools.com/python/python_regex.asp
https://www.geeksforgeeks.org/components-of-a-url/

Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/

You might also like