Regular Expressions in Python
Introduction to Regular Expressions
• Regular expressions are sequences of
characters that define a search pattern.
• Used for string matching, searching, and
manipulation.
• In Python, handled using the 're' module.
Importing the re Module
• To work with regex in Python, import the 're'
module:
• Example:
• import re
Basic Functions in re Module
• re.match(): Checks for a match only at the
beginning of the string.
• re.search(): Searches for a match anywhere in the
string.
• re.findall(): Returns a list of all matches.
• re.sub(): Replaces one or many matches with a
string.
• re.split(): It splits a string into parts wherever the
regex pattern matches, and returns the parts as a
list.
Example: re.match()
• import re
• pattern = r'Hello'
• result = re.match(pattern, 'Hello World')
• print(result) # Matches at start
• Output:
• <re.Match object; span=(0, 5), match='Hello’>
• print(result.group())
• # Hello
Example: re.search()
• import re
• pattern = r'World'
• result = re.search(pattern, 'Hello World')
• print(result) # Found match anywhere
Example: re.findall()
• import re
• pattern = r'\d+'
• result = re.findall(pattern, 'There are 12 apples
and 34 oranges.')
• print(result) # ['12', '34']
Example: re.sub()
• import re
• pattern = r'apples'
• result = re.sub(pattern, 'bananas', 'I like apples.')
• print(result) # 'I like bananas.’
• re.sub(pattern, replacement, string) → searches
for the pattern in string and
replaces all occurrences with replacement
• print(re.sub(r'apples', 'bananas', 'apples are
red, apples are tasty’))
• Output
• bananas are red, bananas are tasty
• re.sub(r'apples', 'bananas', 'apples are red,
apples are tasty', count=1)
Practice Problem -1
Karan is developing a text-splitting tool for his
application that uses regular expressions to
separate items in a comma-separated string. He
inputs a single line of text where items are divided
by commas and wants the program to split the line
using a regular expression.
Write a program to ensure that the task is done by
applying a regular expression to divide the string by
commas and printing the resulting parts as a list.
import re
text = input("Enter a comma-separated string: ")
parts = re.split(r',', text)
print(parts)
• Enter a comma-separated string:
apple,banana,cherry
• ['apple', 'banana', 'cherry']
Special Characters in Regex
• . : Any character except newline
• ^ : Start of string
• $ : End of string
• * : 0 or more repetitions
• + : 1 or more repetitions
• ? : 0 or 1 occurrence
• {n} : Exactly n repetitions
• {n,} : n or more repetitions
• {n,m} : Between n and m repetitions
Symbol Meaning Example Pattern Example Match
Matches any single
. re.findall(r"a.c", "abc a_c a-c") ['abc', 'a_c', 'a-c']
character except newline
re.findall(r"^Hello", "Hello
^ Matches the start of the string ['Hello']
World")
re.findall(r"World$", "Hello
$ Matches the end of the string ['World']
World")
* Matches 0 or more repetitions re.findall(r"ab*", "a ab abb abbb") ['a', 'ab', 'abb', 'abbb']
+ Matches 1 or more repetitions re.findall(r"ab+", "a ab abb abbb") ['ab', 'abb', 'abbb']
? Matches 0 or 1 occurrence re.findall(r"ab?", "a ab abb") ['a', 'ab', 'ab']
{n} Matches exactly n repetitions re.findall(r"\d{3}", "123 45 6789") ['123', '678']
re.findall(r"\d{2,}", "1 12 123
{n,} Matches n or more repetitions ['12', '123', '1234']
1234")
Matches between n and m re.findall(r"\d{2,4}", "1 12 123
{n,m} ['12', '123', '1234', '1234']
repetitions 1234 12345")
Character Classes
• [abc] : Matches a, b, or c
• [^abc] : Matches any character except a, b, or c
• [a-z] : Matches any lowercase letter
• [A-Z] : Matches any uppercase letter
• [0-9] : Matches any digit
• \d : Matches any digit
• \D : Matches any non-digit
• \s : Matches any whitespace
• \S : Matches any non-whitespace
• \w : Matches any alphanumeric
• \W : Matches any non-alphanumeric
Pattern Meaning Example Code Output
re.findall(r"[abc]", "apple bat cat
[abc] Matches a, b, or c ['a', 'a', 'b', 'a', 'c']
dog")
Matches any char except a, b, or re.findall(r"[^abc]", "apple bat
[^abc] all other letters/spaces
c cat dog")
[a-z] Matches lowercase a–z re.findall(r"[a-z]", "Hello123") ['e', 'l', 'l', 'o']
[A-Z] Matches uppercase A–Z re.findall(r"[A-Z]", "Hello123") ['H']
[0-9] Matches any digit re.findall(r"[0-9]", "Age 25") ['2', '5']
\d Matches any digit (same as [0-9]) re.findall(r"\d", "Room 101") ['1', '0', '1']
\D Matches any non-digit re.findall(r"\D", "Room 101") all non-numeric chars
Matches whitespace (space, tab,
\s re.findall(r"\s", "Hello World") [' ']
newline)
\S Matches any non-whitespace re.findall(r"\S", "Hello World") all letters except space
Matches any non- re.findall(r"\W",
\W ['!']
alphanumeric "Hello_123!")
Matches
re.findall(r"\w", ['H', 'e', 'l', 'l', 'o',
\w alphanumeric +
"Hello_123!") '_', '1', '2', '3']
underscore
Example: Using Character Classes
• import re
• pattern = r'[A-Za-z]+'
• result = re.findall(pattern, 'Python 3 is fun!')
• print(result) # ['Python', 'is', 'fun']
Grouping and Capturing
• () : Groups patterns
• \1, \2, ... : Refers to captured groups
• Example:
• pattern = r'(\d+)-(\d+)'
• result = re.match(pattern, '123-456')
• print(result.groups()) # ('123', '456')
Practice Problem -2
• Grisa, an assistant coach for a school-level cricket
camp, is preparing flashcards to help young players
improve their vocabulary. The cards will only include
words that are exactly four letters long, as they are
easier for the juniors to memorize and pronounce. She
receives raw text from various cricket-related stories
and interviews, and needs a program to extract all such
four-letter words.
• Help her to implement the task.
import re
text = input()
words = re.findall(r'\b\w{4}\b', text)
print("Words:", words)
Input - This is a test
Output - Words: ['This', 'test']
Practice Problem -3
• Mihir is building a parser that detects dates in the
format DD-MM-YYYY within natural language
text. He needs to extract all date substrings that
match this pattern using regular expressions.
• Write a program that extracts and displays all
dates in the format DD-MM-YYYY from a line of
text.
import re
text = input()
dates = re.findall(r'\b\d{2}-\d{2}-\d{4}\b', text)
print("Dates:", dates)
Input - My DOB is 04-08-1995
Output - Dates: ['04-08-1995']
Practice Problem -4
• Arjun is building a script to extract the domain
name from an email address present in a given
line of text. He needs to identify the part that
follows the @ symbol in an email address.
• Write a program that extracts and prints the
domain name from the first valid email address
found in the input using regular expressions.
import re
text = input()
match = re.search(r'@([\w.-]+)', text)
domain = match.group(1) if match else ''
print("Domain:", domain)
[email protected]
Domain: gmail.com