Manipulating Text with Regular Expression in python
Regular expressions (regex) in Python are a powerful tool for text manipulation. They allow
you to search, match, and manipulate text strings with complex patterns. The re module in
Python provides several functions to work with regular expressions.
Special Characters
. (Dot): Matches any character except a newline.
^ (Caret): Matches the start of the string.
$ (Dollar Sign): Matches the end of the string.
[] (Square Brackets): Matches any one of the characters inside the brackets.
\ (Backslash): Escapes special characters or signals a particular sequence.
Special Sequences
\d: Matches any digit.
\D: Matches any non-digit character.
\s: Matches any whitespace character.
\S: Matches any non-whitespace character.
\w: Matches any alphanumeric character.
\W: Matches any non-alphanumeric character.
Quantifiers
*: Matches 0 or more repetitions of the preceding pattern.
+: Matches 1 or more repetitions of the preceding pattern.
?: Matches 0 or 1 repetition of the preceding pattern.
{n}: Matches exactly n repetitions of the preceding pattern.
{n,}: Matches n or more repetitions of the preceding pattern.
{n,m}: Matches between n and m repetitions of the preceding pattern.
Basic Functions
Matching Patterns
To check if a pattern exists within a string, you can use re.match() or re.search().
re.match() checks for a match only at the beginning of the string.
re.search() checks for a match anywhere in the string.
import re
text = "Hello, world!"
# Match at the beginning
match = re.match(r'Hello', text)
if match:
print("Match found:", match.group()) # Output: Match found: Hello
# Search anywhere in the string
search = re.search(r'world', text)
if search:
print("Search found:", search.group()) # Output: Search found: world
Finding All Matches
To find all occurrences of a pattern in a string, use re.findall().
text = "The rain in Spain stays mainly in the plain."
# Find all occurrences of 'ain'
matches = re.findall(r'ain', text)
print("Find all matches:", matches) # Output: Find all matches: ['ain', 'ain', 'ain']
Splitting Strings
To split a string by a pattern, use re.split().
text = "one1two2three3four4"
# Split by digits
split_result = re.split(r'\d', text)
print("Split result:", split_result) # Output: Split result: ['one', 'two', 'three', 'four', '']
Replacing Substrings
To replace substrings that match a pattern, use re.sub().
text = "The rain in Spain."
# Replace 'rain' with 'sun'
replace_result = re.sub(r'rain', 'sun', text)
print("Replace result:", replace_result) # Output: Replace result: The sun in Spain.
Capturing Groups
Capturing groups allow you to extract specific parts of a match.
text = "My phone number is 123-456-7890."
# Capture groups for area code, prefix, and line number
match = re.search(r'(\d{3})-(\d{3})-(\d{4})', text)
if match:
area_code, prefix, line_number = match.groups()
print("Area code:", area_code) # Output: Area code: 123
print("Prefix:", prefix) # Output: Prefix: 456
print("Line number:", line_number) # Output: Line number: 7890
Examples
Here are some more examples to illustrate the use of regular expressions for text
manipulation:
# Example 1: Validate an email address
email = "[email protected]"
is_valid = re.match(r'^[\w\.-]+@[\w\.-]+\.\w+$', email)
print("Valid email:", bool(is_valid)) # Output: Valid email: True
# Example 2: Extract all hashtags from a tweet
tweet = "Loving the new features in #Python3.9! #coding #programming"
hashtags = re.findall(r'#\w+', tweet)
print("Hashtags:", hashtags) # Output: Hashtags: ['#Python3', '#coding', '#programming']
# Example 3: Replace multiple spaces with a single space
text = "This is an example with irregular spacing."
normalized_text = re.sub(r'\s+', ' ', text)
print("Normalized text:", normalized_text) # Output: Normalized text: This is an example
with irregular spacing.