Regex Summary
Regex Summary
Tags Product
بسم هللا والصاله والسالم علي رسول هللا صل هللا عليه وسلم
Functions:
search(): Searches through the entire string and finds the first match of the
regex.
findall(): Finds all the matches of the regex in the string and returns them as a
list.
sub(): Replaces the first (or all) occurrences of a regex pattern with a
replacement string.
split(): Splits the string into a list based on the occurrences of the regex
pattern.
Metacharacters:
\d : Matches any digit (equivalent to [0-9] )
regex 1
: Matches a word boundary (the position between a word and non-word
\b
character)
Q1.
import re
text = "The quick brown fox jumbs over the lazy dog"
pattern = r'\b\w{5}\b'
matches = re.findall(pattern, text)
# The Output: ['quick', 'brown', 'jumbs']
Here the pattern puts boundaries \b and searches for a word with a length of 5
characters \w{5} . So the function will return a list of words with the length of 5
characters.
Q2.
import re
text = "My phone number is 100"
pattern = r'\d+'
replacment = "XXX"
new_text= re.sub(pattern, replacment, text)
regex 2
print(new_text)
# The Output: My phone number is XXX
In this one the pattern matches only for digits \d+ , So it will substitute sub() the
100 and replace it with XXX.
Ok that’s totally fine, but what about this little + sign after the \d ??
This plus sign indicates that it matches one or more digits, so in this example
it’ll take the whole 100 and replace it with XXX, But if i only removed this plus
sign the pattern will match for only one digit so it’ll deal with the 100 as 3 digits
and the output will be XXXXXXXXX which is XXX * 3.
Q3.
import re
text = "apple, orange, banana, grape"
pattern = r',\s*'
result = re.split(pattern, text)
print(result)
# The Output: ['apple', 'orange', 'banana', 'grape']
And then split the string at each comma, with ignoring any spaces after the
comma.
Q4.
import re
text = "abc123def456"
regex 3
pattern = r'\d+'
matches= re.findall(pattern, text)
print(matches)
# The Output: ['123', '456']
The pattern matches for one or more consecutive digits, which will return a list
with all the digits found in the text, so the output is ['123', '456'].
Again just to make sure that you got it clear, if he removes the plus sign the
pattern will be matching only one digit so the output in this case gonna be ['1',
'2', '3', '4', '5', '6'].
Lastly but not least, one more difference you gotta make sure you understand it
right, which is the difference between findall() and search() , in the question he
used findall() to return all the matches of the regex in the text. But what does
search() match for ? Only the first match for the regex, so in this case the
output only gonna be 123.
Q5.
import re
text = "Please contact us at [email protected] or support@exam
pattern = r'\b\w+@\w+\.\w+\b'
matches= re.search(pattern, text)
print(matches.group())
# The Output: [email protected]
In this one he puts the word boundary \b this insures that the pattern matches
a complete word or in other words a continues word.
The \w+ matches one or more word characters [letters, digits, underscores],
which is the first part of the email before the @ sign info or support .
Again the \w+ which matches the domain name part in the email example .
regex 4
The \. matches a literal dot but needs to be escaped with a backslash
because the dot is a metacharacter in regex.
One more \w+ which matches the TLD (Top Level Domain) part in the email
address, which is the com .
Lastly it ends with the \b again which is the boundaries part for the pattern
which insures it’s a complete word as we said before.
He used the search() function so the pattern will only returns the first match
which in this case is [email protected] and will not return the support email.
Another detail to make sure you get it, with the search() function the return type
is object not a list like in findall() , so we use the group() function in
matches.group() to extract the matched string.
Q2. Extract all dates in the format dd-mm-yyyy from a given string.
import re
text = "Today's date is 20-01-2025"
pattern = r'\d{2}-\d{2}-\d{4}'
match = re.findall(pattern, text)
print(match)
# The Output: ['20-01-2025']
Q3. Extract all words that start with a vowel from a given string.
regex 5
import re
text = "an apple and an orange or a banana and a mango"
pattern = r'\b[aeiouAEIOU]\w*\b'
matches = re.findall(pattern, text)
print(matches)
# The Output: ['an', 'apple', 'and', 'an', 'orange', 'or', 'a
In this one he asks to filter for the words that starts with a vowel only, before
we start, We should remember the English vowels which they are ['a', 'e', 'i',
'o', 'u'].
Next we filter for the vowel letters lower and upper cases at the beginning
of the word in this part [aeiouAEIOU] , and after it we add \w* which is used
for matching word characters [letters, digits, underscores], the purpose
from the * that it identifies that there is zero or more characters here.
import re
text = "Hi I'm a s/tu*de_nt at FC-I S&C$U"
pattern = r'[^a-zA-Z0-9\s]'
replacment = ''
result = re.sub(pattern, replacment, text)
print(result)
# The Output: Hi Im a student at FCI SCU
regex 6
The caret ( ^ ) when placed inside square brackets the pattern will match any
character not in the set.
Inside the square brackets, we matches for
a-z lower case characters, A-Z upper case characters, 0-9 digits, and \s
whitespace characters [spaces, tabs, newlines], which is what we want
exactly, when using the ^ in the square brackets it’ll match the characters
which is not in the set, which are the non-alphanumeric characters.
Using the sub() function, we replaced these non-alphanumeric characters
with an empty string.
import re
text = "[email protected]"
pattern = r'\b\w+\.\w+\b'
matches= re.search(pattern, text)
print(matches.group())
# The Output: example.com
It is super similar and even easier than the one in Q5 in the final exam.
import re
text = "apple,orange banana,grape;mango"
pattern = r'[\s,;]'
result = re.split(pattern, text)
print(result)
# The Output: ['apple', 'orange', 'banana', 'grape', 'mango']
regex 7
This is an easy one, we’re just splitting the string using the split() function, the
pattern is \s for whitespace characters [spaces, tabs, newlines], ; for
semicolons, and , for commas.
import re
text = "Tommorrow is the final exam!"
pattern = r'\s+'
replacment = ' '
result = re.sub(pattern, replacment, text)
print(result)
# The Output: Tommorrow is the final exam!
In this one, we used the \s+ to specify one or more consecutive Whitespace
Characters and replace them with only a single space.
regex 8