Regexpresion
Regexpresion
Linkedin https://www.linkedin.com/in/shubham-verma-3968a5119
In [1]: import re
abc
abc
None
(3, 6) 3 6
(12, 15) 12 15
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 1/11
9/8/22, 4:14 PM RegEx
abc
abc
1 2 3 a b c 4 5 6 7 8 9 a b c 1 2 3 A B C
pattern = re.compile(r"\.")
matches = pattern.finditer(test_string1)
pattern = re.compile(r"^123")
matches = pattern.finditer(test_string)
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 2/11
9/8/22, 4:14 PM RegEx
for match in matches:
print(match)
# no match
# no match
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 3/11
9/8/22, 4:14 PM RegEx
<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(6, 7), match='1'>
<re.Match object; span=(7, 8), match='2'>
<re.Match object; span=(8, 9), match='3'>
<re.Match object; span=(9, 10), match='_'>
<re.Match object; span=(11, 12), match='h'>
<re.Match object; span=(12, 13), match='e'>
<re.Match object; span=(13, 14), match='y'>
<re.Match object; span=(14, 15), match='h'>
<re.Match object; span=(15, 16), match='o'>
<re.Match object; span=(17, 18), match='h'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(19, 20), match='h'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(21, 22), match='y'>
Sets
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 4/11
9/8/22, 4:14 PM RegEx
1. A set is a set of characters inside a pair of square brackets [] with a special meaning. Append multiple conditions back-
to back, e.g. [aA-Z].
2. A ^ (caret) inside a set negates the expression.
3. A - (dash) in a set specifies a range if it is in between, otherwise the dash itself.
Examples:
1. [arn] Returns a match where one of the specified characters (a, r, or n) are present
2. [a-n] Returns a match for any lower case character, alphabetically between a and n
3. [^arn] Returns a match for any character EXCEPT a, r, and n
4. [0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present
5. [0-9] Returns a match for any digit between 0 and 9
6. 0-5 Returns a match for any two-digit numbers from 00 and 59
7. [a-zA-Z] Returns a match for any character alphabetically between a and z, lower case OR upper case
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 5/11
9/8/22, 4:14 PM RegEx
<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>
<re.Match object; span=(5, 6), match='H'>
<re.Match object; span=(6, 7), match='E'>
<re.Match object; span=(7, 8), match='L'>
<re.Match object; span=(8, 9), match='L'>
<re.Match object; span=(9, 10), match='O'>
<re.Match object; span=(11, 12), match='1'>
<re.Match object; span=(12, 13), match='2'>
<re.Match object; span=(13, 14), match='3'>
Quantifier
1. * : 0 or more
2. + : 1 or more
3. ? : 0 or 1, used when a character can be optional
4. {4} : exact number
5. {4,6} : range numbers (min, max)
# no match
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 6/11
9/8/22, 4:14 PM RegEx
<re.Match object; span=(5, 6), match='1'>
<re.Match object; span=(6, 7), match='2'>
<re.Match object; span=(7, 8), match='3'>
In [87]: # Task 1
dates = """
hello
11.04.2022
2022.04.21
2022-04-30
2022-05-23
2022-06-12
2022-07-15
2022-08-19
2022/04/22
2022_04_04
"""
pattern = re.compile(r"\d.-\d.-\d.")
matches = pattern.finditer(dates)
In [90]: #2. find all date in this 2020-04-01 and 2020/04/02 format
pattern = re.compile(r"\d\d\d\d[-/]\d\d[-/]\d\d")
matches = pattern.finditer(dates)
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 7/11
9/8/22, 4:14 PM RegEx
for match in matches:
print(match)
Conditions
Use the | for either or condition
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 8/11
9/8/22, 4:14 PM RegEx
<re.Match object; span=(29, 37), match='Mr Curry'>
<re.Match object; span=(38, 47), match='Mrs Curry'>
<re.Match object; span=(48, 60), match='Mr. Thompson'>
<re.Match object; span=(61, 70), match='Mrs Green'>
<re.Match object; span=(71, 76), match='Mr. T'>
Grouping
( ) is used to group substrings in the matches.
Modifying strings
1. split() : Split the string into a list, splitting it wherever the RE matches
2. sub() : Find all substrings where the RE matches, and replace them with a different string
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 9/11
9/8/22, 4:14 PM RegEx
['123', '456789', '123ABC']
In [16]: # Task 2
urls = """
http://zara-fashion.com
https://www.world-healthorganisation.org
http://www.iNeuron.ai
"""
# return url
zara-fashion
world-healthorganisation
iNeuron
hello
hello
hello
In [21]: subbed_url = pattern.sub(r"\2\3", urls) #\2 and \3 are used for specifying groups
print(subbed_url)
zara-fashioncom
world-healthorganisationorg
iNeuronai
Compilation Flags
1. ASCII, A : Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
2. DOTALL, S : Make . match any character, including newlines.
3. IGNORECASE, I : Do case-insensitive matches.
4. LOCALE, L : Do a locale-aware match.
5. MULTILINE, M : Multi-line matching, affecting ^ and $.
6. VERBOSE, X (for ‘extended’) : Enable verbose REs, which can be organized more cleanly and understandably.
pattern = re.compile(r"world")
matches = pattern.finditer(my_string)
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 10/11
9/8/22, 4:14 PM RegEx
for match in matches:
print(match)
localhost:8888/nbconvert/html/RegEx.ipynb?download=false 11/11