Structuring With Regix
Structuring With Regix
1
2/12/2022
Match Characters
• Some characters are special metacharacters, and don’t match
themselves. Instead, they signal that some out-of-the-ordinary
thing should be matched, or they affect other portions of the RE
by repeating them or changing their meaning. Much of this
document is devoted to discussing various metacharacters and
what they do.
2
2/12/2022
3
2/12/2022
4
2/12/2022
10
5
2/12/2022
• \D
Matches any non-digit character; this is equivalent to the class
[^0-9].
• \s
Matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v].
11
• \w
• Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
• \W
• Matches any non-alphanumeric character; this is equivalent to the class
[^a-zA-Z0-9_].
12
6
2/12/2022
13
14
7
2/12/2022
15
16
8
2/12/2022
17
.
(Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character
including a newline.
^
(Caret.) Matches the start of the string, and in MULTILINE mode also
matches immediately after each newline.
18
9
2/12/2022
Module Contents
• The module defines several functions, constants, and an
exception. Some of the functions are simplified versions of the full
featured methods for compiled regular expressions.
• Most non-trivial applications always use the compiled form.
20
10
2/12/2022
Module Contents
• re.compile(pattern, flags=0)
• Compile a regular expression pattern into a regular expression
object, which can be used for matching using its match(), search()
and other methods, described below.
21
Module Contents
• The sequence
>> prog = re.compile(pattern)
>> result = prog.match(string)
• is equivalent to
• result = re.match(pattern, string)
• but using re.compile() and saving the resulting regular expression
object for reuse is more efficient when the expression will be used
several times in a single program.
22
11
2/12/2022
Module Contents
• Performing Matches
Once you have an object representing a compiled regular expression,
what do you do with it? Pattern objects have several methods and
attributes. Only the most significant ones will be covered here;
consult the re docs for a complete listing.
23
Module Contents
• Q
24
12
2/12/2022
Module Contents
• >>> import re
• >>> p = re.compile('[a-z]+')
• >>> p
• re.compile('[a-z]+')
25
Module Contents
• Now, you can try matching various strings against the RE [a-z]+.
An empty string shouldn’t match at all, since + means ‘one or
more repetitions’. match() should return None in this case
• >>> p.match("")
• >>> print(p.match(""))
• None
26
13
2/12/2022
Regex Function
• The re module offers a set of functions that allows us to search a
string for a match:
27
Module Contents
•
28
14
2/12/2022
Metacharacters
• Metacharacters are characters with a special meaning:
29
Metacharacters
• []
>> import re
>> txt = "The rain in Spain"
>> #Find all lower case characters alphabetically between "a" and
"m":
>> x = re.findall("[a-m]", txt)
>> print(x)
30
15
2/12/2022
Metacharacters
• \
>> import re
>> txt = "That will be 59 dollars"
>> #Find all digit characters:
>> x = re.findall("\d", txt)
>> print(x)
31
Metacharacters
• .
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by two
(any) characters, and an "o":
>> x = re.findall("he..o", txt)
>> print(x)
Output: ['hello']
32
16
2/12/2022
Metacharacters
• ^
>> import re
>> txt = "hello planet"
>> #Check if the string starts with 'hello':
>> x = re.findall("^hello", txt)
>> if x:
>> print("Yes, the string starts with 'hello'")
>> else:
>> print("No match")
33
Metacharacters
• $
>> import re
>> txt = "hello planet"
>> #Check if the string ends with 'planet':
>> x = re.findall("planet$", txt)
>> if x:
>> print("Yes, the string ends with 'planet'")
>> else:
>> print("No match")
34
17
2/12/2022
Metacharacters
• *
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by 0 or more (any)
>> characters, and an "o":
>> x = re.findall("he.*o", txt)
>> print(x)
35
Metacharacters
• +
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by 1 or more (any)
characters, and an "o":
>> x = re.findall("he.+o", txt)
>> print(x)
• Output: ['hello']
36
18
2/12/2022
Metacharacters
• ?
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by 0 or 1 (any)
character, and an "o":
>> x = re.findall("he.?o", txt)
>> print(x)
>> #This time we got no match, because there were not zero, not one, but two
characters between "he" and the "o“
Output: []
37
Metacharacters
• {}
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed exactly 2
(any) characters, and an "o":
>> x = re.findall("he.{2}o", txt)
>> print(x)
Output: ['hello']
38
19
2/12/2022
Metacharacters
• |
>> import re
>> txt = "The rain in Spain falls mainly in the plain!"
>> #Check if the string contains either "falls" or "stays":
>> x = re.findall("falls|stays", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
39
Special Sequences
• A special sequence is a \ followed by one of the characters in the list below,
and has a special meaning:
40
20
2/12/2022
Special Sequences
• A special sequence is a \ followed by one of the characters in the list below,
and has a special meaning:
41
Special Sequences
• \A
>> import re
>> txt = "The rain in Spain"
>> #Check if the string starts with "The":
>> x = re.findall("\AThe", txt)
>> print(x)
>> if x:
>> print("Yes, there is a match!")
>> else:
>> print("No match") Output: ['The']
Yes, there is a match!
42
21
2/12/2022
Special Sequences
• \b
>> import re
>> txt = "The rain in Spain"
>> #Check if "ain" is present at the end of a WORD:
>> x = re.findall(r"ain\b", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
43
Special Sequences
• \b
>> import re
>> txt = "The rain in Spain"
>> #Check if "ain" is present at the beginning of a WORD:
>> x = re.findall(r"\bain", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
44
22
2/12/2022
Special Sequences
• \B
>> import re
>> txt = "The rain in Spain"
>> #Check if "ain" is present, but NOT at the beginning of a word:
>> x = re.findall(r"\Bain", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
45
Special Sequences
• \d
>> import re
>> txt = "The rain in Spain"
>> #Check if the string contains any digits (numbers from 0-9):
>> x = re.findall("\d", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
46
23
2/12/2022
Special Sequences
• \D
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every no-digit character:
>> x = re.findall("\D", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
47
Special Sequences
• \s
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every white-space character:
>> x = re.findall("\s", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
48
24
2/12/2022
Special Sequences
• \S
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every NON white-space character:
>> x = re.findall("\S", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
49
Special Sequences
• \w
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every word character (characters from a to Z, digits from 0-9, and the
underscore _ character):
>> x = re.findall("\w", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
50
25
2/12/2022
Special Sequences
• \W
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every NON word character (characters NOT between a and Z. Like "!",
"?" white-space etc.):
>> x = re.findall("\W", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
• 51
Special Sequences
• \Z
>> import re
>> txt = "The rain in Spain"
>> #Check if the string ends with "Spain":
>> x = re.findall("Spain\Z", txt)
>> print(x)
>> if x:
>> print("Yes, there is a match!")
>> else:
>> print("No match")
52
26
2/12/2022
Set
• A set is a set of characters inside a pair of square brackets [] with a
special meaning:
53
Set
• [arn]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has any a, r, or n characters:
>> x = re.findall("[arn]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!") Output:
>> else: ['r', 'a', 'n', 'n', 'a', 'n']
>> print("No match") Yes, there is at least one match!
54
27
2/12/2022
Set
• [a-n]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has any characters between a and n:
>> x = re.findall("[a-n]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!") Output:
>> else: ['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
>> print("No match") Yes, there is at least one match!
55
Set
• [^arn]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has other characters than a, r, or n:
>> x = re.findall("[^arn]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
Output:
>> else: ['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
>> print("No match") Yes, there is at least one match!
56
28
2/12/2022
Set
• [123]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has any 0, 1, 2, or 3 digits:
>> x = re.findall("[0123]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
Output:
>> else: []
>> print("No match") No match
57
Set
• [0-9]
>> import re
>> txt = "8 times before 11:45 AM"
>> #Check if the string has any digits:
>> x = re.findall("[0-9]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
Output:
>> else: ['8', '1', '1', '4', '5']
>> print("No match") Yes, there is at least one match!
58
29
2/12/2022
Set
• [0-5][0-9]
>> import re
>> txt = "8 times before 11:45 AM"
>> #Check if the string has any characters from a to z lower case, and A to Z upper case:
>> x = re.findall("[a-zA-Z]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else: Output:
>> print("No match") ['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!
59
Set
• [a-z][A-Z]
>> import re
>> txt = "8 times before 11:45 AM
>> #Check if the string has any characters from a to z lower case, and A to Z upper
case:
>> x = re.findall("[a-zA-Z]", txt)
Output:
>> print(x) ['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
>> if x: Yes, there is at least one match!
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
60
30
2/12/2022
Set
• +
>> import re
>> txt = "8 times before 11:45 AM"
>> #Check if the string has any + characters:
>> x = re.findall("[+]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else: Output:
>> print("No match") []
No match
61
Python RegEx
• Python has a module named re to work with regular expressions.
To use it, we need to import the module.
• import re
• The module defines several functions and constants to work with
RegEx.
62
31
2/12/2022
Python RegEx
• re.findall()
• The re.findall() method returns a list of strings containing all
matches.
63
Python RegEx
• Example 1: re.findall()
• # Program to extract numbers from a string
>> import re
>> string = 'hello 12 hi 89. Howdy 34'
>> pattern = '\d+'
>> result = re.findall(pattern, string)
>> print(result)
• # Output: ['12', '89', '34']
• If the pattern is not found, re.findall() returns an empty list.
64
32
2/12/2022
Python RegEx
• Example 2: re.findall()
>> import re
Output:
['ai', 'ai']
65
Python RegEx
• Example 3: re.findall()
>> import re
>> txt = "The rain in Spain"
>> #Check if "Portugal" is in the string:
>> x = re.findall("Portugal", txt)
>> print(x)
>> if (x):
>> print("Yes, there is at least one match!")
Output:
>> else:
[]
>> print("No match") No match
• 66
33
2/12/2022
Python RegEx
• re.split()
• The re.split method splits the string where there is a match and
returns a list of strings where the splits have occurred.
67
Python RegEx
• Example 1: re.split()
>> import re
>> string = 'Twelve:12 Eighty nine:89.'
>> pattern = '\d+'
>> result = re.split(pattern, string)
>> print(result)
• # Output: ['Twelve:', ' Eighty nine:', '.']
• If the pattern is not found, re.split() returns a list containing the
original string.
68
34
2/12/2022
Python RegEx
• Example 2: re.split()
>> import re
>> #Split the string at every white-space character:
>> txt = "The rain in Spain"
>> x = re.split("\s", txt)
>> print(x)
Output:
['The', 'rain', 'in', 'Spain']
69
Python RegEx
• Example 3: re.split()
>> import re
>> #Split the string at the first white-space character:
>> txt = "The rain in Spain"
>> x = re.split("\s", txt, 1)
>> print(x)
Output:
['The', 'rain in Spain']
70
35
2/12/2022
Python RegEx
• You can pass maxsplit argument to the re.split() method. It's the maximum
number of splits that will occur.
>> import re
>> string = 'Twelve:12 Eighty nine:89 Nine:9.'
>> pattern = '\d+'
>> # maxsplit = 1
>> # split only at the first occurrence
>> result = re.split(pattern, string, 1)
>> print(result)
• # Output: ['Twelve:', ' Eighty nine:89 Nine:9.']
• By the way, the default value of maxsplit is 0; meaning all possible splits.
71
Python RegEx
• re.sub()
• The syntax of re.sub() is:
72
36
2/12/2022
Python RegEx
• Example 1: re.sub()
>> # Program to remove all whitespaces
>> import re
>> # multiline string
>> string = 'abc 12\
>> de 23 \n f45 6'
73
Python RegEx
>> # empty string
>> replace = ''
74
37
2/12/2022
Python RegEx
• Example 2: re.sub()
>> import re
>> #Replace all white-space characters with the digit "9":
>> txt = "The rain in Spain"
>> x = re.sub("\s", "9", txt)
>> print(x)
Output:
The9rain9in9Spain
75
Python RegEx
• Example 3: re.sub()
>> import re
>> #Replace the first two occurrences of a white-space character with the digit 9:
>> txt = "The rain in Spain"
>> x = re.sub("\s", "9", txt, 2)
>> print(x)
Output:
The9rain9in Spain
76
38
2/12/2022
Python RegEx
You can pass count as a fourth parameter to the re.sub() method. If omitted, it
results to 0. This will replace all occurrences.
>> import re
77
Python RegEx
>> # matches all whitespace characters
>> pattern = '\s+'
>> replace = ''
# Output:
# abc12de 23
# f45 6
78
39
2/12/2022
Python RegEx
• re.subn()
• The re.subn() is similar to re.sub() except it returns a tuple of 2 items
containing the new string and the number of substitutions made.
• Example 4: re.subn()
79
Python RegEx
>> # matches all whitespace characters
>> pattern = '\s+'
80
40
2/12/2022
Python RegEx
• re.search()
• The re.search() method takes two arguments: a pattern and a string. The
method looks for the first location where the RegEx pattern produces a
match with the string.
81
Python RegEx
• Example 1: re.search()
>> import re
>> string = "Python is fun"
>> # check if 'Python' is at the beginning
>> match = re.search('\APython', string)
>> if match:
>> print("pattern found inside the string")
>> else:
>> print("pattern not found")
• # Output: pattern found inside the string
82
41
2/12/2022
Python RegEx
• Example 2: re.search()
• import re
• Output:
• The first white-space character is located in position: 3
83
Python RegEx
• Example 3: re.search()
>> import re
• Output:
• None
84
42
2/12/2022
Python RegEx
• Example 3: re.search()
>> import re
• Output:
• None
85
Python RegEx
• Match object
• You can get methods and attributes of a match object using dir() function.
• Some of the commonly used methods and attributes of match objects are:
• match.group()
• The group() method returns the part of the string where there is a match.
86
43
2/12/2022
Python RegEx
• Example 2: Match object
>> import re
>> string = '39801 356, 2102 1111'
>> # Three digit number followed by space followed by two digit number
>> pattern = '(\d{3}) (\d{2})'
>> # match variable contains a Match object.
>> match = re.search(pattern, string)
87
Python RegEx
>> if match:
>> print(match.group())
>> else:
>> print("pattern not found")
• # Output: 801 35
88
44
2/12/2022
Python RegEx
• Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). You can
get the part of the string of these parenthesized subgroups. Here's how:
>>> match.group(1)
'801'
>>> match.group(2)
'35'
>>> match.group(1, 2)
('801', '35')
>>> match.groups()
('801', '35')
89
Python RegEx
• Using r prefix before RegEx
• When r or R prefix is used before a regular expression, it means row string.
For example, '\n' is a new line whereas r'\n' means two characters: a
backslash \ followed by n.
90
45
2/12/2022
Python RegEx
• Example 7: Raw string using r prefix
>> import re
>> string = '\n and \r are escape sequences.'
>> result = re.findall(r'[\n\r]', string)
>> print(result)
91
Example of Regex
92
46
2/12/2022
Example - Regex
• Import pandas as pd
• Import numpy as np
• df = pd.read_csv(‘titanic_train.csv’)
• df[‘Title’] = df[‘Name’].str.extract(‘([A-Za-z]+\.)’,expand=False)
• df[‘Age’].fillna(df.groupby(‘title)[‘Age’].transform(‘mean’), inplace = True)
93
Example 1 - Regex
• Import pandas as pd
• Import numpy as np
• df = pd.read_csv(‘titanic_train.csv’)
• df[‘Title’] = df[‘Name’].str.extract(‘([A-Za-z]+\.)’,expand=False)
• df[‘Age’].fillna(df.groupby(‘title)[‘Age’].transform(‘mean’), inplace = True)
94
47
2/12/2022
Example 2 - Regex
• Remove the unnecessary characters from columns
• import pandas as pd
• import numpy as np
• dfW = pd.read_csv('D:\\Teaching Subject\\Data Science\\Fall
2021\\Lectures\\Structuring and Regex Example\\weather_data.csv')
95
Example 2 - Regex
• dfW['temperature'].replace('[^0-9-]','',inplace=True,regex=True)
• output
96
48
2/12/2022
Example 3 - Regex
• pakistan_intellectual_capital
• Practice of structuring problem. See in Jupytor notebook or PDF
97
49