0% found this document useful (0 votes)

2 views49 pages

Structuring With Regix

The document provides an introduction to regular expressions (regex) for data structuring and feature engineering, detailing metacharacters and their functions. It explains how to use regex for matching characters, including special sequences and repeating qualifiers. Various examples illustrate the application of regex in Python for string matching and searching.

Uploaded by

Atiya Falak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views49 pages

Structuring With Regix

Uploaded by

Atiya Falak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

2/12/2022

Big Data Analytics

Structuring with Regex
Muhammad Affan Alim

Regex for Structuring (Data Wrangling)

1
2/12/2022

What is Feature Engineering - Introduction

• The regular expression language is relatively small and
restricted, so not all possible string processing tasks can be
done using regular expressions

Match Characters
• Some characters are special metacharacters, and don’t match
themselves. Instead, they signal that some out-of-the-ordinary
thing should be matched, or they affect other portions of the RE
by repeating them or changing their meaning. Much of this
document is devoted to discussing various metacharacters and
what they do.

2
2/12/2022

Here’s a complete list of the metacharacters

.^$*+?{}[]\|()

Here’s a complete list of the metacharacters

• The first metacharacters we’ll look at are [ and ]. They’re used for
specifying a character class, which is a set of characters that you
wish to match.
• For example, [abc] will match any of the characters a, b, or c; this
is the same as [a-c], which uses a range to express the same set of
characters. If you wanted to match only lowercase letters, your RE
would be [a-z].

3
2/12/2022

Here’s a complete list of the metacharacters

• Metacharacters are not active inside classes. For example, [akm$]
will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a
metacharacter, but inside a character class it’s stripped of its
special nature.

Here’s a complete list of the metacharacters

• You can match the characters not listed within the class by
complementing the set.
• This is indicated by including a '^' as the first character of the
class.
• For example, [^5] will match any character except '5'. If the caret
appears elsewhere in a character class, it does not have special
meaning. For example: [5^] will match either a '5' or a '^'.

4
2/12/2022

Here’s a complete list of the metacharacters

• Perhaps the most important metacharacter is the backslash, \
• As in Python string literals, the backslash can be followed by
various characters to signal various special sequences. It’s also
used to escape all the metacharacters so you can still match them
in patterns;
• for example, if you need to match a [ or \, you can precede them
with a backslash to remove their special meaning: \[ or \\.

Here’s a complete list of the metacharacters

• Some of the special sequences beginning with '\' represent
predefined sets of characters that are often useful,
• Let’s take an example: \w matches any alphanumeric character. If
the regex pattern is expressed in bytes, this is equivalent to the
class [a-zA-Z0-9_].

5
2/12/2022

Here’s a complete list of the metacharacters

• \d
Matches any decimal digit; this is equivalent to the class [0-9].

• \D
Matches any non-digit character; this is equivalent to the class
[^0-9].

• \s
Matches any whitespace character; this is equivalent to the class
[ \t\n\r\f\v].

Here’s a complete list of the metacharacters

• \S
• Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

• \w
• Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

• \W
• Matches any non-alphanumeric character; this is equivalent to the class
[^a-zA-Z0-9_].

6
2/12/2022

Here’s a complete list of the metacharacters

• These sequences can be included inside a character class. For
example, [\s,.] is a character class that will match any whitespace
character, or ',' or '.'.

• The final metacharacter in this section is ‘.’. It matches anything

except a newline character, and there’s an alternate mode
(re.DOTALL) where it will match even a newline. . is often used
where you want to match “any character”.

Here’s a complete list of the metacharacters

• Repeating Things
• Kleen * and kleen +
• There are two more repeating qualifiers. The question mark
character, ?, matches either once or zero times; you can think of it
as marking something as being optional. For example, home-
?brew matches either 'homebrew' or 'home-brew'.

7
2/12/2022

Here’s a complete list of the metacharacters

• The most complicated repeated qualifier is {m, n}, where m and n
are decimal integers.
• This qualifier means there must be at least m repetitions, and at
most n.
• For example, a/{1,3}b will match 'a/b', 'a//b', and 'a///b'.
• It won’t match 'ab', which has no slashes, or 'a////b', which has
four.

Here’s a complete list of the metacharacters

• You can omit either m or n; in that case, a reasonable value is
assumed for the missing value.
• Omitting m is interpreted as a lower limit of 0, while omitting n
results in an upper bound of infinity.

8
2/12/2022

Here’s a complete list of the metacharacters

• Readers of a reductionist bent may notice that the three other
qualifiers can all be expressed using this notation. {0,} is the same
as *, {1,} is equivalent to +, and {0,1} is the same as ?.
• It’s better to use *, +, or ? when you can, simply because they’re
shorter and easier to read.

Here’s a complete list of the metacharacters

.
(Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character
including a newline.

^
(Caret.) Matches the start of the string, and in MULTILINE mode also
matches immediately after each newline.

9
2/12/2022

Here’s a complete list of the metacharacters

$
Matches the end of the string or just before the newline at the end
of the string, and in MULTILINE mode also matches before a newline.
foo matches both ‘foo’ and ‘foobar’, while the regular expression
foo$ matches only ‘foo’. More interestingly, searching for foo.$ in
'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE
mode; searching for a single $ in 'foo\n' will find two (empty)
matches: one just before the newline, and one at the end of the
string.
19

Module Contents
• The module defines several functions, constants, and an
exception. Some of the functions are simplified versions of the full
featured methods for compiled regular expressions.
• Most non-trivial applications always use the compiled form.

10
2/12/2022

Module Contents
• re.compile(pattern, flags=0)
• Compile a regular expression pattern into a regular expression
object, which can be used for matching using its match(), search()
and other methods, described below.

Module Contents
• The sequence
>> prog = re.compile(pattern)
>> result = prog.match(string)
• is equivalent to
• result = re.match(pattern, string)
• but using re.compile() and saving the resulting regular expression
object for reuse is more efficient when the expression will be used
several times in a single program.
22

11
2/12/2022

Module Contents
• Performing Matches
Once you have an object representing a compiled regular expression,
what do you do with it? Pattern objects have several methods and
attributes. Only the most significant ones will be covered here;
consult the re docs for a complete listing.

Module Contents
• Q

12
2/12/2022

Module Contents
• >>> import re
• >>> p = re.compile('[a-z]+')
• >>> p
• re.compile('[a-z]+')

Module Contents
• Now, you can try matching various strings against the RE [a-z]+.
An empty string shouldn’t match at all, since + means ‘one or
more repetitions’. match() should return None in this case
• >>> p.match("")
• >>> print(p.match(""))
• None

13
2/12/2022

Regex Function
• The re module offers a set of functions that allows us to search a
string for a match:

Module Contents
•

14
2/12/2022

Metacharacters
• Metacharacters are characters with a special meaning:

Metacharacters
• []
>> import re
>> txt = "The rain in Spain"
>> #Find all lower case characters alphabetically between "a" and
"m":
>> x = re.findall("[a-m]", txt)
>> print(x)

15
2/12/2022

Metacharacters
• \
>> import re
>> txt = "That will be 59 dollars"
>> #Find all digit characters:
>> x = re.findall("\d", txt)
>> print(x)

Metacharacters

• .
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by two
(any) characters, and an "o":
>> x = re.findall("he..o", txt)
>> print(x)
Output: ['hello']
32

16
2/12/2022

Metacharacters

• ^
>> import re
>> txt = "hello planet"
>> #Check if the string starts with 'hello':
>> x = re.findall("^hello", txt)
>> if x:
>> print("Yes, the string starts with 'hello'")
>> else:
>> print("No match")
33

Metacharacters
• $
>> import re
>> txt = "hello planet"
>> #Check if the string ends with 'planet':
>> x = re.findall("planet$", txt)
>> if x:
>> print("Yes, the string ends with 'planet'")
>> else:
>> print("No match")

17
2/12/2022

Metacharacters
• *
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by 0 or more (any)
>> characters, and an "o":
>> x = re.findall("he.*o", txt)
>> print(x)

Metacharacters
• +
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by 1 or more (any)
characters, and an "o":
>> x = re.findall("he.+o", txt)
>> print(x)
• Output: ['hello']

18
2/12/2022

Metacharacters
• ?
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed by 0 or 1 (any)
character, and an "o":
>> x = re.findall("he.?o", txt)
>> print(x)
>> #This time we got no match, because there were not zero, not one, but two
characters between "he" and the "o“
Output: []
37

Metacharacters
• {}
>> import re
>> txt = "hello planet"
>> #Search for a sequence that starts with "he", followed exactly 2
(any) characters, and an "o":
>> x = re.findall("he.{2}o", txt)
>> print(x)
Output: ['hello']

19
2/12/2022

Metacharacters
• |
>> import re
>> txt = "The rain in Spain falls mainly in the plain!"
>> #Check if the string contains either "falls" or "stays":
>> x = re.findall("falls|stays", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
39

Special Sequences
• A special sequence is a \ followed by one of the characters in the list below,
and has a special meaning:

20
2/12/2022

Special Sequences
• A special sequence is a \ followed by one of the characters in the list below,
and has a special meaning:

Special Sequences
• \A

>> import re
>> txt = "The rain in Spain"
>> #Check if the string starts with "The":
>> x = re.findall("\AThe", txt)
>> print(x)
>> if x:
>> print("Yes, there is a match!")
>> else:
>> print("No match") Output: ['The']
Yes, there is a match!
42

21
2/12/2022

Special Sequences
• \b
>> import re
>> txt = "The rain in Spain"
>> #Check if "ain" is present at the end of a WORD:
>> x = re.findall(r"ain\b", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
43

Special Sequences
• \b
>> import re
>> txt = "The rain in Spain"
>> #Check if "ain" is present at the beginning of a WORD:
>> x = re.findall(r"\bain", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
44

22
2/12/2022

Special Sequences
• \B
>> import re
>> txt = "The rain in Spain"
>> #Check if "ain" is present, but NOT at the beginning of a word:
>> x = re.findall(r"\Bain", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
45

Special Sequences
• \d
>> import re
>> txt = "The rain in Spain"
>> #Check if the string contains any digits (numbers from 0-9):
>> x = re.findall("\d", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
46

23
2/12/2022

Special Sequences
• \D
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every no-digit character:
>> x = re.findall("\D", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
47

Special Sequences
• \s
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every white-space character:
>> x = re.findall("\s", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
48

24
2/12/2022

Special Sequences
• \S
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every NON white-space character:
>> x = re.findall("\S", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
49

Special Sequences
• \w
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every word character (characters from a to Z, digits from 0-9, and the
underscore _ character):
>> x = re.findall("\w", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
50

25
2/12/2022

Special Sequences
• \W
>> import re
>> txt = "The rain in Spain"
>> #Return a match at every NON word character (characters NOT between a and Z. Like "!",
"?" white-space etc.):
>> x = re.findall("\W", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
• 51

Special Sequences
• \Z
>> import re
>> txt = "The rain in Spain"
>> #Check if the string ends with "Spain":
>> x = re.findall("Spain\Z", txt)
>> print(x)
>> if x:
>> print("Yes, there is a match!")
>> else:
>> print("No match")
52

26
2/12/2022

Set
• A set is a set of characters inside a pair of square brackets [] with a
special meaning:

Set
• [arn]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has any a, r, or n characters:
>> x = re.findall("[arn]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!") Output:
>> else: ['r', 'a', 'n', 'n', 'a', 'n']
>> print("No match") Yes, there is at least one match!
54

27
2/12/2022

Set
• [a-n]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has any characters between a and n:
>> x = re.findall("[a-n]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!") Output:
>> else: ['h', 'e', 'a', 'i', 'n', 'i', 'n', 'a', 'i', 'n']
>> print("No match") Yes, there is at least one match!
55

Set
• [^arn]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has other characters than a, r, or n:
>> x = re.findall("[^arn]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
Output:
>> else: ['T', 'h', 'e', ' ', 'i', ' ', 'i', ' ', 'S', 'p', 'i']
>> print("No match") Yes, there is at least one match!
56

28
2/12/2022

Set
• [123]
>> import re
>> txt = "The rain in Spain"
>> #Check if the string has any 0, 1, 2, or 3 digits:
>> x = re.findall("[0123]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
Output:
>> else: []
>> print("No match") No match
57

Set
• [0-9]
>> import re
>> txt = "8 times before 11:45 AM"
>> #Check if the string has any digits:
>> x = re.findall("[0-9]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
Output:
>> else: ['8', '1', '1', '4', '5']
>> print("No match") Yes, there is at least one match!
58

29
2/12/2022

Set
• [0-5][0-9]
>> import re
>> txt = "8 times before 11:45 AM"
>> #Check if the string has any characters from a to z lower case, and A to Z upper case:
>> x = re.findall("[a-zA-Z]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else: Output:
>> print("No match") ['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
Yes, there is at least one match!
59

Set
• [a-z][A-Z]
>> import re
>> txt = "8 times before 11:45 AM
>> #Check if the string has any characters from a to z lower case, and A to Z upper
case:
>> x = re.findall("[a-zA-Z]", txt)
Output:
>> print(x) ['t', 'i', 'm', 'e', 's', 'b', 'e', 'f', 'o', 'r', 'e', 'A', 'M']
>> if x: Yes, there is at least one match!
>> print("Yes, there is at least one match!")
>> else:
>> print("No match")
60

30
2/12/2022

Set
• +
>> import re
>> txt = "8 times before 11:45 AM"
>> #Check if the string has any + characters:
>> x = re.findall("[+]", txt)
>> print(x)
>> if x:
>> print("Yes, there is at least one match!")
>> else: Output:
>> print("No match") []
No match
61

Python RegEx
• Python has a module named re to work with regular expressions.
To use it, we need to import the module.
• import re
• The module defines several functions and constants to work with
RegEx.

31
2/12/2022

Python RegEx
• re.findall()
• The re.findall() method returns a list of strings containing all
matches.

Python RegEx
• Example 1: re.findall()
• # Program to extract numbers from a string
>> import re
>> string = 'hello 12 hi 89. Howdy 34'
>> pattern = '\d+'
>> result = re.findall(pattern, string)
>> print(result)
• # Output: ['12', '89', '34']
• If the pattern is not found, re.findall() returns an empty list.
64

32
2/12/2022

Python RegEx
• Example 2: re.findall()
>> import re

>> #Return a list containing every occurrence of "ai":

>> txt = "The rain in Spain"

>> x = re.findall("ai", txt)
>> print(x)

Output:
['ai', 'ai']
65

Python RegEx
• Example 3: re.findall()
>> import re
>> txt = "The rain in Spain"
>> #Check if "Portugal" is in the string:
>> x = re.findall("Portugal", txt)
>> print(x)
>> if (x):
>> print("Yes, there is at least one match!")
Output:
>> else:
[]
>> print("No match") No match
• 66

33
2/12/2022

Python RegEx
• re.split()
• The re.split method splits the string where there is a match and
returns a list of strings where the splits have occurred.

Python RegEx
• Example 1: re.split()
>> import re
>> string = 'Twelve:12 Eighty nine:89.'
>> pattern = '\d+'
>> result = re.split(pattern, string)
>> print(result)
• # Output: ['Twelve:', ' Eighty nine:', '.']
• If the pattern is not found, re.split() returns a list containing the
original string.
68

34
2/12/2022

Python RegEx
• Example 2: re.split()
>> import re
>> #Split the string at every white-space character:
>> txt = "The rain in Spain"
>> x = re.split("\s", txt)
>> print(x)

Output:
['The', 'rain', 'in', 'Spain']

Python RegEx
• Example 3: re.split()
>> import re
>> #Split the string at the first white-space character:
>> txt = "The rain in Spain"
>> x = re.split("\s", txt, 1)
>> print(x)

Output:
['The', 'rain in Spain']

35
2/12/2022

Python RegEx
• You can pass maxsplit argument to the re.split() method. It's the maximum
number of splits that will occur.
>> import re
>> string = 'Twelve:12 Eighty nine:89 Nine:9.'
>> pattern = '\d+'
>> # maxsplit = 1
>> # split only at the first occurrence
>> result = re.split(pattern, string, 1)
>> print(result)
• # Output: ['Twelve:', ' Eighty nine:89 Nine:9.']
• By the way, the default value of maxsplit is 0; meaning all possible splits.
71

Python RegEx
• re.sub()
• The syntax of re.sub() is:

• re.sub(pattern, replace, string)

• The method returns a string where matched occurrences are replaced with
the content of replace variable.

36
2/12/2022

Python RegEx
• Example 1: re.sub()
>> # Program to remove all whitespaces
>> import re
>> # multiline string
>> string = 'abc 12\
>> de 23 \n f45 6'

>> # matches all whitespace characters

>> pattern = '\s+'

Python RegEx
>> # empty string
>> replace = ''

>> new_string = re.sub(pattern, replace, string)

>> print(new_string)

>> # Output: abc12de23f456

If the pattern is not found, re.sub() returns the original string.

37
2/12/2022

Python RegEx
• Example 2: re.sub()
>> import re
>> #Replace all white-space characters with the digit "9":
>> txt = "The rain in Spain"
>> x = re.sub("\s", "9", txt)
>> print(x)

Output:
The9rain9in9Spain

Python RegEx
• Example 3: re.sub()
>> import re
>> #Replace the first two occurrences of a white-space character with the digit 9:
>> txt = "The rain in Spain"
>> x = re.sub("\s", "9", txt, 2)
>> print(x)

Output:
The9rain9in Spain

38
2/12/2022

Python RegEx
You can pass count as a fourth parameter to the re.sub() method. If omitted, it
results to 0. This will replace all occurrences.

>> import re

>> # multiline string

>> string = 'abc 12\de 23 \n f45 6'

Python RegEx
>> # matches all whitespace characters
>> pattern = '\s+'
>> replace = ''

>> new_string = re.sub(r'\s+', replace, string, 1)

>> print(new_string)

# Output:
# abc12de 23
# f45 6

39
2/12/2022

Python RegEx
• re.subn()
• The re.subn() is similar to re.sub() except it returns a tuple of 2 items
containing the new string and the number of substitutions made.

• Example 4: re.subn()

>> # Program to remove all whitespaces

>> import re
>> # multiline string
>> sstring = 'abc 12\de 23 \n f45 6'

Python RegEx
>> # matches all whitespace characters
>> pattern = '\s+'

>> # empty string

>> replace = ''

>> new_string = re.subn(pattern, replace, string)

>> print(new_string)

>> # Output: ('abc12de23f456', 4)

40
2/12/2022

Python RegEx
• re.search()
• The re.search() method takes two arguments: a pattern and a string. The
method looks for the first location where the RegEx pattern produces a
match with the string.

• If the search is successful, re.search() returns a match object; if not, it

returns None.

• match = re.search(pattern, str)

Python RegEx
• Example 1: re.search()
>> import re
>> string = "Python is fun"
>> # check if 'Python' is at the beginning
>> match = re.search('\APython', string)
>> if match:
>> print("pattern found inside the string")
>> else:
>> print("pattern not found")
• # Output: pattern found inside the string
82

41
2/12/2022

Python RegEx
• Example 2: re.search()
• import re

• txt = "The rain in Spain"

• x = re.search("\s", txt)

• print("The first white-space character is located in position:", x.start())

• Output:
• The first white-space character is located in position: 3

Python RegEx
• Example 3: re.search()
>> import re

>> txt = "The rain in Spain“

>> x = re.search("Portugal", txt)
>> print(x)

• Output:
• None

42
2/12/2022

Python RegEx
• Example 3: re.search()
>> import re

>> txt = "The rain in Spain“

>> x = re.search("Portugal", txt)
>> print(x)

• Output:
• None

Python RegEx
• Match object

• You can get methods and attributes of a match object using dir() function.
• Some of the commonly used methods and attributes of match objects are:
• match.group()
• The group() method returns the part of the string where there is a match.

43
2/12/2022

Python RegEx
• Example 2: Match object
>> import re
>> string = '39801 356, 2102 1111'
>> # Three digit number followed by space followed by two digit number
>> pattern = '(\d{3}) (\d{2})'
>> # match variable contains a Match object.
>> match = re.search(pattern, string)

Python RegEx
>> if match:
>> print(match.group())
>> else:
>> print("pattern not found")

• # Output: 801 35

• Here, match variable contains a match object.

44
2/12/2022

Python RegEx
• Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). You can
get the part of the string of these parenthesized subgroups. Here's how:
>>> match.group(1)
'801'
>>> match.group(2)
'35'
>>> match.group(1, 2)
('801', '35')
>>> match.groups()
('801', '35')
89

Python RegEx
• Using r prefix before RegEx
• When r or R prefix is used before a regular expression, it means row string.
For example, '\n' is a new line whereas r'\n' means two characters: a
backslash \ followed by n.

• Backlash \ is used to escape various characters including all metacharacters.

However, using r prefix makes \ treat as a normal character.

45
2/12/2022

Python RegEx
• Example 7: Raw string using r prefix
>> import re
>> string = '\n and \r are escape sequences.'
>> result = re.findall(r'[\n\r]', string)
>> print(result)

• # Output: ['\n', '\r']

Example of Regex

46
2/12/2022

Example - Regex
• Import pandas as pd
• Import numpy as np

• df = pd.read_csv(‘titanic_train.csv’)
• df[‘Title’] = df[‘Name’].str.extract(‘([A-Za-z]+\.)’,expand=False)
• df[‘Age’].fillna(df.groupby(‘title)[‘Age’].transform(‘mean’), inplace = True)

Example 1 - Regex
• Import pandas as pd
• Import numpy as np

47
2/12/2022

Example 2 - Regex
• Remove the unnecessary characters from columns
• import pandas as pd
• import numpy as np
• dfW = pd.read_csv('D:\\Teaching Subject\\Data Science\\Fall
2021\\Lectures\\Structuring and Regex Example\\weather_data.csv')

Example 2 - Regex
• dfW['temperature'].replace('[^0-9-]','',inplace=True,regex=True)
• output

48
2/12/2022

Example 3 - Regex
• pakistan_intellectual_capital
• Practice of structuring problem. See in Jupytor notebook or PDF

Python With Data Science
No ratings yet
Python With Data Science
102 pages
Regular Expressions Cheat Sheet
No ratings yet
Regular Expressions Cheat Sheet
1 page
GRIP (BIOLOGY) 2021 PMC NMDCAT NUMS AGHA KHAN 12000+ MCQS Question Bank
No ratings yet
GRIP (BIOLOGY) 2021 PMC NMDCAT NUMS AGHA KHAN 12000+ MCQS Question Bank
103 pages
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
100% (1)
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
18 pages
SystemVerilog Interview Questions PART-1
No ratings yet
SystemVerilog Interview Questions PART-1
17 pages
Dogar AMC Book Biology Portion (Taleem360)
No ratings yet
Dogar AMC Book Biology Portion (Taleem360)
49 pages
Regular Expressions: Regular Expression Syntax in Python
No ratings yet
Regular Expressions: Regular Expression Syntax in Python
11 pages
Unit 2
No ratings yet
Unit 2
69 pages
Python Course: Session 6b - Regular Expressions
No ratings yet
Python Course: Session 6b - Regular Expressions
11 pages
Regex Case Interview Guide
No ratings yet
Regex Case Interview Guide
10 pages
Advanced Python Programming - Lesson No.002
No ratings yet
Advanced Python Programming - Lesson No.002
20 pages
Regular Expressions (Slides)
No ratings yet
Regular Expressions (Slides)
20 pages
Summary Python 1
No ratings yet
Summary Python 1
36 pages
Regular Expression
No ratings yet
Regular Expression
17 pages
Python Assignment Date: 08-11-2021: Name-Navjeet Kaur Sap ID-500076160 Roll No - R134219065
No ratings yet
Python Assignment Date: 08-11-2021: Name-Navjeet Kaur Sap ID-500076160 Roll No - R134219065
3 pages
Regular Expression 01
No ratings yet
Regular Expression 01
48 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Python Regex Cheat Sheet
No ratings yet
Python Regex Cheat Sheet
29 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Data Analysis Using Python Lab Ex3
No ratings yet
Data Analysis Using Python Lab Ex3
27 pages
PP - Module-3 Notes
No ratings yet
PP - Module-3 Notes
56 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
Python Reg Expressions PDF
No ratings yet
Python Reg Expressions PDF
8 pages
Regular Expressions Python
No ratings yet
Regular Expressions Python
26 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
20 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Unix Viva Questions
No ratings yet
Unix Viva Questions
20 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Regular Expression HOWTO: Guido Van Rossum and The Python Development Team
18 pages
45 The Matching Characters
No ratings yet
45 The Matching Characters
3 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Migration From ECC To HANA
No ratings yet
Migration From ECC To HANA
38 pages
Reg Exp
No ratings yet
Reg Exp
10 pages
Regular Expression Python
No ratings yet
Regular Expression Python
23 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
Chapter 10
No ratings yet
Chapter 10
28 pages
Howto Regex
No ratings yet
Howto Regex
17 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Sys LW-08EN Regex-Filters
No ratings yet
Sys LW-08EN Regex-Filters
31 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
Regular Expressions
No ratings yet
Regular Expressions
9 pages
Python Reg Expressions
No ratings yet
Python Reg Expressions
8 pages
Unit - 4 Regex
No ratings yet
Unit - 4 Regex
28 pages
RegEx in Python
No ratings yet
RegEx in Python
6 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
Regular Expressions Cheat Sheet
No ratings yet
Regular Expressions Cheat Sheet
5 pages
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
No ratings yet
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
18 pages
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
No ratings yet
Regular Expression HOWTO: Guido Van Rossum Fred L. Drake, JR., Editor
18 pages
RegEx in Python
No ratings yet
RegEx in Python
5 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Python How To Regex
No ratings yet
Python How To Regex
19 pages
Python Regex
No ratings yet
Python Regex
8 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
12 pages
Regular
No ratings yet
Regular
9 pages
8 - String and Regular Expression
No ratings yet
8 - String and Regular Expression
27 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
Module5 RegularExpressions
No ratings yet
Module5 RegularExpressions
10 pages
E-Learning Excel VBA Programming Lesson 1
No ratings yet
E-Learning Excel VBA Programming Lesson 1
13 pages
Python Regular Expressions Quick Reference
No ratings yet
Python Regular Expressions Quick Reference
2 pages
jBASE Overview
No ratings yet
jBASE Overview
52 pages
Regular Expression Howto: A.M. Kuchling
No ratings yet
Regular Expression Howto: A.M. Kuchling
20 pages
Meer Taqi Meer
No ratings yet
Meer Taqi Meer
4 pages
Scrum Terminology Updated
No ratings yet
Scrum Terminology Updated
3 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
SoapUI Cookbook - Sample Chapter
No ratings yet
SoapUI Cookbook - Sample Chapter
40 pages
Genericgraphandpsets 12757618910872 Phpapp02
No ratings yet
Genericgraphandpsets 12757618910872 Phpapp02
16 pages
Cell Cycle PDF
No ratings yet
Cell Cycle PDF
12 pages
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
No ratings yet
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
56 pages
DUHS Strategic Plan
No ratings yet
DUHS Strategic Plan
55 pages
Chapter Three: Classes and Objects
No ratings yet
Chapter Three: Classes and Objects
75 pages
Lesson 2
No ratings yet
Lesson 2
38 pages
Chapter 9 Biotechnology
No ratings yet
Chapter 9 Biotechnology
21 pages
How To Change The Message Server Port of A Java System
No ratings yet
How To Change The Message Server Port of A Java System
7 pages
CBSE Class 12 Informatics Practices Proj
No ratings yet
CBSE Class 12 Informatics Practices Proj
26 pages
Chemistry Blanks
No ratings yet
Chemistry Blanks
15 pages
Proforma Invoice Lift (Highway Traders LHR)
No ratings yet
Proforma Invoice Lift (Highway Traders LHR)
9 pages
Data For Gratuity Valuation - June 30 2021 v1
No ratings yet
Data For Gratuity Valuation - June 30 2021 v1
27 pages
CH SHM, Waves & Sound
No ratings yet
CH SHM, Waves & Sound
2 pages
INTE 222 BBIT 314 COMP 226 COSF 311COMP 326 OBJECT ORIENTED PROGRAMMING WITH JAVA - Kabarak University
No ratings yet
INTE 222 BBIT 314 COMP 226 COSF 311COMP 326 OBJECT ORIENTED PROGRAMMING WITH JAVA - Kabarak University
5 pages
History
No ratings yet
History
14 pages
Static or Embedded and Dynamic or Interactive SQL
No ratings yet
Static or Embedded and Dynamic or Interactive SQL
5 pages
Guess Paper XI Zoology 2022
No ratings yet
Guess Paper XI Zoology 2022
3 pages
Interface V2
No ratings yet
Interface V2
3 pages
Jamia Tul Madina Faizan
No ratings yet
Jamia Tul Madina Faizan
6 pages
Break Into Tech - Syllabus - Get Hired Track Updates
No ratings yet
Break Into Tech - Syllabus - Get Hired Track Updates
38 pages
Siv 2023-24 UML UNIT-5
No ratings yet
Siv 2023-24 UML UNIT-5
12 pages
150 MCQs
No ratings yet
150 MCQs
13 pages
Writing Approaches
No ratings yet
Writing Approaches
3 pages
Selectors Div Span and Class
No ratings yet
Selectors Div Span and Class
24 pages
Carbohydrateanki CSV
No ratings yet
Carbohydrateanki CSV
2 pages
Python-Final Exam
No ratings yet
Python-Final Exam
2 pages
SQL
No ratings yet
SQL
1 page
Essays 2022
No ratings yet
Essays 2022
7 pages
Digital Learning
No ratings yet
Digital Learning
2 pages
Practical Examination: Intermediate FOR 2018
No ratings yet
Practical Examination: Intermediate FOR 2018
6 pages
Objective/Goals Action Resources Evidence of Achievement Due Date
No ratings yet
Objective/Goals Action Resources Evidence of Achievement Due Date
4 pages
2008 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
No ratings yet
2008 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
44 pages
(Lec-15) Java SE (Explicit, Initializers, Constructors)
No ratings yet
(Lec-15) Java SE (Explicit, Initializers, Constructors)
12 pages
Codasip WP Domain-Specific Processors With RISC-V Custom Extension
No ratings yet
Codasip WP Domain-Specific Processors With RISC-V Custom Extension
12 pages
Akhuwat Internship Programme
No ratings yet
Akhuwat Internship Programme
2 pages
Lambdas Streams
No ratings yet
Lambdas Streams
10 pages
Accurate and Reliable CDC Verification Closure PDF
No ratings yet
Accurate and Reliable CDC Verification Closure PDF
6 pages
Result Chem GT (CH # 2, 5) MDCAT
No ratings yet
Result Chem GT (CH # 2, 5) MDCAT
1 page
HasnainAnsari (0 10)
No ratings yet
HasnainAnsari (0 10)
1 page
Gabriel Baker Resume
No ratings yet
Gabriel Baker Resume
2 pages
Haste Makes Waste Hurry Makes Curry
No ratings yet
Haste Makes Waste Hurry Makes Curry
1 page
Bubble Sort
No ratings yet
Bubble Sort
3 pages
Ch7-Project-Sachev Satheesh Kumar
No ratings yet
Ch7-Project-Sachev Satheesh Kumar
3 pages
Assignment
No ratings yet
Assignment
2 pages
Txt_2005 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
No ratings yet
Txt_2005 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
36 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Ian Talks Regex A-Z
From Everand
Ian Talks Regex A-Z
Ian Eress
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Java: Best Practices to Programming Code with Java: Java Computer Programming, #3
From Everand
Java: Best Practices to Programming Code with Java: Java Computer Programming, #3
Charlie Masterson
No ratings yet

Structuring With Regix

Uploaded by

Structuring With Regix

Uploaded by

2/12/2022

Big Data Analytics

Regex for Structuring (Data Wrangling)

What is Feature Engineering - Introduction

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

• The final metacharacter in this section is ‘.’. It matches anything

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

Here’s a complete list of the metacharacters

>> #Return a list containing every occurrence of "ai":

>> txt = "The rain in Spain"

• re.sub(pattern, replace, string)

>> # matches all whitespace characters

>> new_string = re.sub(pattern, replace, string)

>> # Output: abc12de23f456

If the pattern is not found, re.sub() returns the original string.

>> # multiline string

>> new_string = re.sub(r'\s+', replace, string, 1)

>> # Program to remove all whitespaces

>> # empty string

>> new_string = re.subn(pattern, replace, string)

>> # Output: ('abc12de23f456', 4)

• If the search is successful, re.search() returns a match object; if not, it

• match = re.search(pattern, str)

• txt = "The rain in Spain"

• print("The first white-space character is located in position:", x.start())

>> txt = "The rain in Spain“

>> txt = "The rain in Spain“

• Here, match variable contains a match object.

• Backlash \ is used to escape various characters including all metacharacters.

• # Output: ['\n', '\r']

You might also like