Regular Expressions: Python For Everybody
Regular Expressions: Python For Everybody
Chapter 11
http://en.wikipedia.org/wiki/Regular_expression
Regular Expressions
Really clever “wild card” expressions for matching
and parsing strings
http://en.wikipedia.org/wiki/Regular_expression
Really smart “Find” or “Search”
Understanding Regular Expressions
• You can use re.findall() to extract portions of a string that match your
regular expression, similar to a combination of find() and slicing:
var[5:10]
Using re.search() Like find()
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.find('From:') >= 0: line = line.rstrip()
print(line) if re.search('From:', line) :
print(line)
Using re.search() Like startswith()
import re
hand = open('mbox-short.txt')
for line in hand: hand = open('mbox-short.txt')
line = line.rstrip() for line in hand:
if line.startswith('From:') : line = line.rstrip()
print(line) if re.search('^From:', line) :
print(line)
Many times
Match the start of
X-Sieve: CMU Sieve 2.3 the line
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
X-: Very short
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of your
application, you may want to narrow your match down a bit
One or more
Match the start of
X-Sieve: CMU Sieve 2.3 times
X-DSPAM-Result: Innocent the line
X-: Very Short
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace character
Matching and Extracting Data
• re.search() returns a True/False depending on whether the string
matches the regular expression
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = re.findall('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> y = re.findall('[AEIOU]+',x)
>>> print(y)
[]
Warning: Greedy Matching
The repeat characters (* and +) push outward in both directions (greedy)
to match the largest possible string
One or more
characters
>>> import re
>>> x = 'From: Using the : character'
>>> y = re.findall('^F.+:', x)
>>> print(y) ^F.+:
['From: Using the :']
>>> y = re.findall('\S+@\S+',x)
>>> print(y)
['[email protected]']
^From (\S+@\S+)
>>> y = re.findall('^From (\S+@\S+)',x)
>>> print(y)
['[email protected]']
String Parsing Examples…
21 31
['uct.ac.za']
'@([^ ]*)'
['uct.ac.za']
'@([^ ]*)'
['uct.ac.za']
'@([^ ]*)'
['uct.ac.za']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string 'From '
Even Cooler Regex Version
From [email protected] Sat Jan 5 09:14:16 2008
import re
lin = 'From [email protected] Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)
['uct.ac.za']
'^From .*@([^ ]*)'
['uct.ac.za']
'^From .*@([^ ]*)'
Start extracting
Even Cooler Regex Version
From [email protected] Sat Jan 5 09:14:16 2008
import re
lin = 'From [email protected] Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)
['uct.ac.za']
'^From .*@([^ ]+)'
['uct.ac.za']
'^From .*@([^ ]+)'
Stop extracting
Spam Confidence
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
line = line.rstrip()
stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
numlist.append(num)
print('Maximum:', max(numlist)) python ds.py
Maximum: 0.9907
X-DSPAM-Confidence: 0.8475
Escape Character
If you want a special regular expression character to just behave
normally (most of the time) you prefix it with '\'