Module 4 - Regular Expressions1
Module 4 - Regular Expressions1
Regular Expressions
import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
pattern = ‘^[Ff]rom.*edu$’
if re.search(pattern, line):
print(line)
Pattern to extract lines ending with
any digit
pattern = ‘[0-9]$’
Character matching in regular
expressions
Search for lines that start with From and
have an at sign
The findall() Method
• search() will return a match object of the first matched text in the
searched string
• the findall() method will return the strings of every match in the
searched string.
• Each string in the list is a piece of the searched text that matched the
regular expression.
The findall() Method
pattern = ‘^[^a-z0-9]+’
Here, the first ^ indicates we want something to match in the beginning of
a line. Then, the ^ inside square-brackets indicate do not match any single
character within bracket. Hence, the whole meaning would be – line
must be started with anything other than a lower-case alphabets and digits.
Start with upper case letters and end with digits
pattern = '^[A-Z].*[0-9]$'
Here, we would like to extract only the hour 09. That is, we would like
only two digits representing hour. Hence, we need to modify our
expression as:
Here, [0-9][0-9] indicates that a digit should appear only two times.
The alternative way of writing this would be:
y = re.findall('\$[0-9.]+',x)
Output:
['$10.00']
Note that, grep command does not support the non-blank character \S, hence
we need to use [~] indicating not a white-space.
Matching Zero or More with the
Star
• The group that precedes the star can occur any number of times in the text