0% found this document useful (0 votes)
3 views

Module 4 - Regular Expressions1

patterns
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module 4 - Regular Expressions1

patterns
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Regular expressions

Regular Expressions

In computing, a regular expression, also referred to as “regex” or


“regexp”, provides a concise and flexible means for matching
strings of text, such as particular characters, words, or patterns of
characters.
A regular expression is written in a formal language that can
be interpreted by a regular expression processor.
REGULAR EXPRESSIONS

output would be:


hello, how are you?
how about you?
REGULAR EXPRESSIONS

Regular expressions make use of special characters with specific


meaning. In the following example, we make use of caret (^)
symbol, which indicates beginning of the line.
Character Matching in Regular Expressions
Python provides a list of meta-characters to match search strings. Table shows the
details of few important metacharacters.
Table: List of Important Meta-
Characters
Table: Examples for Regular
Expressions
Example

• Most commonly used metacharacter is dot, which matches any character.


• Consider the following example, where the regular expression is for searching lines which
starts with I and has any two characters (any character represented by two dots) and then
has a character m.
Example

If we don’t know the exact number of characters between two


characters (or strings), we can make use of dot and + symbols
together.
Pattern to extract lines starting with the word
From (or from) and ending with edu.

import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
pattern = ‘^[Ff]rom.*edu$’
if re.search(pattern, line):
print(line)
Pattern to extract lines ending with
any digit

Replace the pattern by


following string, rest of the
program will remain the same.

pattern = ‘[0-9]$’
Character matching in regular
expressions
Search for lines that start with From and
have an at sign
The findall() Method

• search() will return a match object of the first matched text in the
searched string

• the findall() method will return the strings of every match in the
searched string.

• findall() will not return a match object but a list of strings

• Each string in the list is a piece of the searched text that matched the
regular expression.
The findall() Method

• If there are groups in the regular expression, then findall()


will return a list of tuples

• Each tuple represents a found match


• Its items are the matched strings for each group in the
regex
Extracting data using regular
expressions
Search for lines that have an @ sign
between characters
Combining searching and extracting
Example
Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number
can include a decimal . Then print the number if it is greater than
zero.
Example
Search for lines that start with 'Details: rev=' followed
by numbers and '.' Then print the number if it is greater
than zero
Example
Search for lines that start with From and a character followed by a
two digit number between 00 and 99 followed by ':' Then print the
number if it is greater than zero
Escape character
Using Not

pattern = ‘^[^a-z0-9]+’
Here, the first ^ indicates we want something to match in the beginning of
a line. Then, the ^ inside square-brackets indicate do not match any single
character within bracket. Hence, the whole meaning would be – line
must be started with anything other than a lower-case alphabets and digits.
Start with upper case letters and end with digits

pattern = '^[A-Z].*[0-9]$'

Here, the line should start with capital letters, followed by 0

or more characters, but must end with any digit.


The file mbox-short.txt has lines like:

From [email protected] Sat Jan 5 09:14:16 2008

Here, we would like to extract only the hour 09. That is, we would like
only two digits representing hour. Hence, we need to modify our
expression as:

x = re.findall('^From .* ([0-9][0-9]):', line)

Here, [0-9][0-9] indicates that a digit should appear only two times.
The alternative way of writing this would be:

x = re.findall('^From .* ([0-9]{2}):', line)


Escape Character
import re

x = 'We just received $10.00 for cookies.’

y = re.findall('\$[0-9.]+',x)

Output:

['$10.00']

Here, we want to extract only the price $10.00. As, $ symbol is


a metacharacter, we need to use \ before it. So that, now $ is
treated as a part of matching string, but not as metacharacter.
Unix/Linux Users
Support for searching files using regular expressions was built into the Unix OS.

There is a command-line program built into Unix

grep (Generalized Regular Expression Parser) that behaves similar to search()


function.

Note that, grep command does not support the non-blank character \S, hence
we need to use [~] indicating not a white-space.
Matching Zero or More with the
Star

• The * (called the star or asterisk) means “match zero or more

• The group that precedes the star can occur any number of times in the text

• It can be completely absent or repeated over and over again


Matching Zero or More with the
Star
Matching One or More with the
Plus

• The + (or plus) means match one or more


• The group preceding a plus must appear at least once
• The group preceding plus is not optional
The findall() Method
Character classes
Character classes
• Used for shortening regular expressions

• The character class [0-5] will match only the numbers 0 to 5


Character classes

You might also like