0% found this document useful (0 votes)
14 views

Module 4 - Regular Expressions

pattern matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module 4 - Regular Expressions

pattern matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Regular expressions

Regular Expressions

In computing, a regular expression, also referred to as “regex” or


“regexp”, provides a concise and flexible means for matching
strings of text, such as particular characters, words, or patterns of
characters.
A regular expression is written in a formal language that can
be interpreted by a regular expression processor.
REGULAR EXPRESSIONS

output would be:


hello, how are you?
how about you?
REGULAR EXPRESSIONS

Regular expressions make use of special characters with specific


meaning. In the following example, we make use of caret (^)
symbol, which indicates beginning of the line.
Character Matching in Regular Expressions
Python provides a list of meta-characters to match search strings. Table shows the
details of few important metacharacters.
Table: List of Important Meta-
Characters
Table: Examples for Regular
Expressions
Example

• Most commonly used metacharacter is dot, which matches any character.


• Consider the following example, where the regular expression is for searching lines which
starts with I and has any two characters (any character represented by two dots) and then
has a character m.
Example

If we don’t know the exact number of characters between two


characters (or strings), we can make use of dot and + symbols
together.
Pattern to extract lines starting with the word
From (or from) and ending with edu.

import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
pattern = ‘^[Ff]rom.*edu$’
if re.search(pattern, line):
print(line)
Pattern to extract lines ending with
any digit

Replace the pattern by


following string, rest of the
program will remain the same.

pattern = ‘[0-9]$’
Character matching in regular
expressions
Search for lines that start with From and
have an at sign
The findall() Method

• search() will return a match object of the first matched text in the
searched string

• the findall() method will return the strings of every match in the
searched string.

• findall() will not return a match object but a list of strings

• Each string in the list is a piece of the searched text that matched the
regular expression.
The findall() Method

• If there are groups in the regular expression, then findall()


will return a list of tuples

• Each tuple represents a found match


• Its items are the matched strings for each group in the
regex
Extracting data using regular
expressions
Search for lines that have an @ sign
between characters
Combining searching and extracting
Example
Search for lines that start with 'X' followed by any non whitespace
characters and ':' followed by a space and any number. The number
can include a decimal . Then print the number if it is greater than
zero.
Example
Search for lines that start with 'Details: rev=' followed
by numbers and '.' Then print the number if it is greater
than zero
Example
Search for lines that start with From and a character followed by a
two digit number between 00 and 99 followed by ':' Then print the
number if it is greater than zero
Escape character
Start with upper case letters and end with digits

pattern = '^[A-Z].*[0-9]$'

Here, the line should start with capital letters, followed by 0

or more characters, but must end with any digit.


The file mbox-short.txt has lines like:

From [email protected] Sat Jan 5 09:14:16 2008

Here, we would like to extract only the hour 09. That is, we would like
only two digits representing hour. Hence, we need to modify our
expression as:

x = re.findall('^From .* ([0-9][0-9]):', line)

Here, [0-9][0-9] indicates that a digit should appear only two times.
The alternative way of writing this would be:

x = re.findall('^From .* ([0-9]{2}):', line)


Unix/Linux Users
Support for searching files using regular expressions was built into the Unix OS.

There is a command-line program built into Unix

grep (Generalized Regular Expression Parser) that behaves similar to search()


function.

Note that, grep command does not support the non-blank character \S, hence
we need to use [~] indicating not a white-space.
Matching Zero or More with the
Star

• The * (called the star or asterisk) means “match zero or more

• The group that precedes the star can occur any number of times in the text

• It can be completely absent or repeated over and over again


Matching Zero or More with the
Star
Matching One or More with the
Plus

• The + (or plus) means match one or more


• The group preceding a plus must appear at least once
• The group preceding plus is not optional
The findall() Method
Character classes
Character classes
• Used for shortening regular expressions

• The character class [0-5] will match only the numbers 0 to 5


Character classes

You might also like