0% found this document useful (0 votes)

8 views10 pages

Module5 RegularExpressions

This document provides an overview of regular expressions in Python, detailing their syntax and usage for searching and extracting patterns from strings. It includes examples of using the 're' module, various metacharacters, and practical applications for extracting data such as email addresses and specific line formats from text files. The document also covers advanced topics like grouping and escaping characters in regular expressions.

Uploaded by

sanjaychinnu2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Module5 RegularExpressions

Uploaded by

sanjaychinnu2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

1

Notes for Programming IN Python (Open Elective - 21CS751)

Module -5
5.1 REGULAR EXPRESSIONS
Searching for required patterns and extracting only the lines/words matching
the pattern is a very common task in solving problems programmatically. We
have done such tasks earlier using string slicing and string methods like split(),
find() etc. As the task of searching and extracting is very common, Python
provides a powerful library called regular expressions to handle these tasks
elegantly. Though they have quite complicated syntax, they provide efficient way
of searching the patterns.

The regular expressions are themselves little programs to search and parse
strings. To use them in our program, the library/module re must be imported.
There is a search() function in this module, which is used to find particular
substring within a string. Consider the following example –

import re
fhand = open('myfile.txt')
for line in fhand:
line = line.rstrip()
if re.search('how', line):
print(line)

By referring to file myfile.txt that has been discussed in previous Chapters, the
output would be –

hello, how are you?

how about you?

In the above program, the search() function is used to search the lines
containing a word
how.

One can observe that the above program is not much different from a program
that uses find() function of strings. But, regular expressions make use of special
characters with specific meaning. In the following example, we make use of caret
(^) symbol, which indicates beginning of the line.

import re
hand = open('myfile.txt')
for line in hand:
line = line.rstrip()
if re.search('^how', line):
print(line)

The output would be –

how about you?
2
Notes for Programming IN Python (Open Elective - 21CS751)
Here, we have searched for a line which starts with a string how. Again, this
program will not makes use of regular expression fully. Because, the above
program would have been
3
Notes for Programming IN Python (Open Elective - 21CS751)

written using a string function startswith(). Hence, in the next section, we will
understand the true usage of regular expressions.

Character Matching in Regular Expressions

Python provides a list of meta-characters to match search strings. Table 3.1
shows the details of few important metacharacters. Some of the examples for
quick and easy understanding of regular expressions are given in Table 3.2.

Table 3.1 List of Important Meta-Characters

Character Meaning
^ (caret) Matches beginning of the line
$ Matches end of the line
. (dot) Matches any single character except newline. Using option
m, then
newline also can be matched
[…] Matches any single character in brackets
[^…] Matches any single character NOT in brackets
re* Matches 0 or more occurrences of preceding expression.
re+ Matches 1 or more occurrence of preceding expression.
re? Matches 0 or 1 occurrence of preceding expression.
re{ n} Matches exactly n number of occurrences of preceding
expression.
re{ n,} Matches n or more occurrences of preceding expression.
re{ n, m} Matches at least n and at most m occurrences of preceding
expression.
a| b Matches either a or b.
(re) Groups regular expressions and remembers matched text.
\d Matches digits. Equivalent to [0-9].
\D Matches non-digits.
\w Matches word characters.
\W Matches non-word characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches non-whitespace.
\A Matches beginning of string.
\Z Matches end of string. If a newline exists, it matches just
before
newline.
\z Matches end of string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
( ) When parentheses are added to a regular expression, they are
ignored for the purpose of matching, but allow you to extract a
particular subset of the matched string rather than the
whole string when using
findall()
4
Notes for Programming IN Python (Open Elective - 21CS751)

Table 3.2 Examples for Regular Expressions

Expression Description
[Pp]ython Match "Python" or "python"
rub[ye] Match "ruby" or "rube"
[aeiou] Match any one lowercase vowel
[0-9] Match any digit; same as [0123456789]
[a-z] Match any lowercase ASCII letter
[A-Z] Match any uppercase ASCII letter
[a-zA-Z0-9] Match any of uppercase, lowercase alphabets and digits
[^aeiou] Match anything other than a lowercase vowel
[^0-9] Match anything other than a digit

Most commonly used metacharacter is dot, which matches any character.

Consider the following example, where the regular expression is for searching
lines which starts with I and has any two characters (any character represented
by two dots) and then has a character m.
import re
fhand = open('myfile.txt')
for line in fhand:
line = line.rstrip()
if re.search('^I..m', line):
print(line)

The output would be –

I am doing fine.

Note that, the regular expression ^I..m not only matches ‘I am’, but it can
match ‘Isdm’, ‘I*3m’ and so on. That is, between I and m, there can be any two
characters.

In the previous program, we knew that there are exactly two characters between
I and m. Hence, we could able to give two dots. But, when we don’t know the
exact number of characters between two characters (or strings), we can make
use of dot and + symbols together. Consider the below given program –

import re
hand = open('myfile.txt')
for line in hand:
line = line.rstrip()
if re.search('^h.+u', line):
print(line)

The output would be –

hello, how are you?
how about you?
5
Notes for Programming IN Python (Open Elective - 21CS751)

Observe the regular expression ^h.+u here. It indicates that, the string should be
starting with h and ending with u and there may by any number of (dot and +)
characters in- between.

Few examples:
To understand the behavior of few basic meta characters, we will see some
examples. The file used for these examples is mbox-short.txt which can be
downloaded from –
https://www.py4e.com/code3/mbox-short.txt

Use this as input and try following examples –

 Pattern to extract lines starting with the word From (or from) and ending with edu:
import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
pattern = ‘^[Ff]rom.*edu$’
if re.search(pattern, line):
print(line)

Here the pattern given for regular expression indicates that the line should
start with either From or from. Then there may be 0 or more characters, and
later the line should end with edu.

 Pattern to extract lines ending with any digit:

Replace the pattern by following string, rest of the program will remain
the same.
pattern = ‘[0-9]$’

 Using Not :
pattern = ‘^[^a-z0-9]+’

Here, the first ^ indicates we want something to match in the beginning of a

line. Then, the ^ inside square-brackets indicate do not match any single
character within bracket. Hence, the whole meaning would be – line must be
started with anything other than a lower-case alphabets and digits. In other
words, the line should not be started with lowercase alphabet and digits.

 Start with upper case letters and end with digits:

pattern = '^[A-Z].*[0-9]$'

Here, the line should start with capital letters, followed by 0 or more
characters, but must end with any digit.
6
Notes for Programming IN Python (Open Elective - 21CS751)

Extracting Data using Regular Expressions

Python provides a method findall() to extract all of the substrings matching a
regular expression. This function returns a list of all non-overlapping matches in
the string. If there is no match found, the function returns an empty list. Consider
an example of extracting anything that looks like an email address from any line.

import re
s = 'A message from [email protected] to [email protected] about meeting
@2PM'
lst = re.findall('\S+@\S+', s)
print(lst)

The output would be –

['[email protected]', '[email protected]']

Here, the pattern indicates at least one non-white space characters (\S) before @
and at least one non-white space after @. Hence, it will not match with @2pm,
because of a whitespace before @.

Now, we can write a complete program to extract all email-ids from the file.

import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
x = re.findall('\S+@\S+', line)
if len(x) > 0:
print(x)

Here, the condition len(x) > 0 is checked because, we want to print only the
line which contain an email-ID. If any line do not find the match for a pattern
given, the findall() function will return an empty list. The length of empty list will
be zero, and hence we would like to print the lines only with length greater than
0.

The output of above program will be something as below –

['[email protected]']
['<[email protected]>']
['<[email protected]>']
['<[email protected]>;']
['<[email protected]>;']
['<[email protected]>;']
['apache@localhost)']
……………………………….
………………………………..
7
Notes for Programming IN Python (Open Elective - 21CS751)

Note that, apart from just email-ID’s, the output contains additional characters
(<, >, ; etc) attached to the extracted pattern. To remove all that, refine the
pattern. That is, we want email-ID to be started with any alphabets or digits, and
ending with only alphabets. Hence, the statement would be –

x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)

Combining Searching and Extracting

Assume that we need to extract the data in a particular syntax. For example,
we need to extract the lines containing following format –

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

The line should start with X-, followed by 0 or more characters. Then, we need a
colon and white-space. They are written as it is. Then there must be a number
containing one or more digits with or without a decimal point. Note that, we want
dot as a part of our pattern string, but not as meta character here. The pattern
for regular expression would be –
^X-.*: [0-9.]+

The complete program is –

import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^X\S*: [0-9.]+', line):
print(line)

The output lines will as below –

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
……………………………………………………
……………………………………………………

Assume that, we want only the numbers (representing confidence, probability

etc) in the above output. We can use split() function on extracted string. But, it
is better to refine regular expression. To do so, we need the help of parentheses.

When we add parentheses to a regular expression, they are ignored when

matching the string. But when we are using findall(), parentheses indicate that
while we want the whole expression to match, we only are interested in
extracting a portion of the substring that matches the regular expression.
8
Notes for Programming IN Python (Open Elective - 21CS751)

import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^X-\S*: ([0-9.]+)', line)
if len(x) > 0:
print(x)

Because of the parentheses enclosing the pattern above, it will match the pattern
starting with X- and extracts only digit portion. Now, the output would be –
['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
…………………
………………..

Another example of similar form: The file mbox-short.txt contains lines like –

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

We may be interested in extracting only the revision numbers mentioned at

the end of these lines. Then, we can write the statement –

x = re.findall('^Details:.*rev=([0-9.]+)', line)

The regex here indicates that the line must start with Details:, and has
something with rev= and then digits. As we want only those digits, we will put
parenthesis for that portion of expression. Note that, the expression [0-9] is
greedy, because, it can display very large number. It keeps grabbing digits until
it finds any other character than the digit. The output of above regular
expression is a set of revision numbers as given below –
['39772']
['39771']
['39770']
['39769']
………………………
………………………

Consider another example – we may be interested in knowing time of a day of

each email. The file mbox-short.txt has lines like –
From [email protected] Sat Jan 5 09:14:16 2008

Here, we would like to extract only the hour 09. That is, we would like only two
digits representing hour. Hence, we need to modify our expression as –
9
Notes for Programming IN Python (Open Elective - 21CS751)
x = re.findall('^From .* ([0-9][0-9]):', line)
10
Notes for Programming IN Python (Open Elective - 21CS751)

Here, [0-9][0-9] indicates that a digit should appear only two times. The
alternative way of writing this would be -

x = re.findall('^From .* ([0-9]{2}):', line)

The number 2 within flower-brackets indicates that the preceding match should
appear exactly two times. Hence [0-9]{2} indicates there can be exactly two
digits. Now, the output would be –

['09']
['18']
['16']
['15']
…………………
…………………

Escape Character
As we have discussed till now, the character like dot, plus, question mark,
asterisk, dollar etc. are meta characters in regular expressions. Sometimes, we
need these characters themselves as a part of matching string. Then, we need
to escape them using a back- slash. For example,

import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)

Output:
['$10.00']

Here, we want to extract only the price $10.00. As, $ symbol is a

metacharacter, we need to use \ before it. So that, now $ is treated as a part of
matching string, but not as metacharacter.

Bonus Section for Unix/Linux Users

Support for searching files using regular expressions was built into the Unix OS.
There is a command-line program built into Unix called grep (Generalized Regular
Expression Parser) that behaves similar to search() function.

$ grep '^From:' mbox-short.txt

Output:
From: [email protected]
From: [email protected]
From: [email protected]
From: [email protected]
Note that, grep command does not support the non-blank character \S, hence we
need to use [^ ] indicating not a white-space.

Untitled
No ratings yet
Untitled
53 pages
Unit-3 Python
No ratings yet
Unit-3 Python
72 pages
Spire Law Federal Complaint in New York
100% (4)
Spire Law Federal Complaint in New York
912 pages
Guide Better Public Toilet
No ratings yet
Guide Better Public Toilet
81 pages
Technical Manuel 60 Kva
100% (3)
Technical Manuel 60 Kva
70 pages
Lecture 9 Python
No ratings yet
Lecture 9 Python
8 pages
Python Regular Expression
100% (1)
Python Regular Expression
31 pages
Unit 3 Python
No ratings yet
Unit 3 Python
72 pages
Engineer's Responsibilities and Rights
100% (1)
Engineer's Responsibilities and Rights
29 pages
Python Re
No ratings yet
Python Re
101 pages
International Steel Limited Sections
No ratings yet
International Steel Limited Sections
4 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
9 RegEx
No ratings yet
9 RegEx
57 pages
Sandwich Panels en Lo RUUKKI
No ratings yet
Sandwich Panels en Lo RUUKKI
8 pages
Unit - 4 Regex
No ratings yet
Unit - 4 Regex
28 pages
Chapter 10
No ratings yet
Chapter 10
28 pages
Chapter - 11 - Regular Expressions
100% (1)
Chapter - 11 - Regular Expressions
10 pages
06 - Regular Expressions and Network Programming
No ratings yet
06 - Regular Expressions and Network Programming
55 pages
Python Regex: Re - Match, Re - Search, Re - Findall With Example
No ratings yet
Python Regex: Re - Match, Re - Search, Re - Findall With Example
10 pages
PWC PPP
No ratings yet
PWC PPP
20 pages
Regular Expression 01
No ratings yet
Regular Expression 01
48 pages
PP - Module-3 Notes
No ratings yet
PP - Module-3 Notes
56 pages
A Guide To Making Massive Small Change PDF
No ratings yet
A Guide To Making Massive Small Change PDF
62 pages
Module 4 - Regular Expressions1
No ratings yet
Module 4 - Regular Expressions1
37 pages
Hal Har Hac 3930 User Manual 6an
100% (1)
Hal Har Hac 3930 User Manual 6an
75 pages
Unit 2
No ratings yet
Unit 2
69 pages
13B RegExp
No ratings yet
13B RegExp
38 pages
5A - Regex
No ratings yet
5A - Regex
32 pages
Python - Slide 5
No ratings yet
Python - Slide 5
42 pages
BTL - 794 - Red Queen Effect
No ratings yet
BTL - 794 - Red Queen Effect
2 pages
Module 4 - Regular Expressions
No ratings yet
Module 4 - Regular Expressions
35 pages
Python 201 - (Slightly) Advanced Python Topics
No ratings yet
Python 201 - (Slightly) Advanced Python Topics
69 pages
Lecture 6 Re Basics
No ratings yet
Lecture 6 Re Basics
12 pages
Python Re
No ratings yet
Python Re
18 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
12 pages
Python Reg Expressions
No ratings yet
Python Reg Expressions
8 pages
Ii MSC Python Unit V Notes
No ratings yet
Ii MSC Python Unit V Notes
18 pages
11.1 - Regular Expressions
No ratings yet
11.1 - Regular Expressions
14 pages
Python RegEx
No ratings yet
Python RegEx
11 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Lec 06 - Regular Expression
No ratings yet
Lec 06 - Regular Expression
19 pages
Cadpipe BS Tutorial
No ratings yet
Cadpipe BS Tutorial
44 pages
9python Simple Character Matches
No ratings yet
9python Simple Character Matches
19 pages
Regular Expression L
No ratings yet
Regular Expression L
20 pages
17 - Regular Expression
No ratings yet
17 - Regular Expression
20 pages
Data Analysis Using Python Lab Ex3
No ratings yet
Data Analysis Using Python Lab Ex3
27 pages
Module 3 Regular Expressions
No ratings yet
Module 3 Regular Expressions
8 pages
ONR 24810 Design Page
No ratings yet
ONR 24810 Design Page
3 pages
SC-WH - Sop - 002 - Outbound - Quality Control
No ratings yet
SC-WH - Sop - 002 - Outbound - Quality Control
7 pages
Advanced Python Programming - Lesson No.002
No ratings yet
Advanced Python Programming - Lesson No.002
20 pages
Howto Regex PDF
No ratings yet
Howto Regex PDF
20 pages
01 93 00 Building Information Model
No ratings yet
01 93 00 Building Information Model
15 pages
RegEx in Python
No ratings yet
RegEx in Python
5 pages
Regular Expression
No ratings yet
Regular Expression
21 pages
Magnum MK 28l
No ratings yet
Magnum MK 28l
2 pages
Regular Expressions: Python For Everybody
No ratings yet
Regular Expressions: Python For Everybody
34 pages
Regular Expressions Python
No ratings yet
Regular Expressions Python
26 pages
2 - Python Strings
No ratings yet
2 - Python Strings
23 pages
Regular Expression Python
No ratings yet
Regular Expression Python
23 pages
Art.654-658 Property
No ratings yet
Art.654-658 Property
16 pages
Howto Regex
No ratings yet
Howto Regex
20 pages
Python Regular Expressions
No ratings yet
Python Regular Expressions
14 pages
Experimenting With Modular Synths: Notes
No ratings yet
Experimenting With Modular Synths: Notes
2 pages
Unit-3 - Regular Expression
No ratings yet
Unit-3 - Regular Expression
15 pages
Python Regex
No ratings yet
Python Regex
8 pages
Module3 RegularExpressions
No ratings yet
Module3 RegularExpressions
8 pages
Howto Regex
No ratings yet
Howto Regex
17 pages
Dbms Unit 1
No ratings yet
Dbms Unit 1
65 pages
Great Society (Stanford History Education Group Lesson)
No ratings yet
Great Society (Stanford History Education Group Lesson)
6 pages
Regex Case Interview Guide
No ratings yet
Regex Case Interview Guide
10 pages
Python Reg Expressions PDF
No ratings yet
Python Reg Expressions PDF
8 pages
Python Regular Expressions Quick Reference
No ratings yet
Python Regular Expressions Quick Reference
2 pages
Regular Expressions: Regular Expression Syntax in Python
No ratings yet
Regular Expressions: Regular Expression Syntax in Python
11 pages
IOACON 2022 Registration Form
No ratings yet
IOACON 2022 Registration Form
2 pages
Regular Expressions
No ratings yet
Regular Expressions
5 pages
Just in Time AND Lean Operations
No ratings yet
Just in Time AND Lean Operations
66 pages
Quasi Judicial
No ratings yet
Quasi Judicial
9 pages
Regular Exp
No ratings yet
Regular Exp
6 pages
Sample Memorial Respondent
No ratings yet
Sample Memorial Respondent
17 pages
What Is Ambush Marketing Final
No ratings yet
What Is Ambush Marketing Final
29 pages
ChatGPT-Empowered Writing Strategies in EFL Students' Academic Writing Calibre, Challenges and Chances
No ratings yet
ChatGPT-Empowered Writing Strategies in EFL Students' Academic Writing Calibre, Challenges and Chances
21 pages
Complete AWS CI - CD - Deploy Spring Boot To ECS Using CodeBuild & CodePipeline From Scratch - by Trinadh Rayala - Medium
No ratings yet
Complete AWS CI - CD - Deploy Spring Boot To ECS Using CodeBuild & CodePipeline From Scratch - by Trinadh Rayala - Medium
11 pages
Vintage Airplane - Jan 1992
No ratings yet
Vintage Airplane - Jan 1992
36 pages
Attachment Press cw7
No ratings yet
Attachment Press cw7
3 pages
Construction.: (Change of Address January 2011)
No ratings yet
Construction.: (Change of Address January 2011)
6 pages
Ceramic Case (For Students)
No ratings yet
Ceramic Case (For Students)
2 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Ian Talks Regex A-Z
From Everand
Ian Talks Regex A-Z
Ian Eress
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet

Module5 RegularExpressions

Uploaded by

Module5 RegularExpressions

Uploaded by

1

Notes for Programming IN Python (Open Elective - 21CS751)

hello, how are you?

The output would be –

Character Matching in Regular Expressions

Table 3.1 List of Important Meta-Characters

Table 3.2 Examples for Regular Expressions

Most commonly used metacharacter is dot, which matches any character.

The output would be –

The output would be –

Use this as input and try following examples –

 Pattern to extract lines ending with any digit:

Here, the first ^ indicates we want something to match in the beginning of a

 Start with upper case letters and end with digits:

Extracting Data using Regular Expressions

The output would be –

The output of above program will be something as below –

Combining Searching and Extracting

The complete program is –

The output lines will as below –

Assume that, we want only the numbers (representing confidence, probability

When we add parentheses to a regular expression, they are ignored when

We may be interested in extracting only the revision numbers mentioned at

Consider another example – we may be interested in knowing time of a day of

x = re.findall('^From .* ([0-9]{2}):', line)

Here, we want to extract only the price $10.00. As, $ symbol is a

Bonus Section for Unix/Linux Users

$ grep '^From:' mbox-short.txt

You might also like