Python Module-3 Notes (21EC646)_final
Python Module-3 Notes (21EC646)_final
Module -3
Regular Expression (Pattern Matching) & Reading and Writing Files
Syllabus:
Pattern Matching:
• Pattern Matching with Regular Expressions,
• Finding Patterns of Text Without Regular Expressions,
• Finding Patterns of Text with Regular Expressions,
• More Pattern Matching with Regular Expressions,
• The findall() Method,
• Character Classes,
• Making Your Own Character Classes,
• The Caret and Dollar Sign Characters,
• The Wildcard Character,
• Review of Regex Symbols.
PATTERN MATCHING
Pattern matching, is the process of finding specific text (pattern) within a text file.
Let's search for the word "Dhoni" in the given paragraph using Python without using regular
expressions. Instead, we'll use basic string methods like find and in.
Python Program:
# Paragraph
text = """Mahendra Singh Dhoni, commonly known as MS Dhoni, is one of the most
successful captains in the history of Indian cricket. Known for his calm demeanor and
exceptional leadership skills, Dhoni has led India to numerous victories, including the
ICC World T20 in 2007, the ICC Cricket World Cup in 2011, and the ICC Champions
Trophy in 2013."""
Note: Here if and in are keywords. Text and pattern are variables
Explanation:
1. Here text is variable which contains the text data or multiple lines of string
2. pattern is variable which contains the string or word to be searched "Dhoni".
3. in Operator: We use the in operator to search/check if the pattern exists in the
text/string. If pattern exists in the text, then it returns True, so if
statement is executed
4. Print the Result: Based on the output of the in operator, print result.
Example 2: Finding Phone Number from a Given Text (Imp_as per syllabus)
Aim:
To find phone numbers with the format (XXX-XXX-XXXX) within a given text.
The phone number format consists of 12 characters: the first 3 are digits, followed by a hyphen,
then 3 digits, another hyphen, and finally 4 digits.
Program:
def PhoneNumber(text):
if len(text) != 12:
return False
for i in range(0, 3):
if not text[i].isdecimal():
return False
if text[3] != '-':
return False
for i in range(4, 7):
if not text[i].isdecimal():
return False
if text[7] != '-':
return False
for i in range(8, 12):
if not text[i].isdecimal():
return False
return True.
#function call
text= “415-555-4242 ”
Z=PhoneNumber(text)
Output : True
text= “Dhoni-123 ”
Z=PhoneNumber(text)
Output : False
Explanation
1. Function Definition: PhoneNumber
2. Check if the string length is 12 characters. If not, return False.
3. Verify the first 3 characters are digits. If not, return False.
4. Check if the 4th character is a hyphen ('-'). If not, return False.
5. Verify characters 5 to 7 are digits. If not, return False.
6. Check if the 8th character is a hyphen ('-'). If not, return False.
7. Verify characters 9 to 12 are digits. If not, return False.
8. Output: If all are True, return True.
for i in range(len(message)):
chunk = message[i :i+12]
if PhoneNumber(chunk):
print(f 'Phone number found: {chunk}’)
Syntax of Regex
Characters:
2. Anchors:
o ^ : Matches the start of a string.
o $ : Matches the end of a string.
3. Quantifiers:
o * : Matches 0 or more repetitions.
o + : Matches 1 or more repetitions.
o ? : Matches 0 or 1 repetition.
o {n} : Matches exactly n repetitions.
o {n,} : Matches n or more repetitions.
o {n,m} : Matches between n and m repetitions.
o
4. Character Classes:
o [abc] : Matches any single character a, b, or c.
o [^abc] : Matches any single character except a, b, or c.
o [a-z] : Matches any single character from a to z.
6. Escaping:
o \ : Escape character, used to match special characters literally (e.g., \., \*).
Program:
import re
#print result
print(mo_g)
Creating Groups:
o Putting parentheses () around parts of the regex pattern creates groups.
o Example: (\d\d\d)-(\d\d\d-\d\d\d\d) creates two groups:
(\d\d\d)=> group 1 for the area code
(\d\d\d-\d\d\d\d)=> group 2 for the main number.
2. Accessing Groups:
o After performing a search with search() method, use group() method on the to
retrieve specific groups:
▪ mo.group(0)( (equivalent to mo.group()) retrieves the entire matched text.
▪ mo.group(1) retrieves the text matched by the first group.
▪ mo.group(2) retrieves the text matched by the second group.
Program:
import re
#print result
print(f"Area code : {area_code}\n Main number : {main_number} \nEntire
number : {entire_no}")
o If both 'Ramesh' and 'Mahesh' are in the string, the first occurrence is matched.
import re
Str1='Ramesh Mahesh Suresh Kalmesh'
Str2='Mahesh Ramesh Suresh Kalmesh'
pattern = r'Ramesh|Mahesh'
comp= re.compile(pattern)
mo1=comp.search(Str1)
mo1_g= mo1.group() if mo1 else “No match”
print(mo1_g) # Output: Ramesh
mo2=comp.search(Str2)
mo2_g= mo2.group() if mo2 else “No match”
print(mo2_g) # Output: Mahesh
Pattern= r'Bat(man|mobile|copter|bat)'.
• This way, we can specify the common prefix 'Bat' only once.
Program:
import re
# Retrieve the full matched text and the part inside parentheses
mo1_g = mo1.group() # Full match
mo1_g2 = mo1.group(3) # Part inside parentheses
import re
mo2 = comp.search(Str2)
mo2_g = mo2.group() if mo2 else “no match”
print(mo2_g) # Output: Batwoman
Mo3 = comp.search(Str3)
Mo3_g = mo3.group() if mo3 else “no match”
print(mo3_g) # Output: None
except :
print(“No match”)
Example -2
import re
try:
#for string 1
# Use the search method to search for the pattern
mo1 = compl.search(String1)
mo1_g=mo1.group() if mo1 else “no match”
print(mo1_g)
mo2 = compl.search(String2)
mo2_g=mo2.group()
print(mo2_g)
except:
print(“None”)
import re
try:
# Use the search method to search for the pattern
mo1 = comp.search(Str1)
mo1_g = mo1.group() if mo1 else “no match”
print(mo1_g) # Output: Batman
mo2 = comp.search(Str2)
mo2_g = mo2.group() if mo2 else “no match”
print(mo2_g) # Output: Batwoman
mo3 = comp.search(Str3)
mo3_g = mo3.group() if mo3 else “no match”
print(mo3_g) # Output: Batwowowowoman
except:
print(“No match”)
import re
mo2 = comp.search(Str2)
mo2_g = mo2.group() if mo1 else ‘No match’
print(mo2_g) # Output: Batwoman
mo3 = comp.search(Str3)
mo3_g = mo3.group() if mo1 else ‘No match’
print(mo3_g) # Output: Batwowowowoman
3. Equivalent Patterns:
o (Ha){3} is the same as (Ha)(Ha)(Ha).
o (Ha){3,5} is the same as ((Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha))|((Ha)(Ha)(Ha)(Ha)(Ha)).
import re
# Example strings
Str = 'HaHaHaHaHaHa'
# Search
mo1 = comp_specific.search(Str)
mo1_g=mo1.group() # Output: HaHaHa
print(mo1_g)
mo2 = comp_range.search(str)
mo2_g=mo2.group()
print(mo2_g)
mo3 = comp_no_first.search(str)
mo3_g=mo3.group()
print(mo3_g) # Output: HaHaHaHa
mo4 = comp_no_last.search(str)
mo4_g=mo4.group()
print(mo4_g) # Output: HaHaHaHaHa
2. Nongreedy Matching:
o We can use the ? after the curly bracket to make the regex nongreedy.
o It matches the shortest string possible.
o Example: (Ha){3,5}? Here it matches 'HaHaHa'
3. Two Meanings of ?:
o Represents a nongreedy match when used after a quantifier.
o Represents an optional group when used directly after a group.
import re
str='HaHaHaHaHa'
Key Points:
1. Without Groups:
▪ If the regex pattern has no groups, findall() returns a list of strings.
▪ List contains matched strings in the text.
1. With Groups:
o If the pattern contains groups (denoted by parentheses), findall() returns a list of tuples.
o Each tuple contains matched strings for each group
Without Groups
import re
With Groups:
import re
Character Classes
• Character classes simplify regular expressions by using shorthand for common groups of
characters.
• Shorthand codes for common character classes:
• Character classes make regular expressions more concise. For example, [0-5] matches only
the numbers 0 to 5, which is shorter than typing (0|1|2|3|4|5).
Example:
To use character classes to find all instances of a number followed by a word in a string:
import re
In this example, the regular expression \d+\s\w+ matches text with one or more numeric digits
(\d+), followed by a whitespace character (\s), followed by one or more word characters (\w+). The
findall() method returns all matching strings of the regex pattern in a list.
• Including Ranges:
• We can include ranges of characters using a hyphen.
• For example, [a-zA-Z0-9] matches all lowercase letters, uppercase letters, and numbers.
Program:
import re
# Define pattern
Pattern1 = r'[aeiouAEIOU]'
Pattern2 = r'[a-zA-Z0-9]'
#for pattern1
Mo1 = Comp1.findall(text)
print(Mo1) # Output: ['I', 'i', 'a', 'o', 'e', 'e', 'e', 'o', 'u']
#for pattern2
Mo2 = Comp2.findall(text)
print(Mo2)
# Output: 'I’n’d’I’a’ w’o’n’ t’h’e’ t’w’e’n’t’y’ t’w’e’n’t’y’ w’o’r’l’d’ c’u’p'
import re
Str1='Hello world!'
Str2='He said hello.'
Pattern =r'^Hello'
comp= re.compile(pattern)
mo2=comp.search(Str2)
mo2_g= comp. group() if mo2 else ‘No match’
print(mo2_g) Output: ‘None’
import re
Str1='Hello world!'
Str2='He said Hello'
Pattern =r 'Hello$'
comp= re.compile(pattern)
mo1=comp.search(Str1)
mo1_g= mo1. group() if mo1 else ‘None’
print(mo1_g) Output: Hello
mo2=comp.search(Str2)
mo2_g= mo2. group() if mo1 else ‘None’
print(mo2_g) Output: ‘None’
import re
Str1='1234567890'
Str2='12345xyz67890'
Pattern = r'^\d+$'
comp= re.compile(pattern)
mo1=comp.search(Str1)
mo1_g= mo1. group() if mo1 else ‘No match’
print(mo1_g) Output: 1234567890
mo2=comp.search(Str2)
mo2_g= mo2. group() if mo1 else ‘No match’
print(mo2_g) Output: No match
▪ These symbols are essential for defining precise patterns in text matching using regular
expressions.
The dot (.) is a wildcard that matches any character except a newline.
import re
Example:
import re
# Define pattern t
pattern = r'First Name: (.*) Last Name: (.*)'
The dot-star will match everything except a newline. By passing re.DOTALL as the second
argument to re.compile(), we can make the dot character to match all characters, including the
newline character
Program
import re
Program:
import re
1. Example:
o Replace all instances of "Agent [name]" with "CENSORED"
import re
# text or string
text = 'Agent Virat gave the secret documents to Agent Dhoni.'
comp = re.compile(pattern)
print(result)
# Output: 'CENSORED gave the secret documents to CENSORED.'
import re
# text or string
text = 'Agent Virat gave the secret documents to Agent Dhoni.'
comp = re.compile(pattern)
print(result)
▪ Censor the agent names, showing only the first letter and replacing the rest with asterisks:
▪ Regular expressions can get complicated when dealing with complex text patterns.
▪ To make them more readable, we can use "verbose mode" with re.VERBOSE, which allows
for whitespace and comments inside the regex string.
Program:
import re
# String or Text
text = 'Call me at (123) 456-7890 or 123-456-7890 ext. 1234.'
# Define the regex pattern with comments and spread over multiple lines
pattern = r'''(
(\d{3} | \(\d{3}\))? # area code
(\s |- |\.)? # separator
\d{3} # first 3 digits
(\s |- |\.) # separator
\d{4} # last 4 digits
(\s*(ext|x|ext.)\s*\d{2,5})? # extension
)'''
# Print matches
for match in matches:
print(match)
1. Combine Flags: Use the bitwise OR operator (|) to combine re.IGNORECASE, re.DOTALL, and
re.VERBOSE.
2. Compile with Combined Flags: Pass the combined flags as the second argument to
re.compile().
Program
import re
# Text to search
text = "Hi\nHI\nHi"
# Print matches
print(matches) # Output: ['Hi', 'HI', 'Hi']
Program:
import pyperclip
import re
matches.append(groups[0])
• Path Structure:
o Folders (or directories) can contain files and other folders.
o Example: project.docx is in Documents, which is in Python, which is in Users.
o The root folder: In Windows is C:\ (C: drive);
in Linux, it's /.
o Path Separators: Windows uses backslashes (\), while OS X and Linux use forward slashes (/).
os.path.join()
os.path.join()is used to handle paths in python
• Creating Paths:
os.path.join('usr', 'bin', 'spam')
▪ It creates usr\bin\spam on Windows
▪ usr/bin/spam on OS X/Linux.
▪ Useful for constructing file paths programmatically.
hanges to C:\Windows\System32.
o It creates a new folder hierarchy. This creates all intermediate folders, even if they don't exist.
Overview:
o The os.path module inside os module in Python.
o It offers functions for working with file paths and filenames.
o It ensures compatibility across different operating systems.
o Importing the Module: we can Import the module using import os
os.path.abspath(path):
o It Converts a relative path to an absolute path.
▪ Example:
import os
os.path.abspath(' . ')
Output: 'C:\\Python34'.
os.path.isabs(path):
• It checks, whether the given path is absolute.
• Output is True if the given path is an absolute path and False if it is a relative path.
▪ Example:
os.path.isabs('.')
Output: False.
o os.path.dirname(path):
▪ It returns everything before the last component of the path (directory).
▪ Example: os.path.dirname('C:\\Windows\\System32\\calc.exe')
▪ Output: 'C:\\Windows\\System32'.
o os.path.split(path):
▪ It returns both the directory and filename as a tuple.
▪ Example: os.path.split('C:\\Windows\\System32\\calc.exe')
▪ Output: ('C:\\Windows\\System32', 'calc.exe').
o os.listdir(path):
▪ It returns a list of filenames in the directory specified by path.
▪ Example: os.listdir('C:\\Windows\\System32')
▪ Output: list of filenames in that directory.
o os.path.isfile(path):
▪ It checks if the path is a file.
▪ Example: os.path.isfile('C:\\Windows\\System32\\calc.exe')
▪ Output: True.
o os.path.isdir(path):
▪ Checks if the path is a directory.
▪ Example: os.path.isdir('C:\\Windows\\System32')
▪ Output: True.
Here arguments:
filename: Name of the file to be opened. This filename can be with the pathname or
without the pathname. Pathname of the file is optional if file is in current working
directory.
f1: when we open file using open() function, python returns the file object. This file object
is stored in the variable f1, this is also called as handle. A File object represents a file in our
computer; it is simply another type of value in Python, much like the lists or dictionaries
This file object can then be used to read the contents of the file, perform other operations
on the file.
mode: We have to specify what is purpose of opening a file (for reading or writing etc).
Reading Files
• Consider the text file with name “myfile” written in notepad and stored in current directory.
Output:
• Here in this case, it reads the entire file at once and stores in the variable “d1”. So, this
method is suitable when the file size is very less.
Output:
Hello python \n',
Welcome python\n',
Hello India \n',
How are you \n',
Note: readlines() method returns a list of string values from the file, one string for each line of text
Writing Files
• we can use write() method to write data into a file.
• write() can be used in 2 modes: mode “w” or mode “a”.
"a" - Append - will append to the end of the file
"w" - Write - will overwrite any existing content
• If the file does not exist, then a new file (with the given name) will be created.
Example 1: To append
This method will append text to the end of the existing file. No overwriting in case of append
Open the file "myfile.txt" and append content to the file:
f = open("myfile.txt", "a")
f.write("Hello Virat \n")
f.write(“ Hello Dhoni”)
f.close()
Example 2: To write
This method will over-write on the all-ready existing data
#Open the file "myfile.txt" and overwrite the content:
f = open(" myfile.txt", "w")
f.write("Hello Pandya")
f.close()
#open and read the file after the writing :
f = open("myfile.txt", "r")
print(f.read())
To retrieve data
shelf_file = shelve.open('mydata')
retrieved_cats = shelf_file['cats'] # Retrieve the list using the key 'cats'
print(retrieved_cats) # Output: ['Zophie', 'Pooka', 'Simon']
shelf_file.close()
To convert to list
• Just like dictionaries, shelf values have keys() and values().
• This shelve method will return list-like values but not the true lists
• To get true lists, pass the returned values to the list() function.
print("Keys:", keys_list)
print("Values:", values_list)
• The pprint.pformat() function from the pprint module allows us to convert complex
data structures into a formatted string, which is easy to read.
Program:
import pprint
This program will create 35 unique quizzes and their corresponding answer keys, each with
randomized questions and answer choices.
Program :
import random
# The quiz data: keys are states and values are their capitals.
capitals = {
'Alabama': 'Montgomery', 'Alaska': 'Juneau', 'Arizona': 'Phoenix',
'Arkansas': 'Little Rock', 'California': 'Sacramento', 'Colorado': 'Denver',
'Connecticut': 'Hartford', 'Delaware': 'Dover', 'Florida': 'Tallahassee',
'Georgia': 'Atlanta', 'Hawaii': 'Honolulu', 'Idaho': 'Boise',
'Illinois': 'Springfield', 'Indiana': 'Indianapolis', 'Iowa': 'Des Moines',
'Kansas': 'Topeka', 'Kentucky': 'Frankfort', 'Louisiana': 'Baton Rouge',
'Maine': 'Augusta', 'Maryland': 'Annapolis', 'Massachusetts': 'Boston',
'Michigan': 'Lansing', 'Minnesota': 'Saint Paul', 'Mississippi': 'Jackson',
'Missouri': 'Jefferson City', 'Montana': 'Helena', 'Nebraska': 'Lincoln',
'Nevada': 'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton',
'New Mexico': 'Santa Fe', 'New York': 'Albany', 'North Carolina': 'Raleigh',
'North Dakota': 'Bismarck', 'Ohio': 'Columbus', 'Oklahoma': 'Oklahoma City',
'Oregon': 'Salem', 'Pennsylvania': 'Harrisburg', 'Rhode Island': 'Providence',
'South Carolina': 'Columbia', 'South Dakota': 'Pierre', 'Tennessee': 'Nashville',
'Texas': 'Austin', 'Utah': 'Salt Lake City', 'Vermont': 'Montpelier',
'Virginia': 'Richmond', 'Washington': 'Olympia', 'West Virginia': 'Charleston',
'Wisconsin': 'Madison', 'Wyoming': 'Cheyenne'
}
# Write the question and the answer options to the quiz file.
quizFile.write(f'{questionNum + 1}. What is the capital of {states[questionNum]}?\n')
for i in range(4):
quizFile.write(f' {"ABCD"[i]}. {answerOptions[i]}\n')
quizFile.write('\n')
quizFile.close()
answerKeyFile.close()
# The quiz data: keys are states and values are their capitals.
capitals = {
'Andhra Pradesh': 'Amaravati', 'Arunachal Pradesh': 'Itanagar', 'Assam': 'Dispur',
'Bihar': 'Patna', 'Chhattisgarh': 'Raipur', 'Goa': 'Panaji', 'Gujarat': 'Gandhinagar',
'Haryana': 'Chandigarh', 'Himachal Pradesh': 'Shimla', 'Jharkhand': 'Ranchi',
'Karnataka': 'Bengaluru', 'Kerala': 'Thiruvananthapuram', 'Madhya Pradesh':
'Bhopal',
'Maharashtra': 'Mumbai', 'Manipur': 'Imphal', 'Meghalaya': 'Shillong',
'Mizoram': 'Aizawl', 'Nagaland': 'Kohima', 'Odisha': 'Bhubaneswar',
'Punjab': 'Chandigarh', 'Rajasthan': 'Jaipur', 'Sikkim': 'Gangtok',
'Tamil Nadu': 'Chennai', 'Telangana': 'Hyderabad', 'Tripura': 'Agartala',
'Uttar Pradesh': 'Lucknow', 'Uttarakhand': 'Dehradun', 'West Bengal': 'Kolkata',
'Andaman and Nicobar Islands': 'Port Blair', 'Chandigarh (Union Territory)':
'Chandigarh',
'Dadra and Nagar Haveli and Daman and Diu': 'Daman', 'Lakshadweep':
'Kavaratti',
'Delhi': 'New Delhi', 'Puducherry': 'Puducherry', 'Jammu and Kashmir': 'Srinagar',
'Ladakh': 'Leh'
}
# Write the question and the answer options to the quiz file.
quizFile.write(f'{questionNum + 1}. What is the capital of
{states[questionNum]}?\n')
for i in range(4):
quizFile.write(f' {"ABCD"[i]}. {answerOptions[i]}\n')
quizFile.write('\n')
quizFile.close()
answerKeyFile.close()