Lab 5
Lab 5
Objectives
• Be able to craft and apply regular expressions to find matching text and solve problems.
• Be able to read from text files on the hard drive.
Preparation
• Launch the Jupyter notebook.
• Rename the notebook page as “lab5”.
• Solution to one problem should occupy one cell.
Please provide solutions to the problems below.
Problem 1
From your Lab module, download the file emails.txt and place it in the same folder as the current
notebook. Write a Python function called extract_emails_from_file that takes the path to a text
file as its input (in your case, emails.txt). A given text file can contain a mixture of text and email
addresses scattered across different lines. Your function’s task is to:
i) read the text file,
ii) use a regular expression to find all the email addresses, and
iii) return them as a list.
Ensure that your function can handle a variety of email formats, including those with dots, dashe
s, and underscores in the local part and domain name.
Hints:
• Make sure the text file is in the same folder as the notebook.
• You are allowed to test out your regular expression on regexr.com
emails = extract_emails_from_file('emails.txt')
for email in emails:
print(email)
Sample Outputs:
[email protected]
[email protected]
[email protected]
Problem 2
From your Lab module, download the file regex_sum_42.txt and place it in the same folder as the
current notebook. Create a Python function called compute_sum_from_text that takes the path to
a text file as its input (in your case, regex_sum_42.txt) and returns the sum. Your function’s task
is to:
i) read through and parse a file with text and numbers
ii) extract all the numbers in the file and
iii) compute the sum of the numbers.
Hints:
• Make sure the text file is in the same folder as the notebook.
• You are allowed to test out your regular expression on regexr.com
Use the following snippet of code to test your function:
mySum = compute_sum_from_text('regex_sum_42.txt')
print(mySum)
Sample Outputs:
445833
Problem 3
From your Lab module, download the file server_errors.log and place it in the same folder as the
current notebook Write a Python function named parse_log that takes the path to a log file as its
input. The function should:
i) read the file,
ii) use regular expressions to extract error messages and their corresponding timestamps, and
iii) return a list of tuples.
Each tuple should contain the timestamp of the error and the error message.
Hints:
• Make sure the log file is in the same folder as the notebook
• Assume the log format is "YYYY-MM-DD HH:MM:SS: Error: ErrorMessage".
• The function should return a list of tuples like this (timestamp, error_message).
errors = parse_log('logfile.log')
for error in errors:
print(error)
Sample Outputs:
[('2023-04-01 12:00:00', 'Failed to connect to the database.')]
Problem 4
From your Lab module, download the file insurance_claims.txt and place it in the same folder as
the current notebook. In the field of actuarial science, accurate data analysis is crucial. You are
provided with a dataset containing raw insurance claim entries that include various pieces of
information such as the claim date, claim amount, policy number, and claimant comments.
However, the data is in a highly inconsistent format with lots of noise, including unnecessary
spaces, special characters, and varying data entry conventions. Your task is to write a Python
function named parse_insurance_claims to process this dataset and extract structured information.
Your Function Should:
• Accept the path to a dataset file as its input (in your case insurance_claims.txt).
• Use regular expressions to robustly parse each claim entry.
• Extract and return a list of dictionaries. Each dictionary represents a claim entry with the
following keys: claim_date, claim_amount, policy_number, and claimant_comments.
Hints:
• The claim_date always follows the YYYY-MM-DD format, but it might be surrounded by
different characters.
• The claim_amount is prefixed with a dollar sign and may include commas for thousands, but it
could be surrounded by various symbols.
• The policy_number starts with a hash symbol followed by alphanumeric characters, but it might
be encased in different symbols or formats.
• The claimant_comments describe the claim reason and may contain any characters, usually
following the policy_number.
• Your regex must be versatile enough to handle different formats and noise within the claim
entries, accurately extracting the required information despite the inconsistencies.
Sample outputs (for the given input file):
[{'claim_date': '2024-03-01', 'claim_amount': 2500, 'policy_number': 'AB123',
'claimant_comments': 'Water damage in basement'},
{'claim_date': '2024-03-02', 'claim_amount': 1500, 'policy_number': 'CD456',
'claimant_comments': 'Stolen bicycle'},
{'claim_date': '2024-03-03', 'claim_amount': 5000, 'policy_number': 'EF789',
'claimant_comments': 'Car accident'},
{'claim_date': '2024-03-04', 'claim_amount': 300, 'policy_number': 'GH012',
'claimant_comments': 'Broken window'}
]