(Assignment 1 & 2) Regular Expression
(Assignment 1 & 2) Regular Expression
1. Given a list of URLs, write a regular expression to extract the domain name from
each URL. Provide examples and explain the logic behind your regex.
[a-zA-Z0-9]*\.[a-zA-Z]*|[a-zA-Z0-9-]*+\.[a-zA-Z0-9]*\.[a-zA-Z]*
2. You are given a text document containing various dates in different formats (e.g.,
MM/DD/YYYY, DD-MM-YYYY, YYYY/MM/DD). Write a Python script that uses
regular expressions to extract and format all the dates in a consistent manner.
import re
from dateutil.parser import parse as dateutil_parser
datePattern = '\d{4}-\d{2}-\d{2}|\d{2}-\d{2}-\d{4}|\d{2}-\d{2}-\d{2}|\d{4}\/\d{2}\/\d{2}|
\d{2}\/\d{2}\/\d{4}|\d{2}\/\d{2}\/\d{2}'
dateMatches = re.findall(datePattern, dates);
consistentDateFormate = "%Y-%m-%d"
for date in dateMatches:
print(dateutil_parser(date).strftime(consistentDateFormate))
3. Suppose you have a large dataset of product descriptions. Write a regular expression
to find and extract all the prices mentioned in the descriptions, considering different
currency formats (e.g., $100, €50, ¥5000). Explain how your regex works.
\p{S}[0-9]*[\.|\,][0-9]*
4. Given a block of HTML code, write a regular expression to extract all the hyperlinks
(URLs) contained within the HTML <a> tags. Explain the steps and groups in your
regex pattern.
<a\s+href="(.*?)">
5. Implement a regular expression that can detect and correct common spelling
mistakes in a text document. Provide examples and explain the substitution logic
used in your regex.
text = "Thier are two many peple here. I bdefinately want to go."
corrections = {
'bthier': 'there',
'peple': 'people',
'bdefinately': 'definitely'
}
text
6. Design a regular expression to extract street addresses from a text document,
considering variations in address formats (e.g., 123 Main St, Apt 4B vs. 456 Elm
Avenue). Discuss the challenges and strategies for handling different address
structures.
\d+(?:[ \t][\w,-]+)*
7. Create a regular expression that identifies and extracts hexadecimal color codes
(e.g., #FFAABB) from a CSS stylesheet. Explain the pattern you use to capture these
codes.
#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})
1. Develop an algorithm that can resolve context related ambiguities when extracting
patterns e.g., disambiguating between Apple as fruit and Apple as a company.
2. Sentiment analysis and sentiment visualization using non-negative matrix
factorization.
3. Evaluate the performance of your above algorithm in terms of accuracy, precision
and recall.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
# Sample text data and corresponding labels (replace with your data)
text_data = [
"I want to eat apple. They are delicious",
"I want to deposit money in the bank.",
"I prefer to use apple products. They are great",
"The axe fell down near to the bank of the river",
]
print(f"Text: {sample}")
print(f"Predicted Sentiment: {sentiment_label} (Topic {sample_sentiment})")