0% found this document useful (0 votes)
24 views

(Assignment 1 & 2) Regular Expression

This document discusses regular expressions (regex) and provides examples of regex patterns to accomplish various tasks: 1. Extract domain names from URLs 2. Extract and standardize dates in different formats 3. Extract prices from product descriptions considering currency formats 4. Extract hyperlinks from HTML code 5. Correct common spelling mistakes in text 6. Extract street addresses considering variations in formats 7. Identify and extract hexadecimal color codes from CSS It also discusses developing algorithms to disambiguate context-related ambiguities when extracting patterns and using sentiment analysis with non-negative matrix factorization to analyze sentiment in texts.

Uploaded by

TECHNICAL MALIK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

(Assignment 1 & 2) Regular Expression

This document discusses regular expressions (regex) and provides examples of regex patterns to accomplish various tasks: 1. Extract domain names from URLs 2. Extract and standardize dates in different formats 3. Extract prices from product descriptions considering currency formats 4. Extract hyperlinks from HTML code 5. Correct common spelling mistakes in text 6. Extract street addresses considering variations in formats 7. Identify and extract hexadecimal color codes from CSS It also discusses developing algorithms to disambiguate context-related ambiguities when extracting patterns and using sentiment analysis with non-negative matrix factorization to analyze sentiment in texts.

Uploaded by

TECHNICAL MALIK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

2023MSCS229

Assignment 1: Regular Expressions

1. Given a list of URLs, write a regular expression to extract the domain name from
each URL. Provide examples and explain the logic behind your regex.

[a-zA-Z0-9]*\.[a-zA-Z]*|[a-zA-Z0-9-]*+\.[a-zA-Z0-9]*\.[a-zA-Z]*

2. You are given a text document containing various dates in different formats (e.g.,
MM/DD/YYYY, DD-MM-YYYY, YYYY/MM/DD). Write a Python script that uses
regular expressions to extract and format all the dates in a consistent manner.

import re
from dateutil.parser import parse as dateutil_parser
datePattern = '\d{4}-\d{2}-\d{2}|\d{2}-\d{2}-\d{4}|\d{2}-\d{2}-\d{2}|\d{4}\/\d{2}\/\d{2}|
\d{2}\/\d{2}\/\d{4}|\d{2}\/\d{2}\/\d{2}'
dateMatches = re.findall(datePattern, dates);
consistentDateFormate = "%Y-%m-%d"
for date in dateMatches:
print(dateutil_parser(date).strftime(consistentDateFormate))

3. Suppose you have a large dataset of product descriptions. Write a regular expression
to find and extract all the prices mentioned in the descriptions, considering different
currency formats (e.g., $100, €50, ¥5000). Explain how your regex works.

\p{S}[0-9]*[\.|\,][0-9]*

4. Given a block of HTML code, write a regular expression to extract all the hyperlinks
(URLs) contained within the HTML <a> tags. Explain the steps and groups in your
regex pattern.

<a\s+href="(.*?)">

5. Implement a regular expression that can detect and correct common spelling
mistakes in a text document. Provide examples and explain the substitution logic
used in your regex.

text = "Thier are two many peple here. I bdefinately want to go."

corrections = {
'bthier': 'there',
'peple': 'people',
'bdefinately': 'definitely'
}

for pattern, replacement in corrections.items():


text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)

text
6. Design a regular expression to extract street addresses from a text document,
considering variations in address formats (e.g., 123 Main St, Apt 4B vs. 456 Elm
Avenue). Discuss the challenges and strategies for handling different address
structures.

\d+(?:[ \t][\w,-]+)*

7. Create a regular expression that identifies and extracts hexadecimal color codes
(e.g., #FFAABB) from a CSS stylesheet. Explain the pattern you use to capture these
codes.

#([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})

Assignment 2: Regular Expressions

1. Develop an algorithm that can resolve context related ambiguities when extracting
patterns e.g., disambiguating between Apple as fruit and Apple as a company.
2. Sentiment analysis and sentiment visualization using non-negative matrix
factorization.
3. Evaluate the performance of your above algorithm in terms of accuracy, precision
and recall.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

# Sample text data and corresponding labels (replace with your data)
text_data = [
"I want to eat apple. They are delicious",
"I want to deposit money in the bank.",
"I prefer to use apple products. They are great",
"The axe fell down near to the bank of the river",
]

# Create a TF-IDF document-term matrix


vectorizer = TfidfVectorizer()
dtm = vectorizer.fit_transform(text_data)

# Apply NMF for topic modeling (assuming 2 topics)


num_topics = 3
nmf = NMF(n_components=num_topics)
W = nmf.fit_transform(dtm)

# Sentiment labeling for topics (replace with your sentiment labels)


# In practice, you might have a more sophisticated sentiment labeling approach.
topic_sentiments = ["fruit", "company", "place"]

# Perform sentiment analysis for each text sample


for i, sample in enumerate(text_data):
topic_weights = W[i] # Topic weights for the current sample
sample_sentiment = np.argmax(topic_weights) # Choose the dominant topic
sentiment_label = topic_sentiments[sample_sentiment] # Map to sentiment label

print(f"Text: {sample}")
print(f"Predicted Sentiment: {sentiment_label} (Topic {sample_sentiment})")

You might also like