0% found this document useful (0 votes)
2 views

NLP-pyth

Uploaded by

oaboalwafa75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP-pyth

Uploaded by

oaboalwafa75
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

NLP in Python

Dr. Loai Alnemer

1. Regular Expression

a. Example 1: Match an Email Address

import re

pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'

text = "Contact us at [email protected]"

emails = re.findall(pattern, text)

print(emails)

- `\b`: Word boundary to ensure we match the whole email.

- `[a-zA-Z0-9._%+-]+`: Matches the username part, which can include letters, digits, and
special characters.

- `@`: The "@" symbol, required in every email.

- `[a-zA-Z0-9.-]+`: Matches the domain name, which can include letters, digits, dots, and
hyphens.

- `\.[a-zA-Z]{2,}`: Matches the top-level domain (e.g., `.com`, `.org`), requiring at least two
letters.

b. Example 2: Match a phone Number

pattern = r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b'

text = "Call me at 123-456-7890 or 987.654.3210"

phones = re.findall(pattern, text)

print(phones)
- `\b`: Word boundary.

- `\d{3}`: Matches three digits (area code).

- `[-.\s]?`: Matches an optional separator (dash, dot, or space).

- `\d{3}`: Matches the next three digits (exchange).

- `[-.\s]?`: Matches another optional separator.

- `\d{4}`: Matches the last four digits (subscriber number).

c. Example 3: Matches Date (MM/DD/YYYY)

pattern = r'\b\d{2}/\d{2}/\d{4}\b'

text = "Today's date is 05/12/2024"

dates = re.findall(pattern, text)

print(dates)

- `\b`: Word boundary.

- `\d{2}`: Matches exactly two digits (day).

- `/`: The separator.

- `\d{2}`: Matches two digits (month).

- `/`: The separator.

- `\d{4}`: Matches four digits (year).


d. Search for the first match of email:

import re

text = "My email is [email protected] and My alternative email is [email protected]"

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

match = re.search(pattern, text)

if match:

print(f"Found email: {match.group()}")

It retrieves the first match

e. Match function

import re
text = "Hello World!"
pattern = r'Hello'
match = re.match(pattern, text)
if match:
print("Match found at the beginning!")
else:
print("No match at the beginning.")

text = "123 apples"


pattern = r'\d+'
match = re.match(pattern, text)
if match:
print("Match found:", match.group())
else:
print("No match.")

# Output: String starts with letters and ends with digits!


text = "abc123"
pattern = r'^[a-zA-Z]+\d+$'
match = re.match(pattern, text)
if match:
print("String starts with letters and ends with digits!")
else:
print("No match.")

f. sub function

text = "Please contact us at [email protected]."


pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
replacement = "REDACTED"
new_text = re.sub(pattern, replacement, text)
print(new_text)
# Output: Please contact us at REDACTED

g. Split function

text = "Please contact us at [email protected]."


pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
replacement = "REDACTED"
new_text = re.sub(pattern, replacement, text)
print(new_text)
# Output: Please contact us at REDACTED

h. Finditer function

text = "The rain in Spain stays mainly in the plain."


pattern = r'\bain\b'
matches = re.finditer(pattern, text)
for match in matches:
print(f"Match found at position {match.start()}: '{match.group()}'")
# Output:
# Match found at position 5: 'ain'
# Match found at position 21: 'ain'
# Match found at position 44: 'ain'

2. Stemming
a. PorterStemmer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "ran", "easily", "fairly"]
stems = [stemmer.stem(word) for word in words]
print(stems)
# Output: ['run', 'ran', ' easili', 'fairli']
b. Lancaster Stemmer
from nltk.stem import LancasterStemmer
# Initialize the Lancaster Stemmer
lancaster_stemmer = LancasterStemmer()
# Example words to stem
words = ["running", "ran", "easily", "fairly", "happiness"]
# Apply stemming
lancaster_stems = [lancaster_stemmer.stem(word) for word in words]
print("Lancaster Stemmer:", lancaster_stems)
# Output: ['run', 'ran', 'easy', 'fair', 'happi']

c. Snowball stemmer
from nltk.stem import SnowballStemmer
# Initialize the Snowball Stemmer for English
snowball_stemmer = SnowballStemmer("english")
# Example words to stem
words = ["running", "ran", "easily", "fairly", "happiness"]
# Apply stemming
snowball_stems = [snowball_stemmer.stem(word) for word in words]
print("Snowball Stemmer:", snowball_stems)
# Output: ['run', 'ran', 'easili', 'fairli', 'happi']

d. Compare all

# Combine results for comparison


results = {
"Word": words,
"Porter": porter_stems,
"Lancaster": lancaster_stems,
"Snowball": snowball_stems,
}
# Print the results as a table
for i in range(len(words)):
print(f"{results['Word'][i]:<10} | {results['Porter'][i]:<10} | {results['Lancaster'][i]:<10} |
{results['Snowball'][i]:<10}")

Word | Porter | Lancaster | Snowball


running | run | run | run
ran | ran | ran | ran
easily | easili | easy | easili
fairly | fairli | fair | fairli
happiness | happi | happi | happi

You might also like