0% found this document useful (0 votes)
2 views

NLP-pyth

Uploaded by

oaboalwafa75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP-pyth

Uploaded by

oaboalwafa75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

NLP in Python

Dr. Loai Alnemer

1. Regular Expression

a. Example 1: Match an Email Address

import re

pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'

text = "Contact us at [email protected]"

emails = re.findall(pattern, text)

print(emails)

- `\b`: Word boundary to ensure we match the whole email.

- `[a-zA-Z0-9._%+-]+`: Matches the username part, which can include letters, digits, and
special characters.

- `@`: The "@" symbol, required in every email.

- `[a-zA-Z0-9.-]+`: Matches the domain name, which can include letters, digits, dots, and
hyphens.

- `\.[a-zA-Z]{2,}`: Matches the top-level domain (e.g., `.com`, `.org`), requiring at least two
letters.

b. Example 2: Match a phone Number

pattern = r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b'

text = "Call me at 123-456-7890 or 987.654.3210"

phones = re.findall(pattern, text)

print(phones)
- `\b`: Word boundary.

- `\d{3}`: Matches three digits (area code).

- `[-.\s]?`: Matches an optional separator (dash, dot, or space).

- `\d{3}`: Matches the next three digits (exchange).

- `[-.\s]?`: Matches another optional separator.

- `\d{4}`: Matches the last four digits (subscriber number).

c. Example 3: Matches Date (MM/DD/YYYY)

pattern = r'\b\d{2}/\d{2}/\d{4}\b'

text = "Today's date is 05/12/2024"

dates = re.findall(pattern, text)

print(dates)

- `\b`: Word boundary.

- `\d{2}`: Matches exactly two digits (day).

- `/`: The separator.

- `\d{2}`: Matches two digits (month).

- `/`: The separator.

- `\d{4}`: Matches four digits (year).


d. Search for the first match of email:

import re

text = "My email is [email protected] and My alternative email is [email protected]"

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

match = re.search(pattern, text)

if match:

print(f"Found email: {match.group()}")

It retrieves the first match

e. Match function

import re
text = "Hello World!"
pattern = r'Hello'
match = re.match(pattern, text)
if match:
print("Match found at the beginning!")
else:
print("No match at the beginning.")

text = "123 apples"


pattern = r'\d+'
match = re.match(pattern, text)
if match:
print("Match found:", match.group())
else:
print("No match.")

# Output: String starts with letters and ends with digits!


text = "abc123"
pattern = r'^[a-zA-Z]+\d+$'
match = re.match(pattern, text)
if match:
print("String starts with letters and ends with digits!")
else:
print("No match.")

f. sub function

text = "Please contact us at [email protected]."


pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
replacement = "REDACTED"
new_text = re.sub(pattern, replacement, text)
print(new_text)
# Output: Please contact us at REDACTED

g. Split function

text = "Please contact us at [email protected]."


pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'
replacement = "REDACTED"
new_text = re.sub(pattern, replacement, text)
print(new_text)
# Output: Please contact us at REDACTED

h. Finditer function

text = "The rain in Spain stays mainly in the plain."


pattern = r'\bain\b'
matches = re.finditer(pattern, text)
for match in matches:
print(f"Match found at position {match.start()}: '{match.group()}'")
# Output:
# Match found at position 5: 'ain'
# Match found at position 21: 'ain'
# Match found at position 44: 'ain'

2. Stemming
a. PorterStemmer
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "ran", "easily", "fairly"]
stems = [stemmer.stem(word) for word in words]
print(stems)
# Output: ['run', 'ran', ' easili', 'fairli']
b. Lancaster Stemmer
from nltk.stem import LancasterStemmer
# Initialize the Lancaster Stemmer
lancaster_stemmer = LancasterStemmer()
# Example words to stem
words = ["running", "ran", "easily", "fairly", "happiness"]
# Apply stemming
lancaster_stems = [lancaster_stemmer.stem(word) for word in words]
print("Lancaster Stemmer:", lancaster_stems)
# Output: ['run', 'ran', 'easy', 'fair', 'happi']

c. Snowball stemmer
from nltk.stem import SnowballStemmer
# Initialize the Snowball Stemmer for English
snowball_stemmer = SnowballStemmer("english")
# Example words to stem
words = ["running", "ran", "easily", "fairly", "happiness"]
# Apply stemming
snowball_stems = [snowball_stemmer.stem(word) for word in words]
print("Snowball Stemmer:", snowball_stems)
# Output: ['run', 'ran', 'easili', 'fairli', 'happi']

d. Compare all

# Combine results for comparison


results = {
"Word": words,
"Porter": porter_stems,
"Lancaster": lancaster_stems,
"Snowball": snowball_stems,
}
# Print the results as a table
for i in range(len(words)):
print(f"{results['Word'][i]:<10} | {results['Porter'][i]:<10} | {results['Lancaster'][i]:<10} |
{results['Snowball'][i]:<10}")

Word | Porter | Lancaster | Snowball


running | run | run | run
ran | ran | ran | ran
easily | easili | easy | easili
fairly | fairli | fair | fairli
happiness | happi | happi | happi

You might also like